Provided by: ocrodjvu_0.10.2-1_all
NAME
hocr2djvused - hOCR to djvused script converter
SYNOPSIS
hocr2djvused [option...] [hocr-file...]
DESCRIPTION
hocr2djvused reads one or more hOCR[1] files (as produced by OCRopus[2] or Cuneiform[3] or Tesseract[4]) and converts them to a djvused script. Unless a filename is explicitly provided on the command line, hOCR is read from the standard input.
OPTIONS
Text segmentation options -t lines, --details lines Record location of every line. Don't record locations of particular words or characters. -t words, --details=words Record location of every line and every word. Don't record locations of particular characters. This is the default. -t chars, --details=chars Record location of every line, every word and every character. --word-segmentation=simple Consider each non-empty sequence of non-whitespace characters a single word. This is the default, despite being linguistically incorrect. --word-segmentation=uax29 Use the Unicode Text Segmentation[5] algorithm to break lines into words. This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended. Other options --rotation=n Assume that DjVu pages are rotated by n degrees. --page-size=widthxheight Specifies that page size is width pixels × height pixels. This option is required for hOCR generated by Cuneiform (< 0.8) and superfluous otherwise. --html5 Use a HTML5 parser[6], which is more robust but slower than the default parser. --fix-utf8 Attempt to fix UTF-8 encoding issues and eliminate unwanted control characters. This option might be needed for hOCR generated by Cuneiform[7] or Tesseract[8]. --version Output version information and exit. -h, --help Display help and exit.
BUGS
Please report bugs at: https://github.com/jwilk/ocrodjvu/issues
SEE ALSO
djvu(1), ocrodjvu(1), djvu2hocr(1), djvused(1)
NOTES
1. hOCR https://docs.google.com/View?docid=dfxcv4vc_67g844kf 2. OCRopus https://code.google.com/p/ocropus/ 3. Cuneiform https://launchpad.net/cuneiform-linux 4. Tesseract https://github.com/tesseract-ocr/tesseract 5. Unicode Text Segmentation http://unicode.org/reports/tr29/ 6. HTML5 parser https://html.spec.whatwg.org/multipage/syntax.html#parsing 7. https://bugs.launchpad.net/cuneiform-linux/+bug/585418 8. https://code.google.com/p/tesseract-ocr/issues/detail?id=690