Provided by: ocrodjvu_0.7.9-1.1_all
NAME
hocr2djvused - hOCR to djvused script converter
SYNOPSIS
hocr2djvused [option...]
DESCRIPTION
hocr2djvused reads a hOCR[1] file (as produced by OCRopus[2] or Cuneiform[3] or Tesseract[4]) from the standard input and converts it to a djvused script.
OPTIONS
Text segmentation options -t lines, --details lines Record location of every line. Don't record locations of particular words or characters. -t words, --details=words Record location of every line and every word. Don't record locations of particular characters. This is the default. -t chars, --details=chars Record location of every line, every word and every character. --word-segmentation=simple Consider each non-empty sequence of non-whitespace characters a single word. This is the default, despite being linguistically incorrect. --word-segmentation=uax29 Use the Unicode Text Segmentation[5] algorithm to break lines into words. This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended. Other options --rotation=n Assume that DjVu pages are rotated by n degrees. --page-size=widthxheight Specifies that page size is width pixels × height pixels. This option is required for hOCR generated by Cuneiform (< 0.8) and superfluous otherwise. --html5 Use a HTML5 parser[6], which is more robust but slower than the default parser. --version Output version information and exit. -h, --help Display help and exit.
SEE ALSO
ocrodjvu(1), djvused(1)
AUTHOR
Jakub Wilk <jwilk@jwilk.net> Author.
NOTES
1. hOCR http://docs.google.com/View?docid=dfxcv4vc_67g844kf 2. OCRopus http://ocropus.googlecode.com/ 3. Cuneiform http://launchpad.net/cuneiform-linux 4. Tesseract http://tesseract-ocr.googlecode.com/ 5. Unicode Text Segmentation http://unicode.org/reports/tr29/ 6. HTML5 parser http://www.whatwg.org/specs/web-apps/current-work/#html-parser