Provided by: ocrodjvu_0.9.1-1_all bug

NAME

       hocr2djvused - hOCR to djvused script converter

SYNOPSIS

       hocr2djvused [option...] [hocr-file...]

DESCRIPTION

       hocr2djvused reads one or more hOCR[1] files (as produced by OCRopus[2] or Cuneiform[3] or
       Tesseract[4]) and converts them to a djvused script.

       Unless a filename is explicitly provided on the command line, hOCR is read from the
       standard input.

OPTIONS

   Text segmentation options
       -t lines, --details lines
           Record location of every line. Don't record locations of particular words or
           characters.

       -t words, --details=words
           Record location of every line and every word. Don't record locations of particular
           characters.

           This is the default.

       -t chars, --details=chars
           Record location of every line, every word and every character.

       --word-segmentation=simple
           Consider each non-empty sequence of non-whitespace characters a single word.

           This is the default, despite being linguistically incorrect.

       --word-segmentation=uax29
           Use the Unicode Text Segmentation[5] algorithm to break lines into words.

           This options break assumptions of some DjVu tools that words are separated by spaces,
           and therefore is it not recommended.

   Other options
       --rotation=n
           Assume that DjVu pages are rotated by n degrees.

       --page-size=widthxheight
           Specifies that page size is width pixels × height pixels.

           This option is required for hOCR generated by Cuneiform (< 0.8) and superfluous
           otherwise.

       --html5
           Use a HTML5 parser[6], which is more robust but slower than the default parser.

       --fix-utf8
           Attempt to fix UTF-8 encoding issues and eliminate unwanted control characters.

           This option might be needed for hOCR generated by Cuneiform[7] or Tesseract[8].

       --version
           Output version information and exit.

       -h, --help
           Display help and exit.

BUGS

       Please report bugs at: https://bitbucket.org/jwilk/ocrodjvu/issues

SEE ALSO

       djvu(1), ocrodjvu(1), djvu2hocr(1), djvused(1)

NOTES

        1. hOCR
           https://docs.google.com/View?docid=dfxcv4vc_67g844kf

        2. OCRopus
           https://code.google.com/p/ocropus/

        3. Cuneiform
           https://launchpad.net/cuneiform-linux

        4. Tesseract
           https://code.google.com/p/tesseract-ocr/

        5. Unicode Text Segmentation
           http://unicode.org/reports/tr29/

        6. HTML5 parser
           http://www.whatwg.org/specs/web-apps/current-work/#html-parser

        7. https://bugs.launchpad.net/cuneiform-linux/+bug/585418

        8. https://code.google.com/p/tesseract-ocr/issues/detail?id=690