Provided by: ocrodjvu_0.10.2-1_all bug

NAME

       ocrodjvu - OCR for DjVu files

SYNOPSIS

       ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file

       ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file

       ocrodjvu --save-script script-file [option...] djvu-file

       ocrodjvu --in-place [option...] djvu-file

       ocrodjvu --dry-run [option...] djvu-file

       ocrodjvu {--version | --help | -h | --list-engines | --list-languages}

DESCRIPTION

       ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files.

       The following OCR engines are supported:

       •   OCRopus[1] (internally, ocrodjvu calls ocroscript's recognize (or rec-tess) command,
           so that ultimately Tesseract acts as the OCR backend);

       •   Cuneiform for Linux[2].

       •   Ocrad[3].

       •   GOCR[4].

       •   Stand-alone Tesseract[5].

OPTIONS

   OCR engine options
       -e, --engine=engine-id
           Use this OCR engine.

           The default is “tesseract”. (The default was “ocropus” prior to ocrodjvu 0.8.)

       --list-engines
           Print list of available OCR engines.

   Options controlling output
       -o, --save-bundled=output-djvu-file
           Save OCR results as a bundled multi-page document into output-djvu-file.

       -i, --save-indirect=index-djvu-file
           Save OCR results as an indirect multi-page document. Use index-djvu-file as the index
           file name; put the component files into the same directory. The directory must exist
           and be writable.

       --save-script=script-file
           Save a djvused script with OCR results into script-file.

       --in-place
           Save OCR results in place.

           (Use this option to retain compatibility with ocrodjvu < 0.2.)

       --dry-run
           Don't change any files, throw OCR results away.

       It is mandatory to use exactly one of the above options.

       --ocr-only
           If OCR results are to be saved to a separate document (-o/--save-bundled or
           -i/--save-indirect), save only the pages selected for OCR.

           The default is to save all pages, even when the -p/--pages option is in effect.

       --clear-text
           Remove existing hidden text if present in the pages not selected for OCR.

           (Use this option to retain compatibility with ocrodjvu < 0.2.)

       --save-raw-ocr=output-directory
           Save raw OCR results (typically in the hOCR format) into output-directory. The
           directory must exist and be writable.

       --raw-ocr-filename-template=template
           Specifies the file naming scheme for raw OCR results.

           The template language uses the Python string formatting syntax[6]. The following
           fields are available:

           page, page+N, page-N
               page number, optionally shifted by a number N

           id
               page identifier

           id-ext
               page identifier without file extension

           The default template is “{id-ext}”.

   Text segmentation options
       -t lines, --details lines
           Record location of every line. Don't record locations of particular words or
           characters.

           This is the default for OCRopus 0.2. The option is ineffective with stand-alone
           Tesseract 2.0.

       -t words, --details=words
           Record location of every line and every word. Don't record locations of particular
           characters.

           This is the default for most OCR engines.

           This option is ineffective with OCRopus 0.2 and stand-alone Tesseract 2.0.

       -t chars, --details=chars
           Record location of every line, every word and every character.

           This option is ineffective with OCRopus 0.2 and stand-alone Tesseract 2.0.

       --word-segmentation=simple
           Consider each non-empty sequence of non-whitespace characters a single word.

           This is the default, despite being linguistically incorrect.

       --word-segmentation=uax29
           Use the Unicode Text Segmentation[7] algorithm to break lines into words.

           This option breaks assumptions of some DjVu tools that words are separated by spaces,
           and therefore it is not recommended.

   Other options
       -l, --language=language-id
           Set recognition language.  language-id is typically an ISO 639-2/T three-letter code.

           Tesseract ≥ 3.02 allows specifying multiple languages separated by “+” characters.

           For OCRopus, the default is “eng” (English), unless the tesslanguage environment
           variable is set. For other OCR engines, the default is always “eng”.

       --list-languages
           Print list of available languages for the currently selected OCR engine.

       --render=mask
           Render only masks of page images.

           This is the default.

       --render=foreground
           Render only foreground layers of page images.

       --render=all
           Render all layers of page images.

           This option is necessary to OCR DjVu files with invalid foreground/background
           separation.

       -p, --pages=page-range
           Specifies pages to process.  page-range is a comma-separated list of sub-ranges. Each
           sub-range is either a single page (e.g. 17) or a contiguous range of pages
           (e.g. 37-42). Pages are numbered from 1.

           The default is to process all pages.

       -j, --jobs=n
           Start up to n OCR processes.

       --version
           Output version information and exit.

       -h, --help
           Display help and exit.

   Advanced options
       -D, --debug
           To ease debugging, don't delete intermediate files.

       -X key=value
           This option allows controlling some details of how ocrodjvu operates.

       --on-error=abort
           Stop program execution when an exceptional situation (e.g., malformed output from the
           OCR engine, internal ocrodjvu error, etc.) occurs.

           This is the default.

       --on-error=resume
           Attempt to recover from exceptional situations.

           This option is strongly discouraged.

       --html5
           Use a HTML5 parser[8], which is more robust but slower than the default parser.

EXIT STATUS

       One of the following exit values can be returned by ocrodjvu:

       0
           The program finished successfully.

       1
           A fatal error occurred.

       2
           The program recovered from an error (--on-error=resume).

ENVIRONMENT

       The following environment variables affects ocrodjvu:

       tesslanguage
           Recognition language for Tesseract.

           (Use this variable is deprecated in favor of the --language option.)

       TMPDIR
           ocrodjvu makes heavy use of temporary files. It will store them in a directory
           specified by this variable. The default is /tmp.

BUGS

   Known bugs
       Tesseract 3.00 is affected by a bug [9] making it produce invalid hOCR output in certain
       circumstances. ocrodjvu does not try recover form this fault (which couldn't be done
       reliably anyway) unless you pass the -X fix-html=1 option.

       When using Tesseract ≥ 3.00, extracting bounding boxes of particular characters (which
       happens when either --details=chars or --word-segmentation=uax29) is inefficient. This is
       due to limitations of the Tesseract command-line interface.

   Reporting new bugs
       Please report bugs at: https://github.com/jwilk/ocrodjvu/issues

SEE ALSO

       djvu(1), djvu2hocr(1), hocr2djvused(1),

       ocroscript(1), tesseract(1), cuneiform(1), ocrad(1), gocr(1)

NOTES

        1. OCRopus
           https://code.google.com/p/ocropus/

        2. Cuneiform for Linux
           https://launchpad.net/cuneiform-linux

        3. Ocrad
           https://www.gnu.org/software/ocrad/

        4. GOCR
           http://jocr.sourceforge.net/

        5. Tesseract
           https://github.com/tesseract-ocr/tesseract

        6. Python string formatting syntax
           https://docs.python.org/library/string.html#format-string-syntax

        7. Unicode Text Segmentation
           http://unicode.org/reports/tr29/

        8. HTML5 parser
           https://html.spec.whatwg.org/multipage/syntax.html#parsing

        9. https://code.google.com/p/tesseract-ocr/issues/detail?id=376