Provided by: ocrodjvu_0.9.1-1_all
NAME
djvu2hocr - DjVu to hOCR converter
SYNOPSIS
djvu2hocr [option...] djvu-file djvu2hocr {--version | --help | -h}
DESCRIPTION
djvu2hocr converts hidden text from a DjVu file to the hOCR[1] format.
OPTIONS
Input selection options -p, --pages=page-range Specifies pages to covert. page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages are numbered from 1. The default is to convert all pages. Text segmentation options --word-segmentation=simple Use the same word segmentation as found in the DjVu file. This is the default. --word-segmentation=uax29 Use the Unicode Text Segmentation[2] algorithm to break lines into words, possibly fixing word segmentation found in the DjVu file. HTML output options --title=title Specifies the document title. The default is “DjVu hidden text layer”. --css=style Add the specfied CSS style to the document. For example, --css='.ocrx_line { display: block; }' can be used to visually preserve line breaks. Other options --version Output version information and exit. -h, --help Display help and exit.
PORTABILITY
djvu2hocr uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML document. For example, control character BEL (^G, U+0007), is converted into the following HTML chunk: <span class="djvu_char" title="#x07"> </span>
BUGS
Please report bugs at: https://bitbucket.org/jwilk/ocrodjvu/issues
SEE ALSO
djvu(1), hocr2djvused(1), ocrodjvu(1)
NOTES
1. hOCR https://docs.google.com/View?docid=dfxcv4vc_67g844kf 2. Unicode Text Segmentation http://unicode.org/reports/tr29/