Provided by: exactimage_0.7.4-2ubuntu1_i386
hocr2pdf - hOCR to PDF converter of the ExactImage library
hocr2pdf [-c|--concurrent-lines NUMBER] [-d|--directions BITFIELD]
[-s|--line-skip NUMBER] [-t|--threshold VALUE] FILE...FILE
ExactImage is a fast C++ image processing library. Unlike ImageMagick,
it allows operation in several color spaces and bit depths natively,
resulting in much lower memory and computational requirements. Some
optimized algorithms operate in 1/20 of the time ImageMagick requires,
and displaying large images can be as fast as 1/10 of the time the
"display" program takes.
hocr2pdf is a command line front-end for the image processing library
to create perfectly layouted, searchable PDF files from hOCR, annotated
HTML, input obtained from an OCR system.
Input image filename.
Output PDF filename.
Do not place the image over the text.
Sloppily place text, group words, do not draw single glyphs.
Extract text, including trying to remove hyphens.
Show summary of options.
Creating a Searchable PDF from hOCR input
hOCR, annotated HTML, input must be provided to STDIN, and the image
data is read using the filename from the -i or --input argument. For
$ hocr2pdf -i scan.tiff -o test.pdf < cuneiform-out.hocr
By default the text layer is hidden by the real image data. Including
image data can be disabled via the -n, --no-image, so that just the
recognized text from the OCR is visible - e.g. for debugging or to save
$ hocr2pdf -i scan.tiff -n -o test.pdf < cuneiform-out.hocr
Too many gabs between letters in individual words
This might be a problem with imprecise OCR data or justified text with
huge gabs. ExactImage includes a special mode activated with the
command line argument -s, --sloppy-text, to group glyphs between
whitespace to words which can help PDF viewers to produce better
results while cut and pasting text:
$ hocr2pdf -i scan.tiff -s -o test.pdf < cuneiform-out.hocr
More information about hocr2pdf and the ExactImage project can be found
ExactImage was written by ExactCODE GmbH <http://www.exactcode.de/>.
This manual page was written by Daniel Baumann <email@example.com>, for
the Debian project (but may be used by others).