Ubuntu Manpage: pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf files

NAME

       pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf files

SYNOPSIS

       pdfsandwich [options] inputfile.pdf

DESCRIPTION

       pdfsandwich  generates  "sandwich" OCR pdf files, i.e. pdf files which contain only images
       (no text) will be processed by optical character recognition (OCR) and the  text  will  be
       added  to  each  page  invisibly  "behind"  the  images.   Note that pdfsandwich needs the
       following programs: unpaper, convert, gs, hocr2pdf (for tesseract < 3.03), and  tesseract.
       As  tesseract  >=  3.03 can write pdf files, hocr2pdf is only needed for older versions of
       tesseract.  Please visit http://www.tobias-elze.de/pdfsandwich.

OPTIONS

       -convert
              -convert filename : name of convert binary (default: convert)

       -coo   -coo  options  :  additional  convert  options;  make  sure  to  quote;  e.g.  -coo
              "-normalize  -black-threshold  75%"  call  convert  --help  or  man convert for all
              convert options

       -debug keep all temporary files in /tmp (for debugging)

       -enforcehocr2pdf
              use hocr2pdf even if tesseract >= 3.03

       -first_page
              -first_page number : number of page to start OCR from (default: 1)

       -gray  use grayscale for images (default: black and white)

       -grayfilter
              enable unpaper's gray filter; further options can be set by -unpo

       -gs    -gs filename : name of  gs  binary  (default:  gs);  optional,  only  required  for
              resizing

       -hocr2pdf
              -hocr2pdf  filename  :  name  of  hocr2pdf  binary (default: hocr2pdf); ignored for
              tesseract >= 3.03 unless option -enforcehocr2pdf is set

       -hoo   -hoo options : additional hocr2pdf options; make sure to quote

       -identify
              -identify filename : name of identify binary (default: identify)

       -last_page
              -last_page number : number of page up to which to process OCR (default:  number  of
              pages in inputfile)

       -lang  -lang  language : language of the text; option to tesseract (defaut: eng) e.g: eng,
              deu, deu-frak, fra, rus, swe, spa,  ita,  ...   see  option  -list_langs;  Multiple
              languages may be specified, separated by plus characters.

       -layout
              -layout  { single | double | none } : layout of the scanned pages; requires unpaper
              single: one page per sheet  double:  two  pages  per  sheet  none:  no  auto-layout
              (default)

       -list_langs
              list  currently  available  languages  and  exit;  in  case  of  custom binaries of
              tesseract, place this after the -tesseract option

       -maxpixels
              -maxpixels  NUM  :  maximal  number  of  pixels   allowed   for   input   file   if
              (resolution/72)^2  *width*height  >  maxpixels  then  scale page of input file down
              prior to OCR so that  page  size  in  pixels  corresponds  to  maxpixels;  default:
              17415167 (A3 @ 300 dpi)

       -noimage
              do  not  place  the  image  over  the  text  (requires  hocr2pdf;  ignored  without
              -enforcehocr2pdf option)

       -nopreproc
              do not preprocess with unpaper

       -nthreads
              -nthreads number : number of parallel threads (default: guessed number of CPUs;  if
              guessing fails: 1)

       -o     -o  filename  :  output file; default: inputfile_ocr.pdf (if extension is different
              from .pdf, original extension is kept)

       -pagesize
              -pagesize { original  |  NUMxNUM  }  :  set  page  size  of  output  pdf  (requires
              ghostscript)  original:  same  as  input  file (default) NUMxNUM: width x height in
              pixel (e.g. for A4: -pagesize 595x842)

       -pdfinfo
              -pdfinfo filename : name of pdfinfo binary (default: pdfinfo)

       -pdfunite
              -pdfunite filename : name of pdfunite binary (default: pdfunite)

       -resolution
              -resolution NUM : resolution (dpi) used for OCR (default: 300)

       -rgb   use RGB color space for images (default: black and white); use  with  care:  causes
              problems with some color spaces

       -sloppy_text
              sloppily  place text, group words, do not draw single glyphs; ignored for tesseract
              >= 3.03 unless option -enforcehocr2pdf is set

       -tesseract
              -tesseract filename : name of tesseract binary (default: tesseract)

       -tesso -tesso options : additional tesseract options; make sure to quote

       -unpaper
              -unpaper filename : name of unpaper binary (default: unpaper)

       -unpo  -unpo options : additional unpaper options; make sure to quote

       -quiet suppress output

       -verbose
              produce more output

       -version
              print version and quit

       -help  Display this list of options

       --help Display this list of options

LANGUAGES

       Via   Tesseract,   numerous   language   packagess   available   -   follow   this    link
       http://code.google.com/p/tesseract-ocr/downloads/list  for  a  complete  list.  Here is an
       incomplete selection of supported languages and their abbreviations:

       ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat  (Catalan),  ces  (Czech),  chi_sim
       (Simplified  Chinese),  chi_tra  (Traditional Chinese), chr (Cherokee), dan (Danish), dan-
       frak (Danish (Fraktur)), deu (German), ell (Greek), eng (English), enm (Old English),  epo
       (Esperanto),   est  (Estonian),  fin  (Finnish),  fra  (French),  frm  (Old  French),  glg
       (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun (Hungarian), ind  (Indonesian),
       ita (Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch),
       nor (Norwegian), pol (Polish),  por  (Portuguese),  ron  (Romanian),  rus  (Russian),  slk
       (Slovakian), slv (Slovenian), sqi (Albanian), spa (Spanish), srp (Serbian), swe (Swedish),
       tam (Tamil), tel (Telugu), tgl (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian),  vie
       (Vietnamese)

       Multiple  languages  may  be  specified,  separated  by  plus  characters.  Note  that the
       respective tesseract language package needs to be installed on your system to be usable by
       pdfsandwich. Option -list_langs lists the languages which are available on your system.

AVAILABILITY

       Sources  and  packages  as  well  as comprehensive help can be found at http://www.tobias-
       elze.de/pdfsandwich.

AUTHOR

       Tobias Elze <sourceforge@tobias-elze.de>

                                         25 January 2017                           PDFSANDWICH(1)