Ubuntu Manpage: pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf files

Provided by: pdfsandwich_0.1.7-2build6_amd64

NAME

       pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf files

SYNOPSIS

       pdfsandwich [options] inputfile.pdf

DESCRIPTION

       pdfsandwich  generates  "sandwich" OCR pdf files, i.e. pdf files which contain only images
       (no text) will be processed by optical character recognition (OCR) and the  text  will  be
       added  to  each  page  invisibly  "behind"  the  images.   Note that pdfsandwich needs the
       following programs: unpaper, convert, gs, hocr2pdf (for tesseract < 3.03), and  tesseract.
       As  tesseract  >=  3.03 can write pdf files, hocr2pdf is only needed for older versions of
       tesseract.  Please visit http://www.tobias-elze.de/pdfsandwich.

OPTIONS

       -convert
              -convert filename : name of convert binary (default: convert)

       -coo   -coo  options  :  additional  convert  options;  make  sure  to  quote;  e.g.  -coo
              "-normalize  -black-threshold  75%"  call  convert  --help  or  man convert for all
              convert options

       -debug keep all temporary files in /tmp (for debugging)

       -enforcehocr2pdf
              use hocr2pdf even if tesseract >= 3.03

       -first_page
              -first_page number : number of page to start OCR from (default: 1)

       -gray  use grayscale for images (default: black and white)

       -grayfilter
              enable unpaper's gray filter; further options can be set by -unpo

       -gs    -gs filename : name of  gs  binary  (default:  gs);  optional,  only  required  for
              resizing

       -hocr2pdf
              -hocr2pdf  filename  :  name  of  hocr2pdf  binary (default: hocr2pdf); ignored for
              tesseract >= 3.03 unless option -enforcehocr2pdf is set

       -hoo   -hoo options : additional hocr2pdf options; make sure to quote

       -identify
              -identify filename : name of identify binary (default: identify)

       -last_page
              -last_page number : number of page up to which to process OCR (default:  number  of
              pages in inputfile)

       -lang  -lang language : language of the text; option to tesseract (default: eng) e.g: eng,
              deu, deu-frak, fra, rus, swe, spa,  ita,  ...   see  option  -list_langs;  Multiple
              languages may be specified, separated by plus characters.

       -layout
              -layout  { single | double | none } : layout of the scanned pages; requires unpaper
              single: one page per sheet  double:  two  pages  per  sheet  none:  no  auto-layout
              (default)

       -list_langs
              list  currently  available  languages  and  exit;  in  case  of  custom binaries of
              tesseract, place this after the -tesseract option

       -maxpixels
              -maxpixels  NUM  :  maximal  number  of  pixels   allowed   for   input   file   if
              (resolution/72)^2  *width*height  >  maxpixels  then  scale page of input file down
              prior to OCR so that  page  size  in  pixels  corresponds  to  maxpixels;  default:
              17415167 (A3 @ 300 dpi)

       -noimage
              do  not  place  the  image  over  the  text  (requires  hocr2pdf;  ignored  without
              -enforcehocr2pdf option)

       -nopreproc
              do not preprocess with unpaper

       -nthreads
              -nthreads number : number of parallel threads (default: guessed number of CPUs;  if
              guessing fails: 1)

       -o     -o  filename  :  output file; default: inputfile_ocr.pdf (if extension is different
              from .pdf, original extension is kept)

       -omp_thread_limit
              -omp_thread_limit number : number of  threads  tesseract  may  use  for  each  page
              (default: 1)

       -pagesize
              -pagesize  {  original  |  NUMxNUM  }  :  set  page  size  of  output pdf (requires
              ghostscript) original: same as input file (default)  NUMxNUM:  width  x  height  in
              pixel (e.g. for A4: -pagesize 595x842)

       -pdfinfo
              -pdfinfo filename : name of pdfinfo binary (default: pdfinfo)

       -pdfunite
              -pdfunite filename : name of pdfunite binary (default: pdfunite)

       -resolution
              -resolution NUM : resolution (dpi) used for OCR (default: 300)

       -rgb   use  RGB  color  space for images (default: black and white); use with care: causes
              problems with some color spaces

       -sloppy_text
              sloppily place text, group words, do not draw single glyphs; ignored for  tesseract
              >= 3.03 unless option -enforcehocr2pdf is set

       -tesseract
              -tesseract filename : name of tesseract binary (default: tesseract)

       -tesso -tesso options : additional tesseract options; make sure to quote

       -unpaper
              -unpaper filename : name of unpaper binary (default: unpaper)

       -unpo  -unpo options : additional unpaper options; make sure to quote

       -quiet suppress output

       -verbose
              produce more output

       -version
              print version and quit

       -help  Display this list of options

       --help Display this list of options

LANGUAGES

       Via    Tesseract,   numerous   language   packagess   available   -   follow   this   link
       http://code.google.com/p/tesseract-ocr/downloads/list for a  complete  list.  Here  is  an
       incomplete selection of supported languages and their abbreviations:

       ara  (Arabic),  aze  (Azerbauijani),  bul (Bulgarian), cat (Catalan), ces (Czech), chi_sim
       (Simplified Chinese), chi_tra (Traditional Chinese), chr (Cherokee),  dan  (Danish),  dan-
       frak  (Danish (Fraktur)), deu (German), ell (Greek), eng (English), enm (Old English), epo
       (Esperanto),  est  (Estonian),  fin  (Finnish),  fra  (French),  frm  (Old  French),   glg
       (Galician),  heb (Hebrew), hin (Hindi), hrv (Croation), hun (Hungarian), ind (Indonesian),
       ita (Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch),
       nor  (Norwegian),  pol  (Polish),  por  (Portuguese),  ron  (Romanian), rus (Russian), slk
       (Slovakian), slv (Slovenian), sqi (Albanian), spa (Spanish), srp (Serbian), swe (Swedish),
       tam  (Tamil), tel (Telugu), tgl (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie
       (Vietnamese)

       Multiple languages  may  be  specified,  separated  by  plus  characters.  Note  that  the
       respective tesseract language package needs to be installed on your system to be usable by
       pdfsandwich. Option -list_langs lists the languages which are available on your system.

AVAILABILITY

       Sources and packages as well as comprehensive help  can  be  found  at  http://www.tobias-
       elze.de/pdfsandwich.

AUTHOR

       Tobias Elze <sourceforge@tobias-elze.de>

                                          18 August 2023                           PDFSANDWICH(1)