Ubuntu Manpage: pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf files

Provided by: pdfsandwich_0.1.4-1build2_amd64

NAME

       pdfsandwich - A generator for sandwich OCR pdfs from scanned pdf files

SYNOPSIS

       pdfsandwich [options] inputfile.pdf

DESCRIPTION

       pdfsandwich  generates  "sandwich" OCR pdf files, i.e. pdf files which contain only images (no text) will
       be processed by optical character recognition (OCR) and the text will be added  to  each  page  invisibly
       "behind"  the images.  Note that pdfsandwich needs the following programs: unpaper, convert, gs, hocr2pdf
       (for tesseract < 3.03), and tesseract.  As tesseract >= 3.03 can write pdf files, hocr2pdf is only needed
       for older versions of tesseract.  Please visit http://www.tobias-elze.de/pdfsandwich.

OPTIONS

       -convert
              -convert filename : name of convert binary (default: convert)

       -coo   -coo options : additional convert options; make sure  to  quote;  e.g.  -coo  "-normalize  -black-
              threshold 75%" call convert --help or man convert for all convert options

       -debug keep all temporary files in /tmp (for debugging)

       -enforcehocr2pdf
              use hocr2pdf even if tesseract >= 3.03

       -first_page
              -first_page number : number of page to start OCR from (default: 1)

       -grayfilter
              enable unpaper's gray filter; further options can be set by -unpo

       -gs    -gs filename : name of gs binary (default: gs)

       -hocr2pdf
              -hocr2pdf  filename  :  name of hocr2pdf binary (default: hocr2pdf); ignored for tesseract >= 3.03
              unless option -enforcehocr2pdf is set

       -hoo   -hoo options : additional hocr2pdf options; make sure to quote

       -identify
              -identify filename : name of identify binary (default: identify)

       -last_page
              -last_page number : number of page up to which  to  process  OCR  (default:  number  of  pages  in
              inputfile)

       -lang  -lang  language : language of the text; option to tesseract (defaut: eng) e.g: eng, deu, deu-frak,
              fra, rus, swe, spa, ita, ...   see  option  -list_langs;  Multiple  languages  may  be  specified,
              separated by plus characters.

       -layout
              -layout  {  single  |  double | none } : layout of the scanned pages; requires unpaper single: one
              page per sheet double: two pages per sheet none: no auto-layout (default)

       -list_langs
              list currently available languages and exit; in case of custom binaries of tesseract,  place  this
              after the -tesseract option

       -maxpixels
              -maxpixels   NUM  :  maximal  number  of  pixels  allowed  for  input  file  if  (resolution/72)^2
              *width*height > maxpixels then scale page of input file down prior to OCR so  that  page  size  in
              pixels corresponds to maxpixels; default: 17415167 (A3 @ 300 dpi)

       -noimage
              do not place the image over the text (requires hocr2pdf; ignored without -enforcehocr2pdf option)

       -nopreproc
              do not preprocess with unpaper

       -nthreads
              -nthreads number : number of parallel threads (default: guessed number of CPUs; if guessing fails:
              1)

       -o     -o  filename  :  output  file;  default:  inputfile_ocr.pdf  (if extension is different from .pdf,
              original extension is kept)

       -pagesize
              -pagesize { original | NUMxNUM } : set page size of  output  pdf  original:  same  as  input  file
              (default) NUMxNUM: width x height in pixel (e.g. for A4: -pagesize 595x842)

       -resolution
              -resolution NUM : resolution (dpi) used for OCR (default: 300)

       -rgb   use  RGB  color  space  for images (default: black and white); use with care: causes problems with
              some color spaces

       -sloppy_text
              sloppily place text, group words, do not draw single glyphs; ignored for tesseract >= 3.03  unless
              option -enforcehocr2pdf is set

       -tesseract
              -tesseract filename : name of tesseract binary (default: tesseract)

       -tesso -tesso options : additional tesseract options; make sure to quote

       -unpaper
              -unpaper filename : name of unpaper binary (default: unpaper)

       -unpo  -unpo options : additional unpaper options; make sure to quote

       -quiet suppress output

       -verbose
              produce more output

       -version
              print version and quit

       -help  Display this list of options

       --help Display this list of options

LANGUAGES

       Via     Tesseract,     numerous     language     packagess     available     -     follow    this    link
       http://code.google.com/p/tesseract-ocr/downloads/list  for  a  complete  list.  Here  is  an   incomplete
       selection of supported languages and their abbreviations:

       ara  (Arabic),  aze  (Azerbauijani),  bul  (Bulgarian),  cat  (Catalan), ces (Czech), chi_sim (Simplified
       Chinese), chi_tra (Traditional Chinese), chr (Cherokee), dan (Danish), dan-frak (Danish  (Fraktur)),  deu
       (German),  ell (Greek), eng (English), enm (Old English), epo (Esperanto), est (Estonian), fin (Finnish),
       fra (French), frm  (Old  French),  glg  (Galician),  heb  (Hebrew),  hin  (Hindi),  hrv  (Croation),  hun
       (Hungarian),  ind  (Indonesian),  ita  (Italian),  jpn  (Japanese),  kor  (Korean),  lav  (Latvian),  lit
       (Lithuanian), nld  (Dutch),  nor  (Norwegian),  pol  (Polish),  por  (Portuguese),  ron  (Romanian),  rus
       (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa (Spanish), srp (Serbian), swe (Swedish),
       tam (Tamil), tel (Telugu), tgl (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese)

       Multiple  languages  may  be  specified, separated by plus characters. Note that the respective tesseract
       language package needs to be installed on your system to be usable  by  pdfsandwich.  Option  -list_langs
       lists the languages which are available on your system.

AVAILABILITY

       Sources and packages as well as comprehensive help can be found at http://www.tobias-elze.de/pdfsandwich.

AUTHOR

       Tobias Elze <sourceforge@tobias-elze.de>

                                                12 February 2016                                  PDFSANDWICH(1)