lunar (1) pstotext.1.gz

Provided by: pstotext_1.9-7_amd64 bug

NAME

       pstotext - extract ASCII text from a PostScript or PDF file

SYNTAX

       pstotext [option|pathname]...

       where option includes:

       -cork
       -landscape
       -landscapeOther
       -portrait
       -
       -output file
       -gs command
       -debug
       -bboxes

DESCRIPTION

       pstotext  reads  one  or  more  PostScript  or  PDF files, and writes to standard output a
       representation of the plain text that would be  displayed  if  the  PostScript  file  were
       printed.   As  is  described  in the DETAILS section below, this representation is only an
       approximation.  Nevertheless, it is often useful for information retrieval (e.g.,  running
       grep(1) or building a full-text index) or to recover the text from a PostScript file whose
       source you have lost.

       pstotext calls Ghostscript, and  requires  Aladdin  Ghostscript  version  3.51  or  newer.
       Ghostscript  must  be  invokable on the current search path as gs.  Alternatively, you can
       use the -gs option to specify the command (pathname and options) to run Ghostscript.   For
       example, on Windows you might use -gs "c:\gs\gswin32c.exe -Ic:\gs;c:\gs\fonts".

       pstotext  reads  and  processes  its command line from left to right, ignoring the case of
       options.  When it encounters a  pathname,  it  opens  the  file  and  expects  to  find  a
       PostScript  job  or  PDF  document  to  process.  The option - means to read and process a
       PostScript job from standard input.  If  no  -  or  pathname  arguments  are  encountered,
       pstotext reads a PostScript job from standard input. (PDF documents require random access,
       hence cannot be read from standard input.) You can use the -output option  to  specify  an
       output  file  (remember  to invoke it before the input file); otherwise pstotext writes to
       standard output.

       The option -cork is only relevant for PostScript files produced by dvips from TeX or LaTeX
       documents;  it  tells pstotext to use the Cork encoding (known as T1 in LaTeX) rather than
       the old TeX text encoding (known as OT1 in LaTeX). Unfortunately files produced  by  dvips
       don't distinguish which font encodings were used.

       The  options  -landscape  and  -landscapeOther  should  be used for documents that must be
       rotated 90 degrees clockwise or counterclockwise, respectively, in order to be readable.

       The options -debug and -bboxes are mostly of use for the maintainers of pstotext.   -debug
       shows  Ghostscript  output  and  error  messages.  -bboxes  outputs one word per line with
       bounding box information.

DETAILS

       pstotext does its work by telling Ghostscript to load a PostScript library that causes  it
       to write to its standard output information about each string rendered by a PostScript job
       or PDF document.  This information includes the  characters  of  the  string,  and  enough
       additional   information   to  approximate  the  string's  bounding  rectangle.   pstotext
       post-processes this information and outputs  a  sequence  of  words  delimited  by  space,
       newline, and formfeed.

       pstotext  outputs  words  in the same sequence as they are rendered by the document.  This
       usually, but not always, follows the order that a human would read the words  on  a  page.
       Within  this sequence, words are separated by either space or newline depending on whether
       or not they fall on the same line.  Each page is terminated with a formfeed.  If  you  use
       the  incorrect  option  from the set {-portrait, -landscape, -landscapeOther}, pstotext is
       likely to substitute newline for space.

       A PostScript job or PDF document often renders one word as several strings in order to get
       correct  spacing  between  particular  pairs  of  characters.   pstotext  does its best to
       assemble these strings back into words, using a simple heuristic: strings separated  by  a
       distance  of  less  than  0.3 times the minimum of the average character widths in the two
       strings are considered to be part of the same  word.   Note  that  this  typically  causes
       leading and trailing punctuation characters to be included with a word.

       The  PostScript  language  provides a flexible encoding scheme by which character codes in
       strings select specific characters (symbols), so a PostScript  job  is  free  to  use  any
       character code.  On the other hand, pstotext always translates to the ISO 8859-1 (Latin-1)
       character code, which is an extension to ASCII  covering  most  of  the  Western  European
       languages.   When  a  character  isn't  present in ISO 8859-1, pstotext uses a sequence of
       characters, e.g., "---" for em dash or "A\226" for Abreve.  pstotext can be  fooled  by  a
       font  whose Encoding vector doesn't follow Adobe's conventions, but it contains heuristics
       allowing it to handle a wide variety of misbehaving fonts.

       (pstotext no longer translates hyphen (\255) to minus (\055).)

AUTHOR

       Andrew Birrell (PostScript libraries), Paul McJones (application), Russell  Lang  (Windows
       and OS/2 adaptation), and Hunter Goatley (VMS adaptation).

SEE ALSO

       pstotext  incorporates  technology  originally  developed for the Virtual Paper project at
       SRC; see http://www.research.digital.com/SRC/virtualpaper/.

       As    mentioned    above,    pstotext    invokes    Ghostscript.      See     gs(1)     or
       http://www.cs.wisc.edu/~ghost/.

       Copyright 1995-8 Digital Equipment Corporation.
       Distributed only by permission.
       See file /usr/share/doc/pstotext/copyright for details.

       Last modified on Sat Feb  5 21:00:00 AEST 2000 by rjl
            modified on Fri Jun  5 14:02:37 PDT 1998 by mcjones
            modified on Wed Jun  7 17:47:56 PDT 1995 by birrell

       This  file  was  generated  automatically  by  mtex  software;  see  the mtex home page at
       http://www.research.digital.com/SRC/mtex/.

                                                                                      pstotext(1)