Provided by: catdoc_0.94.2-1_i386 bug
 

NAME

        catdoc - reads MS-Word file and puts its content as plain text on stan‐
        dard output
 

SYNOPSIS

        catdoc [-vlu8btawxV] [-m number] [ -s charset] [ -d charset] [ -f  out‐
        put-format] file
 

DESCRIPTION

        catdoc  behaves much like cat(1) but it reads MS-Word file and produces
        human-readable text on standard output.  Optionally it can use latex(1)
        escape  sequences  for characters which have special meaning for LaTeX.
        It also makes some effort to  recognize  MS-Word  tables,  although  it
        never  tries  to  write  correct headers for LaTeX tabular environment.
        Additional output formats, such is HTML can be easily defined.
 
        catdoc doesn’t attempt to extract  formatting  information  other  than
        tables  from  MS-Word  document, so different output modes means mainly
        that different characters should be escaped and different ways used  to
        represent  characters,  missing from output charset. See CHARACTER SUB‐
        STITUTION below
 
        catdoc uses internal unicode(7) representation of text, so it  is  able
        to  convert texts when charset in source document doesn’t match charset
        on target system.  See CHARACTER SETS below.
 
        If no file names supplied, catdoc processes its standard  input  unless
        it  is  terminal. It is unlikely that somebody could type Word document
        from keyboard, so if catdoc invoked without arguments and stdin is  not
        redirected,  it  prints  brief  usage message and exits.  Processing of
        standard input (even among other files) can be forced using dash ’-’ as
        file name.
 
        By  default,  catdoc  wraps lines which are more than 72 chars long and
        separates paragraphs by blank lines. This behavior can be turned of  by
        -w  switch. In wide mode catdoc prints each paragraph as one long line,
        suitable for import into word processors which perform word wrapping.
 

OPTIONS

        -a      - shortcut for -f ascii. Produces ASCII text as output.   Sepa‐
                rates table columns with TAB
 
        -b      - process broken MS-Word file. Normally, catdoc checks if first
                8 bytes of file is Microsoft OLE signature. If so, it processes
                file,  otherwise  it just copies it to stdin. It is intended to
                use catdoc as filter for viewing all files with .doc extension.
 
        -dcharset
                -  specifies  destination charset name. Charset file has format
                described in CHARACTER SETS below and should have  .txt  exten‐
                sion   and  reside  in  catdoc  library directory ( ${exec_pre     
                fix}/lib/catdoc). By default, current locale charset is used if
                langinfo support compiled in.
 
        -fformat
                -  specifies  output format as described in CHARACTER SUBSTITU‐
                TION below.  catdoc comes with two output formats -  ascii  and
                tex. You can add your own if you wish.
 
        -l      Causes catdoc to list names of available charsets to the stdout
                and exit successfully.
 
        -mnumber
                Specifies right margin for text  (default 72).  -m 0 is equiva‐
                lent to -w
 
        -scharset
                Specifies  source charset. (one used in Word document), if Word
                document doesn’t contain UTF-16  text. When reading  rtf  docu‐
                ments,  it  is  typically  not necessary, because rtf documents
                contain ansicpg specification. But it can be set wrong by  Word
                (I’ve  seen  RTF  documents on Russian, where cp1252 was speci‐
                fied). In this case this  option  would  take  precedence  over
                charset,  specified  in the document. But source_charset state‐
                ment in the configuration file have less priority than  charset
                in the document.
 
        -t      - shortcut for -f tex
                 converts  all  printable chars, which have special meaning for
                LaTeX(1) into appropriate control  sequences.  Separates  table
                columns by &.
 
        -u      -  declares  that  Word   document  contain  UNICODE   (UTF-16)
                representation of text (as some Word-97 documents).  If  catdoc
                fails  to  correct   Word document with  default charset,   try
                this  option.
 
        -8      - declares is Word document is 8 bit. Just in case that catdoc
                 recognizes file format incorrectly.
 
        -w      disables word wrapping. By default catdoc output is split  into
                lines  not longer than 72 (or  number, specified by -m  option)
                characters and paragraphs are separated  by  blank  line.  With
                this option each paragraph is one long line.
 
        -x      causes  catdoc  to  output unknown UNICODE character as \xNNNN,
                instead of question marks.
 
        -v      causes catdoc to print some useless information about word doc‐
                ument structure to stdout before actual start of text.
 
        -V      outputs catdoc version
        When  processing MS-Word file catdoc uses information about two charac‐
        ter sets, typically different
         -  input and output. They are stored in plain  text  files  in  catdoc
        data directory. Character set files should contain two whitespace-sepa‐
        rated hexadecimal numbers - 8-bit code in character set and 16-bit Uni‐
        code  code.  Anything from hash mark to end of line is ignored, as well
        as blank lines.
 
        catdoc distribution includes some of these character  sets.  Additional
        character  set  definitions,  directly usable by catdoc can be obtained
        from ftp.unicode.org. Charset files have .txt suffix,  which  shouldn’t
        be specified in command-line or configuration files.
 
        Note  that  catdoc is distributed with Cyrillic charsets as default. If
        you are not Russian, you probably don’t want it, an should  reconfigure
        catdoc at compile time or in runtime configuration file.
 
        When  dealing with documents with charsets other than default, remember
        that Microsoft never uses ISO charsets. While letters  in,  say  cp1252
        are at the same position as in ISO-8859-1, some punctuation signs would
        be lost, if you specify ISO-8859-1 as input charset. If you use cp1252,
        catdoc  would deal with those signs as described in CHARACTER SUBSTITU‐
        TION below.
        catdoc converts  MS-Word file into following internal Unicode represen‐
        tation:
 
        1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)
 
        2. Table cells within row are separated by ASCII Field Separator symbol
            (0x001C)
 
        3. Table rows are separated by ASCII Record Separator (0x001E)
 
        4.  All printable characters, including whitespace are represented with
        their
            respective UNICODE codes.
 
        This  UNICODE  representation is subsequently converted into 8-bit text
        in target character set using following four-step algorithm:
 
        1. List of special characters is searched for given Unicode  character.
            If found,  then  appropriate  multi-character  sequence  is  output
            instead of character.
 
        2. If there is an equivalent in target character set, it is output.
 
        3. Otherwise, replacement list is searched and, if there is multi-char‐
        acter
            substitution for this UNICODE char, it is output.
 
        4. If all above fails, "Unknown char" symbol (question mark) is output.
 
        Lists of special characters and list of substitution are character set-
        independent, because special chars  should  be  escaped  regardless  of
        their  existence  in  target character set  (usually, they are parts of
        US-ASCII, and therefore exist in any  character  set)  and  replacement
        list is searched only for those characters, which are not found in tar‐
        get character set.
 
        These lists are stored in catdoc data directory in files with prefix of
        format name. These files have following format:
 
        Each  line  can  be either comment (starting with hash mark) or contain
        hexadecimal UNICODE value, separated by whitespace from  string,  which
        would  be substituted instead of it. If string contain no whitespace it
        can be used as is, otherwise it should be enclosed in single or  double
        quotes.  Usual  backslash sequences like ’\n’,’\t’ can be used in these
        string.
        Upon startup catdoc reads its system-wide configuration file  /etc/cat     
        docrc and then user-specific configuration file ${HOME}/.catdocrc.
 
        These files can contain following directives:
 
        source_charset = charset-name
                Sets  default  source  charset,  which  would  be used if no -s
                option specified. Consult configuration of nearby windows work‐
                station to find one you need.
 
        target_charset = charset-name
                 Sets  default output charset. You probably know, which one you
                use.
 
        charset_path = directory-list
                colon-separated list of directories,  which  are  searched  for
                charset  files.  This allows you to install additional charsets
                in your home directory.  If first directory component  of  path
                is  ~  it is replaced by contents of HOME environment variable.
                On MS-DOS platform, if directory name starts  with  %s,  it  is
                replaced  with  directory  of executable file. Empty element in
                list (i.e. two consequitve colons) is considered current direc‐
                tory.
 
        map_path = directory-list
                colon-separated  list  of  directories,  which are searched for
                special character map and replacement map.   Same  substitution
                rules as in charset_path are applied.
 
        format = format name
                Output  format  which  would  be used by default.  catdoc comes
                with two formats - ascii and tex but nothing prevents you  from
                writing  your own format (set two map files - special character
                map and replacement map).
 
        unknown_char = character specification
                sets character to output instead of unknown  Unicode  character
                (default ’?’)  Character specification can have one of two form
                - character enclosed in single quotes or hexadecimal code.
 
        use_locale =(yes|no)
                Enables or  disables  automatic  selection  of  output  charset
                (default yes),
                 based  on system locale settings (if enabled at compile time).
                If automatic detection is enabled, than output charset settings
                in  the  configuration  files (but not in the command line) are
                ignored, and current system locale  charset  is  used  instead.
                There are no automatic choice of input charset, based of locale
                language, because most modern Word files (since  Word  97)  are
                Unicode anyway
 

BUGS

        Doesn’t  handle fast-saves properly. Prints footnotes as separate para‐
        graphs at the end of file, instead of producing correct LaTeX commands.
        Cannot distinguish between empty table cell and end of table row.
        xls2csv(1), cat(1), strings(1), utf(4), unicode(7)
 

AUTHOR

        V.B.Wagner <vitus@45.free.net>