Provided by: spamassassin_4.0.0-8ubuntu5_all 
      
    
NAME
       ExtractText - extracts text from documenmts.
SYNOPSIS
       loadplugin Mail::SpamAssassin::Plugin::ExtractText
       ifplugin Mail::SpamAssassin::Plugin::ExtractText
         extracttext_external  pdftotext  /usr/bin/pdftotext -nopgbrk -layout -enc UTF-8 {} -
         extracttext_use       pdftotext  .pdf application/pdf
         # http://docx2txt.sourceforge.net
         extracttext_external  docx2txt   /usr/bin/docx2txt {} -
         extracttext_use       docx2txt   .docx application/docx
         extracttext_external  antiword   /usr/bin/antiword -t -w 0 -m UTF-8.txt {}
         extracttext_use       antiword   .doc application/(?:vnd\.?)?ms-?word.*
         extracttext_external  unrtf      /usr/bin/unrtf --nopict {}
         extracttext_use       unrtf      .doc .rtf application/rtf text/rtf
         extracttext_external  odt2txt    /usr/bin/odt2txt --encoding=UTF-8 {}
         extracttext_use       odt2txt    .odt .ott application/.*?opendocument.*text
         extracttext_use       odt2txt    .sdw .stw application/(?:x-)?soffice application/(?:x-)?starwriter
         extracttext_external  tesseract  {OMP_THREAD_LIMIT=1} /usr/bin/tesseract -c page_separator= {} -
         extracttext_use       tesseract  .jpg .png .bmp .tif .tiff image/(?:jpeg|png|x-ms-bmp|tiff)
         add_header   all          ExtractText-Flags _EXTRACTTEXTFLAGS_
         header       PDF_NO_TEXT  X-ExtractText-Flags =~ /\bpdftotext_NoText\b/
         describe     PDF_NO_TEXT  PDF without text
         score        PDF_NO_TEXT  0.001
         header       DOC_NO_TEXT  X-ExtractText-Flags =~ /\b(?:antiword|openxml|unrtf|odt2txt)_NoText\b/
         describe     DOC_NO_TEXT  Document without text
         score        DOC_NO_TEXT  0.001
         header       EXTRACTTEXT  exists:X-ExtractText-Flags
         describe     EXTRACTTEXT  Email processed by extracttext plugin
         score        EXTRACTTEXT  0.001
       endif
DESCRIPTION
       This module uses external tools to extract text from message parts, and then sets the text as the
       rendered part. External tool must output plain text, not HTML or other non-textual result.
       How to extract text is completely configurable, and based on MIME part type and file name.
CONFIGURATION
       All configuration lines in user_prefs files will be ignored.
       extracttext_maxparts (default: 10)
           Configure  the  maximum  mime  parts  number  to  analyze,  a value of 0 means all mime parts will be
           analyzed
       extracttext_timeout (default: 5 10)
           Configure the timeout in seconds of external tool checks, per attachment.
           Second argument speficies maximum total time for all checks.
   Tools
       extracttext_use
           Specifies what tool to use for what message parts.
           The general syntax is
           extracttext_use  "name"  "specifiers"
       name
           the internal name of a tool.
       specifiers
           File extension and regular expressions for file names and MIME types.  The  regular  expressions  are
           anchored to beginning and end.
       Examples
               extracttext_use  antiword  .doc application/(?:vnd\.?)?ms-?word.*
               extracttext_use  openxml   .docx .dotx .dotm application/(?:vnd\.?)openxml.*?word.*
               extracttext_use  openxml   .doc .dot application/(?:vnd\.?)?ms-?word.*
               extracttext_use  unrtf     .doc .rtf application/rtf text/rtf
       extracttext_external
           Defines  an  external tool.  The tool must read a document on standard input or from a file and write
           text to standard output.
           The special keyword "{}" will be substituted at runtime with the temporary filename to be scanned  by
           the external tool.
           Environment  variables  can  be  defined  with  "{KEY=VALUE}",  these  strings  will  be removed from
           commandline.
           It is required that commandline used outputs result directly to STDOUT.
           The general syntax is
           extracttext_external "name" "command" "parameters"
       name
           The internal name of this tool.
       command
           The full path to the external command to run.
       parameters
           Parameters for the external command.  The  temporary  file  name  containing  the  document  will  be
           automatically added as last parameter.
       Examples
               extracttext_external  antiword  /usr/bin/antiword -t -w 0 -m UTF-8.txt {} -
               extracttext_external  unrtf     /usr/bin/unrtf --nopict {}
               extracttext_external  odt2txt   /usr/bin/odt2txt --encoding=UTF-8 {}
   Metadata
       The  plugin  adds some pseudo headers to the message. These headers are seen by the bayes system, and can
       be used in normal SpamAssassin rules.
       The headers are also available as template tags as noted below.
       Example
       The fictional example headers below are based on a message containing this:
       1 A perfectly normal PDF.
       2 An OpenXML document with a word document inside. Neither Office document contains text.
       Headers
       X-ExtractText-Chars
           Tag: _EXTRACTTEXTCHARS_
           Contains a count of characters that were extracted.
           X-ExtractText-Chars: 10970
       X-ExtractText-Words
           Tag: _EXTRACTTEXTWORDS_
           Contains a count of "words" that were extracted.
           X-ExtractText-Chars: 1599
       X-ExtractText-Tools
           Tag: _EXTRACTTEXTTOOLS_
           Contains chains of tools used for extraction.
           X-ExtractText-Tools: pdftotext openxml_antiword
       X-ExtractText-Types
           Tag: _EXTRACTTEXTTYPES_
           Contains chains of MIME types for parts found during extraction.
           X-ExtractText-Types:                                                                 application/pdf;
           application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/ms-word
       X-ExtractText-Extensions
           Tag: _EXTRACTTEXTEXTENSIONS_
           Contains chains of canonicalized file extensions for parts found during extraction.
           X-ExtractText-Extensions: pdf docx
       X-ExtractText-Flags
           Tag: _EXTRACTTEXTFLAGS_
           Contains notes from the plugin.
           X-ExtractText-Flags: openxml_NoText
       Rules
       Example:
               header    PDF_NO_TEXT  X-ExtractText-Flags =~ /\bpdftotext_Notext\b/
               describe  PDF_NO_TEXT  PDF without text
perl v5.38.2                                       2024-04-12              Mail::SpamAssa...in::ExtractText(3pm)