Provided by: mnogosearch-common_3.2.33-1ubuntu1_all bug

NAME

       indexer.conf - configuration file for indexer

DESCRIPTION

       This  is  configuration  file  for  indexer  (1).   Configuration  file
       consists of commands and  their  arguments.   All  commands  are  case-
       insensitive.  You can use # to comment out lines.

VARIABLES

       Global parameters

              These  commands  should be used only once and take global effect
              for the whole configuration file.

       DBType type
              Database type, currently  supported  values  are  mysql,  pgsql,
              msql,  solid,  mssql, oracle, ibase, sqlite Actually it does not
              matter for native libraries support, but ODBC users must specify
              one  of  the  supported  values.   If  your database type is not
              supported, use unknown instead.

       DBHost host
              SQL host name (Not required for ODBC)

              Default: localhost

       DBName mnogosearch
              SQL database name or ODBC DSN

              Default: mnogosearch

       DBUser foo
              Database username to connect to database

              Default: no user

       DBPass bar
              Database password to connect to database

              Default: no password

       DBMode single/multi/crc/crc-multi
              SQL database words storage mode. Does  not  apply  for  built-in
              database.  When single is specified, all words are stored in the
              same table.  multi means that  words  are  stored  in  different
              tables  depending  on  wordlength.  multi mode is usualy faster,
              but it requires more tables in database.  In case of  crc  mode,
              mnoGoSearch  will  store  32 bit integer word ID’s calculated by
              CRC32 algorythm  instead  of  words.   crc  mode  requires  less
              diskspace  and is faster than single and multi modes.  crc-multi
              mode shares storage structure with crc mode, but stores words in
              different  tables  depending  on  wordlength  like  multi  mode.
              Default DBMode value is single

       LocalCharset charset
              Defines charset for local file system. It is required if you are
              using  8  bit  characters  and  is  not  applicable  for  7  bit
              characters.  This command is to be used once  and  takes  global
              effect for the whole configuration file.

              Example:
              LocalCharset windows-1250

       CrossWords yes|no
              Building  CrossWords  index. Crosswords are those, that are used
              in a link to the present page.  The default value is no

       StopWordFile filename
              This command indicates which file  contains  stopwords  list  to
              load.   You  may  specify either absolute file name, or filename
              with a relative path to mnoGoSearch /etc directory.  You may use
              several StopWordsFile commands.

       MinWordLength characters
              MinWordLength characters  With  these  commands  you  can change
              default length range of words stored  in  database.  By  default
              mnoGoSearch stores words that are longer than 1 and shorter than
              32.  Example: MaxWordLength 35

       MaxDocSize bytes
              Specify maximum size of a document in bytes that can be indexed.
              The  default  value  is 1048576 (1 Mb). This command take global
              effect for the whole config file.

       HTTPHeader header
              You may add custom HTTP headers to indexer HTTP request. Do  not
              use "If-modified-since" and "Accept-Charset" headers, since they
              are     composed     by     indexer     itself.     "User-Agent:
              mnoGoSearch/version"  is sent too, although you may override it.
              The command has global effect for the whole configuration  file.

       ServerTable table_name
              This  command works only with SQL database and is not applicable
              for  built-in  database  mode.   Load  servers  with  all  their
              parameters  from  the  table  table_name  For an example of such
              tables    structure,    please     refer     to     the     file
              create/mysql/server.txt  You may use several arguments with this
              command: ServerTable my_servers1 my_servers2 my_servers3 or just
              a single argument: ServerTable server

       DeleteNoServer yes|no
              Use  this command to specify whether to delete the URL that have
              no corresponding Server commands. Default value is yes

       VarDir /path/to/my/var/dir
              Specify a custom path to directory that indexer stores  data  to
              when  use  with built-in database and in cache mode.  By default
              /var directory of mnoGoSearch installation is used.

URL Control Configuration

       Allow [Match|NoMatch] {NoCase|Case] [String|Regex] <arg> [<arg> ...]
              Use this command to allow URL’s  that  match  (does  not  match)
              given  argument.  First  three  optional parameters describe the
              type of comparison. Default values are Match, NoCase, String Use
              NoCase or Case values to to choose case insensitive or sensitive
              comparison. Use Regex to choose regular  expression  comparison.
              Use String to choose string with wildcards comparison. Wildcards
              are * for any number of characters, and ?   for  one  character.
              Note  that  *  and ?  have special meaning in String match type.
              Please use Regex to describe documents with ?  and  *  signs  in
              URL.   String match is much faster than Regex String wrere it is
              possible. You may use several arguments for  one  Allow  command
              and use this command any number of times. It takes global effect
              for the config file.  Note that mnoGoSearch  automatically  adds
              one  Allow  regex  .*   command  after reading config file. That
              command means that everything is allowed that is not disallowed

       Disallow [Match|NoMatch] [Case|NoCase] [String|Regex] [<arg> ...]
              Use this to disallow indexing documents  with  URLs  that  match
              given  argument.   The  meaning  of  the  first  three  optional
              parameters is exactly the same as with the  Allow  command.  You
              can use several arguments for one Disallow command. Takes global
              effect for config file.

       Example:
              #Exclude cgi-bin and non-parsed-headers
              Disallow /cgi-bin/ \.cgi /nph

              #Exclude some known extensions
              Disallow \.b$  \.sh$     \.md5$
              Disallow \.arj$  \.tar$  \.zip$  \.tgz$  \.gz$
              Disallow \.lha$ \.lzh$ \.tar\.Z$  \.rar$  \.zoo$
              Disallow \.gif$  \.jpg$  \.jpeg$ \.bmp$  \.tiff$
              Disallow \.vdo$  \.mpeg$ \.mpe$  \.mpg$  \.avi$  \.movie$
              Disallow \.mid$  \.mp3$  \.rm$   \.ram$  \.wav$  \.aiff$ \.ra$
              Disallow \.vrml$ \.wrl$
              Disallow \.exe$  \.cab$  \.dll$  \.bin$  \.class$
              Disallow \.tex$  \.texi$ \.xls$  \.doc$  \.texinfo$
              Disallow \.rtf$  \.pdf$  \.cdf$  \.ps$
              Disallow \.ai$   \.eps$  \.ppt$  \.hqx$
              Disallow \.cpt$  \.bms$  \.oda$  \.tcl$
              Disallow \.rpm$

              #Exclude Apache directory list in different sort order
              Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$
              \?S=D$

              #Exclude ./. and ./.. from Apache and Squid directory list
              Disallow /[.]{1,2} /\%2e /\%2f

       CheckOnly regexp [regexp [...] ]
              Indexer  will  use HEAD instead of GET http method for URLs that
              matches regexp. It means that file will be checked only and will
              not  be  downloaded. Usefull for zip,exe,arj etc files.  One can
              use several arguments for one ’CheckOnly’ command.  One can  use
              this  command any times but not more than MAXFILTER in indexer.h
              Takes global effect for config file.

       Examples:
              #Use HEAD method for some known non-text extensions:
              CheckOnly \.b$ \.sh$     \.md5$
              CheckOnly \.arj$  \.tar$  \.zip$  \.tgz$  \.gz$
              CheckOnly \.lha$ \.lzh$ \.tar\.Z$  \.rar$  \.zoo$
              CheckOnly \.gif$  \.jpg$  \.jpeg$ \.bmp$  \.tiff$
              CheckOnly \.vdo$  \.mpeg$ \.mpe$  \.mpg$  \.avi$  \.movie$
              CheckOnly \.mid$  \.mp3$  \.rm$   \.ram$  \.wav$  \.aiff$
              CheckOnly \.vrml$ \.wrl$
              CheckOnly \.exe$  \.cab$  \.dll$  \.bin$  \.class$
              CheckOnly \.tex$  \.texi$ \.xls$  \.doc$  \.texinfo$
              CheckOnly \.rtf$  \.pdf$  \.cdf$  \.ps$
              CheckOnly \.ai$   \.eps$  \.ppt$  \.hqx$
              CheckOnly \.cpt$  \.bms$  \.oda$  \.tcl$
              CheckOnly \.rpm$

       HrefOnly regexp [regexp [...] ]
              Indexer scans html documents that match regexp as it would  scan
              any  other  URLs, except that it will not index the contents. It
              will add any URLs it finds in html document to database. Usefull
              when  indexing  mail  list  archives  with big index pages which
              contain mostly URLs.  One can  use  several  arguments  for  one
              ’HrefOnly’  command.  One can use this command any times but not
              more than MAXFILTER in indexer.h Takes global effect for  config
              file.

       Examples:
              #Scan  these  files  for  href tags only, but do not index there
              contents.
              HrefOnly mail.*\.html$ thr.*\.html$

MIME types and external parsers

       UseRemoteContentType yes|no
              This command specifies if the indexer should  get  content  type
              from  HTTP  server  headers (yes) , or from its AddType settings
              (no). If set to  no  ,  and  the  indexer  could  not  determine
              content-type with its AddType settings,

       SyslogFacility facility
              Useful  only  if  indexer is compiled with syslog support and if
              you do not like the default. Argument is the  same  as  used  in
              syslog.conf  file  (for  example: local7 , daemon ). For list of
              possible facilities see syslog.conf(5) Takes global  effect  and
              should be used only once !  Default: depends on compilation.

       LogdAddr host[:port]
              Use  cachelogd at given host and port if specified. Required for
              cache mode only. Default values are localhost and port 7000

       FollowOutside yes|no
              Allow/disallow indexer to walk outside current  server.   Should
              be used carefully (see MaxHops command).

              Default: no

       Period seconds
              Reindex  period in seconds, 604800 = 1 week.  May be used before
              every Server command and takes effect till  the  end  of  config
              file or till next Period command.

       Tag number
              Use  this  parameter  for  your  own  purposes.  For example for
              grouping some servers into one group, etc.  May be used multiple
              times  before every Server command and takes effect till the end
              of config file or till next Tag command.

       MaxHops number
              Maximum way in "mouse clicks" from start  URL  given  in  Server
              command.  May be used multiple times before every Server command
              and takes effect till the  end  of  config  file  or  till  next
              MaxHops command.

              Default: 256

       MaxNetErrors number
              Maximum  network  errors for each server.  If there are too many
              network errors on some server (server is down, host  unreachable
              etc.)   indexer  will try not to do more than number attempts to
              connect to this server.   May  be  used  multiple  times  before
              Server  command  and takes effect till the end of config file or
              till next MaxNetErrors command.

              Default: 16

       TitleWeight number
              Weight of  the  words  in  the  <title>...</title>  Can  be  set
              multiple  times  before Server command and takes effect till the
              end of config file or till next TitleWeight command.

              Default: 2

       BodyWeight number
              Weight  of  the  words  in  the  <body>...</body>  of  the  html
              documents  and in the contents of the text/plain documents.  Can
              be set multiple times before Server  command  and  takes  effect
              till the end of config file or till next BodyWeight command.

              Default: 1

       DescWeight number
              Weight   of   the   words   in   the   <META  NAME="Description"
              Content="..."> Can be set multiple times before  Server  command
              and  takes  effect  till  the  end  of  config file or till next
              DescWeight command.

              Default: 2

       KeywordWeight number
              Weight of the words in the <META NAME="Keywords"  Content="...">
              Can be set multiple times before Server command and takes effect
              till the end of config file or till next KeywordWeight  command.

              Default: 2

       UrlWeight number
              Weight  of  the  words  in the URL of the documents.  Can be set
              multiple times before Server command and takes effect  till  the
              end of config file or till next UrlWeight command.

              Default: 0

       DeleteBad yes|no
              Prevent  indexer  from  deleting  bad (not found, forbidden etc)
              URLs from database. Useful if you want to check  ’integrity’  of
              you server(s), so if you set it to , that "bad" URLs will remain
              in database.  Can be set multiple times  before  Server  command
              and  takes  effect  till  the  end  of  config file or till next
              DeleteBad command.

              Default: yes

       Robots yes|no
              Allows/disallows  using  robots.txt  and  <META   NAME="robots">
              exclusions.  Useful  if  you  want  to  check ’integrity’ of you
              server(s).  Can be set multiple times before Server command  and
              takes  effect  till  the  end of config file or till next Robots
              command.

              Default: yes.

       Section <string> <number>
              where <string> is a section name  and  <number>  is  section  ID
              between  0  and  255.  Use  0 if you don’t want to index some of
              these sections. It is better to use different sections  IDs  for
              different  documents  parts.  In  this  case  during search time
              you’ll be able to give different weight to  each  part  or  even
              disallow some sections at a search time.

       Index yes|no
              Prevent indexer from storing words into database.  Useful if you
              want to check ’integrity’ of you server(s).  Can be set multiple
              times  before  "Server" command and takes effect till the end of
              config file or till next Index command.

              Note: Instead of Index no you can use the alternate form NoIndex

              Default: yes

       Follow yes|no
              Allow/disallow  indexer  to  store <a href="..."> into database.
              Can be set multiple times before Server command and takes effect
              till the end of config file or till next Follow command.

              Note:  Instead  of  Follow  no  you  can  use the alternate form
              NoFollow

              Default: yes

       MaxDocSize size

              Hope the name is self-explanatory,  this  command  is  to  limit
              maximum  document size.  size is in bytes.  If there is document
              with size more than size , indexer will parse  only  first  size
              bytes of documents.

              Default: 1048576 (which is 1 megabyte)

       Mime   <from_mime> <to_mime>[;charset] ["command line [$1]"]

              This  is  used  to  add  support for parsing documents with mime
              types other than text/plain and text/html.  It can be  done  via
              external  parser  (which  should provide output in plain or html
              text)  or  just  by  substituting  mime  type  so  indexer   can
              understand it directly.

              <from_mime>  and  <to_mime>  are standard mime types.  <to_mime>
              should be either text/plain or text/html , because these are the
              only types that indexer understands.

              We  assume  external parser generates results on stdout (if not,
              you have to write a little script and cat results to stdout).

              Optional charset parameter used to change charset if needed.

              Command line parameter is optional. If there’s no command  line,
              this  is  used to change mime type. Command line could also have
              $1 parameter which stands for temporary file name. Some  parsers
              could  not  operate  on stdin, so indexer creates temporary file
              for parser and its name passed instead of $1.

       CharSet charset
              Useful for 8 bit  character  sets.   WWW-servers  send  data  in
              different  character  sets.  charset is default character set of
              server in next Server command(s).   May  be  used  before  every
              Server  command  and takes effect till the end of config file or
              till next CharSet command.

              By  now  indexer  supports  Cyrillic  koi8-r,   cp1251,   cp866,
              iso8859-5,  x-mac-cyrillic,  Arabic  cp1256, Western iso-8859-1,
              Central Europe iso-8859-2 and cp1250 character sets.

              This parameter is default character set for "bad"  servers  that
              do  not send information about charset in header: just "Content-
              type:  text/html"  instead   of   for   example   "Content-type:
              text/html;  charset=koi8-r"  and do not send charset information
              in META tags.

              CharSet command.

       Examples:

              CharSet koi8-r
              CharSet windows-1250
              CharSet ISO-8859-1

       ForceIISCharset1251 yes/no
              This option is useful for users dealing  with  Cyrillic  content
              and  broken (or misconfigured?) Microsoft IIS web servers, which
              tends to report charset incorrectly.  This  is  a  really  dirty
              hack,  but  if  this  option is turned on it is assumed that all
              servers that are reported as ’Microsoft’ or ’IIS’  have  content
              in Windows-1251 codepage.  This command should be used only once
              in configuration file and takes global effect.

              Default: no

       AuthBasic login:passwd
              Use basic http authorization. Can be  set  before  every  Server
              command and takes effect only for next Server command.

       Examples:

              AuthBasic somebody:something

              If  you have password protected directory(ies), but whole server
              is open, use:

              AuthBasic login1:passwd1
              Server http://my.server.com/my/secure/directory1/
              AuthBasic login2:passwd2
              Server http://my.server.com/my/secure/directory2/
              Server http://my.server.com/

       ProxyAuthBasic login:passwd
              Use http proxy basic authorisation. Can  be  used  before  every
              Server  command  and  taked  effect only for the next one Server
              command! It should be also before Proxy command.

       Example:
              ProxyAuthBasic somebody:smth

       Proxy your.proxy.host[:port]
              Connect ia  proxy rather directly.  You can  index  ftp  servers
              (only) when using proxy.  If port is not specified, it is set to
              default value of 3128 (Squid).  If proxy host is not  specified,
              direct  connection  will  be performed.  Can be set before every
              Server command and takes effect till the end of config  file  or
              till next Proxy command.

       Examples:
              Proxy atoll.anywhere.com
               - proxy on atoll.anywhere.com, port 3128

              Proxy lota.anywhere.com:8090
               - proxy on lota.anywhere.com, port 8090

              Proxy
               - turn off proxy usage (direct connection)

       Server URL
              It is the main configuration command.  Use this to add start URL
              of server to be indexed.  You may use many  Server  commands  in
              the same indexer.conf file

       Examples:

              Server http://localhost/
              Server http://www.yoursite.com/
              Server http://www.yoursite.com/~yourname/
              Server ftp://ftp.yourdomain.com/pub/

EXAMPLE

       This is a minimal sample indexer config file

              DBHost         localhost
              DBName         udmsearch
              DBUser         foo
              DBPass         bar
              Server         http://localhost/
              Disallow /cgi-bin/ \.cgi /nph
              Disallow \.b$  \.sh$     \.md5$
              Disallow \.arj$  \.tar$  \.zip$  \.tgz$  \.gz$
              Disallow \.lha$ \.lzh$ \.tar\.Z$  \.rar$  \.zoo$
              Disallow \.gif$  \.jpg$  \.jpeg$ \.bmp$  \.tiff$
              Disallow \.vdo$  \.mpeg$ \.mpe$  \.mpg$  \.avi$  \.movie$
              Disallow \.mid$  \.mp3$  \.rm$   \.ram$  \.wav$  \.aiff$ \.ra$
              Disallow \.vrml$ \.wrl$
              Disallow \.exe$  \.cab$  \.dll$  \.bin$  \.class$
              Disallow \.tex$  \.texi$ \.xls$  \.doc$  \.texinfo$
              Disallow \.rtf$  \.pdf$  \.cdf$  \.ps$
              Disallow \.ai$   \.eps$  \.ppt$  \.hqx$
              Disallow \.cpt$  \.bms$  \.oda$  \.tcl$
              Disallow \.rpm$
              Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$
              \?S=D$
              Disallow /[.]{1,2} /\%2e /\%2f

SEE ALSO

       indexer(1), syslog.conf(5)