lunar (1) SWISH-FAQ.1.gz

Provided by: swish-e_2.4.7-6.1build2_amd64 bug

NAME

       SWISH-FAQ - The Swish-e FAQ. Answers to Common Questions

OVERVIEW

       List of commonly asked and answered questions.  Please review this document before asking
       questions on the Swish-e discussion list.

       General Questions

       What is Swish-e?

       Swish-e is Simple Web Indexing System for Humans - Enhanced.  With it, you can quickly and
       easily index directories of files or remote web sites and search the generated indexes for
       words and phrases.

       So, is Swish-e a search engine?

       Well, yes.  Probably the most common use of Swish-e is to provide a search engine for web
       sites.  The Swish-e distribution includes CGI scripts that can be used with it to add a
       search engine for your web site.  The CGI scripts can be found in the example directory of
       the distribution package.  See the README file for information about the scripts.

       But Swish-e can also be used to index all sorts of data, such as email messages, data
       stored in a relational database management system, XML documents, or documents such as
       Word and PDF documents -- or any combination of those sources at the same time.  Searches
       can be limited to fields or MetaNames within a document, or limited to areas within an
       HTML document (e.g. body, title).  Programs other than CGI applications can use Swish-e,
       as well.

       Should I upgrade if I'm already running a previous version of Swish-e?

       A large number of bug fixes, feature additions, and logic corrections were made in version
       2.2.  In addition, indexing speed has been drastically improved (reports of indexing times
       changing from four hours to 5 minutes), and major parts of the indexing and search parsers
       have been rewritten.  There's better debugging options, enhanced output formats, more
       document meta data (e.g. last modified date, document summary), options for indexing from
       external data sources, and faster spidering just to name a few changes.  (See the CHANGES
       file for more information.

       Since so much effort has gone into version 2.2, support for previous versions will
       probably be limited.

       Are there binary distributions available for Swish-e on platform foo?

       Foo?  Well, yes there are some binary distributions available.  Please see the Swish-e web
       site for a list at http://swish-e.org/.

       In general, it is recommended that you build Swish-e from source, if possible.

       Do I need to reindex my site each time I upgrade to a new Swish-e version?

       At times it might not strictly be necessary, but since you don't really know if anything
       in the index has changed, it is a good rule to reindex.

       What's the advantage of using the libxml2 library for parsing HTML?

       Swish-e may be linked with libxml2, a library for working with HTML and XML documents.
       Swish-e can use libxml2 for parsing HTML and XML documents.

       The libxml2 parser is a better parser than Swish-e's built-in HTML parser.  It offers more
       features, and it does a much better job at extracting out the text from a web page.  In
       addition, you can use the "ParserWarningLevel" configuration setting to find structural
       errors in your documents that could (and would with Swish-e's HTML parser) cause documents
       to be indexed incorrectly.

       Libxml2 is not required, but is strongly recommended for parsing HTML documents.  It's
       also recommended for parsing XML, as it offers many more features than the internal Expat
       xml.c parser.

       The internal HTML parser will have limited support, and does have a number of bugs.  For
       example, HTML entities may not always be correctly converted and properties do not have
       entities converted.  The internal parser tends to get confused when invalid HTML is parsed
       where the libxml2 parser doesn't get confused as often.  The structure is better detected
       with the libxml2 parser.

       If you are using the Perl module (the C interface to the Swish-e library) you may wish to
       build two versions of Swish-e, one with the libxml2 library linked in the binary, and one
       without, and build the Perl module against the library without the libxml2 code.  This is
       to save space in the library.  Hopefully, the library will someday soon be split into
       indexing and searching code (volunteers welcome).

       Does Swish-e include a CGI interface?

       Yes.  Kind of.

       There's two example CGI scripts included, swish.cgi and search.cgi.  Both are installed at
       $prefix/lib/swish-e.

       Both require a bit of work to setup and use.  Swish.cgi is probably what most people will
       want to use as it contains more features.  Search.cgi is for those that want to start with
       a small script and customize it to fit their needs.

       An example of using swish.cgi is given in the INSTALL man page, and it the swish.cgi
       documentation.  Like often is the case, it will be easier to use if you first read the
       documentation.

       Please use caution about CGI scripts found on the Internet for use with Swish-e.  Some are
       not secure.

       The included example CGI scripts were designed with security in mind.  Regardless, you are
       encouraged to have your local Perl expert review it (and all other CGI scripts you use)
       before placing it into production.  This is just a good policy to follow.

       How secure is Swish-e?

       We know of no security issues with using Swish-e.  Careful attention has been made with
       regard to common security problems such as buffer overruns when programming Swish-e.

       The most likely security issue with Swish-e is when it is run via a poorly written CGI
       interface.  This is not limited to CGI scripts written in Perl, as it's just as easy to
       write an insecure CGI script in C, Java, PHP, or Python.  A good source of information is
       included with the Perl distribution.  Type "perldoc perlsec" at your local prompt for more
       information.  Another must-read document is located at
       "http://www.w3.org/Security/faq/wwwsf4.html".

       Note that there are many free yet insecure and poorly written CGI scripts available --
       even some designed for use with Swish-e.  Please carefully review any CGI script you use.
       Free is not such a good price when you get your server hacked...

       Should I run Swish-e as the superuser (root)?

       No.  Never.

       What files does Swish-e write?

       Swish writes the index file, of course.  This is specified with the "IndexFile"
       configuration directive or by the "-f" command line switch.

       The index file is actually a collection of files, but all start with the file name
       specified with the "IndexFile" directive or the "-f" command line switch.

       For example, the file ending in .prop contains the document properties.

       When creating the index files Swish-e appends the extension .temp to the index file names.
       When indexing is complete Swish-e renames the .temp files to the index files specified by
       "IndexFile" or "-f".  This is done so that existing indexes remain untouched until it
       completes indexing.

       Swish-e also writes temporary files in some cases during indexing (e.g. "-s http", "-s
       prog" with filters), when merging, and when using "-e").  Temporary files are created with
       the mkstemp(3) function (with 0600 permission on unix-like operating systems).

       The temporary files are created in the directory specified by the environment variables
       "TMPDIR" and "TMP" in that order.  If those are not set then swish uses the setting the
       configuration setting TmpDir.  Otherwise, the temporary file will be located in the
       current directory.

       Can I index PDF and MS-Word documents?

       Yes, you can use a Filter to convert documents while indexing, or you can use a program
       that "feeds" documents to Swish-e that have already been converted.  See "Indexing" below.

       Can I index documents on a web server?

       Yes, Swish-e provides two ways to index (spider) documents on a web server.  See
       "Spidering" below.

       Swish-e can retrieve documents from a file system or from a remote web server.  It can
       also execute a program that returns documents back to it.  This program can retrieve
       documents from a database, filter compressed documents files, convert PDF files, extract
       data from mail archives, or spider remote web sites.

       Can I implement keywords in my documents?

       Yes, Swish-e can associate words with MetaNames while indexing, and you can limit your
       searches to these MetaNames while searching.

       In your HTML files you can put keywords in HTML META tags or in XML blocks.

       META tags can have two formats in your source documents:

           <META NAME="DC.subject" CONTENT="digital libraries">

       And in XML format (can also be used in HTML documents when using libxml2):

           <meta2>
               Some Content
           </meta2>

       Then, to inform Swish-e about the existence of the meta name in your documents, edit the
       line in your configuration file:

           MetaNames DC.subject meta1 meta2

       When searching you can now limit some or all search terms to that MetaName.  For example,
       to look for documents that contain the word apple and also have either fruit or cooking in
       the DC.subject meta tag.

       What are document properties?

       A document property is typically data that describes the document.  For example,
       properties might include a document's path name, its last modified date, its title, or its
       size.  Swish-e stores a document's properties in the index file, and they can be reported
       back in search results.

       Swish-e also uses properties for sorting.  You may sort your results by one or more
       properties, in ascending or descending order.

       Properties can also be defined within your documents.  HTML and XML files can specify tags
       (see previous question) as properties.  The contents of these tags can then be returned
       with search results.  These user-defined properties can also be used for sorting search
       results.

       For example, if you had the following in your documents

          <meta name="creator" content="accounting department">

       and "creator" is defined as a property (see "PropertyNames" in SWISH-CONFIG) Swish-e can
       return "accounting department" with the result for that document.

           swish-e -w foo -p creator

       Or for sorting:

           swish-e -w foo -s creator

       What's the difference between MetaNames and PropertyNames?

       MetaNames allows keywords searches in your documents.  That is, you can use MetaNames to
       restrict searches to just parts of your documents.

       PropertyNames, on the other hand, define text that can be returned with results, and can
       be used for sorting.

       Both use meta tags found in your documents (as shown in the above two questions) to define
       the text you wish to use as a property or meta name.

       You may define a tag as both a property and a meta name.  For example:

          <meta name="creator" content="accounting department">

       placed in your documents and then using configuration settings of:

           PropertyNames creator
           MetaNames creator

       will allow you to limit your searches to documents created by accounting:

           swish-e -w 'foo and creator=(accounting)'

       That will find all documents with the word "foo" that also have a creator meta tag that
       contains the word "accounting".  This is using MetaNames.

       And you can also say:

           swish-e -w foo -p creator

       which will return all documents with the word "foo", but the results will also include the
       contents of the "creator" meta tag along with results.  This is using properties.

       You can use properties and meta names at the same time, too:

           swish-e -w creator=(accounting or marketing) -p creator -s creator

       That searches only in the "creator" meta name for either of the words "accounting" or
       "marketing", prints out the contents of the contents of the "creator" property, and sorts
       the results by the "creator" property name.

       (See also the "-x" output format switch in SWISH-RUN.)

       Can Swish-e index multi-byte characters?

       No.  This will require much work to change.  But, Swish-e works with eight-bit characters,
       so many characters sets can be used.  Note that it does call the ANSI-C tolower() function
       which does depend on the current locale setting.  See locale(7) for more information.

       Indexing

       How do I pass Swish-e a list of files to index?

       Currently, there is not a configuration directive to include a file that contains a list
       of files to index.  But, there is a directive to include another configuration file.

           IncludeConfigFile /path/to/other/config

       And in "/path/to/other/config" you can say:

           IndexDir file1 file2 file3 file4 file5 ...
           IndexDir file20 file21 file22

       You may also specify more than one configuration file on the command line:

           ./swish-e -c config_one config_two config_three

       Another option is to create a directory with symbolic links of the files to index, and
       index just that directory.

       How does Swish-e know which parser to use?

       Swish can parse HTML, XML, and text documents.  The parser is set by associating a file
       extension with a parser by the "IndexContents" directive.  You may set the default parser
       with the "DefaultContents" directive.  If a document is not assigned a parser it will
       default to the HTML parser (HTML2 if built with libxml2).

       You may use Filters or an external program to convert documents to HTML, XML, or text.

       Can I reindex and search at the same time?

       Yes.  Starting with version 2.2 Swish-e indexes to temporary files, and then renames the
       files when indexing is complete.  On most systems renames are atomic.  But, since Swish-e
       also generates more than one file during indexing there will be a very short period of
       time between renaming the various files when the index is out of sync.

       Settings in src/config.h control some options related to temporary files, and their use
       during indexing.

       Can I index phrases?

       Phrases are indexed automatically.  To search for a phrase simply place double quotes
       around the phrase.

       For example:

           swish-e -w 'free and "fast search engine"'

       How can I prevent phrases from matching across sentences?

       Use the BumpPositionCounterCharacters configuration directive.

       Swish-e isn't indexing a certain word or phrase.

       There are a number of configuration parameters that control what Swish-e considers a
       "word" and it has a debugging feature to help pinpoint any indexing problems.

       Configuration file directives (SWISH-CONFIG) "WordCharacters", "BeginCharacters",
       "EndCharacters", "IgnoreFirstChar", and "IgnoreLastChar" are the main settings that Swish-
       e uses to define a "word".  See SWISH-CONFIG and SWISH-RUN for details.

       Swish-e also uses compile-time defaults for many settings.  These are located in
       src/config.h file.

       Use of the command line arguments "-k", "-v" and "-T" are useful when debugging these
       problems.  Using "-T INDEXED_WORDS" while indexing will display each word as it is
       indexed.  You should specify one file when using this feature since it can generate a lot
       of output.

            ./swish-e -c my.conf -i problem.file -T INDEXED_WORDS

       You may also wish to index a single file that contains words that are or are not indexing
       as you expect and use -T to output debugging information about the index.  A useful
       command might be:

           ./swish-e -f index.swish-e -T INDEX_FULL

       Once you see how Swish-e is parsing and indexing your words, you can adjust the
       configuration settings mentioned above to control what words are indexed.

       Another useful command might be:

            ./swish-e -c my.conf -i problem.file -T PARSED_WORDS INDEXED_WORDS

       This will show white-spaced words parsed from the document (PARSED_WORDS), and how those
       words are split up into separate words for indexing (INDEXED_WORDS).

       How do I keep Swish-e from indexing numbers?

       Swish-e indexes words as defined by the "WordCharacters" setting, as described above.  So
       to avoid indexing numbers you simply remove digits from the "WordCharacters" setting.

       There are also some settings in src/config.h that control what "words" are indexed.  You
       can configure swish to never index words that are all digits, vowels, or consonants, or
       that contain more than some consecutive number of digits, vowels, or consonants.  In
       general, you won't need to change these settings.

       Also, there's an experimental feature called "IgnoreNumberChars" which allows you to
       define a set of characters that describe a number.  If a word is made up of only those
       characters it will not be indexed.

       Swish-e crashes and burns on a certain file. What can I do?

       This shouldn't happen.  If it does please post to the Swish-e discussion list the details
       so it can be reproduced by the developers.

       In the mean time, you can use a "FileRules" directive to exclude the particular file name,
       or pathname, or its title.  If there are serious problems in indexing certain types of
       files, they may not have valid text in them (they may be binary files, for instance). You
       can use NoContents to exclude that type of file.

       Swish-e will issue a warning if an embedded null character is found in a document.  This
       warning will be an indication that you are trying to index binary data.  If you need to
       index binary files try to find a program that will extract out the text (e.g. strings(1),
       catdoc(1), pdftotext(1)).

       How to I prevent indexing of some documents?

       When using the file system to index your files you can use the "FileRules" directive.
       Other than "FileRules title", "FileRules" only works with the file system ("-S fs")
       indexing method, not with "-S prog" or "-S http".

       If you are spidering a site you have control over, use a robots.txt file in your document
       root.  This is a standard way to excluded files from search engines, and is fully
       supported by Swish-e.  See http://www.robotstxt.org/

       If spidering a website with the included spider.pl program then add any necessary tests to
       the spider's configuration file.  Type <perldoc spider.pl> in the "prog-bin" directory for
       details or see the spider documentation on the Swish-e website.  Look for the section on
       callback functions.

       If using the libxml2 library for parsing HTML (which you probably are), you may also use
       the Meta Robots Exclusion in your documents:

           <meta name="robots" content="noindex">

       See the obeyRobotsNoIndex directive.

       How do I prevent indexing parts of a document?

       To prevent Swish-e from indexing a common header, footer, or navigation bar, AND you are
       using libxml2 for parsing HTML, then you may use a fake HTML tag around the text you wish
       to ignore and use the "IgnoreMetaTags" directive.  This will generate an error message if
       the "ParserWarningLevel" is set as it's invalid HTML.

       "IgnoreMetaTags" works with XML documents (and HTML documents when using libxml2 as the
       parser), but not with documents parsed by the text (TXT) parser.

       If you are using the libxml2 parser (HTML2 and XML2) then you can use the the following
       comments in your documents to prevent indexing:

              <!-- SwishCommand noindex -->
              <!-- SwishCommand index -->

       and/or these may be used also:

              <!-- noindex -->
              <!-- index -->

       How do I modify the path or URL of the indexed documents.

       Use the "ReplaceRules" configuration directive to rewrite path names and URLs.  If you are
       using "-S prog" input method you may set the path to any string.

       How can I index data from a database?

       Use the "prog" document source method of indexing.  Write a program to extract out the
       data from your database, and format it as XML, HTML, or text.  See the examples in the
       "prog-bin" directory, and the next question.

       How do I index my PDF, Word, and compressed documents?

       Swish-e can internally only parse HTML, XML and TXT (text) files by default, but can make
       use of filters that will convert other types of files such as MS Word documents, PDF, or
       gzipped files into one of the file types that Swish-e understands.

       Please see SWISH-CONFIG and the examples in the filters and filter-bin directory for more
       information.

       See the next question to learn about the filtering options with Swish-e.

       How do I filter documents?

       The term "filter" in Swish-e means the converstion of a document of one type (one that
       swish-e cannot index directly) into a type that Swish-e can index, namely HTML, plain
       text, or XML.  To add to the confusion, there are a number of ways to accomplish this in
       Swish-e.  So here's a bit of background.

       The FileFilter directive was added to swish first.  This feature allows you to specify a
       program to run for documents that match a given file extension.  For example, to filter
       PDF files (files that end in .pdf) you can specify the configuation setting of:

           FileFilter .pdf pdftotext   "'%p' -"

       which says to run the program "pdftotext" passing it the pathname of the file (%p) and a
       dash (which tells pdftotext to output to stdout).   Then for each .pdf file Swish-e runs
       this program and reads in the filtered document from the output from the filter program.

       This has the advantage that it is easy to setup -- a single line in the config file is all
       that is needed to add the filter into Swish-e.  But it also has a number of problems.  For
       example, if you use a Perl script to do your filtering it can be very slow since the
       filter script must be run (and thus compiled) for each processed document.  This is
       exacerbated when using the -S http method since the -S http method also uses a Perl script
       that is run for every URL fetched.  Also, when using -S prog method of input (reading
       input from a program) using FileFilter means that Swish-e must first read the file in from
       the external program and then write the file out to a temporary file before running the
       filter.

       With -S prog it makes much more sense to filter the document in the program that is
       fetching the documents than to have swish-e read the file into memory, write it to a
       temporary file and then run an external program.

       The Swish-e distribution contains a couple of example -S prog programs.  spider.pl is a
       reasonably full-featured web spider that offers many more options than the -S http method.
       And it is much faster than running -S http, too.

       The spider has a perl configuration file, which means you can add programming logic right
       into the configuration file without editing the spider program.  One bit of logic that is
       provided in the spider's configuration file is a "call-back" function that allows you to
       filter the content.  In other words, before the spider passes a fetched web document to
       swish for indexing the spider can call a simple subroutine in the spider's configuration
       file passing the document and its content type.  The subroutine can then look at the
       content type and decide if the document needs to be filtered.

       For example, when processing a document of type "application/msword" the call-back
       subroutine might call the doc2txt.pm perl module, and a document of type "appliation/pdf"
       could use the pdf2html.pm module.  The prog-bin/SwishSpiderConfig.pl file shows this
       usage.

       This system works reasonably well, but also means that more work is required to setup the
       filters.  First, you must explicitly check for specific content types and then call the
       appropriate Perl module, and second, you have to know how each module must be called and
       how each returns the possibly modified content.

       In comes SWISH::Filter.

       To make things easier the SWISH::Filter Perl module was created.  The idea of this module
       is that there is one interface used to filter all types of documents.  So instead of
       checking for specific types of content you just pass the content type and the document to
       the SWISH::Filter module and it returns a new content type and document if it was
       filtered.  The filters that do the actual work are designed with a standard interface and
       work like filter "plug-ins". Adding new filters means just downloading the filter to a
       directory and no changes are needed to the spider's configuation file.  Download a filter
       for Postscript and next time you run indexing your Postscript files will be indexed.

       Since the filters are standardized, hopefully when you have the need to filter documents
       of a specific type there will already be a filter ready for your use.

       Now, note that the perl modules may or may not do the actual conversion of a document.
       For example, the PDF conversion module calls the pdfinfo and pdftotext programs.  Those
       programs (part of the Xpfd package) must be installed separately from the filters.

       The SwishSpiderConfig.pl examle spider configuration file shows how to use the
       SWISH::Filter module for filtering.  This file is installed at
       $prefix/share/doc/swish-e/examples/prog-bin, where $prefix is normally /usr/local on unix-
       type machines.

       The SWISH::Filter method of filtering can also be used with the -S http method of
       indexing.  By default the swishspider program (the Perl helper script that fetches
       documents from the web) will attempt to use the SWISH::Filter module if it can be found in
       Perls library path.  This path is set automatically for spider.pl but not for swishspider
       (because it would slow down a method that's already slow and spider.pl is recommended over
       the -S http method).

       Therefore, all that's required to use this system with -S http is setting the @INC array
       to point to the filter directory.

       For example, if the swish-e distribution was unpacked into ~/swish-e:

          PERL5LIB=~/swish-e/filters swish-e -c conf -S http

       will allow the -S http method to make use of the SWISH::Filter module.

       Note that if you are not using the SWISH::Filter module you may wish to edit the
       swishspider program and disable the use of the SWISH::Filter module using this setting:

           use constant USE_FILTERS  => 0;  # disable SWISH::Filter

       This prevents the program from attempting to use the SWISH::Filter module for every non-
       text URL that is fetched.  Of course, if you are concerned with indexing speed you should
       be using the -S prog method with spider.pl instead of -S http.

       If you are not spidering, but you still want to make use of the SWISH::Filter module for
       filtering you can use the DirTree.pl program (in $prefix/lib/swish-e).  This is a simple
       program that traverses the file system and uses SWISH::Filter for filtering.

       Here's two examples of how to run a filter program, one using Swish-e's "FileFilter"
       directive, another using a "prog" input method program.  See the SwishSpiderConfig.pl file
       for an example of using the SWISH::Filter module.

       These filters simply use the program "/bin/cat" as a filter and only indexes .html files.

       First, using the "FileFilter" method, here's the entire configuration file (swish.conf):

           IndexDir .
           IndexOnly .html
           FileFilter .html "/bin/cat"   "'%p'"

       and index with the command

           swish-e -c swish.conf -v 1

       Now, the same thing with using the "-S prog" document source input method and a Perl
       program called catfilter.pl.  You can see that's it's much more work than using the
       "FileFilter" method above, but provides a place to do additional processing.  In this
       example, the "prog" method is only slightly faster.  But if you needed a perl script to
       run as a FileFilter then "prog" will be significantly faster.

           #!/usr/local/bin/perl -w
           use strict;
           use File::Find;  # for recursing a directory tree

           $/ = undef;
           find(
               { wanted => \&wanted, no_chdir => 1, },
               '.',
           );

           sub wanted {
               return if -d;
               return unless /\.html$/;

               my $mtime  = (stat)[9];

               my $child = open( FH, '-⎪' );
               die "Failed to fork $!" unless defined $child;
               exec '/bin/cat', $_ unless $child;

               my $content = <FH>;
               my $size = length $content;

               print <<EOF;
           Content-Length: $size
           Last-Mtime: $mtime
           Path-Name: $_

           EOF

               print <FH>;
           }

       And index with the command:

           swish-e -S prog -i ./catfilter.pl -v 1

       This example will probably not work under Windows due to the '-⎪' open.  A simple piped
       open may work just as well:

       That is, replace:

           my $child = open( FH, '-⎪' );
           die "Failed to fork $!" unless defined $child;
           exec '/bin/cat', $_ unless $child;

       with this:

           open( FH, "/bin/cat $_ ⎪" ) or die $!;

       Perl will try to avoid running the command through the shell if meta characters are not
       passed to the open.  See "perldoc -f open" for more information.

       Eh, but I just want to know how to index PDF documents!

       See the examples in the conf directory and the comments in the SwishSpiderConfig.pl file.

       See the previous question for the details on filtering.  The method you decide to use will
       depend on how fast you want to index, and your comfort level with using Perl modules.

       Regardless of the filtering method you use you will need to install the Xpdf packages
       available from http://www.foolabs.com/xpdf/.

       I'm using Windows and can't get Filters or the prog input method to work!

       Both the "-S prog" input method and filters use the "popen()" system call to run the
       external program.  If your external program is, for example, a perl script, you have to
       tell Swish-e to run perl, instead of the script.  Swish-e will convert forward slashes to
       backslashes when running under Windows.

       For example, you would need to specify the path to perl as (assuming this is where perl is
       on your system):

           IndexDir e:/perl/bin/perl.exe

       Or run a filter like:

           FileFilter .foo e:/perl/bin/perl.exe 'myscript.pl "%p"'

       It's often easier to just install Linux.

       How do I index non-English words?

       Swish-e indexes 8-bit characters only.  This is the ISO 8859-1 Latin-1 character set, and
       includes many non-English letters (and symbols).  As long as they are listed in
       "WordCharacters" they will be indexed.

       Actually, you probably can index any 8-bit character set, as long as you don't mix
       character sets in the same index and don't use libxml2 for parsing (see below).

       The "TranslateCharacters" directive (SWISH-CONFIG) can translate characters while indexing
       and searching.  You may specify the mapping of one character to another character with the
       "TranslateCharacters" directive.

       "TranslateCharacters :ascii7:" is a predefined set of characters that will translate
       eight-bit characters to ascii7 characters.  Using the ":ascii7:" rule will, for example,
       translate "Ääç" to "aac".  This means: searching "Çelik", "çelik" or "celik" will all
       match the same word.

       Note: When using libxml2 for parsing, parsed documents are converted internally (within
       libxml2) to UTF-8.  This is converted to ISO 8859-1 Latin-1 when indexing.  In cases where
       a string can not be converted from UTF-8 to ISO 8859-1 (because it contains non 8859-1
       characters), the string will be sent to Swish-e in UTF-8 encoding.  This will results in
       some words indexed incorrectly.  Setting "ParserWarningLevel" to 1 or more will display
       warnings when UTF-8 to 8859-1 conversion fails.

       Can I add/remove files from an index?

       Try building swish-e with the "--enable-incremental" option.

       The rest of this FAQ applies to the default swish-e format.

       Swish-e currently has no way to add or remove items from its index.  But, Swish-e indexes
       so quickly that it's often possible to reindex the entire document set when a file needs
       to be added, modified or removed.  If you are spidering a remote site then consider
       caching documents locally compressed.

       Incremental additions can be handled in a couple of ways, depending on your situation.
       It's probably easiest to create one main index every night (or every week), and then
       create an index of just the new files between main indexing jobs and use the "-f" option
       to pass both indexes to Swish-e while searching.

       You can merge the indexes into one index (instead of using -f), but it's not clear that
       this has any advantage over searching multiple indexes.

       How does one create the incremental index?

       One method is by using the "-N" switch to pass a file path to Swish-e when indexing.  It
       will only index files that have a last modification date "newer" than the file supplied
       with the "-N" switch.

       This option has the disadvantage that Swish-e must process every file in every directory
       as if they were going to be indexed (the test for "-N" is done last right before indexing
       of the file contents begin and after all other tests on the file have been completed) --
       all that just to find a few new files.

       Also, if you use the Swish-e index file as the file passed to "-N" there may be files that
       were added after indexing was started, but before the index file was written.  This could
       result in a file not being added to the index.

       Another option is to maintain a parallel directory tree that contains symlinks pointing to
       the main files.  When a new file is added (or changed) to the main directory tree you
       create a symlink to the real file in the parallel directory tree.  Then just index the
       symlink directory to generate the incremental index.

       This option has the disadvantage that you need to have a central program that creates the
       new files that can also create the symlinks.  But, indexing is quite fast since Swish-e
       only has to look at the files that need to be indexed.  When you run full indexing you
       simply unlink (delete) all the symlinks.

       Both of these methods have issues where files could end up in both indexes, or files being
       left out of an index.  Use of file locks while indexing, and hash lookups during searches
       can help prevent these problems.

       I run out of memory trying to index my files.

       It's true that indexing can take up a lot of memory!  Swish-e is extremely fast at
       indexing, but that comes at the cost of memory.

       The best answer is install more memory.

       Another option is use the "-e" switch.  This will require less memory, but indexing will
       take longer as not all data will be stored in memory while indexing.  How much less memory
       and how much more time depends on the documents you are indexing, and the hardware that
       you are using.

       Here's an example of indexing all .html files in /usr/doc on Linux.  This first example is
       without "-e" and used about 84M of memory:

           270279 unique words indexed.
           23841 files indexed.  177640166 total bytes.
           Elapsed time: 00:04:45 CPU time: 00:03:19

       This is with "-e", and used about 26M or memory:

           270279 unique words indexed.
           23841 files indexed.  177640166 total bytes.
           Elapsed time: 00:06:43 CPU time: 00:04:12

       You can also build a number of smaller indexes and then merge together with "-M".  Using
       "-e" while merging will save memory.

       Finally, if you do build a number of smaller indexes, you can specify more than one index
       when searching by using the "-f" switch.  Sorting large results sets by a property will be
       slower when specifying multiple index files while searching.

       "too many open files" when indexing with -e option

       Some platforms report "too many open files" when using the -e economy option.  The -e
       feature uses many temporary files (something like 377) plus the index files and this may
       exceed your system's limits.

       Depending on your platform you may need to set "ulimit" or "unlimit".

       For example, under Linux bash shell:

         $ ulimit -n 1024

       Or under an old Sparc

         % unlimit openfiles

       My system admin says Swish-e uses too much of the CPU!

       That's a good thing!  That expensive CPU is supposed to be busy.

       Indexing takes a lot of work -- to make indexing fast much of the work is done in memory
       which reduces the amount of time Swish-e is waiting on I/O.  But, there's two things you
       can try:

       The "-e" option will run Swish-e in economy mode, which uses the disk to store data while
       indexing.  This makes Swish-e run somewhat slower, but also uses less memory.  Since it is
       writing to disk more often it will be spending more time waiting on I/O and less time in
       CPU.  Maybe.

       The other thing is to simply lower the priority of the job using the nice(1) command:

           /bin/nice -15 swish-e -c search.conf

       If concerned about searching time, make sure you are using the -b and -m switches to only
       return a page at a time.  If you know that your result sets will be large, and that you
       wish to return results one page at a time, and that often times many pages of the same
       query will be requested, you may be smart to request all the documents on the first
       request, and then cache the results to a temporary file.  The perl module File::Cache
       makes this very simple to accomplish.

       Spidering

       How can I index documents on a web server?

       If possible, use the file system method "-S fs" of indexing to index documents in you web
       area of the file system.  This avoids the overhead of spidering a web server and is much
       faster.  ("-S fs" is the default method if "-S" is not specified).

       If this is impossible (the web server is not local, or documents are dynamically
       generated), Swish-e provides two methods of spidering. First, it includes the http method
       of indexing "-S http". A number of special configuration directives are available that
       control spidering (see "Directives for the HTTP Access Method Only" in SWISH-CONFIG).  A
       perl helper script (swishspider) is included in the src directory to assist with spidering
       web servers. There are example configurations for spidering in the conf directory.

       As of Swish-e 2.2, there's a general purpose "prog" document source where a program can
       feed documents to it for indexing.  A number of example programs can be found in the
       "prog-bin" directory, including a program to spider web servers.  The provided spider.pl
       program is full-featured and is easily customized.

       The advantage of the "prog" document source feature over the "http" method is that the
       program is only executed one time, where the swishspider.pl program used in the "http"
       method is executed once for every document read from the web server.  The forking of
       Swish-e and compiling of the perl script can be quite expensive, time-wise.

       The other advantage of the "spider.pl" program is that it's simple and efficient to add
       filtering (such as for PDF or MS Word docs) right into the spider.pl's configuration, and
       it includes features such as MD5 checks to prevent duplicate indexing, options to avoid
       spidering some files, or index but avoid spidering.  And since it's a perl program there's
       no limit on the features you can add.

       Why does swish report "./swishspider: not found"?

       Does the file swishspider exist where the error message displays?  If not, either set the
       configuration option SpiderDirectory to point to the directory where the swishspider
       program is found, or place the swishspider program in the current directory when running
       swish-e.

       If you are running Windows, make sure "perl" is in your path.  Try typing perl from a
       command prompt.

       If you not running windows, make sure that the shebang line (the first line of the
       swishspider program that starts with #!) points to the correct location of perl.
       Typically this will be /usr/bin/perl or /usr/local/bin/perl.  Also, make sure that you
       have execute and read permissions on swishspider.

       The swishspider perl script is only used with the -S http method of indexing.

       I'm using the spider.pl program to spider my web site, but some large files are not
       indexed.

       The "spider.pl" program has a default limit of 5MB file size.  This can be changed with
       the "max_size" parameter setting.  See "perldoc spider.pl" for more information.

       I still don't think all my web pages are being indexed.

       The spider.pl program has a number of debugging switches and can be quite verbose in
       telling you what's happening, and why.  See "perldoc spider.pl" for instructions.

       Swish is not spidering Javascript links!

       Swish cannot follow links generated by Javascript, as they are generated by the browser
       and are not part of the document.

       How do I spider other websites and combine it with my own (filesystem) index?

       You can either merge "-M" two indexes into a single index, or use "-f" to specify more
       than one index while searching.

       You will have better results with the "-f" method.

       Searching

       How do I limit searches to just parts of the index?

       If you can identify "parts" of your index by the path name you have two options.

       The first options is by indexing the document path.  Add this to your configuration:

           MetaNames swishdocpath

       Now you can search for words or phrases in the path name:

           swish-e -w 'foo AND swishdocpath=(sales)'

       So that will only find documents with the word "foo" and where the file's path contains
       "sales".  That might not works as well as you like, though, as both of these paths will
       match:

           /web/sales/products/index.html
           /web/accounting/private/sales_we_messed_up.html

       This can be solved by searching with a phrase (assuming "/" is not a WordCharacter):

           swish-e -w 'foo AND swishdocpath=("/web/sales/")'
           swish-e -w 'foo AND swishdocpath=("web sales")'  (same thing)

       The second option is a bit more powerful.  With the "ExtractPath" directive you can use a
       regular expression to extract out a sub-set of the path and save it as a separate meta
       name:

           MetaNames department
           ExtractPath department regex !^/web/([^/]+).+$!$1/

       Which says match a path that starts with "/web/" and extract out everything after that up
       to, but not including the next "/" and save it in variable $1, and then match everything
       from the "/" onward.  Then replace the entire matches string with $1.  And that gets
       indexed as meta name "department".

       Now you can search like:

           swish-e -w 'foo AND department=sales'

       and be sure that you will only match the documents in the /www/sales/* path.  Note that
       you can map completely different areas of your file system to the same metaname:

           # flag the marketing specific pages
           ExtractPath department regex !^/web/(marketing⎪sales)/.+$!marketing/
           ExtractPath department regex !^/internal/marketing/.+$!marketing/

           # flag the technical departments pages
           ExtractPath department regex !^/web/(tech⎪bugs)/.+$!tech/

       Finally, if you have something more complicated, use "-S prog" and write a perl program or
       use a filter to set a meta tag when processing each file.

       How is ranking calculated?

       The "swishrank" property value is calculated based on which Ranking Scheme (or algorithm)
       you have selected. In this discussion, any time the word fancy is used, you should consult
       the actual code for more details. It is open source, after all.

       Things you can do to affect ranking:

       MetaNamesRank
           You may configure your index to bias certain metaname values more or less than others.
           See the "MetaNamesRank" configuration option in SWISH-CONFIG.

       IgnoreTotalWordCountWhenRanking
           Set to 1 (default) or 0 in your config file. See SWISH-CONFIG.  NOTE: You must set
           this to 0 to use the IDF Ranking Scheme.

       structure
           Each term's position in each HTML document is given a structure value based on the
           context in which the word appears. The structure value is used to artificially inflate
           the frequency of each term in that particular document.  These structural values are
           defined in config.h:

            #define RANK_TITLE             7
            #define RANK_HEADER            5
            #define RANK_META              3
            #define RANK_COMMENTS          1
            #define RANK_EMPHASIZED        0

           For example, if the word "foo" appears in the title of a document, the Scheme will
           treat that document as if "foo" appeared 7 additional times.

       All Schemes share the following characteristics:

       AND searches
           The rank value is averaged for all AND'd terms. Terms within a set of parentheses ()
           are averaged as a single term (this is an acknowledged weakness and is on the TODO
           list).

       OR searches
           The rank value is summed and then doubled for each pair of OR'd terms. This results in
           higher ranks for documents that have multiple OR'd terms.

       scaled rank
           After a document's raw rank score is calculated, a final rank score is calculated
           using a fancy "log()" function. All the documents are then scaled against a base score
           of 1000.  The top-ranked document will therefore always have a "swishrank" value of
           1000.

       Here is a brief overview of how the different Schemes work. The number in parentheses
       after the name is the value to invoke that scheme with "swish-e -R" or "RankScheme()".

       Default (0)
           The default ranking scheme considers the number of times a term appears in a document
           (frequency), the MetaNamesRank and the structure value. The rank might be summarized
           as:

            DocRank = Sum of ( structure + metabias )

           Consider this output with the DEBUG_RANK variable set at compile time:

            Ranking Scheme: 0
            Word entry 0 at position 6 has struct 7
            Word entry 1 at position 64 has struct 41
            Word entry 2 at position 71 has struct 9
            Word entry 3 at position 132 has struct 9
            Word entry 4 at position 154 has struct 9
            Word entry 5 at position 423 has struct 73
            Word entry 6 at position 541 has struct 73
            Word entry 7 at position 662 has struct 73
            File num: 1104.  Raw Rank: 21.  Frequency: 8 scaled rank: 30445
             Structure tally:
             struct 0x7 = count of 1 ( HEAD TITLE FILE ) x rank map of 8 = 8

             struct 0x9 = count of 3 ( BODY FILE ) x rank map of 1 = 3

             struct 0x29 = count of 1 ( HEADING BODY FILE ) x rank map of 6 = 6

             struct 0x49 = count of 3 ( EM BODY FILE ) x rank map of 1 = 3

           Every word instance starts with a base score of 1.  Then for each instance of your
           word, a running sum is taken of the structural value of that word position plus any
           bias you've configured.  In the example above, the raw rank is "1 + 8 + 3 + 6 + 3 =
           21".

           Consider this line:

             struct 0x7 = count of 1 ( HEAD TITLE FILE ) x rank map of 8 = 8

           That means there was one instance of our word in the title of the file.  It's context
           was in the <head> tagset, inside the <title>.  The <title> is the most specific
           structure, so it gets the RANK_TITLE score: 7. The base rank of 1 plus the structure
           score of 7 equals 8. If there had been two instances of this word in the title, then
           the score would have been "8 + 8 = 16".

       IDF (1)
           IDF is short for Inverse Document Frequency. That's fancy ranking lingo for taking
           into account the total frequency of a term across the entire index, in addition to the
           term's frequency in a single document. IDF ranking also uses the relative density of a
           word in a document to judge its relevancy. Words that appear more often in a doc make
           that doc's rank higher, and longer docs are not weighted higher than shorter docs.

           The IDF Scheme might be summarized as:

             DocRank = Sum of ( density * idf * ( structure + metabias ) )

           Consider this output from DEBUG_RANK:

            Ranking Scheme: 1
            File num: 1104  Word Score: 1  Frequency: 8  Total files: 1451
            Total word freq: 108   IDF: 2564
            Total words: 1145877   Indexed words in this doc: 562
            Average words: 789   Density: 1120    Word Weight: 28716
            Word entry 0 at position 6 has struct 7
            Word entry 1 at position 64 has struct 41
            Word entry 2 at position 71 has struct 9
            Word entry 3 at position 132 has struct 9
            Word entry 4 at position 154 has struct 9
            Word entry 5 at position 423 has struct 73
            Word entry 6 at position 541 has struct 73
            Word entry 7 at position 662 has struct 73
            Rank after IDF weighting: 574321
            scaled rank: 132609
             Structure tally:
             struct 0x7 = count of  1 ( HEAD TITLE FILE ) x rank map of 8 = 8

             struct 0x9 = count of  3 ( BODY FILE ) x rank map of 1 = 3

             struct 0x29 = count of  1 ( HEADING BODY FILE ) x rank map of 6 = 6

             struct 0x49 = count of  3 ( EM BODY FILE ) x rank map of 1 = 3

           It is similar to the default Scheme, but notice how the total number of files in the
           index and the total word frequency (as opposed to the document frequency) are both
           part of the equation.

       Ranking is a complicated subject. SWISH-E allows for more Ranking Schemes to be developed
       and experimented with, using the -R option (from the swish-e command) and the RankScheme
       (see the API documentation). Experiment and share your findings via the discussion list.

       How can I limit searches to the title, body, or comment?

       Use the "-t" switch.

       I can't limit searches to title/body/comment.

       Or, I can't search with meta names, all the names are indexed as "plain".

       Check in the config.h file if #define INDEXTAGS is set to 1. If it is, change it to 0,
       recompile, and index again.  When INDEXTAGS is 1, ALL the tags are indexed as plain text,
       that is you index "title", "h1", and so on, AND they loose their indexing meaning.  If
       INDEXTAGS is set to 0, you will still index meta tags and comments, unless you have
       indicated otherwise in the user config file with the IndexComments directive.

       Also, check for the "UndefinedMetaTags" setting in your configuration file.

       I've tried running the included CGI script and I get a "Internal Server Error"

       Debugging CGI scripts are beyond the scope of this document.  Internal Server Error
       basically means "check the web server's log for an error message", as it can mean a bad
       shebang (#!) line, a missing perl module, FTP transfer error, or simply an error in the
       program.  The CGI script swish.cgi in the example directory contains some debugging
       suggestions.  Type "perldoc swish.cgi" for information.

       There are also many, many CGI FAQs available on the Internet.  A quick web search should
       offer help.  As a last resort you might ask your webadmin for help...

       When I try to view the swish.cgi page I see the contents of the Perl program.

       Your web server is not configured to run the program as a CGI script.  This problem is
       described in "perldoc swish.cgi".

       How do I make Swish-e highlight words in search results?

       Short answer:

       Use the supplied swish.cgi or search.cgi scripts located in the example directory.

       Long answer:

       Swish-e can't because it doesn't have access to the source documents when returning
       results, of course.  But a front-end program of your creation can highlight terms.  Your
       program can open up the source documents and then use regular expressions to replace
       search terms with highlighted or bolded words.

       But, that will fail with all but the most simple source documents.  For HTML documents,
       for example, you must parse the document into words and tags (and comments).  A word you
       wish to highlight may span multiple HTML tags, or be a word in a URL and you wish to
       highlight the entire link text.

       Perl modules such as HTML::Parser and XML::Parser make word extraction possible.  Next,
       you need to consider that Swish-e uses settings such as WordCharacters, BeginCharacters,
       EndCharacters, IgnoreFirstChar, and IgnoreLast, char to define a "word".  That is, you
       can't consider that a string of characters with white space on each side is a word.

       Then things like TranslateCharacters, and HTML Entities may transform a source word into
       something else, as far as Swish-e is concerned.  Finally, searches can be limited by
       metanames, so you may need to limit your highlighting to only parts of the source
       document.  Throw phrase searches and stopwords into the equation and you can see that it's
       not a trivial problem to solve.

       All hope is not lost, thought, as Swish-e does provide some help.  Using the "-H" option
       it will return in the headers the current index (or indexes) settings for WordCharacters
       (and others) required to parse your source documents as it parses them during indexing,
       and will return a "Parsed Words:" header that will show how it parsed the query
       internally.  If you use fuzzy indexing (word stemming, soundex, or metaphone) then you
       will also need to stem each word in your document before comparing with the "Parsed
       Words:" returned by Swish-e.

       The Swish-e stemming code is available either by using the Swish-e Perl module
       (SWISH::API) or the C library (included with the swish-e distribution), or by using the
       SWISH::Stemmer module available on CPAN.  Also on CPAN is the module
       Text::DoubleMetaphone.  Using SWISH::API probably provides the best stemming support.

       Do filters effect the performance during search?

       No.  Filters (FileFilter or via "prog" method) are only used for building the search index
       database.  During search requests there will be no filter calls.

       I have read the FAQ but I still have questions about using Swish-e.

       The Swish-e discussion list is the place to go.  http://swish-e.org/.  Please do not email
       developers directly.  The list is the best place to ask questions.

       Before you post please read QUESTIONS AND TROUBLESHOOTING located in the INSTALL page.
       You should also search the Swish-e discussion list archive which can be found on the
       swish-e web site.

       In short, be sure to include in the following when asking for help.

       * The swish-e version (./swish-e -V)
       * What you are indexing (and perhaps a sample), and the number of files
       * Your Swish-e configuration file
       * Any error messages that Swish-e is reporting

Document Info

       $Id: SWISH-FAQ.pod 2147 2008-07-21 02:48:55Z karpet $

       .