Provided by: wwwstat_2.0-7_all bug

NAME

       wwwstat - summarize WWW server (httpd) access statistics

SYNOPSIS

       wwwstat [-F system_config] [-f user_config] [options...]  [--] [ summary | logfile | + | -
               ]...

DESCRIPTION

       wwwstat reads a sequence of httpd common logfile format (CLF) access_log files and/or
       prior wwwstat output summary files and/or the standard input and outputs a summary of the
       access statistics in HTML.

       Since wwwstat does not make any changes to the input files or write any files in the
       server directories, it can be run by any user with read access to the input logfile(s) and
       summary file(s).  This allows people other than the webmaster to run specialized analyses
       of just the things they are interested in summarizing.

       wwwstat provides World Wide Web (WWW) access statistics, which does not necessarily
       correspond to statistics on individual users. It counts the number of HTTP requests
       received by the server and the amount of bytes transmitted in response to those requests,
       according to what is in the logfile(s), and outputs those counts as tables broken down by
       category of request.

       wwwstat output summaries can be read by gwstat to produce fancy graphs of the summarized
       statistics. The splitlog program can be used to split a large logfile into separate files
       by entry prefix or URL path.

       wwwstat is a perl script, which means you need to have a perl interpreter to run the
       program.  It has been tested with perl versions 4.036 and 5.002.

   Output Sections
       wwwstat's output consists of a set of cross-reference links, the sum totals and averages
       for the processed data, and a sequence of amount-by-category tables partitioned into
       sections.  The section categories are based on the characteristics evident from the access
       request, as provided by the common logfile format (see NOTES).  These include:

       Request Date        e.g., "Feb  2 1996"

       Request Hour        e.g., "00" through "23"

       Client Domain       The Fully-Qualified Domain Name (FQDN) suffix that corresponds to an
                           organization type or country name.

       Reversed Subdomain  The FQDN, usually minus the first (machine name) component, and
                           reversed so that it is easier to read when sorted.

       URL/Archive         Grouping based on Request-URI or non-success status code.

       Identity            The user identity based on IdentityCheck token or Authorization field.

       Each section can be enabled/disabled using the configuration files or command-line options
       (see Section Display Options).

   Output Table Format
       Inside each section, the statistics are presented as a preformatted table.

       %Reqs %Byte  Bytes Sent  Requests   category-type
       ----- ----- ------------ -------- |---------------
       NN.NN NN.NN NNNNNNNNNNNN NNNNNNNN | category-value
       100.0 100.0 NNNNNNNNNNNN NNNNNNNN | category-value

       Requests    Requests received for this category-value.
       Bytes Sent  Bytes transmitted for this category-value.
       %Reqs       (<Requests>/<Total Requests>)*100.
       %Byte       (<Bytes Sent>/<Total Bytes>)*100.

       The table can be sorted by category-value (-sort key), number of requests received (-sort
       req), or number of bytes received (-sort byte).  It can also be limited to the -top N
       entries.

OPTIONS

   Configuration Options
       These options define how wwwstat should establish defaults and interpret the command-line.

       -F filename
              Get system configuration defaults from the given file.  If used, this must be the
              first argument on the command-line, since it needs to be interpreted before the
              other command options.  The file wwwstat.rc is included with the distribution as an
              example of this file; it contains perl source code which directly sets the control
              and display options provided by wwwstat.  If filename is not a pathname, the
              include path (see FILES) is searched for filename.  An empty string as filename
              will disable this feature.  [-F "wwwstat.rc"]

       -f filename
              Get user configuration defaults from the given file. If used, this must be the
              first argument on the command-line after -F (if any). The file is the same format
              as for the -F option (see wwwstat.rc).  If filename is not a pathname, the include
              path (see FILES) is searched for filename.  An empty string as filename will
              disable this feature.  [-f ".wwwstatrc"]

       --     Last option (the remaining arguments are treated as input files).

   Diagnostic Options
       These options provide information about wwwstat usage or about some unusual aspects of the
       logfile(s) being processed.

       -h     Help - display usage information to STDERR and then exit.

       -v     Verbose display to STDERR of each log entry processed.

       -x     Display to STDERR all requests resulting in HTTP error responses.

       -e     Display to STDERR all invalid log entries. Invalid log entries can occur if the
              server is miswriting or overwriting its own log, if the request is made by a broken
              client or proxy, or if a malicious attacker is trying to gain privileged access to
              your system.  For the latter reason, the webmaster should run wwwstat with this
              option on a regular basis.

   Display Options
       These options modify the output format.

       -H string
              Use the given string as the HTML title and heading for output.

       -X string
              Use the given string as the cross-reference URL to the last summary output.  Any
              occurrence of the characters "%M" or "%Y" are replaced by the month and year,
              respectively, of the month prior to the first log entry date.  The empty string
              will exclude any cross-reference.

       -R     Display the daily stats table sorted in reverse. This option is primarily for use
              with the gwstat program for producing graphs of the output.

       -l
       -L     Do (-l) or don't (-L) display the full DNS hostname of clients in your local domain
              (which is determined by the configured value of $AppendToLocalhost) in the section
              on subdomain statistics.  The default [-L] is to strip the machine name from local
              addresses.

       -o
       -O     Do (-o) or don't (-O) display the full DNS hostname of clients outside your local
              domain in the section on subdomain statistics.  The default [-O] is to strip the
              machine name from outside addresses.

       -u
       -U     Do (-u) or don't (-U) display the IP address of clients with unresolved domain
              names in the section on subdomain statistics. The -dns option can be used to
              resolve some names, but not all IP hosts have a DNS name (SLIP/PPP connections) and
              sometimes a host's DNS service is inaccessible. The default [-U] is to group all
              such addresses under the category "Unresolved".

       -dns
       -nodns Do (-dns) or don't (-nodns) use the system's hostname lookup facilities to find the
              DNS hostname associated with any unresolved IP addresses. Looking up a DNS name may
              be very slow, particularly when the results are negative (no DNS name), which is
              why a caching capability is included as well.  [-nodns]

       -cache filename
              Use the given DBM database as the read/write persistent DNS cache (the .dir and
              .pag extensions are appended automatically). Cached entries (including negative
              results) are removed after the time configured for $DNSexpires [two months].  No
              caching is performed if filename is the empty string, which may be needed if your
              system does not support DBM or NDBM functionality. Running -dns without a
              persistent cache is not recommended.  [-cache "dnscache"]

       -trunc N
              Truncate the URLs listed in the archive section after the Nth hierarchy level. This
              option is commonly used to reduce the output size and memory requirements of
              wwwstat by grouping the requests by directory tree instead of listing every URL.
              The default [-trunc 0] is to display every requested URL.

       -files
       -nofiles
              Do (-files) or don't (-nofiles) include the last component of a URL (usually the
              filename) in the archive section. This option is commonly used to reduce the output
              size and memory requirements of wwwstat by grouping the requests by directory
              instead of listing every URL.  The default [-files] is to display the entire
              requested URL.

       -link
       -nolink
              Do (-link) or don't (-nolink) add a hypertext link around each archive URL.  This
              option is useful for local maintenance, but it is not recommended for publication
              of the HTML results (it often results in links to temporary or nonexistant
              resources, and leads people/robots to resources that might not be publically
              available).  [-nolink]

       -cgi
       -nocgi Do (-cgi) or don't (-nocgi) prefix the summary output with CGI header fields
              appropriate for use with the HTTP common gateway interface.  Using wwwstat as a CGI
              script is not recommended - it is usually better to simply run the wwwstat program
              periodically and serve the static output file.  [-nocgi]

   Section Display Options
       These options change the display of entire sections (as opposed to the entries within
       those sections).  They allow the user to enable or disable an entire section, set the
       sorting method for that section, and limit the number of displayed entries for that
       section.  These options are context-sensitive and processed in the order given.

       -all
       -noall Include (-all) or exclude (-noall) all of the display sections. The -noall option
              is commonly used just prior to one or more of the other section options, such that
              only the listed sections are displayed.

       -daily
       -nodaily
              Include (-daily) or exclude (-nodaily) the section of statistics by request date
              and set the scope for later -sort and -top options to this section.

       -hourly
       -nohourly
              Include (-hourly) or exclude (-nohourly) the section of statistics by request hour
              and set the scope for later -sort and -top options to this section.

       -domain
       -nodomain
              Include (-domain) or exclude (-nodomain) the section of statistics by the client's
              Internet domain and set the scope for later -sort and -top options to this section.

       -subdomain
       -nosubdomain
              Include (-subdomain) or exclude (-nosubdomain) the section of statistics by the
              client's Internet subdomain (reversed for display) and set the scope for later
              -sort and -top options to this section.

       -archive
       -noarchive
              Include (-archive) or exclude (-noarchive) the section of statistics by requested
              URL/archive and set the scope for later -sort and -top options to this section.

       -r
       -ident
       -noident
              Include (-r or -ident) or exclude (-noident) the section of statistics by the
              identity of the user (if IdentityCheck is ON) or the authentication userid (if
              supplied) and set the scope for later -sort and -top options to this section.  DO
              NOT PUBLISH this information, as that would reveal security-related identities and
              be a violation of privacy.  This option is provided for administrative purposes
              only.

       -sort (key|byte|req)
              Sort this section by its primary key, the number of bytes transmitted, or the
              number of requests received.  [-sort key]

       -top N Display only the top N entries for this section. This option assumes that the -sort
              option has been set to either bytes or requests.

       -both  Display both the top N entries for this section [10, sorted by requests], and then
              the full section (all entries) sorted by key.

   Search Options
       These options are used to limit the analysis to requests matching a pattern.  The pattern
       is supplied in the form of a perl regular expression, except that the characters "+" and
       "." are escaped automatically unless the -noescape option is given.  Enclose the pattern
       in single-quotes to prevent the command shell from interpreting some special characters.

       Multiple occurrences of the same option results in an OR-ing of the regular expressions.
       Search options are only applied to logfile entries; any summary files input must have been
       created with the same search options.

       -a regexp
       -A regexp
              Include (-a) or exclude (-A) all requests containing a hostname/IP address matching
              the given perl regular expression.

       -c regexp
       -C regexp
              Include (-c) or exclude (-C) all requests resulting in an HTTP status code matching
              the given perl regular expression.

       -d regexp
       -D regexp
              Include (-d) or exclude (-D) all requests occurring on a date (e.g., "Feb  2 1994")
              matching the given perl regular expression.

       -t regexp
       -T regexp
              Include (-t) or exclude (-T) all requests occurring during the hour (e.g., "23" is
              11pm - 12pm) matching the given perl regular expression.

       -m regexp
       -M regexp
              Include (-m) or exclude (-M) all requests using an HTTP method (e.g., "HEAD")
              matching the given perl regular expression.

       -n regexp
       -N regexp
              Include (-n) or exclude (-N) all requests on a URL (archive name) matching the
              given perl regular expression.

       -noescape
              Do not escape the special characters ("+" and ".") in the remaining search options.

INPUT

       After parsing the options, the remaining arguments on the command-line are treated as
       input arguments and are read in the order given.  If no input arguments are given, the
       configured default logfile is read [+].

       -      Read from standard input (STDIN).

       +      Read the default logfile. [as configured]

       filename...
              Read the given file and determine from the first line whether it is a previous
              output summary or a CLF logfile.  If the filename's extension indicates that is is
              compressed (gz|z|Z), then pipe it through the configured decompression program
              [gunzip -c] first. Summary files must have been created with the same (or similar)
              configuration and command-line options as the currently running program; if not,
              weird things will happen.

USAGE

       wwwstat is used for many purposes:

         o    as a diagnostic utility for measuring server activity, finding incorrect URL
              references, and detecting attempted misuse of the server;

         o    as a public relations tool for measuring technology or information transfer (i.e.,
              Is the message getting out? To the right people?);

         o    as an archival tool for tracking web usage over time without storing the entire
              logfile; and,

         o    most often, as an easy mechanism for justifying all the hard work that went into
              creating the web content that people out there are requesting.

       In most cases, wwwstat is run on a periodic basis (nightly, weekly, and/or monthly) by a
       wrapper program as a crontab entry shortly after midnight, typically in conjunction with
       rotating the current logfile.  The output is usually directed to a temporary file which
       can later be moved to a published location.  The temporary file is necessary to avoid
       erasing your published file during wwwstat's processing (which would look very odd if
       someone tried to GET it from your web).

       wwwstat can be run as a CGI script (-cgi), but that is not recommended unless the input
       logfile is very small.

       All of the command-line options, and a few options that are not available from the
       command-line, can be changed within the user and system configuration files (see
       wwwstat.rc).  These files are actually perl library modules which are executed as part of
       the program's initialization.  The example provided with the distribution includes
       complete documentation on what variables can be set and their range of values.

   Perl Regular Expressions
       The Search Options and many of the configuration file settings allow for full use of perl
       regular expressions (with the exception that the -a, -A, -n and -N options treat '+' and
       '.'  characters as normal alphabetic characters unless they are preceded by the -noescape
       option).  Most people only need to know the following special characters:

       ^       at start of pattern, means "starts with pattern".
       $       at end of pattern, means "ends with pattern".
       (...)   groups pattern elements as a single element.
       ?       matches preceding element zero or one times.
       *       matches preceding element zero or more times.
       +       matches preceding element one or more times.
       .       matches any single character.
       [...]   denotes a class of characters to match. [^...] negates the class.  Inside a class,
               '-' indicates a range of characters.
       (A|B|C) matches if A or B or C matches.

       Depending on your command shell, some special characters may need to be escaped on the
       command line or enclosed in single-quotes to avoid shell interpretation.

EXAMPLES

       Summarize requests from commercial domains.
              wwwstat -a '.com$'

       Summarize requests from the host kiwi.ics.uci.edu
              wwwstat -a '^kiwi.ics.uci.edu$'

       Summarize requests not from kiwi.ics.uci.edu
              wwwstat -A '^kiwi.ics.uci.edu$'

       Summarize requests resulting in temporary redirects
              wwwstat -c '302'

       Summarize requests resulting in server errors
              wwwstat -c '^5'

       Summarize unsuccessful requests
              wwwstat -C '^2' -C '304'

       Summarize requests in first week of the month
              wwwstat -d ' [1-7] '

       Summarize requests in second week of the month
              wwwstat -d ' ([89]|1[0-4]) '

       Summarize requests in third week of the month
              wwwstat -d ' (1[5-9]|2[01]) '

       Summarize requests in fourth week of the month
              wwwstat -d ' 2[2-8] '

       Summarize requests in leftover days of the month
              wwwstat -d ' (29|30|31) '

       Summarize requests in February
              wwwstat -d 'Feb'

       Summarize requests in year 1994
              wwwstat -d '1994'

       Summarize requests not in April
              wwwstat -D 'Apr'

       Summarize requests between midnight and 1am
              wwwstat -t '00'

       Summarize requests not received between noon and 1pm
              wwwstat -T '12'

       Summarize requests with a gif extension
              wwwstat -n '.gif$'

       Summarize requests under user's URL
              wwwstat -n '^/~user/'

       Summarize requests not under "hidden" paths
              wwwstat -N '/hidden/'

ENVIRONMENT

       HOME        Location of user's home directory, placed on INC path.

       LOGDIR      Used instead of HOME if latter is undefined.

       PERLLIB     A colon-separated list of directories in which to look for include and
                   configuration files.

FILES

       Unless a pathname is supplied, the configuration files are obtained from the current
       directory, the user's home directory (HOME or LOGDIR), the standard library path
       (PERLLIB), and the directory indicated by the command pathname (in that order).

       .wwwstatrc     User configuration file.

       wwwstat.rc     System configuration file.

       domains.pl     Mapping of Internet domain to country or organization.

       dnscache.dir
       dnscache.pag   DBM files for persistent DNS cache.

SEE ALSO

       crontab(1), gwstat(1), httpd(1m), perl(1), splitlog(1)

       More info and the latest version of wwwstat can be obtained from

            http://www.ics.uci.edu/pub/websoft/wwwstat/
             ftp://www.ics.uci.edu/pub/websoft/wwwstat/

       If you have any suggestions, bug reports, fixes, or enhancements, please join the
       <wwwstat-users@ics.uci.edu> mailing list by sending e-mail with "subscribe" in the subject
       of the message to the request address <wwwstat-users-request@ics.uci.edu>.  The list is
       archived at the above address.

   More About HTTP
       HTTP/1.1 Proposed Standard
              R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, and T. Berners-Lee.  "Hypertext
              Transfer Protocol -- HTTP/1.1", U.C. Irvine, DEC, MIT/LCS, August 1996.
              http://www.ics.uci.edu/pub/ietf/http/

   More About Perl
       The Perl Language Home Page
              http://www.perl.com/perl/index.html

       Johan Vromans' Perl Reference Guide
              http://www.xs4all.nl/~jvromans/perlref.html

DIAGNOSTICS

       See also the Diagnostic Options above.

       "[none] to [none]" dates
              wwwstat did not find any matching data to summarize.  If you get such an empty
              summary, it means that either: 1) there was no valid data (the input files are all
              invalid or empty), or 2) none of the data matched the search options given.  Try
              using the -e option to show invalid data.

       100% unresolved
              If the subdomain section indicates that all of the client requests come from
              unresolved hostnames (IP addresses), this probably means that your server is
              running without DNS resolution (common for very busy sites).  You can use the -dns
              option to have wwwstat perform the hostname lookups.  If 100% of the hosts are
              still unresolved with the -dns option in effect, then it may be that all of the
              clients accessing your server are doing so from temporary SLIP/PPP addresses
              without DNS names, or it may be a problem with wwwstat's DNS cache (delete the
              cache files), with your system's DNS software (contact your system administrator),
              or with your network connection.

NOTES

   Hits vs Requests vs Visitors
       wwwstat counts HTTP requests received by the server.  When a request is successful, it is
       often referred to as a "hit". Retrieving a single image is one GET request. Retrieving an
       HTML page is also one GET request, but that does not include the separate requests made
       for in-line images or related objects.  Checking to see if a cached image is still valid
       (a HEAD or conditional GET) is also one request.

       In all sections except the archive section, wwwstat shows the statistics for all requests
       (successful or not).  In the archive section, it normally shows all non-successful
       requests under a special category for the status code and only successful requests (hits)
       under the URL or archive tree associated with the request.  However, this grouping of non-
       successful requests is disabled when wwwstat is used with the search options -n, -c, and
       -C, since those options are normally used for finding error conditions.

       wwwstat does not count "visitors" -- individual people or programs making the requests.
       HTTP does not, by default, provide any information that can be accurately correlated to an
       individual person, though it is possible (in an unreliable manner) to use HTTP extensions
       and request profiles as a means of tracking individual client programs.  Such tracking
       requires extensive resources (memory and diskspace) and is often considered a violation of
       privacy.

       With the exception of the ident section, wwwstat does not reveal information about the
       individual people making requests.  Unless the output is limited to a specific URL or a
       specific hostname, wwwstat's output does not connect the requester to the URL being
       requested.

   Common Logfile Format
       The httpd common logfile format (CLF) was defined in early 1994 as the result of
       discussions among server and access_log analyzer developers (Roy Fielding, John Franks,
       Kevin Hughes, Ari Luotonen, Rob McCool, and Tony Sanders) on how to make it easier for
       analysis tools to be used across multiple servers.  The format is:

       remote_host ident authuser [date-time zone] "Request-Line" Status-Code bytes

       where          means
       ------------   --------------------------------------
       remote_host    Client DNS hostname or IP address
       ident          Identity check token or "-"
       authuser       Authorization user-id or "-"
       date-time      dd/Mmm/yyyy:hh:mm:ss
       zone           +dddd or -dddd
       Request-Line   The first line of the HTTP request, which normally includes the method,
                      URL, and HTTP-version.
       Status-Code    Response status from server or "-"
       bytes          Size of Entity-Body transmitted or "-"
       ------------   --------------------------------------

       with each field separated by a single space (it turns out that problems occur if the ident
       token contains a space, which was not anticipated by the original designers).

LIMITATIONS

       wwwstat cannot be more accurate than its input.

       The common logfile format does not include the amount of bytes transferred in HTTP header
       fields and in error responses.  wwwstat attempts to estimate those bytes based on the
       response code.  Although the built-in estimates will suffice for most applications, your
       results will be more accurate if the estimates are customized for the particular server
       software that generated the logfile.

       Modern httpd servers have extended the CLF to include additional fields (Referer and User-
       Agent) or to make the entire format configurable.  Although wwwstat is able to read
       logfiles which append information to the CLF, it will not make use of that additional
       information.  However, wwwstat is written in perl, so if you want to parse a different
       format all you have to do is change the parsing code.

       wwwstat does not do anything with Referer [sic] or User-Agent information that may be
       present in extended logfiles.  In order to do anything interesting with Referer, the
       program would have to build a Request-URI x Referer x Count table, which would require
       huge gobs of memory and is better done using a separate program with a persistent
       database.  Naturally, this is easy to do once you learn perl.

AUTHOR

       Roy Fielding (fielding@ics.uci.edu), University of California, Irvine.  Please do not send
       questions or requests to the author, since the number of requests has long since
       overwhelmed his ability to reply, and all future support will be through the mailing list
       (see above).

       wwwstat was originally based on a multi-server statistics program called fwgstat-0.035 by
       Jonathan Magid (jem@sunsite.unc.edu) which, in turn, was heavily based on xferstats
       (packaged with the version 17 of the Wuarchive FTP daemon) by Chris Myers
       (chris@wugate.wustl.edu).

       This work has been sponsored in part by the Defense Advanced Research Projects Agency
       under Grant Numbers MDA972-91-J-1010 and F30602-94-C-0218.  This software does not
       necessarily reflect the position or policy of the U.S. Government and no official
       endorsement should be inferred.

                                         03 November 1996                              wwwstat(1)