xenial (1) webcheck.1.gz

Provided by: webcheck_1.10.4-1_all bug

NAME

       webcheck - website link checker

SYNOPSIS

       webcheck [OPTION]...  URL

DESCRIPTION

       webcheck  will  check  the document at the specified URL for links to other documents, follow these links
       recursively and generate an HTML report.

       -i,  --internal=PATTERN
              Mark URLs matching the PATTERN (perl-type regular expression) as an internal link.   Can  be  used
              multiple  times.   Note  that  the  PATTERN  is  matched against the full URL.  URLs matching this
              PATTERN will be considered internal, even if they match one of the --external PATTERNs.

       -x,  --external=PATTERN
              Mark URLs matching the PATTERN (perl-type regular expression) as an external link.   Can  be  used
              multiple times.  Note that the PATTERN is matched against the full URL.

       -y, --yank=PATTERN
              Do  not  check URLs matching the PATTERN (perl-type regular expression).  Like the -x flag, though
              this option will cause webcheck to not check the link matched by regex whereas -x will  check  the
              link  but not its children.  Can be used multiple times.  Note that the PATTERN is matched against
              the full URL.

       -b, --base-only
              Consider any URL not starting with the base URL to be external.  For example, if you run
                  webcheck -b http://www.example.com/foo
              then http://www.example.com/foo/bar will be considered  internal  whereas  http://www.example.com/
              will be considered external.  By default all the pages on the site will be considered internal.

       -a, --avoid-external
              Avoid  external  links.   Normally  if webcheck is examining an HTML page and it finds a link that
              points to an external document, it will check to see if that external document exists.  This  flag
              disables that action.

       --ignore-robots
              Do  not  retrieve  and  parse  robots.txt  files.   By  default robots.txt files are retrieved and
              honored.  If you are sure you want to ignore and override the webmaster's decision this option can
              be used.
              For more information on robots.txt handling see the NOTES section below.

       -q, --quiet, --silent
              Do not print out progress as webcheck traverses a site.

       -d, --debug
              Print debugging information while crawling the site.  This option is mainly useful for developers.

       -o, --output=DIRECTORY
              Output  directory.  Use to specify the directory where webcheck will dump its reports. The default
              is the current directory or as specified by config.py. If this directory does not exist it will be
              created for you (if possible).

       -c, --continue
              Try  to continue from a previous run. When using this option webcheck will look for a webcheck.dat
              in the output directory.  This file is read to restore the state  from  the  previous  run.   This
              allows  webcheck  to  continue  a  previously  interrupted  run.   When  this  option is used, the
              --internal, --external and --yank options will be ignored as  well  as  any  URL  arguments.   The
              --base-only and --avoid-external options should be the same as the previous run.
              Note  that  this  option  is  experimental  and  it's  semantics  may  change with coming releases
              (especially in relation to other options).  Also note that the stored files are not guaranteed  to
              be compatible between releases.

       -f, --force
              Overwrite files without asking.  This option is required for running webcheck non-interactively.

       -r, --redirects=N
              Redirect depth. the number of redirects webcheck should follow when following a link. 0 implies to
              follow all redirects.

       -u, --userpass=URL
              Specify a URL with username and password information to use for basic authentication when visiting
              the site.
              e.g. http://test:secret@example.com/
              This option may be specified multiple times.

       -w, --wait=SECONDS
              Wait SECONDS between document retrievals. Usually webcheck will process a url and immediately move
              on to the next. However on some loaded systems it may be desirable to have webcheck pause  between
              requests.  This option can be set to any non-negative number.

       -v, --version
              Show version of program.

       -h, --help
              Show short summary of options.

URL CLASSES

       URLs are divided into two classes:

       Internal  URLs  are  retrieved and the retrieved item is checked for syntax.  Also, the retrieved item is
       searched for links to other items (of any class) and these links are followed.

       External URLs are only retrieved to test whether they are valid and to gather some basic information from
       them (title, size, content-type, etc).  The retrieved items are not inspected for links to other items.

       Apart  from  their  class,  URLs  can  also  be  considered  yanked  (as  specified  with  the  --yank or
       --avoid-external options).  The URLs can be either internal or external and  will  not  be  retrieved  or
       checked at all.  URLs of unsupported schemes are also considered yanked.

EXAMPLES

       Check the site www.example.com but consider any path with "/webcheck" in it to be external.
           webcheck http://www.example.com/ -x /webcheck

NOTES

       When  checking  internal  URLs  webcheck  honors  the  robots.txt  file, identifying itself as user-agent
       webcheck. Disallowed links will not be checked at all as if the -y option was specified for that URL.  To
       allow webcheck to crawl parts of a site that other robots are disallowed, use something like:
           User-agent: *
           Disallow: /foo

           User-agent: webcheck
           Allow: /foo

ENVIRONMENT

       <scheme>_proxy
              Proxy url for <scheme>.

REPORTING BUGS

       Bug reports shoult be sent to the mailing list <webcheck-users@lists.arthurdejong.org>.  More information
       on reporting bugs can be found on the webcheck homepage:
       http://arthurdejong.org/webcheck/

       Copyright © 1998, 1999 Albert Hopkins (marduk)
       Copyright © 2002 Mike W. Meyer
       Copyright © 2005, 2006, 2007, 2008, 2009, 2010 Arthur de Jong
       webcheck is free software; see the source for copying conditions.  There is NO  warranty;  not  even  for
       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
       The  files  produced  as  output  from  the software do not automatically fall under the copyright of the
       software, unless explicitly stated otherwise.