Provided by: linkchecker_8.6-2_amd64 bug

NAME

       linkchecker - command line client to check HTML documents and websites for broken links

SYNOPSIS

       linkchecker [options] [file-or-url]...

DESCRIPTION

       LinkChecker features

       •      recursive and multithreaded checking,

       •      output  in  colored  or  normal  text,  HTML,  SQL,  CSV, XML or a sitemap graph in
              different formats,

       •      support for HTTP/1.1, HTTPS, FTP, mailto:, news:,  nntp:,  Telnet  and  local  file
              links,

       •      restriction of link checking with URL filters,

       •      proxy support,

       •      username/password authorization for HTTP, FTP and Telnet,

       •      support for robots.txt exclusion protocol,

       •      support for Cookies

       •      support for HTML5

       •      HTML and CSS syntax check

       •      Antivirus check

       •      a command line, GUI and web interface

EXAMPLES

       The  most common use checks the given domain recursively, plus any URL pointing outside of
       the domain:
         linkchecker http://www.example.net/
       Beware that this checks the whole site which can have  thousands  of  URLs.   Use  the  -r
       option to restrict the recursion depth.
       Don't check mailto: URLs. All other links are checked as usual:
         linkchecker --ignore-url=^mailto: mysite.example.org
       Checking a local HTML file on Unix:
         linkchecker ../bla.html
       Checking a local HTML file on Windows:
         linkchecker c:\temp\test.html
       You can skip the http:// url part if the domain starts with www.:
         linkchecker www.example.com
       You can skip the ftp:// url part if the domain starts with ftp.:
         linkchecker -r0 ftp.example.org
       Generate a sitemap graph and convert it with the graphviz dot utility:
         linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps

OPTIONS

   General options
       -fFILENAME, --config=FILENAME
              Use    FILENAME    as    configuration    file.   As   default   LinkChecker   uses
              ~/.linkchecker/linkcheckerrc.

       -h, --help
              Help me! Print usage information for this program.

       --stdin
              Read list of white-space separated URLs to check from stdin.

       -tNUMBER, --threads=NUMBER
              Generate no more than the given number of threads. Default  number  of  threads  is
              100. To disable threading specify a non-positive number.

       -V, --version
              Print version and exit.

   Output options
       --check-css
              Check syntax of CSS URLs with the W3C online validator.

       --check-html
              Check syntax of HTML URLs with the W3C online validator.

       --complete
              Log all URLs, including duplicates. Default is to log duplicate URLs only once.

       -DSTRING, --debug=STRING
              Print  debugging  output  for  the  given  logger.   Available loggers are cmdline,
              checking, cache, gui, dns and all.  Specifying all is an alias for  specifying  all
              available  loggers.  The option can be given multiple times to debug with more than
              one logger.   For accurate results, threading will be disabled during debug runs.

       -FTYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
              Output to a file linkchecker-out.TYPE, $HOME/.linkchecker/blacklist  for  blacklist
              output,  or FILENAME if specified.  The ENCODING specifies the output encoding, the
              default   is   that   of   your   locale.    Valid   encodings   are   listed    at
              http://docs.python.org/library/codecs.html#standard-encodings.
              The  FILENAME  and  ENCODING parts of the none output type will be ignored, else if
              the file already exists, it will be overwritten.  You can specify this option  more
              than  once.  Valid  file  output  types  are  text,  html, sql, csv, gml, dot, xml,
              sitemap, none or blacklist.  Default is no file output. The  various  output  types
              are documented below. Note that you can suppress all console output with the option
              -o none.

       --no-status
              Do not print check status messages.

       --no-warnings
              Don't log warnings. Default is to log warnings.

       -oTYPE[/ENCODING], --output=TYPE[/ENCODING]
              Specify output type as text, html, sql,  csv,  gml,  dot,  xml,  sitemap,  none  or
              blacklist.  Default type is text. The various output types are documented below.
              The  ENCODING  specifies  the  output encoding, the default is that of your locale.
              Valid encodings are listed at  http://docs.python.org/library/codecs.html#standard-
              encodings.

       -q, --quiet
              Quiet operation, an alias for -o none.  This is only useful with -F.

       --scan-virus
              Scan content of URLs for viruses with ClamAV.

       --trace
              Print tracing information.

       -v, --verbose
              Log all checked URLs. Default is to log only errors and warnings.

       -WREGEX, --warning-regex=REGEX
              Define a regular expression which prints a warning if it matches any content of the
              checked link.  This applies only to valid pages, so we can get their content.
              Use this to check for pages that contain some form of error, for example "This page
              has moved" or "Oracle Application error".
              Note  that  multiple  values can be combined in the regular expression, for example
              "(This page has moved|Oracle Application error)".
              See section REGULAR EXPRESSIONS for more info.

       --warning-size-bytes=NUMBER
              Print a warning if content size info is available and exceeds the given  number  of
              bytes.

   Checking options
       -a, --anchors
              Check HTTP anchor references. Default is not to check anchors.  This option enables
              logging of the warning url-anchor-not-found.

       -C, --cookies
              Accept and send HTTP cookies according to RFC 2109. Only  cookies  which  are  sent
              back  to  the  originating  server  are  accepted.   Sent  and accepted cookies are
              provided as additional logging information.

       --cookiefile=FILENAME
              Read a file with initial cookie data. The cookie data format is explained below.

       --ignore-url=REGEX
              URLs matching the given regular expression will be ignored and not checked.
              This option can be given multiple times.
              See section REGULAR EXPRESSIONS for more info.

       -NSTRING, --nntp-server=STRING
              Specify an NNTP server  for  news:  links.  Default  is  the  environment  variable
              NNTP_SERVER. If no host is given, only the syntax of the link is checked.

       --no-follow-url=REGEX
              Check but do not recurse into URLs matching the given regular expression.
              This option can be given multiple times.
              See section REGULAR EXPRESSIONS for more info.

       -p, --password
              Read  a  password  from console and use it for HTTP and FTP authorization.  For FTP
              the default password is anonymous@. For HTTP there is no default password. See also
              -u.

       -PNUMBER, --pause=NUMBER
              Pause the given number of seconds between two subsequent connection requests to the
              same host. Default is no pause between requests.

       -rNUMBER, --recursion-level=NUMBER
              Check recursively all links up to  given  depth.   A  negative  depth  will  enable
              infinite recursion.  Default depth is infinite.

       --timeout=NUMBER
              Set  the  timeout  for  connection  attempts  in seconds. The default timeout is 60
              seconds.

       -uSTRING, --user=STRING
              Try the given username for  HTTP  and  FTP  authorization.   For  FTP  the  default
              username is anonymous. For HTTP there is no default username. See also -p.

       --user-agent=STRING
              Specify   the   User-Agent   string  to  send  to  the  HTTP  server,  for  example
              "Mozilla/4.0". The default is "LinkChecker/X.Y" where X.Y is the current version of
              LinkChecker.

CONFIGURATION FILES

       Configuration files can specify all options above. They can also specify some options that
       cannot be set on the command line.  See linkcheckerrc(5) for more info.

OUTPUT TYPES

       Note that by default only errors and warnings are logged.  You should  use  the  --verbose
       option to get the complete URL list, especially when outputting a sitemap graph format.

       text   Standard text logger, logging URLs in keyword: argument fashion.

       html   Log  URLs  in keyword: argument fashion, formatted as HTML.  Additionally has links
              to the referenced pages.  Invalid  URLs  have  HTML  and  CSS  syntax  check  links
              appended.

       csv    Log check result in CSV format with one URL per line.

       gml    Log parent-child relations between linked URLs as a GML sitemap graph.

       dot    Log parent-child relations between linked URLs as a DOT sitemap graph.

       gxml   Log check result as a GraphXML sitemap graph.

       xml    Log check result as machine-readable XML.

       sitemap
              Log   check   result   as   an   XML   sitemap  whose  protocol  is  documented  at
              http://www.sitemaps.org/protocol.html.

       sql    Log check result as SQL script with INSERT commands. An example  script  to  create
              the initial SQL table is included as create.sql.

       blacklist
              Suitable  for cron jobs. Logs the check result into a file ~/.linkchecker/blacklist
              which only contains entries with invalid URLs and the number  of  times  they  have
              failed.

       none   Logs nothing. Suitable for debugging or checking the exit code.

REGULAR EXPRESSIONS

       LinkChecker    accepts    Python   regular   expressions.    See   http://docs.python.org/
       howto/regex.html for an introduction.

       An addition is that a leading exclamation mark negates the regular expression.

COOKIE FILES

       A cookie file contains standard HTTP header (RFC 2616) data with  the  following  possible
       names:

       Scheme (optional)
              Sets the scheme the cookies are valid for; default scheme is http.

       Host (required)
              Sets the domain the cookies are valid for.

       Path (optional)
              Gives the path the cookies are value for; default path is /.

       Set-cookie (optional)
              Set cookie name/value. Can be given more than once.

       Multiple  entries  are separated by a blank line.  The example below will send two cookies
       to all URLs starting with http://example.com/hello/ and one  to  all  URLs  starting  with
       https://example.org/:

        Host: example.com
        Path: /hello
        Set-cookie: ID="smee"
        Set-cookie: spam="egg"

        Scheme: https
        Host: example.org
        Set-cookie: baggage="elitist"; comment="hologram"

PROXY SUPPORT

       To  use  a  proxy  on  Unix  or  Windows  set  the $http_proxy, $https_proxy or $ftp_proxy
       environment  variables  to  the   proxy   URL.   The   URL   should   be   of   the   form
       http://[user:pass@]host[:port].    LinkChecker  also  detects  manual  proxy  settings  of
       Internet Explorer under Windows systems. On a Mac use the  Internet  Config  to  select  a
       proxy.   You  can  also  set  a  comma-separated  domain list in the $no_proxy environment
       variables to ignore any proxy settings for these domains.  Setting a HTTP  proxy  on  Unix
       for example looks like this:

         export http_proxy="http://proxy.example.com:8080"

       Proxy authentication is also supported:

         export http_proxy="http://user1:mypass@proxy.example.org:8081"

       Setting a proxy on the Windows command prompt:

         set http_proxy=http://proxy.example.com:8080

PERFORMED CHECKS

       All  URLs  have  to  pass  a  preliminary syntax test. Minor quoting mistakes will issue a
       warning, all other invalid syntax issues are errors.  After the syntax check  passes,  the
       URL is queued for connection checking. All connection check types are described below.

       HTTP links (http:, https:)
              After connecting to the given HTTP server the given path or query is requested. All
              redirections are followed, and if  user/password  is  given  it  will  be  used  as
              authorization  when necessary.  Permanently moved pages issue a warning.  All final
              HTTP status codes other than 2xx are errors.  HTML page contents  are  checked  for
              recursion.

       Local files (file:)
              A  regular, readable file that can be opened is valid. A readable directory is also
              valid. All other files, for example device files, unreadable or non-existing  files
              are errors.  HTML or other parseable file contents are checked for recursion.

       Mail links (mailto:)
              A  mailto:  link  eventually resolves to a list of email addresses.  If one address
              fails, the whole list will fail.  For each mail  address  we  check  the  following
              things:
                1) Check the adress syntax, both of the part before and after
                   the @ sign.
                2) Look up the MX DNS records. If we found no MX record,
                   print an error.
                3) Check if one of the mail hosts accept an SMTP connection.
                   Check hosts with higher priority first.
                   If no host accepts SMTP, we print a warning.
                4) Try to verify the address with the VRFY command. If we got
                   an answer, print the verified address as an info.

       FTP links (ftp:)

                For FTP links we do:

                1) connect to the specified host
                2) try to login with the given user and password. The default
                   user is ``anonymous``, the default password is ``anonymous@``.
                3) try to change to the given directory
                4) list the file with the NLST command

       Telnet links (``telnet:``)

                We try to connect and if user/password are given, login to the
                given telnet server.

       NNTP links (``news:``, ``snews:``, ``nntp``)

                We try to connect to the given NNTP server. If a news group or
                article is specified, try to request it from the server.

       Unsupported links (``javascript:``, etc.)

                An unsupported link will only print a warning. No further checking
                will be made.

                The complete list of recognized, but unsupported links can be found
                in the linkcheck/checker/unknownurl.py source file.
                The most prominent of them should be JavaScript links.

RECURSION

       Before  descending  recursively into a URL, it has to fulfill several conditions. They are
       checked in this order:

       1. A URL must be valid.

       2. A URL must be parseable. This currently includes HTML files,
          Opera bookmarks files, and directories. If a file type cannot
          be determined (for example it does not have a common HTML file
          extension, and the content does not look like HTML), it is assumed
          to be non-parseable.

       3. The URL content must be retrievable. This is usually the case
          except for example mailto: or unknown URL types.

       4. The maximum recursion level must not be exceeded. It is configured
          with the --recursion-level option and is unlimited per default.

       5. It must not match the ignored URL list. This is controlled with
          the --ignore-url option.

       6. The Robots Exclusion Protocol must allow links in the URL to be
          followed recursively. This is checked by searching for a
          "nofollow" directive in the HTML header data.

       Note that the directory recursion reads all files in that directory,  not  just  a  subset
       like index.htm*.

NOTES

       URLs on the commandline starting with ftp. are treated like ftp://ftp., URLs starting with
       www. are treated like http://www..  You can also give local files as arguments.

       If you have your system configured to automatically establish a connection to the internet
       (e.g.  with  diald),  it will connect when checking links not pointing to your local host.
       Use the --ignore-url option to prevent this.

       Javascript links are not supported.

       If your platform does not support threading, LinkChecker disables it automatically.

       You can supply multiple user/password pairs in a configuration file.

       When checking news: links the given NNTP host doesn't need to be the same as the  host  of
       the user browsing your pages.

ENVIRONMENT

       NNTP_SERVER - specifies default NNTP server
       http_proxy - specifies default HTTP proxy server
       ftp_proxy - specifies default FTP proxy server
       no_proxy - comma-separated list of domains to not contact over a proxy server
       LC_MESSAGES, LANG, LANGUAGE - specify output language

RETURN VALUE

       The return value is 2 when

       •      a program error occurred.

       The return value is 1 when

       •      invalid links were found or

       •      link warnings were found and warnings are enabled

       Else the return value is zero.

LIMITATIONS

       LinkChecker  consumes  memory  for each queued URL to check. With thousands of queued URLs
       the amount of consumed memory can become quite large. This might slow down the program  or
       even the whole system.

FILES

       ~/.linkchecker/linkcheckerrc - default configuration file
       ~/.linkchecker/blacklist - default blacklist logger output filename
       linkchecker-out.TYPE - default logger file output name
       http://docs.python.org/library/codecs.html#standard-encodings - valid output encodings
       http://docs.python.org/howto/regex.html - regular expression documentation

SEE ALSO

       linkcheckerrc(5)

AUTHOR

       Bastian Kleineidam <bastian.kleineidam@web.de>

COPYRIGHT

       Copyright © 2000-2014 Bastian Kleineidam