Provided by: linkchecker_9.4.0-2_amd64
linkchecker - command line client to check HTML documents and websites for broken links
linkchecker [options] [file-or-url]...
LinkChecker features · recursive and multithreaded checking, · output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats, · support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links, · restriction of link checking with URL filters, · proxy support, · username/password authorization for HTTP, FTP and Telnet, · support for robots.txt exclusion protocol, · support for Cookies · support for HTML5 · HTML and CSS syntax check · Antivirus check · a command line and web interface
The most common use checks the given domain recursively: linkchecker http://www.example.com/ Beware that this checks the whole site which can have thousands of URLs. Use the -r option to restrict the recursion depth. Don't check URLs with /secret in its name. All other links are checked as usual: linkchecker --ignore-url=/secret mysite.example.com Checking a local HTML file on Unix: linkchecker ../bla.html Checking a local HTML file on Windows: linkchecker c:\temp\test.html You can skip the http:// url part if the domain starts with www.: linkchecker www.example.com You can skip the ftp:// url part if the domain starts with ftp.: linkchecker -r0 ftp.example.com Generate a sitemap graph and convert it with the graphviz dot utility: linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
General options -fFILENAME, --config=FILENAME Use FILENAME as configuration file. As default LinkChecker uses ~/.linkchecker/linkcheckerrc. -h, --help Help me! Print usage information for this program. --stdin Read list of white-space separated URLs to check from stdin. -tNUMBER, --threads=NUMBER Generate no more than the given number of threads. Default number of threads is 10. To disable threading specify a non-positive number. -V, --version Print version and exit. --list-plugins Print available check plugins and exit. Output options -DSTRING, --debug=STRING Print debugging output for the given logger. Available loggers are cmdline, checking, cache, dns, plugins and all. Specifying all is an alias for specifying all available loggers. The option can be given multiple times to debug with more than one logger. For accurate results, threading will be disabled during debug runs. -FTYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME] Output to a file linkchecker-out.TYPE, $HOME/.linkchecker/blacklist for blacklist output, or FILENAME if specified. The ENCODING specifies the output encoding, the default is that of your locale. Valid encodings are listed at http://docs.python.org/library/codecs.html#standard-encodings. The FILENAME and ENCODING parts of the none output type will be ignored, else if the file already exists, it will be overwritten. You can specify this option more than once. Valid file output types are text, html, sql, csv, gml, dot, xml, sitemap, none or blacklist. Default is no file output. The various output types are documented below. Note that you can suppress all console output with the option -o none. --no-status Do not print check status messages. --no-warnings Don't log warnings. Default is to log warnings. -oTYPE[/ENCODING], --output=TYPE[/ENCODING] Specify output type as text, html, sql, csv, gml, dot, xml, sitemap, none or blacklist. Default type is text. The various output types are documented below. The ENCODING specifies the output encoding, the default is that of your locale. Valid encodings are listed at http://docs.python.org/library/codecs.html#standard- encodings. -q, --quiet Quiet operation, an alias for -o none. This is only useful with -F. -v, --verbose Log all checked URLs. Default is to log only errors and warnings. -WREGEX, --warning-regex=REGEX Define a regular expression which prints a warning if it matches any content of the checked link. This applies only to valid pages, so we can get their content. Use this to check for pages that contain some form of error, for example "This page has moved" or "Oracle Application error". Note that multiple values can be combined in the regular expression, for example "(This page has moved|Oracle Application error)". See section REGULAR EXPRESSIONS for more info. Checking options --cookiefile=FILENAME Read a file with initial cookie data. The cookie data format is explained below. --check-extern Check external URLs. --ignore-url=REGEX URLs matching the given regular expression will be ignored and not checked. This option can be given multiple times. See section REGULAR EXPRESSIONS for more info. -NSTRING, --nntp-server=STRING Specify an NNTP server for news: links. Default is the environment variable NNTP_SERVER. If no host is given, only the syntax of the link is checked. --no-follow-url=REGEX Check but do not recurse into URLs matching the given regular expression. This option can be given multiple times. See section REGULAR EXPRESSIONS for more info. -p, --password Read a password from console and use it for HTTP and FTP authorization. For FTP the default password is anonymous@. For HTTP there is no default password. See also -u. -rNUMBER, --recursion-level=NUMBER Check recursively all links up to given depth. A negative depth will enable infinite recursion. Default depth is infinite. --timeout=NUMBER Set the timeout for connection attempts in seconds. The default timeout is 60 seconds. -uSTRING, --user=STRING Try the given username for HTTP and FTP authorization. For FTP the default username is anonymous. For HTTP there is no default username. See also -p. --user-agent=STRING Specify the User-Agent string to send to the HTTP server, for example "Mozilla/4.0". The default is "LinkChecker/X.Y" where X.Y is the current version of LinkChecker.
Configuration files can specify all options above. They can also specify some options that cannot be set on the command line. See linkcheckerrc(5) for more info.
Note that by default only errors and warnings are logged. You should use the --verbose option to get the complete URL list, especially when outputting a sitemap graph format. text Standard text logger, logging URLs in keyword: argument fashion. html Log URLs in keyword: argument fashion, formatted as HTML. Additionally has links to the referenced pages. Invalid URLs have HTML and CSS syntax check links appended. csv Log check result in CSV format with one URL per line. gml Log parent-child relations between linked URLs as a GML sitemap graph. dot Log parent-child relations between linked URLs as a DOT sitemap graph. gxml Log check result as a GraphXML sitemap graph. xml Log check result as machine-readable XML. sitemap Log check result as an XML sitemap whose protocol is documented at http://www.sitemaps.org/protocol.html. sql Log check result as SQL script with INSERT commands. An example script to create the initial SQL table is included as create.sql. blacklist Suitable for cron jobs. Logs the check result into a file ~/.linkchecker/blacklist which only contains entries with invalid URLs and the number of times they have failed. none Logs nothing. Suitable for debugging or checking the exit code.
LinkChecker accepts Python regular expressions. See http://docs.python.org/ howto/regex.html for an introduction. An addition is that a leading exclamation mark negates the regular expression.
A cookie file contains standard HTTP header (RFC 2616) data with the following possible names: Host (required) Sets the domain the cookies are valid for. Path (optional) Gives the path the cookies are value for; default path is /. Set-cookie (required) Set cookie name/value. Can be given more than once. Multiple entries are separated by a blank line. The example below will send two cookies to all URLs starting with http://example.com/hello/ and one to all URLs starting with https://example.org/: Host: example.com Path: /hello Set-cookie: ID="smee" Set-cookie: spam="egg" Host: example.org Set-cookie: baggage="elitist"; comment="hologram"
To use a proxy on Unix or Windows set the $http_proxy, $https_proxy or $ftp_proxy environment variables to the proxy URL. The URL should be of the form http://[user:pass@]host[:port]. LinkChecker also detects manual proxy settings of Internet Explorer under Windows systems, and gconf or KDE on Linux systems. On a Mac use the Internet Config to select a proxy. You can also set a comma-separated domain list in the $no_proxy environment variables to ignore any proxy settings for these domains. Setting a HTTP proxy on Unix for example looks like this: export http_proxy="http://proxy.example.com:8080" Proxy authentication is also supported: export http_proxy="http://user1:firstname.lastname@example.org:8081" Setting a proxy on the Windows command prompt: set http_proxy=http://proxy.example.com:8080
There are two plugin types: connection and content plugins. Connection plugins are run after a successful connection to the URL host. Content plugins are run if the URL type has content (mailto: URLs have no content for example) and if the check is not forbidden (ie. by HTTP robots.txt). See linkchecker --list-plugins for a list of plugins and their documentation. All plugins are enabled via the linkcheckerrc(5) configuration file.
Before descending recursively into a URL, it has to fulfill several conditions. They are checked in this order: 1. A URL must be valid. 2. A URL must be parseable. This currently includes HTML files, Opera bookmarks files, and directories. If a file type cannot be determined (for example it does not have a common HTML file extension, and the content does not look like HTML), it is assumed to be non-parseable. 3. The URL content must be retrievable. This is usually the case except for example mailto: or unknown URL types. 4. The maximum recursion level must not be exceeded. It is configured with the --recursion-level option and is unlimited per default. 5. It must not match the ignored URL list. This is controlled with the --ignore-url option. 6. The Robots Exclusion Protocol must allow links in the URL to be followed recursively. This is checked by searching for a "nofollow" directive in the HTML header data. Note that the directory recursion reads all files in that directory, not just a subset like index.htm*.
NNTP_SERVER - specifies default NNTP server http_proxy - specifies default HTTP proxy server ftp_proxy - specifies default FTP proxy server no_proxy - comma-separated list of domains to not contact over a proxy server LC_MESSAGES, LANG, LANGUAGE - specify output language
The return value is 2 when · a program error occurred. The return value is 1 when · invalid links were found or · link warnings were found and warnings are enabled Else the return value is zero.
LinkChecker consumes memory for each queued URL to check. With thousands of queued URLs the amount of consumed memory can become quite large. This might slow down the program or even the whole system.
~/.linkchecker/linkcheckerrc - default configuration file ~/.linkchecker/blacklist - default blacklist logger output filename linkchecker-out.TYPE - default logger file output name http://docs.python.org/library/codecs.html#standard-encodings - valid output encodings http://docs.python.org/howto/regex.html - regular expression documentation
Bastian Kleineidam <email@example.com>
Copyright © 2000-2014 Bastian Kleineidam