Ubuntu Manpage: urlwatch-cookbook - Advanced topics and recipes for urlwatch

NAME

       urlwatch-cookbook - Advanced topics and recipes for urlwatch

ADDING URLS FROM THE COMMAND LINE

       Quickly adding new URLs to the job list from the command line:

          urlwatch --add url=http://example.org,name=Example

USING WORD-BASED DIFFERENCES

       You  can  also  specify an external diff-style tool (a tool that takes two filenames (old,
       new) as parameter and returns on its standard output the difference  of  the  files),  for
       example to use wdiff(1) to get word-based differences instead of line-based difference:

          url: https://example.com/
          diff_tool: wdiff

       Note  that  diff_tool  specifies  an  external  command-line  tool,  so  that tool must be
       installed separately (e.g. apt install wdiff on Debian or brew install  wdiff  on  macOS).
       Coloring is supported for wdiff-style output, but potentially not for other diff tools.

IGNORING CONNECTION ERRORS

       In  some  cases,  it  might  be  useful  to  ignore  (temporary)  network  errors to avoid
       notifications being sent. While there is a  display.error  config  option  (defaulting  to
       true)  to control reporting of errors globally, to ignore network errors for specific jobs
       only, you can use the ignore_connection_errors key in the job list configuration file:

          url: https://example.com/
          ignore_connection_errors: true

       Similarly, you might want to ignore some (temporary) HTTP errors on the server side:

          url: https://example.com/
          ignore_http_error_codes: 408, 429, 500, 502, 503, 504

       or ignore all HTTP errors if you like:

          url: https://example.com/
          ignore_http_error_codes: 4xx, 5xx

OVERRIDING THE CONTENT ENCODING

       For web pages with misconfigured HTTP headers or rare  encodings,  it  may  be  useful  to
       explicitly     specify     an     encoding     from     Python’s     Standard    Encodings
       <https://docs.python.org/3/library/codecs.html#standard-encodings>.

          url: https://example.com/
          encoding: utf-8

CHANGING THE DEFAULT TIMEOUT

       By default, url jobs timeout after 60 seconds. If you want a different timeout period, use
       the timeout key to specify it in number of seconds, or set it to 0 to never timeout.

          url: https://example.com/
          timeout: 300

       It is possible to add cookies to HTTP requests for pages that need it, the YAML syntax for
       this is:

          url: http://example.com/
          cookies:
              Key: ValueForKey
              OtherKey: OtherValue

COMPARING WITH SEVERAL LATEST SNAPSHOTS

       If a webpage frequently changes between several known stable states, it may  be  desirable
       to have changes reported only if the webpage changes into a new unknown state. You can use
       compared_versions to do this.

          url: https://example.com/
          compared_versions: 3

       In this example, changes are only reported if  the  webpage  becomes  different  from  the
       latest three distinct states. The differences are shown relative to the closest match.

RECEIVING A REPORT EVERY TIME URLWATCH RUNS

       If you are watching pages that change seldomly, but you still want to be notified daily if
       urlwatch still works, you can watch the output of the date command, for example:

          name: "urlwatch watchdog"
          command: "date"

       Since the output of date changes every second, this job should produce a report every time
       urlwatch is run.

USING REDIS AS A CACHE BACKEND

       If you want to use Redis as a cache backend over the default SQLite3 file:

          urlwatch --cache=redis://localhost:6379/

       There is no migration path from the SQLite3 format, the cache will be empty the first time
       Redis is used.

WATCHING CHANGES ON .ONION (TOR) PAGES

       Since pages on the Tor Network <https://www.torproject.org> are not accessible via  public
       DNS  and  TCP,  you  need  to either configure a Tor client as HTTP/HTTPS proxy or use the
       torify(1) tool from the tor package (apt install  tor  on  Debian,  brew  install  tor  on
       macOS).  Setting  up  Tor  is  out  of  scope  for this document. On a properly set up Tor
       installation, one can just prefix the urlwatch command with the torify wrapper  to  access
       .onion pages:

          torify urlwatch

WATCHING FACEBOOK PAGE EVENTS

       If  you  want  to  be  notified  of  new events on a public Facebook page, you can use the
       following job pattern, replace PAGE with the name of the page (can be found by  navigating
       to the events page on your browser):

          url: http://m.facebook.com/PAGE/pages/permalink/?view_type=tab_events
          filter:
            - css:
                selector: div#objects_container
                exclude: 'div.x, #m_more_friends_who_like_this, img'
            - re.sub:
                pattern: '(/events/\d*)[^"]*'
                repl: '\1'
            - html2text: pyhtml2text

ONLY SHOW ADDED OR REMOVED LINES

       The  diff_filter  feature  can  be used to filter the diff output text with the same tools
       (see filters) used for filtering web pages.

       In order to show only diff lines with added lines, use:

          url: http://example.com/things-get-added.html
          diff_filter:
            - grep: '^[@+]'

       This will only keep diff lines starting with @ or  +.  Similarly,  to  only  keep  removed
       lines:

          url: http://example.com/things-get-removed.html
          diff_filter:
            - grep: '^[@-]'

       More sophisticated diff filtering is possibly by combining existing filters, writing a new
       filter or using shellpipe to delegate the filtering/processing of the diff  output  to  an
       external tool.

       Read the next section if you want to disable empty notifications.

DISABLE EMPTY NOTIFICATIONS

       As  an extension to the previous example, let's say you want to only get notified with all
       lines added, but receive no notifications at all if lines are removed.

       A diff usually looks like this:

          --- @       Fri, 04 Mar 2022 19:58:14 +0100
          +++ @       Fri, 04 Mar 2022 19:58:22 +0100
          @@ -1,3 +1,3 @@
           someline
          -someotherlines
          +someotherline
           anotherline

       We want to filter all lines starting with "+" only, but because of  the  headers  we  also
       want to filter lines that start with "+++", which can be accomplished like so:

          url: http://example.com/only-added.html
          diff_filter:
            - grep: '^[+]'      # Include all lines starting with "+"
            - grepi: '^[+]{3}'  # Exclude the line starting with "+++"

       This  deals  with all diff lines now, but since urlwatch reports "changed" pages even when
       the diff_filter returns an empty string (which might be useful in some cases), you have to
       explicitly  opt  out  by using urlwatch --edit-config and setting the empty-diff option to
       false in the display category:

          display:
            empty-diff: false

PASS DIFF OUTPUT TO A CUSTOM SCRIPT

       In some situations, it might be useful to run a script with the diff as input when changes
       were  detected  (e.g.  to  start  an  update  or  process  something). This can be done by
       combining diff_filter with the shellpipe filter, which can be any custom script.

       The output of the custom script will then be the diff result as reported by  urlwatch,  so
       if  it  outputs  any  status, the CHANGED notification that urlwatch does will contain the
       output of the custom script, not the original diff. This can even have a  "normal"  filter
       attached to only watch links (the css: a part of the filter definitions):

          url: http://example.org/downloadlist.html
          filter:
            - css: a
          diff_filter:
            - shellpipe: /usr/local/bin/process_new_links.sh

SETTING THE CONTENT WIDTH FOR HTML2TEXT (LYNX METHOD)

       When  using  the  lynx  method  in the html2text filter, it uses a default width that will
       cause additional line breaks to be inserted.

       To set the lynx output width to 400 characters, use this filter setup:

          url: http://example.com/longlines.html
          filter:
            - html2text:
                method: lynx
                width: 400

COMPARING WEB PAGES VISUALLY

       To compare  the  visual  contents  of  web  pages,  Nicolai  has  written  pyvisualcompare
       <https://github.com/nspo/pyvisualcompare>  as  a frontend (with GUI) to urlwatch. The tool
       can be used to select a region of a web  page.  It  then  generates  a  configuration  for
       urlwatch to run pyvisualcompare and generate a hash for the screen contents.

CONFIGURING HOW LONG BROWSER JOBS WAIT FOR PAGES TO LOAD

       For  browser jobs, you can configure how long the headless browser will wait before a page
       is considered loaded by using the wait_until option. It can take one of four values:

          • load will wait until the load browser event is fired (default).

          • documentloaded will wait until the DOMContentLoaded browser event is fired.

          • networkidle0 will wait until there are no more than  0  network  connections  for  at
            least 500 ms.

          • networkidle2  will  wait  until  there  are no more than 2 network connections for at
            least 500 ms.

TREATING NEW JOBS AS CHANGED

       In some cases (e.g. when the diff_tool or diff_filter executes some external command as  a
       side  effect  that  should  also  run  for  the  initial  page  state),  you  can  set the
       treat_new_as_changed to true, which will make the job report as CHANGED instead of NEW the
       first time it is retrieved (and the diff will be reported, too).

          url: http://example.com/initialpage.html
          treat_new_as_changed: true

       This  option  will  also  change the behavior of --test-diff-filter, and allow testing the
       diff filter if only a single version of the page has been retrieved.

MONITORING THE SAME URL IN MULTIPLE JOBS

       Because urlwatch uses the url/navigate (for URL/Browser  jobs)  and/or  the  command  (for
       Shell  jobs)  key  as  unique identifier, each URL can only appear in a single job. If you
       want to monitor the same URL multiple times, you can append #1, #2, ... (or anything  that
       makes them unique) to the URLs, like this:

          name: "Looking for Thing A"
          url: http://example.com/#1
          filter:
            - grep: "Thing A"
          ---
          name: "Looking for Thing B"
          url: http://example.com/#2
          filter:
            - grep: "Thing B"

RUNNING A SUBSET OF JOBS

       To  run one or more specific jobs instead of all known jobs, provide the job index numbers
       to the urlwatch command. For example, to run jobs with index 2, 4, and 7:

          urlwatch 2 4 7

SENDING HTML FORM DATA USING POST

       To simulate submitting a HTML form using the POST method, you can pass the form fields  in
       the data field of the job description:

          name: "My POST Job"
          url: http://example.com/foo
          data:
            username: "foo"
            password: "bar"
            submit: "Send query"

       By default, the request will use the HTTP POST method, and the Content-type will be set to
       application/x-www-form-urlencoded.

SENDING ARBITRARY DATA USING HTTP PUT

       It is possible to customize the HTTP method and Content-type header, allowing you to  send
       arbitrary requests to the server:

          name: "My PUT Request"
          url: http://example.com/item/new
          method: PUT
          headers:
            Content-type: application/json
          data: '{"foo": true}'

COPYRIGHT

       2022 Thomas Perl

NAME

ADDING URLS FROM THE COMMAND LINE

USING WORD-BASED DIFFERENCES

IGNORING CONNECTION ERRORS

OVERRIDING THE CONTENT ENCODING

CHANGING THE DEFAULT TIMEOUT

SUPPLYING COOKIE DATA

COMPARING WITH SEVERAL LATEST SNAPSHOTS

RECEIVING A REPORT EVERY TIME URLWATCH RUNS

USING REDIS AS A CACHE BACKEND

WATCHING CHANGES ON .ONION (TOR) PAGES

WATCHING FACEBOOK PAGE EVENTS

ONLY SHOW ADDED OR REMOVED LINES

DISABLE EMPTY NOTIFICATIONS

PASS DIFF OUTPUT TO A CUSTOM SCRIPT

SETTING THE CONTENT WIDTH FOR HTML2TEXT (LYNX METHOD)

COMPARING WEB PAGES VISUALLY

CONFIGURING HOW LONG BROWSER JOBS WAIT FOR PAGES TO LOAD

TREATING NEW JOBS AS CHANGED

MONITORING THE SAME URL IN MULTIPLE JOBS

RUNNING A SUBSET OF JOBS

SENDING HTML FORM DATA USING POST

SENDING ARBITRARY DATA USING HTTP PUT

SEE ALSO

COPYRIGHT