Ubuntu Manpage: urlwatch-filters - Filtering output and diff data of urlwatch jobs

NAME

       urlwatch-filters - Filtering output and diff data of urlwatch jobs

SYNOPSIS

       urlwatch --edit

DESCRIPTION

       Each  job  can have two filter stages configured, with one or more filters processed after
       each other:

       • Applied to the downloaded page before diffing the changes (filter)

       • Applied to the diff result before reporting the changes (diff_filter)

       While creating your filter pipeline, you might want to preview what  the  filtered  output
       looks like. You can do so by first configuring your job and then running urlwatch with the
       --test-filter command, passing in the index (from --list) or the URL/location of  the  job
       to be tested:

          urlwatch --test-filter 1   # Test the first job in the list
          urlwatch --test-filter https://example.net/  # Test the job with the given URL

       The  output  of this command will be the filtered plaintext of the job, this is the output
       that will (in a real urlwatch run) be the input to the diff algorithm.

       The filter is only applied to new content, the old content was already  filtered  when  it
       was  retrieved. This means that changes to filter are not visible when reporting unchanged
       contents (see Display for details), and the diff output will be between (old content  with
       filter at the time old content was retrieved) and (new content with current filter).

       Once  urlwatch  has collected at least 2 historic snapshots of a job (two different states
       of a webpage) you  can  use  the  command-line  option  --test-diff-filter  to  test  your
       diff_filter settings; this will use historic data cached locally.

BUILT-IN FILTERS

       The list of built-in filters can be retrieved using:

          urlwatch --features

       At the moment, the following filters are built-in:

       • beautify: Beautify HTML

       • css: Filter XML/HTML using CSS selectors

       • csv2text: Convert CSV to plaintext

       • element-by-class: Get all HTML elements by class

       • element-by-id: Get an HTML element by its ID

       • element-by-style: Get all HTML elements by style

       • element-by-tag: Get an HTML element by its tag

       • format-json: Convert to formatted json

       • grep: Filter only lines matching a regular expression

       • grepi: Remove lines matching a regular expression

       • hexdump: Convert binary data to hex dump format

       • html2text: Convert HTML to plaintext

       • pdf2text: Convert PDF to plaintext

       • pretty-xml: Pretty-print XML

       • ical2text: Convert iCalendar <https://en.wikipedia.org/wiki/ICalendar> to plaintext

       • ocr: Convert text in images to plaintext using Tesseract OCR

       • re.sub: Replace text with regular expressions using Python's re.sub

       • reverse: Reverse input items

       • sha1sum: Calculate the SHA-1 checksum of the content

       • shellpipe: Filter using a shell command

       • sort: Sort input items

       • remove-duplicate-lines: Remove duplicate lines (case sensitive)

       • strip: Strip leading and trailing whitespace

       • striplines: Strip leading and trailing whitespace in each line

       • xpath: Filter XML/HTML using XPath expressions

       • jq: Filter, transform and extract values from JSON

PICKING OUT ELEMENTS FROM A WEBPAGE

       You  can  pick  only a given HTML element with the built-in filter, for example to extract
       <div id="something">.../<div> from a page, you can use the following in your urls.yaml:

          url: http://example.org/idtest.html
          filter:
            - element-by-id: something

       Also, you can chain filters, so you can run html2text on the result:

          url: http://example.net/id2text.html
          filter:
            - element-by-id: something
            - html2text

CHAINING MULTIPLE FILTERS

       The example urls.yaml file also demonstrates the use of built-in filters, here  3  filters
       are  used:  html2text,  line-grep  and whitespace removal to get just a certain info field
       from a webpage:

          url: https://example.net/version.html
          filter:
            - html2text
            - grep: "Current.*version"
            - strip

EXTRACTING ONLY THE <BODY> TAG OF A PAGE

       If you want to extract only the body tag you can use this filter:

          url: https://example.org/bodytag.html
          filter:
            - element-by-tag: body

FILTERING BASED ON AN XPATH EXPRESSION

       To filter based on an XPath  <https://www.w3.org/TR/1999/REC-xpath-19991116/>  expression,
       you can use the xpath filter like so:

          url: https://example.net/xpath.html
          filter:
            - xpath: /html/body/marquee

       This  filters only the <marquee> elements directly below the <body> element, which in turn
       must be below the <html> element of the document, stripping out everything else.

       See        Microsoft’s        XPath        Examples        <https://msdn.microsoft.com/en-
       us/library/ms256086(v=vs.110).aspx>  page  for  some other examples.  You can also find an
       XPath of an <html> node in the Chromium/Google Chrome developer tools by right clicking on
       the node and selecting copy XPath.

FILTERING BASED ON CSS SELECTORS

       To     filter     based     on    a    CSS    selector    <https://www.w3.org/TR/2011/REC-
       css3-selectors-20110929/>, you can use the css filter like so:

          url: https://example.net/css.html
          filter:
            - css: ul#groceries > li.unchecked

       This would filter only <li class="unchecked">  tags  directly  below  <ul  id="groceries">
       elements.

       Some   limitations   and  extensions  exist  as  explained  in  cssselect’s  documentation
       <https://cssselect.readthedocs.io/en/latest/#supported-selectors>.

USING XPATH AND CSS FILTERS WITH XML AND EXCLUSIONS

       By default, XPath and CSS filters are set up for HTML documents.  However, it is  possible
       to  use  them  for XML documents as well (these examples parse an RSS feed and filter only
       the titles and publication dates):

          url: https://example.com/blog/xpath-index.rss
          filter:
            - xpath:
                path: '//item/title/text()|//item/pubDate/text()'
                method: xml

          url: http://example.com/blog/css-index.rss
          filter:
            - css:
                selector: 'item > title, item > pubDate'
                method: xml
            - html2text: re

       To match  an  element  in  an  XML  namespace  <https://www.w3.org/TR/xml-names/>,  use  a
       namespace prefix before the tag name. Use a : to separate the namespace prefix and the tag
       name in an XPath expression, and use a | in a CSS selector.

          url: https://example.net/feed/xpath-namespace.xml
          filter:
            - xpath:
                path: '//item/media:keywords/text()'
                method: xml
                namespaces:
                  media: http://search.yahoo.com/mrss/

          url: http://example.org/feed/css-namespace.xml
          filter:
            - css:
                selector: 'item > media|keywords'
                method: xml
                namespaces:
                  media: http://search.yahoo.com/mrss/
            - html2text

       Alternatively, use the XPath expression //*[name()='<tag_name>'] to bypass  the  namespace
       entirely.

       Another  useful  option  with XPath and CSS filters is exclude.  Elements selected by this
       exclude expression are removed from the final result. For example, the following job  will
       not have any <a> tag in its results:

          url: https://example.org/css-exclude.html
          filter:
            - css:
                selector: body
                exclude: a

LIMITING THE RETURNED ITEMS FROM A CSS SELECTOR OR XPATH

       If  you  only  want  to  return  a subset of the items returned by a CSS selector or XPath
       filter, you can use two additional subfilters:

       • skip: How many elements to skip from the beginning (default: 0)

       • maxitems: How many elements to return at most (default: no limit)

       For example, if the page has multiple elements, but you only want to select the second and
       third matching element (skip the first, and return at most two elements), you can use this
       filter:

          url: https://example.net/css-skip-maxitems.html
          filter:
            - css:
                selector: div.cpu
                skip: 1
                maxitems: 2

   Dealing with duplicated results
       If you get multiple results on one page, but you only expected one (e.g. because the  page
       contains  both a mobile and desktop version in the same HTML document, and shows/hides one
       via CSS depending on the viewport size), you can use maxitems: 1 to only return the  first
       item.

FIXING LIST REORDERINGS WITH CSS SELECTOR OR XPATH FILTERS

       In  some  cases,  the  ordering  of  items on a webpage might change regularly without the
       actual content changing. By default, this would show up in the diff output as  an  element
       being removed from one part of the page and inserted in another part of the page.

       In  cases  where  the  order  of items doesn't matter, it's possible to sort matched items
       lexicographically to avoid spurious reports when only the ordering of items changes on the
       page.

       The subfilter for css and xpath filters is sort, and can be true or false (the default):

          url: https://example.org/items-random-order.html
          filter:
            - css:
                selector: span.item
                sort: true

FILTERING PDF DOCUMENTS

       To  monitor  the  text  of  a  PDF  file,  you  use  the  pdf2text filter. It requires the
       installation                     of                     the                      pdftotext
       <https://github.com/jalan/pdftotext/blob/master/README.md#pdftotext>  library  and  any of
       its OS-specific dependencies <https://github.com/jalan/pdftotext/blob/master/README.md#os-
       dependencies>.

       This  filter must be the first filter in a chain of filters, since it consumes binary data
       and outputs text data.

          url: https://example.net/pdf-test.pdf
          filter:
            - pdf2text
            - strip

       If the PDF file is password protected, you can specify its password:

          url: https://example.net/pdf-test-password.pdf
          filter:
            - pdf2text:
                password: urlwatchsecret
            - strip

DEALING WITH CSV INPUT

       The csv2text filter can be used to turn CSV data to a prettier textual representation This
       is   done   by   supplying   a   format_string   which   is   a   python   format   string
       <https://docs.python.org/3/library/string.html#format-string-syntax>.

       If the CSV has a header, the format string should use the header  names  lowercased.   For
       example, let's say we have a CSV file containing data like this:

          Name;Company
          Smith;Initech
          Doe;Initech

       A possible format string for the above CSV (note the lowercase keys):

          Mr {name} works at {company}

       If there is no header row, you will need to use the numeric array notation:

          Mr {0} works at {1}

       You can force the use of numeric indices with the flag ignore_header.

       The  key has_header can be used to force use the first line or first ignore the first line
       as header, otherwise csv.Sniffer  <https://docs.python.org/3/library/csv.html#csv.Sniffer>
       will be used.

SORTING OF WEBPAGE CONTENT

       Sometimes  a  web page can have the same data between comparisons but it appears in random
       order. If that happens, you can choose to sort before the comparison.

          url: https://example.net/sorting.txt
          filter:
            - sort

       The sort filter takes an optional separator parameter that defines the item separator  (by
       default  sorting is line-based), for example to sort text paragraphs (text separated by an
       empty line):

          url: http://example.org/paragraphs.txt
          filter:
            - sort:
                separator: "\n\n"

       This can be combined with a boolean reverse  option,  which  is  useful  for  sorting  and
       reversing  with  the  same  separator  (using % as separator, this would turn 3%2%4%1 into
       4%3%2%1):

          url: http://example.org/sort-reverse-percent.txt
          filter:
            - sort:
                separator: '%'
                reverse: true

REVERSING OF LINES OR SEPARATED ITEMS

       To reverse the order of items without sorting, the reverse filter can be used. By  default
       it reverses lines:

          url: http://example.com/reverse-lines.txt
          filter:
            - reverse

       This  behavior  can  be changed by using an optional separator string argument (e.g. items
       separated by a pipe (|) symbol, as in 1|4|2|3, which would be reversed to 3|2|4|1):

          url: http://example.net/reverse-separator.txt
          filter:
            - reverse: '|'

       Alternatively, the filter can be specified more verbose with  a  dict.   In  this  example
       "\n\n" is used to separate paragraphs (items that are separated by an empty line):

          url: http://example.org/reverse-paragraphs.txt
          filter:
            - reverse:
                separator: "\n\n"

WATCHING GITHUB RELEASES AND GITLAB TAGS

       This  is  an  example  how to watch the GitHub “releases” page for a given project for the
       latest release version, to be notified of new releases:

          url: https://github.com/tulir/gomuks/releases
          filter:
            - xpath: '(//div[contains(@class,"d-flex flex-column flex-md-row my-5 flex-justify-center")]//h1//a)[1]'
            - html2text: re
            - strip

       This is the corresponding version for Github tags:

          url: https://github.com/thp/urlwatch/tags
          filter:
            - xpath:
                path: //*[@class="Link--primary"]
                maxitems: 1
            - html2text:

       and for Gitlab tags:

          url: https://gitlab.com/chinstrap/gammastep/-/tags
          filter:
            - xpath: (//a[contains(@class,"item-title ref-name")])[1]
            - html2text

       Alternatively, jq can be used for filtering:

          url: https://api.github.com/repos/voxpupuli/puppet-rundeck/tags
          filter:
            - jq: '.[0].name'

REMOVE OR REPLACE TEXT USING REGULAR EXPRESSIONS

       Just like Python’s re.sub function, there’s the possibility to apply a regular  expression
       and  either remove of replace the matched text. The following example applies the filter 3
       times:

       1. Just specifying a string as the value will replace the matches with the empty string.

       2. Simple patterns can be replaced with another string using “pattern” as  the  expression
          and “repl” as the replacement.

       3. You  can use groups (()) and back-reference them with \1 (etc..) to put groups into the
          replacement string.

       All        features        are        described         in         Python’s         re.sub
       <https://docs.python.org/3/library/re.html#re.sub>  documentation  (the  pattern  and repl
       values are passed to this function as-is, with the value of repl defaulting to  the  empty
       string).

          url: https://example.com/regex-substitute.html
          filter:
              - re.sub: '\s*href="[^"]*"'
              - re.sub:
                  pattern: '<h1>'
                  repl: 'HEADING 1: '
              - re.sub:
                  pattern: '</([^>]*)>'
                  repl: '<END OF TAG \1>'

       If  you  want to enable certain flags (e.g. re.MULTILINE) in the call, this is possible by
       inserting    an    "inline     flag"     documented     in     flags     in     re.compile
       <https://docs.python.org/3/library/re.html#re.compile>, here are some examples:

       • re.MULTILINE: (?m) (Makes ^ match start-of-line and $ match end-of-line)

       • re.DOTALL: (?s) (Makes . also match a newline)

       • re.IGNORECASE: (?i) (Perform case-insensitive matching)

       This allows you, for example, to remove all leading spaces (only space character and tab):

          url: http://example.com/leading-spaces.txt
          filter:
            - re.sub: '(?m)^[ \t]*'

USING A SHELL SCRIPT AS A FILTER

       While  the  built-in  filters  are powerful for processing markup such as HTML and XML, in
       some cases you might already know how you would filter your content using a shell  command
       or  shell script. The shellpipe filter allows you to start a shell and run custom commands
       to filter the content.

       The text data to be filtered will be written to the standard input (stdin)  of  the  shell
       process and the filter output will be taken from the shell's standard output (stdout).

       For  example,  if you want to use grep tool with the case insensitive matching option (-i)
       and printing only the matching part of the line (-o), you can specify  this  as  shellpipe
       filter:

          url: https://example.net/shellpipe-grep.txt
          filter:
            - shellpipe: "grep -i -o 'price: <span>.*</span>'"

       This  feature  also  allows  you  to  use  sed(1),  awk(1) and perl(1) one-liners for text
       processing (of course, any text tool that works in a shell can be used). For example, this
       awk(1) one-liner prepends the line number to each line:

          url: https://example.net/shellpipe-awk-oneliner.txt
          filter:
            - shellpipe: awk '{ print FNR " " $0 }'

       You  can  also  use  a multi-line command for a more sophisticated shell script (| in YAML
       denotes the start of a text block):

          url: https://example.org/shellpipe-multiline.txt
          filter:
            - shellpipe: |
                FILENAME=`mktemp`
                # Copy the input to a temporary file, then pipe through awk
                tee $FILENAME | awk '/The numbers for (.*) are:/,/The next draw is on (.*)./'
                # Analyze the input file in some other way
                echo "Input lines: $(wc -l $FILENAME | awk '{ print $1 }')"
                rm -f $FILENAME

       Within  the  shellpipe  script,  two  environment  variables  will  be  set  for   further
       customization  (this  can be useful if you have an external shell script file that is used
       as filter for multiple jobs, but needs to treat each job in a slightly different way):

                      ┌───────────────────────┬──────────────────────────────────┐
                      │Environment variable   │ Contents                         │
                      ├───────────────────────┼──────────────────────────────────┤
                      │$URLWATCH_JOB_NAME     │ The name of the job (name key in │
                      │                       │ jobs YAML)                       │
                      ├───────────────────────┼──────────────────────────────────┤
                      │$URLWATCH_JOB_LOCATION │ The  URL  of the job, or command │
                      │                       │ line (for shell jobs)            │
                      └───────────────────────┴──────────────────────────────────┘

CONVERTING TEXT IN IMAGES TO PLAINTEXT

       The ocr filter uses the Tesseract OCR engine <https://github.com/tesseract-ocr> to convert
       text  in images to plain text. It requires two Python modules to be installed: pytesseract
       <https://github.com/madmaze/pytesseract> and Pillow <https://python-pillow.org>. Any  file
       formats supported by Pillow (PIL) are supported.

       This  filter must be the first filter in a chain of filters, since it consumes binary data
       and outputs text data.

          url: https://example.net/ocr-test.png
          filter:
            - ocr:
                timeout: 5
                language: eng
            - strip

       The subfilters timeout and language are optional:

       • timeout: Timeout for the recognition, in seconds (default: 10 seconds)

       • language: Text language (e.g. fra or eng+fra, default: eng)

FILTERING JSON RESPONSE DATA USING JQ SELECTORS

       The jq  filter  uses  the  Python  bindings  for  jq  <https://stedolan.github.io/jq/>,  a
       lightweight  JSON  processor.   Use  of this filter requires the optional jq Python module
       <https://github.com/mwilliamson/jq.py> to be installed.

          url: https://example.net/jobs.json
          filter:
             - jq:
                query: '.[].title'

       The subfilter query is optional:

       • query: A valid jq filter string.

       Supports aggregations, selections, and the  built-in  operators  like  length.   For  more
       information     on     the     operations     permitted,     see     the     jq     Manual
       <https://stedolan.github.io/jq/manual/>.

FILES

       $XDG_CONFIG_HOME/urlwatch/urls.yaml

COPYRIGHT

       2023 Thomas Perl

                                           May 03, 2023                       URLWATCH-FILTERS(5)

NAME

SYNOPSIS

DESCRIPTION

BUILT-IN FILTERS

PICKING OUT ELEMENTS FROM A WEBPAGE

CHAINING MULTIPLE FILTERS

EXTRACTING ONLY THE <BODY> TAG OF A PAGE

FILTERING BASED ON AN XPATH EXPRESSION

FILTERING BASED ON CSS SELECTORS

USING XPATH AND CSS FILTERS WITH XML AND EXCLUSIONS

LIMITING THE RETURNED ITEMS FROM A CSS SELECTOR OR XPATH

FIXING LIST REORDERINGS WITH CSS SELECTOR OR XPATH FILTERS

FILTERING PDF DOCUMENTS

DEALING WITH CSV INPUT

SORTING OF WEBPAGE CONTENT

REVERSING OF LINES OR SEPARATED ITEMS

WATCHING GITHUB RELEASES AND GITLAB TAGS

REMOVE OR REPLACE TEXT USING REGULAR EXPRESSIONS

USING A SHELL SCRIPT AS A FILTER

CONVERTING TEXT IN IMAGES TO PLAINTEXT

FILTERING JSON RESPONSE DATA USING JQ SELECTORS

FILES

SEE ALSO

COPYRIGHT