Provided by: html-xml-utils_7.7-1_amd64 bug

NAME

       hxpipe - convert XML file to a format easier to parse with Perl or AWK

SYNOPSIS

       hxpipe [ -l ] [ -- ] [ file-or-URL ]

DESCRIPTION

       hxpipe parses an HTML or XML file and outputs a line-oriented representation of it that is well suited to
       further processing with AWK or similar tools. The format  is  similar  to  the  ESIS  (Element  Structure
       Information Set) that is output by nsgmls/onsgmls.

       The reverse operation, converting back to mark-up, is performed by the hxunpipe program.

       The output format is as follows:

       <!--comment-->
                 Comments are output as

                     *comment

                 I.e., a single line starting with "*" followed by the text of the comment. Line feeds, carriage
                 returns and tabs in the text are written as "\n", "\r" and "\t", respectively. Text that  looks
                 like  a numerical character entity is written with the "&" replaced by "\".  The line ends with
                 a line feed.

                 Note that onsgmls outputs comments starting with a "_" instead of a "*" and doesn't replace the
                 "&" of numerical character entities by "\" (and by default it omits comments altogether).

       <?processing instruction>
                 Processing instructions are output as

                     ?processing instruction

                 I.e., a single line starting with a "?" followed by the text of the processing instruction. The
                 text is escaped as for comments (see above).

       <!DOCTYPE root PUBLIC "-//foo//DTD bar//EN" "http://example.org/dtd">
                 DOCTYPEs are output as one of the following:

                     !root "-//foo//DTD bar//EN" http://example.org/dtd
                     !root "-//foo//DTD bar//EN"
                     !root "" http://example.org/dtd
                     !root ""

                 for respectively: a DOCTYPE with (1) both a public and a system identifier, (2) only  a  public
                 identifier,  (3)  only  a  system  identifier,  or  (4) neither of the two. I.e., a single line
                 starting with a "!", followed  by  a  space  and  a  possibly  empty  quoted  string,  followed
                 optionally  by  a  space  and arbitrary text. Note the quotes for the public identifier and the
                 absence of quotes for the system identifier.

       <elt att1="value1" att2="value2">
                 A start tag is output as

                     Aatt1 CDATA value1
                     Aatt2 CDATA value2
                     (elt

                 I.e., as zero or more lines for the attributes and one line for the element type. Each line for
                 an attribute starts with "A" followed by the name of the attribute, a space, the literal string
                 "CDATA", another space, and the attribute value. The text of the attribute value is escaped  as
                 for comments (see above). The line for the element type starts with "(" followed by the element
                 type.

                 hxpipe does not read DTDs and assumes that attributes are  always  CDATA.  It  never  generates
                 other types (IMPLIED, TOKEN, ID, etc.), unlike onsgmls.

       </elt>    End tags are output as

                     )elt

                 I.e., as a line starting with ")" followed by the element type.

       <empty att1="val1" att2="val2"/>
                 Empty elements (in XML) are output as

                     Aatt1 CDATA val1
                     Aatt2 CDATA val2
                     |empty

                 I.e.,  as  zero  or  more  lines  for attributes and one line starting with "|" followed by the
                 element type.

                 Note that onsgmls never outputs "|". (However, it can optionally output a line consisting of  a
                 single "e" just before the "(" line, to indicate that the element is empty.)

       text      Text is output as

                     -text

                 I.e., as a single line starting with a "-". The text is escaped as for comments (see above).

       line numbers
                 When the -l option is in effect, hxpipe will intersperse the output with lines of the form

                     L12

                 where "12" is replaced with the line number in the source where the next output came from.

       hxpipe  does  not  normalize  the  input and does not add mising tags. It is thus possible that there are
       unequal numbers of "(" and ")" lines. If it is important that every start tag is matched by an  end  tag,
       pipe the input through hxnormalize -x first.

OPTIONS

       The following options are supported:

       -l        Add "L" lines to the output to indicate the line numbers in the source.

OPERANDS

       The following operand is supported:

       file-or-URL
                 The name or URL of an HTML file. If absent, standard input is read instead.

EXIT STATUS

       The following exit values are returned:

       0         Successful completion.

       > 0       An  error  occurred  in the parsing of the HTML file.  hxpipe will try to correct the error and
                 produce output anyway.

ENVIRONMENT

       To use a proxy to retrieve remote files, set the environment variables http_proxy and  ftp_proxy.   E.g.,
       http_proxy="http://localhost:8080/"

BUGS

       The error recovery for incorrect HTML is primitive.  hxnormalize can currently only retrieve remote files
       over HTTP. It doesn't handle password-protected files, nor files whose content depends on HTTP "cookies."

SEE ALSO

       hxunpipe(1), onsgmls(1).