lunar (1) hxpipe.1.gz

Provided by: html-xml-utils_7.7-1.1_amd64 bug

NAME

       hxpipe - convert XML file to a format easier to parse with Perl or AWK

SYNOPSIS

       hxpipe [ -l ] [ -- ] [ file-or-URL ]

DESCRIPTION

       hxpipe parses an HTML or XML file and outputs a line-oriented representation of it that is
       well suited to further processing with AWK or similar tools. The format is similar to  the
       ESIS (Element Structure Information Set) that is output by nsgmls/onsgmls.

       The reverse operation, converting back to mark-up, is performed by the hxunpipe program.

       The output format is as follows:

       <!--comment-->
                 Comments are output as

                     *comment

                 I.e.,  a single line starting with "*" followed by the text of the comment. Line
                 feeds, carriage returns and tabs in the text are written as "\n", "\r" and "\t",
                 respectively.  Text that looks like a numerical character entity is written with
                 the "&" replaced by "\".  The line ends with a line feed.

                 Note that onsgmls outputs comments starting with a "_"  instead  of  a  "*"  and
                 doesn't  replace  the "&" of numerical character entities by "\" (and by default
                 it omits comments altogether).

       <?processing instruction>
                 Processing instructions are output as

                     ?processing instruction

                 I.e., a single line starting with a "?" followed by the text of  the  processing
                 instruction. The text is escaped as for comments (see above).

       <!DOCTYPE root PUBLIC "-//foo//DTD bar//EN" "http://example.org/dtd">
                 DOCTYPEs are output as one of the following:

                     !root "-//foo//DTD bar//EN" http://example.org/dtd
                     !root "-//foo//DTD bar//EN"
                     !root "" http://example.org/dtd
                     !root ""

                 for  respectively: a DOCTYPE with (1) both a public and a system identifier, (2)
                 only a public identifier, (3) only a system identifier, or (4)  neither  of  the
                 two. I.e., a single line starting with a "!", followed by a space and a possibly
                 empty quoted string, followed optionally by a space and arbitrary text. Note the
                 quotes  for  the  public  identifier  and  the  absence of quotes for the system
                 identifier.

       <elt att1="value1" att2="value2">
                 A start tag is output as

                     Aatt1 CDATA value1
                     Aatt2 CDATA value2
                     (elt

                 I.e., as zero or more lines for the attributes and  one  line  for  the  element
                 type.  Each  line  for  an attribute starts with "A" followed by the name of the
                 attribute, a space, the literal string "CDATA", another space, and the attribute
                 value.  The  text of the attribute value is escaped as for comments (see above).
                 The line for the element type starts with "(" followed by the element type.

                 hxpipe does not read DTDs and assumes that attributes are always CDATA. It never
                 generates other types (IMPLIED, TOKEN, ID, etc.), unlike onsgmls.

       </elt>    End tags are output as

                     )elt

                 I.e., as a line starting with ")" followed by the element type.

       <empty att1="val1" att2="val2"/>
                 Empty elements (in XML) are output as

                     Aatt1 CDATA val1
                     Aatt2 CDATA val2
                     |empty

                 I.e.,  as  zero  or  more  lines  for  attributes and one line starting with "|"
                 followed by the element type.

                 Note that onsgmls never outputs "|". (However, it can optionally output  a  line
                 consisting  of  a  single  "e"  just  before  the "(" line, to indicate that the
                 element is empty.)

       text      Text is output as

                     -text

                 I.e., as a single line starting with a "-". The text is escaped as for  comments
                 (see above).

       line numbers
                 When  the  -l option is in effect, hxpipe will intersperse the output with lines
                 of the form

                     L12

                 where "12" is replaced with the line number in the source where the next  output
                 came from.

       hxpipe does not normalize the input and does not add mising tags. It is thus possible that
       there are unequal numbers of "(" and ")" lines. If it is important that every start tag is
       matched by an end tag, pipe the input through hxnormalize -x first.

OPTIONS

       The following options are supported:

       -l        Add "L" lines to the output to indicate the line numbers in the source.

OPERANDS

       The following operand is supported:

       file-or-URL
                 The name or URL of an HTML file. If absent, standard input is read instead.

EXIT STATUS

       The following exit values are returned:

       0         Successful completion.

       > 0       An  error  occurred in the parsing of the HTML file.  hxpipe will try to correct
                 the error and produce output anyway.

ENVIRONMENT

       To use a proxy to retrieve remote files, set  the  environment  variables  http_proxy  and
       ftp_proxy.  E.g., http_proxy="http://localhost:8080/"

BUGS

       The  error  recovery  for  incorrect  HTML  is  primitive.  hxnormalize can currently only
       retrieve remote files over HTTP. It doesn't handle  password-protected  files,  nor  files
       whose content depends on HTTP "cookies."

SEE ALSO

       hxunpipe(1), onsgmls(1).