lunar (1) tagsoup.1.gz

Provided by: libtagsoup-java_1.2.1+-1.1_all bug

NAME

       tagsoup - convert nasty, ugly HTML to clean XHTML

SYNOPSIS

       java -jar /usr/share/java/tagsoup.jar [ options ] [ files ]

DESCRIPTION

       Rectify arbitrary HTML into clean XHTML, using a tailored description of HTML.  The output
       will be well-formed XML, but not necessarily valid XHTML.

       --files
              multiple input files should be processed into corresponding output files

       --encoding=encoding
              specifies the encoding of input files

       --output-encoding=encoding
              specifies the encoding of the output (if the encoding name begins with ``utf'', the
              output will not contain character entities; otherwise, all non-ASCII characters are
              represented as entities)

       --html output rectified HTML rather  than  XML,  omitting  the  XML  declaration  and  any
              namespace declarations

       --method=html
              output rectified HTML rather than XML (end-tags are omitted for empty elements, and
              no character escaping is done in script and style elements)

       --omit-xml-declaration
              omit the XML declaration

       --lexical
              output lexical features (specifically comments and any DOCTYPE declaration)

       --nons suppress namespaces in output

       --nobogons
              suppress unknown non-HTML elements in output

       --nodefaults
              suppress default attribute values

       --nocolons
              change explicit colons in element and attribute names to underscores

       --norestart
              don't restart any restartable elements

       --ignorable
              pass through ignorable whitespace (whitespace  in  element-only  content)  via  SAX
              method handler ignorableWhitespace

       --any  treat unknown non-HTML elements as allowing any content (default)

       --emptybogons
              treat unknown non-HTML elements as empty elements

       --norootbogons
              don't allow unknown non-HTML elements to be root elements

       --doctype-system=system-id
              force DOCTYPE declaration to be output with specified system identifier

       --doctype-public=public-id
              force DOCTYPE declaration to be output with specified public identifier

       --standalone=[yes|no]
              specify standalone pseudo-attribute in output XML declaration

       --version=version
              specify  version pseudo-attribute in output XML declaration (does not affect actual
              version of XML output)

       --nocdata
              treat the CDATA-content elements script and style as ordinary elements (mostly  for
              testing)

       --pyx  output PYX format rather than XML (mostly for testing)

       --pyxin
              input is PYX-format HTML (mostly for testing)

       --reuse
              reuse the same Parser object internally (for testing only)

       --help output basic help

       --version
              output version number

       TagSoup  is  a parser and reformatter for nasty, ugly HTML.  Its normal processing mode is
       to accept HTML files on the command line, or from the standard input if  none  are  given,
       and  output  them  as clean XML to the standard output.  The encoding is assumed to be the
       platform-local encoding on input, and is always UTF-8 on output.

       When the --files option is given, each input file is processed into an output file of  the
       corresponding  name,  with  the  extension  changed to xhtml.  If the extension is already
       xhtml, it is changed to xhtml_.

       TagSoup will repair, by whatever means necessary, violations of XML  well-formedness.   In
       particular,  it  will  fix up malformed attribute names and supply missing attribute-value
       quotation marks.  More significantly, it supplies end-tags where HTML allows  them  to  be
       omitted,  and sometimes where it doesn't.  It will even supply start-tags where necessary;
       for example, if a document begins with a <li> tag, TagSoup will  automatically  prefix  it
       with <html><body><ul>.

BUGS

       TagSoup  can  be  fooled  by missing close quotes after attribute values, and by incorrect
       character encodings (it does not contain an encoding guesser).

       TagSoup doesn't understand namespace declarations, which are not properly  part  of  HTML.
       Instead,  any  element  or  attribute  name beginning foo: will be put into the artificial
       namespace urn:x-prefix:foo.

       For the same reasons, namespace-qualified attributes like xml:space can't be  returned  as
       default  values,  though  an explicit attribute in the xml namespace will be returned with
       the proper namespace URI.

AUTHOR

       John Cowan <cowan@ccil.org>

       Copyright © 2002-2008 John Cowan
       TagSoup is free software; see the source for copying conditions.  There  is  NO  warranty;
       not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.