Ubuntu Manpage: tagsoup - convert nasty, ugly HTML to clean XHTML

Provided by: libtagsoup-java_1.2.1+-1.1_all

NAME

       tagsoup - convert nasty, ugly HTML to clean XHTML

SYNOPSIS

       java -jar /usr/share/java/tagsoup.jar [ options ] [ files ]

DESCRIPTION

       Rectify  arbitrary HTML into clean XHTML, using a tailored description of HTML.  The output will be well-
       formed XML, but not necessarily valid XHTML.

       --files
              multiple input files should be processed into corresponding output files

       --encoding=encoding
              specifies the encoding of input files

       --output-encoding=encoding
              specifies the encoding of the output (if the encoding name begins with ``utf'',  the  output  will
              not contain character entities; otherwise, all non-ASCII characters are represented as entities)

       --html output rectified HTML rather than XML, omitting the XML declaration and any namespace declarations

       --method=html
              output  rectified  HTML rather than XML (end-tags are omitted for empty elements, and no character
              escaping is done in script and style elements)

       --omit-xml-declaration
              omit the XML declaration

       --lexical
              output lexical features (specifically comments and any DOCTYPE declaration)

       --nons suppress namespaces in output

       --nobogons
              suppress unknown non-HTML elements in output

       --nodefaults
              suppress default attribute values

       --nocolons
              change explicit colons in element and attribute names to underscores

       --norestart
              don't restart any restartable elements

       --ignorable
              pass through ignorable whitespace (whitespace in element-only  content)  via  SAX  method  handler
              ignorableWhitespace

       --any  treat unknown non-HTML elements as allowing any content (default)

       --emptybogons
              treat unknown non-HTML elements as empty elements

       --norootbogons
              don't allow unknown non-HTML elements to be root elements

       --doctype-system=system-id
              force DOCTYPE declaration to be output with specified system identifier

       --doctype-public=public-id
              force DOCTYPE declaration to be output with specified public identifier

       --standalone=[yes|no]
              specify standalone pseudo-attribute in output XML declaration

       --version=version
              specify  version pseudo-attribute in output XML declaration (does not affect actual version of XML
              output)

       --nocdata
              treat the CDATA-content elements script and style as ordinary elements (mostly for testing)

       --pyx  output PYX format rather than XML (mostly for testing)

       --pyxin
              input is PYX-format HTML (mostly for testing)

       --reuse
              reuse the same Parser object internally (for testing only)

       --help output basic help

       --version
              output version number

       TagSoup is a parser and reformatter for nasty, ugly HTML.  Its normal processing mode is to  accept  HTML
       files  on the command line, or from the standard input if none are given, and output them as clean XML to
       the standard output.  The encoding is assumed to be the platform-local encoding on input, and  is  always
       UTF-8 on output.

       When  the  --files option is given, each input file is processed into an output file of the corresponding
       name, with the extension changed to xhtml.  If the extension is already xhtml, it is changed to xhtml_.

       TagSoup will repair, by whatever means necessary, violations of XML well-formedness.  In  particular,  it
       will  fix  up  malformed  attribute  names  and  supply  missing  attribute-value  quotation marks.  More
       significantly, it supplies end-tags where HTML allows them to be omitted, and sometimes where it doesn't.
       It will even supply start-tags where necessary; for example, if  a  document  begins  with  a  <li>  tag,
       TagSoup will automatically prefix it with <html><body><ul>.

BUGS

       TagSoup  can  be  fooled  by  missing  close  quotes  after  attribute values, and by incorrect character
       encodings (it does not contain an encoding guesser).

       TagSoup doesn't understand namespace declarations, which are not properly part  of  HTML.   Instead,  any
       element or attribute name beginning foo: will be put into the artificial namespace urn:x-prefix:foo.

       For  the same reasons, namespace-qualified attributes like xml:space can't be returned as default values,
       though an explicit attribute in the xml namespace will be returned with the proper namespace URI.

AUTHOR

       John Cowan <cowan@ccil.org>

COPYRIGHT

       Copyright © 2002-2008 John Cowan
       TagSoup is free software; see the source for copying conditions.  There is  NO  warranty;  not  even  for
       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

TagSoup 1.2.1                                     January 2008                                        TAGSOUP(1)