| ´ This file is part of TagSoup and is Copyright 2002‐2008 by John |
| Cowan. ´ ´ TagSoup is licensed under the Apache License, ´ Ver‐ |
| sion 2.0. You may obtain a copy of this license at ´ |
| http://www.apache.org/licenses/LICENSE‐2.0 . You may also have ´ |
| additional legal rights not granted by this license. ´ ´ TagSoup |
| is distributed in the hope that it will be useful, but ´ unless |
| required by applicable law or agreed to in writing, TagSoup ´ is |
| distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS |
| ´ OF ANY KIND, either express or implied; not even the implied |
| warranty ´ of MERCHANTABILITY or FITNESS FOR A PARTICULAR PUR‐ |
| TAGSOUP(1) User Commands TAGSOUP(1) |
| |
| |
| |
| POSE. ´ |
| |
| NAME |
| tagsoup - convert nasty, ugly HTML to clean XHTML |
| |
| SYNOPSIS |
| java -jar tagsoup-1.2 [ options ] [ files ] |
| |
| DESCRIPTION |
| Rectify arbitrary HTML into clean XHTML, using a tailored description |
| of HTML. The output will be well-formed XML, but not necessarily valid |
| XHTML. |
| |
| |
| --files |
| multiple input files should be processed into corresponding out‐ |
| put files |
| |
| --encoding=encoding |
| specifies the encoding of input files |
| |
| --output-encoding=encoding |
| specifies the encoding of the output (if the encoding name |
| begins with ‘‘utf’’, the output will not contain character enti‐ |
| ties; otherwise, all non-ASCII characters are represented as |
| entities) |
| |
| --html output rectified HTML rather than XML, omitting the XML declara‐ |
| tion and any namespace declarations |
| |
| --method=html |
| output rectified HTML rather than XML (end-tags are omitted for |
| empty elements, and no character escaping is done in script and |
| style elements) |
| |
| --omit-xml-declaration |
| omit the XML declaration |
| |
| --lexical |
| output lexical features (specifically comments and any DOCTYPE |
| declaration) |
| |
| --nons suppress namespaces in output |
| |
| --nobogons |
| suppress unknown non-HTML elements in output |
| |
| --nodefaults |
| suppress default attribute values |
| |
| --nocolons |
| change explicit colons in element and attribute names to under‐ |
| scores |
| |
| --norestart |
| don’t restart any restartable elements |
| |
| --ignorable |
| pass through ignorable whitespace (whitespace in element-only |
| content) via SAX method handler ignorableWhitespace |
| |
| --any treat unknown non-HTML elements as allowing any content |
| (default) |
| |
| --emptybogons |
| treat unknown non-HTML elements as empty elements |
| |
| --norootbogons |
| don’t allow unknown non-HTML elements to be root elements |
| |
| --doctype-system=system-id |
| force DOCTYPE declaration to be output with specified system |
| identifier |
| |
| --doctype-public=public-id |
| force DOCTYPE declaration to be output with specified public |
| identifier |
| |
| --standalone=[yes|no] |
| specify standalone pseudo-attribute in output XML declaration |
| |
| --version=version |
| specify version pseudo-attribute in output XML declaration (does |
| not affect actual version of XML output) |
| |
| --nocdata |
| treat the CDATA-content elements script and style as ordinary |
| elements (mostly for testing) |
| |
| --pyx output PYX format rather than XML (mostly for testing) |
| |
| --pyxin |
| input is PYX-format HTML (mostly for testing) |
| |
| --reuse |
| reuse the same Parser object internally (for testing only) |
| |
| --help output basic help |
| |
| --version |
| output version number |
| |
| TagSoup is a parser and reformatter for nasty, ugly HTML. Its normal |
| processing mode is to accept HTML files on the command line, or from |
| the standard input if none are given, and output them as clean XML to |
| the standard output. The encoding is assumed to be the platform-local |
| encoding on input, and is always UTF-8 on output. |
| |
| When the --files option is given, each input file is processed into an |
| output file of the corresponding name, with the extension changed to |
| xhtml. If the extension is already xhtml, it is changed to xhtml_. |
| |
| TagSoup will repair, by whatever means necessary, violations of XML |
| well-formedness. In particular, it will fix up malformed attribute |
| names and supply missing attribute-value quotation marks. More signif‐ |
| icantly, it supplies end-tags where HTML allows them to be omitted, and |
| sometimes where it doesn’t. It will even supply start-tags where nec‐ |
| essary; for example, if a document begins with a <li> tag, TagSoup will |
| automatically prefix it with <html><body><ul>. |
| |
| |
| BUGS |
| TagSoup can be fooled by missing close quotes after attribute values, |
| and by incorrect character encodings (it does not contain an encoding |
| guesser). |
| |
| TagSoup doesn’t understand namespace declarations, which are not prop‐ |
| erly part of HTML. Instead, any element or attribute name beginning |
| foo: will be put into the artificial namespace urn:x-prefix:foo. |
| |
| For the same reasons, namespace-qualified attributes like xml:space |
| can’t be returned as default values, though an explicit attribute in |
| the xml namespace will be returned with the proper namespace URI. |
| |
| AUTHOR |
| John Cowan <cowan@ccil.org> |
| |
| COPYRIGHT |
| Copyright © 2002-2008 John Cowan |
| TagSoup is free software; see the source for copying conditions. There |
| is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICU‐ |
| LAR PURPOSE. |
| |
| |
| |
| TagSoup 1.2 January 2008 TAGSOUP(1) |