| \' This file is part of TagSoup and is Copyright 2002-2008 by John Cowan. |
| \' |
| \' TagSoup is licensed under the Apache License, |
| \' Version 2.0. You may obtain a copy of this license at |
| \' http://www.apache.org/licenses/LICENSE-2.0 . You may also have |
| \' additional legal rights not granted by this license. |
| \' |
| \' TagSoup is distributed in the hope that it will be useful, but |
| \' unless required by applicable law or agreed to in writing, TagSoup |
| \' is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS |
| \' OF ANY KIND, either express or implied; not even the implied warranty |
| \' of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. |
| \' |
| .TH TAGSOUP "1" "January 2008" "TagSoup 1.2" "User Commands" |
| .SH NAME |
| tagsoup \- convert nasty, ugly HTML to clean XHTML |
| .SH SYNOPSIS |
| .B java -jar tagsoup-1.2 |
| [ |
| .I options |
| ] [ |
| .I files |
| ] |
| .SH DESCRIPTION |
| .\" Add any additional description here |
| .PP |
| Rectify arbitrary HTML into clean XHTML, |
| using a tailored description of HTML. |
| The output will be well-formed XML, but not necessarily |
| .I valid |
| XHTML. |
| .PP |
| .TP |
| .B --files |
| multiple input |
| .I files |
| should be processed into corresponding output files |
| .TP |
| .BI --encoding= encoding |
| specifies the encoding of input files |
| .TP |
| .BI --output-encoding= encoding |
| specifies the encoding of the output |
| (if the encoding name begins with ``utf'', |
| the output will not contain character entities; |
| otherwise, all non-ASCII characters are |
| represented as entities) |
| .TP |
| .B --html |
| output rectified HTML rather than XML, |
| omitting the XML declaration |
| and any namespace declarations |
| .TP |
| .B --method=html |
| output rectified HTML rather than XML |
| (end-tags are omitted for empty elements, |
| and no character escaping is done in |
| script and style elements) |
| .TP |
| .B --omit-xml-declaration |
| omit the XML declaration |
| .TP |
| .B --lexical |
| output lexical features (specifically comments and any DOCTYPE declaration) |
| .TP |
| .B --nons |
| suppress namespaces in output |
| .TP |
| .B --nobogons |
| suppress unknown non-HTML elements in output |
| .TP |
| .B --nodefaults |
| suppress default attribute values |
| .TP |
| .B --nocolons |
| change explicit colons |
| in element and attribute names |
| to underscores |
| .TP |
| .B --norestart |
| don't restart any restartable elements |
| .TP |
| .B --ignorable |
| pass through ignorable whitespace |
| (whitespace in element-only content) |
| via SAX method handler ignorableWhitespace |
| .TP |
| .B --any |
| treat unknown non-HTML elements as allowing any content (default) |
| .TP |
| .B --emptybogons |
| treat unknown non-HTML elements as empty elements |
| .TP |
| .B --norootbogons |
| don't allow unknown non-HTML elements to be root elements |
| .TP |
| .BI --doctype-system= system-id |
| force DOCTYPE declaration to be output with specified system identifier |
| .TP |
| .BI --doctype-public= public-id |
| force DOCTYPE declaration to be output with specified public identifier |
| .TP |
| .B --standalone=[yes|no] |
| specify standalone pseudo-attribute in output XML declaration |
| .TP |
| .BI --version= version |
| specify version pseudo-attribute in output XML declaration |
| (does not affect actual version of XML output) |
| .TP |
| .B --nocdata |
| treat the CDATA-content elements |
| .I script |
| and |
| .I style |
| as ordinary elements |
| (mostly for testing) |
| .TP |
| .B --pyx |
| output PYX format rather than XML |
| (mostly for testing) |
| .TP |
| .B --pyxin |
| input is PYX-format HTML |
| (mostly for testing) |
| .TP |
| .B --reuse |
| reuse the same Parser object internally |
| (for testing only) |
| .TP |
| .B --help |
| output basic help |
| .TP |
| .B --version |
| output version number |
| .PP |
| .B TagSoup |
| is a parser and reformatter for nasty, ugly HTML. |
| Its normal processing mode is to accept HTML files on the command line, |
| or from the standard input if none are given, and output them |
| as clean XML |
| to the standard output. The encoding is assumed to be the platform-local |
| encoding on input, and is always UTF-8 on output. |
| .PP |
| When the |
| .B --files |
| option is given, each input file is processed into an output file of the |
| corresponding name, with the extension changed to |
| .IR xhtml . |
| If the extension is already |
| .IR xhtml , |
| it is changed to |
| .IR xhtml_ . |
| .PP |
| TagSoup will repair, by whatever means necessary, |
| violations of XML well-formedness. In particular, it will fix up |
| malformed attribute names and supply missing attribute-value quotation marks. |
| More significantly, it supplies end-tags where HTML allows them |
| to be omitted, and sometimes where it doesn't. It will even supply |
| start-tags where necessary; for example, if a document begins with a |
| <li> tag, TagSoup will automatically prefix it with <html><body><ul>. |
| .PP |
| .SH BUGS |
| TagSoup can be fooled by missing close quotes after attribute values, and by |
| incorrect character encodings (it does not contain an encoding guesser). |
| .PP |
| TagSoup doesn't understand namespace declarations, which are not properly |
| part of HTML. Instead, any element or attribute name beginning |
| .IR foo : |
| will be put into the artificial namespace |
| .RI urn:x-prefix: foo . |
| .PP |
| For the same reasons, namespace-qualified attributes like |
| xml:space |
| can't be returned as default values, |
| though an explicit attribute in the xml namespace |
| will be returned with the proper namespace URI. |
| .SH AUTHOR |
| John Cowan <cowan@ccil.org> |
| .SH COPYRIGHT |
| Copyright \(co 2002-2008 John Cowan |
| .br |
| TagSoup is free software; see the source for copying conditions. There is NO |
| warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. |