| <!-- |
| // This file is part of TagSoup and is Copyright 2002-2008 by John Cowan. |
| // |
| // This file is part of TagSoup and is Copyright 2002-2008 by John Cowan. |
| // |
| // TagSoup is licensed under the Apache License, |
| // Version 2.0. You may obtain a copy of this license at |
| // http://www.apache.org/licenses/LICENSE-2.0 . You may also have |
| // additional legal rights not granted by this license. |
| // |
| // TagSoup is distributed in the hope that it will be useful, but |
| // unless required by applicable law or agreed to in writing, TagSoup |
| // is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS |
| // OF ANY KIND, either express or implied; not even the implied warranty |
| // of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. |
| --> |
| |
| <html><head><title>TagSoup home page</title></head><body> |
| <h1>TagSoup - Just Keep On Truckin'</h1> |
| |
| <h3>Introduction</h3> |
| <p>This is the home page of TagSoup, a SAX-compliant parser written in Java |
| that, instead of parsing well-formed or valid XML, parses HTML as it is |
| found in the wild: |
| <a href="http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html"> |
| poor, nasty and brutish</a>, though quite often far from short. |
| TagSoup is designed for people who have to process this stuff using some |
| semblance of a rational application design. By providing a SAX interface, |
| it allows standard XML tools to be applied to even the worst HTML. |
| TagSoup also includes a command-line processor that reads HTML files |
| and can generate either clean HTML or well-formed XML that is a |
| close approximation to XHTML.</p> |
| |
| <p>This is also the README file packaged with TagSoup.</p> |
| |
| <p>TagSoup is free and Open Source software. As of version 1.2, it |
| is licensed under the |
| <a href="http://opensource.org/licenses/apache2.0.php"> |
| Apache License, Version 2.0</a>, which allows proprietary re-use as well |
| as use with GPL 3.0 or GPL 2.0-or-later projects. (If anyone needs a |
| GPL 2.0 license for a GPL 2.0-only project, feel free to ask.) |
| |
| <h3><i>Warning:</i> TagSoup will not build on stock Java 5.x or 6.x!</h3> |
| |
| <p>Due to a bug in the versions of Xalan shipped with Java 5.x and |
| 6.x, TagSoup will not build out of the box. You need to retrieve |
| <a href="http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip"> |
| Saxon 6.5.5</a>, which does not have the bug. Unpack the |
| zipfile in an empty directory and copy the <tt>saxon.jar</tt> and |
| <tt>saxon-xml-apis.jar</tt> files to <tt>$ANT_HOME/lib</tt>. The Ant |
| build process for TagSoup will then notice that Saxon is available and |
| use it instead.</p> |
| |
| <h3>TagSoup 1.2 released</h3> |
| |
| <p>There are a great many changes, most of them fixes for long-standing |
| bugs, in this release. Only the most important are listed here; for |
| the rest, see the CHANGES file in the source distribution. Very special |
| thanks to Jojo Dijamco, whose intensive efforts at debugging made this |
| release a usable upgrade rather than a useless mass of undetected bugs.</p> |
| |
| <ul> |
| |
| <li><p>As noted above, I have changed the license to Apache 2.0.</p></li> |
| |
| <li><p>The default content model for bogons (unknown elements) is now |
| ANY rather than EMPTY. <b>This is a breaking change</b>, which I have |
| done only because there was so much demand for it. It can be undone |
| on the command line with the <code>--emptybogons</code> switch, or |
| programmatically with <code>parser.setFeature(Parser.emptyBogonsFeature, |
| true)</code>.</p></li> |
| |
| <li><p>The processing of entity references in attribute values has |
| finally been fixed to do what browsers do. That is, a reference |
| is only recognized if it is properly terminated by a semicolon; |
| otherwise it is treated as plain text. This means that URIs |
| like <code>foo?cdown=32&cup=42</code> are no longer seen as |
| containing an instance of the ∪ character (whose name happens to |
| be <code>cup</code>).</p></li> |
| |
| <li><p>Several new switches have been added: |
| |
| <ul> |
| |
| <li><p><code>--doctype-system</code> and <code>--doctype-public</code> |
| force a <code>DOCTYPE</code> declaration to be output and allow setting |
| the system and public identifiers.</p></li> |
| |
| <li><p><code>--standalone</code> and <code>--version</code> allow control |
| of the XML declaration that is output. (Note that TagSoup's XML output |
| is always version 1.0, even if you use <code>--version=1.1</code>.)</p></li> |
| |
| <li><p><code>--norootbogons</code> causes unknown elements not to be allowed |
| as the document root element. Instead, they are made children of the |
| default root element (the <code>html</code> element for HTML).</p></li> |
| |
| </ul> |
| <li><p>The TagSoup core now supports character entities with values |
| above U+FFFF. As a consequence, the HTML schema now supports all |
| 2,210 standard character entities from the |
| <a href="http://www.w3.org/TR/2007/WD-xml-entity-names-20071214"> |
| 2007-12-14 draft of XML Entity Definitions for Characters</a>, except the |
| 94 which require more than one Unicode character to represent.</p></li> |
| |
| <li>The SAX events <code>startPrefixMapping</code> and |
| <code>endPrefixMapping</code> are now being reported for all cases of |
| foreign elements and attributes.</li> |
| |
| <li><p>All bugs around newline processing on Windows should now be gone.</p></li> |
| |
| <li>A number of content models have been loosened to allow elements |
| to appear in new and non-standard (but commonly found) places. |
| In particular, tables are now allowed inside paragraphs, against the |
| letter of the W3C specification.</p> |
| |
| <li><p>Since the <code>span</code> element is intended for fine |
| control of appearance using CSS, it should never have been a |
| restartable element. This very long-standing bug has now been |
| fixed.</p></li> |
| |
| <li><p>The following non-standard elements are now at least partly |
| supported: <code>bgsound</code>, <code>blink</code>, <code>canvas</code>, |
| <code>comment</code>, <code>listing</code>, <code>marquee</code>, |
| <code>nobr</code>, <code>rbc</code>, <code>rb</code>, <code>rp</code>, |
| <code>rtc</code>, <code>rt</code>, <code>ruby</code>, <code>wbr</code>, |
| <code>xmp</code>.</p></li> |
| |
| <li><p>In HTML output mode, boolean attributes like <code>checked</code> |
| are now output as such, rather than in XML style as |
| <code>checked="checked"</code>.</p></li> |
| |
| <li><p>Runs of < characters such as << and <<< are now |
| handled correctly in text rather than being transformed into extremely |
| bogus start-tags.</p></li> |
| |
| </ul> |
| |
| <p><a href="tagsoup-1.2.jar">Download</a> the TagSoup 1.2 jar |
| file here. |
| It's about 87K long.<br/> |
| <a href="tagsoup-1.2-src.zip">Download</a> the full TagSoup 1.2 source |
| here. If you don't have zip, you can use jar to unpack it. <br/> |
| <a href="CHANGES">Download</a> the current CHANGES file here.</p> |
| |
| <h3>TagSoup 1.1 released</h3> |
| |
| <p>TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. |
| To use TagSoup within the JAXP framework (which is not |
| something I necessarily recommend, but it is part of the Java |
| XML platform), you can create a <tt>SAXParser</tt> by calling |
| <tt>org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance()</tt>. You can |
| also set the system property <tt>javax.xml.parsers.SAXParserFactory</tt> |
| to <tt>org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl</tt>, but <i>be |
| aware</i> that doing this will cause all JAXP-based XML parsing to go |
| through TagSoup, which is a Bad Thing if your application also reads |
| XML documents.</p> |
| |
| <h3>What TagSoup does</h3> |
| <p>TagSoup is designed as a parser, not a whole application; it isn't |
| intended to permanently clean up bad HTML, as |
| <a href="http://tidy.sf.net">HTML Tidy</a> does, only to |
| parse it on the fly. Therefore, it does not convert presentation HTML |
| to CSS or anything similar. It does guarantee well-structured results: |
| tags will wind up properly nested, default attributes will appear |
| appropriately, and so on.</p> |
| |
| <p>The semantics of TagSoup are as far as practical those of actual HTML |
| browsers. In particular, never, never will it throw any sort of syntax |
| error: the TagSoup motto is |
| <a href="http://www.crumbmuseum.com/truckin.html"> |
| "Just Keep On Truckin'"</a>. But there's much, |
| much more. For example, if the first tag is LI, it will supply the |
| application with enclosing HTML, BODY, and UL tags. Why UL? Because |
| that's what browsers assume in this situation. For the same reason, |
| overlapping tags are correctly restarted whenever possible: text like:</p> |
| |
| <pre>This is <B>bold, <I>bold italic, </b>italic, </i>normal text |
| </pre> |
| |
| <p>gets correctly rewritten as:</p> |
| |
| <pre>This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text. |
| </pre> |
| |
| <p>By intention, TagSoup is small and fast. It does not |
| depend on the existence of any framework other than SAX, and should be |
| able to work with any framework that can accept SAX parsers. |
| In particular, <a href="http://www.cafeconleche.org/XOM">XOM</a> |
| is known to work. |
| |
| <p>You can replace the low-level HTML scanner with one based on Sean McGrath's |
| <a href="http://gnosis.cx/publish/programming/xml_matters_17.html">PYX</a> |
| format (very close to James Clark's ESIS format). You can also supply |
| an AutoDetector that peeks at the incoming byte stream and guesses a |
| character encoding for it. Otherwise, the platform default is used. |
| If you need an autodetector of character sets, consider trying to |
| adapt the <a href="http://jchardet.sourceforge.net/">Mozilla one</a>; |
| if you succeed, let me know.</p> |
| |
| <h3>Note: TagSoup in Java 1.1</h3> |
| |
| <p>If you go through the TagSoup source and replace all references to |
| <code>HashMap</code> with <code>Hashtable</code> and recompile, |
| TagSoup will work fine in Java 1.1 VMs. Thanks to Thorbjørn |
| Vinne for this discovery.<p> |
| |
| <h3>The TSaxon XSLT-for-HTML processor</h3> |
| <p><a href="http://www.ccil.org/~cowan">I</a> am also distributing |
| <a href="tsaxon">TSaxon</a>, a repackaging of version 6.5.5 of Michael |
| Kay's Saxon XSLT version 1.0 implementation that includes TagSoup. |
| TSaxon is a drop-in replacement for Saxon, and can be used to process |
| either HTML or XML documents with XSLT stylesheets. |
| |
| |
| <h3>TagSoup as a stand-alone program</h3> |
| <p>It is possible to run TagSoup as a program by saying <code>java |
| -jar tagsoup-1.0.1 [<i>option ...</i>] [<i>file ...</i>]</code>. |
| Files mentioned on the command line will be parsed individually. If no |
| files are specified, the standard input is read.</p> |
| |
| <p>The following options are understood:</p> |
| |
| <dl> |
| <dt><code>--files</code></dt> |
| <dd>Output into individual files, with <code>html</code> extensions changed |
| to <code>xhtml</code>. Otherwise, all output is sent to the standard output.</dd> |
| |
| <dt><code>--html</code></dt> |
| <dd>Output is in clean HTML: the XML declaration is suppressed, as are end-tags |
| for the known empty elements.</dd> |
| |
| <dt><code>--omit-xml-declaration</code></dt> |
| <dd>The XML declaration is suppressed.</dd> |
| |
| <dt><code>--method=html</code></dt> |
| <dd>End-tags for the known empty HTML elements are suppressed.</dd> |
| |
| <dt><code>--doctype-system=<i>systemid</i></code></dt> |
| <dd>Forces the output of a <code>DOCTYPE</code> declaration with the specified systemid.</dd> |
| |
| <dt><code>--doctype-public=<i>publicid</i></code></dt> |
| <dd>Forces the output of a <code>DOCTYPE</code> declaration with the specified publicid.</dd> |
| |
| <dt><code>--version=<i>version</i></code></dt> |
| <dd>Sets the version string in the XML declaration.</dd> |
| |
| <dt><code>--standalone=</code>[<code>yes</code>|<code>no</code>]</dt> |
| <dd>Sets the standalone declaration to yes or no.</dd> |
| |
| <dt><code>--pyx</code></dt> |
| <dd>Output is in PYX format.</dd> |
| |
| <dt><code>--pyxin</code></dt> |
| <dd>Input is in PYXoid format (need not be well-formed).</dd> |
| |
| <dt><code>--nons</code></dt> |
| <dd>Namespaces are suppressed. Normally, all elements are in the XHTML 1.x |
| namespace, and all attributes are in no namespace.</dd> |
| |
| <dt><code>--nobogons</code></dt> |
| <dd>Bogons (unknown elements) are suppressed.</dd> |
| |
| <dt><code>--nodefaults</code></dt> |
| <dd>suppress default attribute values</dd> |
| |
| <dt><code>--nocolons</code></dt> |
| <dd>change explicit colons in element and attribute names to underscores</dd> |
| |
| <dt><code>--norestart</code></dt> |
| <dd>don't restart any normally restartable elements</dd> |
| |
| <dt><code>--ignorable</code></dt> |
| <dd>output whitespace in elements with element-only content</dd> |
| |
| <dt><code>--emptybogons</code></dt> |
| <dd>Bogons are given a content model of EMPTY rather than ANY.</dd> |
| |
| <dt><code>--any</code></dt> |
| <dd>Bogons are given a content model of ANY rather than EMPTY (default).</dd> |
| |
| <dt><code>--norootbogons</code></dt> |
| <dd>Don't allow bogons to be root elements; make them subordinate to the root.</dd> |
| |
| <dt><code>--lexical</code></dt> |
| <dd>Pass through HTML comments and DOCTYPE declarations. Has no effect |
| when output is in PYX format.</dd> |
| |
| <dt><code>--reuse</code></dt> |
| <dd>Reuse a single instance of TagSoup parser throughout. Normally, a new one |
| is instantiated for each input file.</dd> |
| |
| <dt><code>--nocdata</code></dt> |
| <dd>Change the content models of the <code>script</code> and <code>style</code> elements |
| to treat them as ordinary #PCDATA (text-only) elements, as in XHTML, rather than |
| with the special CDATA content model.</dd> |
| |
| <dt><code>--encoding=</code><i>encoding</i></dt> |
| <dd>Specify the input encoding. The default is the Java platform default.</dd> |
| |
| <dt><code>--output-encoding=</code><i>encoding</i></dt> |
| <dd>Specify the output encoding. The default is the Java platform default.</dd> |
| |
| <dt><code>--help</code></dt> |
| <dd>Print help.</dd> |
| |
| <dt><code>--version</code></dt> |
| <dd>Print the version number.</dd> |
| |
| </dl> |
| |
| <a name="properties"></a><h3>SAX features and properties</h3> |
| |
| <p>TagSoup supports the following SAX features in addition to the |
| standard ones:</p> |
| |
| <dl> |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons</tt></dt> |
| <dd>A value of "true" indicates that the parser will ignore |
| unknown elements.</dd> |
| |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/features/bogons-empty</tt></dt> |
| <dd>A value of "true" indicates that the parser will give unknown |
| elements a content model of EMPTY; a value of "false", a |
| content model of ANY.</dd> |
| |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/features/root-bogons</tt></dt> |
| <dd>A value of "true" indicates that the parser will allow unknown |
| elements to be the root of the output document.</dd> |
| |
| |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/features/default-attributes</tt></dt> |
| <dd>A value of "true" indicates that the parser will return default |
| attribute values for missing attributes that have default values.</dd> |
| |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/features/translate-colons</tt></dt> |
| <dd>A value of "true" indicates that the parser will |
| translate colons into underscores in names.</dd> |
| |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/features/restart-elements</tt></dt> |
| <dd>A value of "true" indicates that the parser will |
| attempt to restart the restartable elements.</dd> |
| |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace</tt></dt> |
| <dd>A value of "true" indicates that the parser will |
| transmit whitespace in element-only content via the SAX |
| ignorableWhitespace callback. Normally this is not done, |
| because HTML is an SGML application and SGML suppresses |
| such whitespace.</dd> |
| |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/features/cdata-elements</tt></dt> |
| <dd>A value of "true" indicates that the parser will |
| process the <tt>script</tt> and <tt>style</tt> elements |
| (or any elements with <tt>type='cdata'</tt> in the TSSL schema) |
| as SGML CDATA elements (that is, no markup is recognized except |
| the matching end-tag).</dd> |
| |
| </dl> |
| |
| <p>TagSoup supports the following SAX properties in addition to |
| the standard ones:</p> |
| |
| <dl> |
| |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/properties/scanner</tt></dt> |
| <dd>Specifies the Scanner object this parser uses.</dd> |
| |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/properties/schema</tt></dt> |
| <dd>Specifies the Schema object this parser uses.</dd> |
| |
| <dt><tt>http://www.ccil.org/~cowan/tagsoup/properties/auto-detector</tt></dt> |
| <dd>Specifies the AutoDetector (for encoding detection) this parser uses.</dd> |
| |
| </dl> |
| |
| <h3>More information</h3> |
| <p>I gave a presentation (a nocturne, so it's not on the schedule) at |
| <a href="http://www.extrememarkup.com/extreme/2004">Extreme Markup Languages 2004</a> |
| about TagSoup, updated from the one |
| presented in 2002 at the New York City XML SIG and at XML 2002. |
| This is the main high-level documentation about how TagSoup works. |
| Formats: |
| <a href="tagsoup.odp">OpenDocument</a> |
| <a href="tagsoup.ppt">Powerpoint</a> |
| <a href="tagsoup.pdf">PDF</a>. |
| |
| <p>I also had people add <a href="extreme.html">"evil" HTML</a> to a large |
| poster so that I could <a href="extreme.xhtml">clean it up</a>; |
| View Source is probably more useful than ordinary browsing. |
| The original instructions were:</p> |
| |
| <p align="center">SOUPE DE BALISES (BE EVIL)!</br> |
| Ecritez une balise ouvrante (sans attributs)<br/> ou fermante HTML ici, s.v.p.<p/> |
| |
| |
| <p>There is a <a href="http://groups.yahoo.com/group/tagsoup-friends"> |
| tagsoup-friends</a> mailing list hosted at <a href="http://groups.yahoo.com"> |
| Yahoo Groups</a>. You can |
| <a href="http://groups.yahoo.com/group/tagsoup-friends/join">join</a> |
| via the Web, or by sending a blank email to |
| <a href="mailto:tagsoup-friends-subscribe@yahoogroups.com"><i> |
| tagsoup-friends-subscribe@yahoogroups.com</i></a>. |
| The <a href="http://groups.yahoo.com/group/tagsoup-friends/messages"> |
| archives</a> are open to all.</p> |
| |
| <p>Online TagSoup processing for publicly accessible HTML documents |
| is now <a href="http://xmlarmyknife.org/docs/xhtml/tagsoup/">available</a> |
| courtesy of Leigh Dodds.</p> |