tagsoup.txt - platform/external/tagsoup - Git at Google

 ´ This file is part of TagSoup and is Copyright 2002‐2008 by John
 Cowan.  ´ ´ TagSoup is licensed under the Apache License, ´  Ver‐
 sion   2.0.   You  may  obtain  a  copy  of  this  license  at  ´
 http://www.apache.org/licenses/LICENSE‐2.0 .  You may also have ´
 additional legal rights not granted by this license.  ´ ´ TagSoup
 is distributed in the hope that it will be useful, but  ´  unless
 required  by applicable law or agreed to in writing, TagSoup ´ is
 distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
 ´  OF  ANY  KIND, either express or implied; not even the implied
 warranty ´ of MERCHANTABILITY or FITNESS FOR  A  PARTICULAR  PUR‐
 TAGSOUP(1)                       User Commands                      TAGSOUP(1)


 POSE.  ´

 NAME
        tagsoup - convert nasty, ugly HTML to clean XHTML

 SYNOPSIS
        java -jar tagsoup-1.2 [ options ] [ files ]

 DESCRIPTION
        Rectify  arbitrary  HTML into clean XHTML, using a tailored description
        of HTML.  The output will be well-formed XML, but not necessarily valid
        XHTML.


        --files
               multiple input files should be processed into corresponding out‐
               put files

        --encoding=encoding
               specifies the encoding of input files

        --output-encoding=encoding
               specifies the encoding of  the  output  (if  the  encoding  name
               begins with ‘‘utf’’, the output will not contain character enti‐
               ties; otherwise, all non-ASCII  characters  are  represented  as
               entities)

        --html output rectified HTML rather than XML, omitting the XML declara‐
               tion and any namespace declarations

        --method=html
               output rectified HTML rather than XML (end-tags are omitted  for
               empty  elements, and no character escaping is done in script and
               style elements)

        --omit-xml-declaration
               omit the XML declaration

        --lexical
               output lexical features (specifically comments and  any  DOCTYPE
               declaration)

        --nons suppress namespaces in output

        --nobogons
               suppress unknown non-HTML elements in output

        --nodefaults
               suppress default attribute values

        --nocolons
               change  explicit colons in element and attribute names to under‐
               scores

        --norestart
               don’t restart any restartable elements

        --ignorable
               pass through ignorable whitespace  (whitespace  in  element-only
               content) via SAX method handler ignorableWhitespace

        --any  treat   unknown   non-HTML  elements  as  allowing  any  content
               (default)

        --emptybogons
               treat unknown non-HTML elements as empty elements

        --norootbogons
               don’t allow unknown non-HTML elements to be root elements

        --doctype-system=system-id
               force DOCTYPE declaration to be  output  with  specified  system
               identifier

        --doctype-public=public-id
               force  DOCTYPE  declaration  to  be output with specified public
               identifier

        --standalone=[yes|no]
               specify standalone pseudo-attribute in output XML declaration

        --version=version
               specify version pseudo-attribute in output XML declaration (does
               not affect actual version of XML output)

        --nocdata
               treat  the  CDATA-content  elements script and style as ordinary
               elements (mostly for testing)

        --pyx  output PYX format rather than XML (mostly for testing)

        --pyxin
               input is PYX-format HTML (mostly for testing)

        --reuse
               reuse the same Parser object internally (for testing only)

        --help output basic help

        --version
               output version number

        TagSoup is a parser and reformatter for nasty, ugly HTML.   Its  normal
        processing  mode  is  to accept HTML files on the command line, or from
        the standard input if none are given, and output them as clean  XML  to
        the  standard output.  The encoding is assumed to be the platform-local
        encoding on input, and is always UTF-8 on output.

        When the --files option is given, each input file is processed into  an
        output  file  of  the corresponding name, with the extension changed to
        xhtml.  If the extension is already xhtml, it is changed to xhtml_.

        TagSoup will repair, by whatever means  necessary,  violations  of  XML
        well-formedness.   In  particular,  it  will fix up malformed attribute
        names and supply missing attribute-value quotation marks.  More signif‐
        icantly, it supplies end-tags where HTML allows them to be omitted, and
        sometimes where it doesn’t.  It will even supply start-tags where  nec‐
        essary; for example, if a document begins with a <li> tag, TagSoup will
        automatically prefix it with <html><body><ul>.


 BUGS
        TagSoup can be fooled by missing close quotes after  attribute  values,
        and  by  incorrect character encodings (it does not contain an encoding
        guesser).

        TagSoup doesn’t understand namespace declarations, which are not  prop‐
        erly  part  of  HTML.  Instead, any element or attribute name beginning
        foo: will be put into the artificial namespace urn:x-prefix:foo.

        For the same reasons,  namespace-qualified  attributes  like  xml:space
        can’t  be  returned  as default values, though an explicit attribute in
        the xml namespace will be returned with the proper namespace URI.

 AUTHOR
        John Cowan <cowan@ccil.org>

 COPYRIGHT
        Copyright © 2002-2008 John Cowan
        TagSoup is free software; see the source for copying conditions.  There
        is  NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICU‐
        LAR PURPOSE.


 TagSoup 1.2                      January 2008                       TAGSOUP(1)
	´ This file is part of TagSoup and is Copyright 2002‐2008 by John
	Cowan. ´ ´ TagSoup is licensed under the Apache License, ´ Ver‐
	sion 2.0. You may obtain a copy of this license at ´
	http://www.apache.org/licenses/LICENSE‐2.0 . You may also have ´
	additional legal rights not granted by this license. ´ ´ TagSoup
	is distributed in the hope that it will be useful, but ´ unless
	required by applicable law or agreed to in writing, TagSoup ´ is
	distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
	´ OF ANY KIND, either express or implied; not even the implied
	warranty ´ of MERCHANTABILITY or FITNESS FOR A PARTICULAR PUR‐
	TAGSOUP(1) User Commands TAGSOUP(1)



	POSE. ´

	NAME
	tagsoup - convert nasty, ugly HTML to clean XHTML

	SYNOPSIS
	java -jar tagsoup-1.2 [ options ] [ files ]

	DESCRIPTION
	Rectify arbitrary HTML into clean XHTML, using a tailored description
	of HTML. The output will be well-formed XML, but not necessarily valid
	XHTML.


	--files
	multiple input files should be processed into corresponding out‐
	put files

	--encoding=encoding
	specifies the encoding of input files

	--output-encoding=encoding
	specifies the encoding of the output (if the encoding name
	begins with ‘‘utf’’, the output will not contain character enti‐
	ties; otherwise, all non-ASCII characters are represented as
	entities)

	--html output rectified HTML rather than XML, omitting the XML declara‐
	tion and any namespace declarations

	--method=html
	output rectified HTML rather than XML (end-tags are omitted for
	empty elements, and no character escaping is done in script and
	style elements)

	--omit-xml-declaration
	omit the XML declaration

	--lexical
	output lexical features (specifically comments and any DOCTYPE
	declaration)

	--nons suppress namespaces in output

	--nobogons
	suppress unknown non-HTML elements in output

	--nodefaults
	suppress default attribute values

	--nocolons
	change explicit colons in element and attribute names to under‐
	scores

	--norestart
	don’t restart any restartable elements

	--ignorable
	pass through ignorable whitespace (whitespace in element-only
	content) via SAX method handler ignorableWhitespace

	--any treat unknown non-HTML elements as allowing any content
	(default)

	--emptybogons
	treat unknown non-HTML elements as empty elements

	--norootbogons
	don’t allow unknown non-HTML elements to be root elements

	--doctype-system=system-id
	force DOCTYPE declaration to be output with specified system
	identifier

	--doctype-public=public-id
	force DOCTYPE declaration to be output with specified public
	identifier

	--standalone=[yes\|no]
	specify standalone pseudo-attribute in output XML declaration

	--version=version
	specify version pseudo-attribute in output XML declaration (does
	not affect actual version of XML output)

	--nocdata
	treat the CDATA-content elements script and style as ordinary
	elements (mostly for testing)

	--pyx output PYX format rather than XML (mostly for testing)

	--pyxin
	input is PYX-format HTML (mostly for testing)

	--reuse
	reuse the same Parser object internally (for testing only)

	--help output basic help

	--version
	output version number

	TagSoup is a parser and reformatter for nasty, ugly HTML. Its normal
	processing mode is to accept HTML files on the command line, or from
	the standard input if none are given, and output them as clean XML to
	the standard output. The encoding is assumed to be the platform-local
	encoding on input, and is always UTF-8 on output.

	When the --files option is given, each input file is processed into an
	output file of the corresponding name, with the extension changed to
	xhtml. If the extension is already xhtml, it is changed to xhtml_.

	TagSoup will repair, by whatever means necessary, violations of XML
	well-formedness. In particular, it will fix up malformed attribute
	names and supply missing attribute-value quotation marks. More signif‐
	icantly, it supplies end-tags where HTML allows them to be omitted, and
	sometimes where it doesn’t. It will even supply start-tags where nec‐
	essary; for example, if a document begins with a <li> tag, TagSoup will
	automatically prefix it with <html><body><ul>.


	BUGS
	TagSoup can be fooled by missing close quotes after attribute values,
	and by incorrect character encodings (it does not contain an encoding
	guesser).

	TagSoup doesn’t understand namespace declarations, which are not prop‐
	erly part of HTML. Instead, any element or attribute name beginning
	foo: will be put into the artificial namespace urn:x-prefix:foo.

	For the same reasons, namespace-qualified attributes like xml:space
	can’t be returned as default values, though an explicit attribute in
	the xml namespace will be returned with the proper namespace URI.

	AUTHOR
	John Cowan <cowan@ccil.org>

	COPYRIGHT
	Copyright © 2002-2008 John Cowan
	TagSoup is free software; see the source for copying conditions. There
	is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICU‐
	LAR PURPOSE.



	TagSoup 1.2 January 2008 TAGSOUP(1)