tagsoup.1 - platform/external/tagsoup - Git at Google

 \' This file is part of TagSoup and is Copyright 2002-2008 by John Cowan.
 \'
 \' TagSoup is licensed under the Apache License,
 \' Version 2.0.  You may obtain a copy of this license at
 \' http://www.apache.org/licenses/LICENSE-2.0 .  You may also have
 \' additional legal rights not granted by this license.
 \'
 \' TagSoup is distributed in the hope that it will be useful, but
 \' unless required by applicable law or agreed to in writing, TagSoup
 \' is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
 \' OF ANY KIND, either express or implied; not even the implied warranty
 \' of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 \'
 .TH TAGSOUP "1" "January 2008" "TagSoup 1.2" "User Commands"
 .SH NAME
 tagsoup \- convert nasty, ugly HTML to clean XHTML
 .SH SYNOPSIS
 .B java -jar tagsoup-1.2
 [
 .I options
 ] [
 .I files
 ]
 .SH DESCRIPTION
 .\" Add any additional description here
 .PP
 Rectify arbitrary HTML into clean XHTML,
 using a tailored description of HTML.
 The output will be well-formed XML, but not necessarily
 .I valid
 XHTML.
 .PP
 .TP
 .B --files
 multiple input
 .I files
 should be processed into corresponding output files
 .TP
 .BI --encoding= encoding
 specifies the encoding of input files
 .TP
 .BI --output-encoding= encoding
 specifies the encoding of the output
 (if the encoding name begins with ``utf'',
 the output will not contain character entities;
 otherwise, all non-ASCII characters are
 represented as entities)
 .TP
 .B --html
 output rectified HTML rather than XML,
 omitting the XML declaration
 and any namespace declarations
 .TP
 .B --method=html
 output rectified HTML rather than XML
 (end-tags are omitted for empty elements,
 and no character escaping is done in
 script and style elements)
 .TP
 .B --omit-xml-declaration
 omit the XML declaration
 .TP
 .B --lexical
 output lexical features (specifically comments and any DOCTYPE declaration)
 .TP
 .B --nons
 suppress namespaces in output
 .TP
 .B --nobogons
 suppress unknown non-HTML elements in output
 .TP
 .B --nodefaults
 suppress default attribute values
 .TP
 .B --nocolons
 change explicit colons
 in element and attribute names
 to underscores
 .TP
 .B --norestart
 don't restart any restartable elements
 .TP
 .B --ignorable
 pass through ignorable whitespace
 (whitespace in element-only content)
 via SAX method handler ignorableWhitespace
 .TP
 .B --any
 treat unknown non-HTML elements as allowing any content (default)
 .TP
 .B --emptybogons
 treat unknown non-HTML elements as empty elements
 .TP
 .B --norootbogons
 don't allow unknown non-HTML elements to be root elements
 .TP
 .BI --doctype-system= system-id
 force DOCTYPE declaration to be output with specified system identifier
 .TP
 .BI --doctype-public= public-id
 force DOCTYPE declaration to be output with specified public identifier
 .TP
 .B --standalone=[yes|no]
 specify standalone pseudo-attribute in output XML declaration
 .TP
 .BI --version= version
 specify version pseudo-attribute in output XML declaration
 (does not affect actual version of XML output)
 .TP
 .B --nocdata
 treat the CDATA-content elements
 .I script
 and
 .I style
 as ordinary elements
 (mostly for testing)
 .TP
 .B --pyx
 output PYX format rather than XML
 (mostly for testing)
 .TP
 .B --pyxin
 input is PYX-format HTML
 (mostly for testing)
 .TP
 .B --reuse
 reuse the same Parser object internally
 (for testing only)
 .TP
 .B --help
 output basic help
 .TP
 .B --version
 output version number
 .PP
 .B TagSoup
 is a parser and reformatter for nasty, ugly HTML.
 Its normal processing mode is to accept HTML files on the command line,
 or from the standard input if none are given, and output them
 as clean XML
 to the standard output.  The encoding is assumed to be the platform-local
 encoding on input, and is always UTF-8 on output.
 .PP
 When the
 .B --files
 option is given, each input file is processed into an output file of the
 corresponding name, with the extension changed to
 .IR xhtml .
 If the extension is already
 .IR xhtml ,
 it is changed to
 .IR xhtml_ .
 .PP
 TagSoup will repair, by whatever means necessary,
 violations of XML well-formedness.  In particular, it will fix up
 malformed attribute names and supply missing attribute-value quotation marks.
 More significantly, it supplies end-tags where HTML allows them
 to be omitted, and sometimes where it doesn't.  It will even supply
 start-tags where necessary; for example, if a document begins with a
 <li> tag, TagSoup will automatically prefix it with <html><body><ul>.
 .PP
 .SH BUGS
 TagSoup can be fooled by missing close quotes after attribute values, and by
 incorrect character encodings (it does not contain an encoding guesser).
 .PP
 TagSoup doesn't understand namespace declarations, which are not properly
 part of HTML.  Instead, any element or attribute name beginning
 .IR foo :
 will be put into the artificial namespace
 .RI urn:x-prefix: foo .
 .PP
 For the same reasons, namespace-qualified attributes like
 xml:space
 can't be returned as default values,
 though an explicit attribute in the xml namespace
 will be returned with the proper namespace URI.
 .SH AUTHOR
 John Cowan <cowan@ccil.org>
 .SH COPYRIGHT
 Copyright \(co 2002-2008 John Cowan
 .br
 TagSoup is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
	\' This file is part of TagSoup and is Copyright 2002-2008 by John Cowan.
	\'
	\' TagSoup is licensed under the Apache License,
	\' Version 2.0. You may obtain a copy of this license at
	\' http://www.apache.org/licenses/LICENSE-2.0 . You may also have
	\' additional legal rights not granted by this license.
	\'
	\' TagSoup is distributed in the hope that it will be useful, but
	\' unless required by applicable law or agreed to in writing, TagSoup
	\' is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
	\' OF ANY KIND, either express or implied; not even the implied warranty
	\' of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
	\'
	.TH TAGSOUP "1" "January 2008" "TagSoup 1.2" "User Commands"
	.SH NAME
	tagsoup \- convert nasty, ugly HTML to clean XHTML
	.SH SYNOPSIS
	.B java -jar tagsoup-1.2
	[
	.I options
	] [
	.I files
	]
	.SH DESCRIPTION
	.\" Add any additional description here
	.PP
	Rectify arbitrary HTML into clean XHTML,
	using a tailored description of HTML.
	The output will be well-formed XML, but not necessarily
	.I valid
	XHTML.
	.PP
	.TP
	.B --files
	multiple input
	.I files
	should be processed into corresponding output files
	.TP
	.BI --encoding= encoding
	specifies the encoding of input files
	.TP
	.BI --output-encoding= encoding
	specifies the encoding of the output
	(if the encoding name begins with ``utf'',
	the output will not contain character entities;
	otherwise, all non-ASCII characters are
	represented as entities)
	.TP
	.B --html
	output rectified HTML rather than XML,
	omitting the XML declaration
	and any namespace declarations
	.TP
	.B --method=html
	output rectified HTML rather than XML
	(end-tags are omitted for empty elements,
	and no character escaping is done in
	script and style elements)
	.TP
	.B --omit-xml-declaration
	omit the XML declaration
	.TP
	.B --lexical
	output lexical features (specifically comments and any DOCTYPE declaration)
	.TP
	.B --nons
	suppress namespaces in output
	.TP
	.B --nobogons
	suppress unknown non-HTML elements in output
	.TP
	.B --nodefaults
	suppress default attribute values
	.TP
	.B --nocolons
	change explicit colons
	in element and attribute names
	to underscores
	.TP
	.B --norestart
	don't restart any restartable elements
	.TP
	.B --ignorable
	pass through ignorable whitespace
	(whitespace in element-only content)
	via SAX method handler ignorableWhitespace
	.TP
	.B --any
	treat unknown non-HTML elements as allowing any content (default)
	.TP
	.B --emptybogons
	treat unknown non-HTML elements as empty elements
	.TP
	.B --norootbogons
	don't allow unknown non-HTML elements to be root elements
	.TP
	.BI --doctype-system= system-id
	force DOCTYPE declaration to be output with specified system identifier
	.TP
	.BI --doctype-public= public-id
	force DOCTYPE declaration to be output with specified public identifier
	.TP
	.B --standalone=[yes\|no]
	specify standalone pseudo-attribute in output XML declaration
	.TP
	.BI --version= version
	specify version pseudo-attribute in output XML declaration
	(does not affect actual version of XML output)
	.TP
	.B --nocdata
	treat the CDATA-content elements
	.I script
	and
	.I style
	as ordinary elements
	(mostly for testing)
	.TP
	.B --pyx
	output PYX format rather than XML
	(mostly for testing)
	.TP
	.B --pyxin
	input is PYX-format HTML
	(mostly for testing)
	.TP
	.B --reuse
	reuse the same Parser object internally
	(for testing only)
	.TP
	.B --help
	output basic help
	.TP
	.B --version
	output version number
	.PP
	.B TagSoup
	is a parser and reformatter for nasty, ugly HTML.
	Its normal processing mode is to accept HTML files on the command line,
	or from the standard input if none are given, and output them
	as clean XML
	to the standard output. The encoding is assumed to be the platform-local
	encoding on input, and is always UTF-8 on output.
	.PP
	When the
	.B --files
	option is given, each input file is processed into an output file of the
	corresponding name, with the extension changed to
	.IR xhtml .
	If the extension is already
	.IR xhtml ,
	it is changed to
	.IR xhtml_ .
	.PP
	TagSoup will repair, by whatever means necessary,
	violations of XML well-formedness. In particular, it will fix up
	malformed attribute names and supply missing attribute-value quotation marks.
	More significantly, it supplies end-tags where HTML allows them
	to be omitted, and sometimes where it doesn't. It will even supply
	start-tags where necessary; for example, if a document begins with a
	<li> tag, TagSoup will automatically prefix it with <html><body><ul>.
	.PP
	.SH BUGS
	TagSoup can be fooled by missing close quotes after attribute values, and by
	incorrect character encodings (it does not contain an encoding guesser).
	.PP
	TagSoup doesn't understand namespace declarations, which are not properly
	part of HTML. Instead, any element or attribute name beginning
	.IR foo :
	will be put into the artificial namespace
	.RI urn:x-prefix: foo .
	.PP
	For the same reasons, namespace-qualified attributes like
	xml:space
	can't be returned as default values,
	though an explicit attribute in the xml namespace
	will be returned with the proper namespace URI.
	.SH AUTHOR
	John Cowan <cowan@ccil.org>
	.SH COPYRIGHT
	Copyright \(co 2002-2008 John Cowan
	.br
	TagSoup is free software; see the source for copying conditions. There is NO
	warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.