]>
Transcribed from an email to Olivier Thereaux. A friend writes:
There is a reported bug in the validator, that SGML character number
128-159 are not allowed for xml-based markup languages.
We have a test case at:
Our parser is opensp, and our opensp uses
In the bugzilla item I mentioned above, Terje Bless, who generally
knows much more about SGML than I do, thinks it may just be that our
sgml declaration for xml should be updated to include this character
range. http://www.w3.org/Bugs/Public/show_bug.cgi?id=3164#c5
As I am rather confused by the issue, I'd appreciate any guidance,
diagnosis, or pointer, you could provide.
This is the second time this week I've indulged my
inner language-lawyer in response to some query. I reproduce
my reply to this question here, as an Awful Warning to those
who might otherwise be tempted to ask me questions about XML.
What I wrote was (more or less): OK, I'll try.
I'm going to ask a number of short, pointed questions, provide long,
digressive answers (sorry about that), and then say what I think it
all means for your problem.
Note that the character range x80-x9F is known, for historical
reasons, as "the C1 range" or "the C1 characters". I can explain if
you wish. But be careful what you wish for.
In XML 1.0, the grammar production for Char includes them, so yes.
The formulation of the Char production has changed from time to time,
but 7F and the C1 range have always been included.
In XML 1.1, the grammar production for Char continues to allow them,
but the 'document' production takes care to exclude them (in their
literal form) from the document.
In 1.1, the C1 characters may be referred to using numeric character
references (€ etc.) but not used as literal characters.
So: XML 1.0 does not forbid the use of these characters. XML 1.1
forbids their appearance as literals but not as numeric character
references.
It might be argued (I think Chris Lilley has done so) that since the
C1 characters aren't really Unicode, they aren't legal in a document
whose document character set is supposed to be Unicode.
The Unicode 2.0 spec open on my desk, however, says
Like the C0 control codes, the Unicode Standard makes no specific
use of these C1 control codes, but provides for the passage of
their numeric code values intact, neither adding to nor
subtracting from their semantics. The semantics of the C1
controls are generally determined by the application with which
they are used. However, in the absence of specific application
uses, they may be interpreted according to the semantics specified
in ISO 6429.
(p. 6-5, section "Latin-1 Supplement: U+0080 - U+00FF")
I take that to mean that for all intents and purposes they are legal
Unicode characters. That Unicode does not assign meanings to them
does not constitute an argument that they are excluded from Unicode:
there are lots of gaps in Unicode. U+0FB0, for example, is also not
defined as meaning a specific character (at least in Unicode 2.0; I'm
too lazy to check the current version), but it's clearly got to be
accepted in a Unicode data stream.
So: Unicode includes these characters.
No. The test case you linked to illustrates that very nicely.
Yes, the declaration at
The document character set is defined by a CHARSET declaration
The "document character set" as defined by SGML is rather unlike the
"document character set" concept of HTML 4, which brilliantly co-opted
the SGML term and gave it a new and better meaning. (At least, that's
the way I understand the history of events.)
As defined by SGML, the document character set is the actual coded
character set (aka character encoding) the parser can expect to
encounter, conceived as a mapping from integers to characters. The
bit combinations come in, and the parser knows what characters they
represent by reference to the character set declaration.
In HTML, by contrast, the "document character set" is the repertoire
of abstract objects called "characters" which may occur in an HTML
document and which are mapped 1:1 with a set of integers. The integer
mappings are relevant for numeric character references, but for
nothing else. In particular, the HTML spec explicitly clarifies that
the document character set has nothing in particular to do with the
encoding in which data may arrive, except that the abstract characters
encoded by the encoding had better be present in the document
character set. (The HTML and later XML view is concisely summarized
by Gavin Nicol at
XML essentially adopted the ideas of HTML 4 on this point: the ISO
10646/Unicode character set is conceived as a large and abstract
pairing of integers and characters, one step divorced from the messy
business of actual encoding.
So from the point of view of an SGML processor, the character set
description above documents which bit patterns will and won't occur in
the input stream. (The ones that won't occur are important, because
the SGML spec assumes a processor may want to use them for its own
internal purposes.) From the HTML and XML point of view, the
description documents which characters will and won't occur.
How does it work? The BASESET says that we'll describe the document
character set by reference to the coded character set whose public
identifier is "ISO Registration Number 177//CHARSET ISO/IEC
10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6".
The reader is assumed to be in a position to understand what
references to that character spec mean.
The DESCSET bit contains a sequence of triples which assign meanings
to integers, using a kind of run-length documentation.
And so on. So the lines
A character set declaration similar to this one, but which allows
DEL and the C1 range, would have a DESCSET section like this:
You could of course replace the three lines for 127-55295 with
the single line
So: the SGML declaration used by many people as representing the rules
of XML 1.0 disagrees with the XML 1.0 spec on the characters x7F
through x9F.
It may be worth noting that the rule in the XML 1.1 spec which some
people find odd, that says that the characters in the range x7F-x9F
may be referred to using numeric character references but must not
appear as literals, is precisely the rule implied by the SGML
declaration: by marking the characters UNUSED, the SGML declaration
says they don't appear as literals, but not that they can't be
referred to numerically.
As far as I can tell, it's always been there.
I have looked at all the published drafts of XML 1.0 to see if some
early draft excluded the C1 characters; no, as mentioned above they
all include 7F and the C1 controls.
I have consulted Dave Peterson, who worked intensively with me in the
winter of 1996-97 to categorize all of the divergences between SGML
and the first draft of XML, and who on the basis of that work prepared
the first draft of what became the Web SGML Annex, to ask him if he
remembered the responsible ISO WG deciding that they needed to exclude
the C1 controls. He has no memory of such a decision, and neither he
nor I can think of a reason the SGML WG would have felt it necessary.
The SGML spec goes to extreme lengths to try make it possible to
describe arbitrarily weird encodings and use them to encode SGML
documents. (In fact even the huge complexity of the character set
mechanisms in SGML falls short of the ingenuity of some designers of
character encodings, so SGML can't describe some existing encodings
well -- but even so, those encodings can be used to encode SGML
documents.)
It appears likely that the SGML declaration in the Web SGML Annex was
copied from the SGML declaration formulated by James Clark during the
development of XML and published as part of the SGML/XML note
(
Further excavation reveals that an SGML declaration was included in
the first published working draft of XML -- in the printed form only,
however, not the version at
Like every other SGML declaration for SGML I have found today, that
one excludes 7F and the C1 controls.
Those whose pain threshold for character set discussions has not
already been exceeded will find more discussion in the long thread at
Initially, one might be unsure.
The prose suggests that they are legal. HTML 4.01 says
that its document character set is Unicode, and nowhere
in the section on HTML document representation in HTML 4.01
(
On the other hand, the HTML 4 spec has an SGML declaration that
indicates quite clearly that the characters are not legal.
(
The relevant part of the SGML declaration reads:
Is the SGML declaration normative? It would seem to be: it's in a
numbered section, not an appendix, and it's not labeled non-normative
or informative. And the section on conformance describes HTML as a
conforming SGML application.
So: I conclude that the SGML declaration is normative and that 7F and
the C1 controls are not legal in HTML 4.
It might appear not.
XHTML describes itself as a reformulation in XML of HTML 4.01, so I
believe that the character-set restriction of HTML 4 is inherited by
XHTML 1.0. It's no longer enforced by the lower-level markup system,
so in XHTML it would appear to be an "application convention", i.e. a
rule that goes beyond those imposed by XML. The comparison of XHTML
1.0 with HTML 4.01
(
May we conclude that XHTML 1.0, like HTML 4, excludes x7F and the C1
controls? In the first draft of this treatise I did so conclude. But
a different analysis is possible.
XHTML 1.0 was intended as an XML 1.0 application, and all XML
applications have the same rule for character sets. The WG
regarded itself as just adopting whatever it was that the XML
spec said; they didn't believe they had an option. On that analysis, the rule for XHTML 1.0 is whatever the
rule for XML 1.0 is, which does not forbid these characters.
XHTML 1.0 assumes a ‘generic’ XML parser. The absence of the difference from the list of differences
is to be understood not as a statement that there is no differnce,
but as an omission from the list, either because it was
regarded as uninteresting or because the WG didn't notice this
particular difference. The overarching goal in developing XHTML 1.0
was, to quote the chair of the HTML WG, “to be a generic
XML as we could”.
So: we conclude that XHTML 1.0, unlike HTML 4, does not exclude
x7F and the C1 controls by means of its SGML declaration. If
they are legal in XML in general, they are legal in XHTML 1.0. N.B. this does not mean that it's a good idea to
1) There's definitely a bug in the SGML declarations in wide
circulation for XML 1.0. Either that, or I am being defeated once
more by ISO 8879's character set mechanisms.
2) HTML 4 excludes 7F and the C1 range.
3) XHTML 1.0 does
4) The validator seems to be correct in rejecting the characters
in question in HTML 4.0 documents. 5) The validator appears to be incorrect in rejecting the test
document at
I hope this helps.