Re: How to store XML DTDs/Schemas on the web

James Tauber (jtauber@jtauber.com)
Sat, 26 Jun 1999 08:29:14 +0800


---- Original Message -----
From: Ernest Prabhakar <prabhaka@apple.com>
> a) Are people using DTDs, or moving to Schemas/XML-Data, or whatever
that's called?

DTDs are still the principal means of defining "document types" (for more on
that term, see answer to next question) although XML-Data stuff is emerging
(despite initial indications that DCD superceded XML-Data). Invariably the
documentation for these document types defines additional constraints that
are not expressible via DTD syntax. With new schema languages (and *the* new
schema language as a W3C REC), these additional constraints will move more
and more into the more expressible schemata.

> b) What do we call these anyway?
> - vocabularies
> - schemas
> - naming schemes
> - document definitions

In SGML (ISO 8879), the term "document type" refers to a class of documents.
A particular document is said to be an "instance" of a document type. A
"document type definition", "document definition" or DTD is a specification
of a class of documents. Note that in SGML, a DTD is not just the markup
declarations (which is what commonly gets called the DTD) but also any
additional information necessary to characterise the document type. In other
words, as far as ISO 8879 is concerned, a comment that says something about
the semantics of an element type *is part of the DTD* as is a bit of code
that tests a requirement that foo elements only contain the letters "B", "A"
and "R". Note that people don't generally use the term "document type
definition" that way, but that's how the standard defines it.

In XML (REC-xml), a "document type definition" is defined far more in line
with what people generally mean when they say "document type definition" or
DTD. As far as REC-xml is concerned, a DTD is a grammar defining a set of
documents that are "valid" instances of that DTD. In XML, a DTD *is* just
the markup declarations.

Now, it is possible to have an XML document without a DTD. But it still
could be said that such a document belongs to a "document type" in the SGML
sense. I can come up with a set of conventions for what constitutes a
purchase order in XML but I don't have to write up those conventions as a
DTD (in the XML sense). If I do write up those conventions, no matter how I
do it, I have a schema.

So a schema, as the term seems to be used in the XML context, is some
specification of a document type. So any given XML DTD is a schema. Any
given SGML DTD (even in the broader sense including code and semantics) is a
schema.

Of course, it's no good writing schemata if no one else understands them so
you need a "schema language". REC-xml defines a language for XML DTDs (this
language doesn't have a name, which can lead to confusion in the use of the
term "DTD"). ISO 8879 defines a language for part of an SGML DTD (the part
that is a DTD in the XML sense). The XML-Data specification defines a
language, the SOX NOTE defines a language and the REC that the XML Schema WG
comes up with will define a language.

In the case of XML-Data or SOX, we have a name for the schema language
(XML-Data and SOX respectively) and I guess you can call instances of these
XML-Data schemata and SOX schemata. The real confusion is going to come when
the XML Schema WG produces its REC. What will the schema language be
called? People seem to often use the term "Schema" to mean the XML Schema
WG's schema language. This is mixing up different levels of abstraction. A
"schema language" defines a set of valid schemata (with a specification of
the semantics of such schemata). A "schema" defines a set of valid documents
(*sometimes* with a specification of the semantics of such documents).

I suspect the term "vocabularies" came into use as part of the move away
from the SGML notion of a monolithic document type. In SGML, you have
documents that are instances of document types. Often, you want to just talk
about a few element names or attribute names. So a "vocabulary" in the XML
context generally refers to a set of element names and/or attribute names
along with some notion of semantics and possibly constraint of content and
location. So I could say that I have a document that makes use of the HTML
"vocabulary" without it following the HTML DTD. In other words I use "P" to
mean a paragraph without my document necessarily having a TITLE. So the term
"vocabulary" seems to be used to focus more on the semantics of what is
being labeled rather than syntactic constraints (althought not entirely; for
one thing, the two are generally not completely separable).

The notation of a vocabulary is quite useful, I think, when talking about
namespaces in XML. XML namespaces are basically a means of avoid name
collisions in vocabularies. Namespaces solve the problem of combining
vocabularies, not the problem of combining DTDs.

I hope all that helps. If nothing else, it provides insight into how I view
and use those terms :-)

> c) Is it considered "good policy" to post a description in both DTD &
Schema
> format? Or if we want broad usage, should we just use DTD?

In light of my answer to (b), what do you mean by "schema"? Firstly, I would
say you are making the abstraction-level mistake. At least as I use the
terms, you are really asking whether to post a DTD (written in a language
that is defined by REC-xml but doesn't have a name) and a schema written in
some other schema language. You haven't said what schema language you mean.
Sorry for being pedantic but I think it avoids a lot of confusion.

To answer your question as I believe you meant it, though: I would post
both. If you publish a DTD you are going to include some extra stuff anyway
explaining semantics and any additional constraints. A richer schema
language will just let you formalise more of that extra stuff, probably with
less ambiguity that possible with, say, prose accompanying the DTD.

> d) Are there any formal or informal naming conventions for where to store
these?
> e.g.:
> xml.drernie.com/schemas
> www.drernie.com/xml/dtds

I haven't seen any emerge. I'm still working on how best to organise
SCHEMA.NET

> What is best practice? I tried looking at places like
> msdn.microsoft.com/xml, but only got more confused...

I'm planning on working with people like XML.ORG and ONTOLOGY.ORG to get
some conventions happening between those two and SCHEMA.NET

Anyway. Hope all this helps.

James