XML: Metadata for the Rest of Us (Part 1)
The hypertext markup language, as we are all well aware, was an experiment
that got out of the lab too soon. It was, and to a certain extent still is,
a very simple way to describe a limited set of information for transmission
and display on the Web. In the few short years it's been around, we've seen
that various political and commercial forces have stretched the language
almost to the point of breaking. So what's the next step?
Well, what if you could merge the simplicity of HTML with the unparalleled
flexibility of standard generalized markup language, or SGML? That's the
idea behind the extensible markup language, or XML.
I've asked Tim Bray, co-editor of the XML spec, to give us some background
on the project. Tim spent three years working on one of the largest
electronic publishing initiatives in history - the New Oxford English
Dictionary project. He then co-founded Open Text Corp., which created one
of the first large search engines on the Web. He currently has an
independent consulting practice called Textuality, and is representing
Netscape in the XML standards process, including their Meta Content
This week, we'll take a look at the motivation behind SGML on the Web, and
how that resulted in the XML project. Next week, we'll dig into some
practical applications of the technology.
JEFF: Can you tell us how the XML project came about?
TIM: Going back several years, some prominent techies in the SGML community
had been saying that SGML was a good idea, but it was just too hairy for
real people to get into; you could crack great big problems, but sometimes
not do the simple things simply. Then the Web came along and showed the
power of doing simple things simply, with the Internet providing the
horsepower. Anyhow, in the summer of '96, Jon Bosak, a Sun guy and longtime
SGML user (he did the Novell docs site) badgered the W3C about doing
something for SGML on the Web, and they said he could form a committee and
see what could be done. The people he picked for the committee were the
same ones from SGML-land who had been talking simplification for years. The
committee is pretty heavy - almost everyone on it is a chief scientist or
Internet IPO architect or standards editor or some such.
The ostensible agenda was (a) better stylesheets than CSS, (b) better
hyperlinking than <a href=....>, and (c) a simpler form of the language.
Once we got together, it took about 15 seconds to decide to do it in the
order (c), (b), and (a). Furthermore, there were, I think, no less than
five of us who had already cooked up designs for an SGML simplification.
The premise was, put in everything that's proven to work and easy to
implement, throw the rest out. The work was mostly done between August and
November '96 - it was pretty intense. When we first trotted it out, the
SGML community mostly leapt on board instantly; getting our nose into the
Web-grunts' tent has been a bit tougher, but it sounds like we're making
good progress on that front. Interestingly, there were a couple of places
where SGML had features that were going to be a *total* pain in the ass in
network deployments; the SGML gang is impressed enough with XML that they
have cooked up a "technical corrigendum" to SGML to iron out these wrinkles
and keep XML Net-capable without losing ISO-SGML compatibility.
JEFF: We've already seen Microsoft using XML for their Channel Definition
Format (CDF) for scheduling and delivering Web-based content. Apple's work
on Meta Content Framework is now being embraced by Netscape as another XML
TIM: The difference between a library and a pile of books on the floor of a
big room is the card catalog (which is now computerized, of course). The
card catalog uses an agreed-on format and an agreed-on vocabulary to let
you find books by author, title, subject, and some other things. Of course,
the Web has no librarians (aside from the guys at Yahoo and so on, who are
way outnumbered), but even if you could get people to put cards in the
catalog for their own pages, there's no agreed-on format or vocabulary.
That's what we're trying to provide with MCF and XML. Once we have this,
the people who publish on the Web and have their act together absolutely
will make the effort to keep their metadata up to scratch. Then I'll be
able to go to a search engine and do things like pull up resources on
limnology of polluted waters hosted by US universities and updated since
January '97 - or entertainment magazines with articles about Beck prior to
July '96 that aren't talking about Jeff Beck - or mailing lists that
discuss dual-citizenship issues.
Historically, the Net has no metadata to speak of. But all of a sudden in
recent times there have been a lot of proposals for doing metadata. The
idea behind MCF is that if all the different sorts of metadata in the world
share something by way of vocabulary and data model, you get quite a bit of
interoperability and the ability to ask questions about all sorts of
different metadata in the same framework. For example, if Wired were to
define an "Internet hipness index" and start assigning it to things out
there, you'd define your own property, called IHI, and even if I didn't
know exactly what the semantics were, in an MCF environment I would be able
to find out that the property exists, that its domain is Web sites and its
range is numeric values, that it comes from Wired, and that it was last
It's a richer world. The Web has made for less data being stored in
proprietary formats. Metadata is just as important.
Next week: Practical applications of XML.
Jeffrey Veen is HotWired's interface director. His beard is growing back
--- Rohit Khare /// MCI Internet Architecture (BOS) /// firstname.lastname@example.org Voice+Pager: (617) 960-5131 VNet: 370-5131 Fax: (617) 960-1009