[RAW NOTES] XML namespaces, RDF, DCD, XML-data, ICE...

I Find Karma (adam@cs.caltech.edu)
Fri, 14 Aug 1998 11:23:55 -0700


Disclaimer: these are just raw notes, a first pass. There is much still
to be done.

First, recall our XML page: http://www.cs.caltech.edu/~adam/local/xml.html
(although I haven't editted this in months...)

and my May 8 FoRKpost, the XML family of drafts (xent currently has
forbidden permissions or I'd send this.)

Second, a very useful page is the WWW7 Metadata tutorial:
http://purl.oclc.org/~emiller/talks/www7/tutorial/

Part 1 is about the Dublin Core, Part 2 is about RDF. Lots of slides,
lots of good examples.
http://purl.oclc.org/~emiller/talks/www7/tutorial/part2/

Third, see Ralph Swick's RDF technical overview from WWW7:
http://www.w3.org/Talks/1998/0418-WWW7-RDF/

Fourth, see Janne Saarela's "using RDF to model multimedia content" slides:
http://www.w3.org/Architecture/1998/06/Workshop/paper29/slides/

Also, the digital libraries metadata resources page:
http://www.nlc-bnc.ca/ifla/II/metadata.htm

And now, here comes a whole bunch of random notes I've written, cut and
pasted, or pulled out of my butt.

W3C Metadata Activity page: http://www.w3.org/Metadata
Dublin Core Metadata page: http://purl.oclc.org/metadata/dublin_core
W3C Metadata Activity statement: http://www.w3.org/Metadata/Activity.html

W3C's strong interest in metadata has prompted development of the
Resource Description Framework (RDF), a language for representing
metadata, and its relative PICS (Platform for Internet Content
Selection). Both PICS and RDF are described briefly below.

W3C's work on metadata aims to evolve:

1. A language for expressing metadata which is simple to process by
machine. Our chosen language for this purpose is RDF.

2. A language for defining the vocabularies for use with particular
applications. Two applications might both be written in RDF, yet adopt
quite different headings and categories when it comes to organizing
material. In other words, the vocabulary is likely to be
application-specific: perhaps title, director and language for an
application associated with information movies but author, keywords and
date published for say, scientific research papers online.

3. RDF leaves programmers free to choose the vocabulary of their
choice, spelling out in detail the allowed use of the vocabulary -
essentially the "grammar" of the application. RDF clarifies which
vocabulary is being used by assigning each vocabulary a Web address.

4. A language for expressing filters. An application written in RDF
has something in common with a database; just as with a database, you
can apply database-like operations. Suppose you have a "database" of Web
pages. You could then apply queries against the metadata expressed in
RDF, to filter and to sort the pages as appropriate.

5. A syntax for digitally signing RDF statements. You can take a
metadata statement and digitally sign it, in effect saying that the
statement has not been tampered with since it was "signed" and
furthermore, you can certify that the signature is to be trusted.

6. An algorithm for canonicalizing metadata, which is needed for
creating and verifying digital signatures. The idea here is, that, if
you want to compare "records" of information in the database, then you
need to be able to map everything into one standard form, which makes
comparison easier.

7. A vocabulary for expressing PICS labels in RDF, and a conversion
algorithm from PICS 1.1. This is so that you can take a "database" of
PICS labels and convert it to RDF.

RDF will play an important role in enabling a whole gamut of new
applications, for example, the automation of many tasks involving
bibliographic records, product features and terms and conditions.
Metadata will facilitate searching, helping authors to describe their
documents in ways that search engines, browsers and Web crawlers can
understand.

PICS consists of a suite of specifications which enable people to
distribute information about the content of digital material in a
simple, computer-readable form. Information can be given a label; which
computers can then process in the background, filtering out undesirable
material or directing users to sites that may be of special interest to
them. An application of RDF, historically, PICS came first with the
ability to provide labels for Web resources. RDF provides a more general
treatment of metadata and PICS has been reformulated as an application
of RDF.

PICS was originally designed to allow parents and teachers to screen out
materials unsuitable for children using the Internet. Rather than simply
censoring the information itself, as various legislative bodies have
suggested, PICS gives responsibility to users to control themselves, or
to delegate control, of what they receive on their browsers.

W3C is involved in defining the RDF language in conjunction with experts
in knowledge representation and artificial intelligence. We have links
with the Dublin Core Workshop series. The Dublin Core is an attempt to
define bibliographic categories for Web pages.

The Project Manager for the W3C Metadata Project is Ralph Swick, who
reports to the Domain Leader for Technology and Society. Ralph is
responsible for establishing and enforcing project deliverables,
deadlines, and coordination between the various tasks within the
Project, other W3C Activities related to the Project, and with
organizations outside of W3C that have related work or need access to
the Project's deliverables.

The RDF Syntax Working Group has been set up to define the RDF data
model and write the RDF syntax. The Schema Working Group is meanwhile
working on ways to specify the sets of vocabularies specific to each
application. Its charter is to: 1) Specify a language for defining
vocabularies (for encoding and exchange of schemas), and 2) Make it
possible to automatically translate PICS-1.1 rating service descriptions
to RDF schemas. The Metadata Coordination Group is the forum in which
dependencies on and from other activities such as XML and P3P are
managed.

http://www.w3.org/Talks/1998/04/WWW7-XML/slide13-0.htm
XML: order, occurrence
RDF: logical consistency

XML Namespaces
--------------

Specification (July 31): http://www.w3.org/TR/WD-xml-names

Namespace declarations use an attribute whose prefix is xmlns.
<?xml version="1.0"?>
<x xmlns:edi="http://ecommerce.org/schema">
<!-- the edi namespace applies to the "x" element and contents -->
</x>

Each namespace prefix is scoped to the tag that introduces it, and
there's a way to declare default namspace prefixes:
http://www.w3.org/TR/1998/WD-xml-names-19980731#scoping-defaulting

Resource Description Framework (RDF)
------------------------------------

Specification of RDF syntax (July 20): http://www.w3.org/TR/WD-rdf-syntax
Specitication of RDF schemas (April 9): http://www.w3.org/TR/WD-rdf-schema
Simple Introduction: http://www.w3.org/TR/NOTE-rdf-simple-intro
Really good page for RDF is Dave Beckett's Resource Description
Framework (RDF) Resources:
http://www.cs.ukc.ac.uk/people/staff/djb1/research/metadata/rdf.shtml
Good RDF examples page: John Cowan's RDF Made Easy
http://www.ccil.org/~cowan/XML/RDF-made-easy.html
Eric Miller's RDF intro: http://www.dlib.org/dlib/may98/miller/05miller.html
Dan Brickley's RDF-dev mailing list: http://www.mailbase.ac.uk/lists/rdf-dev/

RDF is a framework for metadata; it provides interoperability between
applications that exchange machine-understandable information on the
Web. RDF emphasizes facilities to enable automated processing of Web
resources. RDF metadata can be used in a variety of application areas;
for example: in resource discovery to provide better search engine
capabilities; in cataloging for describing the content and content
relationships available at a particular Web site, page, or digital
library; by intelligent software agents to facilitate knowledge sharing
and exchange; in content rating; in describing collections of pages that
represent a single logical "document"; for describing intellectual
property rights of Web pages, and in many others. RDF with digital
signatures will be key to building the "Web of Trust" for electronic
commerce, collaboration, and other applications.

RDF will provide the following features:
1. interoperability of metadata
2. machine understandable semantics for metadata
3. a uniform query capability for resource discovery
4. better precision in resource discovery than full text search
5. a processing rules language for automated decision-making about Web
resources
6. language for retrieving metadata from third parties
7. future-proofing applications as schemas evolve

In general, RDF provides the basis for generic tools for authoring,
manipulating, and searching machine understandable data on the Web
thereby promoting the transformation of the Web into a
machine-processable repository of information.

RDF offers a way to list structured information about a Web resource
that a program can use to intelligently match results. For example,
with the right RDF schemas and software in place, a person could perform
a very specific Web search for biographical information about Virginia
Woolf, and not be deluged with lists of essays about Woolf's work or Web
sites about the state of Virginia.

Searching and cataloging text is just one possible application for RDF.
RDF is also at the center of two other highly publicized W3C efforts,
the Platform for Privacy Preferences (P3P) specification for exchanging
personal information on the Web, and the Platform for Internet Content
Selection (PICS) specification for content labeling.

The RDF effort grew out of PICS and was also influenced by the Dublin
Core Workshop Series, which is focused on defining a metadata vocabulary
for describing electronic documents.

RDF isn't restricted to XML resources. You can make statements about
any addressable (with an URI) thing; HTML docs, GIFs, etc...

Resource = Anything with a URI.

RDF lets programmers make statements about those resources. A statement
asserts a value for a property (sometimes referred to as an "attribute")
of the resource. Each property is of a specific property type, such as
"Title" or "Author." For example, consider the statement "The title of
the document metadata.html is 'Understanding Metadata.'" Expressed in
terms of the RDF data model, metadata.html is the resource. The property
type is "Title." The value for the property is "Understanding Metadata."
The value for a property need not be a simple string of characters. A
value can be another resource, which in turn can have its own
properties. For example, we might want to make the statement, "The
author of the document metadata .html is John Q. Public, whose e-mail
address is jpublic@iw.com." But there's a problem. John Q. Public can't
really be the value for a property, because property values can only be
simple pieces of data or Web resources, and a person qualifies as
neither. You can get around this by associating John Q. Public with a
unique URI, such as http:// www.iw.com/staff/jpublic.

Now you can rephrase the statement for RDF: "The author of the document
metadata.html is http:// www.iw.com/staff/jpublic. The resource
http://www.iw.com/staff/jpublic has name John Q. Public and e-mail
address jpublic@iw.com." RDF statements are encoded in XML. For
instance, the above example might be encoded as:

<rdf:RDF xmlns:rdf="http://www.w3.org/TR/WD-rdf-syntax/"
xmlns:s="http://www.iw.com/OurRDFSchema">
<rdf:Description about="metadata.html">
<s:Author resource="http://www.iw.com/staff/jpublic"/>
</rdf:Description>
<rdf:Description about="http://www.iw.com/staff/jpublic">
<s:Name>John Q. Public</s:Name>
<s:Address>jpublic@iw.com</s:Address>
</rdf:Description>
</rdf:RDF>

The code provides a description of the resource metadata.html, and a
second description of the resource http://www.iw.com/staff/jpublic.

RDF syntax also provides other facilities, such as ways to refer to
containers that hold a number of resources or values, and ways to make
statements that describe other statements rather than Web resources.
All of RDF, however, is fundamentally based on the simple model of
resources, properties, and values.

You may have noticed the "s" prefix next to the property types in the
code above, and the "xmlns" attributes at the start of the code. These
relate to the problem of differing semantics. Different communities
attach different semantic meanings to words; for instance, "address"
could mean anything from a street address to an IP address to a speech
at Gettysburg. One solution might be to have a central organization
define what the word "address" means in all contexts.

RDF takes a different approach; it specifies no particular vocabulary
and instead lets communities of users define their own vocabularies. An
RDF description identifies each property type with a prefix (such as "s"
in the code example), which is in turn mapped to a specific URI using
the XML namespaces mechanism, which is currently a working draft at the
W3C.

The "http://www.iw.com/OurRDFSchema" URI identifies a specific RDF
schema that contains descriptions of what is meant by various property
types and specifies rules for properties. For instance, our example
schema might specify that an author must have one and only one name and
can have zero or more addresses.

This decentralized, bottom-up design lets each community define its own
vocabulary according to its needs.

RDF and Metadata article by Tim Bray
------------------------------------
http://www.xml.com/xml/pub/98/06/rdf.html

Resource Description Framework, as its name implies, is a framework for
describing and interchanging metadata. It is built on the following
rules:

1. A Resource is anything that can have a URI; this includes all the
world's Web pages, as well as individual elements of an XML document. An
example of a resource is a draft of the document you are now reading and
its URL is http://www.textuality.com/RDF/Why.html

2. A PropertyType is a Resource that has a name and can be used as a
property, for example Author or Title. In many cases, all we really care
about is the name; but a PropertyType needs to be a resource so that it
can have its own properties.

3. A Property is the combination of a Resource, a PropertyType, and a
value. An example would be: "The Author of
http://www.textuality.com/RDF/Why.html is Tim Bray." The Value can just
be a string, for example "Tim Bray" in the previous example, or it can
be another resource, for example "The Home-Page of
http://www.textuality.com/RDF/Why.html is http://www.textuality.com."

4. There is a straightforward method for expressing these abstract
Properties in XML, for example:

<RDF:Description href='http://www.textuality.com/RDF/Why-RDF.html'>
<Author>Tim Bray</Author>
<Home-Page RDF:href='http://www.textuality.com' />
</RDF:Description>

RDF is carefully designed to have the following characteristics:

1. Independence

Since a PropertyType is a resource, any independent organization (or
even person) can invent them. I can invent one called Author, and you
can invent one called Director (which would only apply to resources that
are associated with movies), and someone else can invent one called
Restaurant-Category. This is necessary since we don't have www.GOD.org
to take care of it for us.

2. Interchange

Since RDF Properties can be converted into XML, they are easy for us to
interchange. This would probably be necessary even if we did have
www.GOD.org.

3. Scalability

RDF properties are simple three-part records (Resource, PropertyType,
Value), so they are easy to handle and look things up by, even in large
numbers. The Web is already big and getting bigger, and we are probably
going to have (literally) billions of these floating around (millions
even for a big Intranet), so this is important.

4. PropertyTypes are Resources

This means that they can have their own properties and can be found and
manipulated like any other Resource. This is important because there are
going to be lots of them; too many to look at one by one. For example, I
might want to know if anyone out there has defined a PropertyType that
describes the genre of a movie, with values like Comedy, Horror,
Romance, and Thriller. I'll need metadata to help with that.

5. Values Can Be Resources

For example, most Web pages will have a property named Home-Page which
points at the home page of their site. So the values of properties,
which obviously have to include things like title and author's name,
also have to include Resources.

6. Properties Can Be Resources

So they can have properties too. Since there's no www.GOD.org to provide
useful assertions for all the resources, and since the Web is way too
big for us to provide our own, we're going to need to do lookups based
on other people's metadata (as we do today with Yahoo!). This means that
we'll want, given any Property such as "The Subject of this Page is
Donkeys", to be able to ask "Who said so? And When?" One useful way to
do this would be with metadata; so Properties will need to have
Properties.

Why Not Just Use XML? XML allows you to invent tags, and for the tags
to contain both text data and other tags. Also, XML has a built-in
distinction between element types, for example the IMG element type in
HTML, and elements, for example an individual <IMG SRC='Madonna.jpg'>;
this corresponds naturally to the distinction between PropertyTypes and
Properties. So it seems as though XML documents should be a natural
vehicle for exchanging general purpose metadata.

XML, however, falls apart on the Scalability design goal. There are two
problems:

1. The order in which elements appear in an XML document is significant and
often very meaningful. This seems highly unnatural in the metadata
world. Who cares whether a movie's Director or Title is listed first, as
long as both are available for lookups? Furthermore, maintaining the
correct order of millions of data items is expensive and difficult, in
practice.

2. XML allows constructions like this:

<Description>The value of this property contains some
text, mixed up with child properties such as its temperature
(<Temp>48</Temp>) and longitude
(<Longt>101</Longt>). [&Disclaimer;]</Description>

When you represent general XML documents in computer memory, you get
weird data structures that mix trees, graphs, and character strings. In
general, these are hard to handle in even moderate amounts, let alone by
the billion.

On the other hand, something like XML is an absolutely necessary part of
the solution to RDF's Interchange design goal. XML is unequalled as an
exchange format on the Web; but by itself, it doesn't provide what you
need in a metadata framework.

XML-Data
--------

Specification (January 5): http://www.w3.org/TR/1998/NOTE-XML-data

PICSRules
---------

Specification (December 29): http://www.w3.org/TR/REC-PICSRules

Document Content Description (DCD)
----------------------------------

Specification (July 31): http://www.w3.org/TR/NOTE-dcd

DCD is an RDF vocabulary that incorporates some of XML-Data and some
basic data types. The note says DCD is intended to define document
constraints in an XML syntax; these constraints may be used in the same
fashion as traditional XML DTDs.

DCDs improve on DTDs in the following three principal ways:

1. Unlike the DTD, the DCD provides the ability to specify data types.
For example, if the value, or content, of a tag is the number 120874,
a DCD will let the developer specify whether that number is a date, a
time, a time interval, a Boolean value, an integer, a decimal, or some
other type of data.

2. DCDs will let authors create open content models. The way it is now
with the DTD's closed model, an author cannot add tags to a completed
DTD. But the DCD will let authors carve out a space for additional
tags to be specified at some point in the future.

3. DCDs allow new flexibility in letting developers reuse tags. For
example, an invoice written using XML could reuse an address tag set
within the document. Another XML document also could use that tag set.
Neither of these capabilities exist with the current DTD.
DCDs are XML documents; DTDs are written in a different syntax.

Information and Content Exchange (ICE)
--------------------------------------

No specification out yet

ICE would create a defined way, based on XML, for content providers and
publishers to exchange content. While modern Web servers can perform all
kinds of magic with the content they serve up to users -- pulling it
from databases, customizing it for individual users, and pouring it into
various templates -- they're still lousy at trading content with other
servers. Online publishers, content aggregators, and retailers spend a
lot of time and money to acquire articles, product specifications, and
other "media assets" that they then repurpose onto their own sites.

At present, this is mostly done through systems that site managers
develop internally; content aggregators like Excite and Yahoo have
legions of engineers devoted to maintaining their feeds from news
services, specialty Web publishers, and content partners. The cost makes
it prohibitive for most Web sites to incorporate content from multiple
sources -- which in turn may be keeping content providers from sources
of licensing revenue.

ICE is being created to standardize the way in which businesses can set
up online relationships with other businesses to exchange information in
a controlled, measured, targeted fashion. Today there is no accepted
protocol for establishing these relationships, forcing companies to
create one-off technology integrations with each site they choose to do
business with. Currently, businesses have no standard means of
controlling, exchanging and sharing information with other businesses
without timely and expensive manual processes. Developers have no common
platform on which to build powerful networked applications to serve
multiple data types. And, individuals are finding it difficult to access
rich, personalized networked experiences while protecting their privacy.
With ICE, businesses can easily partner with any number of affiliates to
create online destinations such as syndicated publishing networks, Web
superstores, and online reseller channels.

The ICE ad-hoc working group is comprised of a unique blend of two types
of customer-driven companies: a) technology firms including Adobe,
Firefly, JavaSoft, Microsoft, National Semiconductor, and Vignette, and
b) asset exchanging companies including CNET, Hollinger International,
News Internet Services, Preview Travel, Tribune Media Services, and ZD
Net.

ICE is a logical extension to the work being done on P3P because it
manages the exchange of electronic assets from business to business
(server to server) in a completely trusted and controlled manner. In
keeping with the guiding principles of OPS-control by source and
informed consent--ICE enables businesses to automatically describe what
information can be passed to each affiliate and what may be done with
that asset once displayed on an affiliate site. Supporting OPS/P3P will
also enable business to present the best personalized content and
information to customers.

----
adam@cs.caltech.edu

Nothing means a thing to me. It's not a habit; it's cool, I feel alive.
If you don't have it, you're on the other side. I'm not an addict; maybe
that's a lie.
-- K's Choice, "Not an Addict"