Re: Baked XML With Sour Grapes and ArChives [Draft -1 !!!]

Ron Resnick (
Sat, 28 Jun 1997 16:54:15 +0300

At 01:30 AM 6/25/97 -0400, Rohit wrote:
>Baked XML With Sour Grapes and ArChives
>Rohit Khare, 6/25/97 * brought
>to you by $3.41 worth of chai * 1200 words in 50 minutes
>Dan Connolly suggested codifying the argument I have bandied about as a
>pet theory for a few years now:

Hmm. Before digging into this, I suppose it's worth considering what
motivation Rohit would have in writing this, and who his target audience
is. Possibilities include:

i) Organize his own ideas purely for his own sake. Not trying to convince
anyone of anything. This is perfectly legitimate; I do this all the time.
In that case, ignore much of the below as mere crankiness.
ii) Preach to the converted. He and Dan Connolly and Adam and other
regular churchgoers can read Rohit's paper and all chant "Hallelujah!"
I doubt that's the intent.
iii) Preach to the unconverted. Attempt to present a reasonable argument
grounded in logic which can perhaps sway potential converts to join
the fold.
iv) ???

I'm going to assume that since Rohit posted this publicly, he's willing
to at least get feedback from people who take this as (iii).

Here goes.

First, I'm somewhat surprised that the theme is usage of XML for
pickling. I'd always figured karmakids were presenting XML as a
tool for knowledge representation, as needed for things like
dynamic session formation.

Pickling, to me, implies notions of composite storage facilities. Without
that, you haven't really 'archived' anything, have you? You need to treat
the durability of a related set of entities as a whole, or else you haven't
truly created an archive. Sure, the 'make durable' command can deposit
the durable pieces scattered around in distributed fashion - I'm not
saying it has to centrally gather and store everything. But without
a centralized command to 'make it so', I have a hard time thinking
of what you suggest as an 'archive'.

I get the sense that the main thing that irks Rohit about frozen
sets of objects is the (supposed) need to centrally gather them and
store them all-in-a-bundle.

Whereas, as you suggest, by leaving things out where
they are, distributed over the web, you can archive as little or as
much as you need, on the fly. But why do you suppose that there's something
magic about http/XML etc. that your suggestion works only with it?

Come on,
really you're just pointing out the differences between call-by-value
and call-by-reference, and the need to have both, and be able to
flexibly move some bits by value, and leave others at arms length,
accessed by reference, with the ability to switch what's what on the fly.
That's very true and quite profound, I believe. It's good to see it pointed
out, in whatever context that's done. But it can be done in any sufficiently
computationally powerful framework, web or otherwise.

The real trouble with
CORBA (one of its real troubles, anyway) has been its stubborn
inability to move objects by
value. And, one of the real troubles with Java early on was that
it gave a great by-value bytecode mech, but zip for by-reference calls
(e.g. RMI).

If your call is "need both!", I fully agree. Need both, need to be able
to pick which at the drop of a hat, need to be able to change your
decision at the drop of a hat.

>XML is an ideal substrate for archiving the state of
>distributed systems. In this case, I mean distributed in the sense of
>'across organizational boundaries' more than I mean 'across address
>spaces' (though there's some of that, too). Let me draw a picture, then
>we'll go back over the theory (as meager as it is). Let's recapitulate the
>most basic case: archiving a network of dependent objects within one

Yuck. As Mark will hasten to tell you in great detail, 'application' is
such an ugly word, in your world and ours. If you really believe in
a web full of documents, and everything is a document, then where
pray tell are these 'applications' coming from?

>Suppose we have a human resources application with Employee,
>Department, and Manager::Employee classes and assorted instances thereof.
>Let's consider what happens, at first without considering HTML/XML at all.
> As a zeroth attempt, you might 'pickle' a Department by simply writing
>down its data structure in memory. Immediately, though, we see that there
>are pointers to a Deparment's Manager, so we actually need to package both
>objects together to make a complete statement. If you wanted to get all
>the Employees within a department, too, then you discover the general
>case. Since there is a web of employees, all related to each other because
>some are managers, we cannot simply write down the Department, the
>Manager, and then each Employee reporting to that manager: we could fall
>off the edge into a cyclic loop. In fact, the general case is
>mark-and-sweep: first you trace out every object connected to the one at
>hand; then collect and serialize every affected object in that subset.
>[@@expn could be clarified through superclass write: methods instead (you
>may know what friends you need pickled, but not what your superclass
>implementation might be using.] This is expensive! Before you can even put
>the first byte on the wire, you have to plan out the entire pickle.

Um, yes, I'd agree with much of this. Mind you, it presupposes that
the point of the exercise is to pull all the objects together to one
(time/space) coordinate, which I've already indicated doesn't have to be
the goal.

>fragile, too, because two subsequent snapshots may yield separate values
>for some subsets of the archive: some reporting roles change, etc. (i.e.
>duplicate copies of the state of an object in two archives).

Well, um, yeah - an archive *is* a snapshot at some moment in time. Sure,
the real operational system is changing continuously, and differs from
the snapshot moments after it's taken. Isn't that kind of the point of
it all?

Maybe I've missed something here... Perhaps you're troubled
by how to do incremental saves to a baseline archive? This is certainly
thorny, but (i) the database community is full of techniques for
addressing this problem
(ii) I don't see how the web doesn't face it - it's not a problem of
it's part of the very notion of what archiving a running system means.

Version management is tricky in any system; look at the hoops
source repositories go through to deal with it. They're document based,
right ;-)?

>Now, wait:
>you actually know a radically different way of solving this problem under
>your nose: transferring a Web page! A page has many subsidiary resources,
>some of which load other subparts in turn; some of which are shared with
>other pages, and so on. But WE don't have to pickle:

See Rohit, this is where I return to those (i), (ii), (iii) options at top.
"But WE" kind of irked me. Are you trying to convince people like me
of something? The "But WE" stuff will be very counterproductive, I'm afraid.

I'm approaching your "web can do it all" approach from this perspective:

We all agree that ultimately, it's all just
bits. Damn little automata non-thinking 1s and 0s. The most you can
get them to do is what Turing says they can do, no more no less. As such,
it really doesn't matter ultimately whether you start from CORBA, Java,
Web, DCOM, TCP/IP, sockets, DCE, ObjectiveC, or whatever. They're
all computationally equivalent. They differ in elegance and style, not
in ability. They're all better at some things, worse at others.(Well,
some suck at the whole thing, but that's besides the point).

I'm quite willing to grant that the Web as we know it can
evolve to the visions we share, of a single unified networked space that
everything. So can ReXX over SNA, or CMIP over OSI, or Lotus Notes,
or a multitude of other possible starting grounds, so what?

The real discussion, imho,
should be over (i) what services do we need (ii) which tech
currently best understands some subset of those services (iii) how
to superimpose and merge and blend and morph those "best of breed"
subsets into the
total set. I'm not anti-web or pro-java. I'm a believer in morphing the best
of everything, and ditching the worst in everything.

But if you want to present the discussion as a "But WE" can do things
they can't, I'll quickly lose interest. (Maybe that's your intent :-).

>HTTP servers don't
>grovel over home pages and send out neatly packaged bundle of
>html-with-all-embedded-images-and-sounds-in-one-MIME-multipart. We have the
> miracle of names!

Miracle? Oh, and I suppose object refs could never, ever do this? Besides,
URL naming is the one thing I'd happily grab from the web into my
morphed "best-of" collection of services. That one's a gimme.

See? I don't play fair -that's the thing about a morpher. We have no
allegiances. We'll steal the best ideas you can come up with, and
simply scavenge them, leaving behind the not so tasty parts :-).
We've been known to infuriate die-hard Javaheads, dyed-in-the-wool
CORBA types, and perhaps now you, with this approach :-).

Surprisingly, I've yet to have encountered an MS idea I thought
was "best of breed" and worth incorporating. They seem to be
scavengers too, but they've got a butt ugly base to begin with in
OLE/COM, so it makes their job a lot harder. But they
do have very smart people (hi JoeB :-),
and are willing to pay $ to attract more, so they will get there in time.

>Instead of expensive marshalling burdens on the server
>(writer), we just send over the one object at hand with names as pointers
>to other resources. Then, let the client (reader) pull what they need to
>build a complete map... the delay bottleneck goes away, so we can really
>stream these puppies asap. [Now, of course, as a performance optimization,
>we can pipeline the next few employee records you'll need to mask the
>underlying latency... cache push] Lesson 1: URLs are an excellent way to
>capture distributed state. Versioning, security, etc, can be layered on
>top of that mechanism. There are several red-herring issues: space used by
>long URLs (compress the transport), fragility of locations (bzzt!
>Locations == names). """""<>" using style sheets conveys the same
>information in a human-readable form. Lesson 2: Documents are an
>extremely convenient way of encoding object state in a way that's usable
>to humans and computers and is more palatable, reusable, non-fragile than
>binary formats.

Yes, I've been thinking a lot about this. The beauty of some of the most
important standards we've seen in this industry (e.g. ASCII, IP, Win32, x86,
Java bytecode, JPG, html tags) is that they're, well, standard. In the
intuitive, English sense of the word. They're ubiquitous and global.
or at least most everybody, uses them.

So the next step would be to really get one universal world-spanning,
system-spanning representation for all our bits, right up to the highest
semantic levels, right?

Well, yes, with some reservations - never mind those for now - they're
off topic. More on-topic is, as you note, wouldn't it be great to have
all those representations human-readable, as well as just machine-readable?
What *is* so great about Java bytecode, anyway? I mean, I can transfer
code around a network in standardized, human readable representations
too, can't I? Eg, any scripting language, VBasic, etc.Then I get the benefits
of write once, run anywhere, but my shipped-bits are readable; better
than .class files, right?

Wrong. The answer, of course, is that so long as we have hardware
differences, there will always be variations and ambiguities on reps.
at a source-level only. What's an int? How may bits? Big/litte endian?
Mantissa/exp issues in floating point?

The standard rep. has to go right from the bottom to the top. Short
circuiting the process and standardizing the semantic knowledge
up top without the bottom rungs isn't going to work. Does it have
to be Java? Of course not. There's a perfectly usable (ha!) binary
format already: x86. Any takers? Thought not.

> """<><>""" quote). And the beauty of *XML* is that I
>don't need to compile new programs to process each DTD: I really CAN
>dynamically learn about new document types. XML fixes bugs in SGML. XML
>adds a real naming scheme for DTDs. Lesson 3: documents with
>self-describing grammars are infinitely more reusable than ad-hoc

Sure, I believe in this. Self-defining and self-referencing and
is key for anything that wants to pull itself out of the primeval muck. Again,
only XML knows how to do this? I think not. Try the KIF refs from the
dist-obj faq, as an example of alternatives.

>(i.e. motivation of *MLs) But now, we have only defined our
>precise meaning: nothing lets us share WebCard semantics yet. What we need
>is a way to equivalence my vCard to your WebCard: a filter between XML
>DTDs. Traditionally, we know filters as converting between encoding types
>(e.g. jpg to png to asciiart). DTD calculus lets us *coordinate across
>administrative domains* (each having its own worldview). This brings
>synergy to the table from ad-hocracy. Instead of being held prisoner by
>some industrywide megaproject to define 'employee' and 'purchase order'
>like the OMG is doing, we can let it emerge organically.

Oh, yes. Mega-agreement here. Have you read Sims, by the way?
Try it :-).

I don't believe much in vertical
frameworks either, at least not in the way OMG would have them, by
straightjacket IDL interfaces for everything. We definitely need much
looser notions of "standard interfaces" that give a common semantic/
conceptual context, but are free form after that.

Take natural language as an example:
A set of radiologists all speak "English" but have their own particular
vernacular and jargon unique to their profession.
So does a set of taxidermists, taxi-drivers,
and tax collectors. You don't force them to agree on :
taxidermist.stuff( Animal theBeast, Stuffing StrawOrSawdust);
That's silly and offensive, imho.

>Lesson 4:
>declaring what we mean instead of operational encoding (programs like
>Java), we can deterministically transform data with high fidelity. Look,
>Documents are Archived Data. We create them by pickling programs (the
>output of CGIs), and we can extract pickles back out through deterministic
>reverse-engineering (webMethods' package tracker). Most of all, the
>evolutionary advantage is that *combo* human/mr documents will be most
>powerful at focusing attention. People will invest more to make the
>'purchase order' forms look good, collect the right data, and generally
>sweat the details which they won't for relational DB table schemas.
>Lesson 5: human-readable documents form an excellent common ground for many
> ways to generate and extract data of that kind. There are millions of
>employee databases in gadzooolians of languages, but a single pretty
>common framework for business cards (admittedly with a thousand variations
>in the graphic details -- but that's an XML dtd with many CSS sheets).
>So, the best way to archive the state of a distributed computation like the
> business of an entire corporation is its intranet web. And the best way to
> pickle the world is the Web.

Hmm. Part of our differences is in vision too though. I don't believe in
"organizations" that have "intranets" populated with "databases" ultimately.
Those are all such centralized things. Rather, I consider one big
soup of micro-cogs, any set of which can be arbitrarily and dynamically put
together in a session to form a virutal-organization, and just as quickly

>The best content for it is XML. The pointers
>we get with XML are also more powerful and finer-grained so they are
>better marshalling tools. This is a protocol issue, too: caching of object
>state becomes a visible, soluble issue (it's swept under the rug in
>rpc/corba/dcom systems instead). <>