[FoRK] Now with magic pixie dust!

Stephen D. Williams sdw at lig.net
Sun May 23 12:23:27 PDT 2004

Infosets vs. SAX + custom data structures vs. DOM:
Every application works with a data structure that is the embodiement of 
some abstract infoset.
Generally speaking, SAX a application uses events to construct a 
proprietary, custom data structure.  You can't compare SAX by itself to 
anything that builds or maintains a data structure.  You have to compare 
alternatives with SAX coupled either with a general purpose data 
structuer or with a custom data structure.
A DOM application is using a data structure is that general purpose for 
infosets compatible with XML expression in representing those same events.

Collection classes, data structures, tree+graph data structure completeness:
With the collection classes of C++ (STL), Java, perl, etc., no one 
should be building raw data structures very often.
Most abstract infosets can be expressed with a small number of 
constructs: trees (which includes arrays), graphs, and queues.  With 
small additional semantics, a tree/array system can also represent 
graphs.  Graphs and arrays can represent queues, linked lists, etc.
A collection class that supports all of these with indexing/hashing and 
support of any type of value/payload, including binary blocks, would 
suffice for many programming tasks, including nearly all business 

Minimum overhead of "line/file format" vs. "conversion + operational 
memory data structure + modification overhead":
Applications that read mostly XML (or similar general format) data and 
write XML data in an SOA, n-tier environment often have a minimum 
overhead computational load that dwarfs the actual processing they are 
accomplishing.  Certainly with many distributed computing methods, such 
as CORBA, DCOM, and systems based on ASN.1 where you working with IDL or 
IDL-like systems, the maintenance overhead and tightness of binding lead 
to serious long-term issues.

Theory of minimum distributed-application processing and data overhead:
Ideally, the overhead of getting data into and out of an application and 
into and out of data structures that functional code can actually do 
something with should be minimized to something approaching theoretical 
minimum.  My theory is that you can have both the general, standardized 
data expression of XML and avoid nearly all overhead except raw I/O of 
blocks and a slight overhead of traversal/access/modification.  This 
overhead should be linear to the number of operations performed, not to 
the number or type of elements in a block of data.

Serialized, wire/file-formats of data have been optimized mostly 
independently of memory data structures.
Memory-based data structures have been optimized mostly independently of 
serialized data formats.  Mostly they have been the concern of language 
and library designers with respect to in-memory processing.  Pascal, the 
original Wirth Pascal, didn't even have input/output operators: all 
input/output in actual Pascal languages was non-standard.

My observation is that as applications and application components become 
more and more distributed, componetized, and bound in ways that force 
frequent transitions between serialized form and operational memory 
form, the overhead of existing methods will continue to increase sharply 
and become less tolerable.  After working on it for a while, I am 
convinced that it is possible to solve this problem in a way that will 
create a new paradigm at the 3GL and below levels of the stack while 
supporting a variety of existing and alternate models above.  In 
particular, data and data structures that are read in, operated on, and 
written out should not expressed in 3GL method variables but in a format 
like esXML and accessed via a collections-style interface like esDOM.

Back in 1998, which is when I started thinking about this problem quite 
a bit, FoRK had a discussion about YML which explored substantially the 
same territory.


Gavin Thomas Nicol wrote:

>On Friday 21 May 2004 10:35 pm, Stephen D. Williams wrote:
>>I think that an XPath based API is pretty general, with certain
>>semantics.  You need to be able to get, set (create/replace), append,
>>insert.  You need array indexing, array counting, iteration/enumeration,
>>subtree operations (get, set, append, insert subtrees).
>These are not necessary for a significant number of applications... for 
>example, rendering a page of data in a read-only scenario, or sucking in a 
>SOAP message doesn't really need much more than a stack and some SAX events 
>(bit of an oversimplification, but...). XPath as such is likewise overkill 
>(and overhead!) for many applications.
>In many cases, these are also not only not necessary, but  irrelevant. Go one 
>or two levels higher in the application, and the XML can't (or shouldn't be) 
>FoRK mailing list

swilliams at hpti.com http://www.hpti.com Per: sdw at lig.net http://sdw.st
Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lair.xent.com/pipermail/fork/attachments/20040523/286d9808/attachment.html

More information about the FoRK mailing list