[FoRK] Metastructure description langauge instances as compiled binary encoding metadata

Stephen D. Williams sdw at lig.net
Thu Mar 31 22:34:51 PST 2005


The W3C Binary Characterization Working Group has published final documents:
http://www.w3.org/XML/Binary/

This is a new working group style that acted essentially as an incubator 
to consider use cases, properties, requirements, and feasibility of 
creating a spec for binary XML before actually working to create such a 
spec.

My proposed binary XML format, esXML, has evolved to support very 
compact encoding of data with externalization (as opposed to XML-style 
self-contained self-description, see the Measurements Methodology 
discussion for clarification: http://www.w3.org/TR/xbc-measurement/ ).  
I call externalization the process of factoring out redundancies in 
data.  This unifies the use of IDLs and schemas (like XML Schema) for 
long-term redundancy along with short-term redundancy from deltas (which 
I favor in some cases).  Externalization is undesirable in many cases, 
but is required in others.  In fact, in some cases the competition is 
hand-coded bit packing.

I have never liked IDL/Schema-based externalization due to versioning 
issues, creating messy and inefficient stub code, etc.  Using high level 
schemas to indicate how bits should be packed is also just hand-packing 
with high-level syntax and doesn't take advantage of optimization.  A 
further issue is that one schema doesn't fit everyone and there are 
automated analysis that can be very helpful.

My solution to all of this at the format level is to support a content 
that is a mix of self-describing tokenized data along with bit-packed 
data streams that refer to a metastructure instance for description.  
The metastructure instance is a metadata format that defines structure, 
typing, options, alternation, extension, and certain encoding details.  
The idea is that the metastructure instance is a kind of microcode 
instruction set for encoder/decoder engines.  The metastructure instance 
can be created not only from compiled schemas but also from historical 
examples of data which supports automated selection of encoding 
methods.  The metastructure instance is preferrably not compiled into 
code or stubs but rather is interpreted by codec engines and can be 
communicated in or out of band.  This solves many of the problems of 
schema-based encoding while being able to outperform existing manual and 
automated methods.

As an example, an object/element/document that contains a number of 
booleans might be found during analysis to express certain combinations 
much more frequently than others.  This could be used to huffman encode 
the most common combinations of several booleans and encode the less 
common combinations directly.

So, my question is, have you seen this before?

The closest thing I know of is the Data Format Description Language 
(DFDL) http://forge.gridforum.org/projects/dfdl-wg/
That effort was an effort to describe existing formats so they could be 
mapped automatically.  Interesting, but impossible to completely solve.  
The metastructure description language that I am describing is somewhat 
like pcode instructions for converting between uncompressed 
self-contained self-describing data and minimal bit-packed data with 
potentially dozens of optional encoding strategies.  The strategy chosen 
could be indicated manually in a schema or chosen by any algorithm 
including genetic search based on historical instances.

sdw

-- 
swilliams at hpti.com http://www.hpti.com Per: sdw at lig.net http://sdw.st
Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw




More information about the FoRK mailing list