[FoRK] Metastructure description langauge instances as compiled
binary encoding metadata
Stephen D. Williams
sdw at lig.net
Thu Mar 31 22:34:51 PST 2005
The W3C Binary Characterization Working Group has published final documents:
This is a new working group style that acted essentially as an incubator
to consider use cases, properties, requirements, and feasibility of
creating a spec for binary XML before actually working to create such a
My proposed binary XML format, esXML, has evolved to support very
compact encoding of data with externalization (as opposed to XML-style
self-contained self-description, see the Measurements Methodology
discussion for clarification: http://www.w3.org/TR/xbc-measurement/ ).
I call externalization the process of factoring out redundancies in
data. This unifies the use of IDLs and schemas (like XML Schema) for
long-term redundancy along with short-term redundancy from deltas (which
I favor in some cases). Externalization is undesirable in many cases,
but is required in others. In fact, in some cases the competition is
hand-coded bit packing.
I have never liked IDL/Schema-based externalization due to versioning
issues, creating messy and inefficient stub code, etc. Using high level
schemas to indicate how bits should be packed is also just hand-packing
with high-level syntax and doesn't take advantage of optimization. A
further issue is that one schema doesn't fit everyone and there are
automated analysis that can be very helpful.
My solution to all of this at the format level is to support a content
that is a mix of self-describing tokenized data along with bit-packed
data streams that refer to a metastructure instance for description.
The metastructure instance is a metadata format that defines structure,
typing, options, alternation, extension, and certain encoding details.
The idea is that the metastructure instance is a kind of microcode
instruction set for encoder/decoder engines. The metastructure instance
can be created not only from compiled schemas but also from historical
examples of data which supports automated selection of encoding
methods. The metastructure instance is preferrably not compiled into
code or stubs but rather is interpreted by codec engines and can be
communicated in or out of band. This solves many of the problems of
schema-based encoding while being able to outperform existing manual and
As an example, an object/element/document that contains a number of
booleans might be found during analysis to express certain combinations
much more frequently than others. This could be used to huffman encode
the most common combinations of several booleans and encode the less
common combinations directly.
So, my question is, have you seen this before?
The closest thing I know of is the Data Format Description Language
That effort was an effort to describe existing formats so they could be
mapped automatically. Interesting, but impossible to completely solve.
The metastructure description language that I am describing is somewhat
like pcode instructions for converting between uncompressed
self-contained self-describing data and minimal bit-packed data with
potentially dozens of optional encoding strategies. The strategy chosen
could be indicated manually in a schema or chosen by any algorithm
including genetic search based on historical instances.
swilliams at hpti.com http://www.hpti.com Per: sdw at lig.net http://sdw.st
Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw
More information about the FoRK