[FoRK] binary XML

damien morton fork at bitfurnace.com
Fri Jan 21 23:39:00 PST 2005


Its not about compression, ist more about compaction - that quality of 
an encoding which makes it both fast to decode and efficient to transmit.

Its a goal which sais "I am willing to pay 20-40% above the going 
entropy cost in order to ensure that my encoding is easy and fast to decode"

Gavin, in your "edge transformation" who is the consumer? Is it the 
parser itself? or is it the erstwile notepad equiped xml-o-naut.


Heres a data point:

I get bursts which range up to 10 megabits of these messages a second:

Im using the C# XmlReader, which is able to handle about 8000 of these 
mesages a second.

Without wanting to blow my own horn (too much), I wrote my own xml 
parser, tailored for the form of then xml received (which is emitted 
according to some consistent formatting rules by some machine 
somewhere). The results were a 20-fold improvement in parsing performance.

Now my parser will break on anything other than the rigidly defined xml 
I wrote it for, but even working in the pure text domain, its possible 
to create an xml-like subset of xml syntax that can be parsed 20 times 
faster than a generalised xml parser would parse it.

Whats worng with this picture?



<?xml version="1.0" encoding="UTF-8"?>
<mepsm:messageEnvelope 
xmlns:mepsm="http://www-server.ms.com/ms/dist/fidev/merlin/PricingServiceModel" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <message xsi:type="mepsm:SharedSpreadRelationship">
     <sequence>528612</sequence>
     <name>RsharedLIBERTY10000037219</name>
     <instrumentContext>fid1Id_10000011524</instrumentContext>
     <pricingContext>NYBrokerCompare</pricingContext>
     <isActive>1</isActive>
     <isDriver>0</isDriver>
     <benchmarkAttribute1>beYield</benchmarkAttribute1>
     <dependentAttribute1>beYield</dependentAttribute1>
     <benchmark1 xsi:type="mepsm:ObservableReference">
       <name>POfficial</name>
       <instrumentContext>fid1Id_10000037219</instrumentContext>
       <pricingContext>NYBrokerCompare</pricingContext>
     </benchmark1>
     <dependent1 xsi:type="mepsm:PYModelReference">
       <name>POfficial</name>
       <instrumentContext>fid1Id_10000011524</instrumentContext>
       <pricingContext>NYBrokerCompare</pricingContext>
     </dependent1>
     <benchmarkAttribute2>beYield</benchmarkAttribute2>
     <dependentAttribute2>beYield</dependentAttribute2>
     <benchmark2 xsi:type="mepsm:ObservableReference">
       <name>PYstaticLIBERTY</name>
       <instrumentContext>fid1Id_10000037219</instrumentContext>
       <pricingContext>NYBrokerCompare</pricingContext>
     </benchmark2>
     <dependent2 xsi:type="mepsm:PYModelReference">
       <name>PYstaticLIBERTY10000037219</name>
       <instrumentContext>fid1Id_10000011524</instrumentContext>
       <pricingContext>NYBrokerCompare</pricingContext>
     </dependent2>
   </message>
</mepsm:messageEnvelope>


> 
> On Jan 22, 2005, at 12:12 AM, Reza B'Far (Voice Genesis) wrote:
> 
>> This all assumes you have pretty intelligent design engineers (not just
>> people who know how to count bytes, but have the ability to understand
>> statistics and things like Huffman encoding).  Obviously, there are more
>> advanced domain-based compression techniques that could take in text and
>> produce text.  And, if they are simple enough, you can use an XSL to view
>> them during testing, development, etc.
>>
>> So, this is hacky, but it is a way to buy a little performance.
> 
> 
> FWIW. This is similar to the approach that I call "edge transformation", 
> where
> you emphasise the ability for the consumer to interpret the data stream, 
> rather
> than standardising the data stream. It gives you more flexibility, and 
> as in
> your case, allows true application-based optimisation.
> 
> I assume you are preloading the huffman dictionaries with the most common
> symbols?
> 
> 
> 
> 
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork
> 
> 



More information about the FoRK mailing list