[FoRK] Now with magic pixie dust!
Kenneth.Meltsner at ca.com
Fri May 21 14:24:49 PDT 2004
It's also worth pointing out that Tarari is comparing their hardware to relatively naïve implementations of XPath operations -- reading the white paper, they don't mention various indexing schemes (BitCubes, Quasi-BitCubes), translations to previously solved problems (XPath to SQL, for example), or the high-performance XML stream processing efforts at U Washington and elsewhere. So, in the end it's "naïve" vs. "sophisticated," not "software" vs. "hardware."
It is an open question whether they handle full XML, or some binary-friendly subset -- Elliotte Rusty Harold at Café con Leche discussed this while blogging WWW2004:
(not a permalink, hence the huge FoRK-friendly quote):
The talk that sold me on the Optimizing Encoding session was IBM's Roberto J. Bayardo "An Evaluation of Binary XML Encoding Optimizations for Fast Stream based XML Processing." (Co-authors: Daniel Gruhl, Vanja Josifvoski, and Jussi Myllymaki) I suspect I'll hate it, but I could be wrong. My prediction for yesterday's schema talk proved to be quite off base. Maybe these guys will surprise me too. My specific prediction is that they'll speed up XML processing by failing to do well-formedness checking. Of course, they'll hide that in a binary encoding and claim the speed gain is from using binary rather than text. However, I've seen a dozen of these binary panaceas and they never really work the way the authors think they work. An XML processor needs to be able to accept any stream of data and behave reliably, without crashing, freezing, or reporting incorrect content. And indeed a real XML parser can do exactly this. No matter what you feed it, it either succeeds or reports an error in a predictable, safe way. Most binary schemes incorrectly assume that because the data is binary they don't need to check it for well-formedness. But as anyone who's ever had Microsoft Word crash because of a corrupted .doc file or had to restore a hosed database from a backup should know, that's not actually true. Binary data is often incorrect, and robust programs need to be prepared for that. XML parsers are robust. Most binary parsers aren't. The fact is when you pull well-formedness checking out of a text-based XML parser it speeds up a lot too, as many developers of incomplete parsers like ???? have found. Binary is not fundamentally faster than text, as long as the binary parser does the same work as the text based parser does. Anyway, that's my prediction for where this talk is going (I haven't read the paper yet.) In an hour, we'll see if my prediction was right.
He thinks XML is too heavyweight for high performance apps. (I completely disagree. I'm really having to bite my tongue here to let the speaker finish in his allotted small time.) both in parsing overhead and space requirements. "Bad side of XML is addressed by throwing away (much of) the good side." XML compression trades speed for small size. Also negatively impacts the ability to stream. Some parsing optimizations slow down the encoding. In other words they trade speed of encoding for speed of decoding. He's only considering single stream encodings. Strategies tested include:
Tokenization of common strings: string table makes this inappropriate for streaming because you need to know the tokenized string in advance. But demand driven tokenization may work, but causes problems for random access.
Random access (indexing) support: doesn't work well with most APIs and many queries
Schema-based optimizations: doesn't work for schema-less documents and brittle in face of schema changes
Most of their evaluations are based on IBM's XTalk plus various enhanced versions of XTalk. (He seems to be confusing SAX with unofficial C++ versions. ) TurboXPath is a representative application. Two sample data sets: record-oriented (a big bibliography, 180MB) and more deeply nested collection of real estate listings grouped by cities (50 MB). Used Visual C+++ on Windows. Linux results were similar. They tested expat and Xerces-C. (Why not libxml2?) Their numbers show maybe a third to a half improvement by using binary parsing. expat was three times faster than Xerces-C. (on their first example query). On the second query, improvement is a little more, but in the same ballpark. Skip pointers are a significant improvement here. "Don't throw out plain XML if you haven't tried expat." "Java and DOM based parsers are not competitive."
It took till Q&A to find out for sure, but I was right. They aren't doing full well-formedness checking. For instance, they don't check for legal characters in names. That's why XTalk is faster. They claim expat isn't either (I need to check that) but there was disagreement from the audience on this point. But the bottom line is they failed to demonstrate a difference in speed due to binary vs. text.
More information about the FoRK