[FoRK] Multicore, async segmented sequential models

J. Andrew Rogers andrew at jarbox.org
Fri May 10 14:00:34 PDT 2013

On May 10, 2013, at 12:56 AM, Stephen Williams <sdw at lig.net> wrote:
> Specifics?  I didn't see it in Sybase, SQL Server, Oracle (a while ago), MySQL, or PostgreSQL.  There was not one example for doing this with SQL fields, tables, and joins.

Many databases (PostgreSQL, SQL Server, et al) support e.g. XML document types and indexed access via xpath expressions and similar. The query executor can see the internal structure. Most of your big enterprise databases support materializing a set of join relationships between logical tables into document-like record structures in the storage layer with varying degrees of sophistication (Oracle 7 could do this). The big guys added support for join recursion sometime in the 1990s (yay, graph traversal). And then there are the object databases (remember those?) most of which were document-organized. What the designers of document-oriented databases claim is that their implementation is better or more modern or relatively faster when doing certain types of queries. 

Arguments about relative efficiency aside, the long-term technical issue is poor concurrency and distributability.

> Sure, if you have a few fixed width fields and you have a zillion of them you can optimize most overhead out.  However, as soon as you want to be able to put a lot of data in a table, you'll want to consider column and row compression or similar.  Sure, you can just relegate blobs to another table / space and just put the reference in the row, but that's not necessarily going to make retrieval of that row faster.

Compression does not change the relative overhead. Per-column intra-page compression strategies are sophisticated and support direct and extremely fast query evaluation on rows in the compressed format but most require some level of strictly typed structure to work well. If all of your records are arbitrary document graphs, you are left with naive (read: slow) row- and page-level compression options.

> Applications already have to manage, parse, etc. this stuff.  Done correctly, the database wouldn't have to parse much or any of this.

This is not true. Every value in a database has a schema associated with it stored, for performance reasons, with the highest level construct where the schema appears fixed. So-called schema-less database engines have to tag every value with a schema and those tags have to be evaluated at query time. In addition to adding significant storage overhead, the evaluation of documents injects a lot of conditional branching into the processing pipeline.

It is not controversial that a highly-optimized pipeline for conventional page structures is 1-2 orders of magnitude faster to evaluate (depending on the data model) than similar pipelines for document-organized page structures. Evaluating document structures usually clocks in at around 100 MB/s per core.

> This isn't true for all storage strategies.  Write new records in compressed extents, store the extent and offset in indexes, etc. Lowering IO enough completely compensates for computation.

It inherently has a larger I/O footprint in terms of storage and only reduces query I/O in a narrow case. Storage formats with more structure have much better strategies for compression and query evaluation available.

As an aside, big RAM and SSDs have made optimizing I/Os somewhat less important while increasing the need to optimize for maximizing throughput of memory structures through a core.

> Are you sure that this is true once you take into account multiple queries needed to join and follow equivalent data?  Sometimes, a little extra processing in the database might obviate multiple queries.

Yes, I am taking that into account. This is a narrow optimization. It makes a single join much faster at the cost of making virtually everything else you might want to do with the data significantly slower. The cost exceeds the benefit for most data models.

> What does doing it right look like?  Besides pseudo-tables or just storing indexed blobs (in which you are using an XML database, not SQL), what does doing it right for traditional SQL look like for these problems?

There are many more options and strategies available. I was successfully designing architectures for cases like this in the 1990s using Oracle, which by modern standards was not a very sophisticated database engine though excellent for the time. It also had a ton of knobs for modifying storage organization behaviors that most database architects were unfamiliar with. Contrary to what I stated previously, I think I did use the document-organization option once for storing a mess of user attributes when I was designing Netflix's systems (Oracle 8 IIRC).

> SparQL always felt half done to me.  But you can express complex graph & semantic searches in a fairly powerful way.
> What will cut it?  Is SQL adding features to compete with it?  One of the other graph query languages?
> Or more likely the addition of graph / semantic search to SQL?

It is unbelievably crufty and was not designed to be extensible. I'm not offering a solution, just criticizing it. :-)  Most other query languages are narrow tweaks on SQL for a specific purpose -- they do not go far enough.

SQL has been able to do the semantic and some types of graph analysis stuff for ages, in its own ugly way. Geospatial was added as a gross hack. The language itself would be more easily extensible if it was an object-functional model. Good luck trying to express an analytical processing pipeline in it even though that would be computationally efficient (there are several vendor-specific extensions for this purpose). SQL never really conceived of generalized spatial data models so that requires alternative DDL if you are one of the few database engines with that capability. Disallow DDL to specify storage format; many database engines already fake support for storage formats rather than implementing them because it is an anachronism.

In short, SQL has been unable to adequately or cleanly express the increasing sophistication of both the analytics and data models that people are trying to build with databases. There was a time when that was because people did not know how to use SQL well but these days there are more cases of fundamental broken-ness.

More information about the FoRK mailing list