[FoRK] Multicore, async segmented sequential models

Stephen D. Williams sdw at lig.net
Fri May 10 15:11:13 PDT 2013


You started out sounding like you disagreed with my main premise and then concluded in the same general direction that SQL is 
broken, even if various frankenDBMS products have enough bolted on to solve a wider range of problems.

SparQL is a tricky beast: Apparently, a key concept is that the query model was intentionally limited, not including generalized 
graph-regular-expressions for instance, precisely because it was intended to query both a dumb triple-store and a semantic enabled 
knowledge base.  The latter provides more or less unlimited semantic power through the expressiveness of the reasoning system and 
the particular schema / knowledge loaded.  So, somewhat similar to legacy HTML vs. CSS+XHTML data, rather than explicitly implement 
a particular graph traversal for a particular form of knowledge, you simply query for "descendents of" and get children, 
grandchildren, etc.

I like and dislike this.  I find it a huge punt over what they could have done, however they may have a point about striving toward 
a powerful logical semantics vs. data form and mechanics specific query poisoning.  The former is much more likely to be a timeless 
expression of intent than the latter.  Even if they didn't do it well, it is a good direction.

The way that triplestores work is interesting, and I can see a lot of takeoffs to improve chunkiness (a la document databases), etc. 
Each constant is in the constants table and every triple/quad consists of 3 IDs.  Then these are indexed for every possible 
combination: SPO, POS, OPS, OSP, etc.  You can do both full-text searches of the constant space and relational / graph search in any 
direction for any relationship chain.  More indexing than a typical RDBMS, and granularity is too high generally, but totally 
flexible and, for some tasks, actually seems efficient.  Since it can be highly compressed and/or minimized, in memory for instance, 
there are some interesting cases.

On the other hand, document databases win for things that are accessed and used in a chunky manner.  Third normal form is a waste in 
that case.

Some of my old thinking involved combining both ideas, chunky versioned delta-enabled document based triplestores.  I'm getting back 
there now.  You want the full flexibility of a graph query triplestore but the efficiency of RDBMS and/or document and/or content 
stores and/or filesystems.

Another way of putting it is that you need a model that allows configuration, manually or automatically, so that you can vary 
between maximum flexibility and power vs. various types of performance optimization.  It doesn't seem likely to get everything at 
once, but it does seem possible to have everything available with little or no code changes, simply requiring a reconfiguration of 
data / index / memory.  The key question is whether this is best done by a multibase, with various modes bolted on, or some unifying 
but tunably flexible solution.

Stephen

On 5/10/13 2:00 PM, J. Andrew Rogers wrote:
> On May 10, 2013, at 12:56 AM, Stephen Williams <sdw at lig.net> wrote:
>> Specifics?  I didn't see it in Sybase, SQL Server, Oracle (a while ago), MySQL, or PostgreSQL.  There was not one example for doing this with SQL fields, tables, and joins.
>
> Many databases (PostgreSQL, SQL Server, et al) support e.g. XML document types and indexed access via xpath expressions and similar. The query executor can see the internal structure. Most of your big enterprise databases support materializing a set of join relationships between logical tables into document-like record structures in the storage layer with varying degrees of sophistication (Oracle 7 could do this). The big guys added support for join recursion sometime in the 1990s (yay, graph traversal). And then there are the object databases (remember those?) most of which were document-organized. What the designers of document-oriented databases claim is that their implementation is better or more modern or relatively faster when doing certain types of queries.
>
> Arguments about relative efficiency aside, the long-term technical issue is poor concurrency and distributability.
>
>
>> Sure, if you have a few fixed width fields and you have a zillion of them you can optimize most overhead out.  However, as soon as you want to be able to put a lot of data in a table, you'll want to consider column and row compression or similar.  Sure, you can just relegate blobs to another table / space and just put the reference in the row, but that's not necessarily going to make retrieval of that row faster.
>
> Compression does not change the relative overhead. Per-column intra-page compression strategies are sophisticated and support direct and extremely fast query evaluation on rows in the compressed format but most require some level of strictly typed structure to work well. If all of your records are arbitrary document graphs, you are left with naive (read: slow) row- and page-level compression options.
>
>
>> Applications already have to manage, parse, etc. this stuff.  Done correctly, the database wouldn't have to parse much or any of this.
>
> This is not true. Every value in a database has a schema associated with it stored, for performance reasons, with the highest level construct where the schema appears fixed. So-called schema-less database engines have to tag every value with a schema and those tags have to be evaluated at query time. In addition to adding significant storage overhead, the evaluation of documents injects a lot of conditional branching into the processing pipeline.
>
> It is not controversial that a highly-optimized pipeline for conventional page structures is 1-2 orders of magnitude faster to evaluate (depending on the data model) than similar pipelines for document-organized page structures. Evaluating document structures usually clocks in at around 100 MB/s per core.
>
>
>> This isn't true for all storage strategies.  Write new records in compressed extents, store the extent and offset in indexes, etc. Lowering IO enough completely compensates for computation.
>
> It inherently has a larger I/O footprint in terms of storage and only reduces query I/O in a narrow case. Storage formats with more structure have much better strategies for compression and query evaluation available.
>
> As an aside, big RAM and SSDs have made optimizing I/Os somewhat less important while increasing the need to optimize for maximizing throughput of memory structures through a core.
>
>
>> Are you sure that this is true once you take into account multiple queries needed to join and follow equivalent data?  Sometimes, a little extra processing in the database might obviate multiple queries.
>
> Yes, I am taking that into account. This is a narrow optimization. It makes a single join much faster at the cost of making virtually everything else you might want to do with the data significantly slower. The cost exceeds the benefit for most data models.
>
>
>> What does doing it right look like?  Besides pseudo-tables or just storing indexed blobs (in which you are using an XML database, not SQL), what does doing it right for traditional SQL look like for these problems?
>
> There are many more options and strategies available. I was successfully designing architectures for cases like this in the 1990s using Oracle, which by modern standards was not a very sophisticated database engine though excellent for the time. It also had a ton of knobs for modifying storage organization behaviors that most database architects were unfamiliar with. Contrary to what I stated previously, I think I did use the document-organization option once for storing a mess of user attributes when I was designing Netflix's systems (Oracle 8 IIRC).
>
>
>> SparQL always felt half done to me.  But you can express complex graph & semantic searches in a fairly powerful way.
>> What will cut it?  Is SQL adding features to compete with it?  One of the other graph query languages?
>> Or more likely the addition of graph / semantic search to SQL?
>
> It is unbelievably crufty and was not designed to be extensible. I'm not offering a solution, just criticizing it. :-)  Most other query languages are narrow tweaks on SQL for a specific purpose -- they do not go far enough.
>
> SQL has been able to do the semantic and some types of graph analysis stuff for ages, in its own ugly way. Geospatial was added as a gross hack. The language itself would be more easily extensible if it was an object-functional model. Good luck trying to express an analytical processing pipeline in it even though that would be computationally efficient (there are several vendor-specific extensions for this purpose). SQL never really conceived of generalized spatial data models so that requires alternative DDL if you are one of the few database engines with that capability. Disallow DDL to specify storage format; many database engines already fake support for storage formats rather than implementing them because it is an anachronism.
>
> In short, SQL has been unable to adequately or cleanly express the increasing sophistication of both the analytics and data models that people are trying to build with databases. There was a time when that was because people did not know how to use SQL well but these days there are more cases of fundamental broken-ness.
>
>

sdw



More information about the FoRK mailing list