[FoRK] Multicore, async segmented sequential models

Stephen Williams sdw at lig.net
Fri May 10 00:56:47 PDT 2013


On 5/9/13 11:59 PM, J. Andrew Rogers wrote:
> On May 9, 2013, at 9:45 PM, Stephen Williams <sdw at lig.net> wrote:
>> On 5/9/13 7:27 PM, J. Andrew Rogers wrote:
>>> Fixed format tables? What does that mean?
>>>
>> First, delete all data definition statements.
>> Next, fully qualify names of each "column" on "row" insert.
>> Insert N rows with every row potentially having a different set of column names.
>> Make up a new column name, insert a row containing it, query on that column.
>> Allow indexing on any field name, even if not present in all rows.  Support dynamic automatic indexing based on queries.
>> Adding a new field for new rows?  Just start inserting rows using that column.
>>
>> Apparently they couldn't imagine how to do that efficiently enough in the 70's...
>> But we are no longer IN the 70's.
>
> This has been available in databases for as long as I remember, under a variety of interfaces and names. Good relational databases can support it in almost exactly the same way that so-called "document-oriented" databases support it. And if this functionality has existed in databases in some form or another forever then it should raise the question of why no one used it until some young hipsters recently "discovered" it.

Specifics?  I didn't see it in Sybase, SQL Server, Oracle (a while ago), MySQL, or PostgreSQL.  There was not one example for 
doing this with SQL fields, tables, and joins.
Storing an XML or similar blob in a field is not the same thing. I've actually built a number of systems using that method to 
maintain sanity, including what is probably the highest volume payment management system in the US.  I've also worked on a 
system that did the relational database within a relational database (table of tables, table of row+columns+values, etc.), aka 
pseudo-tables.

>
> The short version: it is an inherently inefficient and non-scalable way to fundamentally organize a database. Write performance will generally suck, concurrency will suck, it won't scale worth a damn, and some common database queries will be painfully slow relative to more conventional organizations. If your database is only 10 GB or something like that, you might not notice the performance and concurrency issues because it fits in memory. If you have TBs of data, the problems will become hard to ignore.

Sure, if you have a few fixed width fields and you have a zillion of them you can optimize most overhead out.  However, as soon 
as you want to be able to put a lot of data in a table, you'll want to consider column and row compression or similar.  Sure, 
you can just relegate blobs to another table / space and just put the reference in the row, but that's not necessarily going to 
make retrieval of that row faster.

>
> It exists in databases for those few cases where it might offer a benefit that offsets the cost. (Back when I was architecting databases, I never found a use case for this particular optimization where it was worth it.)

What are you talking about?  The feature that shall not be named?

I know that the kitchen sink has been added to Oracle et al now, so this is somewhat moot.  We were talking about the 
historically narrow solution that SQL provided.

>
>
>> Want to do exactly the same optimized storage that you could do with fixed format tables?  Just do it for the rows that share the same fields, or hybrid with common fields and overflow / alternate, or whatever.
>
> You have not through the implications of this.
>
> How is this stored on pages when every row looks different? When every row looks the same? How are you minimizing the number of pages touched per operation? Outside of some narrow cases, this bit of cleverness will increase the average number of page accesses required to satisfy most workloads.

I already pointed out that a system could support uniform rows in the traditional way while maintaining more diverse data 
separately, probably in what amounts to a separate table.  The point is programmer, management, and perhaps performance in some 
cases would be a lot better if the database supported more flexible models.

>
> Do you think storing and parsing all of this row metadata is inexpensive? Far from it!  PostgreSQL had a big across the board performance win when they shaved 4 bytes from the row header, and they don't have to describe the structure of every row. This stuff adds up surprisingly quickly.

Applications already have to manage, parse, etc. this stuff.  Done correctly, the database wouldn't have to parse much or any of 
this.

>
> Also, this makes almost all page structure optimization methods for both write and query performance largely non-usable. That is an integer factor performance hit for evaluation of most queries so you better hope your queries are in the subset that might save a page access.

This isn't true for all storage strategies.  Write new records in compressed extents, store the extent and offset in indexes, 
etc. Lowering IO enough completely compensates for computation.

>
> I think you might be underestimating just how computationally efficient database engines actually are. Things you think of as trivial additional overhead will show up as huge performance hits in many databases.

Are you sure that this is true once you take into account multiple queries needed to join and follow equivalent data?  
Sometimes, a little extra processing in the database might obviate multiple queries.

>
>
>> Don't think this is a real problem?  "Build a database for a hypermarket (Super Walmart / Target / Meijer / Fry's) with a row for each item having searchable representation for each salient attribute.  Metrics for a tire, screw, high heeled shoe, book, steak, shirt, etc.  Now do it in one inventory table.  And among the 30,000 different types of items, there is a daily turnover of about 200 new types of items.  No, the store cannot hire a DBA."  The solutions to this with SQL are bleak: incomplete, irregular, inefficient, and otherwise ugly.
>
> Yer doin' it wrong.

What does doing it right look like?  Besides pseudo-tables or just storing indexed blobs (in which you are using an XML 
database, not SQL), what does doing it right for traditional SQL look like for these problems?

>
>
>> True.  The fixed format tables (assumption that all rows have exactly the same fields of the same types, with conversion of data needed for changes) is one of the worst.
>
> I don't think "fixed format" means what you think it means. A column can have a complex data structure as a type where the contents of the complex structure are queryable (XML and JSON docs are popular). And you can index the contents of the complex structures if you like. Hierarchies of tables can have the joins materialized into the same database page under the hood (basically constructing a document in terms of physical storage).

Nesting XML, JSON, etc. in a relational SQL database isn't really solving the problem at the SQL level.  From the point of view 
of SQL, that's delegating everything to the application, although extensions do this in the database.

>
> What, precisely, is it that a modern relational database does not do? They are *more* flexible than something like MongoDB, which basically only supports a fixed, 2-column format, a key and a document.

A modern multi-database can do everything.  But then it isn't strictly "SQL" or "relational".

>
>
>> The crazy thing is that nothing in the query model implies a need for this.
>
> No one said that it did.

They enforced it in the design of shipping databases, until a more modern era.  That's the same thing.

>
>
>> While we shouldn't stop at SparQL, it is a fairly powerful alternative to SQL, especially with arbitrary semantics possible in the engine SparQL is querying.
>
> SparQL is not going to cut it.

SparQL always felt half done to me.  But you can express complex graph & semantic searches in a fairly powerful way.
What will cut it?  Is SQL adding features to compete with it?  One of the other graph query languages?
Or more likely the addition of graph / semantic search to SQL?  OrientDB and others do this.  If they have all of SQL and SparQL, including OWL et al, and other powerful ideas, it could get very good.


sdw



More information about the FoRK mailing list