[FoRK] Multicore, async segmented sequential models

J. Andrew Rogers andrew at jarbox.org
Thu May 9 23:59:21 PDT 2013

On May 9, 2013, at 9:45 PM, Stephen Williams <sdw at lig.net> wrote:
> On 5/9/13 7:27 PM, J. Andrew Rogers wrote:
>> Fixed format tables? What does that mean?
> First, delete all data definition statements.
> Next, fully qualify names of each "column" on "row" insert.
> Insert N rows with every row potentially having a different set of column names.
> Make up a new column name, insert a row containing it, query on that column.
> Allow indexing on any field name, even if not present in all rows.  Support dynamic automatic indexing based on queries.
> Adding a new field for new rows?  Just start inserting rows using that column.
> Apparently they couldn't imagine how to do that efficiently enough in the 70's...
> But we are no longer IN the 70's.

This has been available in databases for as long as I remember, under a variety of interfaces and names. Good relational databases can support it in almost exactly the same way that so-called "document-oriented" databases support it. And if this functionality has existed in databases in some form or another forever then it should raise the question of why no one used it until some young hipsters recently "discovered" it.  

The short version: it is an inherently inefficient and non-scalable way to fundamentally organize a database. Write performance will generally suck, concurrency will suck, it won't scale worth a damn, and some common database queries will be painfully slow relative to more conventional organizations. If your database is only 10 GB or something like that, you might not notice the performance and concurrency issues because it fits in memory. If you have TBs of data, the problems will become hard to ignore.

It exists in databases for those few cases where it might offer a benefit that offsets the cost. (Back when I was architecting databases, I never found a use case for this particular optimization where it was worth it.)

> Want to do exactly the same optimized storage that you could do with fixed format tables?  Just do it for the rows that share the same fields, or hybrid with common fields and overflow / alternate, or whatever.

You have not through the implications of this. 

How is this stored on pages when every row looks different? When every row looks the same? How are you minimizing the number of pages touched per operation? Outside of some narrow cases, this bit of cleverness will increase the average number of page accesses required to satisfy most workloads. 

Do you think storing and parsing all of this row metadata is inexpensive? Far from it!  PostgreSQL had a big across the board performance win when they shaved 4 bytes from the row header, and they don't have to describe the structure of every row. This stuff adds up surprisingly quickly. 

Also, this makes almost all page structure optimization methods for both write and query performance largely non-usable. That is an integer factor performance hit for evaluation of most queries so you better hope your queries are in the subset that might save a page access.

I think you might be underestimating just how computationally efficient database engines actually are. Things you think of as trivial additional overhead will show up as huge performance hits in many databases.

> Don't think this is a real problem?  "Build a database for a hypermarket (Super Walmart / Target / Meijer / Fry's) with a row for each item having searchable representation for each salient attribute.  Metrics for a tire, screw, high heeled shoe, book, steak, shirt, etc.  Now do it in one inventory table.  And among the 30,000 different types of items, there is a daily turnover of about 200 new types of items.  No, the store cannot hire a DBA."  The solutions to this with SQL are bleak: incomplete, irregular, inefficient, and otherwise ugly.  

Yer doin' it wrong.

> True.  The fixed format tables (assumption that all rows have exactly the same fields of the same types, with conversion of data needed for changes) is one of the worst.

I don't think "fixed format" means what you think it means. A column can have a complex data structure as a type where the contents of the complex structure are queryable (XML and JSON docs are popular). And you can index the contents of the complex structures if you like. Hierarchies of tables can have the joins materialized into the same database page under the hood (basically constructing a document in terms of physical storage).

What, precisely, is it that a modern relational database does not do? They are *more* flexible than something like MongoDB, which basically only supports a fixed, 2-column format, a key and a document.

> The crazy thing is that nothing in the query model implies a need for this.

No one said that it did.

> While we shouldn't stop at SparQL, it is a fairly powerful alternative to SQL, especially with arbitrary semantics possible in the engine SparQL is querying.

SparQL is not going to cut it. 

More information about the FoRK mailing list