[FoRK] Multicore, async segmented sequential models

J. Andrew Rogers andrew at jarbox.org
Thu May 9 17:04:37 PDT 2013

On May 9, 2013, at 11:15 AM, Stephen Williams <sdw at lig.net> wrote:
> This is an apparently complete list of NoSQL databases, although they don't cover content stores like S3 and similar:
> http://nosql-database.org/

That list is not complete but it covers a lot of ground. There are unconventional database engines that could support a SQL interface but simply choose not to and others that implement both SQL and NoSQL type interfaces that are excluded from the list because they offer the former as an option. The classification of "NoSQL" has no implications in terms of capabilities these days.

> PostgreSQL seems to have been making a name for itself in the geosearching area.  Useful even if all you use it for is an index.

PostgreSQL is a mediocre geospatial indexing platform but wins by default because there is nothing else comparable in open source. PostgreSQL still does not do geodetic computations correctly and the implementation is incredibly slow but at least it tries to do a good job. It is pretty good if your data set is tiny and queries simple. Oracle is better, and they do geodetic correctly AFAIK, but not so much better that it is a compelling choice versus PostgreSQL.

There are no fast, scalable geospatial indexes on the market today. Scalable geospatial indexes require a fundamentally different architecture than those of the database engines onto which they are tacked. The situation is even worse on Hadoop hence the complete lack of useful geospatial support there.

> On the MySQL side, MariaDB has been getting press.  And I've been spammed about NuoDB, not to mention various fully commercial MySQL scalers.

MySQL seems to be increasingly relegated to the role of an object store. I think it is safe to say that PostgreSQL won the open source RDBMS war at this point, a good result since PostgreSQL evolved into something that was superior in almost every regard. 

PostgreSQL is quite a good default database for the vast majority of applications and what I usually recommend absent a good reason to use something else.

> Cassandra is well supported in some cloud packages, as is S3 compatible content stores.

Cassandra is a sophisticated key-value store known for its good write performance and robust handling of node failures. However, it inherits the query limitations of most key-value stores. It is unlikely to be anything but a niche player, in part because it is built in Java; it is much easier to optimize the workload it was designed for in C++ which has allowed competitors to differentiate. Cool in 2008, less so in 2013.

> For fine-grained work, I'd probably favor something like Neo4J.

I'm not sure what "fine-grained" means but Neo4J is probably the best default choice for a graph database assuming you are doing complex graph analytics. Like every other graph database, it scales poorly. Every claim of "scalable" graph databases needs to be followed by about a dozen asterisks and lots of fine print. Quality relational databases support the type of join recursion required to do graph analysis but it is not their sweet spot by any stretch of the imagination.

Best practice for most companies doing simple graph analytics is to materialize some of the implicit joins and use a more conventional database engine that scales better. You can't do complex ad hoc queries on a graph implemented this way but for most applications that is not a requirement and the runtime performance is usually better.

> Savvy real-world comments on the best of these for different kinds of tasks would be very helpful to avoid recapitulating the pain of wasted exploration.  Most things work well enough in a prototype, and most fail in annoying ways at scale.  These days, we ought to have a rich development and data model, very simple install and management from embedded, single instance, to scaled, redundant, replicating, etc.

There are deep tradeoffs that bleed into the interfaces even if carefully designed.

Database engine internals are designed to fit characteristic curves in myriad dimensions so that the overall system throughput peaks at a selected point in the phase space of the several metrics by which database performance, scalability, etc is measured. There is no correct way to design the internals per se, it is selecting from a complex set of tradeoffs to (hopefully) produce a system that optimally exhibits the desired characteristics in aggregate. 

For a simple example, a database engine optimized for GB-scale data models is going to suck at PB-scale data models and vice versa, and those internals will look very, very different even if you just look at individual shards. There are database engine designs where the overall system becomes significantly faster as the size of the indexed data model approaches the PB range.

A good engine can often be repurposed to do an adequate job of targeting narrow use cases outside its design spec. However, this requires the user to figure out how to optimally express that data model and workloads in terms of the construction of the database engine. Most database architects already do this without thinking about it but it is exposed. Workload optimization is necessarily a compile-time decision as far as the database engine is concerned.

More information about the FoRK mailing list