[FoRK] Multicore, async segmented sequential models

J. Andrew Rogers andrew at jarbox.org
Sun May 12 21:25:59 PDT 2013

On May 12, 2013, at 12:29 PM, Stephen D. Williams <sdw at lig.net> wrote:
> On 5/10/13 3:58 PM, J. Andrew Rogers wrote:
>> SQL is just a query language. It is largely divorced from database engine implementation.
> Exactly my earlier point about why it is bizarre and ignorant for so little innovation in storage method to have happened until recently, when the stranglehold (mainly in the minds of corporations and their technologists) of the oligopoly of Oracle / SQL Server / IBM finally has loosened.

What would be an example of a new storage method that is not at least a decade old?  There has not been any in open source that I can think of, and what I can think of was done at IBM, Oracle, et al. 

> Yes, but.  Innovations can be done that provide something that works as if it were a traditional triplestore, but while it takes aggressive shortcuts whenever possible.  Need for indexes flows directly from the particular queries made.  A database should really be a query->index compiler, where that means a lot more potentially than what a traditional RDBMS is capable of.

Not to dampen your ardor but this has been done several times over the last few decades by people that did not bother to read the literature from the prior attempts. There are no mysteries as to why this does not work, there are clear theoretical reasons.

You are sort of handwaving away a lot of theoretical computer science as though that does not apply to whatever it is you want to do. 

> Yes but they're scaling this variety of system anyway, into the billions of triples.

No, they are not even scaling it to billions of triples unless you append so many conditions and qualifications that it has no relevance to ordinary use cases. And in any case, a billion triples is tiny, that problem fits in the RAM of my desktop which raises the question of why it is considered a meaningful benchmark. (It is not a hypothetical, I literally process billion edge graphs in memory on my desktop.) Graph databases have a used car salesman reputation for good reason.

Companies like IBM can eat through a graph analytic problems that are considerably more difficult at scales several orders of magnitude larger without the asterisks. Nothing in open source has any idea how to do that. They are using inferior computer science.

>> "simply requiring a reconfiguration of data / index / memory"
>> Yeah, well that is the real trick now, isn't it. In real systems, no one is willing to pay the extraordinary cost of that operation. In a distributed system it would be straight pathological. 
> Depends on what one means by this and how clever one is.

Actually, no. Cleverness does not solve theoretical limitations no matter how much one refuses to acknowledge them.

>>> or some unifying but tunably flexible solution.
>> Yes, this is how it has to be done. So why has no one built such a solution?
> People are trying, somewhat, although usually coming at it from a particular narrow need.  At least we are well beyond everything being a relatively dumb RDBMS.

I was asking a cynical trick question. Theoretical computer scientists who work in this space know what the problems are which matches a great many things of which nouveau database designers are ignorant. In reality, there are multiple hard theoretical problems in the way of this glorious future that have withstood the attempts of ordinary PhDs to solve them for decades. The idea that someone that can't be bothered to read the vast bodies of prior literature will solve all of these is dubious and on some levels insulting to all the computer scientists that have spent their lives working on these problems.

It is not obvious to me that you appreciate the seriously hardcore theoretical nature of the problem you think some clever hipster will solve over the weekend. Designing good, scalable databases is extremely difficult even if you are a frackin' genius.

> This is a nice map of the space, although categorization & placement is a bit simplistic in some cases:
> http://blogs.the451group.com/information_management/files/2013/02/db_Map_2_13.jpg

Really? That is basically a catalog of databases that suck more than high-end relational databases. For the record, I am not even a fan of the databases that prove that rule. This chart is a dubious model that can be shoehorned into a wrong model people can understand.

In any case, that graphic is not particularly educational.

> http://blog.sqrrl.com/post/49905027780/how-to-choose-a-nosql-database

I do not want to bust your balls but your understanding of databases seems to be driven by marketing hype. 

My advantage is that I've (1) read tens of thousands of pages of the relevant literature and (2) developed several of the algorithm families that are currently considered the state-of-the-art in this field. The second reason is probably more important. :-) 

The lack of scalable, flexible, expressive databases is not because computer scientists are assholes but because the problem is *hard*. Like Turing Award hard. I can't tell you how many high-reputation computer scientists have dedicated their lives to this and done squat.

Please consider the possibility that this is not easy.

More information about the FoRK mailing list