[FoRK] Multicore, async segmented sequential models

Stephen D. Williams sdw at lig.net
Sun May 12 23:03:18 PDT 2013


On 5/12/13 9:25 PM, J. Andrew Rogers wrote:
> On May 12, 2013, at 12:29 PM, Stephen D. Williams <sdw at lig.net> wrote:
>> On 5/10/13 3:58 PM, J. Andrew Rogers wrote:
>>> SQL is just a query language. It is largely divorced from database engine implementation.
>>>
>> Exactly my earlier point about why it is bizarre and ignorant for so little innovation in storage method to have happened until recently, when the stranglehold (mainly in the minds of corporations and their technologists) of the oligopoly of Oracle / SQL Server / IBM finally has loosened.
>
> What would be an example of a new storage method that is not at least a decade old?  There has not been any in open source that I can think of, and what I can think of was done at IBM, Oracle, et al.

I did message oriented development with lightweight message queues for high performance, scalable systems 15-20 years before it 
became well-known and commonly used.  I believe it was common somewhere in mainframe systems long before that (although probably not 
using network links in clusters).  I know that most things have been fundamentally identified and explored long ago.  That doesn't 
mean that those ideas were usable for real projects for sometimes decades later.  Just because there are few or no new fundamental 
ideas in C++11 doesn't mean that it isn't far ahead of what I had in C++ 25 years ago.

I'm looking for products that I can feasibly use to scale from mobile to desktop to arbitrarily large cloud systems without spending 
a career inventing them.

>
>
>> Yes, but.  Innovations can be done that provide something that works as if it were a traditional triplestore, but while it takes aggressive shortcuts whenever possible.  Need for indexes flows directly from the particular queries made.  A database should really be a query->index compiler, where that means a lot more potentially than what a traditional RDBMS is capable of.
>
> Not to dampen your ardor but this has been done several times over the last few decades by people that did not bother to read the literature from the prior attempts. There are no mysteries as to why this does not work, there are clear theoretical reasons.

While you could argue this for many areas of databases, it seems odd that you would pick on this.  There is no reason that, given a 
particular set of data and a history of queries against that data, a variety of choices for representing and indexing that data 
couldn't be explored.  A system could rerun the historical queries, vary parameters, do Monte Carlo simulations of data 
distributions, block placement tuning, etc.  It could even do genetic programming exploration.  All of this is made up of fairly 
mundane steps, but I cannot download and use such a solution right now, at least not one that documents heroic efforts to optimize 
data to this degree.

>
> You are sort of handwaving away a lot of theoretical computer science as though that does not apply to whatever it is you want to do.

A) Not really.  Generally, people want to first assume certain constraints on the solution are fixed, when they are not, then want 
to argue that you can do no better.  Changing the problem or strategy of the solution can often do better than what theory seemed to 
imply to begin with.

B) A lot of theoretical computer science is falling by the wayside, at least in some important but not strict fashion.  The Nyquest 
limit was smashed (for radio data), TSP was solved in better than polynomial time, just without a guarantee, only a "most of the 
time" probabilistic measure.  Strictly speaking, the theoretical limits stand, but practically speaking, sometimes they are worked 
around. We haven't come close to fully emulating a brain, but we have self-driving cars.

>
>
>> Yes but they're scaling this variety of system anyway, into the billions of triples.
>
> No, they are not even scaling it to billions of triples unless you append so many conditions and qualifications that it has no relevance to ordinary use cases. And in any case, a billion triples is tiny, that problem fits in the RAM of my desktop which raises the question of why it is considered a meaningful benchmark. (It is not a hypothetical, I literally process billion edge graphs in memory on my desktop.) Graph databases have a used car salesman reputation for good reason.

"Graph databases" includes the old notion of graph databases which I agree is not interesting at all.  If you just want to write 
records with pointers between them, as you would in memory, with a btree++ index, it's not all that interesting.  A "triplestore" 
level "graph database" is different.  The straightforward implementation of a triplestore is more like a graph-oriented relational 
database.  A number of successful "triplestore" databases are implemented on a more or less traditional relational engine (Virtuoso).

1 trillion triples were loaded and queried in 2011 here:
http://www.w3.org/wiki/LargeTripleStores

150B triples are benchmarked somewhat here:
http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/results/V7/

In any case, the completely generic methods of doing triplestores are probably never going to be as scalable as alternative storage 
systems, especially for many common problems.  If you have 200 fields in a relational table, joined with another table of 201 
fields, you get 400 fields for the same number of index lookups and queries that a triplestore might pay nearly 400 lookups for.  
What would be interesting is to find methods that provide the flexibility of a triplestore for the efficiency of relational in this 
case. Leaving out the query language details, it's possible for a database to support both storage & semantic styles yet provide the 
same query efficiency.  That's logistically difficult, but no stretch of theory.

>
> Companies like IBM can eat through a graph analytic problems that are considerably more difficult at scales several orders of magnitude larger without the asterisks. Nothing in open source has any idea how to do that. They are using inferior computer science.

There are some computer scientists involved.  Hard to say which groups will get the momentum to evolve to something competitive. 
Some of this technology is getting a workout in the real world, with real money available to improve it as a commodity.  Those 
without a huge budget and not on a "solve it now at all costs" mission might explore how far the more accessible products will go.

>
>
>>> "simply requiring a reconfiguration of data / index / memory"
>>>
>>> Yeah, well that is the real trick now, isn't it. In real systems, no one is willing to pay the extraordinary cost of that operation. In a distributed system it would be straight pathological.
>>>
>> Depends on what one means by this and how clever one is.
>
> Actually, no. Cleverness does not solve theoretical limitations no matter how much one refuses to acknowledge them.

There have been a number of cases to the contrary to that point, but generally that's obvious.  You are implying some specific 
constraints in this statement.  Clearly, it is possible to reconfigure data / index / memory; some examples of that are clearly 
trivial.  You allude to some deeper concept that you imply is a definitive case that precludes all strategies.  It is not obvious 
what you are referring to; we'd like to know more.

>
>
>>>> or some unifying but tunably flexible solution.
>>> Yes, this is how it has to be done. So why has no one built such a solution?
>> People are trying, somewhat, although usually coming at it from a particular narrow need.  At least we are well beyond everything being a relatively dumb RDBMS.
>
> I was asking a cynical trick question. Theoretical computer scientists who work in this space know what the problems are which matches a great many things of which nouveau database designers are ignorant. In reality, there are multiple hard theoretical problems in the way of this glorious future that have withstood the attempts of ordinary PhDs to solve them for decades. The idea that someone that can't be bothered to read the vast bodies of prior literature will solve all of these is dubious and on some levels insulting to all the computer scientists that have spent their lives working on these problems.
>
> It is not obvious to me that you appreciate the seriously hardcore theoretical nature of the problem you think some clever hipster will solve over the weekend. Designing good, scalable databases is extremely difficult even if you are a frackin' genius.

I don't expect a hipster to solve anything.  Plenty of computer scientists and ur-scientists are magically appearing every year. 
Google, Netflix, etc. weren't possible in practical terms before they happened, but there they are.  Any back of the napkin analysis 
would have proven that they couldn't possibly have afforded the Oracle and IBM licenses and hardware needed to even get started.

Until recently, you could have proven that many problems weren't practical, essentially because you A) couldn't get a machine that 
had enough memory, B) the memory wasn't fast enough, C) you couldn't afford it even if you did.  Now 64GB of ECC memory is $700 and 
you get get machines with hundreds of GB and dozens of cores with 3 or more memory channels for less than the price of a Prius.

I know there are unsolved hard problems.  I also know there are many solved problems and known methods that aren't efficiently 
available yet in all of the forms that might be needed for an optimal solution.

>
>
>> This is a nice map of the space, although categorization & placement is a bit simplistic in some cases:
>> http://blogs.the451group.com/information_management/files/2013/02/db_Map_2_13.jpg
>
> Really? That is basically a catalog of databases that suck more than high-end relational databases. For the record, I am not even a fan of the databases that prove that rule. This chart is a dubious model that can be shoehorned into a wrong model people can understand.
>
> In any case, that graphic is not particularly educational.

It is a list of products with some kind of organization.  Do you have a better map / list, perhaps with reviews of what is good for 
what if anything?

>
>
>> http://blog.sqrrl.com/post/49905027780/how-to-choose-a-nosql-database
>
> I do not want to bust your balls but your understanding of databases seems to be driven by marketing hype.

Sometimes I dig into the details, and source code, of systems to understand their limitations and what they will and won't do.  I 
have many other areas of interest, and the database arena is proliferating rapidly.  I know most of it is uninteresting, but there 
is enough usage and money that something will get on a Linux-like trajectory.  I'm just sampling to make an early choice when the 
question comes up.  I asked earlier if there was any useful review that anyone trusted.

>
> My advantage is that I've (1) read tens of thousands of pages of the relevant literature and (2) developed several of the algorithm families that are currently considered the state-of-the-art in this field. The second reason is probably more important. :-)

I've hit a number of papers in the past.  Somewhere I have a license for Postgres (or whatever it was called) from Stonebraker 
himself from the early to mid 90's that I worked on porting from BSD to SysV.  I've designed my own database storage mechanisms, 
etc.  I know various interesting details about Oracle, Sybase, MySQL, PostgreSQL, etc.  But I'm not an expert and I've fallen behind 
in relevant papers while I learned semantic web ideas, machine learning, machine vision, etc.  (Would you mind sharing a reading list?)

>
> The lack of scalable, flexible, expressive databases is not because computer scientists are assholes but because the problem is *hard*. Like Turing Award hard. I can't tell you how many high-reputation computer scientists have dedicated their lives to this and done squat.
>
> Please consider the possibility that this is not easy.

I know it isn't easy.  I also know that a lot of fairly straightforward computer science for things that I wanted every time I 
worked with a database was not implemented in database systems for decades.  None of that was theoretically hard, except in an 
oligarchy with no incentive to improve drastically.  They were like Microsoft with IE 6, or Windows, or Office, or Outlook...  Or 
SQL Server.  Just getting past ODBC and "database drivers" took Java / JDBC, and later AWS S3 et al, to finally resolve the obvious 
stupidity there.

sdw



More information about the FoRK mailing list