[FoRK] Database tech (was: Multicore, async segmented sequential models)
J. Andrew Rogers
andrew at jarbox.org
Tue May 14 10:11:21 PDT 2013
On May 12, 2013, at 11:03 PM, Stephen D. Williams <sdw at lig.net> wrote:
> B) A lot of theoretical computer science is falling by the wayside, at least in some important but not strict fashion. The Nyquest limit was smashed (for radio data), TSP was solved in better than polynomial time, just without a guarantee, only a "most of the time" probabilistic measure. Strictly speaking, the theoretical limits stand, but practically speaking, sometimes they are worked around. We haven't come close to fully emulating a brain, but we have self-driving cars.
Databases are currently designed with non-strict methods and approximations -- that is the problem. Secondary indexing, distributed hashing, etc need to be removed from engine designs or the generality and scalability will be hobbled. How would you suggest designing a parallel engine with no indexes and no hashing (among other well-worn design elements)?
Databases need to scale in a semi-general and deterministic way. At sufficiently large scales, theoretical elegance is mandatory or you won't achieve the necessary parallelism and distributability. The legion of old, hard problems surrounding database engines are about finding more elegant and general models -- eliminating narrow optimizations, not adding them.
A radically better database engine is unlikely to be produced from the casual interest of open source hackers. At least three hurdles that make it unlike most other open source software projects:
1.) Without deep advances in computer science, it will manifest the same weaknesses as existing databases. Hackers would need to crack multiple esoteric CS problems that have stood for decades to have a viable product.
2.) The expertise required to design a completely new database engine from scratch is not evident in the open source community. The reason Postgres is head and shoulders above the rest of open source is because Stonebraker designed it.
3.) Minimum viable implementation of modern core architecture is a lot of code written by excellent software engineers. For a skeletal, bootstrappable system, 100-150 kLoC of low-level C++ at a minimum. That is many quality man-years, a very long-term commitment for a side project.
I think skepticism is warranted. It would be a longshot even if you had millions of dollars with which to hire the small team of people with the rarified skill sets required to make the attempt plausible.
> 1 trillion triples were loaded and queried in 2011 here:
It is marketing literature. Notice the lack of query benchmarks or submissions to Graph500 (where IBM would pwn them). Loading triples is the easy part, and averaging 830k triples per second is pretty poor for a 240-core cluster. I have a parallel triple store prototype I designed in 2009 that, coincidentally, also runs on a 240-core test cluster with similar specs. The sustained, average triple load rate is hundreds of millions per second. So a few orders of magnitude faster. It is not that interesting to people that know graph engines (and I'm not trying to sell triple stores). There is a huge technical gap between the few companies working at the high-end and the rest of this market.
> It is a list of products with some kind of organization. Do you have a better map / list, perhaps with reviews of what is good for what if anything?
Nah, a good matrix for databases has far too many dimensions to be easily constructed. It is easier to just use the strongest implementation in a broad class and learn how to exploit the internals to optimize for your use case. The caveat is that there are many applications / data models / workloads for which no usable database currently exists in the market; being able to identify those cases can save much unnecessary pain.
> I've hit a number of papers in the past. Somewhere I have a license for Postgres (or whatever it was called) from Stonebraker himself from the early to mid 90's that I worked on porting from BSD to SysV. I've designed my own database storage mechanisms, etc. I know various interesting details about Oracle, Sybase, MySQL, PostgreSQL, etc. But I'm not an expert and I've fallen behind in relevant papers while I learned semantic web ideas, machine learning, machine vision, etc. (Would you mind sharing a reading list?)
Other than skimming things that show up in my Internet feeds I am not a big consumer of database literature. I find more value in the theoretical computer science that is not related to databases per se. Not much help I'm afraid. :-)
If you were on top of the state-of-the-art and prior literature in 2000 then you'd immediately recognize most of the internals of every database in the market today. The only topic area that has moved significantly since then in the literature is transaction processing. Advanced R&D is primarily done by a few companies and publication of those advances is suppressed for years.
Most database engines liberally copy existing designs. You will find Postgres genetics under the hood of a majority of commercial database engines, for example. It is a good place to start. Designing a completely new database engine from scratch is a completely different type of problem than implementing a database engine from scratch; having the skill to do the latter does not imply the skill to do the former. The benefit of the former is that you do not inherit the fundamental limitations of the copied design.
Databases are like a lot of other applied fields. The basics that everyone learns contain a lot of half-truths and lossy simplifications because at that level most people are not prepared for the unfiltered version. Unless you design database engines, you'd probably never notice.
More information about the FoRK