[FoRK] no moore

Eugen Leitl < eugen at leitl.org > on > Mon Jan 15 09:43:09 PST 2007


no moore (part 1)...
A recent string of comments in Alex's blog at Pythian got me into thinking about this whole subject of "where is database technology heading".

Sure: it's true that Moore's law has been valid for a long time.

Fact is: it ain't no more! I don't often use wikipedia as a reference as there is far too much plain incorrect content there. But in this case, it's spot-on. Read the whole entry, not just the definition!

There are huge consequences for IT from this. But before anyone looks into that, what of data management and databases?

Moore's law dealt with density of fabrication. Not speed, not capacity. And no, the three are not inter-dependent although they are related!

The undeniable fact is that processing speed - directly dependent on density of packaging due to propagation and temperature constraints - has been increasing exponentially.

The other undeniable fact is that disk storage access speed has NOT been increasing at anywhere near the same rate. We now have disk storage systems that can sustain 100MB/s IO rates at best. Ignore buffer cache speeds - those are not sustainable.

That is hardly fantastically more than the 10MB/s of 1996 and nowhere near the exponential growth rates of other electronic devices.

Those with an electrical engineering background will know perfectly well why this is so: the physics of transmission lines and their limits in speed have been understood for many, many decades. In fact the main reason one needs to reduce the size of CPU components in order to jack up the clock frequency is PRECISELY these limits!

So, for databases and more generally data managment, where to then?

Because undeniably, there is a need for this technology to progress at a rate than can supply with data all those speed hungry CPU and processing capacities created by the consequences of Moore's law in the last two decades.

There are those who propose that with the improvements in CPU speed, there will be less need for adequate indexing and optimisation of access methods: an approach based on simple brute force "full scan" of a solid state disk - SSD - storage will suffice, the hardware will be capable of returning results in usable time.

Anyone interested in this should check out the research carried out by Jim Gray of Microsoft. In one of his papers he proposes that the entire life history of a human being can be contained in a Petabyte (PB) of data - he calls it the "Personal Petabyte". Read it, it's very interesting and I believe he is 100% right. We are heading fast into a world where it will be possible to store that PB about a person in a finite storage element!

Any future marketing exercises wanting to address a given population better be prepared to be able to digest this sort of volume of information, in a usable time! Because it will happen, very soon, scaled out to the size of their audience. Yes, Virginia: we're talking Hexabytes - HB - here!

OK, so let's be incredibly optimistic and assume for a second that we'll see the same rate of growth for data access speed in the next 10 years. Yeah, that brings us to 1GB/sec by 2017, doesn't it?

Hang on, didn't good old Jim say 1PB? well, that's 1 million (10**6) of these GB/s, boys and girls! Think you can fast scan 1PB at these rates? Yeah, that's 1 million seconds, or aproximately 11 days of solid data transfer rate to go through it. Don't forget as well that your marketing campaign will be remarkably ineffective if all you can look at is the data of one client... and that you might as well be prepared to multiply all those times by 1000 as you may well be dealing with HB volumes, not PB!

'scuse the French folks, but "fast scan" my arse!

Of course: one can argue that disks will soon become solid state and much faster. That, is so not the problem. The problem was NEVER the speed of the disk but how fast that data can travel to the CPU and its memory!

That is what is taking 100MB/s now and was taking 10MB/s 10 years ago and 10 years from now - with luck - will take 1GB/s!

No, clearly we'll need MUCH MORE sophisticated indexing and cataloguing techniques to manage and derive any value out of these huge amounts of data.

"fast scans" and/or "full scans" are so not there, it's not even worth thinking about them!

So, what will these techniques be, you ask?

I'll blog about what I think they'll be very soon.

catchyalata, folks

no moore (part 2)
I was going to call it "the sequel"... :-)
Part 2 it is, then.

OK, so let's make a few things clear upfront:

1- Jim Gray's "Personal Petabyte" was used here as a metaphor to facilitate the discussion and increase awareness of the incoming problem: the ELDB - as in Extremely Large Data Base. It doesn't mean that everyone will store all your videos somewhere! Think blurfl RFID data here for a concrete example of what I'm talking about - I'm amazed wikipedia doesn't know what a blurfl is...

2- The problem is not the random access speed of disks! The problem is that even with bulk transfer rates of 1GB/s - which you MIGHT get in another 10 years! - you still will need 11 days to full scan 1PB! Therefore, any suggestion that brute force "full scans" will replace indexing and good design ignores the real problem. Which is one of speed of BULK data access. Not speed of random data access!

3- SSD addresses one problem only: the speed of random access. In other words: the seek times of traditional disks. No SSD so far has addressed the problem of bulk data transfer: they all are within the same ball park of current disks when it comes to bulk. Yes, folks: 100MB/s and 10K IOPS is perfectly possible with any low cost SAN right now, SSD has improved nothing in that chapter! As such, thinking SSD when you're dealing with a problem of ELDB and replacement of indexes by full scans is actually a perfect example of shooting oneself in the foot!

And now I'll go even more extreme!

The problem of handling ELDB is not one of indexing or design either! Think about it: if you are struggling to full scan 1PB in 11 days at data transfer speeds that you MAY be achieving in another ten years, do you seriously believe you can properly index that sort of volume? And maintain that index?

Dream on, bro!

Just the CREATE INDEX will take you a full month. Throw in 15 indexes on the same table and the task of inserting a record becomes herculean. Yes, parallel it, you may well say. So what? You still have to TRANSFER all that data somewhere else to create the index, dude! That, is THE problem: the actual transfer!

No, indexing is not the solution either. Not at these volumes.

So, what is the solution?

Well, let's backtrack for a second here. How did we locate very large volumes of data when computers didn't exist?

Perfect example: libraries. How did they manage to organize themselves so they could find the precise location of a book amidst tens of thousands, without the help of so much as a calculator let alone a computer?

They used ranking and catalogue systems to do so. Storage ranking as well.

Ie, they didn't just throw every book into any shelf and expect some magical "brute force" mechanism to find it, next year.

Well, some of us do so... :-(

But, I digress...

Books were catalogued and sorted into precise locations according to precise rules when they were added to the library and therefore were very easy to locate thereafter: all the client had to know was an author and/or the subject domain of the book, and bingo, up came the shelf or shelf unit where the book was. The "brute force" scan was reduced to a simple matter of looking up one book amid 300 or so, rather than one in the middle of 50000!

And if the librarian was a really good one, the books would be sorted out by title within the shelf. Which made locating the book kid's play.

I know: I was one of the kids playing!

Anyone notice a parallel (pun intended!) here between ranking/cataloguing and what we now know as "partitioning"? Anyone noticed also the similarity of the sorted shelf titles with local indexes in a partition?

Yes, folks. THAT is the solution I see. Sure, we need to evolve the SQL design and syntax itself as well as the storage architecture to make this possible. But that is a LOT easier than trying to subvert limitations of physics, believe me!

There is no way with current or future disk or SSD hardware that we can handle PB databases unless we reduce the problem to small portions. Simple divide to conquer, really. Apply it to very large volumes and we got it. Provided of course that we adjust our way of thinking to this brave new world.

How? Read on.

"OK, so we partition everything, Nuno. Have you thought of how long the CREATE TABLE statement will now become? Have you looked at the syntax for partitioning? Man, the CREATE TABLE section of the SQL manual is now half the size of the bloody book and you want to make it MORE complex? What you been smoking, dude?"

Well, no. Not really, folks. Notice that I said we need to evolve SQL design and syntax. Just like we need to evolve the hardware itself: it ain't gonna happen with what we know now!

Two pronged attack. Disks first:

As far as making disks faster, be they SSD or conventional, the solution is very simple: if you can't bring Mohamed to the CPU/memory "mountain" in a usable time, then bring the mountain to Mohamed! Slap a core2 Duo or whatever it will be in 10 years time, with 100GB of memory, on EACH disk!

Then, make it so that Oracle or db2 or SQL Server can download a simplified meta-SQL interpreter - the output of the optimizer - into each of these very large capacity disks.

Now, what was that access problem again for bulk data? Yup, as a rdbms kernel, you now can download an entire partition management subsystem into a single disk. And have that disk take care of managing the local indexes that handle the partition(s) residing in that disk! And all the other disk space paraphernalia needed to manage a single device capacity of 500TB. Anyone got the "Oracle ASM" bell ringing now? Yes: it's the start of that, folks.

With PMR, in another 10 years you'll be looking at a single disk of that capacity. Do you seriously think that the SAN and NAS makers -and their users and potential market - out there are gonna listen to BAARF and/or BAHD? Particularly when, as Kevin so well points out, the vast majority of data storage is going to be unstructured and outside databases?

No, folks: baarf and badh have as much chance of getting anywhere as I have of becoming the president. That's why I am not a member of either.

Don't get me wrong! Both address a very real problem. It's the solution they propose that is, IMO, completely inadequate: the problem is not speed of random access. The problem is speed of bulk access and how to master the ELDB with the h/w and s/w that realistically we will get in the next 10 years!

OK, that was disks. Now, software.

As in: RDBMS. Because the alternatives - OODB and other such flights of fancy - are not really there and have never proven to even be aware of the problem! Forget a working solution from these...

Clearly, what we need is a MAJOR simplification of the entire reality of the Oracle partitioning concept. No more 50 pages to describe all the nuances and subtleties of the partitioning syntax!

I want to type:
CREATE TABLE blurfl_table (
dept "datatype spec",
blurfl_type "datatype spec",
rfid_date "datatype spec",
aggregate on dept, blurfl_type,rfid_date;

Boom. Period. Sure: later on, I want to be able to indicate to the rdbms engine that I want the physical aggregation for dept to go to my "wonder disks" 44,45 and 46. And within that, I want the aggregation blurfl_type to be spread by individual value in as many partitions as needed to cope with the data spread. And so on.

Or maybe not: I just give all those to ASM and it knows how to physically aggregate - the library shelf metaphore - among all the storage units it knows about. And automatically have a local index on that aggregation key stored in meta-data somewhere else and local indexes in those devices for all PKs, RIs and any other grouping constraints I might also specify with the table.

Now, I write my humble end-user yearly "mother-of-all-queries":

from blurfl_table
where rfid_date >= to_date('2019-01-01')
and rfid_date < to_date('2020-01-01)
group by dept,blurfl_type,trunc(rfid_date,'MONTH');

or words to that effect...

and instead of it becoming the legacy query that my grand-children will see the results of, the following happens:

1- The optimizer - FINALLY doing something smart! - will recognize from the syntax and the meta-data in the dictionary that I'm grouping on the exact aggregation keys of my data. It then reads the locations of the aggregations for those keys off the dictionary, slaps them on the ASM layer with the model SELECT for each one of them.

2- The ASM layer, having previously loaded the SQL meta-interpreter into each of those intelligent 500TB storage units, then fires off the modified model SELECT to each one of them with instructions to aggregate all that data and come back with the result.

3- When the results reach the ASM, it now further aggregates that, a-la Materialized View, into a result set for the end-user query.

4- Who then gets the result within a reasonable time. Of course, end-users being what they are, the speed will still not be satisfactory. And they'll still loudly complain they could do it much faster in Excel and "Windows Panorama" on their MacTel desktop!

But, what the heck:

We tried!...

Yes, yes: I know. There is a lot of "Steve Jobs presentation" in the above scenario.


1- have you had a close look at how 10gr2 manages the AWR partitioning totally transparent to you? And how it partitions the performance data?

2- Have you had a look at what ASM keeps in its own instance of meta-data?

3- Have you noticed the 11g "wish list" for range-range composite partitioning?

Dudes, Dudettes: get ready! Because it's coming.

At light speed. And I can't wait to get my mitts on it!

(end of crystal ball gazing for another decade...)

Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
ICBM: 48.07100, 11.36820            http://www.ativel.com
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

More information about the FoRK mailing list