[FoRK] large scale dataset mailing list/resources?

Stephen D. Williams <sdw at lig.net> on Mon Feb 25 21:56:50 PST 2008

Reza B'Far wrote:
> Hi Luis:
>
> I spent the past year and half at Oracle (well we got bought by Oracle)
> solving a similar problem... Ken is quite right in that "Most straight RDF
> triplestores seem to hit the wall at millions of triples"... however, there
>   
I'm very interested in this also.  I can't see any theoretical reason 
why an RDF store, with sufficient automatic intelligence, couldn't do 
relational work on RDF data as fast as an RDMS does relational work on 
fixed tuples.  The simplistic case has a 'blown to bits' problem, and 
that is the very thing that makes it a be-all, end-all flexible data, 
er... knowledge, model.

What makes an RDF store, fundamentally, more expensive than an RDBMS? 
I think that all it comes down to is that:

o  Each pair of columns becomes a triple so that a single RDBMS tuple 
becomes a bunch of triples.  (I.e. "blown to bits".)
o  The default case is that every value of each resulting triple goes 
into an index.
o  Everyone seems to come to the conclusion that triples are best 
managed as triples of integers, which seems elegant and efficient but 
adds a layer of indirection.

Clearly, with relatively obvious algorithms, you could recognize tuple 
usage of graph space and tupelize actual storage and indexing.  This can 
all be automatic and query driven.  At that point, for those queries, 
performance should be the same, no?

That method gives new meaning to "query optimization" as the data would 
actually be reorganized physically on the fly.
> is actually a pretty elegant solution to this that combines distributed
> ontology techniques with MapReduce-like technique... folks that know what
> these two things are can probably interpolate the solution fairly
> obviously...
>   
Seems to make sense, depending on what you mean by "distributed ontology 
techniques". 
But isn't this mainly needed because you have a bits multiplier and that 
you are doing more complex queries on graph space rather than simple 
data joins?
> Another alternative to get to billions of triples is Oracle 11g RDF Store :)
> (that's blatent plug)...
>   

The problem with Oracle is that it has long ago priced itself beyond 
anyone's budget except well established, or well funded with rapid burn 
rate, companies.  That's fine for Oracle, but isn't interesting for 
anything that I'm likely to bootstrap, even when referring to fairly 
large corporate or government projects.

sdw
> First solution, IMHO, is better... the second one is quicker.
>
>
> -----Original Message-----
> From: fork-bounces at xent.com [mailto:fork-bounces at xent.com]On Behalf Of
> Luis Villa
> Sent: Wednesday, February 20, 2008 12:32 PM
> To: Friends of Rohit Khare
> Subject: Re: [FoRK] large scale dataset mailing list/resources?
>
>
> On Wed, Feb 20, 2008 at 3:02 PM, Jeff Bone <jbone at place.org> wrote:
>   
>>  On Feb 20, 2008, at 9:20 AM, Luis Villa wrote:
>>
>>  > Hey, all-
>>  >
>>  > A friend is working on a fairly large-scale data project- will
>>  > probably top out in the neck of 5M records (but potentially 25-50
>>  > times that if really takes off), each of which is both a lot of text
>>  > to be analyzed (5-50K words, with link and potentially grammar
>>  > analysis) and an associated pdf (original source material.) Goal is to
>>  > do good search and probably eventual statistical analysis for
>>  > research. (No prizes for guessing what this is if you've been
>>  > following my blog ;)
>>
>>  This is big?
>>     
>
> Big enough to make doing it on a single machine much less responsive
> than he'd like. Or to put it another way: his data sets are growing
> larger and harder to parse faster than his machines are growing bigger
> and faster at parsing, so things are getting more complicated.
>
>   
>>  > Currently search is Apache Solr-powered; he's considering moving to an
>>  > RDF store
>>
>>  Yeah, good luck w/ that! ;-)
>>     
>
> Yeah, I didn't want to tell him that flat out, since it isn't really
> my project on the technology side, but I'm hoping to nudge him away.
>
>   
>>  Random and tangential, but anybody seen this:
>>
>>    http://blog.freebase.com/?p=108
>>     
>
> Eeenteresting.
>
> Luis
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork
>
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork
>   

-- 
swilliams at hpti.com http://www.hpti.com Per: sdw at lig.net http://sdw.st
Stephen D. Williams 703-371-9362C 703-995-0407Fax 94043 AIM: sdw


More information about the FoRK mailing list