[FoRK] large scale dataset mailing list/resources?

Luis Villa <luis at tieguy.org> on Wed Feb 20 12:31:34 PST 2008

On Wed, Feb 20, 2008 at 3:02 PM, Jeff Bone <jbone at place.org> wrote:
>
>  On Feb 20, 2008, at 9:20 AM, Luis Villa wrote:
>
>  > Hey, all-
>  >
>  > A friend is working on a fairly large-scale data project- will
>  > probably top out in the neck of 5M records (but potentially 25-50
>  > times that if really takes off), each of which is both a lot of text
>  > to be analyzed (5-50K words, with link and potentially grammar
>  > analysis) and an associated pdf (original source material.) Goal is to
>  > do good search and probably eventual statistical analysis for
>  > research. (No prizes for guessing what this is if you've been
>  > following my blog ;)
>
>  This is big?

Big enough to make doing it on a single machine much less responsive
than he'd like. Or to put it another way: his data sets are growing
larger and harder to parse faster than his machines are growing bigger
and faster at parsing, so things are getting more complicated.

>  > Currently search is Apache Solr-powered; he's considering moving to an
>  > RDF store
>
>  Yeah, good luck w/ that! ;-)

Yeah, I didn't want to tell him that flat out, since it isn't really
my project on the technology side, but I'm hoping to nudge him away.

>  Random and tangential, but anybody seen this:
>
>    http://blog.freebase.com/?p=108

Eeenteresting.

Luis

More information about the FoRK mailing list