[FoRK] large scale dataset mailing list/resources?
Luis Villa
<luis at tieguy.org> on
Wed Feb 20 12:31:34 PST 2008
On Wed, Feb 20, 2008 at 3:02 PM, Jeff Bone <jbone at place.org> wrote:
>
> On Feb 20, 2008, at 9:20 AM, Luis Villa wrote:
>
> > Hey, all-
> >
> > A friend is working on a fairly large-scale data project- will
> > probably top out in the neck of 5M records (but potentially 25-50
> > times that if really takes off), each of which is both a lot of text
> > to be analyzed (5-50K words, with link and potentially grammar
> > analysis) and an associated pdf (original source material.) Goal is to
> > do good search and probably eventual statistical analysis for
> > research. (No prizes for guessing what this is if you've been
> > following my blog ;)
>
> This is big?
Big enough to make doing it on a single machine much less responsive
than he'd like. Or to put it another way: his data sets are growing
larger and harder to parse faster than his machines are growing bigger
and faster at parsing, so things are getting more complicated.
> > Currently search is Apache Solr-powered; he's considering moving to an
> > RDF store
>
> Yeah, good luck w/ that! ;-)
Yeah, I didn't want to tell him that flat out, since it isn't really
my project on the technology side, but I'm hoping to nudge him away.
> Random and tangential, but anybody seen this:
>
> http://blog.freebase.com/?p=108
Eeenteresting.
Luis
More information about the FoRK
mailing list