[FoRK] large scale dataset mailing list/resources?
Luis Villa
<luis at tieguy.org> on
Wed Feb 20 06:20:06 PST 2008
Hey, all-
A friend is working on a fairly large-scale data project- will
probably top out in the neck of 5M records (but potentially 25-50
times that if really takes off), each of which is both a lot of text
to be analyzed (5-50K words, with link and potentially grammar
analysis) and an associated pdf (original source material.) Goal is to
do good search and probably eventual statistical analysis for
research. (No prizes for guessing what this is if you've been
following my blog ;)
Currently search is Apache Solr-powered; he's considering moving to an
RDF store but I don't know the details there. The pre-processing is
becoming a total PITA- every time he improves the parser to get more
data out of the original sources, several days worth of data
processing and many, many gigs (not quite terabyte yet but getting
there) of data is created and moved around. He's looking at moving
that to AWS to cheaply parallelize, but that only solves some problems
and creates others.
This is an area with very sketchy resources: the only people who seem
to know how to do it well are deeply locked inside G/Y!/MS. More and
more people outside the big three are getting into it, but there
doesn't seem to yet be much documentation of the CW, best practices,
etc., which has frustrated my friend. (As he put it, 'I don't know
anything yet, but that hasn't stopped O'Reilly from asking me to help
write a book about it...')
So... he's doing his best to teach himself this stuff on the fly, but
he asked me if I had any pointers to good resources/discussions/etc.
on this. I had no idea, but I said I'd ask around- does anyone have
pointers to good resources, places where people discuss these
problems, etc.?
thanks in advance-
Luis
More information about the FoRK
mailing list