[FoRK] large scale dataset mailing list/resources?

Justin Mason <jm at jmason.org> on Wed Feb 20 08:53:21 PST 2008

FoRKer Aaron Swartz to the rescue! http://theinfo.org/

  'This is a site for large data sets and the people who love them: the
  scrapers and crawlers who collect them, the academics and geeks who
  process them, the designers and artists who visualize them. It's a place
  where they can exchange tips and tricks, develop and share tools
  together, and begin to integrate their particular projects.'

--j.

Luis Villa writes:
> Hey, all-
> 
> A friend is working on a fairly large-scale data project- will
> probably top out in the neck of 5M records (but potentially 25-50
> times that if really takes off), each of which is both a lot of text
> to be analyzed (5-50K words, with link and potentially grammar
> analysis) and an associated pdf (original source material.) Goal is to
> do good search and probably eventual statistical analysis for
> research. (No prizes for guessing what this is if you've been
> following my blog ;)
> 
> Currently search is Apache Solr-powered; he's considering moving to an
> RDF store but I don't know the details there. The pre-processing is
> becoming a total PITA- every time he improves the parser to get more
> data out of the original sources, several days worth of data
> processing and many, many gigs (not quite terabyte yet but getting
> there) of data is created and moved around. He's looking at moving
> that to AWS to cheaply parallelize, but that only solves some problems
> and creates others.
> 
> This is an area with very sketchy resources: the only people who seem
> to know how to do it well are deeply locked inside G/Y!/MS. More and
> more people outside the big three are getting into it, but there
> doesn't seem to yet be much documentation of the CW, best practices,
> etc., which has frustrated my friend. (As he put it, 'I don't know
> anything yet, but that hasn't stopped O'Reilly from asking me to help
> write a book about it...')
> 
> So... he's doing his best to teach himself this stuff on the fly, but
> he asked me if I had any pointers to good resources/discussions/etc.
> on this. I had no idea, but I said I'd ask around- does anyone have
> pointers to good resources, places where people discuss these
> problems, etc.?
> 
> thanks in advance-
> Luis
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork

More information about the FoRK mailing list