Indexing the web for regular expression search, revisited

Gordon Mohr gojomo at usa.net
Wed Sep 24 23:56:59 PDT 2003


There was talk here back in June about wanting a web-wide search
facility that could do a better job of matching regular expressions,
or some useful subset thereof.

The following research wasn't mentioned (except perhaps
that James Rogers seems to have elliptically referred to similar
techniques), but this 2001 paper would seem to indicate that the
idea of a "RegExp Google" isn't entirely far-fetched...

   "A Fast Regular Expression Indexing Engine (2001)"
   Junghoo Cho, Sridhar Rajagopalan
   http://citeseer.nj.nec.com/cho01fast.html

   Abstract: In this paper, we describe the design, architecture,
   and the lessons learned from the implementation of a fast
   regular expression indexing engine FREE. FREE uses a prebuilt
   index to identify the text data units which may contain a
   matching string and only examines these further. In this way,
   FREE shows orders of magnitude performance improvement in
   certain cases over standard regular expression matching
   systems, such as lex, awk and grep.

Looking at their example test vectors, it seems to me that even
the three (out of 10) that showed no improvement (over a traditional
scan of the full document corpus) could potentially be optimized
using a variant of their technique, if you could create synthetic
multigrams using character-class atoms.

- Gordon



More information about the FoRK mailing list