Indexing the web for regular expression search, revisited
gojomo at usa.net
Wed Sep 24 23:56:59 PDT 2003
There was talk here back in June about wanting a web-wide search
facility that could do a better job of matching regular expressions,
or some useful subset thereof.
The following research wasn't mentioned (except perhaps
that James Rogers seems to have elliptically referred to similar
techniques), but this 2001 paper would seem to indicate that the
idea of a "RegExp Google" isn't entirely far-fetched...
"A Fast Regular Expression Indexing Engine (2001)"
Junghoo Cho, Sridhar Rajagopalan
Abstract: In this paper, we describe the design, architecture,
and the lessons learned from the implementation of a fast
regular expression indexing engine FREE. FREE uses a prebuilt
index to identify the text data units which may contain a
matching string and only examines these further. In this way,
FREE shows orders of magnitude performance improvement in
certain cases over standard regular expression matching
systems, such as lex, awk and grep.
Looking at their example test vectors, it seems to me that even
the three (out of 10) that showed no improvement (over a traditional
scan of the full document corpus) could potentially be optimized
using a variant of their technique, if you could create synthetic
multigrams using character-class atoms.
More information about the FoRK