Indexing the web for regular expression search, revisited

James Rogers jamesr at best.com
Fri Sep 26 15:35:21 PDT 2003


On Wed, 2003-09-24 at 22:56, Gordon Mohr wrote:
> The following research wasn't mentioned (except perhaps
> that James Rogers seems to have elliptically referred to similar
> techniques), but this 2001 paper would seem to indicate that the
> idea of a "RegExp Google" isn't entirely far-fetched...
> 
>    "A Fast Regular Expression Indexing Engine (2001)"
>    Junghoo Cho, Sridhar Rajagopalan
>    http://citeseer.nj.nec.com/cho01fast.html

> Looking at their example test vectors, it seems to me that even
> the three (out of 10) that showed no improvement (over a traditional
> scan of the full document corpus) could potentially be optimized
> using a variant of their technique, if you could create synthetic
> multigrams using character-class atoms.


Interesting paper, but more case specific than what I was talking
about.  The types of algorithms I work on are more like universal
indexes that can converge on high-order patterns (i.e. roughly optimal
n-order UPs).  Some functional differences:

- The indexing algorithms I use have a substantially larger memory
footprint (and the current implementation is pretty well-tuned short of
byte-slicing and making the code ugly), so I have never used a text
corpus as large as was used in the paper and have to extrapolate from
the convergence function generated by similar but much smaller
datasets.  Within the physical memory allocated solely to the test
process (1.0 Gbyte) on a machine with similar CPU power to the one in
the paper, it is only around 3-4% efficient byte-wise when it runs out
of memory (i.e. index size is 25-30x greater than the corpus size). 
However, the math and convergence function (which varies as a function
of the entropy in the data stream) suggests that we should get >100%
efficiency given enough RAM and a large enough data stream (demonstrable
with simpler data sources); the Kolmogorov complexity of your typical
collection of real test documents is too high for the model/index to
converge well in the puny quantity of RAM available.  My relevant
implementations are not text specific, and the text is treated as a raw
binary stream with each character cast to 32-bit atoms.

- The poor memory efficiency of the designs I work on are offset by the
fact that they exhaustively index the patterns in every data stream they
processes.  

- Indexing is fast, being basically bound by memory bandwidth/latency.

- Pattern matching on my indexes are not limited to simple regex, and
can handle any arbitrarily complicated pattern that is exact, partial,
fuzzy, or dirty (e.g. bad or misplaced tokens) with similar efficiency
and fast query times. Ordering of matches is roughly based on
information theoretic diffs from the query (implicitly, not
post-processed).

The bad news is that these types of constructs cannot really be
simplified to more memory efficient implementations.  The good news is
that you can get much greater memory efficiency (and for practical
purposes with finite memory, better predictive performance) by lowering
the resolution (i.e. Kolmogorov complexity) of the data stream, similar
to lossy compression techniques.

So for simpler cases, the paper describes a much more efficient indexing
algorithm for all practical purposes.  The types of constructs I've been
working with are much more ambitious in capability, but also don't
become relatively more efficient for data sets people care about until
you have machines with mountains of low latency memory.

Planning to get a much bigger machine/cluster shortly,

-James Rogers
 jamesr at best.com






More information about the FoRK mailing list