On Tue, Jul 04, 2006 at 05:24:04AM +1000, Damien Morton wrote:

> Doubtful, give that the extremely poor quality scan is pretty unusual.

That's the first line of the scan. The quality-up gradient is down
the page.
> Why are the scans such poor quality anyway?

It's an outlier. I hope, not very typical. A large organization 
is preparing the scans (which comes in different fonts, probably
even resolutions, and quality), so I have no control on
quality. My trouble is finding a diagnostic procedure which
handles such variations gracefully. Unfortunately, the things
contain graphs, structural formulas and IUPAC names, so just looking
at IUPAC errors count out of a nomenclature spell checker (there
is yet no such thing) doesn't do. Doing quorum on multiple feeds 
(unfortunately, there's just FineReader and OmniPage, and not 
much else) is also nontrivial because of frequent frame shifts 
in OCRd text. One could probably steal something from bioinformatics
to produce optimal alignments of fragments, probably.

I wonder which joker had the idea (extracting full-text indexable
text from OCR pages with little to no human postprocessing) in 
the first place. I could have told him right from the start that
it's going to be difficult, without even trying it. Time to move
on, I guess, too many morons in leading position.

