[FoRK] Re: [>Htech] OCR help needed

Eugen Leitl < eugen at leitl.org > on > Mon Jul 3 12:55:18 PDT 2006

On Tue, Jul 04, 2006 at 05:24:04AM +1000, Damien Morton wrote:

> Doubtful, give that the extremely poor quality scan is pretty unusual.

That's the first line of the scan. The quality-up gradient is down
the page.
> Why are the scans such poor quality anyway?

It's an outlier. I hope, not very typical. A large organization 
is preparing the scans (which comes in different fonts, probably
even resolutions, and quality), so I have no control on
quality. My trouble is finding a diagnostic procedure which
handles such variations gracefully. Unfortunately, the things
contain graphs, structural formulas and IUPAC names, so just looking
at IUPAC errors count out of a nomenclature spell checker (there
is yet no such thing) doesn't do. Doing quorum on multiple feeds 
(unfortunately, there's just FineReader and OmniPage, and not 
much else) is also nontrivial because of frequent frame shifts 
in OCRd text. One could probably steal something from bioinformatics
to produce optimal alignments of fragments, probably.

I wonder which joker had the idea (extracting full-text indexable
text from OCR pages with little to no human postprocessing) in 
the first place. I could have told him right from the start that
it's going to be difficult, without even trying it. Time to move
on, I guess, too many morons in leading position.

Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
ICBM: 48.07100, 11.36820            http://www.ativel.com
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

More information about the FoRK mailing list