[FoRK] Re: [>Htech] OCR help needed
Eugen Leitl <
eugen at leitl.org
> on >
Mon Jul 3 12:55:18 PDT 2006
On Tue, Jul 04, 2006 at 05:24:04AM +1000, Damien Morton wrote:
> Doubtful, give that the extremely poor quality scan is pretty unusual.
That's the first line of the scan. The quality-up gradient is down
> Why are the scans such poor quality anyway?
It's an outlier. I hope, not very typical. A large organization
is preparing the scans (which comes in different fonts, probably
even resolutions, and quality), so I have no control on
quality. My trouble is finding a diagnostic procedure which
handles such variations gracefully. Unfortunately, the things
contain graphs, structural formulas and IUPAC names, so just looking
at IUPAC errors count out of a nomenclature spell checker (there
is yet no such thing) doesn't do. Doing quorum on multiple feeds
(unfortunately, there's just FineReader and OmniPage, and not
much else) is also nontrivial because of frequent frame shifts
in OCRd text. One could probably steal something from bioinformatics
to produce optimal alignments of fragments, probably.
I wonder which joker had the idea (extracting full-text indexable
text from OCR pages with little to no human postprocessing) in
the first place. I could have told him right from the start that
it's going to be difficult, without even trying it. Time to move
on, I guess, too many morons in leading position.
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
ICBM: 48.07100, 11.36820 http://www.ativel.com
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
More information about the FoRK