How to Handle Soft Hyphens in OCR'd PDF? « Digital Humanities Questions & Answers

Digital Humanities Questions & Answers » Applications, Tools, Formats

How to Handle Soft Hyphens in OCR'd PDF?

(6 posts) (5 voices)

Asked 6 years ago by cforster
Latest answer from Ge0ffrey

Tags:

cforster
Member
I am using ABBYY Finereader 10.0 to OCR some material I've scanned. I would like to export the result as a PDF with the text "under" the image. Everything seems to be going well, except the way that soft-hyphens (Finereader calls them "optional hyphens") are handled in the PDF output.

While the OCR successfully recognizes most end-of-line hyphens as soft hyphens, when the result is exported to PDF, the word remains split in half and therefore will not appear in search results. For example, if the word "digital" appeared at the end of a line, hyphenated as "digi-tal," Finereader recognizes the hyphen as a soft hyphen. But if I export to PDF and search for the term "digital" it will not be found (but "digi-" and "tal" would).

Any thoughts on how to handle this? I could just manually rejoin these words, but that seems absurd. After futzing about for quite awhile, though, I've been unable to find a better solution.
Tweet this question
Posted 6 years ago Permalink
Wayne Graham
Member
Do you really need to put the words together? I think this is a case where you may need to rely on something like a search engine in order to get "expected" results. Something like Solr should help with these kinds of issues. Essentially, using its stemmers, words like digital drop the "tal" suffix in the index anyway, so you would get a match in a result as expected. Using the new ExtractingRequestHandler (http://wiki.apache.org/solr/ExtractingRequestHandler), you can actually add PDFs to an index without patching the core and recompiling the sources.

You could do something along these lines with Ruby to index your PDFs:
#! /usr/bin/env ruby Dir["path/to/pdfs/*.pdf"].each do |file| # Extract the file name out of the file path, excluding extension fname = File.basename(file, '.pdf') # Don't do it this way in a production environment; intended just to get you going # Use rsolr gem if you want to do this on a server <code>curl "http://localhost:8983/solr/update/extract?literal.id=#{fname}&commit=true" -F "myfile=@#{file}"</code> end
Unless there's some really weird hyphenation, most modern English words will be hyphenated in the same place the stemmers will drop suffixes, so you'll still get a hit on "digital" on "digi" and "tal" (and probably "dig" and "ital". For more information on the stemmers used, see the solr wiki at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming.
Posted 6 years ago Permalink
Stéfan Sinclair
Admin

If searching within the image-over-text PDF is the important thing, it sounds like the PDF software may actually be the problem, rather than the output from Finereader. I'm curious how different PDF readers handle this – could you make available somewhere a hyphenated PDF? (It would be nice to be able to upload here.)

Posted 6 years ago Permalink
cforster
Member

Wayne, what we would like is a searchable PDF; so a single file that would be searchable for keywords (without losing any to hyphenation).

But, in fact, this may all be moot; I went to output a single page from Finereader to share as an example... and it seems to be searching successfully across the soft hyphens. This may be, as Stéfan suggests, a matter of what reader one is using. But, like that odd car noise which disappears when the mechanic arrives, I need to be able to more reliably reproduce this problem before I take anyone's time...

Sorry about that.

Posted 6 years ago Permalink
Joe Wicentowski
Member

Sounds like this would be a good question to post to the ABBYY Finereader forums - perhaps they can help you disable the creation of the soft hyphens. I just noticed this problem myself today when working with a PDF document that I OCRed with ABBYY and saved to MS Word. Even in Word, I couldn't find & replace the soft hyphens away...

Posted 6 years ago Permalink
Ge0ffrey
Member

There's a solution if you first import it into Word before turning it into a PDF. First, highlight the soft hyphen and copy it. Then, Control H (search and replace). Paste the hyphen in the "Find What" box. Replace it with a regular hyphen or some other sequence of characters, such as ***. Then do a search and replace for the regular hyphen or the *** and replace it with nothing! Et voila. For some odd reason searching in Word for the ABBYYFine generated hyphen and replacing it with nothing doesn't work.

Replying to @Joe Wicentowski's post:

Posted 4 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.