How to train OCRopus (or other open-source OCR software)? « Digital Humanities Questions & Answers

Digital Humanities Questions & Answers » Applications, Tools, Formats

How to train OCRopus (or other open-source OCR software)?

(2 posts) (2 voices)

Asked 6 years ago by Shane Landrum
Latest answer from Wayne Graham
This question has a best answer.

Tags:

ocr
ocropus

Shane Landrum
Member
I have a batch of JPEG images of typescript correspondence from the early 20th century which should be plenty high-quality to run through OCR. All I'm looking for is a good way to get a mostly-trustworthy set of searchable text out of a mass of images. I don't really have the budget to invest in commercial OCR software, and I know there's gotta be an open-source way to do what I want.

So far, the best option I've found for open-source OCR software is OCRopus, which is under heavy development. I've been able to get it to generate basic output in the hOCR HTML format, but there's no good documentation on how to train it for better recognition of a particular corpus of images.

I'm imagining that I should be able to take the output from an ocropus run over a particular image (A.xml), copy it to B.xml, correct the mis-recognitions in B.xml, and run some command on the resulting files to train ocropus for better recognition. However, the documentation as it exists is designed more for developers than for users, and I'm having trouble deciphering the proper way to do what I want.

Has anyone successfully done this, and if so, how? Is there some other package that's easier or better to use for this purpose?

Thanks in advance for your thoughts.
Tweet this question
Posted 6 years ago Permalink
Wayne Graham
Member
Best Answer

The underlying library that OCRopus uses is a project named tesseract-ocr (developed at UNLV in the mid-90s). I don't think it's obvious on the OCRopus book, but take a look at the training docs on the tesseract-ocr project. This should get you going in the training you need...

Posted 6 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.