I have a batch of JPEG images of typescript correspondence from the early 20th century which should be plenty high-quality to run through OCR. All I'm looking for is a good way to get a mostly-trustworthy set of searchable text out of a mass of images. I don't really have the budget to invest in commercial OCR software, and I know there's gotta be an open-source way to do what I want.
So far, the best option I've found for open-source OCR software is OCRopus, which is under heavy development. I've been able to get it to generate basic output in the hOCR HTML format, but there's no good documentation on how to train it for better recognition of a particular corpus of images.
I'm imagining that I should be able to take the output from an ocropus run over a particular image (A.xml), copy it to B.xml, correct the mis-recognitions in B.xml, and run some command on the resulting files to train ocropus for better recognition. However, the documentation as it exists is designed more for developers than for users, and I'm having trouble deciphering the proper way to do what I want.
Has anyone successfully done this, and if so, how? Is there some other package that's easier or better to use for this purpose?
Thanks in advance for your thoughts.