I'm trying to create a corpus of texts to use for various data mining ends but can only access the texts in pdf form. I have 21 pdfs that range from 250-600 pages, most are within the 400-500 page range. I've tried online converters and they are not able to handle the process. Does anyone have a way of doing this?
How to Convert PDF's to Plain Text
(6 posts) (6 voices)-
Posted 4 years ago Permalink
-
If you know a friendly neighborhood Linux user, there's a package called pdf2txt that would likely be able to do it. I don't know if similar is available on other operating systems
Posted 4 years ago Permalink -
If it's already in digital text form, pdf2txt is great.
If it's just images (the easiest way to check is if you can copy and paste from the file in Acrobat or Preview or whatever), you have to do the OCR as well: if you have a Mac, I wrote up one blow-by-blow for a free solution on the command line for my grad class a while ago. There are software packages your library might have, as well, that would make it easier.
Posted 4 years ago Permalink -
Does anyone know if theses solutions convert the PDFs to completely plain text, so no formatting at all, or, if there are others out there?
We have a prof who is working with PDFs from the Congressional Record, and he needs absolutely clean, non-formatted plain text to then use in a programming software. So far he's doing this by hand because any OCR tools still leave some formatting.
Posted 4 years ago Permalink -
pdf2txt really does give you plain text. PrimeOCR has settings for the output: among other possible formats are RTF and plain text. I would imagine that other OCR packages can also create plain text if properly configured. If not, I would think you could post-process the output with another tool (such as rtf2txt).
Posted 4 years ago Permalink -
What are the "software packages your library might have" that would make this process easier? Many thanks!
Posted 3 years ago Permalink
Reply
You must log in to post.