How to Convert PDF's to Plain Text « Digital Humanities Questions & Answers

Digital Humanities Questions & Answers » Applications, Tools, Formats

How to Convert PDF's to Plain Text

(6 posts) (6 voices)

Asked 4 years ago by John Handel
Latest answer from spotvin

Tags:

John Handel
Member
I'm trying to create a corpus of texts to use for various data mining ends but can only access the texts in pdf form. I have 21 pdfs that range from 250-600 pages, most are within the 400-500 page range. I've tried online converters and they are not able to handle the process. Does anyone have a way of doing this?
Tweet this question
Posted 4 years ago Permalink
Patrick Murray-John

If you know a friendly neighborhood Linux user, there's a package called pdf2txt that would likely be able to do it. I don't know if similar is available on other operating systems

Posted 4 years ago Permalink
Ben Schmidt
Member

If it's already in digital text form, pdf2txt is great.

If it's just images (the easiest way to check is if you can copy and paste from the file in Acrobat or Preview or whatever), you have to do the OCR as well: if you have a Mac, I wrote up one blow-by-blow for a free solution on the command line for my grad class a while ago. There are software packages your library might have, as well, that would make it easier.

Posted 4 years ago Permalink
Beth Russell
Member

Does anyone know if theses solutions convert the PDFs to completely plain text, so no formatting at all, or, if there are others out there?

We have a prof who is working with PDFs from the Congressional Record, and he needs absolutely clean, non-formatted plain text to then use in a programming software. So far he's doing this by hand because any OCR tools still leave some formatting.

Posted 4 years ago Permalink
Kevin Hawkins
Member

pdf2txt really does give you plain text. PrimeOCR has settings for the output: among other possible formats are RTF and plain text. I would imagine that other OCR packages can also create plain text if properly configured. If not, I would think you could post-process the output with another tool (such as rtf2txt).

Posted 4 years ago Permalink
spotvin
Member

What are the "software packages your library might have" that would make this process easier? Many thanks!

Posted 3 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.