How create a real (with page numbers) index of journal's entire run, from PDFs?

Digital Humanities Questions & Answers » Applications, Tools, Formats

How create a real (with page numbers) index of journal's entire run, from PDFs?

(9 posts) (4 voices)

Asked 3 years ago by olaf
Latest answer from Joel Kalvesmaki

Tags:

olaf
Member
I need to index all back issues of Mamluk Studies Review (open access, now digital only but formerly print) but have not had much luck finding ideas about how to go about it.
Searching the Web for info about indexing PDFs leads largely to results about indexing them on a computer for improved searches, or to indexing services.
I hope to find software (or scripts or something!) that can
- read PDF files
- understand the idea of page numbers
- understand that each page in a pdf is a distinct entity
- handle Unicode and diacritics (and, ideally, Arabic script)
- see phrases or hyphenated words that break across pages as single items
I don't expect anything to happen automatically: I know I (or better yet an unwary grad student) will have to actually go through and mark words and phrases to be included in the index.

Bonus points if it can be taught to ignore certain strings when alphabetizing. For example, since 'al-' is Arabic for 'the', it doesn't affect alphabetization (so al-Nasir Muhammad goes in the N section).
Similarly, there needs to be a way to instruct it that ā and a are the same for purposes of alphabetization, as are ṣ and s, etc.

Super bonus points if it can recognize (or learn to recognize) variations on a word or phrase in terms of spelling (often inconsistent when transliteration is involved), word order or intervening words.

What I have: 23 issues of the journal as whole-book pdfs, as well as individual pdfs of all articles. Unfortunately, the first half dozen or so were created without Unicode, using proprietary fonts with non-standard encodings. Messy, but I can work around it somehow. I also have InDesign files (various versions) for about half the issues. This will all be done in Windows (32-bit XP and 64-bit 7). I always have the latest version of Acrobat (not reader, the full program).

The resulting index will be posted on the Web, probably both as a PDF and in some more dynamic and usable format(s).

Any ideas for ways to streamline this would be appreciated.

Thanks!
Olaf
Tweet this question
Posted 3 years ago Permalink
Dorothea Salo
Member

I'm confused. Are you making a concordance (list of words/phrases present in text with pointers), or an index (synthesized list of important terminology, with pointers to meaningful mentions while omitting passing ones)? They're not at all the same thing.

Posted 3 years ago Permalink
olaf
Member

One thing I've been playing with today is going through a pdf and using the highlight tool on words/phrases, in the hope that I can then export the comments list (which has page numbers) to some format I can work with. Doesn't work very well for the older issues with the messy fonts, since you can't always tell what the word was supposed to be (Ṣubḥ becomes ˝ubh˝S and maqāmah becomes maqa≠mah or mah≠maqa, and words with lots of diacritics become almost unrecognizable as words). Those fonts were on long-dead Macs running OS7-OS9, so aren't available to me now.

Posted 3 years ago Permalink
olaf
Member

Replying to @Dorothea Salo's post:

I mean a real index, not a concordance. The need to leave out passing mentions is one of the reasons that no software will be able to automate the process.

Posted 3 years ago Permalink
olaf
Member

One more wish for the wishlist: a way to designate a term as fitting into more than one topic in the index. For example, al-Zahir Baybars would be indexed as himself and under "sultans".

Posted 3 years ago Permalink
Peter Organisciak
Member

Replying to @olaf's post:

I believe you're looking for an automated way to create a back-of-the-book index, correct? 'Indexing' tends to refer to building indices for information retrieval (such as Terrier and Lucene's PDF parsers), which is why you couldn't find it on Google.

Back-of-the-book indexes are tough to parse. Patrick Juola wrote about the need for such software and the technical challenges in Killer Applications for Digital Humanities. If I recall, he had early work in the area: I'm not sure what came of it.

I don't know if there is any software that would do what you need. However, since it's a tough problem, you can be sure that researchers have tried it. Your best bet is to look through the research literature and see if any researchers have released their code. A scholar search for 'back-of-the-book indexing' along with keywords like 'unsupervised', 'semi-supervised', or 'automated' gave me some potentially useful articles. Still, you'd probably have to split the problem into two parts — parsing PDFs to text and generating an index — as I suspect there aren't any tools mature enough t include PDF parsing.

To be honest, your approach of going through manually and highlighting notable terms sounds more tractable to me. With the OCR problems: have you tried re-applying text recognition on the older issues with the newest version of Acrobat Professional? Their OCR improves often.

Sorry that I don't have a better answer for you. Good luck.

Posted 3 years ago Permalink
olaf
Member

Replying to @Peter Organisciak's post:

Thanks for the tips.

Not automated. Just more convenient, and perhaps with some automated features to help with the actual index creation.

I hadn't thought of running any OCR on the older files, since I made them many years ago from the original Word or Nisus files (i.e., they were never scanned or OCRed), but that's a great idea that I'm about to try. Don't know if OCR will ignore the text that's already 'live' though, or if I'll have to flatten them first.

I'll definitely take a stroll through the research and see what I can find.

Posted 3 years ago Permalink
olaf
Member

Replying to @olaf's post:

The OCR idea seems to be a bust, unfortunately. It's a pain to convert to a "flat" file without renderable text. Then, even the newest version of Acrobat is finding it difficult to understand diacritics, italics and anything else non-standard. I think I'll be better off working with mistakes that follow a regular pattern (such as a≠ always equals ā) and working on a script or something to do mass replacements.
Don't know why Adobe doesn't allow OCR of files with renderable text in them. What could be the harm?

Posted 3 years ago Permalink
Joel Kalvesmaki
Member

You could try indexing software such as http://www.pdfindexgenerator.com/. But it sounds as if the level of quality and detail to which you aspire would be best handled not so much by software but by hiring a professional indexer who already uses such software and can write a strong index in a timely manner. Of course, if you have more time than money, this may not be feasible.

Posted 3 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.