I need to index all back issues of Mamluk Studies Review (open access, now digital only but formerly print) but have not had much luck finding ideas about how to go about it.
Searching the Web for info about indexing PDFs leads largely to results about indexing them on a computer for improved searches, or to indexing services.
I hope to find software (or scripts or something!) that can
- read PDF files
- understand the idea of page numbers
- understand that each page in a pdf is a distinct entity
- handle Unicode and diacritics (and, ideally, Arabic script)
- see phrases or hyphenated words that break across pages as single items
I don't expect anything to happen automatically: I know I (or better yet an unwary grad student) will have to actually go through and mark words and phrases to be included in the index.
Bonus points if it can be taught to ignore certain strings when alphabetizing. For example, since 'al-' is Arabic for 'the', it doesn't affect alphabetization (so al-Nasir Muhammad goes in the N section).
Similarly, there needs to be a way to instruct it that ā and a are the same for purposes of alphabetization, as are ṣ and s, etc.
Super bonus points if it can recognize (or learn to recognize) variations on a word or phrase in terms of spelling (often inconsistent when transliteration is involved), word order or intervening words.
What I have: 23 issues of the journal as whole-book pdfs, as well as individual pdfs of all articles. Unfortunately, the first half dozen or so were created without Unicode, using proprietary fonts with non-standard encodings. Messy, but I can work around it somehow. I also have InDesign files (various versions) for about half the issues. This will all be done in Windows (32-bit XP and 64-bit 7). I always have the latest version of Acrobat (not reader, the full program).
The resulting index will be posted on the Web, probably both as a PDF and in some more dynamic and usable format(s).
Any ideas for ways to streamline this would be appreciated.
Thanks!
Olaf