<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="bbPress/1.0.2" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Digital Humanities Questions &#38; Answers &#187; Topic: How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?</title>
		<link>http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs</link>
		<description>Digital Humanities Questions &amp; Answers &#187; Topic: How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?</description>
		<language>en-US</language>
		<pubDate>Wed, 03 Aug 2016 01:15:20 +0000</pubDate>
		<generator>http://bbpress.org/?v=1.0.2</generator>
		<textInput>
			<title><![CDATA[Search]]></title>
			<description><![CDATA[Search all topics from these forums.]]></description>
			<name>q</name>
			<link>http://digitalhumanities.org/answers/search.php</link>
		</textInput>
		<atom:link href="/rss/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs/index.xml" rel="self" type="application/rss+xml" />

		<item>
			 
				<title>Joel Kalvesmaki on "How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1913</link>
			<pubDate>Thu, 07 Mar 2013 10:22:39 +0000</pubDate>
			<dc:creator>Joel Kalvesmaki</dc:creator>
			<guid isPermaLink="false">1913@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;You could try indexing software such as &#60;a href=&#34;http://www.pdfindexgenerator.com/&#34; rel=&#34;nofollow&#34;&#62;http://www.pdfindexgenerator.com/&#60;/a&#62;. But it sounds as if the level of quality and detail to which you aspire would be best handled not so much by software but by hiring a professional indexer who already uses such software and can write a strong index in a timely manner. Of course, if you have more time than money, this may not be feasible.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>olaf on "How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1911</link>
			<pubDate>Wed, 06 Mar 2013 16:25:43 +0000</pubDate>
			<dc:creator>olaf</dc:creator>
			<guid isPermaLink="false">1911@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @&#60;a href='/profile/olaf'&#62;olaf&#60;/a&#62;'s &#60;a href=&#34;http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1910&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;The OCR idea seems to be a bust, unfortunately. It's a pain to convert to a &#34;flat&#34; file without renderable text. Then, even the newest version of Acrobat is finding it difficult to understand diacritics, italics and anything else non-standard. I think I'll be better off working with mistakes that follow a regular pattern (such as a≠ always equals ā) and working on a script or something to do mass replacements.&#60;br /&#62;
Don't know why Adobe doesn't allow OCR of files with renderable text in them. What could be the harm?
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>olaf on "How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1910</link>
			<pubDate>Wed, 06 Mar 2013 15:00:33 +0000</pubDate>
			<dc:creator>olaf</dc:creator>
			<guid isPermaLink="false">1910@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @Peter Organisciak's &#60;a href=&#34;http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1909&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;Thanks for the tips.&#60;/p&#62;
&#60;p&#62;Not automated. Just more convenient, and perhaps with some automated features to help with the actual index creation.&#60;/p&#62;
&#60;p&#62;I hadn't thought of running any OCR on the older files, since I made them many years ago from the original Word or Nisus files (i.e., they were never scanned or OCRed), but that's a great idea that I'm about to try. Don't know if OCR will ignore the text that's already 'live' though, or if I'll have to flatten them first.&#60;/p&#62;
&#60;p&#62;I'll definitely take a stroll through the research and see what I can find.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Peter Organisciak on "How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1909</link>
			<pubDate>Wed, 06 Mar 2013 14:41:49 +0000</pubDate>
			<dc:creator>Peter Organisciak</dc:creator>
			<guid isPermaLink="false">1909@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @&#60;a href='/profile/olaf'&#62;olaf&#60;/a&#62;'s &#60;a href=&#34;http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1908&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;I believe you're looking for an automated way to create a back-of-the-book index, correct? 'Indexing' tends to refer to building indices for information retrieval (such as Terrier and Lucene's PDF parsers), which is why you couldn't find it on Google.&#60;/p&#62;
&#60;p&#62;Back-of-the-book indexes are tough to parse. Patrick Juola wrote about the need for such software and the technical challenges in &#60;a href=&#34;http://llc.oxfordjournals.org/content/23/1/73.full?sid=f01a5ee3-2477-4711-8fff-d42916eead6d&#34;&#62;Killer Applications for Digital Humanities&#60;/a&#62;. If I recall, he had early work in the area: I'm not sure what came of it. &#60;/p&#62;
&#60;p&#62;I don't know if there is any software that would do what you need. However, since it's a tough problem, you can be sure that researchers have tried it. Your best bet is to look through the research literature and see if any researchers have released their code. A scholar search for 'back-of-the-book indexing' along with keywords like 'unsupervised', 'semi-supervised', or 'automated' gave me some potentially useful articles. Still, you'd probably have to split the problem into two parts — parsing PDFs to text and generating an index — as I suspect there aren't any tools mature enough t include PDF parsing.&#60;/p&#62;
&#60;p&#62;To be honest, your approach of going through manually and highlighting notable terms sounds more tractable to me. With the OCR problems: have you tried re-applying text recognition on the older issues with the newest version of Acrobat Professional? Their OCR improves often.&#60;/p&#62;
&#60;p&#62;Sorry that I don't have a better answer for you. Good luck.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>olaf on "How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1908</link>
			<pubDate>Wed, 06 Mar 2013 14:21:41 +0000</pubDate>
			<dc:creator>olaf</dc:creator>
			<guid isPermaLink="false">1908@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;One more wish for the wishlist: a way to designate a term as fitting into more than one topic in the index. For example, al-Zahir Baybars would be indexed as himself and under &#34;sultans&#34;.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>olaf on "How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1907</link>
			<pubDate>Wed, 06 Mar 2013 14:09:23 +0000</pubDate>
			<dc:creator>olaf</dc:creator>
			<guid isPermaLink="false">1907@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @Dorothea Salo's &#60;a href=&#34;http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1905&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;I mean a real index, not a concordance. The need to leave out passing mentions is one of the reasons that no software will be able to automate the process.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>olaf on "How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1906</link>
			<pubDate>Wed, 06 Mar 2013 14:07:37 +0000</pubDate>
			<dc:creator>olaf</dc:creator>
			<guid isPermaLink="false">1906@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;One thing I've been playing with today is going through a pdf and using the highlight tool on words/phrases, in the hope that I can then export the comments list (which has page numbers) to some format I can work with. Doesn't work very well for the older issues with the messy fonts, since you can't always tell what the word was supposed to be (Ṣubḥ becomes ˝ubh˝S and maqāmah becomes maqa≠mah or mah≠maqa, and words with lots of diacritics become almost unrecognizable as words). Those fonts were on long-dead Macs running OS7-OS9, so aren't available to me now.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Dorothea Salo on "How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1905</link>
			<pubDate>Wed, 06 Mar 2013 13:59:59 +0000</pubDate>
			<dc:creator>Dorothea Salo</dc:creator>
			<guid isPermaLink="false">1905@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;I'm confused. Are you making a concordance (list of words/phrases present in text with pointers), or an index (synthesized list of important terminology, with pointers to meaningful mentions while omitting passing ones)? They're not at all the same thing.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>olaf on "How create a real (with page numbers) index of journal&#039;s entire run, from PDFs?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-create-a-real-with-page-numbers-index-of-journals-entire-run-from-pdfs#post-1904</link>
			<pubDate>Wed, 06 Mar 2013 13:57:10 +0000</pubDate>
			<dc:creator>olaf</dc:creator>
			<guid isPermaLink="false">1904@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;I need to index all back issues of &#60;em&#62;&#60;a href=&#34;http://mamluk.uchicago.edu&#34;&#62;Mamluk Studies Review&#60;/a&#62;&#60;/em&#62; (open access, now digital only but formerly print) but have not had much luck finding ideas about how to go about it.&#60;br /&#62;
Searching the Web for info about indexing PDFs leads largely to results about indexing them on a computer for improved searches, or to indexing services.&#60;br /&#62;
I hope to find software (or scripts or something!) that can &#60;/p&#62;
&#60;ul&#62;
&#60;li&#62;read PDF files&#60;/li&#62;
&#60;li&#62;understand the idea of page numbers&#60;/li&#62;
&#60;li&#62;understand that each page in a pdf is a distinct entity&#60;/li&#62;
&#60;li&#62;handle Unicode and diacritics (and, ideally, Arabic script)&#60;/li&#62;
&#60;li&#62;see phrases or hyphenated words that break across pages as single items&#60;/li&#62;
&#60;/ul&#62;
&#60;p&#62;I don't expect anything to happen automatically: I know I (or better yet an unwary grad student) will have to actually go through and mark words and phrases to be included in the index.&#60;/p&#62;
&#60;p&#62;Bonus points if it can be taught to ignore certain strings when alphabetizing. For example, since 'al-' is Arabic for 'the', it doesn't affect alphabetization (so al-Nasir Muhammad goes in the N section).&#60;br /&#62;
Similarly, there needs to be a way to instruct it that ā and a are the same for purposes of alphabetization, as are ṣ and s, etc.&#60;/p&#62;
&#60;p&#62;Super bonus points if it can recognize (or learn to recognize) variations on a word or phrase in terms of spelling (often inconsistent when transliteration is involved), word order or intervening words.&#60;/p&#62;
&#60;p&#62;What I have: 23 issues of the journal as whole-book pdfs, as well as individual pdfs of all articles. Unfortunately, the first half dozen or so were created without Unicode, using proprietary fonts with non-standard encodings. Messy, but I can work around it somehow. I also have InDesign files (various versions) for about half the issues. This will all be done in Windows (32-bit XP and 64-bit 7). I always have the latest version of Acrobat (not reader, the full program).&#60;/p&#62;
&#60;p&#62;The resulting index will be posted on the Web, probably both as a PDF and in some more dynamic and usable format(s). &#60;/p&#62;
&#60;p&#62;Any ideas for ways to streamline this would be appreciated. &#60;/p&#62;
&#60;p&#62;Thanks!&#60;br /&#62;
Olaf
&#60;/p&#62;</description>
		</item>

	</channel>
</rss>
