resources for the theory and math behind text-mining algorithms

Digital Humanities Questions & Answers » Applications, Tools, Formats

resources for the theory and math behind text-mining algorithms

(6 posts) (5 voices)

Asked 6 years ago by jeri.elizabeth@gmail.com
Latest answer from johnlaudun

Tags:

jeri.elizabeth@gmail.com
Member
I am working on a minor field reading list in digital history and am looking for some recommendations for texts or other resources that deal with the theory and math behind the various algorithms used by digital humanists for text-mining and topic modeling. I have already identified Blei's "Probablistic Topic Models: Surveying a suite of algorithms that offer a solution to managing large document archives" and am looking for more of the same.

Thank you!
Tweet this question
Posted 6 years ago Permalink
tedunderwood
Member

Excellent question! I started out reading a little of this and a little of that, mainly in works by digital humanists themselves. Which works fine for a long time.

For instance, the user instructions for Wordhoard at Northwestern are very helpful. I haven't used the tool itself, but browsing the instructions taught me a lot. E.g., what's "Dunning's log likelihood"?

http://wordhoard.northwestern.edu/userman/index.html

But eventually I realized that there was a whole different *discipline* out there that I was getting only indirectly, in bits and pieces. I think, sooner or later, we need to face that reality and go direct to the source. You're doing that by putting Blei on your list, so congrats.

On the other hand, this is kind of a crazy thing to ask a grad student to do. You're specializing in one part of Discipline A (history) -- you can't seriously add "(plus everything in Discipline B)" to your fields list!

But what can I say? This is a crazy enterprise.

So, if you're looking for an overview of data science / machine learning, I would honestly just pick up a textbook. Let's face it: this is a whole different discipline. Don't read the whole book, necessarily. Browse for parts that are relevant to what you want to do. On line, I often find myself consulting this one:

http://nlp.stanford.edu/IR-book/

It'll explain things like naive Bayes document classification.

But my current favorite source is _Data Mining_ by Witten, Frank, and Hall. It's pretty readable -- more readable than that Stanford text -- and it focuses on practical questions. A lot of it may be more ambitious than you can actually do at this stage, but it's important to know what's possible. E.g., different kinds of exploratory clustering that might help you recognize groups of similar documents.

I'm constantly running into mathematical notation I don't understand. I think Steve Ramsay is about to come out with a book on mathematical notation for humanists. When he does, we should all buy it. Until then, honestly, I have two strategies: a) start by ignoring the math and just try to understand the text and/or pseudocode and then b) if it turns out I really need to understand the math, Wikipedia!

Godspeed. This stuff is not easy, but I think it's a thrilling challenge.

Posted 6 years ago Permalink
jeri.elizabeth@gmail.com
Member

Wonderful! This is very helpful and I really appreciate the suggestions of particular textbooks.

I am trying to remember to not add an entire other field into the list, but I do want my minor field to include some background information to help me be more focused and intentional in future uses of these tools. As you say, it is a crazy enterprise!!

Replying to @tedunderwood's post:

Posted 6 years ago Permalink
dbamman
Member

I second Ted's rec to pick up a textbook -- machine learning is a vast field. In my opinion, Kevin Murphy's new one ("Machine Learning: A Probablistic Perspective") is hands down the most up to date and one of the best out there for the theory and math behind the methods; Hastie, Tibshirani and Friedman's Elements of Statistical Learning is also excellent (and you can get the pdf for that one online: http://www-stat.stanford.edu/~tibs/ElemStatLearn/)

David

Posted 6 years ago Permalink
Peter Organisciak
Member

Ted says it best: what you're looking for is a large area and can be daunting at the start. It can also be quite fulfilling and fun, as text mining is essentially an exercising in explicating your intuitions about the world in a way that a computer could understand.

In generally, you want to concentrate on the basic assumptions made in text mining. Documents are often represented as a "bag of words." The bag of words model has intuitive flaws but is a good starting spot because of its simplicity. It represents a document by the terms that occur in it and how often they occur, while disregarding the more complex information containing in where they occur and what they occur alongside. A number of natural questions follow. What about uninteresting words, like the or and? If words are stand-ins for concepts, don't synonyms and verb tenses violate the assumption? Such questions lead you to stopword lists, stemming and lemmatization, and term weighting.

Like Ted, I also recommend the Manning et al Information Retrieval book (http://nlp.stanford.edu/IR-book/). For an introduction, the earlier Manning and Schütze book on Foundations of Statistical Natural Language Processing might be even better. It's not available online, but your library is sure to has it. Both of these were on my qualifying exam reading list, and at times they are more helpful than the foundational papers which they describe.

For the former book, try the following chapters: 1 (Boolean retrieval), 2 (term vocabulary), 6 (term weighting and the VSM), 9 (query expansion), 11 (probabilistic IR), and 13 (text classification and Naive Bayes). For the latter, most of the Words and Grammar sections would be helpful, especially the chapters on collocations, n-gram models, word sense disambiguation, and parts of speech tagging. Finally, the text classification and hard clustering chapters help bring it together.

These books try to both teach beginners and serve as reference for experts, so the non-math descriptions are generally complete. Skip the math, but read the text carefully.

Posted 6 years ago Permalink
johnlaudun
Member
It's a great question and one that also highlights, I believe, that we are still early in the era where there is a great deal of sorting out to do. One of the questions we haven't really asked is: Is there a difference between what historians need to do algorithmically and what literary scholars need to do? Much of the computer science involved assumes a referential understanding of the nature of texts, but the referential function may not necessarily be the dominant function, let alone the only function of a text.

No matter how those distinctions either shake out or reveal new kinds of lines, hopefully fuzzy ones, for disciplines, I find myself, like Jeri, wanting to understand the statistics behind the algorithms and, to some degree, the math behind the statistics. And, like others, I have found myself pulling pieces from here and there, for instance:
- Stefan Gries' "Introduction" to his Quantitative Corpus Linguistics with R
- Bird, Klien, and Loper's Natural Language Processing with Python
To name but two examples, and that doesn't include a wide range of programming books, PDFs, and math and statistics courses I have sampled, just trying to begin to "think like a programmer" as one book titled itself.

I would love to see a more coherent reading list assembled that also had some sense of sequence or "if want to know x, then read y" in its organization.
Posted 5 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.