I have a student who has a corpus of materials, and we'd like to plumb its depths with some data mining. We think that topic modeling - as exemplified by the 'Mining the Dispatch http://bit.ly/hv4I6E might be one way to go. I'd like for the student to run with this herself. What are some good introductory tutorials for data mining more generally, and topic modeling specifically? What about tools? Preferably, we'd like to use some tools that don't involve any command-line interactions. Thanks everyone!
What is the easiest tool for topic modelling ?
(11 posts) (7 voices)-
Posted 7 years ago Permalink
-
A caution on NVivo: It does NOT do well with large data sets/files, no matter how much power it has available to it. If you're working with lots of small files, it can be very useful. But throw 750 pages of text at it and it grinds to a halt.
Still, it has a 30 day free trial and might be worth playing around with.
Posted 7 years ago Permalink -
"easy" and "topic-modeling" are not often used in the same sentence:-) So as an alternative to "easy" let me suggest "best." In that case I highly recommend MALLET from umass (http://mallet.cs.umass.edu/). There are some other packages, including a few for R, but Mallet is, from my experience, the standard by which others are judged.
Posted 7 years ago Permalink -
Ha! Isn't that the truth... I've been playing with Mallet; or rather, I've been growing increasingly frustrated with it. I've set the environment variables, got Java all hunky dory... but I get the following: http://i.imgur.com/xPuox.png
"exception in thread"... any ideas how I should tackle this?
Posted 7 years ago Permalink -
I was going to suggest MALLET before realizing, as Matt did, that the question was looking for 'easy'. Unfortunately, topic modelling does seem to have some overhead.
Have you considered PLSA instead of topic modelling? I've had success with the Lemur toolkit's implementation, which may be easier, but would still require command line hammering.
The exception you got, by the way, possibly means that the Java classpath isn't set properly. Not sure, but worth a try... http://download.oracle.com/javase/1.3/docs/tooldocs/win32/classpath.html
Posted 7 years ago Permalink -
Replying to @organisciak@gmail.com's post:
Hmm. Wasn't familiar with Lemur. Thanks! Will give that an explore. It's also been suggested to me that Weka http://www.cs.waikato.ac.nz/ml/weka/ or Orange http://orange.biolab.si/ might be suitable.
Thank you for the link to the classpath info. I'll see where that leads me... ;)
Posted 7 years ago Permalink -
"Mining the Dispatch" is my project. It uses MALLET, which I cannot speak more highly of. Shawn, I've written some scripts to dump topic and topic proportion data output by MALLET into a relational database. Unless your student has some experience with relational databases that, of course, isn't easy either, but it's necessary to do any meaningful work with the topic modeling data. I'd be happy to share those with you and your student if that would be helpful: rnelson2@richmond.edu.
Posted 7 years ago Permalink -
Hi everyone,
Just thought I'd share the following link: https://dhs.stanford.edu/spatial-humanities/the-digital-humanities-as-topic-network/ as an interesting application of topic modeling.
I've also discovered what the issue with Mallet is: it has to be installed directly off C:\ (thus, C:\Mallet) and an environment variable (which is accessed from control panel >> system and security >> system >> advanced system settings >> environment variables ) called MALLET_HOME has to be created, which has as its value c:\Mallet\ . And Java has to be installed too, natch.
And then you're good to go!
My thanks to Elijah Meeks and Rob Nelson.
Posted 7 years ago Permalink -
Hi!
I undertstand that this thread is rather old, but I wanted to share my positive experience with using a python library called Gensim http://radimrehurek.com/gensim/ for the purpose of topic modelling.
For those of you who already work in a Pythonic environment this would probably be the easiest way to try out something with TM. However, I must admit that I'm not much into the eventual differences between LDA's implementations in Mallet and Gensim.
Posted 6 years ago Permalink
Reply
You must log in to post.