Topic Modeling (MALLET) with JSTOR Data For Research « Digital Humanities Questions & Answers

Digital Humanities Questions & Answers » Applications, Tools, Formats

Topic Modeling (MALLET) with JSTOR Data For Research

(10 posts) (6 voices)

Asked 3 years ago by cforster
Latest answer from Ben Marwick
This question has a best answer.

Tags:

cforster
Member
Does anyone have any experience using topic modeling to analyze data from JSTOR's "Data for Research"?

DFR lets you request datasets based on queries of the JSTOR database. Full-text of the JSTOR material, however, is not available. Instead one can request keywords, various ngrams (bi, tri, or quad), or word counts; requesting the word counts, one gets a set of files: one file per article with the word counts (CSV or XML), and a manifest (connecting filenames to complete[ish] citations). (Minor note: Looking at the raw counts, it seems like these may be samples of the articles, not the full word counts for the whole article, though I'm not totally sure; anyone have any experience with DFR?)

My question: is there a way to get MALLET to take word counts as input rather than raw text? Since topic modeling treats texts/documents as bags of words, it should be able to work with the frequency counts as effectively as with raw text, right? I could write a script to reassemble texts in the proportions described by the word frequencies, but that seems so utterly absurd that I suspect (hope) I may be missing something.

Anyone have any experience here?
Tweet this question
Posted 3 years ago Permalink
tedunderwood
Member
Best Answer

Hey, Andrew Goldstone and I topic modeled data from DfR. Andrew used MALLET, and I believe he did it the way you describe: reconstituted "text" from the word counts. Perhaps you could ping him for his script.

I use my own Java script for topic modeling, and it works directly with word counts, so I didn't have to do that. But I haven't packaged my script for distribution, and MALLET is faster, anyway, so you'd want to use it -- even with the weird input quirk you mention.

My experience suggests that they're full word counts for the whole article. Anyway, for PMLA they were. But we were missing some metadata, as I recall, and had to reorder it. DfR is still semi kinda in beta mode.

Posted 3 years ago Permalink

Andrew Goldstone
Member

Best Answer

Replying to @cforster's post:

Glad you're trying out DFR too. Lots of treasures buried in there, I feel sure.

As Ted says, when I applied mallet to the data from jstor, I just reconstituted a bag of words from each word-count file in csv format, basically with this perl:

my $header = <INFILE>;
die unless $header =~ /^WORDCOUNTS,WEIGHT/;
 
while(<INFILE>) {
    my ($word,$count) = split ',';
    if($word) {
        print OUTFILE "$word " for (1..$count);
    }
}

and then passed the resulting files to mallet import-dir. It was quite cheap in time and space for the set of about 10^4 articles we were working on.

I don't think the mallet command-line tool can take a file of word counts. The MALLET java library must operate on word counts in the end, and I think if you interface with MALLET through java you can feed it the counts directly: cf. MALLET: Data Import for Java Developers. I went with the fast-and-dumb method because my java is too weak to figure this out in short order and I wanted instant gratification.

If your data looks funky, write the dfr support e-mail address; they were quite helpful to Ted and me. As Ted says it's a beta, but you are supposed to get wordcounts for the full articles. Remember that lots of jstor items are not articles but reviews, front and back matter, etc. You can filter by item type "fla" (full-length article) to get only articles (put ty:fla in your search field).

edit: Or possibly you could convert the csv wordcounts to "SVMLight-style" feature:value pairs? Didn't try this, but see the bottom of: http://mallet.cs.umass.edu/import.php.

edit: One last note; see below.

Posted 3 years ago Permalink

cforster
Member

Replying to @tedunderwood's post:

Excellent; thanks very much.

Posted 3 years ago Permalink
cforster
Member

Replying to @Andrew Goldstone's post:

Thanks very much; I'll likely cobble together a similar script in Python (Perl? What is this, the nineties? I kid... I kid...). Very much appreciate it.

And thanks for the "fla" filter tip. I may revise my query with that in mind. I'm a little suspicious of the data because I ran a quick word count for one file against the data they provided and it didn't seem to match up. I'll look into that at greater length and then get in touch with the DfR folks should my suspicion be borne out.

Posted 3 years ago Permalink
Andrew Goldstone
Member

Replying to @cforster's post:

Looked again at this, am embarrased by that snippet. Here's a whole script for making bags of words from jstor's wordcount csv files: https://github.com/agoldst/dfr-analysis/blob/master/count2txt . It is still in perl, so there.

I couldn't get mallet train-topics to work on an instance file produced from mallet import-svmlight, but I didn't try very hard.

Posted 3 years ago Permalink

Ben Marwick
Member

Best Answer

Replying to @cforster's post:

Here are a few lines of R that I'm working with for topic modelling. These lines should:

1. read in the JSTOR CSV wordcount files to R
2. convert them from a table of words and their counts to a 'bag of words'
3. for each CSV file, create a txt file of the 'bag of words' ready for MALLET

A sample of 1000 articles from 'American Antiquity' takes about 7 sec to read in the CSV files and about 6 sec to write the 'bag-of-words' txt files

# set working directory, ie. location of JSTOR DfR CSV
# files on the computer
setwd("C:\\some directory with JSTOR DfR CSV files")
 
# create a list of all the CSV files
myFiles <- list.files(pattern="*.csv|CSV")
 
# read in all the CSV files to an R data object
myData <-  lapply(myFiles, read.csv)
 
# assign file names to each dataframe in the list
names(myData) <- myFiles
 
# Here's the step where we turn the JSTOR DfR 'wordcount' into
# the 'bag of words' that's typically needed for topic modelling
# The R process is 'untable-ing' each CSV file into a
# list of data frames, one data frame per file
myUntabledData <- sapply(1:length(myData),
  function(x) {rep(myData[[x]]$WORDCOUNTS, times = myData[[x]]$WEIGHT)})
 
# And here's the step where we create individual txt files
# for each data frame (formerly a CSV file) that should be suitable for
# input into MALLET.
names(myUntabledData) <- myFiles
sapply(myFiles,
  function (x) write.table(myUntabledData[x], file=paste(x, "txt", sep="."),
                          quote = FALSE, row.names = FALSE, eol = " " ))
 
# Look in the working directory to find the txt files

I have a few more snippets that use the citations CSV for filtering and attaching biblio data to the R data and topic modelling using R (both packages) and MALLET. Some of these are here: https://gist.github.com/benmarwick

Posted 3 years ago Permalink

Michael Widner
Member

If you're planning on working in Python anyway, you might want to look at the gensim library: http://radimrehurek.com/gensim/ It would let you perform some topic modeling in the code, so wouldn't require the stop-gap between word frequencies and mallet.

Posted 3 years ago Permalink
johnlaudun
Member

Replying to @Michael Widner's post:

I didn't know about GenSim. What a great library for Python, and the site has some really nice explanatory material on it, too. Thanks for the link.

Posted 3 years ago Permalink
Ben Marwick
Member

Replying to @Ben Marwick's post:

Just a short follow-up, I have now bundled my snippets into a more complete R package for working with JSTOR DFR data. It takes DFR output and does ngrams, word correlations over time, document clustering and topic modelling (with MALLET or in R, and inlcuding hot and cold topic identification): https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr

Posted 2 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.