Does anyone have any experience using topic modeling to analyze data from JSTOR's "Data for Research"?
DFR lets you request datasets based on queries of the JSTOR database. Full-text of the JSTOR material, however, is not available. Instead one can request keywords, various ngrams (bi, tri, or quad), or word counts; requesting the word counts, one gets a set of files: one file per article with the word counts (CSV or XML), and a manifest (connecting filenames to complete[ish] citations). (Minor note: Looking at the raw counts, it seems like these may be samples of the articles, not the full word counts for the whole article, though I'm not totally sure; anyone have any experience with DFR?)
My question: is there a way to get MALLET to take word counts as input rather than raw text? Since topic modeling treats texts/documents as bags of words, it should be able to work with the frequency counts as effectively as with raw text, right? I could write a script to reassemble texts in the proportions described by the word frequencies, but that seems so utterly absurd that I suspect (hope) I may be missing something.
Anyone have any experience here?