I'm looking for a corpus of literary criticism that would allow me to do some text mining. Anyone have any suggestions? I found OTA, but it doesn't seem big enough...
I'm not sure I put the right "category" for this question...
I'm looking for a corpus of literary criticism that would allow me to do some text mining. Anyone have any suggestions? I found OTA, but it doesn't seem big enough...
I'm not sure I put the right "category" for this question...
Are you interested in contemporary lit. criticism or older stuff?
If older stuff, I know many folks turn to Project Gutenberg to get a whole lot of text data quickly; while their text data is totally unstructured and devoid of metadata, it is relatively easy to preprocess and start mining. They don't categorize things in a way to make it easy to identify literary criticsm: but search "Essays" in the title field and I think you'll already find some relevant texts.
The other alternative, perhaps, would be trying to scrape the web for literary criticism from sites with freely available content (both the London Review of Books and the New York Review of Books, for instance, have plenty of freely available content).
If it is contemporary academic criticism you're interested in, you might look at JSTOR's Data For Research. It is an interesting way to search/explore/mine JSTOR's database; but it won't let you really "mine" their data. It would be great if JSTOR or MUSE would let you have more direct access if mining that sort of corpus is what you're interested in--though I'm unsure of anyway to actually do that.
You might look through the Directory of Open Access Journals for appropriate journals to spider and text-mine.
I'm not sure about all the specifics, but I know the INKE project has harvested the full texts available from many of the Open Journal Systems (OJS) journals. There's probably considerable overlap with the Directory of Open Access Journals. Anyway, you may want to check out http://pkp.sfu.ca/?q=harvester and/or contact Julie Meloni (@jcmeloni) about more details.
As Stefan suggests, you can get certain kinds of canned results from JSTOR's research service (for examples, lists of bigrams from artciles in specific issues of the journals they provide). The Data Types they offer are: Citations Only (all requests come with citations by default), Word Counts, Bigrams, Trigrams, Quadgrams, Key Terms, and References. Their output formats are XML or CSV. You haven't said much yet about what you want to do, so it's hard to know if these options would be useful. Assuming you're interested in what critics say about authors, author names would come up as part of the n-grams, but there's no guarantee that the relevant other words would fall in the same range of words. However, if you asked for quadgrams, for example, you'd probably get them in overlapping order, so you might actually be able to find keywords in an arbitrary context. I might be inclined to ask for a sample. Using their faceted search, you could get to this set which has a total of 9 articles in it:
* Subject:
o British Studies
o Language & Literature
* Has References:
o Yes
* Journal:
o Eighteenth-Century Studies
* Article Type:
o Research article
Ask for quadgrams for these nine articles, and since it's not a big request, you might get it pretty quickly; once you have it, you can see if it can be made to work for your purposes. If not, you'll have a basis for making a special request to JSTOR, for full-text or a specified set of articles (I still might adhere to their 1,000 items ceiling, unless you chose to ask for the whole run of a particular journal, which would probably be pretty easy for them to pull).
You must log in to post.