At the end of THATCamp Prime's "Dork Shorts" this year, my fellow GMU History student Sasha Hoffman made a plea for help figuring out how solve a problem that I have with my dissertation research as well.
Both of us are working on comics in historic newspapers-- Sasha's dealing with midcentury political cartoons, and I'm dealing with early 20th century daily comic strips. Both of us are looking for ways to gauge the relationship of the paper's editorial bias in writing with what was in the papers' editorial or news content.
Because it intersects with my own research, I've been mulling it over for a while. I think I have an idea about a workflow that might get some good data, but I'm not sure about what tools to use, or how workable the idea has. This is where I'm looking for help from y'all.
Essentially, here's what I have come up with:
- After selecting the stuff you want, collect all the comics, and tag them. Tag them as much as physically and intellectually possible, identifying issues addressed, visual elements, people discussed, news items being referenced-- get as much data as possible. You don't know that certain visual metaphors or something you're not catching might not rise and fall at different times, so create the richest tagging taxonomy you can.
- Using the databases that have the relevant newspapers-- assuming they've been digitized-- and scrape out all the relevant OCR'd text, along with data about paper, date, etc.
AND THEN, EITHER...
- Running some sort of ngram-based data-mining software (MALLET?) look for patterns in the textual elements.
- Running some sort of data-mining software, look at the information about the cartoons based on the tags, along with information about cartoonist, date, paper, etc.
- Using some sort of of data-mining software, compare the outputs of the last two pieces of data mining to see where patterns emerge between the two data sets
OR...
- Find a piece of data mining software that can handle the two above types of data sets as one data set, resolve the data, and see what you get.
I'm (clearly) very new to text mining, but I'm very interesting to learn it, and if I could do so in the course of my dissertation research, it would only be a bonus.
That said, I'm really not sure how well the above would work, how realistic a workflow it is, or what tools I might want to try to use for such an approach. This is where I'm helping you all can give me some advice.