Digital Humanities Questions & Answers » Topic: I think I have a workflow, help me figure out what tools...

Digital Humanities Questions & Answers » Topic: I think I have a workflow, help me figure out what tools... http://digitalhumanities.org/answers/topic/i-think-i-have-a-workflow-help-me-figure-out-what-tools-1 Digital Humanities Questions & Answers » Topic: I think I have a workflow, help me figure out what tools... en-US Fri, 04 Oct 2013 22:29:51 +0000 http://bbpress.org/?v=1.0.2 <![CDATA[Search]]> q http://digitalhumanities.org/answers/search.php kevin.s.hawkins on "I think I have a workflow, help me figure out what tools..." http://digitalhumanities.org/answers/topic/i-think-i-have-a-workflow-help-me-figure-out-what-tools-1#post-1705 Thu, 21 Jun 2012 02:19:28 +0000 kevin.s.hawkins 1705@http://digitalhumanities.org/answers/ While there is something to be said for open-ended exploration of data ngram viewers and such, in this case, I'm not sure that text mining will actually be helpful. But visualization software might. You say that both of you want to "gauge the relationship of the paper's editorial bias in writing with what was in the papers' editorial or news content." This means that, for example, you're curious whether cartoons with a leftish bent appear in newspapers that run editorials with a leftish bent as well. Right? Even if that's not it, let's pretend it is for the sake of argument. It appears that in any case you're looking for correlation of *some dimension* with the source (the particular newspaper) of the document (an article or cartoon). If you wanted the computer to identify the correlation, you would use information retrieval (IR) techniques. You would manually tag a "training set" as leftish or rightish and then have a computer use IR techniques to classify the rest (assuming there's OCR text of all of them) as one or the other. But you'd want to verify a sample of the output, and if the results aren't good, you would need to provide a larger training set and re-run the classification. I suspect that a computer won't be very good at this since the left and right tend to talk about the same things and criticize each others, thereby using the same terms. And I suspect it's nearly impossible for cartoons, where the captions are quite brief. In that case, you'll need to rely on a human expect. In fact, in step 1, you already propose "tagging" the documents in various ways by hand, so it seems you were already open to doing this. You could use qualitative analysis software for the manual tagging. Such software is common in the social sciences for things like tagging transcriptions of interviews. Once you've tagged lots of documents (and carefully recorded the source newspaper for each), you can view your documents by source and see if they come out more often in the leftish or rightish category. If there are multiple dimensions (not just leftish and rightish), you will need some sort of visualization software that can cluster similar documents in one of those three-dimensional tree diagrams. (I don't know this area, but others probably do.) Perhaps that software can use colors to distinguish the sources and different shapes (triangles and squares) for cartoons and articles. Tad Suiter on "I think I have a workflow, help me figure out what tools..." http://digitalhumanities.org/answers/topic/i-think-i-have-a-workflow-help-me-figure-out-what-tools-1#post-1704 Wed, 20 Jun 2012 20:27:09 +0000 Tad Suiter 1704@http://digitalhumanities.org/answers/ At the end of THATCamp Prime's "Dork Shorts" this year, my fellow GMU History student Sasha Hoffman made a plea for help figuring out how solve a problem that I have with my dissertation research as well. Both of us are working on comics in historic newspapers-- Sasha's dealing with midcentury political cartoons, and I'm dealing with early 20th century daily comic strips. Both of us are looking for ways to gauge the relationship of the paper's editorial bias in writing with what was in the papers' editorial or news content. Because it intersects with my own research, I've been mulling it over for a while. I think I have an idea about a workflow that might get some good data, but I'm not sure about what tools to use, or how workable the idea has. This is where I'm looking for help from y'all. Essentially, here's what I have come up with: <ol> <li>After selecting the stuff you want, collect all the comics, and tag them. Tag them as much as physically and intellectually possible, identifying issues addressed, visual elements, people discussed, news items being referenced-- get as much data as possible. You don't know that certain visual metaphors or something you're not catching might not rise and fall at different times, so create the richest tagging taxonomy you can.</li> <li>Using the databases that have the relevant newspapers-- assuming they've been digitized-- and scrape out all the relevant OCR'd text, along with data about paper, date, etc.</li> </ol> AND THEN, EITHER... <ul> <li>Running some sort of ngram-based data-mining software (MALLET?) look for patterns in the textual elements. </li> <li>Running some sort of data-mining software, look at the information about the cartoons based on the tags, along with information about cartoonist, date, paper, etc.</li> <li>Using some sort of of data-mining software, compare the outputs of the last two pieces of data mining to see where patterns emerge between the two data sets</li> </ul> OR... <ul> <li>Find a piece of data mining software that can handle the two above types of data sets as one data set, resolve the data, and see what you get.</li> </ul> I'm (clearly) very new to text mining, but I'm very interesting to learn it, and if I could do so in the course of my dissertation research, it would only be a bonus. That said, I'm really not sure how well the above would work, how realistic a workflow it is, or what tools I might want to try to use for such an approach. This is where I'm helping you all can give me some advice.