Methods to process a plain text collection into a bimodal graph...

Digital Humanities Questions & Answers » Applications, Tools, Formats

Methods to process a plain text collection into a bimodal graph...

(8 posts) (2 voices)

Asked 3 years ago by johnlaudun
Latest answer from Scott Weingart

Tags:

johnlaudun
Member
I am starting a new project wherein I want to take a closer look at narratives both in terms of topics as well as in terms of morphologies. For the moment, I have narrowed my initial exploration to two collections of treasure legends, one drawn from oral sources (and transcribed by folklorists) and one drawn from materials found on on-line forums. Both collections are, purposefully as I take baby steps here, small: each is only sixteen texts. The texts range in size from 153 to 1025 words for the oral collection and 155-3081 words for the web collection.

I am looking to create a bimodal graph for each of the collections which represents the relationship between the words used in each text, such that I can examine the relationship either between texts, based on words in common, or between words, based on texts in common.

What I need, I think, is a Python script or some other kind of code/application which will work through each collection of plain text files and generate a CSV or network file of some kind that will let me then work in Gephi or Sci2. I would especially like it if the script would allow me to feed in a stopword list of my choosing.
Tweet this question
Posted 3 years ago Permalink
Scott Weingart
Member

Replying to @johnlaudun's post:
This can actually be done within Sci2 itself, once you have a csv with titles in one column and text in the second. Once that's loaded into Sci2, go to Preprocessing->Topical->Stopword Text (you can select your own stopword list from that dialog).

Then go to Data Preparation->Extract Directed Network, first column title, second column text, hit okay, and voila! You can then make that into texts based on words in common or words based on texts in common using either Data Prep->Extract Co-Citation Network, or Extract Bibliographic Coupling Network (depending on your need), and you've got what you're looking for.

Posted 3 years ago Permalink
Scott Weingart
Member

Replying to @johnlaudun's post:
Also, on a side note, if you do the first normalization step and then "Extract Word Co-Occurrence" network, you can also get to words that co-occur frequently, as explained here: http://wiki.cns.iu.edu/pages/viewpage.action?pageId=2200066#id-514StudyingFourMajorNetSciResearchersISIData-5145WordCo-OccurrenceNetwork5145WordCo-OccurrenceNetwork

Posted 3 years ago Permalink
johnlaudun
Member

Replying to @Scott Weingart's post:

Hey, Scott, thanks for the replies. I will try them out, but first I have to get to the CSV of the bimodal network! Any ideas on how to get there from a collection of plain texts? It's this bit that has stumped me on a couple of occasions.

Posted 3 years ago Permalink
johnlaudun
Member
Replying to @johnlaudun's post:

What I'm working on is a script that goes from scraping out word frequencies to changing the csv to simply be the name-of-the-text, word, like this:
```
    text1, word1
    text1, word2
    text1, word3
```
I can then join those CSVs, to produce a CSV that looks like so:
```
    t1, w1
    t1, w2
    t1, w3
    t2, w1
    t2, w3
    t3, w2
    t3, w3
    t3, w4

    and so on ...
```
I'm going to try that out in Sci2.
Posted 3 years ago Permalink
Scott Weingart
Member
Replying to @johnlaudun's post:
I think maybe I didn't explain what I meant well enough the first time, sorry! If you have a csv that looks like
```
title1, fulltext1
title2, fulltext2
title3, fulltext3
```
Doing the steps I described above will get you
```
t1, w1
t1, w2
t1, w3
t2, w1
t2, w3
t3, w2
t3, w3
t3, w4
```
without any additional scripting. Hope it helps!
Posted 3 years ago Permalink
johnlaudun
Member
Actually you said it quite well the first time:

once you have a csv with titles in one column and text in the second

But apparently my reading comprehension is not up to par!

Here's a brief summary of travails:
- Sci2 expects CSVs to have a header row. This is important. It will not process the first row in a CSV and it will lead to awkwardly named outputs. (This took me a surprisingly long time to puzzle out.)
- I was glad to figure out the format of the file, because it also means I could potentially do my own stopwording, lemmatizing, etc. The menu item here is a wonderful one-stop shop, but I don't know that I always want everything it brings to the table.
- I got the directed network, no problem, but a visualization (using GUESS because I couldn't get the Gephi hand-off to work) reveals that the projection isn't really what one would expect:
<img src="http://farm8.staticflickr.com/7054/8691631906_8554b3905e_z.jpg" width="640" height="424" alt="Screen Shot 2013-04-28 at 18.04.38">

As I note on Flickr -- if this image makes it through -- what should have been either a word-to-word projection or a text-to-text projection based on a text-to-word CSV is in face a rather large visualization in which the sixteen texts sit in a tight ball in the middle of an oort cloud of words.

To be sure, I grabbed a screen capture of the Sci2 Data Manager pane, which does a lovely job of capturing at a glance what I had done:

<img src="http://farm8.staticflickr.com/7055/8690511907_c97c81dcc9_z.jpg" width="501" height="150" alt="Screen Shot 2013-04-28 at 18.12.46">

The CSV should work:
```
TEXT,WORDS
ancelet-88,in|charenton|north|o
ancelet-89,went|meet|old|man|ma
ancelet-90,mom|said|use|dig|lot
ancelet-91,know|jess|venabl|fat
laudun-1,yeah|s|like|talk|shit|
laudun-2,like|said|famili|weird
```
I'm going to keep playing with the various options to see what else might work. It's been a rough day -- also trying to get an 8 year old to step through a book presentation. That'll drive you nuts, right there.
Posted 3 years ago Permalink
Scott Weingart
Member

Replying to @johnlaudun's post:
Ah, sorry that I neglected to mention the header row! What you're seeing in the first picture is actually what you're looking for (a document-document network based on shared words), but the word nodes were not removed. Preprocessing->Delete Isolates can do that. Also, if that was the bibliographic coupling, than the other algorithm will have the word-word network based on shared documents. As for getting the graph into gephi, try saving the network out as graphml, renaming the file from .xml to .graphml, and then loading it into gephi.

In any case, because there are likely many words per document, you might do well to threshold above a certain value (filtering in gephi or preprocessing->extract edges above or below value in sci2) to make the network a bit more sparse.

Posted 3 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.