How to extract tagged data and text from TEI file? « Digital Humanities Questions & Answers

Digital Humanities Questions & Answers » Programming

How to extract tagged data and text from TEI file?

(6 posts) (4 voices)

Asked 2 years ago by aliciapeaker@gmail.com
Latest answer from petris.it@googlemail.com
This question has a best answer.

Tags:

aliciapeaker@gmail.com
Member
I’ve been using CATMA (http://www.catma.de/) to markup a text with some analytical tags I’ve created. I then exported the file in TEI, and I’m now trying to extract the data I’ve marked up in order to measure tag frequencies, but am finding it quite difficult.

Rather than tagging text with the labels I’ve created, CATMA has established a somewhat complicated (though likely necessary) system of identifiers. So, for example, I’ve tagged the word “clouds” in my text with the tag “weather,” which is a child of the tagset “non-living.”

CATMA represents the tag in the text like this:
<text>
<body>
<ab type=“catma”>
Small feckless <seg ana="#CATMA_0036983F-4D37-48C2-8BC7-5846A8364D26">clouds</seg> were hurried across the vast untroubled sky...
</ab>
</body>
</text>

The identifier then points to this feature statement after the body of the text:

<text>
<body>
</body>
<fs xml:id="CATMA_0036983F-4D37-48C2-8BC7-5846A8364D26" type="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE">
<f name="catma_displaycolor">
<string>-16710765</string>
</f>
<f name="catma_markupauthor">
<string>name@email</string>
</f>
</fs>
</text>

The id for the type of the fs then points back up to the feature statement declaration in the header:

<teiHeader>
<encodingDesc>
<fsDecl xml:id="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE" n="2014-12-16T13:30:36.000+0000" type="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE">
<fsDescr>Weather</fsDescr>
<fDecl xml:id="CATMA_699BAC76-8D15-408E-A30A-984849115A71" name="catma_displaycolor">
<vRange>
<vColl>
<string>-16710765</string>
</vColl>
</vRange>
</fDecl>
<fDecl xml:id="CATMA_8653855B-B611-48E8-AE9D-00E0160A37DB" name="catma_markupauthor">
<vRange>
<vColl>
<string>name@email</string>
</vColl>
</vRange>
</fDecl>
</fsDecl>
</encodingDesc>
</teiHeader>

I need to extract the text and data, perhaps in a csv file (or other output format, if it’s easier), into something that lists the tagged text (e.g. “clouds”) in one column, the first tag applied to it in the next column (e.g. "weather"), and the tagset or category to which that tag belongs in the next (e.g. "non-living).

Or perhaps there’s a better way—really, what I’d like to be able to do is get the frequencies of each tag & tagset for each chapter. If there’s an easier way to mark up the text in TEI that would better allow for what I need, I’m open to re-encoding manually.

I’ve also tried playing around a bit with some XSLT and a Python script (http://www.rdegges.com/quickly-extract-xml-data-with-python/) but with very little experience with either, I find myself quickly out of my depths. Open to suggestions—and thanks in advance for your help!
Tweet this question
Posted 2 years ago Permalink
Ethan Gruber
Member

To clarify, is there also an fsDecl for the "non-living" category which contains fDecls for "weather" and other tags? It's doable in XSLT. I don't think you need to count the segs in the XSLT because your TEI (presumably) contains an <fs> for every annotation you've created in your body.

You'll need to iterate through every fsDecl and perform a count of every fs that occurs elsewhere in the document that as a @type that is equal to the @xml:id of the fsDecl. You'd have to tweak this somewhat to include counts of the total tagset and to initiate the counts per chapter instead of overall. Without seeing more, it's difficult to construct XPath to handle the document chapter by chapter. See this gist for a basic bit of XSLT: https://gist.github.com/ewg118/6b0b99d953ae1f4d8eaf

Posted 2 years ago Permalink
Ondine
Member

I can't pretend to be the most knowledgeable person about XML and especially not about querying it for data analysis purposes. But I have used the TEI for markup and delivery a good bit, generally in oXygen and generally with a fairly constrained set of the TEI P5 tags.

From that, I can tell you that I rarely encounter anything as complicated as the markup CATMA is giving you. It seems far more complex than XML for most humanities encoding purposes would need to be, esp given that part of the point of XML, and esp TEI, is that it is human readable. Of course, for some of the more complex content analysis goals that some DHers are pursuing with enormous corpa of humanities texts, this kind of markup may be necessary.

But based on what I *think* you're trying to do, the complexity here might be unnecessarily mystifying your markup of your content. If you simply need to measure the frequency of the presence of specific tags that appear in the text, based on--I assume--your own criteria for how those tags should be applied, then it may be that a straightforward TEI document in a transparent editor (oXygen would be my choice) would give you far more control.

Simply counting the number of uses of a particular tag could be done in oXygen using an XPath query , which you can refine according to attributes, hierarchy, position, etc.
The XPath wouldn't generate a new product from your XML, but it would give you results list (plain text) with a count and that shows where all the instances are.

If you want a new product, you can use XSLT to generate a new XML document that retains just the elements you want and/or that adds sequential numbers to them, again based on attributes, hierarchy, position, etc., as a way to select exactly what you want.

The CATMA document looks so complicated that I would expect it to be very difficult to parse with XSLT, but parsing a more straightforward TEI P5 document for a count of specific tags shouldn't be so difficult.

All that said, I don't use either tool often enough--and haven't recently enough--to be able to offer concrete direction. For that, I recommend going on the TEI discussion list, which you can sign up for here: http://www.tei-c.org/Support/#tei-l

Posted 2 years ago Permalink
petris.it@googlemail.com
Member
Best Answer

You could simply use the CATMA Analyzer to count and extract the tagged information.
Assuming you have loaded text and annotations into the Tagger:
Click on "Analyze Document"
Type: tag="%" into the query box and hit "Execute query"
Select the tab "Result by markup"
You'll see all tags with the frequency counts there.
You can also export the results to a CSV file for further processing.
So there is no need for painful XML XSLT hacking so far.

Posted 2 years ago Permalink
aliciapeaker@gmail.com
Member

Wonderful! Thank you all for your replies! I've used CATMA to return the tag frequencies and then exported a CSV file with the compiled results. This gives me everything I need except the location in the text of each tag, which would enable to me to track frequencies by chapter (for which I have a list of CATMA locations). Is there a way I could search CATMA for tags within a set of location ranges to output a set of results for each chapter?

Posted 2 years ago Permalink
petris.it@googlemail.com
Member

If you have tagged the chapters and if the amount of chapters is not that huge you could query the results for every chapter like this:

tag="%" where tag="chapter1" boundary
this one assumes you have a tag for each chapter

tag="%" where tag="chapter" property="number" value="1" boundary
this one assumes you have only one chapter tag and a property that holds the chapter number

But you are right, there should be a way of extracting the positions of each tag. You could, as a workaround, extract the KWIC for each tag into its own CSV file, that gives you the positions of each instance that belongs to the tag you selected. If you add a tag column manually you will then be able to merge the contents of the per tag files into one file and get tags with positions.

Posted 2 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.