I’ve been using CATMA (http://www.catma.de/) to markup a text with some analytical tags I’ve created. I then exported the file in TEI, and I’m now trying to extract the data I’ve marked up in order to measure tag frequencies, but am finding it quite difficult.
Rather than tagging text with the labels I’ve created, CATMA has established a somewhat complicated (though likely necessary) system of identifiers. So, for example, I’ve tagged the word “clouds” in my text with the tag “weather,” which is a child of the tagset “non-living.”
CATMA represents the tag in the text like this:
<text>
<body>
<ab type=“catma”>
Small feckless <seg ana="#CATMA_0036983F-4D37-48C2-8BC7-5846A8364D26">clouds</seg> were hurried across the vast untroubled sky...
</ab>
</body>
</text>
The identifier then points to this feature statement after the body of the text:
<text>
<body>
</body>
<fs xml:id="CATMA_0036983F-4D37-48C2-8BC7-5846A8364D26" type="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE">
<f name="catma_displaycolor">
<string>-16710765</string>
</f>
<f name="catma_markupauthor">
<string>name@email</string>
</f>
</fs>
</text>
The id for the type of the fs then points back up to the feature statement declaration in the header:
<teiHeader>
<encodingDesc>
<fsDecl xml:id="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE" n="2014-12-16T13:30:36.000+0000" type="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE">
<fsDescr>Weather</fsDescr>
<fDecl xml:id="CATMA_699BAC76-8D15-408E-A30A-984849115A71" name="catma_displaycolor">
<vRange>
<vColl>
<string>-16710765</string>
</vColl>
</vRange>
</fDecl>
<fDecl xml:id="CATMA_8653855B-B611-48E8-AE9D-00E0160A37DB" name="catma_markupauthor">
<vRange>
<vColl>
<string>name@email</string>
</vColl>
</vRange>
</fDecl>
</fsDecl>
</encodingDesc>
</teiHeader>
I need to extract the text and data, perhaps in a csv file (or other output format, if it’s easier), into something that lists the tagged text (e.g. “clouds”) in one column, the first tag applied to it in the next column (e.g. "weather"), and the tagset or category to which that tag belongs in the next (e.g. "non-living).
Or perhaps there’s a better way—really, what I’d like to be able to do is get the frequencies of each tag & tagset for each chapter. If there’s an easier way to mark up the text in TEI that would better allow for what I need, I’m open to re-encoding manually.
I’ve also tried playing around a bit with some XSLT and a Python script (http://www.rdegges.com/quickly-extract-xml-data-with-python/) but with very little experience with either, I find myself quickly out of my depths. Open to suggestions—and thanks in advance for your help!