How do I best convert hundreds of TEI P5 documents to plaintext?

Digital Humanities Questions & Answers » Applications, Tools, Formats

How do I best convert hundreds of TEI P5 documents to plaintext?

(10 posts) (7 voices)

Asked 3 years ago by Arno Bosse
Latest answer from acrymble
This question has a best answer.

Tags:

Arno Bosse
Member
I'd like to use the available corpora in the German Text Archive (http://www.deutschestextarchiv.de/download) to train OCR software. For this I need these texts as plaintext. All the German Text Archive texts however are all TEI P5 tagged. How do I best convert these (hundreds..) of documents into plaintext?

I'm comfortable on the command line and with small shell scripts but I wouldn't be able to write an app to make use of a public API to such a service. Ideally I'd like to find some tei2text-ish command line tool but the ones I've found in googling around and looking on GitHub don't appear (to me, leastways) to be suitable for TEI texts.
Tweet this question
Posted 3 years ago Permalink
cforster
Member
Best Answer
Well, best is a tricky one. But if you're comfortable using XSLT this is pretty classic XSLT sort of problem. The default XSLT rules are to essentially output the content of the nodes so a very simple script could do what you're looking to do (I've done it many times before). Chances are, though, that you are going to want to trip out the teiHeader and just get the body of the text. You can write an empty rule to match the teiHeader, which will have the effect of silencing its output. A simple (though imperfect may to do this would be an XSLT stylesheet something like the following:
<?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns="http://www.w3.org/1999/xhtml" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="text" encoding="UTF-8" /> <xsl:template match="tei:TEI"> <xsl:apply-templates /> </xsl:template> <xsl:template match="tei:teiHeader"></xsl:template> </xsl:stylesheet>
If you're on OSX, the built-in xslt processor (xsltproc) can handle this. Just save the above XSLT code into a file (call it strip.xsl) and then, at the command line: xsltproc strip.xsl [TEI FILE NAME]. By default this outputs to standard input (i.e. the screen), but you can pipe it into a file xsltproc strip.xsl [TEI FILE NAME] > stripped.txt. And at this point if you can write a shell script, it should be possible to loop over all the files in a directory.
Posted 3 years ago Permalink
Hugh Cayless
Member

Replying to @Arno Bosse's post:

@cforster is almost certainly right that you want an XSLT. It all depends on the level of encoding and whether you really want all of the text, laid out as it is in the document (maybe there are footnotes in the OCR for example that have been turned into inline notes in the TEI).

That said, if you really just want a dead-simple text extraction, you can do something like the following with xmllint:

xmllint --xpath "//*[local-name() ='text']//text()" flle.xml

That can be bundled up with a find command (or other method of iterating over the files) and the output redirected to files in order to do your batch processing.

Posted 3 years ago Permalink
Stéfan Sinclair
Admin
Replying to @Hugh Cayless's post:

I was going to suggest something similar to Hugh, though the text() syntax doesn't seem to work when I test, but this does:
```
xmllint --xpath "string(//*[local-name()='body'])" FILENAME.xml
```
Posted 3 years ago Permalink
Arno Bosse
Member

Thank you very much everyone for your help - both techniques worked equally well. I still need to swap some characters and remove line breaks from the processed texts but I can easily batch that in my text editor. Thanks again!

Posted 3 years ago Permalink
Kevin Hawkins
Member
Do keep in mind that the approach recommended here strips the markup away, leaving only the text in between the tags. Since the TEI (and other document-based XML languages) generally leave the transcribed text inside tags (rather than as, say, the values of attributes), this approach will work quite well. But also keep in mind that the TEI has intentionally given up on the naive assumption that if you strip away the markup, you get the exact text that appeared on the page. For example, the TEI's
```
choice
```
element includes text between tags for more than one interpretation of what you see on the page, so these two strings will appear in the output of the above command sequences, one after the other, while the source document contained only one of these (though the encoder is unclear on which).

So beware that things like will introduce errors in your OCR training.
Posted 3 years ago Permalink
Kevin Hawkins
Member

I should have also noted that OxGarage ( http://www.tei-c.org/oxgarage/ ) allows conversion from P5 to "plain text". This page explains how to get the sourcecode behind it: http://wiki.tei-c.org/index.php/OxGarage .

Posted 1 year ago Permalink
Kevin Hawkins
Member

I should have also noted that OxGarage ( http://www.tei-c.org/oxgarage/ ) allows conversion from P5 to "plain text". This page explains how to get the sourcecode behind it: http://wiki.tei-c.org/index.php/OxGarage .

Posted 1 year ago Permalink
Stéfan Sinclair
Admin

This is an older post, so it's probably not worth offering many more solutions, but I want to mention that some simple document conversion, including TEI, is available through Voyant.

You would create a Voyant corpus as usual (say a zip file with TEI documents):

http://voyant-tools.org/docs/#!/guide/corpuscreator

And then you could use the Documents tool (middle tab in the lower left-hand panel of the default skin) to download the corpus in one or more formats (original, minimal HTML for Voyant, plain text):

http://voyant-tools.org/docs/#!/guide/documents

Posted 1 year ago Permalink
scottkleinman
Member

I just saw that this topic had been revived, and I thought I'd point out that Lexos enables the user to strip tags selectively customising how each tag is handled. This can be really useful if, for example, you want to keep readings in <orig> and delete readings in <reg>. We haven't tested it with large numbers of files, but you can always do your processing in small batches if it gets too slow.

Posted 1 year ago Permalink
acrymble
Member

There is a good tutorial on XSLT on the Programming Historian if you'd like to learn more.

http://programminghistorian.org/lessons/transforming-xml-with-xsl

Posted 11 months ago Permalink

RSS feed for this topic

Reply

You must log in to post.