Replying to @lit_cht's post:
Dear Colleagues, though it has been some time since this discussion came to a preliminary end, I want to thank you again for all your hints. For those generally concerned with Quality Assurance in large TEI corpora, Deutsches Textarchiv's recently published article might be of interest. English abstract below, text is German, though. All the best
- for the DTA-Team -
Christian Thomas
Alexander Geyken, Susanne Haaf, Bryan Jurish, Matthias Schulz, Christian Thomas, Frank Wiegand: "TEI und Textkorpora: Fehlerklassifikation und Qualitätskontrolle vor, während und nach der Texterfassung im Deutschen Textarchiv." In: Jahrbuch für Computerphilologie, http://www.computerphilologie.de/jg09/geykenetal.html [retrieved 2012-08-09].
Abstract:
This paper deals with the issue of quality assurance in very large, XML/TEI-encoded full-text collections. The text corpus edited by the DFG-funded project Deutsches Textarchiv (henceforth: DTA), a large and still growing reference corpus of historical German, is a fine example of such a collection. The following remarks focus on text prepared in a Double-Keying-process, since the major part of the DTA-corpus is com-piled by applying this highly accurate method. An extensive and multi-tiered approach, which is currently applied by the DTA for the analysis and correction of errors in double-keyed text, is introduced. The process of quality assurance is pursued in a formative way in order to prevent as many errors as possible, as well as in a summative way in order to track errors which nevertheless may have occurred in the course of full-text digitization. To facilitate the latter, DTAQ, a web-based, collaborative tool for finding and commenting errors in the corpus, was developed. On the profound basis of practical experience in the past four years, the preliminaries and possible methods of conducting a widespread quality assurance are being discussed.