Digital Humanities Questions & Answers » Tag: text mining

Digital Humanities Questions & Answers » Tag: text mining - Recent Posts http://digitalhumanities.org/answers/tags/text-mining Digital Humanities Questions & Answers » Tag: text mining - Recent Posts en-US Fri, 14 Jul 2017 10:05:17 +0000 http://bbpress.org/?v=1.0.2 <![CDATA[Search]]> q http://digitalhumanities.org/answers/search.php petris.it@googlemail.com on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2293 Sat, 28 Feb 2015 14:12:42 +0000 petris.it@googlemail.com 2293@http://digitalhumanities.org/answers/ If you have tagged the chapters and if the amount of chapters is not that huge you could query the results for every chapter like this: tag="%" where tag="chapter1" boundary this one assumes you have a tag for each chapter tag="%" where tag="chapter" property="number" value="1" boundary this one assumes you have only one chapter tag and a property that holds the chapter number But you are right, there should be a way of extracting the positions of each tag. You could, as a workaround, extract the KWIC for each tag into its own CSV file, that gives you the positions of each instance that belongs to the tag you selected. If you add a tag column manually you will then be able to merge the contents of the per tag files into one file and get tags with positions. aliciapeaker@gmail.com on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2292 Fri, 27 Feb 2015 13:38:11 +0000 aliciapeaker@gmail.com 2292@http://digitalhumanities.org/answers/ Wonderful! Thank you all for your replies! I've used CATMA to return the tag frequencies and then exported a CSV file with the compiled results. This gives me everything I need except the location in the text of each tag, which would enable to me to track frequencies by chapter (for which I have a list of CATMA locations). Is there a way I could search CATMA for tags within a set of location ranges to output a set of results for each chapter? petris.it@googlemail.com on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2291 Fri, 27 Feb 2015 09:47:36 +0000 petris.it@googlemail.com 2291@http://digitalhumanities.org/answers/ You could simply use the CATMA Analyzer to count and extract the tagged information. Assuming you have loaded text and annotations into the Tagger: Click on "Analyze Document" Type: tag="%" into the query box and hit "Execute query" Select the tab "Result by markup" You'll see all tags with the frequency counts there. You can also export the results to a CSV file for further processing. So there is no need for painful XML XSLT hacking so far. Ondine on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2290 Thu, 26 Feb 2015 20:55:32 +0000 Ondine 2290@http://digitalhumanities.org/answers/ I can't pretend to be the most knowledgeable person about XML and especially not about querying it for data analysis purposes. But I have used the TEI for markup and delivery a good bit, generally in oXygen and generally with a fairly constrained set of the TEI P5 tags. From that, I can tell you that I rarely encounter anything as complicated as the markup CATMA is giving you. It seems far more complex than XML for most humanities encoding purposes would need to be, esp given that part of the point of XML, and esp TEI, is that it is human readable. Of course, for some of the more complex content analysis goals that some DHers are pursuing with enormous corpa of humanities texts, this kind of markup may be necessary. But based on what I *think* you're trying to do, the complexity here might be unnecessarily mystifying your markup of your content. If you simply need to measure the frequency of the presence of specific tags that appear in the text, based on--I assume--your own criteria for how those tags should be applied, then it may be that a straightforward TEI document in a transparent editor (oXygen would be my choice) would give you far more control. Simply counting the number of uses of a particular tag could be done in oXygen using an XPath query , which you can refine according to attributes, hierarchy, position, etc. The XPath wouldn't generate a new product from your XML, but it would give you results list (plain text) with a count and that shows where all the instances are. If you want a new product, you can use XSLT to generate a new XML document that retains just the elements you want and/or that adds sequential numbers to them, again based on attributes, hierarchy, position, etc., as a way to select exactly what you want. The CATMA document looks so complicated that I would expect it to be very difficult to parse with XSLT, but parsing a more straightforward TEI P5 document for a count of specific tags shouldn't be so difficult. All that said, I don't use either tool often enough--and haven't recently enough--to be able to offer concrete direction. For that, I recommend going on the TEI discussion list, which you can sign up for here: <a href="http://www.tei-c.org/Support/#tei-l" rel="nofollow">http://www.tei-c.org/Support/#tei-l</a> Ethan Gruber on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2287 Thu, 26 Feb 2015 17:54:52 +0000 Ethan Gruber 2287@http://digitalhumanities.org/answers/ To clarify, is there also an fsDecl for the "non-living" category which contains fDecls for "weather" and other tags? It's doable in XSLT. I don't think you need to count the segs in the XSLT because your TEI (presumably) contains an <fs> for every annotation you've created in your body. You'll need to iterate through every fsDecl and perform a count of every fs that occurs elsewhere in the document that as a @type that is equal to the @xml:id of the fsDecl. You'd have to tweak this somewhat to include counts of the total tagset and to initiate the counts per chapter instead of overall. Without seeing more, it's difficult to construct XPath to handle the document chapter by chapter. See this gist for a basic bit of XSLT: <a href="https://gist.github.com/ewg118/6b0b99d953ae1f4d8eaf" rel="nofollow">https://gist.github.com/ewg118/6b0b99d953ae1f4d8eaf</a> aliciapeaker@gmail.com on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2286 Thu, 26 Feb 2015 16:56:54 +0000 aliciapeaker@gmail.com 2286@http://digitalhumanities.org/answers/ I’ve been using CATMA (<a href="http://www.catma.de/" rel="nofollow">http://www.catma.de/</a>) to markup a text with some analytical tags I’ve created. I then exported the file in TEI, and I’m now trying to extract the data I’ve marked up in order to measure tag frequencies, but am finding it quite difficult. Rather than tagging text with the labels I’ve created, CATMA has established a somewhat complicated (though likely necessary) system of identifiers. So, for example, I’ve tagged the word “clouds” in my text with the tag “weather,” which is a child of the tagset “non-living.” CATMA represents the tag in the text like this: <text> <body> <ab type=“catma”> Small feckless <seg ana="#CATMA_0036983F-4D37-48C2-8BC7-5846A8364D26">clouds</seg> were hurried across the vast untroubled sky... </ab> </body> </text> The identifier then points to this feature statement after the body of the text: <text> <body> </body> <fs xml:id="CATMA_0036983F-4D37-48C2-8BC7-5846A8364D26" type="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE"> <f name="catma_displaycolor"> <string>-16710765</string> </f> <f name="catma_markupauthor"> <string>name@email</string> </f> </fs> </text> The id for the type of the fs then points back up to the feature statement declaration in the header: <teiHeader> <encodingDesc> <fsDecl xml:id="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE" n="2014-12-16T13:30:36.000+0000" type="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE"> <fsDescr>Weather</fsDescr> <fDecl xml:id="CATMA_699BAC76-8D15-408E-A30A-984849115A71" name="catma_displaycolor"> <vRange> <vColl> <string>-16710765</string> </vColl> </vRange> </fDecl> <fDecl xml:id="CATMA_8653855B-B611-48E8-AE9D-00E0160A37DB" name="catma_markupauthor"> <vRange> <vColl> <string>name@email</string> </vColl> </vRange> </fDecl> </fsDecl> </encodingDesc> </teiHeader> I need to extract the text and data, perhaps in a csv file (or other output format, if it’s easier), into something that lists the tagged text (e.g. “clouds”) in one column, the first tag applied to it in the next column (e.g. "weather"), and the tagset or category to which that tag belongs in the next (e.g. "non-living). Or perhaps there’s a better way—really, what I’d like to be able to do is get the frequencies of each tag & tagset for each chapter. If there’s an easier way to mark up the text in TEI that would better allow for what I need, I’m open to re-encoding manually. I’ve also tried playing around a bit with some XSLT and a Python script (<a href="http://www.rdegges.com/quickly-extract-xml-data-with-python/" rel="nofollow">http://www.rdegges.com/quickly-extract-xml-data-with-python/</a>) but with very little experience with either, I find myself quickly out of my depths. Open to suggestions—and thanks in advance for your help! scottkleinman on "Experience with Lexos or Text Mining Software?" http://digitalhumanities.org/answers/topic/experience-with-lexos-or-text-mining-software#post-2260 Thu, 18 Dec 2014 00:22:19 +0000 scottkleinman 2260@http://digitalhumanities.org/answers/ I am one of the Lexos developers, and I can't seem to reproduce the error you are getting--at least not on upload. Feel free to drop me a line (scott.kleinman@csun.edu), and we'll see if we can identify the problem. Muzel on "Experience with Lexos or Text Mining Software?" http://digitalhumanities.org/answers/topic/experience-with-lexos-or-text-mining-software#post-2259 Wed, 17 Dec 2014 19:14:04 +0000 Muzel 2259@http://digitalhumanities.org/answers/ What is your desired output? Have you tried Voyant Tools? rubyperlmutter@gmail.com on "Experience with Lexos or Text Mining Software?" http://digitalhumanities.org/answers/topic/experience-with-lexos-or-text-mining-software#post-2258 Tue, 18 Nov 2014 19:57:32 +0000 rubyperlmutter@gmail.com 2258@http://digitalhumanities.org/answers/ Hi! I'm trying to use Lexos (<a href="http://lexos.wheatoncollege.edu/upload)" rel="nofollow">http://lexos.wheatoncollege.edu/upload)</a>, and the text of the files I upload keep getting scrambled. I've tried several of the accepted formats as well as creating new files with no luck. Any ideas about what the problem could be? Or, any recommendations for free and easy to use text-mining tools? Thank you, Ruby slh@ens-lyon.fr on "Text mining tools that work with RTL texts?" http://digitalhumanities.org/answers/topic/text-mining-tools-that-work-with-rtl-texts#post-2172 Sat, 03 May 2014 12:52:10 +0000 slh@ens-lyon.fr 2172@http://digitalhumanities.org/answers/ Replying to @sinai.rusinek@gmail.com's <a href="http://digitalhumanities.org/answers/topic/text-mining-tools-that-work-with-rtl-texts#post-1912">post</a>: Hi Sinai, I suggest you give a try to <a href="http://textometrie.ens-lyon.fr/?lang=en">TXM</a>. We haven't designed the GUI with RTL writing systems in mind but UTF-8 RTL encoding appears to be globally well supported by default technology, with a notable exception concerning concordance contexts that are interchanged from left to right. The current state of the software and possible evolutions concerning writing systems is described here (in French): <a href="https://groupes.renater.fr/wiki/txm-info/public/specs_langues?s=%C3%A9criture" rel="nofollow">https://groupes.renater.fr/wiki/txm-info/public/specs_langues?s=%C3%A9criture</a>. If there is sufficient interest, we could make things evolve more rapidly with respect to RTL. Mind that GUI management of RTL display is independant of the word segmentation/tokenization process of raw text which can also have a deep impact on usability of textual analysis software. Even if one can alaways use software on character strings, it is much better to use them on words or lexical items. For TXM we begin to address semitic language word tokenization with Arabic. See here the current state: <a href="https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto#etat_de_l_art_pour_l_arabe" rel="nofollow">https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto#etat_de_l_art_pour_l_arabe</a> If there is sufficient interest, we could include Hebrew in our roadmap. For example wih the MorphTagger software: <a href="http://www.cs.technion.ac.il/~barhaim/MorphTagger" rel="nofollow">http://www.cs.technion.ac.il/~barhaim/MorphTagger</a>. Brett Bobley on "New research questions in the humanities" http://digitalhumanities.org/answers/topic/new-research-questions-in-the-humanities#post-2129 Sun, 03 Nov 2013 02:41:15 +0000 Brett Bobley 2129@http://digitalhumanities.org/answers/ Replying to @<a href='/profile/tedunderwood'>tedunderwood</a>'s <a href="http://digitalhumanities.org/answers/topic/new-research-questions-in-the-humanities#post-2111">post</a>: Ted makes a great point here. Indeed, there is a lot to be done in this domain. Brett tedunderwood on "New research questions in the humanities" http://digitalhumanities.org/answers/topic/new-research-questions-in-the-humanities#post-2111 Sat, 19 Oct 2013 12:01:51 +0000 tedunderwood 2111@http://digitalhumanities.org/answers/ @Inna: I have to confess that it seems to me expansion of scale is in no danger of turning humanities research into "a set of standardized operations." It might be nice if we had standards for any part of this process! But we don't. At this point there are only maybe twenty or thirty people seriously attempting macroscopic, quantitative humanities research, and many of those people are not in humanities disciplines: they're psychologists or computer scientists or linguists. So there's a huge diversity of approach. We're all posing different questions, and having to improvise our own ad-hoc solutions. And my sense is that we haven't even begun to discover what's possible in this domain. I believe that because I keep stumbling on really big obvious questions that haven't been posed yet. I understand that a lot of people in the humanities are philosophically or temperamentally uneasy with quantitative methods, especially at a macroscopic scale. Which is fine! there are lots of persuasive reasons not to do this kind of research. But the notion that "it has already been done; it's standardized now" is not one of the reasons I find persuasive. I'm too vividly aware that almost nothing has been done yet in this domain. Smallpiper on "New research questions in the humanities" http://digitalhumanities.org/answers/topic/new-research-questions-in-the-humanities#post-2083 Sat, 17 Aug 2013 12:01:04 +0000 Smallpiper 2083@http://digitalhumanities.org/answers/ Replying to @Patrick Murray-John's <a href="http://digitalhumanities.org/answers/topic/new-research-questions-in-the-humanities#post-2064">post</a>: The Oxford English Dictionary (OED) is a great example of an early (1988-) digital resource that has produced a wealth of research questions that only it could answer, but which are nonetheless basically humanistic questions about intellectual history, literature, society, and culture. Some examples include the significance of particular authors (e.g. Shakespeare) on A. the dictionary and B. the language (and C. the gap between A. and B.); word coinage and sense coinage; changing editorial practices (reflecting changing societal attitudes) regarding words about sex, race, and religion; and the varying significance of different literary periods to late Victorian lexicographers (and their successors). All this and much more using just the interfaces (CDROM and online) supplied by OUP over the years. My own DH project uses the back data to detect and document the OED's influence on poetry (and vice versa), as well as other kinds of literary production (<a href="http://poetry-contingency.uwaterloo.ca" rel="nofollow">http://poetry-contingency.uwaterloo.ca</a> ). Smallpiper on "History and theory question: lexicography / discourse analysis / text mining" http://digitalhumanities.org/answers/topic/history-and-theory-question-lexicography-discourse-analysis-text-mining#post-2082 Fri, 16 Aug 2013 23:04:02 +0000 Smallpiper 2082@http://digitalhumanities.org/answers/ Perhaps at an angle to what you're describing, from time to time I post on lexicography and literary writing, especially poetry, here: <a href="http://poetry-contingency.uwaterloo.ca/" rel="nofollow">http://poetry-contingency.uwaterloo.ca/</a>, with an emphasis on text mining and comparison. Inna Kizhner on "New research questions in the humanities" http://digitalhumanities.org/answers/topic/new-research-questions-in-the-humanities#post-2066 Mon, 12 Aug 2013 13:42:16 +0000 Inna Kizhner 2066@http://digitalhumanities.org/answers/ Thanks a lot for your replies! Yes, many people just do close reading. Thank you, Ted, for sincerely admitting that there are things that 'DH does not do'. But don't they need digital methods to build (say) networks of metaphors (some metaphors may belong to a particular set of characters) or don't they need spatial tools to map their characters and study how particular words are related to geographical settings? Thank you, Linkoln, for reminding us that these are old research questions that are easier to answer with digital tools. Anyway, currently, big data changes might result in turning research in the humanities into a set of standardized operations, a sort of assembly line. This may explain Eliah Meek's remark that DH conference this year was dominated by text analysis. Don't you think that DH is now at the stage of market-driven consumerism rather than the stage of looking for new intellectual discoveries in the humanities similar to a spiritual and intellectual wave that printing press raised in Europe in the sixteenth century?