Text mining tools that work with RTL texts? « Digital Humanities Questions & Answers

Digital Humanities Questions & Answers » Applications, Tools, Formats

Text mining tools that work with RTL texts?

(2 posts) (2 voices)

Asked 4 years ago by sinai.rusinek@gmail.com
Latest answer from slh@ens-lyon.fr

Tags:

sinai.rusinek@gmail.com
Member
Tried AntConc with Unicode format Hebrew Texts. It works, but the results come out left-to-right. Any recommendations on how to solve this, or tools more adapted to it?
Tweet this question
Posted 4 years ago Permalink
slh@ens-lyon.fr
Member

Replying to @sinai.rusinek@gmail.com's post:

Hi Sinai,

I suggest you give a try to TXM.

We haven't designed the GUI with RTL writing systems in mind but UTF-8 RTL encoding appears to be globally well supported by default technology, with a notable exception concerning concordance contexts that are interchanged from left to right.

The current state of the software and possible evolutions concerning writing systems is described here (in French): https://groupes.renater.fr/wiki/txm-info/public/specs_langues?s=%C3%A9criture.

If there is sufficient interest, we could make things evolve more rapidly with respect to RTL.

Mind that GUI management of RTL display is independant of the word segmentation/tokenization process of raw text which can also have a deep impact on usability of textual analysis software. Even if one can alaways use software on character strings, it is much better to use them on words or lexical items. For TXM we begin to address semitic language word tokenization with Arabic. See here the current state: https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto#etat_de_l_art_pour_l_arabe

If there is sufficient interest, we could include Hebrew in our roadmap. For example wih the MorphTagger software: http://www.cs.technion.ac.il/~barhaim/MorphTagger.

Posted 3 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.