Tried AntConc with Unicode format Hebrew Texts. It works, but the results come out left-to-right. Any recommendations on how to solve this, or tools more adapted to it?
Text mining tools that work with RTL texts?
(2 posts) (2 voices)-
Posted 4 years ago Permalink
-
Replying to @sinai.rusinek@gmail.com's post:
Hi Sinai,
I suggest you give a try to TXM.
We haven't designed the GUI with RTL writing systems in mind but UTF-8 RTL encoding appears to be globally well supported by default technology, with a notable exception concerning concordance contexts that are interchanged from left to right.
The current state of the software and possible evolutions concerning writing systems is described here (in French): https://groupes.renater.fr/wiki/txm-info/public/specs_langues?s=%C3%A9criture.
If there is sufficient interest, we could make things evolve more rapidly with respect to RTL.
Mind that GUI management of RTL display is independant of the word segmentation/tokenization process of raw text which can also have a deep impact on usability of textual analysis software. Even if one can alaways use software on character strings, it is much better to use them on words or lexical items. For TXM we begin to address semitic language word tokenization with Arabic. See here the current state: https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto#etat_de_l_art_pour_l_arabe
If there is sufficient interest, we could include Hebrew in our roadmap. For example wih the MorphTagger software: http://www.cs.technion.ac.il/~barhaim/MorphTagger.
Posted 3 years ago Permalink
Reply
You must log in to post.