I should add that for now we're following Laura's suggestion and using ampersand pound 8212 semicolon
TEI Basic Tags
(24 posts) (13 voices)-
Posted 7 years ago Permalink
-
I favor Unicode numerical entities, because it's less hassle for me than either using entity definitions or using finicky character palettes, but I am a HUGE NERD and have quite a few of the commonest Unicode punctuation codepoints memorized. If you prefer the HTML-ish formulations, Wayne's suggestion above should work just fine.
Posted 7 years ago Permalink -
The best way to do characters like mdashes is surely to just write them in your file in UTF-8. Forget numerical entities, whatever, just put the character in... your editing application is sure to have way of doing it via a character map or menu or the like. Why treat them differently from any other character your normally type?
Posted 7 years ago Permalink -
Some people avoid putting non-ASCII characters into their XML for any of the following reasons:
* The project's encoders or quality-control team have trouble distinguishing between characters with similar glyphs (like the ASCII hyphen and the em dash). While this can be mitigated by choosing a different font, you don't always have control over the font used if users are working in various environments. See the next point.
* There are tools in use (text editors, scripts, etc.) that do not read and write Unicode, at least without proper configuration that users might not always remember to undertake. So if all text is in ASCII (using entity references when not), you ensure fidelity of interchange.
Mneumonic entities (like —) are easy to proofread but require extra configuration for validation (as mentioned previously in this thread). Decimal and hexadecimal entity references are always allowed and are functionally interchangeable. If consulting Unicode charts, hexadecimal entity references are more convenient since they match the Unciode code points.
Posted 7 years ago Permalink -
Would you say the same for someone encoding texts in more or less any European language (let alone the scripts the majority of people in the world read and write)? I don't think French readers would like it if you suggested they write
<![CDATA[<hi>tête-à-tête</hi>]]>
in their files and proof-read like that. The case of endash vs emdash vs hyphen vs minus sign may be a special case, I admit.That there are tools which cannot do UTF-8 Unicode properly is true, but for how long do we have to tolerate the tyranny of old software? that a tool would be Unicode aware, yet not do UTF-8, stretches credibility, and if a tool be not Unicode aware, should we not cast it into the outer darkness?
Posted 7 years ago Permalink -
ach, I should have guessed that would happen. an editor which claimed to HTML but is a pale shadow. My code example should have been
<hi>tête-à-tête</hi>
Posted 7 years ago Permalink -
try the third
<hi>tête-à-tête</hi>
Posted 7 years ago Permalink -
I'd generally second what Sebastian says here. But, I'd also add a software-specific hint:
If you happen to be using oXygen then you can do what are called 'code templates'. This could consist of highly complex markup, or even just a single character.
So for example, although I can go to their character map and find a capital thorn character and click on it I find it easier to create a code template once with a capital thorn and name it THORN and thorn for the lowercase. That means I can start writing 'thor' and hit control-space to see if there are any templates that this matches and select the one that comes up. typing tho-ctrl-space-enter is certainly more keystrokes, but much easier than needing to remember unicode code points, and it puts an actual thorn character in there.
One could do the same with mdash or any manner of things. I realise that is software-specific to a single editor, but any _good_ editor should allow you to do similar forms of abbreviation expansion or something. (So you can probably do something similar in whatever editor you are using.)
-James
Posted 7 years ago Permalink -
In response to Sebastian, when I said that some people do not put non-ASCII characters into their XML, this was meant simply as an observation of the current state of affairs and is not a recommendation. That is, I'm trying to explain why some people might say not to do this. Naturally French speakers creating XML will use their keyboard layout to directly enter characters used in French and will have no trouble reading and proofreading these.
As for use of tools, not everyone has full control over the environment in which everyone in their project operates.
Posted 7 years ago Permalink -
Belatedly, since I was checking to see what the current received wisdom is: character entities in P4 are also handy for glyphs for which no Unicode exists (yet?). Though I like Junicode, I'm leery of pasting in glyphs from its Private Use section. Very much in agreement with @kevin.s.hawkins here, also.
I'm glad to see @jamesc's reminder about oXygen's code templates, at any rate, and will define a few templates for the glyphs in my manuscripts which aren't satisfied by the many Unicode hex points I've memorized....
Posted 6 years ago Permalink
Reply
You must log in to post.