Wednesday, March 30, 2011

Using the Ş or Ș In #Romanian

The definition of the subset of #Unicode characters used for the Romanian language is quite clear; only the Ș and the ș is correct for the șe. This does not mean however that everybody uses the comma under the S and not a cedilla.

Before Unicode became common place typing the comma under a character was really hard. As a consequence many, many expressions of Romanian are erroneous. People got used to writing with a cedilla.

With the later versions of MS Windows, the keyboard mapping for Romanian made it easy to write correct Romanian. For the people stuck with a wrong keyboard we can easily have Narayam provide a modern input method for Romanian.

This is currently a big deal for the Romanian and English language Wiktionaries. They are in the process of correcting every occurrence of a wrongly written șe and țe. This is quite an undertaking because it affects interwiki links to other Wiktionaries as well.

It also means that they want to ensure that only a correct șe or țe is written in Romanian. This is complicated by the fact that a Ş is correct in for instance Turkish. Being able to identify a text for its language is therefore quite important.

The solution currently implemented on the Romanian Wikipedia is that any t or s with a cedilla is converted to a proper șe or țe. As a consequence Turkish names of people and places are likely to be spelled incorrectly.

PS Please note that the font used for the title does not cope with these characters.
