Thursday, April 12, 2012

#CLDR is not used solely for locales

#Unicode is best known for its architecture for the digital representation of the characters off scripts. The Unicode consortium also hosts the "common locale data repository" project also known as the CLDR. In this repository you can find how languages are used in different areas. You will find for instance what currency symbol is used. There is also information on what direction is the language written in, the way numbers and dates are written.

Many applications rely on this data. Several Open Source word processors use this data to enable a language for editing. While it is great that this data is used it is problematic because not all languages are spoken let alone in locales.

One great example is Ancient Greek; this language is not a living language but it is taught in schools all over the world. Students are doing their homework and what they write, it is certainly not modern Greek. From a technical perspective, it is only correct when the meta-data of such documents indicates that it is Ancient Greek.

When Ancient Greek and other extinct languages are supported in word processors, surviving texts can be written with modern tools. These transcribed texts will by default have correct meta data and it will be easier to find them when they are placed on the Internet.

For this to happen, the CLDR either embraces that it is used to enable languages in word processors or the word processors who currently use the CLDR allow for alternate sources of primary data.
