Sunday, April 15, 2012

Supporting #Asturian in the #CLDR

Finding the data to support a language in the CLDR can be a struggle. The core requirements are only a few so how hard can it be...
  1. (04) Exemplar sets: main, auxiliary, index, punctuation. [main/xxx.xml]
  2. (02) Orientation (bidi writing systems only) [main/xxx.xml]
  3. (01) Plural rules [supplemental/plurals.xml]
  4. (01) Default content script and region (normally: normally country with largest population using that language, and normal script for that).  [supplemental/supplementalMetadata.xml]
  5. (N) Verify the country data ( i.e. which territories in which the language is spoken enough to create a locale ) [supplemental/supplementalData.xml]
  6. *(N) Romanization table (non-Latin writing systems only) [spreadsheet, we'll translate into transforms/xxx-en.xml]
When you read this, the text indicating what the initial requirements are, it becomes quite obvious why this process has such a bad reputation. 
  • It is not clear nor relevant where the data provided ends up in an xml format
  • Orientation is very much an aspect of a script, not of a language nor of a locale.
  • In the survey tool it is Esperanto that proves that a language may not fit into a locale anyway.
  • Romanisation is stated as a requirement. It is however not obvious at all that every script or language has ever been romanised in a standardised way and why this might keep a language out of the standard
In the past there has been an attempt to provide information for the Asturian language to the CLDR. The good news is that there is documentation on why it failed. The problem was that when you establish data about a language, you need to be certain. Four Unicode characters ('Ḥ ḥ Ḷ ḷ') are used for writing the Asturian language properly and the literature on the subject was not consistent. This issue was resolved but it took more time then was available in the CLDR time box.

The Asturian example proves that getting data ready for a standard takes time. The practice of closing a request because the data was not provided within a set amount of time is what stopped people dead in their tracks. We can only hope that Asturians will find what it takes to get support for their language in this time box.

