Sunday, April 15, 2012

Supporting #Asturian in the #CLDR

Finding the data to support a language in the CLDR can be a struggle. The core requirements are only a few so how hard can it be...
  1. (04) Exemplar sets: main, auxiliary, index, punctuation. [main/xxx.xml]
  2. (02) Orientation (bidi writing systems only) [main/xxx.xml]
  3. (01) Plural rules [supplemental/plurals.xml]
  4. (01) Default content script and region (normally: normally country with largest population using that language, and normal script for that).  [supplemental/supplementalMetadata.xml]
  5. (N) Verify the country data ( i.e. which territories in which the language is spoken enough to create a locale ) [supplemental/supplementalData.xml]
  6. *(N) Romanization table (non-Latin writing systems only) [spreadsheet, we'll translate into transforms/xxx-en.xml]
When you read this, the text indicating what the initial requirements are, it becomes quite obvious why this process has such a bad reputation. 
  • It is not clear nor relevant where the data provided ends up in an xml format
  • Orientation is very much an aspect of a script, not of a language nor of a locale.
  • In the survey tool it is Esperanto that proves that a language may not fit into a locale anyway.
  • Romanisation is stated as a requirement. It is however not obvious at all that every script or language has ever been romanised in a standardised way and why this might keep a language out of the standard
In the past there has been an attempt to provide information for the Asturian language to the CLDR. The good news is that there is documentation on why it failed. The problem was that when you establish data about a language, you need to be certain. Four Unicode characters ('Ḥ ḥ Ḷ ḷ') are used for writing the Asturian language properly and the literature on the subject was not consistent. This issue was resolved but it took more time then was available in the CLDR time box.

The Asturian example proves that getting data ready for a standard takes time. The practice of closing a request because the data was not provided within a set amount of time is what stopped people dead in their tracks. We can only hope that Asturians will find what it takes to get support for their language in this time box.


Steven R. Loomis (ICU/CLDR Project) said...


I think a key issue here is communication. We'd like to improve communication to and from the CLDR team. The expectation of the page you linked to is that the data listed in core would be provided so that we can collect other data via the automated Survey Tool process. As a result of this discussion, I've re-written that page into a new page (work in progress!!) which puts that same request for data in a questionnaire form. But to reply to some specific points:

1. XML format isn't the question here- all that's needed is to inform us of the data requested.

2. Orientation: Data requested here may be stored in different forms. It is true that this data is relevant on a script basis, and in fact CLDR tools verify that languages with right-to-left characters have right-to-left orientation. I don't see why this is a problem, though. Not all of the questions on this list are of a per-locale basis.

3. I'm not sure what Esperanto proves - perhaps we have a different definition of "Locale"?

4. Lack of Romanization data never kept anything out of CLDR. Anyways, I don't think this is a hard requirement (hence the asterisk).

As I understand the Asturian situation, there was no data entered, and (to this day) no reply as to providing this data. The core Asturian data could be provided at any time, it wasn't waiting on the CLDR time box. Data submission opened two weeks ago, following our Beta period.

When the request #2868 was closed, it clearly stated: "please open a new ticket providing the necessary core data and we will reinstate it". That request had to be closed, because it already had active commits on it. It should in no way be construed as a rejection of Asturian.

"What it takes" is to click here and file a new ticket. Please, work with us.


Anonymous said...

Gerard, Steven,

It's not that I blame CLDR or anyone in Asturian case, I understand that some outdated information went in our way last time.

Once that info was fixed, I delayed to provide data because I knew that a new data submission period would start in a few months and, unfortunately, we are a small language and we are always short of manpower to keep up with all projects we are involved in.

The new questionnaire is a good step to help non technical users to provide data for new languages. Please, keep it as easy as possible!

And, certainly, we want to work with you and we will do!

Best regards,