Friday, May 05, 2006

Languagecodes on the Wikimedia Foundation

In the past we started using the ISO-639-1 codes for indicating languages. This list was extended with the ISO-639-2 list because it was too limited. Even so there were issues with the list and they were augmented with codes from Ethnologue. This was not often considered not enough so we created our own codes.

Now there are the ISO-639-3 codes. They have a provisional status because there will be even further extentions of the list but it highlights one thing. Us using our "own" codes is really problematic. I give you some examples and I also indicate to you why many arguments used in the discussion on new languages are wrong.

The ksh.wikipedia is for something called Ripuarian. The ksh code is for the K├Âlsch language. Ripuarian is considered to be a language family. The consequence is that there IS no single Ripuarian orthography, language or culture

The als.wikipedia is called Allemanish. According to the ISO-639 these are four languages. The problem is that the als code is used for the main Albanian language. The code for Albanian sq (ISO-639-3 sqi) is also considered a languagefamily. The two main variants are the Albanian and the Kosovar languages. Two other members of the Albanian language family are spoken in Italy and Macedonia

Some of the most heated discussions on the request for new projects are about the status of a language; is it a language a dialect and often the arguments are of a political nature. The inclusion of languages in the ISO-639 has been political in the past. With ISO-639-3 many of these arguments have an answer with the many new language codes that have been created.

The result is that we have wikipedias like the ku the fa, the sq, wikipedia where the language is now considered a language family and where a request can be made for recognition of a language that is part of that languagefamily. There are more projects like that that I have not identified yet.

Another "nice" situation is the Low Saxon nds wikipedia. When you look at what Ethnologue has to say about the Lower Saxon language family than you get the impression to what extend there is not really something called Low Saxon in the Netherlands. The varieties of Low Saxon of the Netherlands are all there.. They are named, have there codes..

The point that I am raising is, languages are a mess. The codes for our projects are as a consequence a mess. The procedures for new projects are a mess because of the politics and the codes we have come up with in the past. What we need are better quidelines what the relation is between the ISO-639 codes. If the WMF says it uses the ISO-639 codes the codes must be in use or they must be clearly different.

Last but not least, according to the terms of use, we are not allowed to extend the codes in the way that we do.


