Wednesday, May 31, 2006

What use is a community for others ..

I was at the LREC 2006 conference in Genoa, and one recurring theme was the use of software because there are not enough people to do things manually. Some things a computer can do well, some things a computer does not so well. You are often presented with a percentage where the computer is off from what a human would do.

One presentation, the price winners presentation at the end of the conference mentioned a nice scheme were two concepts were compared and the question was, to what extend is the first concept associated with a second. People doing this are trained with a first set of concepts, they are then asked to do a further set of concepts to see to what extend they have learned things and then .. They are off. This worked really well but, you need a large group of volunteers or you use a computer. The computer did either a good job or gave COMPLETELY different answers from what a human would do (these are the things to watch our for in a Turing test).

My idea is, when WiktionaryZ gets itself a large community of people interested in languages, it would also be natural to ask this community if they are interested in helping out with research. One strategy would be to have just one person check a machine derived result, when there is a discrepancy, have some more people look at it ..

Another interesting experiment would be to test the difference between the different groups of users of English including people who use English as a second language.. What do you think ???


Saturday, May 20, 2006

Languages, dialects and orthographies

When a text is known to be in a certain language, and this language is more or less familiar to a person, this text may be meaningful. I have had some French classes and when I am in Italy there is quite a lot that I can understand. An automated process cannot do this; it helps quite a lot when a text has Meta-data that indicates what language, dialect or script it is.

One of the things that makes sense to be aware of, is what orthography a text, phrase or word is in. It is definitely something that is in a class of its own and it matters when text is to be understood in an automated way. Languages do change over time and, the recognized correct orthography changes to reflect this. The German and Dutch language both have had their fair chair of changes. The functional design of WiktionaryZ has always had a place to indicate that a given spelling is dated. The way we will export the WiktionaryZ data will be by using standards like TBX, LMF maybe RDS, SKOS or something different but standard. The problem is; how do we indicate that a given word needs to be spelled different since a given date ?


Friday, May 05, 2006

Languagecodes on the Wikimedia Foundation

In the past we started using the ISO-639-1 codes for indicating languages. This list was extended with the ISO-639-2 list because it was too limited. Even so there were issues with the list and they were augmented with codes from Ethnologue. This was not often considered not enough so we created our own codes.

Now there are the ISO-639-3 codes. They have a provisional status because there will be even further extentions of the list but it highlights one thing. Us using our "own" codes is really problematic. I give you some examples and I also indicate to you why many arguments used in the discussion on new languages are wrong.

The ksh.wikipedia is for something called Ripuarian. The ksh code is for the K├Âlsch language. Ripuarian is considered to be a language family. The consequence is that there IS no single Ripuarian orthography, language or culture

The als.wikipedia is called Allemanish. According to the ISO-639 these are four languages. The problem is that the als code is used for the main Albanian language. The code for Albanian sq (ISO-639-3 sqi) is also considered a languagefamily. The two main variants are the Albanian and the Kosovar languages. Two other members of the Albanian language family are spoken in Italy and Macedonia

Some of the most heated discussions on the request for new projects are about the status of a language; is it a language a dialect and often the arguments are of a political nature. The inclusion of languages in the ISO-639 has been political in the past. With ISO-639-3 many of these arguments have an answer with the many new language codes that have been created.

The result is that we have wikipedias like the ku the fa, the sq, wikipedia where the language is now considered a language family and where a request can be made for recognition of a language that is part of that languagefamily. There are more projects like that that I have not identified yet.

Another "nice" situation is the Low Saxon nds wikipedia. When you look at what Ethnologue has to say about the Lower Saxon language family than you get the impression to what extend there is not really something called Low Saxon in the Netherlands. The varieties of Low Saxon of the Netherlands are all there.. They are named, have there codes..

The point that I am raising is, languages are a mess. The codes for our projects are as a consequence a mess. The procedures for new projects are a mess because of the politics and the codes we have come up with in the past. What we need are better quidelines what the relation is between the ISO-639 codes. If the WMF says it uses the ISO-639 codes the codes must be in use or they must be clearly different.

Last but not least, according to the terms of use, we are not allowed to extend the codes in the way that we do.