Tuesday, November 28, 2006

A new language on the Internet ?

I received a mail that was forwarded to me by Martin Benjamin (the Kamusi project). The mail was by a gentleman who wants to promote his mother tongue, the Bangubangu language. This is the first time that written material in this language has been created. The first project is a book in English, French and Swahili called “Teach yourself the Kibangubangou”.

I am thrilled with initiatives like this. The book is there now what to do next. Do you print it with an organisation like Lulu? Do you advise to make it a Wikibook? To what extend are concepts like copyright and licenses relevant and understood ..

It would be great to make this project succeed and have kids learn their mother tongue. There are stumbling blocks. Google does support less languages then the Wikimedia Foundation; Google does some hundred and The WMF some two hundred and fifty. One of the problems is that there is not much material in any but the bigger languages and Google does good by already doing this much.

To make an impact, I think it is crucial for material in particularly the endangered languages, to be tagged correctly. This gives Google and the other search engines a fighting chance to function on little material. For the Bangubangu language, there is no proper tag yet. The question is if the IETF will consider to do something about it. They are religious in their belief that the ISO-639-3 is not approved yet and that the bnx code therefore is not to be used. At issue is that there is a need that presents itself for this language.


Sunday, November 26, 2006

Kanab Ambersnail

According to a heading on the English Wikipedia mainpage, the article about the Kanab Ambersnail was the 1.500.000th article. I think this is very spectacular. Personally I find it funny that I can write about it before I have read about it on slashdot :)

Congratulations to all who make Wikipedia so special.


The I&I conference

Last Wednesday and Thursday I was at the annual I&I conference in Lunteren. This conference brings together many of the ICT (information & computer technology) coordinators of Dutch and also some Flemish schools. I have had the privilege to be for the second time. This year we had beautiful new Wikipedia folders, Marc Bergsma demonstrated a OLPC motherboard; it got a lot of interest particular when it was learned that a full system can be available for Dutch schools for the school year 2008/2009.

One great project I learned about is called TEEM. TEEM is a project where teachers evaluate educational websites and CD/DVD based resources for their use in classrooms. The way it is organised is such that I can understand why British teachers trust it as a resource. The way that it is funded however is one that creates a sad systemic bias. Reviews are paid for by publishers the consequence is that open content course ware will not be evaluated. This funding model also keeps out those publishers that do not pay for a review. This governmental political choice was to encourage innovation and competition. It is not unreasonable to suggest that it effectively costs the British schools more money as they do not learn about what is available for free.


Friday, November 24, 2006

Whose language is it anyway ?

Much of Microsoft's software has been translated in the Mapuzugun or is it the Mapundugun language. They did this in consultation with the Chilean government.

The Mapuche people have gone to court because they disagree that Microsoft or the Chilean government had the right to do this. The language it is claimed is theirs, and the translation was done without consulting the Mapuche people.

Many people do not understand or know what is behind this. Why would people be opposed against becoming part of the digital world? To me, the key thing to appreciate in the reporting is the accusation of "violating their cultural and collective heritage". There are competing orthographies for this language and one is strongly favoured by the government.

So if I remember things well, this has everything to do about the involvement of the people that use the language. Mapundugun is spoken in both Argentina and Chile and here the government of a nation and the most powerful company of the world seem to be taken on because they do not represent the Mapuche people and are denied the right to decide for them.

In that light it makes perfect sense to go to court and insist that this is very much not wanted.


Friday, November 17, 2006

To the winner go all the spoils

When changing you name makes you money, would you do it? Would you change your name for a pig or a goat ? Accepting such an offer could make me "Pig Meijssen", it would go well with my mascot. The name that the people changed their name to, "Hornsleth", is perfectly honourable. This "project" by the Danish artist Kristian Von Hornsleth is very much intended to demonstrate that international aid fails people. It fails people because it assumes that "the way of the donor" is best.

If you want to be helped you have to do this, that whatever. For me the great thing about Wikipedia is that it helps. It brings information to people. Its intention have always been to bring people information in their language. This means that culture, people and language are respected and that have people are enabled to help themselves.

For the western languages there is an abundance of great information. For many other languages, Wikipedia still has to take root. As the world becomes more wired, I expect Wikipedia will take root and become relevant for as a resource for both the culture, the people and the language.

When it is considered acceptable to bring information only in English, French, Arab, Chinese or whatever is considered a big language, the notion of Neutral Point of View that the Wikimedia Foundation offers is deminished. A NPOV exists because all points can be brought in the diversity that are the languages and cultures that are reflected in the 250 Wikipedias that currently exist.

It may be efficient to concentrate on the "important" languages.. but I do not want to consider the loss.


Sunday, November 12, 2006

The worth of MediaWiki

According to the dark art of economics, everything can be valued. Everything can be given a price tag. People may object to this, and do on principle, but sorry, they have done just this for MediaWiki.

MediaWiki has a worth of $3.810.127,- when you assume that a developer costs $55.000,- a year and some other stuff. It was "valued" per the first of January 2006 and as the year is almost gone, it will be worth a lot more.

Another economic truism is that money makes money. Because of the success of MediaWiki more people will start developing MediaWiki. I am not in a position to deny this. WiktionaryZ extends MediaWiki. With new functionality much more becomes possible.

There is one thing missing in this argument. What is wrong is that MediaWiki is a tool. A tool that produces something that is far more valuable. Wikipedia is not the only project that MediaWiki enabled. I am sure some people who understand this dark art of valuation will be able to come up with a better number.

Given that money makes money, it is possible to leverage the MediaWiki generated content. Much proprietary content is not really relevant because it does not get exposure. By making content available it can get exposure. Material was often created to get exposure. By keeping it proprietary, thereby hidden from view, it does not do all that it could do. By making it available under a free license new opportunities arise.

MediaWiki enabled among others Wikipedia. WiktionaryZ has potential. My hunch is that like MediaWiki it will enable the creation of content in a different way. I hope and expect that it will help us to negotiate the release of much content under a Free/Open license and allow us to collaborate with many organisations and people.


Friday, November 10, 2006

Running the interwiki bot for Wiktionary

I run the interwiki functionality of pywikipedia bot on all the Wiktionaries. It is a thing that I started and it is the kind of public service that needs doing. It links all the words that are spelled the same by adding "interwiki" links. These are the things that you see at the left hand side where it is indicated that there is also information in another language.

I have done this now for over a year and what I just noticed is the amount of words that I do not understand is growing rapidly. On the one hand it is to be expected as it is in line with the rapid growth of projects like the Vietnamese wiktionary. What now starts to happen more is that multiple wiktionaries have words together. That is what I see when I watch the bot.

In a way it would be fun to have WiktionaryZ in there. Currently we have 159.004 Expressions and we have 10.557 DefinedMeanings. Based on the expressions we would be the fourth project in size, it would be more reasonable to use the DefinedMeaning for the comparison and this would have us as the 26th in size.

Comparing Wiktionary with WiktionaryZ is like apples and oranges. Where Wiktionary has each word only once, WiktionaryZ counts them as existing in a language. Where there can be many red-linked articles on a Wiktionary page, the WiktionaryZ expressions are implicitly there.

It makes better sense to appreciate what the implications are of the numbers. In lexicology size counts. Only when people have a good chance of finding the information they are looking for will they find a resource useful. It is one reason why it makes sense to concentrate on certain topics or domains. WiktionaryZ is rich in ecological terminology due to the information that we got by including the GEMET thesaurus. By working on the OLPC children's dictionary we get a lot of the basic stuff that is the bread and butter of dictionaries.

Tuesday, November 07, 2006

Some thoughts on Alexa

Alexa is a website that provides an indication of the popularity of websites on the Internet. What I do does not matter, as Alexa only measures the use of Internet Explorer which is statistically becoming a less brilliant idea. Given the amount of people using Firefox on WiktionaryZ, I am sure that they do not know where many of the alpha crowd hangs out.

We have had our downtime this week and, I have been looking at how this affects our standing at Alexa. Sure enough, after two days of downtime, we hit the 900.000th place. Now that we are back up, we rebound nicely and today we are already back at number 478,316 for the weekly average.

WiktionaryZ has its own statistics, here you will find that our daily average hits did take a pounding. Given no more downtime and given that the trend of continued interest continues, the numbers will improve but the average will be depressed. At this moment all this is not crucial. When we get people to rely on WiktionaryZ, our service level needs to be much improved.

In a conversation with a developer I said once, when you respect our users, you have to treat them as if they cost us $150,- an hour. The point is that with the realisation how valuable contributors are, you are more likely to give them with the respect that they are due. With professional people using wikis, there actually is money paid for the time spend editing wikis this makes it more plain but it does not make a difference. Editors are to be respected and it is important to make the most out of what they do for us.


Wednesday, November 01, 2006

The semantic web to the rescue ?

The reporting on the Internet Governance Forum in Athens is mighty interesting. Yet another nice article on the BBC website with some thought provoking ideas.

I find it really interesting that spoken languages are considered. However, practically at this stage the Internet is very much oriented towards written languages and, given the amount of stumbling blocks that exist to integrate languages other than the ones using a Latin script, I am afraid this is just a red herring.

I was also amused to see that the semantic web was brought to the fore as one solution to the problem of linguistic diversity. Yes, it is intended to be understood by computers. Computers are used by people and the semantic web is decidedly English. This raises the question how this computer that apparently understands English communicates to its user who does not.

There are however some great things to be said about the semantic web; first of all the terms used should be unambiguous. This in turn means that it should be possible to translate it to other languages than English. This is a challenge that we face in WiktionaryZ. We are able to have semantic relations and our semantic relations do translate to the language of the User Interface. So when the terms have been translated, in WiktionaryZ relations can be understood not only but also by computers.

This is a good moment for a disclaimer; WiktionaryZ is pre-alpha software. Many of the issues that have been tackled in the development of the semantic web we have not considered let alone touched. We hope / expect that we will be allowed to stand on "the shoulders of giants".


Is this a "good" word

MediaWiki is the software that drives Wikipedia but also WiktionaryZ. It is probably one of the best pieces of software when it comes to internationalisation and localisation. This is demonstrated by the many localisations that have been done already for the software. Singing the praises of MediaWiki from me can be expected; why else develop WiktionaryZ on top of MediaWiki?

This does however not mean that all is well. I have written before about the problems with the Neapolitan language and there issue with the '' combination. Today I learned that a language called Hai||om uses the "pipe" character and consequently, I cannot make it work properly in a MediaWiki installation.

There are ways around such a problem; I can use one of the alternate names; San or Saan. I can expect that there will be no Wikipedia created in this language (only 16.000 speakers). But the point is, that even a system that does really well is only as good as the next language that proves that it has an issue with it's presumptions.