Tuesday, February 28, 2006

What would you have done that makes sense

I talked with SJ the other day. We talked about many things but the things that come back to me is what would we have people do if they want to spend serious effort on things that make sense to the aim of the Wikimedia Foundation.

The aim of the Wikimedia Foundation is to bring all knowledge to people in their own language. The aim is breathtaking.. It is absolutely audacious, how do you go about making this happen. In a way it is a journey you embark upon and there are many small things along the way.

So what are the things that make a difference, things that can be done within half a year. How about creating fonts, fonts for languages that do not have a Free font yet. Or even define the script for a language that does not have a script. How about creating inflection boxes for parts of speech for WiktionaryZ? How about thinking wildly how you could do something you take for granted and do them in a new way. How about writing documentation for MediaWiki of for a project (Wikipedia; das Buch I do recommend :) ).

How about writing software to ease the translation of Wiki content? This could be by having OmegaT read and write directly to a MediaWiki resource. How about having people work on content that is underdeveloped. Yes, the English Wikipedia will have a million articles, but where is the content in Swahili, Farsi, Hopi ? Even a million articles will not tell you about all the villages in Ghana, Honduras or Belarus.

These are just some of the things that I can come up with without trying. What would you have a student or a few students do when they have half a year to work on a "term-project" ?


Thursday, February 23, 2006

Upper or lower case in Wiktionary

Yesterday Brion made an executive decision. He decided that ALL wiktionaries will have lower case articles from now on. He changed the database so that lower case articles are allowed and ran a program that changed all entries from uppercase to lowercase.

The result was predictable, some wiktionaries like the Norwegian is really happy. They were anticipating this change of their database. They were ready for it and the major problems were done in a few hours. Not so with some of the other wiktionaries, there this was completely unanticipated and a lot of feathers were ruffled. When you do not have a plan, when it is against what you wish it does not make for an effective change.

The other thing that was not really pleasant, is that the bot used to create the interwiki links is broken. It is broken yet again. It does work after a fashion but it does not do all the work that it is supposed to do.

This month is probably set another record for the RobotGMwikt bot. As so many wiktionary entries have changed, it has to be changed on ALL projects. This month may be good for some 500.000 edits.. It is indeed great that we have interwiki links but the technology is not efficient.


Saturday, February 18, 2006

White spelling

When you consider Wikimedia projects, one of the leading guidelines is the NPOV. The Neutral Point Of View is one of the most important things in dealing with differences of opinion. Particularly for Wikipedia it is really relevant. It gets the sting out of many arguments because what you can quarrel over is limited. Prove your point and show that it is a fair representation.

Getting into an argument about NPOV on a Wiktionary is different because you define a concept well and when someone is of the opinion it means that you define another concept. Technically they are DefinedMeaning in WiktionaryZ. So today I had one of these NPOV situations. Something to do with my mother tongue.

The official spelling is published in what is known as the "groene boekje". The latest version was published in 2005. There are a few problems with it; it is a proprietary list so organizations like Open Office cannot get it to build new spell checkers, it is also not available as a list of Expressions for WiktionaryZ. The datadesign is such that it allows for spellings that are correct according to an authority.

The other big problem is acceptance. Yes, it is the official spelling, but what if people and certainly big publishers do not accept it ? There is a new movement called "Witte spelling" that intends to create an alternative that is less confusing. This will result in a list of words spelled correctly according to this list. It results in a "green" and a "white" spelling. When we get the witte spelling as a resource, we can create a spell checker for Open Office, we can inform about the correct spelling according to the while spelling..

From a NPOV point of view, doing it in this way is problematic. The official spelling gets underrepresented, but how can we do it justice as it is proprietary? In several way the official spelling becomes less relevant..

If anything this is a great example that making what is supposed to be a standard proprietary, is a self defeating strategy.


Thursday, February 16, 2006

Building a community of developers

The number of people who work on the MediaWiki software is limited. There is a growing need for functionality. This need is the consequence of MediaWiki being considered the best of breed of the Wiki engines. It is a consequence of the use of wikis for knowledge management, it is used for documentation efforts.. MediaWiki is used a lot.

Many needs for improvement arise from within the Wikimedia projects. These needs are typically taken care of by people who scratch their own itch. I am particularly interested in stuff when it helps me with the WiktionaryZ project. Other people have a need for Wikinews or WikiSpecies functionality.

One problem with the current model is that the developers are nominally all volunteers. This is when you analyze it no longer working it is also not true; the best developers are being snapped up by organizations and consequently are working either for interested parties or they are no longer available for MediaWiki work.

This means that it becomes more and more difficult to get things programmed. WiktionaryZ is an ambitious project. It needs available programmers and it needs people who can program and know other languages than just any "European" language and English. This need is felt more and more acutely.

I hope to develop contacts that I have made in Africa. A programmer that came recommended to me by someone who manages a great project in Swahili as best as he can. Erik did write this nice specification for something called InstantCommons. We hope them to develop this for us. When this works out, we have some new MediaWiki developers.

As we typically do, we discussed what to do when this is a success and, when we have a need for MORE developers.. Because of my interest in Iran (Farsi and Luri), I said that trying a similar would be a good idea. This got me in an interesting argument; Iran is with its present policies and president seen more and more as an enemy. Some people consider this to be a reason not to use the contacts that exists. From my perspective, this is not rocket science and it is about words and how they are used and understood. If anything we should collaborate on this.

Another strategy that we could adopt is having a competent developer, someone who is also a good communicator help students when they do termprojects related to MediaWiki. Any project related to MediaWiki. I think up to 50 projects could be handled given that a project takes half a year.. This would mean on average 100 students.. The benefit would exist in two ways; a lot of work can be done in this way and we probably would have a retention rate of something like 5% of the people as developers for a period of minimally a year.

Building a community of developers is essential. It is however not that easy :)


Friday, February 10, 2006

What is a language

When I was in school, I had to learn several definitions for "intelligence". My favorite still is: "Intelligence is what the intelligence test measures". Many people use as their definition for what a language is; "a language is a language when it has its language code".

From a technical point of view, such a definition is beneficial. For all the major languages it is simple; it is obvious that there is a language code that describes them. For some languages there is a language code because people in the west take an interest, tlh is one code for one such language. For other languages it is more problematic, some of them have their code but that can make the problem worse. Some tools rely on the codes to be their and applicable; OmegaT, a CAT tool, for instance relies on the codes that exist in its programming environment. This is a serious problem because this programming environment supports ISO-639-1 and only with ISO-639-2 the code for the Neapolitan language became available. Consequently a translation tool does not support MANY languages. Even ISO-639-2 does not really help; the Kurdish language is acknowledged not to be a language; it is considered a language family that consists at least of three languages. These languages are acknowledged in the ISO/DIS-639-3.

While the ISO/DIS-639-3 is a huge improvement, it gets opposition from a few quarters. Many people, particularly developers of software, are of the opinion that some 8.000 languages is too much. Other people are of the opinion that the number of languages is not big enough. But also some languages that had support can be considered a dialect of another language like what is happening for the Twi language. How this will be appreciated by the people who speak Twi is anybodies guess. Twi is considered to be part of the Akan language, this article on the Akan language is indeed another example of the systematic lack of attention Africa gets.

For a CAT tool, it is really relevant that it allows its users to use the tool to its fullest potential. This does not mean that standards should not be supported, it means that multiple standards should be supported AND that you can introduce user defined languages as well.


Wednesday, February 08, 2006

Gmail now with instance message functionality

The great thing of Google tools is that they are useful. Not only useful but they have a knack of taking something that everybody does and then add something to it that makes it better. Gmail was great; all the mail that you receive online AND presented in a way that makes sense.

When you receive as much mail as I do, most of it overwhelmingly from the same "source", mailinglists. Many people are subscribed to the same mailinglists and the reason why most people use Gmail... It would be cool if it were possible to identify mail addresses like the WMF mailing lists the point is that they can be stored differently. The content is not personal and it would be nice if it could be treated differently, it would be nice if mailing lists could be identified and stored separately.

The great news today is the new chat functionality that was added today. People who do not use skype or IRC but who have Gmail now can be chatted with. Two people I have communicated with for quite some time, now are available for a chat.. really powerful and guess what, one of them uses Google talk.. That was really sweet.


Wednesday, February 01, 2006

MediaWiki is secure software

MediaWiki, the software behind projects like Wikipedia and Wiktionary is software written with an eye for security. The practices that are employed prove that security is taken seriously. The point is that people do not always understand what security means and what security is provided.

Many of the MediaWiki implementations allow everybody to create and change articles. This is a conscious decision, it is part of the formula and consequently this is not a problem from a security point of view. As a consequence the problem of maintaining quality content and preventing people vandalising the content, is a management problem. The tools to manage this problem are diverse but many tools that are considered security tools are usable.

Often vandals do not know that what they do is useless. Often people add links to all kinds of websites in order to increase their Google-rating. The MediaWiki software indicates to the Google crawler NOT to include external websites for its ratings.

Blocking IP-ranges and users because of persistent vandalism is one. Trusting logged in users more than anonymous users is another. There are many Wikimedia projects and all of them still have at present their own users. In Februari, it is planned to develop single signon for the Wikimedia projects. Single login has been on the wishlist of many of the people who are active on multiple projects.

With single login, in essence a management issue with security implications, it becomes feasible to use this as a stepping stone for the implimentation of security features that help with the management of vandalism.

The feature that I would love best is to differentiate the strength of authentication based on where a user comes from. When a user comes from a school with a history of vandalism, it makes sense not to allow anonymous edits. There are many of these types of soft security measures possible.

On mailing lists about Wikimedia, there was talk about a patch that allows for logging in users who authenticate themselves with OpenID. The interesting thing was that people had two issues with this; first it would not allow me to use my MediaWiki ID as an OpenID. The second is that to some extend OpenID is going to fit into the YADIS framework (Yet Another Decentralized Identity Interoperability System).

Yadis is interesting because it is linked to the eXtensible Resource Identifier or XRI, a standard that is developped by OASIS. It is also linked to the W3C (YASB - yet another standard body :) ).

In the end it comes back to standards; when the WMF would support twoway YADIS authentication, it makes for a VERY relevant implementation of security related functionality. This could provide for better management in the fight against vandalism. It is however important that it is a standard that we provide. That is why I am of the opinion that the WMF should support standards.