Tuesday, October 31, 2006

The importance of good standards

At the Internet Governance Forum Mr Vint Cerf has said that changing the way the Internet works to accommodate a multi-lingual Internet raises concerns. The question raised in the BBC article is whether it is a technical issue or not.

Interoperability on the Internet is possible when the standards used are such that interoperability is possible. The current URL system is based on the Latin script. This was a sensible choice in the days when computing was developed in America. It made sense in a world when the only script supported by all computers was the Latin script. In these days computers all support UTF-8. All modern computers can support any script out of the box. This means that all computers are inherently able to display all characters. This does however not mean that all computers are able to display all scripts; even my computer does not support all scripts and I have spend considerable time adding all kinds of scripts to my operating system.

The next issue is for content on the Internet to be properly indicated as to what language they are. Here there is a big technical issue. The issue is that the standards only acknowledge the existence of a subset of languages. The result of this is that it is not possible to indicate using the existing standards what language any text is in.

Yes, the net will fragment in parts what will be "seen" by some and not "seen" by others. This is however not necessarily because of technical restrictions but much more because the people involved and the services involved do not support what is in the other script, the other language. When I for instance ask Google to find лошадь or paard, I get completely different results even though I am asking both times for information about the Equus caballus. In essence this split of the Internet already exists. The question seems to me to be much more about how to make a system that is interoperable.

The Internet is interoperable because of the standards that underlay it. With the emancipation of the Internet users outside of it's original area, these standards have to become usable for both the users of the Latin, Cyrillic, Arabic, Han and other scripts. It seems to me that at the core of this technical problem is the fact that the current standards are completely Latin oriented and also truly focused on what used to be good. At this moment the codes that are used are considered to be human readable. I would argue that this is increasingly not the case as many of these codes are only there for computers to use. When this becomes accepted fact, it will be less relevant what these codes look like because their relevance will be in them being unambiguous.

For those who have read this blog before, it will be no surprise that the current lack of support of ISO-639-3 for language names is one of my hobby horses. As I have covered this subject before I will not do this again. What I do want to point out that insisting on "backwards compatibility" is more likely to break the current mould of what is the Internet than preserve it.


Sunday, October 29, 2006

New functionality of Firefox and blogging

I use Firefox as my browser. I have upgraded it to the latest version and now, my English will be spell checked for me in a real time fashion. The thing I probably will like best is, that a word like "localising" is now seen as correctly spelled. This is because Firefox allows me my British English. Blogger, although nice expects people to use American English spelling. This is useless when I were to blog in any other language.

Another thing that is nice is, that it allows me to accept words that are correct in the texts that I write; WiktionaryZ is such a word .. and so are MediaWiki, Wikimedia and Wikipedia.

By having spell checking done client side, the server functionality becomes cheaper for the service provider as well. The quality goes up.. all in all a good reason to upgrade to the latest Firefox.


Monday, October 23, 2006

The NPOV of language names

Yesterday Sannab pointed me to this posting on the linguistlist. The gist was that there had not been full consultation with the academic community about the adoption of the Ethnologue database for the ISO-639-3 codes of languages. A secondary argument was that Ethnologue is primarily a religious organization and the question was raised if it could be ethically to have such an organization be the guardian of what is to be considered a language.

This e-mail is a reaction to what Dr. Hein van der Voort wrote in the SSILA-Bulletin number 242 of August 22 of 2006.

The problem I see with the stance taken is not so much in the realization that some of the Ethnologue information needs to be curated, it is also not in the fact that some would consider Ethnologue to be the wrong organization to play this part, the problem is that no viable solution is offered. The need for the ISO-639-3 list is not only to identify what languages there from a linguistic point of view, it very much addresses the urgent need to identify text on the Internet as being in a specific language.

At WiktionaryZ we are creating language portals. These language portals are linked into country portals, both countries and languages have ISO-codes. When I had questions about language names, Ethnologue was really interested in learning what I had to say about what are to me obscure languages. The point here is, Ethnologue wants to cooperate. Some people do not want to cooperate for ideological reasons and at the same time do not provide a viable alternative. This is from my point of view really horrible. The need for codes that are more or less usable is expanding with Internet time and not with the glacial time that is the time of academics.

When WiktionaryZ proves itself and becomes a relevant resource for an increasing number of languages, all kinds of services will be build on the basis of it using standardized identification for content. The ISO-639-2 code is inadequate. It is not realistic to expect ISO to review it's decision at this stage and not include Ethnologue. It is not realistic to expect such a review without providing an alternative that is clearly superior to what is ISO-639-3. It is clearly better to improve together on what is arguably in need of improvement than not to provide the tools to work with in the first place.



Wednesday, October 18, 2006

Learning a language .. because you must

When you want to go live in another country, to live there permanently, it is best to know the country, the language. When you want to emigrate to the Netherlands, it is often required that you first pass a test that shows that you have some basic ability speaking Dutch. This test is to be passed while still abroad. This test can be taken at a Dutch embassy; both for the embassy and for the people who have to take this test, it is a logistical challenge.

The test is to find if the "A1-" level of comprehension exists. People need to be able to listen to someone who speaks Dutch SLLOOWWWLY and uses a limited range of words. This list is finite. It is likely that many of these words are the same words that WiktionaryZ needs for it's OLPC project.

Many of the techniques that you would use in a school are the same as the ones needed to prepare for this exam. You need soundfiles, you may want illustrations; both pictures and clips, you want definitions in the many languages that people understand.

Given a list of words expected to be known for the "A1-" exam, it would be easy to get the communities of the would be emigrant to add translations for both the word and the definition. As there is a group of people that do not read or write for both, a soundfile needs to be produced as well.

The next thing is making this content available on the Internet and serve it as a public service. Maybe there would even be public money to make this a public service. Then again, even if there is no money available for it. It is a nice challenge and, when you can do this.. There are other countries that people emigrate to, even people from the Netherlands :)


Monday, October 16, 2006

How to integrate wordlists in WiktionaryZ

There are many GREAT resources on the Internet. One I (re)discovered the other day is http://dicts.info. It provides information for some 79 languages and the information they provide is Freely licensed; the data can be downloaded for personal use as it can change rapidly.

So how do we integrate such information in WiktionaryZ ? WiktionaryZ insists on the concept of the DefinedMeaning. As this is central to how WiktionaryZ works, it is crucial that we have the concept defined. The dicts.info is split into two parts; a from part and a to part. The translations include synonyms and alternate spellings.

An application that is to include these translations could work like this: When an Expression is found in the to language and the translation is not there already, a user is shown the WiktionaryZ content with the suggestion to add the translation. This way the new information is integrated into WiktionaryZ.

The one part of this "how to" needed is possibly some discussion on the finer details, but certainly someone who will take up this challenge and develop this for us.


Saturday, October 14, 2006

Eating your own dogfood

WiktionaryZ is about lexicology, terminology and ontology. Consequently you would expect that these concepts are in there .. Now they are.

For me not understanding a word and looking up what it means is a moral obligation to include it in WiktionaryZ .. the latest word I did not know was divot. It was used on the BBC news website and yes, you would appreciate what it would be like. The word in Dutch still escapes me :)


Wednesday, October 11, 2006

Medical terminology

Yesterday I read an article on the BBC news website claiming that the term schizophrenia is invalid. The article points out that schizophrenia is not a single but a multitude of syndromes. The problem with the word is that given that it is understood to be a single syndrome, many patients are treated in a "one cure fits all" fashion. This is tragic as schizophrenia is seen as something that cannot be treated which is not necessarily true.

As we do not have much medical data yet, I have added the word to WiktionaryZ. WiktionaryZ will soon include an important resource of medical data, the UMLS. This will certainly include words like schizophrenia. I have added the resource that I created the definition on. With such a large body of medical terminology, it can be expected that many people will find their way to WiktionaryZ to learn what this medical terminology is. These people in turn will be interested in creating definitions that are consistent with what the scientific field considers something to be.

The word schizophrenia is well entrenched, it has a specific meaning to the laymen and my understanding from the BBC article is, that many professionals also need to learn about the ambiguity of the term. The question is, when is there enough support to depreciate the old meaning from a well known word like schizophrenia.. How will peer review work out in a wiki and how will the "public" react to these definitions..


Tuesday, October 10, 2006


Yesterday I met some people at the Universiteit of Amsterdam the topic was there use of a content management tool called Sakai. This tool is to be used for a collaboration and learning environment for education. It is a tool that is used very much in universities.

At the UvA they want to use it as a shared environment for Dutch and Iranian students. This is in and of itself a splendid idea. The software is open source so much can be done with the software. I recommended that they should localize the software into Persian in order to provide a friendly environment. For me one of the bigger challenges was that Sakai provides a wicks environment but with a high level of authorization of what people can and cannot see. The challenge is how to make such an environment get to its tipping points where its community takes off and becomes autonomous in its action. Carving it up makes it much more problematic I expect.

The people who wrote the software however were braindead when it came to the use of their software in other languages. I learned from the University of Bamberg that they will not use it because localization is done by changing texts that can be found in the source code. A university in Spain I was told decided not to upgrade the software because it was not feasible to localize the software again..

It is sad when a tool with such promise is dead in the water because it was not considered that in order to be useful you have to allow for proper localization.


Monday, October 09, 2006

Bi-lingual content

We are working hard to help the OLPC with content. WiktionaryZ seems to be really user oriented. WiktionaryZ wants to work together with organizations, with professionals. Quite often they have rich resources that are extremely useful, but need integration. Often this information comes in the format of a spreadsheet. When these are two column affairs with words in one language and words in another, it is not known what meaning these words have and, yes they can be imported but only people who know both languages can integrate it. At this moment it is manual work.

Manual work is sometimes necessary but it is time consuming and as it is, we do not have the tools to make it less time consuming. When such lists are imported with a "flag", it would be possible to locate them easily and compare them to existing Expressions for that language. This would help integration. When the Expression does not exist for both linked Expressions, it does not follow that the DefinedMeaning does not exist.

I am sure that we will have to deal often with these situations. We are at the stage where it makes sense to think about this. We are approaching the point where we have to deal with this.


Friday, October 06, 2006

Using Microsoft as an initial standard

I do confess that I use Microsoft operating system and software. I will not actively buy new Microsoft software and I will not buy any new MS software if I can help it. Having said all that, it is clear that Microsoft's monopoly does not allow me to buy a laptop without paying the Microsoft tax.

I was using Word on my computer and I wanted to change the language as I learned not to write American but British English. Changing the language for a text I found a rich resource of languages that Microsoft supports. Of particular interest for me was how it splits English in many versions. This is something that I can easily emulate in WiktionaryZ. The question is; I can but should I. Would it be better to wait until someone wants to include something that is for instance Jamaican English or is it better to be proactive ?


Thursday, October 05, 2006

Admin rights

At WiktionaryZ, an admin is someone that we trust to edit. When this someone edits, it helps us that he is a sysop. It allows him to block and delete errors when this is needed.

For those of you who know the Wikimedia Foundation projects, how would one of their communities react when a "bureaucrat" starts promoting users.. and create some 20 new sysops in an hour .. I am sure that it would create an uproar. Not on WiktionaryZ I am happy to say.