Sunday, December 31, 2006


It is the last day of the year and, it is a great moment to consider what to do in the New Year. For me the one thing that will typically make the difference is collaboration. What we aim to do with OmegaWiki will be a success when we aim to be inclusive. This will allow everyone to benefit from the mutual effort on the same data.

The economies of scale really will work to our advantage when the work on the data is shared. When people and organisations work together, tipping points will be reached that will enable the application of the data that would otherwise not be really possible.

Today I learned that it may be possible to collaborate with Logos. This would be really great; Logosdictionary is already relevant, it has a community that make the daily logosquote a rich reality. It has a children's dictionary, a conjugator and more. Truly the prospect of such a collaboration is significant, the challenge will be to make it happen.

2007 will be an interesting year .. :)

Thanks and "gelukkig nieuwjaar!"

Saturday, December 30, 2006


The Marathi Wiktionary has 121 articles. OmegaWiki has 120 expressions in Marathi. I had a look at the recent changes of the Marathi Wiktionary, it was full of changes to the MediaWiki messages. Consider, the same work was probably already done on the Marathi Wikipedia. When these updates are done in this way, people will not be able to benefit on other projects that are of interest to people that read and write Marathi.

It is really relevant that the work done on languages like Marathi count. BetaWiki was a place that functioned well, however as it was not part of the MediaWiki projects, it was ignored by some of the developers. With the inclusion of the software written for BetaWiki in the Incubator, there will be a more obvious place for the localisation of MediaWiki. It will be a place that is easy to understand for translators. The work done there will benefit all the projects where people are able to use their language for their User interface.


Wednesday, December 27, 2006

Playing a game

It is the festive season and families come together. Adults talk and kids play. The kids in my family play games, computer games. I was asked if they could use my computer to play a multi user game. I did not mind, I made them promise me that they would uninstall this game when they are done / before I am to leave.

The question; is this the type of thing that Industry considers illegal.. It must be because no money changed hands. If this is indeed illegal, where is the fun?


Lies, damned lies and statistics

There was an e-mail on the Wiktionary mailing list where a comparison was made between the English language Wiktionary and OmegaWiki. It was based on an analysis by Zdenek Broz. It compares the number of translations between the two projects.

The numbers may be right, but there is so much more to the English Wiktionary that it should not be reduced to such a numerical comparison. There is the Wikisaurus, there is a lot of etymological information, frequency lists, rhymes and most relevantly there are a lot of people who make it a great project.

It will take a lot more work before OmegaWiki can include all the information that is in en.wiktionary and it will take even more work before this information will actually be in there. I think this may happen but we do not need a competition for that. Both projects have there stong points and when we are able and willing to learn from each other it will be awesome.


Friday, December 22, 2006

A Christmas story about traditions

I really enjoyed this wonderful Christmas story. It is about centuries old traditions and about personal traditions and how they change.. I really enjoyed and want to share it with you ..

Wednesday, December 20, 2006

Running in front of the pack

MediaWiki is the software used by the projects of the Wikimedia Foundation. If there is one thing MediaWiki can boast about, it is the amount of localising that has been done. There are currently projects in 250 languages and, many have been localised somehow somewhere.

The Wikimedia Foundation has a big need for money. If ever there was an organisation that took care of the money it received it is the WMF. Wikipedia has according to Alexa the 12th traffic rank at the moment and the growth of their projects is something like doubling every four to six months.

The WMF has an ongoing funding drive, they created software to manage this in Drupal. The functionality is good for some languages but for others Drupal has not been localised. Some will say: it does a good job for my language but for most languages Drupal is just not up to the task. MediaWiki is far ahead of the pack when it comes to localisation and consequently there are no tools that will support the languages that are needed by the Wikimedia Foundation.

When Drupal gets more localisation done, it will find that in the same time frame the WMF will have added more languages. There is a moral question here as well. Is it acceptable to use tools that convey an important message that do not support the languages that you need..


Tuesday, December 19, 2006

Glyphs, fonts and the need to support them

In OmegaWiki (formerly known as WiktionaryZ) we want to support all words of all languages. Some languages have scripts that are not supported by the Operating System that you use. This can be bad or it can be really bad. It is bad when you have to find and install a font. Where to get it, what does it cost. It is really bad when there is no font. It can also be extremely bad when there the glyphs have not been defined in UNICODE.

In the Wikimedia Foundation there are projects that do require another font. Khmer, Laotian come to mind. For the Ripuarian language the situation is extremely bad; there is an official orthography that defined characters that do not yet exist in UNICODE.

Today I learned about a really nice project called Dejavu. It is an open source project that works on the creation of fonts. They do good work but there is a long way to go before they will have tackled the Chinese and Indian languages.

For OmegaWiki projects like Dejavu are important because they will ease the use of those people who are interested in seeing all the information that it will contain. UNICODE and the work done by Michael Everson are as important because without the glyphs being defined they will not go into fonts and without fonts we can not have all words of all languages.


Monday, December 11, 2006

A tribute to the creator of a language

Many people are not impressed when non natural languages are mentioned.. WiktionaryZ is by its very nature more inclusive. That is to say all artificial languages with sufficient recognition will be welcome. One class of languages that WiktionaryZ will not include are programming languages.

Because of my background, I am personally very much interested in these languages too. Today, I learned from an article on the BBC-website that Grace Hopper was born 100 years ago. Rear-admiral Hopper was one of the most influential people in the development of computing. She is certainly one of my heroes.


Wednesday, December 06, 2006

Preparing for a conference

In two weeks on December 14 and 15 I will speak at a conference in Vienna. I was surprised when I was asked to speak at the Language Standards for Global Business. What is it that I can bring to the table. Yes, I proposed a Wiki for Standards at the Berlin conference and yes I am a member of the Wikimedia Foundation language sub committee and yes WiktionaryZ is a big user of standards. I am still shocked and awed that I was asked.

Now that I am getting used to the idea I find that standards were increasingly taking up my time. Standards are crucial for the projects I am involved in. When they fit a need, you do not need to explain and argue why individual choices were made, you refer to the standard. When you export data and you implement a Standard for the format like TBX, LMF or OWL, you hide the complexities of the database and you provide the export in a stable, mature way that allows people to build upon.

The problem with standards however is, that there are so many of them. Also many of the standards work cross purposes by focusing on single issues. This lead to separate standards for the Internet, for libraries all indicating languages.. This plethora of standards and requirements prevents interoperability. It also prevents the general adoption of these Standards.

As I learn more about Standards, I find myself with WiktionaryZ at the cutting edge. How to publish content for a language like Bangubangu ? As far as I know there is little or no content on the Internet at all. I am pleased now to have the Babel templates for Bangubangu. But as the content grows, how do we get the search engines to find it, it is here where appropriate Standards can make a difference.

In preparation for my presentation I have looked at other conferences and I find that many are very much driven by commercial needs. Needs that do not necessarily take into account what is in the long tail of the industry that is represented. The maturity that can be found in translations between the languages of economic power houses like America, Japan and Germany are worlds apart of the African court rooms where the defendant is lucky when he understands the judge or a witness. Here there is often no translation and the tools that are available in the translation industry are not available even for the translation of court papers.

My problem for the conference is that there is so much that I would like to address that I have to concentrate and make what I will say count. The good news is that is open for business. When people interested in standards take an interest and collaborate we may address all issues and get a better understanding what all these Standards are there for and more importantly, how they interrelate.

The challenge will be to build a community that understands that it is only collaboration that will make their Standards relevant and integrated with other Standards that are not relevant to their business


Tuesday, December 05, 2006

Another certificate but no tulip

In my living room I have a vase with tulips. Every one of these fabric tulips stands for a certificate that I earned in my computer career. To me it expresses that in order to remain up to date you have to work.

Last week I was invited to a lecture by Marshall B. Rosenberg in Rotterdam. Mr Rosenberg is credited with developing a method of communication called " non violent communication". I attended, and I got a certificate to prove it. It does not rate a tulip because there is no achievement for me yet.

In the hand out we got, there were list of feelings and needs. It was mentioned that the English language was not made to express feelings and needs, the Spanish language was mentioned as being much richer when you want to express either. When you think of it, it is not surprising when language can be intimidating. The choice and the appreciation of words makes all the difference. Like it was explained during the lecture, in order to change your language you have to be aware about what you say and how the effect is on the party that is on the receiving end of that language.

In this discussion about male domination that I mentioned in my previous post, the language used is a major contributing factor to the unease that is being felt. When people do not perceive that it is their very words that makes others feel uneasy, angry even discriminated it is very hard to come to an understanding. Yes, it may be that for some the English language is a second or even tertiary language, it does not negate the effects these words have; it is at best an explanation not an excuse. Bullets do kill, words do hurt.



On the mailing list for the English Wikipedia there were some women who articulated that a sizeable group of women feel not comfortable in the en.wikipedia community of editors. To alleviate this issue, they decided to create a community of women called Wikichix.

Angela, who is one of the best around for creating communities, set up some infrastructure including a mailing list for those interested in joining. A lively discussion started about this. Many men denied that there is a need, and that it is appropriate to have such a self help group. Several of them expressed that they feel excluded and even discriminated against. Several women including Anthere, provided graphic evidence of how women are dealt with by some of the males that think nothing of making disparaging remarks qualifying them as "jokes".

Even though it was said time and time again that this was to engage more women to become part of the Wikipedia mainstream, some people could not accept that what is good for some does not need to be an affront to others. Sadly, Anthere has now asked for the mailing list not to make use of WMF infrastructure. :(

Personally I feel it as a loss. A loss because it does not help to engage more people. A loss because it may even prevent the engagement of people who are not part of the Wikipedia community and all this because of the boorish behaviour by some.


Tuesday, November 28, 2006

A new language on the Internet ?

I received a mail that was forwarded to me by Martin Benjamin (the Kamusi project). The mail was by a gentleman who wants to promote his mother tongue, the Bangubangu language. This is the first time that written material in this language has been created. The first project is a book in English, French and Swahili called “Teach yourself the Kibangubangou”.

I am thrilled with initiatives like this. The book is there now what to do next. Do you print it with an organisation like Lulu? Do you advise to make it a Wikibook? To what extend are concepts like copyright and licenses relevant and understood ..

It would be great to make this project succeed and have kids learn their mother tongue. There are stumbling blocks. Google does support less languages then the Wikimedia Foundation; Google does some hundred and The WMF some two hundred and fifty. One of the problems is that there is not much material in any but the bigger languages and Google does good by already doing this much.

To make an impact, I think it is crucial for material in particularly the endangered languages, to be tagged correctly. This gives Google and the other search engines a fighting chance to function on little material. For the Bangubangu language, there is no proper tag yet. The question is if the IETF will consider to do something about it. They are religious in their belief that the ISO-639-3 is not approved yet and that the bnx code therefore is not to be used. At issue is that there is a need that presents itself for this language.


Sunday, November 26, 2006

Kanab Ambersnail

According to a heading on the English Wikipedia mainpage, the article about the Kanab Ambersnail was the 1.500.000th article. I think this is very spectacular. Personally I find it funny that I can write about it before I have read about it on slashdot :)

Congratulations to all who make Wikipedia so special.


The I&I conference

Last Wednesday and Thursday I was at the annual I&I conference in Lunteren. This conference brings together many of the ICT (information & computer technology) coordinators of Dutch and also some Flemish schools. I have had the privilege to be for the second time. This year we had beautiful new Wikipedia folders, Marc Bergsma demonstrated a OLPC motherboard; it got a lot of interest particular when it was learned that a full system can be available for Dutch schools for the school year 2008/2009.

One great project I learned about is called TEEM. TEEM is a project where teachers evaluate educational websites and CD/DVD based resources for their use in classrooms. The way it is organised is such that I can understand why British teachers trust it as a resource. The way that it is funded however is one that creates a sad systemic bias. Reviews are paid for by publishers the consequence is that open content course ware will not be evaluated. This funding model also keeps out those publishers that do not pay for a review. This governmental political choice was to encourage innovation and competition. It is not unreasonable to suggest that it effectively costs the British schools more money as they do not learn about what is available for free.


Friday, November 24, 2006

Whose language is it anyway ?

Much of Microsoft's software has been translated in the Mapuzugun or is it the Mapundugun language. They did this in consultation with the Chilean government.

The Mapuche people have gone to court because they disagree that Microsoft or the Chilean government had the right to do this. The language it is claimed is theirs, and the translation was done without consulting the Mapuche people.

Many people do not understand or know what is behind this. Why would people be opposed against becoming part of the digital world? To me, the key thing to appreciate in the reporting is the accusation of "violating their cultural and collective heritage". There are competing orthographies for this language and one is strongly favoured by the government.

So if I remember things well, this has everything to do about the involvement of the people that use the language. Mapundugun is spoken in both Argentina and Chile and here the government of a nation and the most powerful company of the world seem to be taken on because they do not represent the Mapuche people and are denied the right to decide for them.

In that light it makes perfect sense to go to court and insist that this is very much not wanted.


Friday, November 17, 2006

To the winner go all the spoils

When changing you name makes you money, would you do it? Would you change your name for a pig or a goat ? Accepting such an offer could make me "Pig Meijssen", it would go well with my mascot. The name that the people changed their name to, "Hornsleth", is perfectly honourable. This "project" by the Danish artist Kristian Von Hornsleth is very much intended to demonstrate that international aid fails people. It fails people because it assumes that "the way of the donor" is best.

If you want to be helped you have to do this, that whatever. For me the great thing about Wikipedia is that it helps. It brings information to people. Its intention have always been to bring people information in their language. This means that culture, people and language are respected and that have people are enabled to help themselves.

For the western languages there is an abundance of great information. For many other languages, Wikipedia still has to take root. As the world becomes more wired, I expect Wikipedia will take root and become relevant for as a resource for both the culture, the people and the language.

When it is considered acceptable to bring information only in English, French, Arab, Chinese or whatever is considered a big language, the notion of Neutral Point of View that the Wikimedia Foundation offers is deminished. A NPOV exists because all points can be brought in the diversity that are the languages and cultures that are reflected in the 250 Wikipedias that currently exist.

It may be efficient to concentrate on the "important" languages.. but I do not want to consider the loss.


Sunday, November 12, 2006

The worth of MediaWiki

According to the dark art of economics, everything can be valued. Everything can be given a price tag. People may object to this, and do on principle, but sorry, they have done just this for MediaWiki.

MediaWiki has a worth of $3.810.127,- when you assume that a developer costs $55.000,- a year and some other stuff. It was "valued" per the first of January 2006 and as the year is almost gone, it will be worth a lot more.

Another economic truism is that money makes money. Because of the success of MediaWiki more people will start developing MediaWiki. I am not in a position to deny this. WiktionaryZ extends MediaWiki. With new functionality much more becomes possible.

There is one thing missing in this argument. What is wrong is that MediaWiki is a tool. A tool that produces something that is far more valuable. Wikipedia is not the only project that MediaWiki enabled. I am sure some people who understand this dark art of valuation will be able to come up with a better number.

Given that money makes money, it is possible to leverage the MediaWiki generated content. Much proprietary content is not really relevant because it does not get exposure. By making content available it can get exposure. Material was often created to get exposure. By keeping it proprietary, thereby hidden from view, it does not do all that it could do. By making it available under a free license new opportunities arise.

MediaWiki enabled among others Wikipedia. WiktionaryZ has potential. My hunch is that like MediaWiki it will enable the creation of content in a different way. I hope and expect that it will help us to negotiate the release of much content under a Free/Open license and allow us to collaborate with many organisations and people.


Friday, November 10, 2006

Running the interwiki bot for Wiktionary

I run the interwiki functionality of pywikipedia bot on all the Wiktionaries. It is a thing that I started and it is the kind of public service that needs doing. It links all the words that are spelled the same by adding "interwiki" links. These are the things that you see at the left hand side where it is indicated that there is also information in another language.

I have done this now for over a year and what I just noticed is the amount of words that I do not understand is growing rapidly. On the one hand it is to be expected as it is in line with the rapid growth of projects like the Vietnamese wiktionary. What now starts to happen more is that multiple wiktionaries have words together. That is what I see when I watch the bot.

In a way it would be fun to have WiktionaryZ in there. Currently we have 159.004 Expressions and we have 10.557 DefinedMeanings. Based on the expressions we would be the fourth project in size, it would be more reasonable to use the DefinedMeaning for the comparison and this would have us as the 26th in size.

Comparing Wiktionary with WiktionaryZ is like apples and oranges. Where Wiktionary has each word only once, WiktionaryZ counts them as existing in a language. Where there can be many red-linked articles on a Wiktionary page, the WiktionaryZ expressions are implicitly there.

It makes better sense to appreciate what the implications are of the numbers. In lexicology size counts. Only when people have a good chance of finding the information they are looking for will they find a resource useful. It is one reason why it makes sense to concentrate on certain topics or domains. WiktionaryZ is rich in ecological terminology due to the information that we got by including the GEMET thesaurus. By working on the OLPC children's dictionary we get a lot of the basic stuff that is the bread and butter of dictionaries.

Tuesday, November 07, 2006

Some thoughts on Alexa

Alexa is a website that provides an indication of the popularity of websites on the Internet. What I do does not matter, as Alexa only measures the use of Internet Explorer which is statistically becoming a less brilliant idea. Given the amount of people using Firefox on WiktionaryZ, I am sure that they do not know where many of the alpha crowd hangs out.

We have had our downtime this week and, I have been looking at how this affects our standing at Alexa. Sure enough, after two days of downtime, we hit the 900.000th place. Now that we are back up, we rebound nicely and today we are already back at number 478,316 for the weekly average.

WiktionaryZ has its own statistics, here you will find that our daily average hits did take a pounding. Given no more downtime and given that the trend of continued interest continues, the numbers will improve but the average will be depressed. At this moment all this is not crucial. When we get people to rely on WiktionaryZ, our service level needs to be much improved.

In a conversation with a developer I said once, when you respect our users, you have to treat them as if they cost us $150,- an hour. The point is that with the realisation how valuable contributors are, you are more likely to give them with the respect that they are due. With professional people using wikis, there actually is money paid for the time spend editing wikis this makes it more plain but it does not make a difference. Editors are to be respected and it is important to make the most out of what they do for us.


Wednesday, November 01, 2006

The semantic web to the rescue ?

The reporting on the Internet Governance Forum in Athens is mighty interesting. Yet another nice article on the BBC website with some thought provoking ideas.

I find it really interesting that spoken languages are considered. However, practically at this stage the Internet is very much oriented towards written languages and, given the amount of stumbling blocks that exist to integrate languages other than the ones using a Latin script, I am afraid this is just a red herring.

I was also amused to see that the semantic web was brought to the fore as one solution to the problem of linguistic diversity. Yes, it is intended to be understood by computers. Computers are used by people and the semantic web is decidedly English. This raises the question how this computer that apparently understands English communicates to its user who does not.

There are however some great things to be said about the semantic web; first of all the terms used should be unambiguous. This in turn means that it should be possible to translate it to other languages than English. This is a challenge that we face in WiktionaryZ. We are able to have semantic relations and our semantic relations do translate to the language of the User Interface. So when the terms have been translated, in WiktionaryZ relations can be understood not only but also by computers.

This is a good moment for a disclaimer; WiktionaryZ is pre-alpha software. Many of the issues that have been tackled in the development of the semantic web we have not considered let alone touched. We hope / expect that we will be allowed to stand on "the shoulders of giants".


Is this a "good" word

MediaWiki is the software that drives Wikipedia but also WiktionaryZ. It is probably one of the best pieces of software when it comes to internationalisation and localisation. This is demonstrated by the many localisations that have been done already for the software. Singing the praises of MediaWiki from me can be expected; why else develop WiktionaryZ on top of MediaWiki?

This does however not mean that all is well. I have written before about the problems with the Neapolitan language and there issue with the '' combination. Today I learned that a language called Hai||om uses the "pipe" character and consequently, I cannot make it work properly in a MediaWiki installation.

There are ways around such a problem; I can use one of the alternate names; San or Saan. I can expect that there will be no Wikipedia created in this language (only 16.000 speakers). But the point is, that even a system that does really well is only as good as the next language that proves that it has an issue with it's presumptions.


Tuesday, October 31, 2006

The importance of good standards

At the Internet Governance Forum Mr Vint Cerf has said that changing the way the Internet works to accommodate a multi-lingual Internet raises concerns. The question raised in the BBC article is whether it is a technical issue or not.

Interoperability on the Internet is possible when the standards used are such that interoperability is possible. The current URL system is based on the Latin script. This was a sensible choice in the days when computing was developed in America. It made sense in a world when the only script supported by all computers was the Latin script. In these days computers all support UTF-8. All modern computers can support any script out of the box. This means that all computers are inherently able to display all characters. This does however not mean that all computers are able to display all scripts; even my computer does not support all scripts and I have spend considerable time adding all kinds of scripts to my operating system.

The next issue is for content on the Internet to be properly indicated as to what language they are. Here there is a big technical issue. The issue is that the standards only acknowledge the existence of a subset of languages. The result of this is that it is not possible to indicate using the existing standards what language any text is in.

Yes, the net will fragment in parts what will be "seen" by some and not "seen" by others. This is however not necessarily because of technical restrictions but much more because the people involved and the services involved do not support what is in the other script, the other language. When I for instance ask Google to find лошадь or paard, I get completely different results even though I am asking both times for information about the Equus caballus. In essence this split of the Internet already exists. The question seems to me to be much more about how to make a system that is interoperable.

The Internet is interoperable because of the standards that underlay it. With the emancipation of the Internet users outside of it's original area, these standards have to become usable for both the users of the Latin, Cyrillic, Arabic, Han and other scripts. It seems to me that at the core of this technical problem is the fact that the current standards are completely Latin oriented and also truly focused on what used to be good. At this moment the codes that are used are considered to be human readable. I would argue that this is increasingly not the case as many of these codes are only there for computers to use. When this becomes accepted fact, it will be less relevant what these codes look like because their relevance will be in them being unambiguous.

For those who have read this blog before, it will be no surprise that the current lack of support of ISO-639-3 for language names is one of my hobby horses. As I have covered this subject before I will not do this again. What I do want to point out that insisting on "backwards compatibility" is more likely to break the current mould of what is the Internet than preserve it.


Sunday, October 29, 2006

New functionality of Firefox and blogging

I use Firefox as my browser. I have upgraded it to the latest version and now, my English will be spell checked for me in a real time fashion. The thing I probably will like best is, that a word like "localising" is now seen as correctly spelled. This is because Firefox allows me my British English. Blogger, although nice expects people to use American English spelling. This is useless when I were to blog in any other language.

Another thing that is nice is, that it allows me to accept words that are correct in the texts that I write; WiktionaryZ is such a word .. and so are MediaWiki, Wikimedia and Wikipedia.

By having spell checking done client side, the server functionality becomes cheaper for the service provider as well. The quality goes up.. all in all a good reason to upgrade to the latest Firefox.


Monday, October 23, 2006

The NPOV of language names

Yesterday Sannab pointed me to this posting on the linguistlist. The gist was that there had not been full consultation with the academic community about the adoption of the Ethnologue database for the ISO-639-3 codes of languages. A secondary argument was that Ethnologue is primarily a religious organization and the question was raised if it could be ethically to have such an organization be the guardian of what is to be considered a language.

This e-mail is a reaction to what Dr. Hein van der Voort wrote in the SSILA-Bulletin number 242 of August 22 of 2006.

The problem I see with the stance taken is not so much in the realization that some of the Ethnologue information needs to be curated, it is also not in the fact that some would consider Ethnologue to be the wrong organization to play this part, the problem is that no viable solution is offered. The need for the ISO-639-3 list is not only to identify what languages there from a linguistic point of view, it very much addresses the urgent need to identify text on the Internet as being in a specific language.

At WiktionaryZ we are creating language portals. These language portals are linked into country portals, both countries and languages have ISO-codes. When I had questions about language names, Ethnologue was really interested in learning what I had to say about what are to me obscure languages. The point here is, Ethnologue wants to cooperate. Some people do not want to cooperate for ideological reasons and at the same time do not provide a viable alternative. This is from my point of view really horrible. The need for codes that are more or less usable is expanding with Internet time and not with the glacial time that is the time of academics.

When WiktionaryZ proves itself and becomes a relevant resource for an increasing number of languages, all kinds of services will be build on the basis of it using standardized identification for content. The ISO-639-2 code is inadequate. It is not realistic to expect ISO to review it's decision at this stage and not include Ethnologue. It is not realistic to expect such a review without providing an alternative that is clearly superior to what is ISO-639-3. It is clearly better to improve together on what is arguably in need of improvement than not to provide the tools to work with in the first place.



Wednesday, October 18, 2006

Learning a language .. because you must

When you want to go live in another country, to live there permanently, it is best to know the country, the language. When you want to emigrate to the Netherlands, it is often required that you first pass a test that shows that you have some basic ability speaking Dutch. This test is to be passed while still abroad. This test can be taken at a Dutch embassy; both for the embassy and for the people who have to take this test, it is a logistical challenge.

The test is to find if the "A1-" level of comprehension exists. People need to be able to listen to someone who speaks Dutch SLLOOWWWLY and uses a limited range of words. This list is finite. It is likely that many of these words are the same words that WiktionaryZ needs for it's OLPC project.

Many of the techniques that you would use in a school are the same as the ones needed to prepare for this exam. You need soundfiles, you may want illustrations; both pictures and clips, you want definitions in the many languages that people understand.

Given a list of words expected to be known for the "A1-" exam, it would be easy to get the communities of the would be emigrant to add translations for both the word and the definition. As there is a group of people that do not read or write for both, a soundfile needs to be produced as well.

The next thing is making this content available on the Internet and serve it as a public service. Maybe there would even be public money to make this a public service. Then again, even if there is no money available for it. It is a nice challenge and, when you can do this.. There are other countries that people emigrate to, even people from the Netherlands :)


Monday, October 16, 2006

How to integrate wordlists in WiktionaryZ

There are many GREAT resources on the Internet. One I (re)discovered the other day is It provides information for some 79 languages and the information they provide is Freely licensed; the data can be downloaded for personal use as it can change rapidly.

So how do we integrate such information in WiktionaryZ ? WiktionaryZ insists on the concept of the DefinedMeaning. As this is central to how WiktionaryZ works, it is crucial that we have the concept defined. The is split into two parts; a from part and a to part. The translations include synonyms and alternate spellings.

An application that is to include these translations could work like this: When an Expression is found in the to language and the translation is not there already, a user is shown the WiktionaryZ content with the suggestion to add the translation. This way the new information is integrated into WiktionaryZ.

The one part of this "how to" needed is possibly some discussion on the finer details, but certainly someone who will take up this challenge and develop this for us.


Saturday, October 14, 2006

Eating your own dogfood

WiktionaryZ is about lexicology, terminology and ontology. Consequently you would expect that these concepts are in there .. Now they are.

For me not understanding a word and looking up what it means is a moral obligation to include it in WiktionaryZ .. the latest word I did not know was divot. It was used on the BBC news website and yes, you would appreciate what it would be like. The word in Dutch still escapes me :)


Wednesday, October 11, 2006

Medical terminology

Yesterday I read an article on the BBC news website claiming that the term schizophrenia is invalid. The article points out that schizophrenia is not a single but a multitude of syndromes. The problem with the word is that given that it is understood to be a single syndrome, many patients are treated in a "one cure fits all" fashion. This is tragic as schizophrenia is seen as something that cannot be treated which is not necessarily true.

As we do not have much medical data yet, I have added the word to WiktionaryZ. WiktionaryZ will soon include an important resource of medical data, the UMLS. This will certainly include words like schizophrenia. I have added the resource that I created the definition on. With such a large body of medical terminology, it can be expected that many people will find their way to WiktionaryZ to learn what this medical terminology is. These people in turn will be interested in creating definitions that are consistent with what the scientific field considers something to be.

The word schizophrenia is well entrenched, it has a specific meaning to the laymen and my understanding from the BBC article is, that many professionals also need to learn about the ambiguity of the term. The question is, when is there enough support to depreciate the old meaning from a well known word like schizophrenia.. How will peer review work out in a wiki and how will the "public" react to these definitions..


Tuesday, October 10, 2006


Yesterday I met some people at the Universiteit of Amsterdam the topic was there use of a content management tool called Sakai. This tool is to be used for a collaboration and learning environment for education. It is a tool that is used very much in universities.

At the UvA they want to use it as a shared environment for Dutch and Iranian students. This is in and of itself a splendid idea. The software is open source so much can be done with the software. I recommended that they should localize the software into Persian in order to provide a friendly environment. For me one of the bigger challenges was that Sakai provides a wicks environment but with a high level of authorization of what people can and cannot see. The challenge is how to make such an environment get to its tipping points where its community takes off and becomes autonomous in its action. Carving it up makes it much more problematic I expect.

The people who wrote the software however were braindead when it came to the use of their software in other languages. I learned from the University of Bamberg that they will not use it because localization is done by changing texts that can be found in the source code. A university in Spain I was told decided not to upgrade the software because it was not feasible to localize the software again..

It is sad when a tool with such promise is dead in the water because it was not considered that in order to be useful you have to allow for proper localization.


Monday, October 09, 2006

Bi-lingual content

We are working hard to help the OLPC with content. WiktionaryZ seems to be really user oriented. WiktionaryZ wants to work together with organizations, with professionals. Quite often they have rich resources that are extremely useful, but need integration. Often this information comes in the format of a spreadsheet. When these are two column affairs with words in one language and words in another, it is not known what meaning these words have and, yes they can be imported but only people who know both languages can integrate it. At this moment it is manual work.

Manual work is sometimes necessary but it is time consuming and as it is, we do not have the tools to make it less time consuming. When such lists are imported with a "flag", it would be possible to locate them easily and compare them to existing Expressions for that language. This would help integration. When the Expression does not exist for both linked Expressions, it does not follow that the DefinedMeaning does not exist.

I am sure that we will have to deal often with these situations. We are at the stage where it makes sense to think about this. We are approaching the point where we have to deal with this.


Friday, October 06, 2006

Using Microsoft as an initial standard

I do confess that I use Microsoft operating system and software. I will not actively buy new Microsoft software and I will not buy any new MS software if I can help it. Having said all that, it is clear that Microsoft's monopoly does not allow me to buy a laptop without paying the Microsoft tax.

I was using Word on my computer and I wanted to change the language as I learned not to write American but British English. Changing the language for a text I found a rich resource of languages that Microsoft supports. Of particular interest for me was how it splits English in many versions. This is something that I can easily emulate in WiktionaryZ. The question is; I can but should I. Would it be better to wait until someone wants to include something that is for instance Jamaican English or is it better to be proactive ?


Thursday, October 05, 2006

Admin rights

At WiktionaryZ, an admin is someone that we trust to edit. When this someone edits, it helps us that he is a sysop. It allows him to block and delete errors when this is needed.

For those of you who know the Wikimedia Foundation projects, how would one of their communities react when a "bureaucrat" starts promoting users.. and create some 20 new sysops in an hour .. I am sure that it would create an uproar. Not on WiktionaryZ I am happy to say.

Saturday, September 30, 2006


I run the pywikipedia bot for the Wiktionary projects. I have done this for quite some time, and what I do is a "public service".. The software is quirky; when it works, it works well. That is until recently when it decided to blank pages for no reason.

This was a great moment to update the software. This did not work; authentication problems. Sourceforge decided to have me change my password. Thank you sourceforge. This was the moment were it still did not work.

I asked Andre Engels to have a look. The result; the bot works again after some major chirurgy. Also the way it works for me is different; it assumes that I have a user on any wiktionary.. This was already more or less the case. It will now test more systems than before to see if an expression exists there..

All in all, I hope / expect that this solves my problems running the pywikipedia bot.


The patron saint for the translators

Today, the 30th of September it is the day of St Jerome. He is best known as the translator of the bible from Greek and Hebrew into Latin. He is recognized by the Vatican as a Doctor of the Church. It is also the "International Translation Day".

WiktionaryZ will be a tool for everyone, it welcomes people from all countries and languages. Typically there is not that much political or religious to be found. This does not mean that we should not take a moments notice; we had Ramadan as the word of the day, and today I blog about St Jerome.


Thursday, September 28, 2006

European day of languages

On the 26th of September, the European day of languages was held. We did not know.. This means that we are either not into languages or there was not that much marketing for this event.

The European Centre for Modern Languages has a website, it used to have posters agendas and all kinds of information about this. The stuff just does not load on my computer. It may be that the moment has gone, on the other hand many of the other pages of their website do not load for me.

I really wonder, if you have a website that does not work for people, does it serve it's purpose ?? Anyway there is always next year .. the 26th of September :)


Monday, September 25, 2006

Cooperation ? Not with Debian...

I learned how "nice" it is to be "Free" at any price. Firefox, my favorite browser is Open Source. A great amount of effort was spent in making Firefox popular; advertisements in the New York Times covering a whole page.. The marketing of Firefox is truly one of the success stories of the Open Source world. Firefox is a trademark, it has a slick logo and the Mozilla Foundation protects its assets.

The Debian distribution is one of those "Free" distributions that does not appreciate that derivations of logos are not permitted. As it does not allow these files in it's distribution. Firefox insists that the logo and the name go together.. this results in a mess where Debian is likely to rename Firefox..

In my mind this is foolishness. Logos and trademarks are there to help products like Firefox to create a market. The evident denial IMHO of this is as stupid as the Wikimedia Foundation only having it's logos in the Commons repository and denying the logos of other organizations.

It is much better to appreciate that logos have a special place and, that it is great when organizations allow the use of their logo to be used with encyclopedic content.. For now is not to be because it is considered not to be "Free".. in effect denying this Freedom to others that is reserved for the own organization. It is not consistent .. it is a muddle.. It is bad practice.


Sunday, September 24, 2006

When Mohammed doesn't come to the mountain ..

Some things are inevetitable; WiktionaryZ tried always to be a project about content. Getting the attention of people has been difficult because what is WZ about, what is it's relation with the Wiktionary projects, what is the relation with it's partners, paying developers and last but not least, what is the relation between WiktionaryZ and the Wikimedia Foundation.

At some stage, the "Special Projects Committee" of the WMF issued a resolution that they want to host WiktionaryZ. Combine this with the wish of Jimmy Wales for having WiktionaryZ in the WMF; it put us under some pressure. On the other hand, the experience with the InstantCommons project taught us that even friends need contracts to stay friends and it also taught us that when you want to get things done, it is often best to do it yourself.

With the election of Erik to the board of the Wikimedia Foundation, it seems that the mountain has come to Mohammed. Erik has been and is a key contributor to WiktionaryZ. This should facilitate a great relation between the WMF and the WiktionaryZ community and partners.

We have invested a lot of ourselves in WiktionaryZ. We intent to invest even more in the success of WiktionaryZ; we have schemes how it should integrate with the WMF projects. We see a bright future but as guardians of the project we will protect what WiktionaryZ stands for and nurture what we think WiktionaryZ will make possible.


Thursday, September 14, 2006


Boinc or the Berkeley Open Infrastructure for Network Computing is something I learned about the other day on a BBC worldservice program. I was fascinated by the wealth of options that it provides. It is quiet similar to the "Ligandfit" application that I have been running for the last 2 year and 153 days. It is different in that there is a much bigger array of things you can work on. Both have projects that I do recommend.

Given that many people like myself have loads of computer cycles going to waste, it makes sense to do something with it. For me there are two things that I do with my excess computing power; I have bots running doing maintenance work on Wiktionary and the Ligandfit..

Do you have cycles to donate ?

Wednesday, September 13, 2006

Transliteration, do we need to do things twice?

At WiktionaryZ we support all the scripts UNICODE supports. This means that we already do both Serbian in the Cyrillic and the Latin script.. We also support Mandarin in both the simplified and the traditional script.. We want to support Cherokee which is available in the Cherokee and the Latin script ..

Many of these conversions can be done by a program or, like for Mandarin there are databases with both versions of the script. We received from Jeffrey V. Merkey permission to use a long list of Cherokee words, we also received permission to use a program called chr2syl that does transliteration automatically. There is a similar program for Serbian.

WiktionaryZ is at a pre-alpha stage, so we do not even have the basic functionality available, but would it not be nice when we add a Serbian word we automagically get the other version ?


Tuesday, September 12, 2006


Papiamento is a language with two orthographies. There is the Aruban and the Antillian version of this language. Today I have the possibility to add this language to WiktionaryZ. The problem is that I do not have a clue what the code for the language should look like.


Sunday, September 10, 2006

Antonyms in WiktionaryZ

An antonym is "a word or phrase that has exactly or nearly exactly the opposite meaning to another word or phrase". In WiktionaryZ, meaning is associated with what is called a DefinedMeaning. A DM is different in the way people think about concepts in that it refers to both synonyms and translations when you approach a concept from an Expression.

The technical problem is, when implementing antonyms in WiktionaryZ, are they still antonyms. Traditionally antonymy is considered within one language but because of the way WZ implements things, this is not true with in WZ.

Antonymy has it's own problems too. When a concept is considered the opposite of, this notion is often cultural. The problem then is, does antonym translate in the first place.


Thursday, September 07, 2006

Internationalisation or internationalization or I18N

The interest that we have in localisation and internationalisation is well known by many, we need it for WiktionaryZ as we want a User Interface in the languages that we want to support. We want to support all languages.

One of us was approached by a really interesting company, a company asking us if we know people who are in the business of I18N. Asking us if we know people who are interested in a job doing a great job for a great company.

I find it really funny that we are seen as being (becoming) relevant for this subject...


Wednesday, September 06, 2006

A small annoyance

The user interface is important, it is how people learn to work with a software environment. For WiktionaryZ the content is in the "WiktionaryZ" namespace. This is already counter intuitive as the default namespace is only supportive to what is done in WiktionaryZ.

The small annoyance is in the "New entries" it does show you only what is new in the "default namespace".. Not really relevant ..


Lies, damn lies and statistics

A lot of number crunching is done on the Wikipedia content. Particular the most famous one, the English language version has a lot going for it. My favorite number crunching project is Wikiword, it tries to get semantic content out of Wikipedia. It is a great project.

Many statistics seem to proof what the researcher tries to prove. Aaron Swartz wrote a really nice article on who writes Wikipedia. It is great because it challenges conventional wisdom. The conventional wisdom is that a small group of people write Wikipedia. Aaron makes it plausible that it is the anonymous user who contributes the most letters to the article that you read in Wikipedia.

What is particularly important for me are the consequences for WiktionaryZ. The suggestion is that we have to make it easy for the casual user. That people contribute to the things they know and care about. That an intuitive screen helps, WYSIWYG helps people who are the writers, the wikisyntax is for editors, the people who make things pretty.

The user interface of WiktionaryZ needs to be compared against the user interface of Wiktionary. Wiktionary is flat file and almost every Wiktionary is essentially different. WiktionaryZ will have one user interface for all languages. Contributions made by some will be available to all. I trust that we have and edge.

The flipside of the coin is that indeed a limited group of people do the EDITING. What we do not have in place are tools that help editors. Editors are the people that will prevent WiktionaryZ from becoming a mess. Particular the merging and deletion of the [[DefinedMeaning]] and [[Expression]] will be important to get right..

There are ideas on how to do this, they are not mature yet.


Monday, September 04, 2006


No, I will not talk about the spelling of the word. I want to raise the subject of defining colours. There are many colours and in order to agree what the colour IS, you cannot really use words. For colours there is a standard, the RAL codes. This effort to bring quality to the delivery of colours started off with some 40 colours and now it defines some 1900.

Lemon yellow, is such a colour. It has the RAL-1012 code. Because of it's long existence, the colour indicated by this code has been translated to numerous languages; suurlemoengeel is the name of the colour in Afrikaans. When the colour is defined by the RAL code, and you use the names used with this colour, you have an identical meaning for the word. This is obvious as this is what the RAL codes are there for.

The question is, how to list the RAL codes themselves in WiktionaryZ. The RAL colours are a collection, we can describe them as such. We can also include the RAL-codes as an expression, the zxx language seems obvious to me. We can do either and we can do both.


Sunday, September 03, 2006

The Chinese language

There is no such thing as "the Chinese language". This should be no surprise to people. When you learn more about Chinese languages, it does not take long that people mean Mandarin and the simplified script when they talk about the Chinese language. For WiktionaryZ, this is not a position that we can leave like that; WiktionaryZ is to be a lexicological, terminological and ontological resource in every language. So we will change Chinese and have it called Mandarin (simplified) we will add Mandarin (traditional) and Min Nam to the languages that we will support in WiktionaryZ.

The basis for this is the way we have embraced ISO-639-3 and the experience we have gained with Serbian and English. For Serbian we have two scripts and this works well, for English the words that are universal are English, the specific US-American words are now English (American).
The only thing left for English is to include English (British).


Sunday, August 20, 2006

Latin and species

We have an ever growing list of languages in WiktionaryZ. There are always good reasons why we add a specific language; it has a nice script or there is someone interested in adding content or we have a potential partner with an interest in THAT language.

It was suggested to me to have Latin. The motivation was; I have this list of birds and would it not make sense.. It does make sense on one level and, on another it does not. The Latin used to come up with all these taxonomical names is not necessarily the kind of Latin that helps you learn the language.

When we want to include THAT kind of Latin, it makes sense to include the taxonomical relations and attributes that makes taxonomy the science that it is. Considering this, it will need Wikiauthors for it's publication data. It will need some specific database functionality to make this feasible..

I do want Latin but I doubt if this is a good moment to include it.


Friday, August 11, 2006

After Wikimania ... semantic mediawiki

I had a great time at Wikimania, and I am now slowly but surely trying to get things organised. This is not a trivial thing.. loads of people.. loads of interest in what we are doing with WiktionaryZ. The need for becoming organised is becoming increasingly important.. So how do I this, what tool to use.

When I was in Rome I met Denny. He demonstrated how well Semantic MediaWiki can be used for personal use. I was really impressed. Denny installed all the necessary bits on my computer. And it does do many of the things that I need. It allows me to create multiple lists associated with a subject. For instance the organisations involved and the persons involved with a project..

The more I see it, the more impressed I am. The only potential problem I see is that people have to learn more wiki-syntax. As it is so powerful, I would not mind to include semantic mediawiki in WiktionaryZ.

Wednesday, August 09, 2006

Yochai Benkler: The Wealth of Networks

At Wikimania there were many presentations. Some were great, some were awesome. It is really hard to assess how influential many of these will be. There are ways whereby presentations may become relevant.
  • The message was heard for a first time by a public.
  • The message was told for the first time.
  • The discussion following a presentation brought new insights.
  • The presentation raises questions.
I have truly enjoyed Wikimania. For me the presentation of Yochai Benkler was intriguing. It is about the way Internet is changing how information fits into society. What I understand from the presentation is, that business as usual has had its day and, we are working on what the new model will be for the future.

As I did not grasp what the new role for organizations will be in this brave new world, I bought the book. I bought the book because I strongly believe that there will be a role to play for organizations and if there is to be a recipe for including organizations, businesses I want to have it. If there is no such recipe ..

I have browsed the book so far, many things I do not grasp. In discussions I had earlier in the week I learned how different the notion of "liberal" is depending on the context. As this is also a strong theme in the book, I will be struggling. However, it is fun.


Sunday, July 30, 2006

WiktionaryZ and attributes

At this time, WiktionaryZ is able to have text based attributes. The implementation allows us to have attributes on the level of the DefinedMeaning. In plain English it means that we can have a free form text without any wiki-syntax. It does not do much for us at this moment. It does not allow us to have sample sentences (they are on the SynTrans level), it doe not allow us to indicate part of speech, inflection or gender (they are on the SynTrans level).

The good thing is that it is an indicator of things that may one day make our day. The thing however is, that it does not necessarily have our highest priority. At this moment our priority is to get our versioning working. We need to have historic data, we need to have improved information in our recent changes.. This is what we need badly.

We need it badly because this is the core functionality of WiktionaryZ or Wikidata. It needs to be done. It needs to be done well. It needs to be done before we add even more complicated functionality in WiktionaryZ.

There is a lot of work that needs to be done. We want to have proper support for terminology, we need to include the notion of "domains" for that. We want to have proper support for lexicology, we need to include the notion of language dependent attributes for that. We need to have relationtypes that can only be chosen in the correct context. The context being either the language or the domain.

All these things will happen. They will happen in a way that is consistent with the amount of resources available. This is why we have a need for involvement. This involvement can come in many different ways from from many different places providing many different angles to improve our data. The key thing is that we will maintain our architectural integrity. It is completely unacceptable to build all kinds of wished for functionality while the technical foundation has not been laid.

The collaboration that became WiktionaryZ will start it's third year at the end of August. There were many reasons why it has taken this long. What it brought us is a design and a philosophy that may work. The third year will be the year where I expect that we will have our full functionality.. When we do, things will evolve more quickly.


Tuesday, July 25, 2006

The operational definition meets the DefinedMeaning

In WiktionaryZ, we aim to have definitions for all the concepts that are to do with an expression in a language. These definitions have to be good, they have to express well what the concept is. We have great definitions, operational definitions, when more than 90% of the people correctly identify the concepts given the definition in a corpus.

When such an operational meaning cannot be correctly identified by 90%, it means that all the definitions of the different concepts are suspect. It may mean that too many concepts have been identified; certainly more work needs to be done to define things better.

In WiktionaryZ, many DefinedMeanings may exist where it has been indicated that the definition does not define the concept really well. These concepts are close but no cigar; they should not be taken into account when it is determined if the Definitions are indeed operational definitions.

The question is, how do you then identify the quality of the translations and the usability of the synonyms.


Sunday, July 23, 2006

Being too busy

The last month has been a rollercoaster; the new functionality of WiktionaryZ that we now have is awesome. The whole idea of the [[DefinedMeaning]] starts to make sense. People are collaborating on the same data. The idea that information can be shared and that it only needs to be added once is now a reality.

The thing that amazes me is that I am so busy making sure that everything works well, that we have some crucial data. Things like the Swadesh lists and the the list of 1000 basic English words are really important because these are basic words, they demonstrate best the merits of the concept. And I am really pleased with how things are progressing.

The thing that annoys me is that there is so much work that I would like to get done.. Some of it is plainly not for me to do. Other stuff like writing on the blog is very much for me to do. I may have to be even more selective what I do. Making these choices is hard.. Then I am glad I am in this position.. So many great things are happening and I am part of it :)


Tuesday, July 11, 2006

Water; at least three DefinedMeanings

When the word "water" is considered, people say it is this liquid that people drink and also that it is this chemical known as H2O. Well, actually these two things are not the same. The chemical, would be closest to distilled water and, it is not healthy to drink. The water you drink has traces of all kinds of chemicals like salts in it. This what makes for it to be good to drink. Another use for the word "water" is to indicate a body of water where you can swim, raft or boat on. In essence it is short for open water and as such it can be both salt and fresh water.

It has been said before by many, it is the simple words where "everyone" understands what is ment which are the hardest to define.


Saturday, July 08, 2006

Just a word

Working on projects like WiktionaryZ is a job do take a lot of my time. Often I would like to add just a few words to my home project, the Dutch Wiktionary. Today, I felt a need to add a word, record the pronunciation and translations to the English word for the same phenomenon. The word is uitzaaiing.

In Dutch the word has a clear agricultural background. It refers to how a weed, once it is established seeds itself to neighboring areas. It is one of those word that you hate to hear in another domain. Today I did. There is so little that I can do, I feel sad and can only hope for the best for Scott.


Sunday, July 02, 2006

A dubious record

I run for the Wiktionary projects the pywikipedia interwiki bot. This is a program that finds articles by the same name in different Wiktionary projects and creates a link to the projects that share this article. I run it for quite some time now and as the projects grow bigger, they share more articles.

The bot is quite nice, it can run autonomously and it updates some 30 wiktionaries at the same time. For the smaller wiktionaries, I run it specifically for a project every now and then.

Yesterday I chalked up the 200.000th edit for the English Wiktionary. It makes it nor unrealistic to think that I have some 600.000 edits on all the wiktionaries. There are six instances of the bot that run at any one time, they run on different projects. I think it is a dubious record because it does not bring me any happiness; it only indicates that there is information about a word that is written in the same way. I doubt very much that anybody does anything with it.

For your amusement; there are MANY English words on the Chinese Wiktionary that cannot yet be found in the English Wiktionary .. :)


Saturday, July 01, 2006

What is your mother tongue

Sabine her kids speak primarily Italian, they are living in the area where Neapolitan is spoken but Sabine is German. Sabine talks frequently in both Italian and German to her kids and when she gets angry she turns to Neapolitan, it can be really expressive .. :)

Now what is the mother tongue of Sabine's kids ? They speak primarily Italian...

In WiktionaryZ we have people that indicate that their mother tongue is zho or Chinese. According to Ethnologue Chinese is a macrolanguage. This implies that Chinese cannot be a mother tongue, one of the 13 languages Chinese is divided in can only be the mother tongue. This is a potential hot potato when people equate Chinese with the country and not the language.

WiktionaryZ is about languages and only about languages.

It can also be understood differently, they may mean that Chinese is the first written language that they learned. However if I understand things well, when people talk about the Chinese written language, it is actually Mandarin. For Yue for instance, there is a need for additional characters that are in one of the later versions of the UNICODE. This is however not what I would consider a mother tongue. A mother tongue is the language that you learned from your mother. Writing is what you learn at school.


Saturday, June 10, 2006

Punjabi and what IS that script

On the WiktionaryZ main page, we have a list of languages, they point to "portal" pages for those languages. It is quite clear that a project like WiktionaryZ has to take the different scripts into account that a language may manifest itself in. After a lot of head scratching, I created a link to "cmn-Hans" and "cmn-Hans", to indicate that there is Mandarin both in a simplified and a traditional script. One other reason, it looks more organisms this way.

I then tried my hand at the Punjabi language. Punjabi is written in two scripts, and I guessed wrong trying to identify them. One was indeed an Indic script, ਪੰਜਾਬੀ is written in the Gurmukhī script while پنجابی is written in the Shahmukhi script. Shahmukhi is indeed an Arab script but it is not the Arab script. In order to properly identify these words, I looked them up at Unicode where there is a nice list of the ISO-15924 script codes. Gurmuki has Guru as its code and Shahmukhi .. is absent.

It is probably pretty safe to indicate it as Arab, but when my information says that it is not, it is indeed problematic. I could also have it as an uncoded script. The problem with these standards is that they work up to a point. The point is what are they there to do.

When you write Dutch the standard Latin script is used, however that leaves out one character and consequently all word processors capitalise the ij wrong, it should be IJ and not Ij.. I think a similar thing is happening with the Shamukhi script. It is assumed to be Arabic but the style of the glyphs is different. I think it is just one of those things that may change in the future.

I think I will indicate it to be Arab for now .. :)


Sunday, June 04, 2006

What to do next

We have in our GEMET data our first data online. We have already planned the import of the languages that are in ISO-639-3 and allow for the translation of these names to other languages. We will also extend the number of languages that can edit. Given that we are in pre-alpha and that we are at this moment still very much a development project, we include all languages that showed some activity towards localizing the MediaWiki user interface.

The question is what to do next. We cannot and should not rush the development but we do have a host of data that we could import. It could be other thesauri, glossaries or ontologies. It could be a long list of Expressions in a given language to populate a spell checker. We could import the data resulting from Duesentrieb's Wikiword application, this could give us a link to Wikipedia articles.

When we had more active collaborating developers, we could consider doing the import and export routine or we could start working on inflections.

What would you consider the next bit of data to import after the languages? What would you start programming on given where we are at this moment?


Wednesday, May 31, 2006

What use is a community for others ..

I was at the LREC 2006 conference in Genoa, and one recurring theme was the use of software because there are not enough people to do things manually. Some things a computer can do well, some things a computer does not so well. You are often presented with a percentage where the computer is off from what a human would do.

One presentation, the price winners presentation at the end of the conference mentioned a nice scheme were two concepts were compared and the question was, to what extend is the first concept associated with a second. People doing this are trained with a first set of concepts, they are then asked to do a further set of concepts to see to what extend they have learned things and then .. They are off. This worked really well but, you need a large group of volunteers or you use a computer. The computer did either a good job or gave COMPLETELY different answers from what a human would do (these are the things to watch our for in a Turing test).

My idea is, when WiktionaryZ gets itself a large community of people interested in languages, it would also be natural to ask this community if they are interested in helping out with research. One strategy would be to have just one person check a machine derived result, when there is a discrepancy, have some more people look at it ..

Another interesting experiment would be to test the difference between the different groups of users of English including people who use English as a second language.. What do you think ???


Saturday, May 20, 2006

Languages, dialects and orthographies

When a text is known to be in a certain language, and this language is more or less familiar to a person, this text may be meaningful. I have had some French classes and when I am in Italy there is quite a lot that I can understand. An automated process cannot do this; it helps quite a lot when a text has Meta-data that indicates what language, dialect or script it is.

One of the things that makes sense to be aware of, is what orthography a text, phrase or word is in. It is definitely something that is in a class of its own and it matters when text is to be understood in an automated way. Languages do change over time and, the recognized correct orthography changes to reflect this. The German and Dutch language both have had their fair chair of changes. The functional design of WiktionaryZ has always had a place to indicate that a given spelling is dated. The way we will export the WiktionaryZ data will be by using standards like TBX, LMF maybe RDS, SKOS or something different but standard. The problem is; how do we indicate that a given word needs to be spelled different since a given date ?


Friday, May 05, 2006

Languagecodes on the Wikimedia Foundation

In the past we started using the ISO-639-1 codes for indicating languages. This list was extended with the ISO-639-2 list because it was too limited. Even so there were issues with the list and they were augmented with codes from Ethnologue. This was not often considered not enough so we created our own codes.

Now there are the ISO-639-3 codes. They have a provisional status because there will be even further extentions of the list but it highlights one thing. Us using our "own" codes is really problematic. I give you some examples and I also indicate to you why many arguments used in the discussion on new languages are wrong.

The ksh.wikipedia is for something called Ripuarian. The ksh code is for the Kölsch language. Ripuarian is considered to be a language family. The consequence is that there IS no single Ripuarian orthography, language or culture

The als.wikipedia is called Allemanish. According to the ISO-639 these are four languages. The problem is that the als code is used for the main Albanian language. The code for Albanian sq (ISO-639-3 sqi) is also considered a languagefamily. The two main variants are the Albanian and the Kosovar languages. Two other members of the Albanian language family are spoken in Italy and Macedonia

Some of the most heated discussions on the request for new projects are about the status of a language; is it a language a dialect and often the arguments are of a political nature. The inclusion of languages in the ISO-639 has been political in the past. With ISO-639-3 many of these arguments have an answer with the many new language codes that have been created.

The result is that we have wikipedias like the ku the fa, the sq, wikipedia where the language is now considered a language family and where a request can be made for recognition of a language that is part of that languagefamily. There are more projects like that that I have not identified yet.

Another "nice" situation is the Low Saxon nds wikipedia. When you look at what Ethnologue has to say about the Lower Saxon language family than you get the impression to what extend there is not really something called Low Saxon in the Netherlands. The varieties of Low Saxon of the Netherlands are all there.. They are named, have there codes..

The point that I am raising is, languages are a mess. The codes for our projects are as a consequence a mess. The procedures for new projects are a mess because of the politics and the codes we have come up with in the past. What we need are better quidelines what the relation is between the ISO-639 codes. If the WMF says it uses the ISO-639 codes the codes must be in use or they must be clearly different.

Last but not least, according to the terms of use, we are not allowed to extend the codes in the way that we do.


Friday, April 28, 2006

and now for some other news ..

I read on the BBC news website that the UN is cutting in half the daily rations of the fugitives in the Darfur region due to severe funding shortfall .....

From May the ration will be half of what the minimum amount required for each day.

They are starving and they are the lucky ones ......

Thursday, April 27, 2006


The classic business model for lexicology, terminology and thesauri is .... create a dictionary and keep it to yourself. Protecting this investment is difficult; they are facts so they are protected as a collection of data. The classical way of "proving" that the collection is "stolen", is by adding some nonsense words or have some other bogus data as part of your collection.

The open / free business model for lexicology, terminology and thesauri is .... work together on a stellar resource and make the data available to everybody. With the data available to everybody, the big question is how to achieve the best result. For an open / free resource, the best way is by providing the data in a standard way. The first standard I want WiktionaryZ to use is the "TermBase eXchange" or TBX standard. Given what we do in WiktionaryZ, LMF and SKOZ are two other great standards.

Why use standards? Simple, our definition of success is: "when people find a use for our data we did not think of". By providing the data in a standard way, it will be available in a stable way and as a result it will be more easy for people to make use of our data. It will be easier for WiktionaryZ to become a success.

Yes, I love our "competition", but I will love them to bits when they want to be as relevant as we want to be.


Monday, April 17, 2006

What to do with stuff that is good but not standard

What to do if you can get a lot of content that is good in some respects and lousy in others. German uses different characters then the English language and there are ways to indicate an Ä/Ä, Ö–/ö, Üœ/ü or ߟ. When you are using German these characters should be used and not an ue for instance. So when should we accept content with German that is non standard ?

I have been thinking about this for a few days. The answer for me became obvious; it is in the database. When we have the MisSpelling table, we can have the community identify the words that should have an Umlaut. With proper logic the representation for our public will be the proper German with the umlauts.. But the first thing is to have the MisSpellings..


Friday, April 14, 2006

Diana posted an answer ..

Diana posted an answer to an entry of this blog. I know she did because I was send a message to inform me of the fact. I was quite happy with her message, she wants to get into contact with me but I do not know how to get into contact with her.

I am GerardM .. :) You can find me on the WiktionaryZ site. I am very happy to talk about collaboration. I am happy to remind everyone how we define success for the project: Success is when people find an application of our data that we did not consider in the first place..


Tuesday, April 04, 2006

Some of the best things in life are free

Last week I went to the Berlin 4 Open Access - From Promise to Practice conference in Golm Germany. For me it was an education. The really big thing that I now appreciate even more than before is the extend science is prevented from being science because of restrictive practices.

Typically something can be called scientific when the conclusions are arrived at in a methodical way and, the method is repeatable. This is exactly what the Open Access movement wants to bring back. In order to do this they have to wrest away the restrictions that copyright has put on scientific data far too long. A lot of bad science is the result of these restrictions a lot of wastage is the result of these restrictions.

What I learned is that many superb resources are becoming available to the world as a consequence of this movement. Open Access is a rich tapestry with many threats in many fabrics of many colours.

If there was one thing disappointing it was the lack of awareness of licenses. It is a CC license.. was the answer and people applauded. Well, Creative Commons has great licenses but it is a bit like Animal Farm; all CC licenses are Free but some are more Free than others ...


Saturday, March 25, 2006

A new concept; the "Regime"

In WiktionaryZ we want to have a lot of data, both lexicological, terminological and thesaurus information. Information may come from many sources, reputable sources. And yes, in WiktionaryZ people will be able to add, modify. When the information fits in with the existing data, it will be important to know where it fits in from a quality point of view.

Enter the Regime, a regime would be a procedure that people can submit to. It would be optional. There will be many areas where we can develop these regimes. It may be that a regime is developed or execustes with organisations that we will partner with.. One thing, to state the obvious, our data is Free so the process will be transparant and the resulting data will be Free.

It is an idea that we came up recently, we discussed it with some and now, we are interested in what you think of this.


Saturday, March 11, 2006

Babel templates and WiktionaryZ

I discussed how to proceed with WiktionaryZ with Dvortygirl. What she said was that it is time to ask people to get involved. According to the planning, we hope to have an editable version of the relational data at the end of the month. The software will be an "pre-alpha release", the importance of this release is to show that we can edit relational data in a wiki.

WiktionaryZ is a fully functional wiki. This means that we can add content; we can create users, we can create templates, categories. We should when it makes sense. And it does. When we start the coming test, we will start with people who understand what WiktionaryZ is about. This means that they understand the concept of the DefinedMeaning. An other factor that will help us decide who to ask, is the languages these people master.

WiktionaryZ will for now not be available to anonymous users. People can create a user. When they add Babel information to their user page, we will learn who has expertise in what language. The templates we start with have been copied from the en.wikipedia and, we hope we will get many more templates that will allow us to have five levels; the native speaker and level 1 to 4 to indicate the growing level of proficiency.

The scope of the test is limited; priority is in learning the edit process for relational data. What does work what does not. What improvement will be needed to make us ready to meet the "great unwashed". We will also be able to work on information that will be important in later phases, things like names of languages and other terminology that is likely to end up in the user interface...


Tuesday, March 07, 2006

If I had money that I could freely spend ...

If I could hire a programmer that would give me one extra functionality, something that is not scheduled for this moment, what would I have him do? What if I have a budget for at least a few weeks of work ..

Interwiki links
I hate these things. They are always out of sync. Many people spend a lot of hard work on getting them right while the problem is getting bigger and it is not solved at all. It feels like a waste. When a project grows, it has articles that need to be linked. As more and more wikipedias grow, the number of articles that need updating grow rapidly. Small projects are not easily integrated. It costs a lot of resources.

With a centralised database, we could link an article to another article and by inference it would be known to all the articles that are linked to it. As all the articles are about the same subject, we could check if article names are translations. When they are, it is a basis for linking to the lexicological content of WiktionaryZ ..

Inflection boxes
When a verb, a noun and adjective changes under given rules, it makes sense to have inflection boxes. They are generated using templates on many wiktionaries, but it makes more sense to have some software that allows us to build these boxes. Software that associates inflections of one language to the inflections of other languages for the purpose of translation.

Better support for tools like OmegaT
OmegaT is a CAT-tool, it helps translators with their work. I would do two things to OmegaT, I would have it read directly from a MediaWiki wiki and when the translation is finished, write it to another wiki. I would also have its translation glossary funtction make use of WiktionaryZ..

Yes there would be a quid pro quo, when a translator adds a word to the glossary it would be fed back to WiktionaryZ.


PS What would your suggestion be ?