Saturday, December 31, 2005

A meeting with Brion

I have been to Berlin again. Again for a conference, again in the first place to meet one man. Brion Vibber is the Chief Technological Officer of the Wikimedia Foundation. Brion is the person who is responsible for the technology behind Wikipedia, and all the other Wikimedia Foundation projects. Brion thinks that the first development milestone, the Wikidata namespace manager, could go into the next release of Mediawiki.. WOW

We discussed the long standing need for single login. Because it is a dependency for many other projects related to Ultimate Wiktionary and Wikimedia in general, if at all possible, Brion will start working on it soon after the release of MediaWiki 1.6. Brion will also look into Surfnet's A-Select in order to interface Wikimedia with other authentication service providers. There is a need for outside authentication to Wikimedia projects, especially Wiktionary, and A-Select has the potential to provide it. However, Brion feels that thinking about federation is only possible after the Wikimedia-internal authentication problems are fully resolved.

Together with Erik, we discussed the need for handling multiple languages inside a MediaWiki installation, which is obviously related to Wikidata and UW and will likely be one of our next development milestones. While Brion sees this as a quite complex problem, he did agree that the situation as it currently is - that multilingual projects like Meta and Commons have no language-awareness whatsoever - is broken. What we agreed to do is to send him specifications to review before any implementation begins. Brion also pointed out some current and potential problems with MySQL: that certain UTF-8 characters cannot be stored with the proper charset encoding, and that it may not be possible to have multiple sorting orders on a field without duplicating it. (In this context, we debated the need for following important standards such as CLDR for locale data.)
Besides meeting Brion, we made an appointment with two officials of Wikimedia Germany to discuss the potential for cooperation in different areas. So far, the signals are positive.

Guten Rutsch!


Monday, December 26, 2005

May I present to you ...

I have been hoping for this, I have been more or less promissing this and it is with great happiness and gratitude that I announce that on Boxingday we can showcase the first public Wikidata database. It contains some 70.000 words in 22 languages with definitions typically in 4 languages.

We have been given permission by European Environment Information and Observation Network (EIONET) to host the GEneral Multilingual Environmental Thesaurus (GEMET) thesaurus. This showcases our wish to have great relevant content in many languages that does have structures as can be found in thesauruses.

The data is not complete yet and the content will be improved. To put it in perspective, this read only implementation of a Wikidata database is a proof of concept. This will be expanded slowly but surely, not only to improve the technical features but also the information and the user interface.

We really welcome all your constructive comments.


Sunday, December 25, 2005

More pressies

Today I received another present. Angela informed me that a gentleman from Australia became aware that we might be interested in a machine translation engine. This is absolutely great news. Because the engine is an "n-gram" based design. This means that it has the potential to function for many languages. One other most relevant detail is that it has a small footprint.

Important is that even when this software is not and does not become Free software, we will be able to cooperate.. This is the key thing of the Ultimate Wiktionary project.

PS I received an e-mail from Erik that he is importing the GEMET data into a Wikidata database.. It does a cool 1.000 records a minute.

PS2 I am also celebrating Christmas. But I am in two minds.. I want this soo bad. How much do I want to be like a Santa bringing good cheer.. :)


Saturday, December 24, 2005

Under the Christmas tree

On Christmas Eve, by magic presents appear under the Christmas tree. The great thing is the suspense. The suspense of what will it be or the suspense of how a particular present will be appreciated. Presents under the tree, Christmas wishes presented in many ways, it is a festive moment.

Well what do we find under our virtual Christmas tree. Today I received an e-mail from Erik who wants meat on the bones of his present. He does want to have "versioned tables" to be included even though the main purpose is that we have something to show :) . It is still very much intended to be on-line in the coming day/days.

Yesterday, I had a conversation with Jimmy Wales. We discussed many things. I was really happy to learn that he considers it necessary to have "committees" that take care for the board of Wikimedia Foundation projects that do not get the attention that they do deserve. Wiktionary would be one of these projects.

In the last month many things have happened, I hinted so some and not to others. Some of the highlights are the potential cooperation with several organisations. Two of these I want to highlight; the GEvTerm project and the ProZ organisation.

  • GEvTerm is based on a great idea; when you have an international event somewhere. It will mean that many people will congregate to one place. They need to communicate but it cannot be expected that all people share one common language. The idea of GEvTerm is to concentrate on translations that are associated with the particular event and make it available to ease the human interaction.
  • ProZ is a member based organisation of professional translators. It serves the largest community of translators. Proz has been active building glossaries, they have their kudoz where colleagues can help out with particular problematic translations. They were about to create their own dictionary and we were lucky to get into contact with them through Sabine who is a ProZ member. We are now talking on how we can create one fabulous resource together.
As these are only two wonderfull opportunities, you will appreciate that a consortium of likeminded organisations will realise an even bigger potential to bring a resource with lexicological, terminological and thesaurus information that is there to be used Freely.

I wish everyone the most joyous of

Friday, December 23, 2005

Is a rose by any other name as beautifull ?

Ultimate Wiktionary is the name of a project. The initial goal was to improve on the exisisting Wiktionaries. This would bring cooperation between the people that are interested in specific languages but choose a different language for their user interface. When adopted, it would mean that we can concentrate our effort in one resource.

The name for the project does not determine that the eventual project will be found at "". There are reasons for it and, there are reasons against it. It is cheap to name it like this as the domain wiktionary is already owned by the Wikimedia Foundation. The second reason is that a fair number of people already know this name and finally it does link the old with the new. A big argument against is the use of this "ultimate" label. The functionality of the software will grow but at first there will not be much that deserves this accolade.

Now there is this opportunity; what name to pick and, what arguments to use
  • Wiktionary2
  • WiktionaryZ
These are the two contenders at the moment that people came up with. Personally I like WiktionaryZ best. When pronounced it is "wiktionaries" and it reflects that we will include all the content of the current Wiktionaries and also it really links into the past. Given the way it is written it also reflects modern times and the approach that we take with Ultimate Wiktionary is an exponent of our time.

Wiktionary2 is another great name it. It also gives a great link to the current Wiktionaries and it symbolises well the big technological step that it represents. One thing that is problematic that some people suggested that it could be seen as a version number; this would mean that the future might bring us a "". This is not a sensible way of doing things as you do not want to reflect version numbers in your domain name.

I hope that people will like these suggestions, and there is room for many more.


Tuesday, December 20, 2005

Extending the community

Ultimate Wiktionary is there to be used. My definition for success is "when people find an application for the data that we did not think of". There is however nothing wrong with us coming up with new ways in which we can extend the potential use.

Particularly interesting to me are the changes that extend the community of users. We want the scientists; the translators but we could also have the puzzlers. For me this would be really cool if UW becomes a challenge to my mother, she likes her crosswords and her cryptograms. Puzzlers are interested in synonymy and definitions so by adding this one field in the
Expression table, the first step is taken to charm yet another group of people into the Ultimate Wiktionary...


Sunday, December 18, 2005

Changes to the data design

I have posted the new data design and I did a lot of annotations.. Not all by a long shot and I am sure that Erik wants many more changes. The adagio of Open Source is to publish often so I do.


New tables due to many new ideas and other novelties

Ultimate Wiktionary is making a rush towards its first "outing". This results in all kinds of interesting things. It results in even more interest before we have anything to show for ourselves. It results in reasoned suggestion to change labels in preference to others; Label for Attribute because label has a meaning that is confusing to a large constituency for UW's community.

Wikidata is not the same as Ultimate Wiktionary and consequently has requirements of its own. It has language requirements of its own. It may need longer texts, it may require texts in a format that Ultimate Wiktionary frowns upon like capitalised expressions. As we are investigating the use of TBX for the static part of Ultimate Wiktionary, it made sense to think about TMX as well for this issue. This means that we need some basic stuff to deal with handling translation projects. I have come with this extension of Ultimate Wiktionary, this datadesign makes use of tables that are part of UW and may as a result become part of Mediawiki proper.

I realise that when we implement this, we have the core of a translation / localisation workflow. This makes sense when you consider that Wikipedia, one of the biggest websites of this world, exists in 212 different languages. When a Mediawiki message is changed, who is going to do the translation.. I doubt that there is one organisation that can do that well on a continuous basis. As I am a firm believer in using standards AND in eating my own dogfood, this is my first take on this issue.


Saturday, December 17, 2005

The relevancy or lack thereof of standards

When is a standard a standard? A standard is a standard when a standard body says it is.

That in a nutshell describes the situation for many standards and as far as I am concerned, I would prefer a definition that includes relevancy. "A standard is a standard when a standard body says so and when it is freely available for adoption". When a standard is not freely available, it means that the standard will not be adopted by some for monetary reasons. The consequence is that money removes relevancy from a Standard when it leads to it not being adopted.

In my mind the worst thing that can happen to a standard is that it is not adopted or ignored.


Friday, December 16, 2005

Alternate representations

A new problem that is in need for a solution are "alternate representations". Alternate representations are expressions that do not fit the mold of how you want to have expressions in a lexicological resource. One of the rules has always been that capitalisation is only used for words that are always capitalised, eg English (the language) is always capitalised in English. There are resources, resources that we would like to include, that have these as synonimes. An other example is "plague, bubonic" to me that should be "bubonic plague".

Many of these things find their origin in being the legacy of a paper based origin. In a digital resource with some magic linking "plague" and "bubonic plague", one would suffice. The problem is in how to make the Ultimate Wiktionary relevant. When we do include "plague, bubonic" in some way, we allow for the one to one linking from the Unified Medical Language System to Ultimate Wiktionary and vice versa. It would even allow for the inclusion of UMLS data in Ultimate Wiktionary.

My current thinking is about two options. I know that in lexicology they have some anotation to describe in what relation in a sentence a word exists. The other option is to have an AlternateRepresentation table that links an Expression to the preferred Expression.

I do want this anotation anyway, what I do not know is if this anotation is aware of capitalisation.


Thursday, December 15, 2005

Terms and a neutral point of view

Ultimate Wiktionary wants to be open, wellcoming to all communities and.. yes, I was human, so I was convinced that Term was a better term. However, it is one of those words with multiple meanings and particularly in the worlds of terminology, lexicology and thesauri. After some discussion we came to the conclusion that this is not the right word to describe what we mean, and also that it is not really neutral. So, a new word was agreed upon: LexicalItem.

There are some more changes that we decided on in Berlin; the
Label table contains attributes and also the name Attribute is less confusing. It is also necessary to include some more intelligence; this means that it must be possible to group the attributes.

There are loads of things that I have learned that I am still internalising. When I have that there will be several other changes.

Oh, the great news is that many of these changes are inspired by the great people that I met In Berlin.. It will make Ultimate Wiktionary more relevant to the science types .. :)


Tuesday, December 13, 2005

terms and what not

In the data design I have a table called Word. Well strike that.. on the Language Standards for Global Business conference in Berlin there were more people impressing me with the need to change this to Term..

I am only human,


Sunday, December 11, 2005

Even handed approach

There was one great thing said in a presentation of Flemish. When you research if a word is particular to a dialect, it is as relevant to know what words are NOT part of that particular dialect. There are therefore different situations:
  • A word is specific to the dialect
  • A word is used both in all areas where the language is spoken
  • A word is not used in the dialect but specific to the parts where the dialect is not spoken,
This aproach would be a NPOV aproach. The research that is needed to find these words is difficult. One scientific aproach would be by presenting people a text and ask them to correct this.

This aproach is not problematic when you consider the Dutch and Belgian situation, languages / dialects like Andalusian are much more problematic because people will bring political dimensions to it. The history of the creation of new wikipedia project often proves that a language is a dialect with an army.


Saturday, December 10, 2005

TST centrale

The "Instituut voor Nederlandse Lexicologie" has in her TST centrale a resource where people with a need for lexicological content can choose what they need. All this material is copyrighted and it is made available at the lowest possible cost. The material is the result of many scientific projects and it is considered basic material for lexicology based on the Dutch language.

There are other resources that have importance to people interested in lexicology. Logos in its dictionary provides a rich tapestry of words with translations. In its link to wordtheque, you find the words in its context. In the philosophy of Logos, this often provides as clear an idea as a definition would do. A link to publicly available resource is not available through the resources of the TST centrale.

In the Kudoz open glossaries of Proz, you find a rich resource of hard to translate words. When you start looking for resources that have a relevance for the creation of dictionaries, there are many resources that are not created in a "scientific" manner. Practically they can be extremely usefull. It is a shame that the scientific resources are not Free and consequently that they make the "unscientific" resources unavailable for the enrichment.

Anyway, as long as these resources are used side by side there is nothing that stops the research of lexicology. As the Wikipedias are a rich resource of contemporary language, and as its content is categorised as to subject matter, it is good to know that scientists are free to use it for their research. I checked it with Jimmy Wales and he was happy to confirm this.

Tomorrow I will be going to Berlin. We will talking about interfacing the Ultimate Wiktionary using the TBX standard..


Friday, December 09, 2005

A difference in approach

Yesterday, I was at a conference for Dutch language lexicologists. It was my first such thing and it was a grand experience. Lexicologists have always been abstract people, now they have faces they exist in many shapes and forms and they largely do many different things. They work together in many ways and to me they do marvellous things.

The difference in our approach and the scientific approach can be given in one word: scientific. What we try to do with Ultimate Wiktionary is not scientific. Being scientific has never been considered. Our outlook has always been practical. We want to do practical things with our dictionary. Publishing a scientific paper is not practical to us. That is not what our goal is.

Given this difference in approach, there is still very much that we can do for each other. By building a resource that is useful but not complete, it may have a limited scientific value but it does have a value. Being build by people who do not necessarily share the same methodology, it may be chaotic but is still has a scientific value. Even for all these "issues" a project that makes lexicons relevant to people who typically do not care is probably the most valuable gift we can give to the science of lexicology. If we can make lexicons relevant and exiting, there will be new people who will find their way in this profession..


Monday, December 05, 2005

Sinterklaas or Christmas is coming

Today the 5th of December it is "pakjesavond", the night prior to the birthday of Sinterklaas when traditionally presents are given sometimes with a rhyme or a suprise. Christmas is therefore not that far away and it is therefore a time of how to do things in line with the Christmas spirit. It is a time about Christmas spirit and Christmas gifts.

Last year we started to collaborate on Christmas wishes and it was good fun. It is so funny to see a text in alphabetic script and not have a clue as to how it is pronounced.. "Përshumvjet Krishtlindjen dhe Gëzuar Vitin e Ri". Last year we were as ambitious as this year; we would love more people to translate and say: "Merry Christmas and a happy New Year!" in their language..

We hope and expect that the first tangible results of all the effort that has gone into Ultimate Wiktionary will be our Christmas gift.. In the mean time we will also do some more work on our Christmas glossary.. Have a look and see how you can make the glossary yours as well :)


Friday, December 02, 2005

Wikipedia Is The Next Google

There is quite a lot of buzz about a blogentry by Steve Rubel; Wikipedia Is The Next Google is together with the comments a nice read. For me there are two things to this article; the disruptive nature of organisations and Wikipedia.

When disruptive technology apears, it changes business as usual. It has done so in our society from the moment when innovation was considered to be good. Innovation was never considered to be universally good, but it led to our current society with a number of people having it good in a way that could not be conceived one hundred years ago. In a way, with the ever increasing speed of communication, new ideas get an audience with an ever increasing speed.

Wikipedia is an encyclopedia, it is internet based and it is growing as quickly as new servers can be brought online. It can only do this because of the huge pent up demand for affordable information that has a neutral point of view. Important is the realisation that Wikipedia is not one but many encyclopedias. Every month there is yet another language that gets its own Wikipedia.

These wikipedias all have the ambition to equal the star wikipedias like the German and the English Wikipedia. They will have to grow from a small project where everybody knows everbody to a project where even the heroes of last year are not known by all anymore. Slowly but surely these project create Free information and get the recognition for the viability of the languages they express.

Certainly when there are few resources in a language, the impact that a wikipedia may have is big. Comparatively Wikipedia cannot be as important for languages like English and German as it could be for Swahili. It will take its own good time..

With all this talk about disruptive technology, it is fun for me to predict that Wikidata and Ultimate Wiktionary will be disruptive in their own right. It will be in more ways than one.. I am anxious in how conservative the Wikimedia crowd will prove to be. If they are like I expect them to be, they will allow both Wikidata and Ultimate Wiktionary to develop its potential.


Thursday, December 01, 2005

Luxury problems

When you need more people to work on programming. When you have them do work for money then at some stage you need to pay them. This is actually a great moment because it means that you have something to show. We are about to hit the first milestone of our development of the Ultimate Wiktionary. This is where we have functionality to identify tables and their relations within a Mediawiki environment.

The problem is that we have to pay outside of the European Community. So we do get into silly stuff like currency and costs.. We have to find out what the cheapest way is to get money elsewhere.

It is a problem but I prefer this to not having code finished.