Monday, November 28, 2005

Content looking to be seen

If there are good reasons to share lexicological content, the best seems to me that it makes what you have to share more relevant. Some content does not really need more exposure but the effort of professor Rennison for the Koromfe language is one such. The language is not well known; it does not even have an article by that name in Wikipedia. Even the Koromba people (indigenous in Burkina Faso) do not have their article yet..

Professor Rennison, who worked some twenty years on the Koromfe language, made his resource more relevant by suplying not only an English but also a French and German translation to the Koromfe idiom. Consequently his data is more accessible. At this time, Ultimate Wiktionary is little more than a promiss. It will be great if it can be a place where we can give the Koromfe language a place that is as important as any other language.


Saturday, November 26, 2005


The English language wiktionary has a new experiment. Connel, one of the Wiktionarians has downloaded the Gutenberg project, he performed a wordcount and did a ranking for these words. The word seeing for instance currently occupies the 621th place.

I really like this example of being creative with this aspect of wiki. When I discussed it with Erik in Berlin, we found that there is indeed little room at this moment in time. Maybe we should consider to have some free space at designated places where one can freely enter data that is not structured.

more resources: frequency list discussion


Thursday, November 24, 2005

Alternative definitions

In my Berlin III blog of November 21, people commented on what I said about Meanings. They objected that there would only be one DefinedMeaning. Thinking about this, I had to conclude that often one lemma can have multiple definitions and still be the same thing.
  • the problem solving ability
  • what the intelligence test measures
These two definitions are both for intelligence and they have the same translations. I had to learn these among others. As I said in my reply there is a need to decide what definition is the one that ties the DefinedMeaning down. This will be still the same. However alternative definitions are welcome as long as they are try to define the same thing.


The I&I conference and IEEE LOM

Yesterday and today the 15th I&I conference. There were some 200 educators that deal with integrating computers in the educational process. For Wikipedia the 14th edition was important as it was when teachers told Kennisnet that Wikipedia was important for the Dutch education. This resulted in cooperation between Kennisnet and the Wikimedia Foundation.

This time there was a large group of wikimedians there to inform about what we do. Kennisnet gave a great presentation on how they are experimenting with wikis in education.

I learned from many people that tagging educational content is starting to become important. The IEEE LOM standard adopted in the Netherlands (EDUSTANDAARD) is being implemented into many of the software applications. One problem is that in order for such a standard to make an impact, much data needs to be tagged. This means that everybody is a winner when educational material is shared among schools. In order to do this, some authorisation and authentication is needed. This is needed so that students will not get exams from the schools that do share.. Kennisnet provides just such a service in their Entree. Having this is likely to be a key enabler for successful sharing of content.

One other key enabler is to just share. People will have to get used to the idea that it is like marriage, both parties think they are giving more than the other .. :)

One thing I would really like is to have Mediawiki to include the potential to have IEEE LOM tags. Through our interwiki links we know which articles share the same subject. Consequently these articles can share much of the tags given in one flavour of IEEE LOM. There are several reasons why we should do this:
  • it improves the accesability of our information
  • it would involve many people in education in our projects
  • it would stimulate all implementations of IEEE LOM
  • there would be another reason for having Ultimate Wiktionary; it could localise the IEEE LOM tags.

Tuesday, November 22, 2005

Home but busy

Well I am home. A great weekend but a tiring trip home. At home I went to bed and I am still a bit groggy. Tomorrow I will be at the I&I conference. Only after that, in two days, I can start work on all the details that I have to document... :(

Mediawiki is a great product, it is great at scaling, with Commons we have our image repository, with Ultimate Wiktionary we will have our lexicological resource. All this is part of this great idea of having all information in people's language to all people.

It is great when you can be part of this puzzle on how to get all this together. There is however so much to do, what to do first.


Monday, November 21, 2005

Berlin III

Still in Berlin, at the end of three days of Ultimate Wiktionary we have done a lot of work. There is still a lot of work because what people want, need and deserve is something to show for all the work done. As visibility is important, we are going to have two things happen as soon as possible; first the Wikidata milestone 1 has to be finalized and committed to the CVS release branch, and then we will have an extra step to publish a read only version of the GEMET data. This combined with all the languages that are in the ISO 639-3 provisional version (in English) will give a clear idea of what we want: great lexicological content in all languages.

Technically, some things in the data design will be changed, among them a change of the Meaning table; it will become DefinedMeaning. This is to reflect that the DefinedMeaning defines which MeaningText is the one that truly defines a meaning. The point is that for a meaning you have to decide what language and what word define what it is. The other MeaningTexts in the other languages should be a translation of that specific text.

One great thing is that we came up with an improvement regarding inflections. The problem is that it does not make sense that all inflections show up in the list of the synonyms and translations, only because they share the same DefinedMeaning. By adding the key to the InflectionWord it belongs to, we can only show the headword for the parts of speech. Yes, it is a database change.

Erik was not happy with my Table table. He called it a hack and, it is a hack. So he does not want it, he does not want it, he does not want it... So, it is to go. Erik is correct where he says that NOT having this hack means that it will be much cleaner code and, it will help with the scalability issue.. A major point. So I will create a few more Relation tables that will be more specific.


Saturday, November 19, 2005

Berlin II

Today was a lot of work in a short time.. We went over the database and given Wikidata and the Ultimate Wiktionary it has some constraints. UW may be the first implementation of Wikidata it will certainly be one of the more complicated ones. This is good in a way, it means that the technology will be fleshed out from the beginning.

Some of the thing were thought to be a hack and yes, in a way they are but hacks that work.. things like my Table table .. :)



Today I am in Berlin to work on Ultimate Wiktionary. I am really happy to be here and I expect it to help a lot in making Ultimate Wiktionary AND Wikidata available. We are now having a break and one fun thing we already experienced is how much Erik and I have a different outlook. I am very much Ultimate Wiktionary oriented while Erik is more into Wikidata.

We have discussed many things. The subject of hosted thesauri is one that will come back again.. We are now into data design.


Friday, November 18, 2005

Adopting changes in an included thesaurus/glossary

When authorative glossaries or thesauri are included in Ultimate Wiktionary, they have to choose how they want to make their content available. When they choose to be seperate, and have a restricted group of people work on their content, they will not benefit from the wiki-way. As the definitions and structures are seperate, they will find that alternative meanings and structures will arive. Meanings that are essentially the same.

It is therefore that a method needs to be found to merge glossary entries, thesaurus entries and wiktionary entries. This may seem like a simple thing and in many ways it is. The problems arise particularly in the thesaurus structures that are associated with a lemma.

When the community around a thesaurus decide to adopt a lemma, the version that is adopted is tagged as being part of the lemma. When the lemma is changed, the later version can be adopted as well. This allows for one way of quality control. Alternatively the changed lemma is flagged as "pending aproval" this in turn allows for a second method of quality assurance. The Wiki way would be to assume good faith and expect a change to be a change for the better.

When agreement can be reached on how a word is defined and how it is translated the relations may need to be tagged as belonging to a specific thesaurus or glossary. As agreement could be reached on the Meaning, the connections between the different thesauri is what brings the similarities in focus. This in turn may help bring more understanding.


Thursday, November 17, 2005

Why dictionaries under a Free license should cooperate

It is easy to explain why the non-free dictionaries are not “Free”. They are not free because people thing that more money can be made with these dictionaries. It is elementary. It is impossible to understand why a dictionary available under a “Free” license should be licensed under any particular license.

From my perspective, there are a few things that are relevant. It must be “Free”, this is something that is shared by all so that people are able to use it. The other thing is there must be attribution.

When we work together on a FREE resource, it will not take too long and we have enough relevant information to matter in a language. From that moment onwards quality and quantity will grow because "enough" is the tipping point where it makes sense to add content to the shared resource. When a new spell checkers is generated every week, it is obvious where to correct mistakes and where to add content. A spell checker only needs to say where this can be done (this IS the reason for attribution) and it would enable people to get updates, it would also enable to find who contributed to this resource.

It is really important to understand about DATA that a license can only be "viral" with respect to DATA. Using a spell checker generated with a GPL licensed dictionary does not change the license of the software running this data. To be honest, I am happy that it does not, because it makes it obvious. It makes it obvious that the “Free” dictionaries should work together. By working together we will achieve more than what we can achieve separately.


Ideas on quality assurance for Ultimate Wiktionary

Ultimate Wiktionary is intended to be used. To be used not only interactively but also by programs. Certainly when the lexicological data is used in earnest, the need for quality assurance will exist and it can be understood that this makes sense. On the other hand, the way of the wiki is that we allow for the collaboration of everyone who has something to contribute.

This is a complicated thing and the current thinking is as follows:
  • All edits need to be validated two times by different people to be considered "good"
  • When two words are considered to be the same in Meaning and Expression, they can be merged. The translations of these meanings will be merged but they get the status of a newly added word. This is to ensure that translations are considered again for their validity.
  • Bots will be disallowed from making interactive edits. Every bot has to be associated with an interactive user.
  • A thesaurus or glossary can, when it is agreed that it needs this status, be write protected; this means that comments can be made on the "talk page".
  • When a bot is to be used for a specific usage ie the maintenance of a specific write protect thesaurus or glossary, it can be given a status of implied quality control. This means that the organisation that maintains this resource in the Ultimate Wiktionary is wholy responisible for its own quality. We are thinking of terminology like the terminology of the Roman Catholic church where the exact nature of the definitions is a matter of doctrine.
  • The user associated with such a bot will be the admin for this glossary or thesaurus. This admin can allow users to make changes to its resource.
  • Like on the other Wikimedia projects, we will need admins to do the necessary maintenance. The admin status for deletions should be given per language.
  • A priviledged few will be admin / bureacrat for the whole of the project. They will have access to all languages and all resources. As you can imagine they will be in a glass cage.

Tuesday, November 15, 2005

Semantic web

At some stage it had to be. The semantic web, the holy grail of this digital age had to come my way. The sum of all knowledge is to be found by it. It is a great endeavour, many people work really hard to make it a reality and so far it passed me by.

I am happy that it passed me by until now. I am happy because in my ignorance I was able to come up with the Ultimate Wiktionary. Ultimate Wiktionary is in its own way equally ambitious; it wants to have all lexicological information on all words of all languages. Some people say that you cannot know a thing if you do not have a word for it..

The semantic web came my way in a meeting at the University of Rotterdam; they need a lexicological resource for a big thesaurus. They have experimented with products that are closely related to the semantic web.

What I have understood is that certain words are used in a tree of concepts, in order to make this information usable in other languages; these concepts have to be translated. In my opinion that is exactly what the Ultimate Wiktionary is intended to do. When a "concept" is associated with a particular "Meaning", it follows that the translations and synonyms can be used to present these relations in another languages.

I understand that there is this idea that a concept and the tag used in the relations is considered by some to be distinct. At this moment I think it is not really practical, It is great that I will learn more about the semantic web.

I am on record that when people try to find a use of the Ultimate Wiktionary that I did not consider, I would think the Ultimate Wiktionary a success. By that standard, even though it is not operational yet, it is doing well.


Monday, November 14, 2005

The Century Dictionary

The Century Dictionary was in its time a wonderful resource, even though it has aged it is still a wonderful resource. It gives a best of great impression of how dictionaries were at the end of the 19th and the beginning of the 20th century (1889-1910).

The fact that much effort was undertaken to make it available in this digital day and time is wonderful. It is advertised as the biggest on-line English dictionary on the Internet, with more than 500.000 definitions it may be just that.

The Wikimedia Foundation was asked for advice on how this splendid resource could be modernized and updated. Being asked to give an opinion privileged me. As I think highly of resources like the Century Dictionary, I would at best convert the digitized content when this improves the usability of the data. As I valuable the Century Dictionary for what it is, I would definitely keep maintain the data as is.

This does not mean that all this lexicological information cannot be used to build a modern dictionary. This can be done in many ways. An important consideration is that the data of the Century Dictionary is firmly in the public domain. This means that any existing project that works on building a dictionary can and may use this data.

I would not mind including the data of the Century Dictionary in the Ultimate Wiktionary. It would prove a challenge to fit it in what has always been envisioned to be a modern dictionary. Then again, the Ultimate Wiktionary is also to be inclusive. So when the opportunity comes to include whole dictionaries, I am sure we will find a way and make sure that it makes sense for our users as well..

The conversion of the Century Dictionary will be a lot of work. However, there are many professions where the skills for such a project are taught in universities. It is therefore that I could see students working on such a project for a term project.


Sunday, November 13, 2005

How to cooperate with a community

Wikipedia is an outrageously successful project of the Wikimedia Foundation. Its most prestigious project is the Wikipedia in English, it has over 818.000 articles, it has an amazing number of active contributors and the achievements of this community are not only in this high number of articles, it is also in the rating given by Alexa (today nr 38), the attention we get in the press, the overall quality of the articles.

The community that makes all this possible is essential to what is done. The community does not exist as one big always agreeing whole. If anything the difference in what people want out of Wikipedia is huge. There is this tension between people who want to concentrate on the "main" Wikipedias and people who want to create new Wikipedias. A tension between having only illustrations that is Free and illustrations that is free to be used. Given the size of our community (a middle sized town) we have people willing to utter any POV.

In what I am doing, I am truly outside the Wikipedia community; I am firmly into dictionaries and I want to take the Wiktionaries to the next level. Ultimate Wiktionary is intended to be inclusive for all the lexicological data and applications we can think off. We came up with spell checkers, computer aided translation tools and it being a translation and a descriptive dictionary is what we started with.

As Ultimate Wiktionary is there to be an inclusive lexicological resource, we invite everyone to join us in making it exactly that. It is easy for people to join they just do. For organisations it is different. They often have a lot to offer but they also have their requirements. To address these requirements you have to be approachable. Sure the Wikipedia community is approachable but it lacks the ability to come up with a single response as the community is divided. It is unable to come up with a quick response, and when a response is given it is often does not answer the question that was asked in the first place.

The Wikimedia Foundation has as its goal to bring all information to all people of the world. Wiktionary has as its goal to bring all lexicological information of to all people of the world. It is therefore that it makes sense to have a consortium where organisations can find a focal point where their need for cooperation can be discussed. Given that we are open to cooperation on a non-discriminatory way and, that adding information makes us richer and more relevant. When organisations have a need for quality, we should find our way in providing this quality assurance. We should when it is legitimate request.

Ultimate Wiktionary will find its legitimacy not in being yet another on-line dictionary but in giving this data an application. When it does not go beyond what every dictionary does, it will be a failure. When organisations like the University of Bamberg have a use for the Ultimate Wiktionary, Ultimate Wiktionary will become a credible and important resource.