Sunday, July 30, 2006

WiktionaryZ and attributes

At this time, WiktionaryZ is able to have text based attributes. The implementation allows us to have attributes on the level of the DefinedMeaning. In plain English it means that we can have a free form text without any wiki-syntax. It does not do much for us at this moment. It does not allow us to have sample sentences (they are on the SynTrans level), it doe not allow us to indicate part of speech, inflection or gender (they are on the SynTrans level).

The good thing is that it is an indicator of things that may one day make our day. The thing however is, that it does not necessarily have our highest priority. At this moment our priority is to get our versioning working. We need to have historic data, we need to have improved information in our recent changes.. This is what we need badly.

We need it badly because this is the core functionality of WiktionaryZ or Wikidata. It needs to be done. It needs to be done well. It needs to be done before we add even more complicated functionality in WiktionaryZ.

There is a lot of work that needs to be done. We want to have proper support for terminology, we need to include the notion of "domains" for that. We want to have proper support for lexicology, we need to include the notion of language dependent attributes for that. We need to have relationtypes that can only be chosen in the correct context. The context being either the language or the domain.

All these things will happen. They will happen in a way that is consistent with the amount of resources available. This is why we have a need for involvement. This involvement can come in many different ways from from many different places providing many different angles to improve our data. The key thing is that we will maintain our architectural integrity. It is completely unacceptable to build all kinds of wished for functionality while the technical foundation has not been laid.

The collaboration that became WiktionaryZ will start it's third year at the end of August. There were many reasons why it has taken this long. What it brought us is a design and a philosophy that may work. The third year will be the year where I expect that we will have our full functionality.. When we do, things will evolve more quickly.


Tuesday, July 25, 2006

The operational definition meets the DefinedMeaning

In WiktionaryZ, we aim to have definitions for all the concepts that are to do with an expression in a language. These definitions have to be good, they have to express well what the concept is. We have great definitions, operational definitions, when more than 90% of the people correctly identify the concepts given the definition in a corpus.

When such an operational meaning cannot be correctly identified by 90%, it means that all the definitions of the different concepts are suspect. It may mean that too many concepts have been identified; certainly more work needs to be done to define things better.

In WiktionaryZ, many DefinedMeanings may exist where it has been indicated that the definition does not define the concept really well. These concepts are close but no cigar; they should not be taken into account when it is determined if the Definitions are indeed operational definitions.

The question is, how do you then identify the quality of the translations and the usability of the synonyms.


Sunday, July 23, 2006

Being too busy

The last month has been a rollercoaster; the new functionality of WiktionaryZ that we now have is awesome. The whole idea of the [[DefinedMeaning]] starts to make sense. People are collaborating on the same data. The idea that information can be shared and that it only needs to be added once is now a reality.

The thing that amazes me is that I am so busy making sure that everything works well, that we have some crucial data. Things like the Swadesh lists and the the list of 1000 basic English words are really important because these are basic words, they demonstrate best the merits of the concept. And I am really pleased with how things are progressing.

The thing that annoys me is that there is so much work that I would like to get done.. Some of it is plainly not for me to do. Other stuff like writing on the blog is very much for me to do. I may have to be even more selective what I do. Making these choices is hard.. Then I am glad I am in this position.. So many great things are happening and I am part of it :)


Tuesday, July 11, 2006

Water; at least three DefinedMeanings

When the word "water" is considered, people say it is this liquid that people drink and also that it is this chemical known as H2O. Well, actually these two things are not the same. The chemical, would be closest to distilled water and, it is not healthy to drink. The water you drink has traces of all kinds of chemicals like salts in it. This what makes for it to be good to drink. Another use for the word "water" is to indicate a body of water where you can swim, raft or boat on. In essence it is short for open water and as such it can be both salt and fresh water.

It has been said before by many, it is the simple words where "everyone" understands what is ment which are the hardest to define.


Saturday, July 08, 2006

Just a word

Working on projects like WiktionaryZ is a job do take a lot of my time. Often I would like to add just a few words to my home project, the Dutch Wiktionary. Today, I felt a need to add a word, record the pronunciation and translations to the English word for the same phenomenon. The word is uitzaaiing.

In Dutch the word has a clear agricultural background. It refers to how a weed, once it is established seeds itself to neighboring areas. It is one of those word that you hate to hear in another domain. Today I did. There is so little that I can do, I feel sad and can only hope for the best for Scott.


Sunday, July 02, 2006

A dubious record

I run for the Wiktionary projects the pywikipedia interwiki bot. This is a program that finds articles by the same name in different Wiktionary projects and creates a link to the projects that share this article. I run it for quite some time now and as the projects grow bigger, they share more articles.

The bot is quite nice, it can run autonomously and it updates some 30 wiktionaries at the same time. For the smaller wiktionaries, I run it specifically for a project every now and then.

Yesterday I chalked up the 200.000th edit for the English Wiktionary. It makes it nor unrealistic to think that I have some 600.000 edits on all the wiktionaries. There are six instances of the bot that run at any one time, they run on different projects. I think it is a dubious record because it does not bring me any happiness; it only indicates that there is information about a word that is written in the same way. I doubt very much that anybody does anything with it.

For your amusement; there are MANY English words on the Chinese Wiktionary that cannot yet be found in the English Wiktionary .. :)


Saturday, July 01, 2006

What is your mother tongue

Sabine her kids speak primarily Italian, they are living in the area where Neapolitan is spoken but Sabine is German. Sabine talks frequently in both Italian and German to her kids and when she gets angry she turns to Neapolitan, it can be really expressive .. :)

Now what is the mother tongue of Sabine's kids ? They speak primarily Italian...

In WiktionaryZ we have people that indicate that their mother tongue is zho or Chinese. According to Ethnologue Chinese is a macrolanguage. This implies that Chinese cannot be a mother tongue, one of the 13 languages Chinese is divided in can only be the mother tongue. This is a potential hot potato when people equate Chinese with the country and not the language.

WiktionaryZ is about languages and only about languages.

It can also be understood differently, they may mean that Chinese is the first written language that they learned. However if I understand things well, when people talk about the Chinese written language, it is actually Mandarin. For Yue for instance, there is a need for additional characters that are in one of the later versions of the UNICODE. This is however not what I would consider a mother tongue. A mother tongue is the language that you learned from your mother. Writing is what you learn at school.