Sunday, July 31, 2005

Working on my presentations

Wikimania is only a few days off. There will be many exiting speakers. And I hope to be one. Not that I will not speak. I am getting nervous. I did my homework, I did write my presentation and there are so many things I would like to include but can't.

I can't because things have to develop some more. Because Ultimate Wiktionary is still some way off. Because people are still thinking old Wiktionary and I would only confuse most people even more.

Really, when we get UW life and functional it will be great. The filenames that I used have names like Spelling, Word and Meaning. The problem is that people think of them as a spelling a word or a meaning. Somtimes I think I should have named them Kwik, Kwek and Kwak. (you may also know them as Huey, Dewey, and Louie. It would make it more abstract but if it would help ??

Wednesday, July 27, 2005

Translating proverbs

When you put proverbs in a dictionary, you add all the usual things. You describe the meaning, you add an etymology and you translate the proverb. However how do you translate proverbs, do you translate them literally or do you want to give a proverb with a same meaning.

The proverb "de beste stuurlui staan aan wal" has a similar meaning as the English "backseat drivers". The Dutch version is nautical and the second one obviously not. So it is hardly a literal translation but it is a functional translation. In general terms the meaning can be described and as such it would function however, I can apreciate that there will be a certain drift when proverbs from many languages are put forward.

The problem for me is to consider how to deal with this in the Ultimate Wiktionary.. Well, at this moment we do not have it yet so it is not a problem .. :)

Thanks,
GerardM

Saturday, July 23, 2005

Wikimania

In a few days wikimania the first world conference of the Wikimedia Foundation will open in Frankfurt. It will be a great time for the people that are able to come. There will be some things that will be broadcasted over the Internet.. The agenda is absolutely smashing. I will be speaking as well and I will speak in "die grosse Saal". It will be the final presentation in a series of four;

*Wikidata will be about the technology that will be behind Ultimate Wiktionary and many other projects.
*Wikisign will be about creating a lexicological resource for the deaf.
*Logos will bring us what their experience is hosting lexicological content
*Ultimate Wiktionary will be about what we hope achieve in the next generation of Wiktionary

As I have been talking so much about what I hope to achieve, I may not bring you anything new. However, there is so much to it..

Thanks,
GerardM

Friday, July 22, 2005

Feedback

I am starting to get feedback on my ERD. As was to be expected, what is clear to me is not necessarily clear to others. The great thing are the suggestions that come with the feedback, things like call it "Script" in stead of "Characterset" and there is an ISO code for it. Another improvement was to include a field to say that a specific relation (eg idiom or proverb) are language dependent.

Some of these things are so basic that you tend to forget to include it by making it explicit. All in all it proves important, publishing and publishing again does work.

Wednesday, July 20, 2005

Sound and sign

I am progressing with the datadesign for Ultimate Wiktionary. The current challenge I am facing is to deal with both oral languages and sign languages.

The easiest for sign languages was the realisation that a movie is the "Pronunciation" of a signed word. This made me change the fieldname from "Soundfile" to "Mediafile". More complicated is the fact that there are some four written signlanguages. These I would really want in the Ultimate Wiktionary. The question is, do they have like Chinese does their own UTF-8 characters. When they do, I do not have to do anything. It would just work as designed.

I have realised that languages like Arabic and Chinese are formal written languages. There are many people who have a spoken language that is grammatically and syntactically (does this word exist?) different from the formal words. So when I record pronunciations, how do I deal with those. How do I register those lanuages? How do I indicate that these languages use Chinese / Arabic for their written language..

My working theory for the moment is that there may be transcriptions for those languages. Certainly when they have been noted by someone who has some authority, these can be used to link the essentially oral words with something that has characters. These characters are needed at this moment to make it possible to enter them in the database. Now the question is, how to relate them to the written language ... At this time it is just a matter of having the written word as a translation.. in effect this is correct.

Thanks,
GerardM

Thursday, July 14, 2005

Working on a table design

I am working on the table design for the Ultimate Wiktionary. I have posted the current version of my ERD and true to the tenets of Open Source I am working on it and will post often, people can deduce what I am thinking. The funny thing is that since the last time I created a more or less working model for an UW, I have learned so much. The resulting datadesign is significantly different. I have come to the realisation that what I create is very much the result of this proces of assimilation of a lots of loose ends in the current Wiktionaries.

One of the things that is funny is that when you design a database design you not only have to think of the data itself, you also have to think about how it is to be used. The problem for me is that the database and the development are pretty much divorced. I know databases pretty well but I do not know the restrictions of MySQL in combination with what Wikidata will bring us.

I find it really thrilling that we are at the stage where there is an imminent need for the datadesign for Ultimate Wiktionary..

Thanks,
GerardM

Friday, July 08, 2005

talking tables / files

We are arriving at the point where we have to talk file design. What tables do we need what fields will they have and how will this relate to the functionality of what we have.

The three most important tables will be "Language" "Word" "Meaning". They are top down related. The most difficult to understand will be "Meaning" because the meaning itself will be in a seperate table "Meaning-text". This is because the text of meaning is to be had in every language, and it is the abstraction of the meaning that is in "Meaning".

This "Meaning" will relate to synonymes and translations (a synonym is equivalent to a translation in the same language). This will give people an instant problem many words are not the exact translation of another, so how will we deal with this.

When a word is translated, the word picked in the translation is the one that fits best in the meaning of the original word. This meaning is therefore one that is of importance to this word as well. This meaning can be endemic to the language of the word, this makes it a natural fit or the meaning can be external to the language of the word. When the meaning is external to the language, this meaning is only relevant when translating the word.

This sound problematic. The word girl, meisje, M├Ądchen are good translations. In the Neopolitan language there are words that are specific to girls of a certain age. The meaning of these word is included in the meaning of the word girl. They need to be shown when you are interested in the Neopolitan language. However, when you are not interested, these meanings that are external to words of the English language, do not have to be shown.

I have been told that there are some four words that can be included in the word girl. These meanings do relate to each other and as such it makes sense to use thesaurus like structures to describe these relations. As these relations describe the meanings, these relations are relevant when you are interested in the Neopolitan language. They do help a translator choose the best fit and also alternatives when one word is used too often.

Thanks,
GerardM

Thursday, July 07, 2005

Supporting a "bot"

The Wiktionary projects are very gratefull to the programmers of the pywikipedia bot. Particularly Andre Engels has been important in supporting this bot for the Wiktionary projects. He has programmed new functionalities that helped us work together more than what we would have done without it.

As can be deduced by its name, most people use the pywikipedia bot for Wikipedia projects. Many of the innovations have been programmed with Wikipedia in mind. The latest innovation allows you to be logged in several projecs at the same time. The interwiki bot makes use of this facility and when it finds that one project needs to be updated, it will do so. This enhances the quality of the bot dramatically.

Supporting a non-programmer like myself is a pain. It is therefore important that tools like tortoise work well. It means that a common baseline can be created. This in turn facilitate the analasys of error conditions. Today we finally got it to work. We had to remove the application and start it all over again.. This time it did download the pywikipediabot software from Sourceforge..

Really, Open Source rocks when there are friendly people like Andre ..

Thanks,
GerardM

Wednesday, July 06, 2005

IATE and Free content

IATE or Inter-Agency Terminology Exchange is a project that is to create a glossary to be used for the European Union. It has live data and its content was appreciated by many translators until recently. Until recently, there was a guest profile with a guest password that was used by many. Because the IATE database is "not ready for the general public as it may not cope with the demand that might be put upon it", this access to the public is removed.

To gain access, you have to translate for the EU and you have to sign a contract that you use it only for EU use. There is however one bright spot; its copyright. The IATE copyright says clearly that you can have this data and use it as long as you attribute it to the institution that manages this information.

It is therefore a lucky coincidence that we want to make Ultimate Wiktionary relevant. It is as fortunate that we already plan on cooperating with the EU by publishing its GEMET content. When we have proven that we can host lexicological data, we can ask the EU if we can host this data. It is relevant data it is important data so much so that the EU expects that its modern systems will crash under the strain of all these people who want it.

With the Wikimedia servers, we are used to provide as good a service as we can. We do not promiss 0.9999 uptime, we do the best that we can. And, if this data can be had for the lexicological information that it is, we are quite happy to host it. We are quite happy to cooperate with the EU to make this information available and more relevant then it is at the moment, being a "secret".

Thanks,
GerardM

Sunday, July 03, 2005

The need for a reference implementation

To make Ultimate Wiktionary relevant, we need data, we need a big community, we need relevance. Relevance can be had in several ways. One way in which we may get both more people and more content is by making the content of UW as a translation glossary to be used in translation tools. There are several translation tools, Sun Microsystems Opened up its CAT tool, OmegaT is another and, there are more.

The functionality that all these tools will derive from UW is the same. So having an implementation that provide the bare bones of what is needed makes sense. It does help to make a bigger group of people aware of the wish for this cooperation. It hopefully leads to the cooperation of the different communities behind these tools, in order to improve the quality of all the tools.

To communicate about the tool, we have started an experiment with Google groups. Here you find a discussion list. Everyone can read this, but only members may write to this list.. this helps against SPAM :) . It is not a Sourceforge environment yet, this is something that people who will develop this reference implementation should decide on.

It is always exiting to see how these things develop. I hope for the best.

Thanks,
GerardM