Saturday, December 31, 2005
We discussed the long standing need for single login. Because it is a dependency for many other projects related to Ultimate Wiktionary and Wikimedia in general, if at all possible, Brion will start working on it soon after the release of MediaWiki 1.6. Brion will also look into Surfnet's A-Select in order to interface Wikimedia with other authentication service providers. There is a need for outside authentication to Wikimedia projects, especially Wiktionary, and A-Select has the potential to provide it. However, Brion feels that thinking about federation is only possible after the Wikimedia-internal authentication problems are fully resolved.
Together with Erik, we discussed the need for handling multiple languages inside a MediaWiki installation, which is obviously related to Wikidata and UW and will likely be one of our next development milestones. While Brion sees this as a quite complex problem, he did agree that the situation as it currently is - that multilingual projects like Meta and Commons have no language-awareness whatsoever - is broken. What we agreed to do is to send him specifications to review before any implementation begins. Brion also pointed out some current and potential problems with MySQL: that certain UTF-8 characters cannot be stored with the proper charset encoding, and that it may not be possible to have multiple sorting orders on a field without duplicating it. (In this context, we debated the need for following important standards such as CLDR for locale data.)
Besides meeting Brion, we made an appointment with two officials of Wikimedia Germany to discuss the potential for cooperation in different areas. So far, the signals are positive.
Monday, December 26, 2005
We have been given permission by European Environment Information and Observation Network (EIONET) to host the GEneral Multilingual Environmental Thesaurus (GEMET) thesaurus. This showcases our wish to have great relevant content in many languages that does have structures as can be found in thesauruses.
The data is not complete yet and the content will be improved. To put it in perspective, this read only implementation of a Wikidata database is a proof of concept. This will be expanded slowly but surely, not only to improve the technical features but also the information and the user interface.
We really welcome all your constructive comments.
Sunday, December 25, 2005
Important is that even when this software is not and does not become Free software, we will be able to cooperate.. This is the key thing of the Ultimate Wiktionary project.
PS I received an e-mail from Erik that he is importing the GEMET data into a Wikidata database.. It does a cool 1.000 records a minute.
PS2 I am also celebrating Christmas. But I am in two minds.. I want this soo bad. How much do I want to be like a Santa bringing good cheer.. :)
Saturday, December 24, 2005
Well what do we find under our virtual Christmas tree. Today I received an e-mail from Erik who wants meat on the bones of his present. He does want to have "versioned tables" to be included even though the main purpose is that we have something to show :) . It is still very much intended to be on-line in the coming day/days.
Yesterday, I had a conversation with Jimmy Wales. We discussed many things. I was really happy to learn that he considers it necessary to have "committees" that take care for the board of Wikimedia Foundation projects that do not get the attention that they do deserve. Wiktionary would be one of these projects.
In the last month many things have happened, I hinted so some and not to others. Some of the highlights are the potential cooperation with several organisations. Two of these I want to highlight; the GEvTerm project and the ProZ organisation.
- GEvTerm is based on a great idea; when you have an international event somewhere. It will mean that many people will congregate to one place. They need to communicate but it cannot be expected that all people share one common language. The idea of GEvTerm is to concentrate on translations that are associated with the particular event and make it available to ease the human interaction.
- ProZ is a member based organisation of professional translators. It serves the largest community of translators. Proz has been active building glossaries, they have their kudoz where colleagues can help out with particular problematic translations. They were about to create their own dictionary and we were lucky to get into contact with them through Sabine who is a ProZ member. We are now talking on how we can create one fabulous resource together.
I wish everyone the most joyous of Christmases..
Friday, December 23, 2005
The name for the project does not determine that the eventual project will be found at "http://ultimate.wiktionary.org". There are reasons for it and, there are reasons against it. It is cheap to name it like this as the domain wiktionary is already owned by the Wikimedia Foundation. The second reason is that a fair number of people already know this name and finally it does link the old with the new. A big argument against is the use of this "ultimate" label. The functionality of the software will grow but at first there will not be much that deserves this accolade.
Now there is this opportunity; what name to pick and, what arguments to use
Wiktionary2 is another great name it. It also gives a great link to the current Wiktionaries and it symbolises well the big technological step that it represents. One thing that is problematic that some people suggested that it could be seen as a version number; this would mean that the future might bring us a "http://wiktionary3.org". This is not a sensible way of doing things as you do not want to reflect version numbers in your domain name.
I hope that people will like these suggestions, and there is room for many more.
Tuesday, December 20, 2005
Ultimate Wiktionary is there to be used. My definition for success is "when people find an application for the data that we did not think of". There is however nothing wrong with us coming up with new ways in which we can extend the potential use.
Particularly interesting to me are the changes that extend the community of users. We want the scientists; the translators but we could also have the puzzlers. For me this would be really cool if UW becomes a challenge to my mother, she likes her crosswords and her cryptograms. Puzzlers are interested in synonymy and definitions so by adding this one field in the Expression table, the first step is taken to charm yet another group of people into the Ultimate Wiktionary...
Sunday, December 18, 2005
Wikidata is not the same as Ultimate Wiktionary and consequently has requirements of its own. It has language requirements of its own. It may need longer texts, it may require texts in a format that Ultimate Wiktionary frowns upon like capitalised expressions. As we are investigating the use of TBX for the static part of Ultimate Wiktionary, it made sense to think about TMX as well for this issue. This means that we need some basic stuff to deal with handling translation projects. I have come with this extension of Ultimate Wiktionary, this datadesign makes use of tables that are part of UW and may as a result become part of Mediawiki proper.
I realise that when we implement this, we have the core of a translation / localisation workflow. This makes sense when you consider that Wikipedia, one of the biggest websites of this world, exists in 212 different languages. When a Mediawiki message is changed, who is going to do the translation.. I doubt that there is one organisation that can do that well on a continuous basis. As I am a firm believer in using standards AND in eating my own dogfood, this is my first take on this issue.
Saturday, December 17, 2005
That in a nutshell describes the situation for many standards and as far as I am concerned, I would prefer a definition that includes relevancy. "A standard is a standard when a standard body says so and when it is freely available for adoption". When a standard is not freely available, it means that the standard will not be adopted by some for monetary reasons. The consequence is that money removes relevancy from a Standard when it leads to it not being adopted.
In my mind the worst thing that can happen to a standard is that it is not adopted or ignored.
Friday, December 16, 2005
Many of these things find their origin in being the legacy of a paper based origin. In a digital resource with some magic linking "plague" and "bubonic plague", one would suffice. The problem is in how to make the Ultimate Wiktionary relevant. When we do include "plague, bubonic" in some way, we allow for the one to one linking from the Unified Medical Language System to Ultimate Wiktionary and vice versa. It would even allow for the inclusion of UMLS data in Ultimate Wiktionary.
My current thinking is about two options. I know that in lexicology they have some anotation to describe in what relation in a sentence a word exists. The other option is to have an AlternateRepresentation table that links an Expression to the preferred Expression.
I do want this anotation anyway, what I do not know is if this anotation is aware of capitalisation.
Thursday, December 15, 2005
Ultimate Wiktionary wants to be open, wellcoming to all communities and.. yes, I was human, so I was convinced that Term was a better term. However, it is one of those words with multiple meanings and particularly in the worlds of terminology, lexicology and thesauri. After some discussion we came to the conclusion that this is not the right word to describe what we mean, and also that it is not really neutral. So, a new word was agreed upon: LexicalItem.
There are some more changes that we decided on in
There are loads of things that I have learned that I am still internalising. When I have that there will be several other changes.
Oh, the great news is that many of these changes are inspired by the great people that I met In
Tuesday, December 13, 2005
Sunday, December 11, 2005
- A word is specific to the dialect
- A word is used both in all areas where the language is spoken
- A word is not used in the dialect but specific to the parts where the dialect is not spoken,
This aproach is not problematic when you consider the Dutch and Belgian situation, languages / dialects like Andalusian are much more problematic because people will bring political dimensions to it. The history of the creation of new wikipedia project often proves that a language is a dialect with an army.
Saturday, December 10, 2005
There are other resources that have importance to people interested in lexicology. Logos in its dictionary provides a rich tapestry of words with translations. In its link to wordtheque, you find the words in its context. In the philosophy of Logos, this often provides as clear an idea as a definition would do. A link to publicly available resource is not available through the resources of the TST centrale.
In the Kudoz open glossaries of Proz, you find a rich resource of hard to translate words. When you start looking for resources that have a relevance for the creation of dictionaries, there are many resources that are not created in a "scientific" manner. Practically they can be extremely usefull. It is a shame that the scientific resources are not Free and consequently that they make the "unscientific" resources unavailable for the enrichment.
Anyway, as long as these resources are used side by side there is nothing that stops the research of lexicology. As the Wikipedias are a rich resource of contemporary language, and as its content is categorised as to subject matter, it is good to know that scientists are free to use it for their research. I checked it with Jimmy Wales and he was happy to confirm this.
Tomorrow I will be going to Berlin. We will talking about interfacing the Ultimate Wiktionary using the TBX standard..
Friday, December 09, 2005
Yesterday, I was at a conference for Dutch language lexicologists. It was my first such thing and it was a grand experience. Lexicologists have always been abstract people, now they have faces they exist in many shapes and forms and they largely do many different things. They work together in many ways and to me they do marvellous things.
The difference in our approach and the scientific approach can be given in one word: scientific. What we try to do with Ultimate Wiktionary is not scientific. Being scientific has never been considered. Our outlook has always been practical. We want to do practical things with our dictionary. Publishing a scientific paper is not practical to us. That is not what our goal is.
Given this difference in approach, there is still very much that we can do for each other. By building a resource that is useful but not complete, it may have a limited scientific value but it does have a value. Being build by people who do not necessarily share the same methodology, it may be chaotic but is still has a scientific value. Even for all these "issues" a project that makes lexicons relevant to people who typically do not care is probably the most valuable gift we can give to the science of lexicology. If we can make lexicons relevant and exiting, there will be new people who will find their way in this profession..
Monday, December 05, 2005
Last year we started to collaborate on Christmas wishes and it was good fun. It is so funny to see a text in alphabetic script and not have a clue as to how it is pronounced.. "Përshumvjet Krishtlindjen dhe Gëzuar Vitin e Ri". Last year we were as ambitious as this year; we would love more people to translate and say: "Merry Christmas and a happy New Year!" in their language..
We hope and expect that the first tangible results of all the effort that has gone into Ultimate Wiktionary will be our Christmas gift.. In the mean time we will also do some more work on our Christmas glossary.. Have a look and see how you can make the glossary yours as well :)
Friday, December 02, 2005
When disruptive technology apears, it changes business as usual. It has done so in our society from the moment when innovation was considered to be good. Innovation was never considered to be universally good, but it led to our current society with a number of people having it good in a way that could not be conceived one hundred years ago. In a way, with the ever increasing speed of communication, new ideas get an audience with an ever increasing speed.
Wikipedia is an encyclopedia, it is internet based and it is growing as quickly as new servers can be brought online. It can only do this because of the huge pent up demand for affordable information that has a neutral point of view. Important is the realisation that Wikipedia is not one but many encyclopedias. Every month there is yet another language that gets its own Wikipedia.
These wikipedias all have the ambition to equal the star wikipedias like the German and the English Wikipedia. They will have to grow from a small project where everybody knows everbody to a project where even the heroes of last year are not known by all anymore. Slowly but surely these project create Free information and get the recognition for the viability of the languages they express.
Certainly when there are few resources in a language, the impact that a wikipedia may have is big. Comparatively Wikipedia cannot be as important for languages like English and German as it could be for Swahili. It will take its own good time..
With all this talk about disruptive technology, it is fun for me to predict that Wikidata and Ultimate Wiktionary will be disruptive in their own right. It will be in more ways than one.. I am anxious in how conservative the Wikimedia crowd will prove to be. If they are like I expect them to be, they will allow both Wikidata and Ultimate Wiktionary to develop its potential.
Thursday, December 01, 2005
The problem is that we have to pay outside of the European Community. So we do get into silly stuff like currency and costs.. We have to find out what the cheapest way is to get money elsewhere.
It is a problem but I prefer this to not having code finished.
Monday, November 28, 2005
Professor Rennison, who worked some twenty years on the Koromfe language, made his resource more relevant by suplying not only an English but also a French and German translation to the Koromfe idiom. Consequently his data is more accessible. At this time, Ultimate Wiktionary is little more than a promiss. It will be great if it can be a place where we can give the Koromfe language a place that is as important as any other language.
Saturday, November 26, 2005
I really like this example of being creative with this aspect of wiki. When I discussed it with Erik in Berlin, we found that there is indeed little room at this moment in time. Maybe we should consider to have some free space at designated places where one can freely enter data that is not structured.
more resources: frequency list discussion
Thursday, November 24, 2005
- the problem solving ability
- what the intelligence test measures
This time there was a large group of wikimedians there to inform about what we do. Kennisnet gave a great presentation on how they are experimenting with wikis in education.
I learned from many people that tagging educational content is starting to become important. The IEEE LOM standard adopted in the Netherlands (EDUSTANDAARD) is being implemented into many of the software applications. One problem is that in order for such a standard to make an impact, much data needs to be tagged. This means that everybody is a winner when educational material is shared among schools. In order to do this, some authorisation and authentication is needed. This is needed so that students will not get exams from the schools that do share.. Kennisnet provides just such a service in their Entree. Having this is likely to be a key enabler for successful sharing of content.
One other key enabler is to just share. People will have to get used to the idea that it is like marriage, both parties think they are giving more than the other .. :)
One thing I would really like is to have Mediawiki to include the potential to have IEEE LOM tags. Through our interwiki links we know which articles share the same subject. Consequently these articles can share much of the tags given in one flavour of IEEE LOM. There are several reasons why we should do this:
- it improves the accesability of our information
- it would involve many people in education in our projects
- it would stimulate all implementations of IEEE LOM
- there would be another reason for having Ultimate Wiktionary; it could localise the IEEE LOM tags.
Tuesday, November 22, 2005
Mediawiki is a great product, it is great at scaling, with Commons we have our image repository, with Ultimate Wiktionary we will have our lexicological resource. All this is part of this great idea of having all information in people's language to all people.
It is great when you can be part of this puzzle on how to get all this together. There is however so much to do, what to do first.
Monday, November 21, 2005
Technically, some things in the data design will be changed, among them a change of the Meaning table; it will become DefinedMeaning. This is to reflect that the DefinedMeaning defines which MeaningText is the one that truly defines a meaning. The point is that for a meaning you have to decide what language and what word define what it is. The other MeaningTexts in the other languages should be a translation of that specific text.
One great thing is that we came up with an improvement regarding inflections. The problem is that it does not make sense that all inflections show up in the list of the synonyms and translations, only because they share the same DefinedMeaning. By adding the key to the InflectionWord it belongs to, we can only show the headword for the parts of speech. Yes, it is a database change.
Erik was not happy with my Table table. He called it a hack and, it is a hack. So he does not want it, he does not want it, he does not want it... So, it is to go. Erik is correct where he says that NOT having this hack means that it will be much cleaner code and, it will help with the scalability issue.. A major point. So I will create a few more Relation tables that will be more specific.
Saturday, November 19, 2005
Some of the thing were thought to be a hack and yes, in a way they are but hacks that work.. things like my Table table .. :)
We have discussed many things. The subject of hosted thesauri is one that will come back again.. We are now into data design.
Friday, November 18, 2005
It is therefore that a method needs to be found to merge glossary entries, thesaurus entries and wiktionary entries. This may seem like a simple thing and in many ways it is. The problems arise particularly in the thesaurus structures that are associated with a lemma.
When the community around a thesaurus decide to adopt a lemma, the version that is adopted is tagged as being part of the lemma. When the lemma is changed, the later version can be adopted as well. This allows for one way of quality control. Alternatively the changed lemma is flagged as "pending aproval" this in turn allows for a second method of quality assurance. The Wiki way would be to assume good faith and expect a change to be a change for the better.
When agreement can be reached on how a word is defined and how it is translated the relations may need to be tagged as belonging to a specific thesaurus or glossary. As agreement could be reached on the Meaning, the connections between the different thesauri is what brings the similarities in focus. This in turn may help bring more understanding.
Thursday, November 17, 2005
It is easy to explain why the non-free dictionaries are not “Free”. They are not free because people thing that more money can be made with these dictionaries. It is elementary. It is impossible to understand why a dictionary available under a “Free” license should be licensed under any particular license.
From my perspective, there are a few things that are relevant. It must be “Free”, this is something that is shared by all so that people are able to use it. The other thing is there must be attribution.
When we work together on a FREE resource, it will not take too long and we have enough relevant information to matter in a language. From that moment onwards quality and quantity will grow because "enough" is the tipping point where it makes sense to add content to the shared resource. When a new spell checkers is generated every week, it is obvious where to correct mistakes and where to add content. A spell checker only needs to say where this can be done (this IS the reason for attribution) and it would enable people to get updates, it would also enable to find who contributed to this resource.
It is really important to understand about DATA that a license can only be "viral" with respect to DATA. Using a spell checker generated with a GPL licensed dictionary does not change the license of the software running this data. To be honest, I am happy that it does not, because it makes it obvious. It makes it obvious that the “Free” dictionaries should work together. By working together we will achieve more than what we can achieve separately.
This is a complicated thing and the current thinking is as follows:
- All edits need to be validated two times by different people to be considered "good"
- When two words are considered to be the same in Meaning and Expression, they can be merged. The translations of these meanings will be merged but they get the status of a newly added word. This is to ensure that translations are considered again for their validity.
- Bots will be disallowed from making interactive edits. Every bot has to be associated with an interactive user.
- A thesaurus or glossary can, when it is agreed that it needs this status, be write protected; this means that comments can be made on the "talk page".
- When a bot is to be used for a specific usage ie the maintenance of a specific write protect thesaurus or glossary, it can be given a status of implied quality control. This means that the organisation that maintains this resource in the Ultimate Wiktionary is wholy responisible for its own quality. We are thinking of terminology like the terminology of the Roman Catholic church where the exact nature of the definitions is a matter of doctrine.
- The user associated with such a bot will be the admin for this glossary or thesaurus. This admin can allow users to make changes to its resource.
- Like on the other Wikimedia projects, we will need admins to do the necessary maintenance. The admin status for deletions should be given per language.
- A priviledged few will be admin / bureacrat for the whole of the project. They will have access to all languages and all resources. As you can imagine they will be in a glass cage.
Tuesday, November 15, 2005
At some stage it had to be. The semantic web, the holy grail of this digital age had to come my way. The sum of all knowledge is to be found by it. It is a great endeavour, many people work really hard to make it a reality and so far it passed me by.
I am happy that it passed me by until now. I am happy because in my ignorance I was able to come up with the Ultimate Wiktionary. Ultimate Wiktionary is in its own way equally ambitious; it wants to have all lexicological information on all words of all languages. Some people say that you cannot know a thing if you do not have a word for it..
The semantic web came my way in a meeting at the University of Rotterdam; they need a lexicological resource for a big thesaurus. They have experimented with products that are closely related to the semantic web.
What I have understood is that certain words are used in a tree of concepts, in order to make this information usable in other languages; these concepts have to be translated. In my opinion that is exactly what the Ultimate Wiktionary is intended to do. When a "concept" is associated with a particular "Meaning", it follows that the translations and synonyms can be used to present these relations in another languages.
I understand that there is this idea that a concept and the tag used in the relations is considered by some to be distinct. At this moment I think it is not really practical, It is great that I will learn more about the semantic web.
I am on record that when people try to find a use of the Ultimate Wiktionary that I did not consider, I would think the Ultimate Wiktionary a success. By that standard, even though it is not operational yet, it is doing well.
Monday, November 14, 2005
The Century Dictionary was in its time a wonderful resource, even though it has aged it is still a wonderful resource. It gives a best of great impression of how dictionaries were at the end of the 19th and the beginning of the 20th century (1889-1910).
The fact that much effort was undertaken to make it available in this digital day and time is wonderful. It is advertised as the biggest on-line English dictionary on the Internet, with more than 500.000 definitions it may be just that.
The Wikimedia Foundation was asked for advice on how this splendid resource could be modernized and updated. Being asked to give an opinion privileged me. As I think highly of resources like the Century Dictionary, I would at best convert the digitized content when this improves the usability of the data. As I valuable the Century Dictionary for what it is, I would definitely keep maintain the data as is.
This does not mean that all this lexicological information cannot be used to build a modern dictionary. This can be done in many ways. An important consideration is that the data of the Century Dictionary is firmly in the public domain. This means that any existing project that works on building a dictionary can and may use this data.
I would not mind including the data of the Century Dictionary in the Ultimate Wiktionary. It would prove a challenge to fit it in what has always been envisioned to be a modern dictionary. Then again, the Ultimate Wiktionary is also to be inclusive. So when the opportunity comes to include whole dictionaries, I am sure we will find a way and make sure that it makes sense for our users as well..
The conversion of the Century Dictionary will be a lot of work. However, there are many professions where the skills for such a project are taught in universities. It is therefore that I could see students working on such a project for a term project.
Sunday, November 13, 2005
Wikipedia is an outrageously successful project of the Wikimedia Foundation. Its most prestigious project is the Wikipedia in English, it has over 818.000 articles, it has an amazing number of active contributors and the achievements of this community are not only in this high number of articles, it is also in the rating given by Alexa (today nr 38), the attention we get in the press, the overall quality of the articles.
The community that makes all this possible is essential to what is done. The community does not exist as one big always agreeing whole. If anything the difference in what people want out of Wikipedia is huge. There is this tension between people who want to concentrate on the "main" Wikipedias and people who want to create new Wikipedias. A tension between having only illustrations that is Free and illustrations that is free to be used. Given the size of our community (a middle sized town) we have people willing to utter any POV.
In what I am doing, I am truly outside the Wikipedia community; I am firmly into dictionaries and I want to take the Wiktionaries to the next level. Ultimate Wiktionary is intended to be inclusive for all the lexicological data and applications we can think off. We came up with spell checkers, computer aided translation tools and it being a translation and a descriptive dictionary is what we started with.
As Ultimate Wiktionary is there to be an inclusive lexicological resource, we invite everyone to join us in making it exactly that. It is easy for people to join they just do. For organisations it is different. They often have a lot to offer but they also have their requirements. To address these requirements you have to be approachable. Sure the Wikipedia community is approachable but it lacks the ability to come up with a single response as the community is divided. It is unable to come up with a quick response, and when a response is given it is often does not answer the question that was asked in the first place.
The Wikimedia Foundation has as its goal to bring all information to all people of the world. Wiktionary has as its goal to bring all lexicological information of to all people of the world. It is therefore that it makes sense to have a consortium where organisations can find a focal point where their need for cooperation can be discussed. Given that we are open to cooperation on a non-discriminatory way and, that adding information makes us richer and more relevant. When organisations have a need for quality, we should find our way in providing this quality assurance. We should when it is legitimate request.
Ultimate Wiktionary will find its legitimacy not in being yet another on-line dictionary but in giving this data an application. When it does not go beyond what every dictionary does, it will be a failure. When organisations like the University of Bamberg have a use for the Ultimate Wiktionary, Ultimate Wiktionary will become a credible and important resource.
Wednesday, October 12, 2005
Dicologos is the dictionary of Logos. It is a great resource and there are many things to learn from it in order to make Ultimate Wiktionary a success. Dicologos has more than 7 million words, there are words in many languages and, the content is growing all the time and its main problem is that it is not well known and that its focus is on translators.
When translators use a dictionary, they typically use it to find confirmation for the translation of a word. All the rest is not really relevant to them. An ordinary user of a dictionary uses it as much as anything to find a definition of a word. As the bulk of the potential public for Dicologos is NOT a translator, the lack of definitions is a big problem when you want to establish public awareness of Dicologos a Free resource. For Logos it is important that Dicologos is seen as an important resource because it demonstrates that Logos contributes to society by providing to the culture of our society.
When you have worked on this resource like I have, you will appreciate that it is much more responsive than the Wiktionary servers but you miss the cooperation, the sense of community. There are no talk pages like in all the Mediawiki Wikis. There are no mailing lists. There is no sense of community.
When Dicologos and the wiktionaries are to work together, a common ground must be found where the content and the communities can find each other. Technically, the content of the Wiktionaries cannot be converted to the Dicologos database because many types of information cannot find a place in the database design of Dicologos. In the same way it is not possible to convert the Dicologos data to the Wiktionaries because you have to do it so many times and, there is so much overlap in the data.
When Logos decides that they are going to work together in a lexicological resource, they will find that in Ultimate Wiktionary they can include all their content. They will have all the community features that are implicitly available in the Mediawiki software. If they want to take the next step, they can work together in a resource that will be hosted by the Wikimedia Foundation. The specific needs of Logos can be adressed in what will be the Ultimate Wiktionary. These needs will be addressed in a non-discriminatory way.
Saturday, October 08, 2005
The result is that documents can be identified to be about a particular subject and as a result it enhances the time spend; you will read about things that are of interest.
An other application is when words are used together in a given setting that they can be identified to be about a given subject matter. For translators it means that it helps to choose the correct meaning for a word. When this is done automagically, a correct translation glossary can be loaded.
Thursday, October 06, 2005
I did go to Modena (Italy) to make it happen. Modena is where the Logos Group is based. They have been working for 20 years on an online database. This resource is huge; this resource is important. It has between 7 and 10 miljon lemmas. It has an active community of translators working on the content and it is my pleasure to help make this resource even better.
One of the things Logos wants to do is to cooperate with the Wiktionary communities and as Ultimate Wiktionary is also to be implemented in an Mediawiki environment, what does make more sense then to join the resources of two vibrant communities ?
Wednesday, September 21, 2005
The great thing is that is allows for among other things, the use of this content for teaching languages. Consider, when you create a language excercise you need words to select. When these words are available in an electronic resource, you can make selections including particular vocabulary related to specific subject matter.
For the provider and the user of the Free content / Free education, it is a win-win situation. Both need ample data and with more eyes looking at content and structure, the data can only improve in quantity and quality.
The good news is all of this is happening.
Wednesday, September 07, 2005
For me it is a treasure trove. Because all the content described needs its place in the Ultimate Wiktionary. I have to think through how to add homophones. I have to think about how to have them as content. At this stage I do not need to concern myself with where it will end up in the actual screens. I have to have it in the database.
Homophones led to the most drastic change in a long time. I divorced Relations from Meanings. Now RelationType is connected to the Table table. This in effect allows me to use the Relation table in combination with Words as well. This gives me right and rite as homophones
Tuesday, September 06, 2005
Eponyms are however a funny thing; they are definetly related to words and not to their meaning. Actually this should be quite obvious because in German you have "Röntgenstrahlen" while in English it is called "x-ray". One is an eponym, while the other is not and they do share the same meaning.
Thinking about eponyms and how to include them in the database design let me see the light that I was really wrong about how I had etymologies in the data design. Like eponymys etymologies are word related and not meaning related; they too are language specific.
The funny bit is that many people have looked at it and nobody noticed. I think it is like with so many things, the best designs do not survive reality unscathed. :)
Sunday, September 04, 2005
Sun has its own Computer Aided Translation (CAT) tool. These open language tools are written in Java. The open language tools are licensed under the CDDL license. It would be great if they could share their code with the OmegaT CAT tool. OmegaT is also written in Java.
My point is that there is this big concentration of effort and power in the commercial CAT tool business to the extend that there is a genuine monopoly. It does not make sense to have all the Open/Free CAT tools work seperately. To stimulte cooperation We hope to make a success out of the reference tool for a translation glossary. The best thing that could happen if some serious attention is given to more cooperation.
Wednesday, August 24, 2005
I am actively looking for money to enhance OmegaT. Not because I am likely ever to use it, but because I want quality translations in Ultimate Wiktionary. This reference implementation will do exactly that. In order to make OmegaT less user-feindly, there several little things that can be done. Sabine, defined some quirks that are a barrier for newbies. She also identified that "Trados" compatibility would be really important to introduce many translators to OmegaT.
A friend of mine is working on three annoying things. He is a professional programmer. He will inform us how much time it costs and how much it would cost if 100 translators payed for this. The point that I want to drive home is, that you can either pay for a license or you pay for functionality. When you pay for functionality, it will prove to be much cheaper.
I have my reasons why I want OmegaT to be a success, Microsoft happens to be the monopolist in the translation/localisation business. It is a genuine suprise but when you think about it, it should not be.
Sunday, August 21, 2005
The differences between the current Wiktionary and Logos are profound. In Wiktionary anyone, even anonymous users can edit almost everything. In Logos an anonymous user can add words that are checked. When you are a professional, you edit translations in the languages that you know best. The thing that I missed were the talk pages; a place where you can discuss an individual word. I missed the IRC channel where I can discuss issues about words or meanings.
I understand the differences, they make sense because it shows where you are coming from. Logos provides very much a tool for translators by translators . Wiktionary is very much a tool for people who care about words/lexicology and share this in their mailinglist, IRC-channel and talk pages.
What kind of a community would result when these two communities were to merge? What kind of content? It would be an intersting experiment.
Tuesday, August 16, 2005
Consider what is going to happen when we can Free all the recordings of sign languages that exist in many universities and combine them with the content that we hope / expect to have, it will be a lexicological resource that will be awesome. Because of its scope it will get a relevance of its own and that is why it so great that an organisation like the Wikimedia Foundation will host it; it is not party in any of the rivalries between organisations or institutions it is just there to provide information to all the people of this world in their own language.
Sunday, August 14, 2005
Wikimania was also important because it paved the way of including sign languages into the project. Wolfgang Georgsdorf gave a great presentation and together with Ascander we changed the data design to include sign languages as well. I learned that there are ISO-639 codes for sign languages as well :) .
When I came back I have worked hard to do many things that can be considered the fallout of the conference, I still have not finished to do all the things that I want to do. The nds thing did not go away and it does cost me my time. There are all kinds of things that I want to have done and I work on them. It is about priorities. Informing about what is going on is a priority, I did some work on the nl.wikimedia server. And now I finally have written here as well.
Sunday, July 31, 2005
I can't because things have to develop some more. Because Ultimate Wiktionary is still some way off. Because people are still thinking old Wiktionary and I would only confuse most people even more.
Really, when we get UW life and functional it will be great. The filenames that I used have names like Spelling, Word and Meaning. The problem is that people think of them as a spelling a word or a meaning. Somtimes I think I should have named them Kwik, Kwek and Kwak. (you may also know them as Huey, Dewey, and Louie. It would make it more abstract but if it would help ??
Wednesday, July 27, 2005
The proverb "de beste stuurlui staan aan wal" has a similar meaning as the English "backseat drivers". The Dutch version is nautical and the second one obviously not. So it is hardly a literal translation but it is a functional translation. In general terms the meaning can be described and as such it would function however, I can apreciate that there will be a certain drift when proverbs from many languages are put forward.
The problem for me is to consider how to deal with this in the Ultimate Wiktionary.. Well, at this moment we do not have it yet so it is not a problem .. :)
Saturday, July 23, 2005
*Wikidata will be about the technology that will be behind Ultimate Wiktionary and many other projects.
*Wikisign will be about creating a lexicological resource for the deaf.
*Logos will bring us what their experience is hosting lexicological content
*Ultimate Wiktionary will be about what we hope achieve in the next generation of Wiktionary
As I have been talking so much about what I hope to achieve, I may not bring you anything new. However, there is so much to it..
Friday, July 22, 2005
Some of these things are so basic that you tend to forget to include it by making it explicit. All in all it proves important, publishing and publishing again does work.
Wednesday, July 20, 2005
The easiest for sign languages was the realisation that a movie is the "Pronunciation" of a signed word. This made me change the fieldname from "Soundfile" to "Mediafile". More complicated is the fact that there are some four written signlanguages. These I would really want in the Ultimate Wiktionary. The question is, do they have like Chinese does their own UTF-8 characters. When they do, I do not have to do anything. It would just work as designed.
I have realised that languages like Arabic and Chinese are formal written languages. There are many people who have a spoken language that is grammatically and syntactically (does this word exist?) different from the formal words. So when I record pronunciations, how do I deal with those. How do I register those lanuages? How do I indicate that these languages use Chinese / Arabic for their written language..
My working theory for the moment is that there may be transcriptions for those languages. Certainly when they have been noted by someone who has some authority, these can be used to link the essentially oral words with something that has characters. These characters are needed at this moment to make it possible to enter them in the database. Now the question is, how to relate them to the written language ... At this time it is just a matter of having the written word as a translation.. in effect this is correct.
Thursday, July 14, 2005
One of the things that is funny is that when you design a database design you not only have to think of the data itself, you also have to think about how it is to be used. The problem for me is that the database and the development are pretty much divorced. I know databases pretty well but I do not know the restrictions of MySQL in combination with what Wikidata will bring us.
I find it really thrilling that we are at the stage where there is an imminent need for the datadesign for Ultimate Wiktionary..
Friday, July 08, 2005
The three most important tables will be "Language" "Word" "Meaning". They are top down related. The most difficult to understand will be "Meaning" because the meaning itself will be in a seperate table "Meaning-text". This is because the text of meaning is to be had in every language, and it is the abstraction of the meaning that is in "Meaning".
This "Meaning" will relate to synonymes and translations (a synonym is equivalent to a translation in the same language). This will give people an instant problem many words are not the exact translation of another, so how will we deal with this.
When a word is translated, the word picked in the translation is the one that fits best in the meaning of the original word. This meaning is therefore one that is of importance to this word as well. This meaning can be endemic to the language of the word, this makes it a natural fit or the meaning can be external to the language of the word. When the meaning is external to the language, this meaning is only relevant when translating the word.
This sound problematic. The word girl, meisje, Mädchen are good translations. In the Neopolitan language there are words that are specific to girls of a certain age. The meaning of these word is included in the meaning of the word girl. They need to be shown when you are interested in the Neopolitan language. However, when you are not interested, these meanings that are external to words of the English language, do not have to be shown.
I have been told that there are some four words that can be included in the word girl. These meanings do relate to each other and as such it makes sense to use thesaurus like structures to describe these relations. As these relations describe the meanings, these relations are relevant when you are interested in the Neopolitan language. They do help a translator choose the best fit and also alternatives when one word is used too often.
Thursday, July 07, 2005
As can be deduced by its name, most people use the pywikipedia bot for Wikipedia projects. Many of the innovations have been programmed with Wikipedia in mind. The latest innovation allows you to be logged in several projecs at the same time. The interwiki bot makes use of this facility and when it finds that one project needs to be updated, it will do so. This enhances the quality of the bot dramatically.
Supporting a non-programmer like myself is a pain. It is therefore important that tools like tortoise work well. It means that a common baseline can be created. This in turn facilitate the analasys of error conditions. Today we finally got it to work. We had to remove the application and start it all over again.. This time it did download the pywikipediabot software from Sourceforge..
Really, Open Source rocks when there are friendly people like Andre ..
Wednesday, July 06, 2005
To gain access, you have to translate for the EU and you have to sign a contract that you use it only for EU use. There is however one bright spot; its copyright. The IATE copyright says clearly that you can have this data and use it as long as you attribute it to the institution that manages this information.
It is therefore a lucky coincidence that we want to make Ultimate Wiktionary relevant. It is as fortunate that we already plan on cooperating with the EU by publishing its GEMET content. When we have proven that we can host lexicological data, we can ask the EU if we can host this data. It is relevant data it is important data so much so that the EU expects that its modern systems will crash under the strain of all these people who want it.
With the Wikimedia servers, we are used to provide as good a service as we can. We do not promiss 0.9999 uptime, we do the best that we can. And, if this data can be had for the lexicological information that it is, we are quite happy to host it. We are quite happy to cooperate with the EU to make this information available and more relevant then it is at the moment, being a "secret".
Sunday, July 03, 2005
The functionality that all these tools will derive from UW is the same. So having an implementation that provide the bare bones of what is needed makes sense. It does help to make a bigger group of people aware of the wish for this cooperation. It hopefully leads to the cooperation of the different communities behind these tools, in order to improve the quality of all the tools.
To communicate about the tool, we have started an experiment with Google groups. Here you find a discussion list. Everyone can read this, but only members may write to this list.. this helps against SPAM :) . It is not a Sourceforge environment yet, this is something that people who will develop this reference implementation should decide on.
It is always exiting to see how these things develop. I hope for the best.
Thursday, June 30, 2005
The English wiktionary now has a problem and, it has an opportunity. The problem is that many entries are wrong. The problem is that the interproject links to Wiktionary from Wikipedia are wrong. The opportunity is that there are many other things wrong as well and it is therefore an unsought opportunity to revisit the content to improve the content.
Many people will feel frustrated because of all the huha. Many people will feel angry because the timing was not great; it stopped the migration of Wikipedia to release 1.5 temporarily among other things. But as the opportunity is there. It is also the time to step to the plate and do the best that can be done.
I am speaking to Andre Engels and I hope that he will come up with a bot that will find the capitalised words and move them back to capitalisation. This bot should also be able to list the words where a word can be found in both upper- and lowercase. After this the interwiki.py bot can be run again.. There is also a need for a bot that checks the en.wikipedia content for links to wiktionary and checks if the article is there and if not fixes it to undercase..
Yes, there was a need to prepare this change but it is also understandable that given that the decision was reached so long ago it could go wrong as it did. So now we have to do without preparation and just do the work ..
Monday, June 27, 2005
There is this thing with convincing people, to what length should you go. To what length do you want to go to make people buy into an idea? My time is valuable in that I can spend it only once and if I spend too much time arguing I do not speak with people who have idea's on how things can be done. Arguing does not help the project when the result is not positive.
There are people who insist that I do everything by IRC or e-mail while I prefer to skype as it gives me better feed-back. There are people who are only interested in a tiny specific part of Wiktionary.. only English seems to be relevant to some. There are people who think they quote me and say things I would never say.
Basically, to me there are three groups. People who understand what I am saying, people who want to understand what I am saying and people who for whatever reason do not want to hear what I say or cannot understand what I say. With the first two groups I can talk. We do not have to agree but there is this basis of understanding. With the last group who can be quiet vocal, I find that they waste my time.
The problem is, there is always the off-chance that it is me who does not hear what they are saying. It may be a dilemma where there is no good solution. When I can adress the problem why they do not hear or understand what they say they may become part of the people that become relevant... So, how much time to spend on this and how much to spend on new things.
Spending time on new things is a hazard in itself. It moves me even further away from the people who find it hard to hear / understand what I am on about..
I think I will add the word "frustratie" to the nl.wiktionary..
Sunday, June 26, 2005
When we can pull it off to do these kind of things we will add extra relevance to the Ultimate Wiktionary.
Saturday, June 25, 2005
The Dutch language will change in 2006, it will change things that are artificial like paardenbloem back to paardebloem, it has always been pronounced as paardebloem.. The result will be that many words will be wrong from 2006 onwards.
In October 2005 a list of words will be published with the old and new spelling. It means that we have to cater for this list in the Ultimate Wiktionary. So the Ultimate Wiktionary has to be more ambitious alas..
This is then the time to start experimenting. So I am using the word Imbiß as an example, in modern German it is spelled as Imbiss, I have introduced two new templates. One to be used in front of everything to signal old spelling and the correct one. One to say that it used to be correct.
Having a date for the change will make the information even more valuable. When UW is used within software to be used for optical character reading, it may be used as a pass after the initial pass that did the scanning. It will allow for an appropriate spellcheck that will allow to enhance the quality of the OCR process.
One thing to consider as well is that some spellings are local to a certain region or country. Rudolf Heß is called Rudolf Hess in Switzerland.. the "scharfes S" is not used in die Schweiz.. So words that still have there "scharfes S" in German, are spelled differently in Switzerland. This is just spelling. Some words or their meaning are not known to all people who speak German like "Paradeis" which Austrians know to be a "Tomate".
I am more and more appreciating the fact that linguist find it astounding that we attempt to make the Ultimate Wiktionary a reality. What makes us try it is that it was for us a natural growth path from Wiktionary. So we have our problems serially and not in a parallel fashion. The issues are there to be solved and they can be solved. Getting the issues serially helps because it prevents you from being overwhelmed by complexities for us it is just a matter of refactoring.
Tuesday, June 21, 2005
OSOSS is working hard to make the list of properly spelled words maintained by the NTU available for the public. Because of all kinds of contractual restrictions this is not possible at this time. To alleviate this issue, they are working as the focal point of the Dutch Open world to get the list of the NTG, the Nederlandstalige TeX Gebruikersgroep, validated for the spelling. This means that some 222.872 words will be validated.
This list of differently spelled words, comes with indications how the word is to be broken up at the end of a line. When the UW is to host such a list, it will mean some adaptions to the software; we will want to keep track correct spelling. As the Dutch spelling will change in August 2006, it means that we will want to retain the old spelling and mark it as such. As the change of the spelling rules will be in the future, we will have to consider how to deal with this.
When we host a resource like this for the NTG, it means that our license has to be compatible with the NTG. Currently they use the GNU Lesser General Public License. They do not care who uses it under what license as long as it stays Free.
Technically there is this issue; we want to host this data for the NTG. It would be really cool to be the resource for the Open/Free content world and host the Open/Free resource for the Dutch language. It would very much be in line with our objectives. We will find a solution for this issue; one thing is sure the LGPL is not applicaple for a wiki. :)
Monday, June 20, 2005
My opinion is a bit more inclusive, I would like to have extensive definitions and etymology but for me the sheer fact that a word is properly spelled is enough to have it in an electronic dictionary. The Dutch language knows an institution that does provide the authorised list of correctly spelled Dutch words. For me a list with these words would be a worthwhile contribution to the Ultimate Wiktionary. Obviously, it would be a bit meagre but it does serve its purpose.
When the correct way of spelling words changes, like it will do on the 15th of October, an electronic dictionary has a clear advantage over paper based dictionaries. It is however not clear to me how We should cover the old correct spellings. In a way it is relevant to have a history of correct spellings. It could/should be part of the database..
Thursday, June 16, 2005
With UW we will get the cooperation, the synergy that we do not have at this moment. The will to cooperate is there but it just does not happen. So I am unapologetic, only with the UW we will get the synergy that we so desperately want. It is not that we do not want to, it is that it does not happen in a practical manner.
Tuesday, June 14, 2005
When I want to connect to people who are "official" or high up in an organisation, there is little chance for me to actually reach the right level. There are often many intermediary levels before my message gets to Mr or Mrs Right. These intermediary levels have similar strategies like mine; I do not expect that they are impressed with my myrealbox.com or gmail.com e-mail adressses. It makes me just a person of the public (and I am) not someone who asks something on behalf of the Wikimedia Foundation. So it would be helpfull if people who are known to be active on behalf of the WMF to have a wikimedia.org e-mail adress. It helps to overcome the barriers thrown up by the intermediary levels and get a job done, a message delivered.
Monday, June 13, 2005
Влади́мир Влади́мирович Пу́тин is a suprise for me because it is the first famous person I found on wikipedia with a sound file that I did not ask for.
The funny thing with pronunciations is that the pronunciation of Mr Bush can be heard on the Dutch Wikipedia. The English Wikipedia objects to the soundfile; it has been removed already several times. Some people use Wikipedia to learn languages, it is therefore usefull to learn how a local pronounces famous names.
What I would really like is to have soundfiles of famous people. We have already asked the new pope... One can always hope :)
Saturday, June 11, 2005
A lesson has two parts; the spoken Farsi words and the translation in Dutch. When you click on the Farsi words, you may hear the pronunciation in the .ogg format. When you press the Dutch words it takes you to the nl.wiktionary.org. When a word does not exist I create it when the word exists I add the Farsi translation.
It is really hard if not impossible to use my favourite browser, I have to move to the other side as it is clearly superior when editing a page like FarsiLes5. In the past I did enter bugreports for Mozilla and Mediawiki, I learned today that they are working on it. I hope they do a good job because Firefox is almost useless when editing pages where there is a mix of languages.
Thursday, June 02, 2005
I decided to wade through the wikitech-l and found this interesting concept of "transcluding a sound". I think they mean that this means that a sound is played automagically. This needs some clever software that will play the sound in-line. The article says that there is no need for such a feature .. Would it not be cool if you find a word, you hear it automagically ?? Yes, it could also be something that you can enable/disable from your preferences ..
Saturday, May 28, 2005
Today I was working on some translations from the English Wiktionary and the word "fruit" had two translations in language I had not seen before. One of these was the "Ojibwe" language. There is not mention of this language in the ISO-639-3 so I had a problem. The language codes that indicate that a word is a language are based on this code.
Google as so often turned out to be my friend; the Ojibwe are better known as the Chippewa; the code for the Chippewa language is ciw. So I was pleased to have a code to go with the Ojibwe language.
In the Ultimate Wiktionary there will be no reliance on the existence of an ISO 639 code. We could have more languages and nobody would realise ..
Wednesday, May 25, 2005
What a template like this can be used for, is not only indicate who has some expertise in some language but also to use it as a filter on the recent changes in an Ultimate Wiktionary. This could function in a same way as it would work for the inclusion/exclusion of bots.
Another use would be to help indicate who shares knowledge of a language, this in turn could help form a community for a language.. I can appreciate that there could be a need for a "village well" or a "kroeg" for each community.
One thing I did find was that it is also a bit of information where you need LOADSA localisation.. There are three templates for each language; Oscar one of my Dutch Wikipedia friends is a tr-3 there is no template for that yet. His being only a tr-3 means that you cannot expect him to write this template in Turkish :)
Friday, May 20, 2005
As I recently explained we do want to use the data in Ultimate Wiktionary for non server purposes as well. I mentioned the .dict data format. Data in this format is also used in off line usage. To create this data you create a subset of the data we hold. Given the license we should inform about every contributor to each word. This is not practical. It is practical to refer to the UW for the history of every word.
As the Ultimate Wiktionary is a new database, it is best to start with an appropriate license that is free and prevents the data from becoming unfree.
With the UW containing the free data, it does not pay to be too concerned about mirrors that host UW data. Given time and the entheausiasm of our community, the UW content will grow and therfore has the potential to outcompete this type of competition in all important ways.
Monday, May 16, 2005
The biggest challenge however will be to grow both the content and the community. As there still a limited number of languages present, we do need to grow a presence, a community for the missing languages. For many languages including my own mother tongue, there is no comprehensive coverage yet. We will be searching for content to be added by our community and by incorporating existing glossaries, wordlists and thesauri.
Today I learned about a third way of making free content available, it is by use of the RFC 2229 a protocol to provide people dictionary information over the Internet. The trick here is that there is a database and that it does provide the information where it is available. So from a user point of view it would be great when we cooperate with dict.org.
There will be two issues that need to be resolved. We use the GNU-FDL license for our content and the GPL for our software and they use the GPL. In order to cooperate we will need to work something out. Licenses are a necessary evil but it would be a travesty if free licenses are found to be mutually exclusive.
RFC 2229 compliance would be for Ultimate Wiktionary mark II .. It is funny that Ultimate will be as much a work in progress as everything else.. Not really a suprise. :)