Tuesday, January 31, 2006
Today a nice standard called OLIF was pointed out to me. OLIF stands for Open Lexicon Interchange Format, it is an open standard for lexical/terminological data encoding. It is essentially a Free and Open standard, it has many illustrious and industrious backers. But it is a tad Eurocentric. It now has East Asian language support, but it is best at English, German, French, Spanish, Danish and Portuguese.
I would not want to dismiss a standard like OLIF when people are actively involved in a standard .. To be useful would be necessary to define how a standard is lacking, it might be that it is just a matter of some refining. It might be that the standard does consider things that I have not considered yet (and I do know that I do not consider everything all the time).
On Meta we are voting for a new Wikimedia project, this project is about standards .. Discussing standards but more importantly it is a conduit for working on those standards that do not fit the requirements that we have for standards in the Wikimedia Foundation's projects. Now there are two options: I am wrong or the standards will fit our needs like a glove.
Saturday, January 28, 2006
If you know Dutch, you will love it. If there is more of this, also on other languages, on the Internet .. I would appreciate to learn about it :)
Monday, January 23, 2006
I have learned a few things. I am right when I understand that the location of words in a sentence is relevant. The TST-Centrale uses a fixed notation for it; I wonder how universal it is.. On the CD it is explained that people who use this content, can select the information they want to use and that this is part of the secret of its success. We hope that having the WiktionaryZ available as a database will serve in a similar way.
Reading back the first paragraph, I feel like a Tom Thumb. Every second noun is something to look up and it was hard work writing it. The great thing is, that when you have it in front of you, when you see it demonstrated it does make sense. When we collaborate on content and stucture, when we make WiktionaryZ something that is usefull because it has an application, we will have giants that help us little people make progress when we are allowed to stand on their shoulders and move with us forwards.
Saturday, January 21, 2006
From my perspective the biggest problem with trusted computing that I have is that it may be an open standard, but it is not a free standard. The specifications of the standard that the trusted computing group is available for organisations, it costs at least $1000,- and thereby excludes all these people that create wonderfull programs that people share. Because of this lack of openness it is has a fundamental problem. It makes me trust organisations that I not necessarily trust; why should I trust Yahoo, MSN and AOL as they demonstrate what I perceive as a lack of protection for the privacy of their clients? This trusted computing architecture does not allow me to trust my own software of the software created by a friend. It does not because I do not see how I can create software that will be trusted by my system.
Personally I do not think trusted computing is the equivalent of digital rights management. I am of the opinion that DRM leads to giving away rights that are mine.
Trusted computing does one other thing. I expect that it will take away much of the anonimity that is still with us on the Internet. This aspect is probably something that few people considered. My first clue came after I realised that it is a perfect tool to do vandal fighting on Wikipedia. It gives us a tool to more precisely know where these people come from and it can give us much better protection against this scurge. The other side of the coin is that when it provides us with more control, it will also give more control to those that I do not trust to use it wisely.
It is only lacking in this one resource.. money.. If it does not find money it will pack it in. They are now finishing off software so that the project will be in the best shape it can be. I truly hope that the Kamusi project will find its way into a bright future. If you can help, please do :)
Friday, January 20, 2006
As it says on the information on the mainpage, they only have some 20.000 visitors, the size of the database is "only" 51,08 MB. My reaction to these problems are predictable; I would like to host this information in WiktionaryZ when we are ready for etymological content. There are however a few issues. These issues have to do with the perception of Wikipedia.. Let me quote:
"Approach Wikipedia about a partnership, or actually merge the site into Wikipedia. This is a painful option. In a sense, this site is the anti-Wikipedia. It is deliberately not open source. You'd understand why if you saw the regular stream of e-mails I get from people insisting that their own crackpot folk-etymology idea is absolutely correct. Such things can be based on some fervent politico-racial agenda, simple insanity, or "my French teacher in 10th grade told me."
A major reason this site exists is to serve as a template against which to measure people's best guesses and wacko theories. The whole Internet is a big Wikipedia; this site is a compilation of the most rigorous academic information."
In WiktionaryZ we have a need to address quality. In the current Wiktionary we have our crackpots. There is this loon who insists of there being a word called "exicornt". He even threatened admins in order to have this word exist in the Wiktionary.. :( We have to deal with these persons.
We could do something for etymonline. We can approach them. We can offer to host its content by importing their content (obviously with proper attribution), but we sure have to address the issues. Quality is important and we have to protect quality content for our own sake. When we do, and when are successful opportunities like this will be less problematic.
Thursday, January 19, 2006
This is marvelous resource; there are some caveats. The on-line data is only from 1979 onwards while the resource started in 1932. The printed edition is available in the UNESCO library in Paris...
There are many possibilities with a resource like this.. It is indeed one of the resources that would benefit on a massive scale from a GOOGLE Book action. Obviously I would not care who does the scanning and who spends effort on digitizing this conten. It is however an extremely important resource and everyone would benefit if the data from 1932 to 1979 becomes publicly available as is the intention of this project.
There is more to learn about this project. When considering using it for presenting our sources in Wikimedia projects, we would like an API to refer to it.. I have not looked into it.
Wednesday, January 18, 2006
What do you do when the WIFI router does not work anymore .. You use the ethernet option this router provides...
What do you do when the keyboard mapping is broken .. you do not use question marks ..
Life is a bitch.
Monday, January 16, 2006
Yesterday Amgine wrote an “Advertising proposal. The idea is that we can advertise the services that the Wikimedia Foundation provides. Essential for good advertising and marketing is that you target your audience. With such an attitude comes the focus that would improve our content. The advertising would be done using an advertising server; the many people that have their own wiki, their own blog, could include advertisements from this server.
The Functionality that was originally conceived for Wikidata will end up in Mediawiki it self; this is both a blessing and a curse. The great thing is that it does signal that much of the functionality that we conceived is indeed relevant and it will bring functionality to all the wikis that use Mediawiki. The drawback is that it has implications for the design. It does complicate things at this stage, but on the other hand when we have a great instead of a good design from the start, the extra time and effort needed now may pay itself back in the long run.
Sunday, January 15, 2006
All these things are not new. It happened all the time.. Now we will make this process more visible; it is done to read this other blog, the WiktionaryZ.blogger.com blog of the commission.
The thing I am not sure about is to what extend I will continue blogging here and to what extend I will blog on the new blog.
Saturday, January 14, 2006
There is one issue; people think that the problem with this project is that it is run by one person; me. It was once put to me was that that I represented a "truck factor" for the project. From my perspective, WiktionaryZ is certainly my dream, but it has never been my dream alone. I have worked hard to make WiktionaryZ happen, but I am not the only one that worked hard to make it happen this far. I came up with many ideas that made WiktionaryZ what it is, but not all ideas have been mine, all the ideas became WiktionaryZ because of the many conversations, e-mails and IRC chats about them.
From my perspective, there is this issue that I do not scale. There are things that amount to policy and it needs to be expressed because policy dictates technological choices and we are building the technology for WiktionaryZ. This implies that some policies have been set but it also implies that more questions will raise their head that do require an answer and require an answer quickly because we are building the software, the network, the connections now.
I did discuss this issue with Jimmy Wales, and he came up with the great idea to have a commission for projects like Wiktionary. This commission could have several funtions; it can act in a similar way as the chapters do for countries for projects. One role of the commission would be that the policies of a project would be consistent with the aims, the policies of the Wikimedia Foundation. Another equally important part would be to represent the community that will make WiktionaryZ its own. To do all this, members of such a commission have to be part of the discussions about the developing Wikidata and WiktionaryZ.
I like the idea and, I have asked several people to become part of an initial commission. They are all Wiktionarians (with one exception), they represent many Wiktionary projects and they are and have to be communicative; they do use Skype/VOIP and often can be found on IRC.
Including Sabine Cretella is obvious. Sabine developed WiktionaryZ with me from the start. WiktionaryZ is a dream she has fostered for a long long time. Sabine is active on the Italian Wiktionary and is one of the initiators of the Neapolitan Wikipedia. Sabine is a professional translator and is known in this world as an evangelist of Open Source and Open Content.
Of all the Wiktionaries, English is the most relevant. Dvortygirl has been active there for a long time. Like me, she is an admin and she is well liked and respected for her work.
Gangleri is active on many wiktionaries. The thing I really appreciate is his involvement with right to left languages like Yiddish. On the Internet, these languages have their own issues, Gangleri is active in the Mozilla organisations to address several of these.
Yann, is active on the French, Hindi and Gujarati Wiktionary. He is also the treasurer of the French chapter. One of Yann's challenges is to help us get more people interested in the languages from India.
Erik is the one who is not into Wiktionary. The reason why he is invited is because he is the architect and realiser of Wikidata. This is the enabling technology for WiktionaryZ. Erik is also the realiser for WiktionaryZ. Wikidata has in WiktionaryZ its first application. As this is a truly big and complex project, many of the things that would hit Wikidata eventually, need to be addressed from the start. Erik is also important as a linking pin to the Mediawiki developers.
GerardM, if I need an introduction I invite you to read this blog.
Some policies or, a glossary of our policies:
Availability: "our data is to be made available through open standards and in a non-discriminatory manner"
Data design: "WiktionaryZ is implemented as a relational database. When information that is relevant in the context of WiktionaryZ cannot be added, we will try to ammend the design."
Full functionality: "WiktionaryZ needs to be able to include the information that is available in the Wiktionaries"
Partner: "a partner is an organisation that collaborates with us in the realisation of what we intend with WiktionaryZ"
Sponsor: "a person or organisation that donates money or content to the project or to the Foundation"
Success: "success is when people find an application for the WiktionaryZ data that we did not think off."
User Interface: "the interface should be in any language. We want this both for the Mediawiki and the WiktionaryZ user interface"
Thursday, January 12, 2006
JAVA uses ISO-639 for its language codes. The codes used is the ISO-639-1. Consequently the Neopolitan language is not known. OmegaT is an open source CAT tool, it uses the languages known to JAVA as the languages that it can translate.. So in order to translate to Neapolitan you have to pretend that it is a different language.. Not nice.. So the nice people of SUN were asked this and we have great expectations.
Today there was a request on Meta, the website about the Wikimedia Foundation's project, for a new Wikipedia. The request is for tarantino, it is considered a dialect, a dialect of Neapolitan. This request is problematic because there is not even an ISO-639 code. Consequently there is little chance of there being a wikipedia for created. Now, with the new namespace manager, it is possible to create a seperate namespace within the nap.wikipedia.org for the tarantino dialect. This is also a solution for the problematic request for a Lower Saxon wikipedia that will be in an orthography that is not German..
It is sobering to see that standards can enable and prevent things to happen. Good standards are vital and ISO/DIS 639-3 is a big move forward.
Wednesday, January 11, 2006
To create a Machine Translation engine for these language, you would need all kinds of rules about the languages that you want to translate to. Now I would like to know these rules. Not so much to build Machine Translation engines but because they have this other application, one that is much closer to my heart, it is needed for software that teaches people languages. The Universität Bamberg is working on exactly this. They want to use the data of WiktionaryZ for this purpose. So if you have a nice sets of rules for them I would be obliged.
As astonishing as the growth of Wikipedia is the apparent popularity of Wiktionary. According to Alexa Wiktionary is more popular than dictionary projects that have much more content. This is probably the effect that the association with Wikipedia brings us.
When you look at the traffic details for Wiktionary, the thing that strikes me most is the popularity of the Russian wiktionary. Such details points to the apparant strength of this project or to the need for Russian content. For me, this is relevant because it could be one reason why we concentrate resources on a given language.
As WiktionaryZ is a true Wiki project, there is no need to concentrate to much about the user interface. People WILL find what works and what does not work and to a large extend the user interface will evolve. However, at this moment we are thinking about the infrastructure of the project.
In the current infrastructure a resource is indicated by preceding the project name with the ISO-639 code; the Dutch wiktionary is therefore http://nl.wiktionary.org. For WiktionaryZ we do not have a separate database for each project. When we maintain this link into the mainpage for a language, we benefit in two ways; there is a main page for the whole project, there is a localised entry level per language and the statistics of Alexa remain relevant for some basic analysis of the demand of our project.
Tuesday, January 10, 2006
The current implementation of our database is the first itteration of what is going to be WiktionaryZ. Its intention started off as at least including all the information that is included in the Wiktionaries. So far I was against the inclusion of transliterations as we would want the translations anyway and, often a tranlation is in effect a transliteration. The case was made that this is too simple.
Several point were made:
- people find transliterations usefull
- many phrasebooks include them (their transliterations are often quite bad)
- having a "standard" transliteration helps because for names like معمر القذافي in excess of 20 different transliterations exist.
- transliteration should exist in addition to translations AND they are specific to a language
- there are standards for transliteration (eg for transliterating Russian into English)
Given my stance on what should be included in the database, I would say I want to have this. However, this is complicated by the fact that transliterations are language specific AND they are on the same level as pronunciations are.
Yes, this gentlemen knows how to find Njoe Jork, he studied some Afrikaans at one point in his life.. :)
- We have discussed that we are in danger of getting too much content. This may mean that we have to slowly absorb content and not make a big mess.
- There was someone new to me who came to Sabine and wanted to help us connect with someone we are already working with. The great thing is that this is a great person that can help the Wiki for Standards project along. He is also someone who could become relevant for WiktionaryZ
- Today I found this proposal called Wikimaps This is a proposal that was missing localisation, so I added it as a "must have" feature. When this is going to happen, many translators will be happy for a resource that helps with geographic data.
- I have been thinking more about partners. There is so much work and there are so many problems that will come our way and there are so many organisations that can help us manage. It would be folly not to collaborate and it would be ungrateful not to acknowledge these organisations.
- Quistnix told me that conversions are happening because messages about "namespaces". He is very active with interwiki links so he would notice these things.
- I received a mail about Maltese verbal morphology. This came with a request on how we could collaborate.
Monday, January 09, 2006
- The way copyright infringements are signalled in relation to lexicological content
- Do we want to emulate what others have done, or are we masters of our own destiny
- Where are the French, the Swahili, the Kannada and the words of all the other languages that are equally deserving attention
With WiktionaryZ we have the opportunity to use Wikipedia as our corpus. In Wikipedia we find the words as they are used today. When we concentrate our effort on these words, we provide added value to the information contained in Wikipedia and at the same time Wikipedia adds value to WiktionaryZ because it allows us to show words in context. In Wikipedia we have people from many countries that contribute, they do use words that are normal in their locale. With the "OED and AHD" we do not necessarily get these words.
It is not a bad thing that people interested in English concentrate on English. As such I welcome this effort. However, in my opinion the emphasis is too much on main languages like English. In these languages it is hard for WiktionaryZ to become relevant. To become relevant we have to do things that others do not. Relevancy can be gained in many ways; translation to minor languages is a way for some, counting the characters in a word makes is a way for others.
From my perspective, we become relevant by harbouring communities, special interest groups and allow them to make WiktionaryZ their project. When we maintain our core values of Freedom, of inclusiveness, of non-discriminatory access, of open standards we will be relevant to some if not to all.
Sunday, January 08, 2006
When we are going to have single login, people will know that Sabine is not a newby. So what will happen when you are new to a project, would that mean that when you are new to the project you will not be welcommed ?? Sabine and I found it funny that she was welcomed then again it is the charm of our project that you are welcomed.. What is funny is this..
The biggest improvement for the production of statistics has been the policy that the dumps of the Wikimedia Foundation are in a XML format. This provides a much more stable basis for the production of statistics. With the advent of Wikidata and single login, I worry about how this will affect these popular statistics.
It was great to have Erik reassure me that we will cross that bridge when we meet it, "because we have always done that". Issues that may arise are: currently we do not have Wikimedia wide statistics, with single login implemented we can. Wikidata projects will be .. where and will probably not count for the project that the data has been entered for. With the introduction of the namespace manager in Mediawiki 1.6, some of the assumption in the statistics may have to be revisited..
Let me be clear about my position; I am all in favour of providing sources with articles. However, for every controversial subject you can find literature that "proves" all the crackpot ideas that are floating around. To complicate things even more, literature available in one country or language is not all the literature that exists on a subject. It is therefore my position that you do not prove anything by providing sources, you only prove that there are sources that helped us come up with a particular article and that it raises the standard of quality.
Yesterday, there was a New Year meeting of Wikipedians in Rotterdam, and as is usual at such meetings all kinds of everything were discussed. Including the problems with sources. During the discussion we came up with the following:
We need a central database to do away with the current interwiki links. When we have such a database, we should have all sources for the same subject, never mind what Wikipedia it is, in there as well. It helps people find sources but also when there is a difference in the view taken between the different Wikipedias, the sources can be compared and it will prove to be a usefull instrument to deal with cultural bias.
The thing that triggered this idea was that Oscar had bought a book about the Amazingh. On the Dutch Wikipedia there was a guy who quoted sources that noone of us could read as it was in the Berber language. I was really pleased that Oscar took the effort to buy this because it shows very much his good faith, our good faith. By having all sources for the same subject in one place, we would show similar good faith and, it would be a really powerfull tool to remove much cultural bias from the Wikipedias.
Friday, January 06, 2006
When people want to do this kind of research, it makes sense to have a place for it in WiktionaryZ as well. It dawned on me that I have not really considered many of the User Interface questions that will come up. Then again some things are obvious. People will not only want to select the language that they see. There will be a need for a portal page for each language. This could be the kernel for a portal page for the English language (from the Dutch Wiktionary).
With starter portals like this, it can be expanded in many ways. Many internal (to the Wikimedia Foundation) and external resources links can be added and give users a rich experience. The Main page of WiktionaryZ would therefore be similar to the http://wiktionary.org website particularly to point people in the right direction.
Thursday, January 05, 2006
The fun will start when English, South African, Australian, New Zealand, Canadian and US-American legal terms end up together. The only way this can work is by having distinct terms clearly associated with a specific legal glossary for a legislation.
The same report mentions a biblical Hebrew dictionary, it seems that people working on such a resource want to have it in a collaborative environment.
Both are examples of the pent up demand that exists for a resource like WiktionaryZ.
Wednesday, January 04, 2006
We ended up with a design that will allow for a lot of refinement. However many people who looked at it think it has great potential. Ultimate Wiktionary, the project, is our dream. As we worked on it for the last year and a half we understood more and more the pent up demand that exists for information that can be stored in our project.
Our dreams are big. We want to realise them. We are fortunate as we are associated with the best organisation to make this happen, the Wikimedia Foundation. It has a great reputation, it has a great community and what they do with Wikipedia is astounding.
From an organisational point of view, the Wikipedia project will be different from the WiktionaryZ project. Wikipedia is community driven; they create the data they finance the project. WiktionaryZ will be different because much of the data that will be included already exists. Many organisations struggle while maintaining their resources. For WiktionaryZ the opportunity exists to focus all this energy in one place.
The development of WiktionaryZ was made possible by organisations supporting our effort. Kennisnet, a WMF partner, provided the initial investment. The Universität Bamberg was the second organisation to help. More work needs to be done and there are more organisations that are willing to collaborate technically and who are willing to share their resources to make WiktionaryZ happen.
WiktionaryZ is going to be big. I made the bet that we will need in the first year of full-featured production two hundred thousand EURO (not US$) in servers. There are people that have told me that my “guestimation” is on the low end.. :)
People that work on content are and will be attributed in the normal way; it can and will be found in the history of the content. Organisations that prove to be partners of our project could be credited on the left hand side and may end up under the toolbox. A link will refer to a page about our partner.
Our project is as much about collaboration as any of the other Wikimedia projects. There is however no other project where organisations will play such an important role. This calls for a different way of organisation their effort. I propose therefore to combine these organisations in a consortium that will be the focal point for the contributions of organisations.
The WiktionaryZ consortium will have two functions; managing the collaboration of organisations and finding the resources to make WiktionaryZ possible.
Tuesday, January 03, 2006
This situation is more complicated on the Wikipedias. Because of disambiguation articles are necessarily not named in an obvious way. With currently 212 languages it is impossible to maintain the links timely and reliably.
There is also the problem that some people consider linking from Wikipedia to Wiktionary "spamming" and are so extreme as to ban people for this. The situation is a mess.
With the advent of more centralisation for users and the integration of Wikidata in Mediawiki, my solution would be technical. Each Wikipedia can link one article once to projects and languages. Other projects can link in the same way. The behaviour of these projects is not necessarily the same. WiktionaryZ could because of its thesaurus structures provide a universal browsing capability through the articles and the terminology of many languages. Commons could provide galleries of pictures that are associated with a word as its "category".
I would have this mechanism also be integrated with our search engine; when for instance the word Chihuahua does not exist at all in a Wikipedia, it might exist in the Wiktionary and thereby provide the disambiguation that you can find similar to the one in the English Wikipedia. When the Russian Wikipedia has an article on the dog it can be suggested to read that one in stead due to the relation in the thesaurus.
The benefits in short:
- Better and more timely information.
- Less edits by bots.
- Improved user experience
- Better integration of the smaller Wikipedias.
Monday, January 02, 2006
- All words must belong to a language
- All words must be spelled as they are used in that language
- Only data that is more or less structured can be parsed and thereby become a candidate for inclusion in the database
If this means that some data cannot be converted I find this disturbing. I do however not see how it can be done in a different way.
Sunday, January 01, 2006
The next important deliverable is a "write up" by Erik where he explains what Wikidata is about. How it can be used and how certain essentials are delivered. As much as an explanation it will also be the mental excercise needed to get more programmers involved in Wikidata. Combine this with the amount of inline documentation that PHP provides and it will open up many possibilities to many programmers, inside the Wikimedia Foundation an out.
Let me be really clear about one thing; Wikidata is powerfull stuff in its own right. In one way it is really great that its first implementation is so ambitious; when you can model this. You can model almost everything. In another it means that as WiktionaryZ is dependent on Wikidata, it will move forward technically at the same speed. Given that the namespace manager will be in Mediawiki 1.6 and given that other infrastructure issues are addressed things could not look more prommissing that the way they do.
For the language guys reading this; we are seriously considering to "steal" a page out of the LMF book; particularly for lexicological information we are going to have "Attributes" that are defined in a language specific way and that are going to be defined particularly in three places; the Expression, the SynTrans and the DefinedMeaning level. This means that we are about to ditch the LexicalItem. This will do us two things; it will make the core of WiktionaryZ more efficient and it will allow us to be more language specific from the start. As we will have Attributes that are conditional on other Attributes and as this will be reflected in the User Interface I think this may reflect the core idea of LMF. Then again I do not really know as I do not really understand enough of LMF yet.