Wednesday, February 28, 2007

NIH Wiki Fair

The National Institutes of Health organised a Wiki fair today. There have been great presentations; the line-up of speakers and subjects was impressive, many of the presentations are on line and the whole presentation can be seen as a video as well ... all 5:41 hours off it...

For Wiki aficionados particularly interesting are a presentation of Dr Larry Sanger, and Dr Barend Mons. Larry is part of the Wikipedia history and Barend is part of the OmegaWiki presence.

Larry's presentation is self serving; he gives a selly presentation about Citizendium and he urges the people from the NIH to join in his project. That is in and off itself OK as he just what he has to do. As the raison d'être of Citizendium is in doing a better job than Wikipedia there are loads of comparisons with Wikipedia.. It is not surprising that you find no nicks when that is your policy, it can not be considered something that is "good" in comparison. Given that Larry finds it relevant to be known as Dr Sanger, I would expect a more scientific approach to his presentation. Given the small size of the project, it is not so strange to find that there is a lot of harmony around. This is very much the experience of all Wiki projects. It has been well documented in many scientific papers and, I am sure is Dr Sanger aware of this.

Being part of the history of Wikipedia and having created Citizendium as a reaction to what is considered "wrong" in Wikipedia means that Citizendium will always be compared to Wikipedia. I think that having started from scratch is daring.. It is the right thing to do. There are certainly more scientists working on Wikipedia, and given that Wikipedia has more than 1000 times more articles, it will be interesting to see what kind of attention this project will get.


Saturday, February 24, 2007

Open Access

Given that I am very much in the open source and open content world, it will not be a surprise that I am very much in favour of open access and that I do not think highly of the patent system.

I think that it is easy to argue that patents kill. My argument goes like this; the pharmaceutical industry is documented to be only interested in patentable medicine. When a medicine is not patentable, they do not have an incentive in researching its application for particular purposes. Recently the university of Alberta found by accident that dichloroacetate (DCA), appears to suppress the growth of cancer cells without affecting normal cells. The problem is that there is no funding for doing the research to prove the efficacy of this medicine. It is therefore easy to understand that only substances that are patentable are considered in much of the bio-medical research.

I think that is is easy to argue that the classic business model for scientific publishing kills. The cost of reading scientific articles is staggering. The article published in Nature about what is demoed on, for me it is a really exciting article, for me this really relevant article costs $30,- to read. I can not lawfully send it to my mother to read. I can only tell her. Scientific journals assume that the only people who need to read their papers are scientists; the scientific libraries will have a subscription and that is how it has always been. Except that it is not true. Laymen have as much a need to read bio-medical science papers; they or their loved ones are affected by all kinds of afflictions. They have a need to understand what is happening to them. Often doctors do not know all the details that are available and , there are enough documented cases where research on the literature by laymen made a difference. One famous example is known as "Lorenzo's oil", a story about a disease called adrenoleukodystrophy or ALD.

Not only laymen are denied access to literature, many universities cannot afford the cost of literature. With the science results owned by business, it is important to understand how vital the whole "open" movement already is. A new, a different business model is needed and many organisations are developing these. The big challenge is for science to become science again; to be able to research without having the public pay again and again and some companies walking away with only an eye for profit.

The challenge will be to find a balance that does justice to all parties involved.


Thursday, February 22, 2007

Mother language day

Yesterday, it was the International Mother Language Day. There is a program in Paris with all kinds of speeches and happenings. I read the program, I know some and I know about some of the people involved. I wish I was there. I wish they recorded some of the speeches, presentations.

One thing I find funny is that UNESCO indicates that there are some 6000 languages where SIL indicates the existence of over 7000 languages and where the ISO-639-6 will know at least 25.000 linguistic entities .. :) I wonder if they need to be emancipated so that UNESCO will feel a need to at least acknowledge their existence..

It would be cool if International Mother Language day had hit the Internet; a podcast would have been nice ..


Sunday, February 18, 2007

A root canal treatment anyone ?

Yesterday, I had an appointment with a dentists specialised in the noble art of endodontics. I had a molar that had already had a root canal treatment before and it needed some more work. I was nervous. Actually I am terrified of dentists and consequently I am my own worst enemy as a person suffers most from the suffering that he fears and that often never materialises.

So I went to my appointment and I was wearing a Wikimania 2006 t-shirt. A person waiting in the reception reacted to this and, we got to talk about science, creating educational content in a collaborative way, licenses and copyright, OmegaWiki, the Nature article and the Wikiprofessional demo.

When it was my turn to sit in the "chair", the doctor had heard much of the conversation and asked several questions.. There was no time gap between being operated on and talking about the things that are so dear to me. One big difference between what an endodontist does and what a dentists does is in the tools of the trade; an endodontists uses microscopes and a lot of digital imagery. It really gave me the feeling that I was operated on. Anyway, after the operation I gave a demonstration to my endodontist.

The good news is; I did not have time to be nervous.. Two more people may have a good look at OmegaWiki. It feels good even though my molar is still sensitive .. :)


Friday, February 16, 2007

A friend of my is a scientist...

A friend of my is a scientist. He is a terminologist. He has published a lot of papers and, he is considered one of the best in the field by some other people I know.

I told him about wikiprofessional, what kind of things we are doing in the bio-medical field. What we could do for other fields as well when we have the terminology available to us. This was two days ago. Today I learned that much of his work was once available on diskettes. These diskettes were no longer there .....

I discussed this with another friend. She has some great OCR-software. We know each other, so when the terminologist scans his paper paper, the translator can translate it from a analog into a digital format. The terminologist can define his terminology, his papers will be known but as relevant, we will have started supporting the terminology of terminology in OmegaWiki.

It is great to have friends ...


Friday, February 09, 2007

Cleaning a bit on the sk.wiktionary

Many of the Wiktionary projects have a moribund existence. At some stage people worked on it. They left the project, things changed and where never properly taken care off. Today I was in an extended chat and I was not the lead talker, so I had some time to do some stuff that does not take much attention.

I cleaned up a bit on the sk.wiktionary. At some stage all Wiktionaries supported proper casing. This meant for the acronyms like ADHD that they became aDHD. Somebody needed to go and fix these so that they became properly ADHD. Today I fixed more than sixty acronyms that started with a, there are many more of these that need fixing and, not only but also on the Slovak Wiktionary.

I am really pleased that at OmegaWiki we only need to fix things once..


Thursday, February 08, 2007

The Shtooka recorder

I was told to have a look at the Shtooka recorder. It is a tool that makes it really easy to record the pronunciation of words. You provide it with a list of words, you provide the meta data and, it creates an .ogg file for you in the directory of your choice. It makes it really easy..

For me the challenge was to understand how to use it. There is no user manual and mostly it is self evident. So when I understood that I had to press on the "record" button, it started to work.

This software does do wonders when recording words that are provided in a Latin script. I tried it to record a Persian word, کم حرف and it failed. I checked it with something Russian and it failed as well.

For me it is a big improvement, I did some 15.000 words with Audacity.. It would have saved me a lot of time when I would have had Shtooka.. The next thing to automate; the uploading to commons...


Friday, February 02, 2007

Ten things you want to know about dictionaries

I met Erin McKean at the Wikimania 2006, I loved her presentation then and I was really happy when I found her presentation in the Google Author series titled "Ten things you want to know about dictionaries". I loved it, I have seen it twice now. I may even have added value to it by adding the "dictionary" label to the presentation. Here I am going to react it with my OmegaWiki hat on. So yes, please watch the presentation (almost an hour and well worth it) as I hope it will improve the understanding of my reaction.

There is no one dictionary; it is a tool
OmegaWiki as a resource is very much a child of the Internet; consequently it has the potential to configure its use. When people do not care for particular information; they should be able to make it invisible or turn it off. In a way it is like the cordless drill, by replacing the drill with a different thingie it becomes an other tool. The same is true for pronunciation; we love to include IPA, but we can also record pronunciations this way people do not need the understanding required when reading IPA.

Please read the "front matter"
People indeed assume that they understand tools like dictionaries and wikis for that matter not to RTFM. For a consumer good like a read only lexical resource, it is pretty safe when the introductions have not been read. As OmegaWiki allows people to add/edit to the information that is in there this proves to be much more problematic.

Inclusion in the dictionary is because it is useful
As we do not have all the functionality that we need to be a credible lexical resource, this is very much a state we hope to get at. However, our aim to include all the lexicological, terminological and ontological sounds pretty like megalomania. Our standard excuse is that we use is that is already less problematic because we only want to do this once and this is where we came from. The data will only be useful when there are people who care about particular categories of data. I am totally with Erin that only data that is useful should be included. Getting rid of unnecessary cruft is hard work.

Horrible words make it in their too
I am a fan of swear words in dictionaries, particularly when there is some etymology to it. Most often people use swear words as an expletive without much understanding for their actual original meaning. As English is for me a second language it is relevant for me to understand why I would rather be a bigot than a racist or someone who discriminates.

The other part of horrible words are those words that are actually used and offend the aesthetic sensitivities. In several medical resources you find stuff like MALARIA and Malaria. UGLY. However as it is useful to these folks, it makes sense to include them anyway. As they are exactly the same as the preferred English expression of malaria, it does not hurt.

Words like "irregardless" well being a non native to the language, I just want to be able to find them.

You have to look at all definitions to find the REAL meaning
The way the New Oxford American Dictionary does this is exquisite. They use a core sense / sub sense approach. To me this seems an approach that is very much language specific. For OmegaWiki to have such an approach, it will need quite a lot of thinking on how to build this.

Approaching the understanding of an expression with core senses / sub senses could be one way of stretching the number of concepts that people can juggle with. For Operational Definitions there is currently this practical limit of some 7 different meanings.

Dictionaries have a sell by date
For OmegaWiki this is not an issue as it is web based resource. However, the same issue still applies; a Dutch book printed in 2003 will use the orthography of 1995 and not the 2005 orthography. People will still want to be able to understand what this word means; annotating them as not being the official spelling since 2005 is relevant.

When words are tagged as used in earnest up to a certain date, I could even include 15th century German words and not have people be confused.. filtering would also help here ..

Facts are good
Referring to actual usage seems obvious, what we intend to do is link OmegaWiki's content to Wikipedia, this will be the most obvious resource to start of with; we aim to have a Wikipedia in all languages. The language is modern usage, so given our Wiki credentials it is the obvious corpus. I totally agree when it is said that we are limited in this way; it is however a great start. When we gain a community with people, organisations that introduce us to other resources, it will be great.

At this moment OmegaWiki is still very much like a "stamp collection"; it is a nice collection, and at some stage it will even become useful.

What we do is like an iceberg
The work done on the New Oxford American Dictionary may be in preparation for the moment when other information like thesaurus information will be included. For OmegaWiki, including information from thesauri is what we did from the start by including the GEMET data, OmegaWiki at this moment is very much "you get what you see", there is little of an iceberg yet. In a way this prevents the usefulness of our data because there is often too much to take in.

Technically etymology is one of the hardest nuts to crack. When a word has its root in Latin, it often came to the English language through a French or Spanish connection. I wonder how the NOAD does this, indeed I do not have a copy so I have not read the "front matter" either :) .

At this moment there is not much of a problem about neologisms yet. Our community is still small and sane.. sort off (you must be weird to involve yourself in a project like this). My current thinking is that this problem can be solved using annotation. When a word is tagged as "Neologism; this word is not used except by the author" it will be pretty devastating to the prestige of the word and or the author on OmegaWiki.

The bonus: Using Google and other resources
Erin explains that the resources for building a resource like the NOAD are hardly as much as she would want. Given that this is true for a successful resource for the American English market, consider what this means for languages like Kituba, Stellingwerfs or Seeltersk. Consider what it means when you want to use a translation dictionary for such languages.. These resources become a reality when there is the necessary cooperation; this is what OmegaWiki hopes to achieve.

There is a need for people to work on their terminology and indeed we would like to include the terminology of falconry, tennis, and ships. It will happen when it does.

*OmegaWiki does want to include out of copyright content as well. For us it is a start. Collaborating with for instance WordNet would be however more important and relevant.
*Proper names; yes we want them; we have George W. Bush already for quite some time.
*We would retire words by indicating them with a date indicating when they went out of use
*Circular definitions are even more problematic in OmegaWiki, this is a great example why.
*Context .. yes, I wish this was a problem that we have to deal with.. we need more functionality
*Print versions .. this is at this stage no issue. Nobody has indicated that they want to work on this.

I really enjoyed Erin's presentation. It helps me to get my mind around issues that have not popped up for OmegaWiki. As many are quite will make their appearance, it is best to be forewarned, it allows us to get forearmed.


Thursday, February 01, 2007

Google defuse the googlebomb .. GREAT

When you are of the opinion that George W. Bush or "Shrubya" is a "miserable failure", you could find confirmation for this by googling for this and this truth would be on top both in the Google, Yahoo or Microsoft search engine. Effectively it is a prank. It is not what you really want to find and Google announced it has worked on an algorithm that will prevent a Googlebomb in future. Effectively making the word a misnomer it is now more correct to call it a Yahoobomb or even better, a Microsoftbomb.

In an article in the Guardian, the fear is expressed that by manipulating rankings in this way, Google will exert its power and be able to manipulate what is seen as true. At issue is that the reason why Google defused this bomb was because people believed it to be true because Google said so... (there is no such thing as common sense as common sense ain't common)

In many content projects, bots create links to websites they hope to make more relevant in the eyes of the search engines. This type of vandalism resulted in a backlash where Wikipedia now indicates to search engines to disregard any and all links and thereby invalidating the basis on which search engines operate. When Google were to have algorithms that filter and punish this type of SPAM, it would lead to a more sane environment.

Google is open about its intentions. Microsoft is open about its intentions as well; as long as it discriminates against its competitors like Wikipedia I will be happy to use Google knowing that Google is kept honest by having competitors.