Monday, August 31, 2015

#Wikidata - Kian and #quality

Last time Kian was very much a promise. This time, after the announcement by Amir, Kian is so much more. Kian is a tool that can be trained to identify items for what they are. Training means, that parameters are provided whereby the software can act on its own and based on likelihood will make the identification or list it as a "maybe".

Obviously once it is known what an item, an article is about, so much more can be deduced. That is something Kian will do as well.

The thing that pleases me most, is that Kian for its learning makes use of autolists, it means that Kian became part of the existing ecosystem of tools. Eventhough hard mathematics are the background of Kian, it is relatively easy to train because prior knowledge is of value.

In the announcement mail Amir asks for collaboration. One area where this will be particularly relevant is where people are asked to decide where Kian has its doubt. It currently uses reports in the Wiki but it would be awesome if such questions can be asked in the same environement where Magnus asks for collaboration.

Yes, Kian makes use of hard scientific knowledge but as it is structured in this way, it makes a real difference. It is possible to learn to train Kian and when ambiguous results can be served to people for a result, Kian will be most glorious. Its bus factor will not be Amir.

Sunday, August 30, 2015

#Wikidata - the #firing #line

Wikidata is not good or bad, it is indifferent. It does not care what subjects it includes. For me firing ranges, the use of guns is not something I expect a reasonable person to be involved in. Only people with guns kill with guns. I do not understand people who collect guns or shoot them. I have no respect for them. People get killed as there are too many people with guns.

The suggestion was raised to have an app determining the closest shooting range based on Wikidata data. I objected and was told not to express my opinion. My freedom of speech was shot because apparently it is not allowed to express such an opinion. An opinion that makes the obvious link between guns and the value of life.

People may tell me to shut up, they may be a Wikidata admin even oversighter but by telling people to shut up, they effectively kill not only freedom of speech.

#Wikidata - Mian Ghulam Jilani, who is he?

According to a Wikidata description Mr Jilani is a "British Indian general". The highest rank he achieved in the British Indian army was actually lieutenant, he did become a major general but that was in the Pakistani army.

Mr Jilani was also a Pakistani politician. He was a prisoner of war during the second world war and tortured by the Japanese and a prisoner of conscience according to Amnesty International. He escaped prison and died in the USA.

Mr Jilani was never an Indian citizen in fact he fought against India in several battles and was decorated several times. He served as a diplomat for Pakistan in the USA, became a politician, became on outspoken critic of the Pakistani prime minister Zulfikar Ali Bhutto, was imprisoned, escaped to the USA and died as a refugee in Fairfax, Virginia.

In conclusion, Mr Jiliani was never British, Indian nor a general for either country. The automated description has him as: Indian-pakistani politician, military personnel, and diplomat (1913–2004); Legion of Merit, Imtiazi Sanad, and Sitara-e-Quaid-i-Azam ♂. This is not perfect but much better.

Friday, August 28, 2015

#Wikidata - Joseph Reagle not an #author?

The English Wikipedia article says: "Joseph Michael Reagle Jr. is an American academic and author focused on technology and Wikipedia". It seems obvious that the occupation "author" fits Mr Reagle.

Not so I am told, the word "auteur" is a generic term in French, so it is at best an anglicism. This gets us in a tricky position because it is suggested that if this appears in infoboxes which automatically import stuff from wikidata, it will create an absolute mess in the French wikipedia, with everybody being credited as an "auteur" which does not make sense at all.

When you analyse "author" in Wikidata, it is a subclass of "creator". Creator seems to me to be what the French understand for "auteur". Consequently, the labels used in French do not match what is meant by author in English.

Arguably, when items are labelled in a way where the meaning in one language is not the same as in other languages,  This has major consequences for the integrity of Wikidata.

NB Mr Reagle wrote a few books, that makes him more than an "essayist".

Thursday, August 27, 2015

#Wikidata - Heinz R. Pagels Human Rights of Scientists Award

Awards are often the subject of this blog. Every award has its own merit and every award connects many people as a result. The Heinz R. Pagels Human Rights of Scientists Award is an award hidden in an article on the Committee on Human Rights of Scientists. The story of Mr Pagels is interesting but so are the people who received the award.

Some of them have been prisoners of conscience, all of them have relevance. Most of them deserve more attention, be it in improving their articles, by adding statements in Wikidata, or reading about them. For people to receive an award like this, they have to have been in harms way. It is important to know how easy it is to get into problems and also why some of such problems are worth it.

By exposing awards like this, the people connected in this way get more attention. It is one way of making sure that their effort is valued.

Saturday, August 22, 2015

#Wikidata - recent #changes

Databases change all the time. The expectation is that these changes make things, different, better. This is true for all the online resources Wikidata connects to.

There are several good reasons to refer to an external database:
  • to indicate that the external source is about the same subject
  • to acknowledge the external source served as the source for a statement
  • to indicate whether shared values match
As databases change all the time, there is little value to indicate that a database shared the same value at a given date and time. Consider for instance the item for Mr Sudar Pichai, apparently he went twice to the Indian Institute of Technology Kharagpur and to Stanford University. When two source states that he went there, one source may know what academic degree was achieved at the end of the study where the other does not. When you only verify if the information in the two sources match, both sources match. One source may not care about what degree or when it was achieved and the other does. When you quote them as the source for the statement, you expect them to fully endorse the current content. Mr Pichai went to either educational institution once. Having two statements for the same thing completely defeats the objective of Wikidata; the objective of Wikidata being useable.

Having references for statements make sense when statements are exactly the same. When they are not, arguably there is little point but indicate that all values for a source match. This can be done by showing the source in green. It is a lot more reassuring to see all sources in green than a lot of references that give no assurance that the values are indeed the same,

Friday, August 14, 2015

#Wikidata - Mr Sundar Pichai

I heard of a dispute about the facts of Mr Pichai's study by Wikipedians. That was yesterday so I hoped that some of that discussion would transpire at the item for Mr Pichai.

Mr Pichai's item is indeed in need of serious attention. The stated place of birth should be more specific and, his education has the same school entered twice for no obvious reason. He was born in India but Wikidata has him as an "Indian American" for whatever reason.

The information when you Google Mr Pichai is much better. When Google and Wikidata were to compare each others records, the Wikidata item would certainly be flagged as problematic.

As a lot of Wikipedians have invested serious attention to Mr Pichai, comparing the Wikipedia article will expose the weakness of the Wikidata entry. I am not particularly interested in Mr Pichai, I leave it for someone else to sort this out.

Thursday, August 13, 2015

#Wikidata - #Quality, #probability and #set theory

The problem with any source is that it has errors. It cannot be helped. There is always a certain percentage that is wrong. When you take all the items of Wikidata that have statements, the type of process that added those statements provides an indication of the percentage of errors that were included.

I made thousands of mistakes. In a way I am entitled to have made those mistakes because I made over 2 million edits. Amir made even more edits with his bot. Because of the process involved the percentage of his errors will be fewer. When you only look at Wikidata and its items, you can be confident that these errors exist, you can be confident about what percentage is likely but there is no way to make an educated guess what is right or what is wrong. The only way to improve the data is by sourcing one statement at a time. It is a process that will introduce its own errors. That is something we know from experience elsewhere.

To add value to Wikidata, we need both quality and quantity. Let us consider the use of external sources that are known to have been created with the best of intentions. Consider one type of information, the place of birth for example. It is highly likely that Wikidata and that external source have many items in common. Once they are defined as being about the same person, we can use the logic of set theory. We can establish the number of records where both have a value for the place of birth. We can determine the amount of matching items, we can determine the number where one has a value and the other does not and, we can determine the number of items where there is a mismatch.

It is probable that most errors will be found where Wikidata and the source do not match. It is certain that even where the two match there will still be factual errors as both can be wrong.

Quality and confidence have much in common. Wikipedia has quality but we know it has issues. Wikidata has quality but we know it has issues. The easiest and most economical way to improve the quality of Wikidata is by comparing sources, many sources and concentrating on the differences. It is easy and obvious and when we ask someone to add a source to a statement we are confident that the result matters. It matters for both Wikidata and the external source.

This approach is not available to Wikipedia. It cannot easily compare with other sources and therefore there is no option but to source everything. Given that many statements find their origin in Wikipedia, new insights in Wikidata may prove a point and a need to adapt articles.

Consequently, applying set theory and probability will enhance the quality of Wikidata. It will help drive fact checking in Wikipedia and it is therefore the best approach to improve quality. Accepting new data from external sources and iterating this process of comparison will ensure that Wikidata will become much more reliable. Reliable because you can expect that the data is there and, reliable because you know that quality has been a priority in what we do.