Monday, August 31, 2015

#Wikidata - Kian and #quality

Last time Kian was very much a promise. This time, after the announcement by Amir, Kian is so much more. Kian is a tool that can be trained to identify items for what they are. Training means, that parameters are provided whereby the software can act on its own and based on likelihood will make the identification or list it as a "maybe".

Obviously once it is known what an item, an article is about, so much more can be deduced. That is something Kian will do as well.

The thing that pleases me most, is that Kian for its learning makes use of autolists, it means that Kian became part of the existing ecosystem of tools. Eventhough hard mathematics are the background of Kian, it is relatively easy to train because prior knowledge is of value.

In the announcement mail Amir asks for collaboration. One area where this will be particularly relevant is where people are asked to decide where Kian has its doubt. It currently uses reports in the Wiki but it would be awesome if such questions can be asked in the same environement where Magnus asks for collaboration.

Yes, Kian makes use of hard scientific knowledge but as it is structured in this way, it makes a real difference. It is possible to learn to train Kian and when ambiguous results can be served to people for a result, Kian will be most glorious. Its bus factor will not be Amir.

Sunday, August 30, 2015

#Wikidata - the #firing #line

Wikidata is not good or bad, it is indifferent. It does not care what subjects it includes. For me firing ranges, the use of guns is not something I expect a reasonable person to be involved in. Only people with guns kill with guns. I do not understand people who collect guns or shoot them. I have no respect for them. People get killed as there are too many people with guns.

The suggestion was raised to have an app determining the closest shooting range based on Wikidata data. I objected and was told not to express my opinion. My freedom of speech was shot because apparently it is not allowed to express such an opinion. An opinion that makes the obvious link between guns and the value of life.

People may tell me to shut up, they may be a Wikidata admin even oversighter but by telling people to shut up, they effectively kill not only freedom of speech.

#Wikidata - Mian Ghulam Jilani, who is he?

According to a Wikidata description Mr Jilani is a "British Indian general". The highest rank he achieved in the British Indian army was actually lieutenant, he did become a major general but that was in the Pakistani army.

Mr Jilani was also a Pakistani politician. He was a prisoner of war during the second world war and tortured by the Japanese and a prisoner of conscience according to Amnesty International. He escaped prison and died in the USA.

Mr Jilani was never an Indian citizen in fact he fought against India in several battles and was decorated several times. He served as a diplomat for Pakistan in the USA, became a politician, became on outspoken critic of the Pakistani prime minister Zulfikar Ali Bhutto, was imprisoned, escaped to the USA and died as a refugee in Fairfax, Virginia.

In conclusion, Mr Jiliani was never British, Indian nor a general for either country. The automated description has him as: Indian-pakistani politician, military personnel, and diplomat (1913–2004); Legion of Merit, Imtiazi Sanad, and Sitara-e-Quaid-i-Azam ♂. This is not perfect but much better.

Friday, August 28, 2015

#Wikidata - Joseph Reagle not an #author?

The English Wikipedia article says: "Joseph Michael Reagle Jr. is an American academic and author focused on technology and Wikipedia". It seems obvious that the occupation "author" fits Mr Reagle.

Not so I am told, the word "auteur" is a generic term in French, so it is at best an anglicism. This gets us in a tricky position because it is suggested that if this appears in infoboxes which automatically import stuff from wikidata, it will create an absolute mess in the French wikipedia, with everybody being credited as an "auteur" which does not make sense at all.

When you analyse "author" in Wikidata, it is a subclass of "creator". Creator seems to me to be what the French understand for "auteur". Consequently, the labels used in French do not match what is meant by author in English.

Arguably, when items are labelled in a way where the meaning in one language is not the same as in other languages,  This has major consequences for the integrity of Wikidata.

NB Mr Reagle wrote a few books, that makes him more than an "essayist".

Thursday, August 27, 2015

#Wikidata - Heinz R. Pagels Human Rights of Scientists Award

Awards are often the subject of this blog. Every award has its own merit and every award connects many people as a result. The Heinz R. Pagels Human Rights of Scientists Award is an award hidden in an article on the Committee on Human Rights of Scientists. The story of Mr Pagels is interesting but so are the people who received the award.

Some of them have been prisoners of conscience, all of them have relevance. Most of them deserve more attention, be it in improving their articles, by adding statements in Wikidata, or reading about them. For people to receive an award like this, they have to have been in harms way. It is important to know how easy it is to get into problems and also why some of such problems are worth it.

By exposing awards like this, the people connected in this way get more attention. It is one way of making sure that their effort is valued.

Saturday, August 22, 2015

#Wikidata - recent #changes

Databases change all the time. The expectation is that these changes make things, different, better. This is true for all the online resources Wikidata connects to.

There are several good reasons to refer to an external database:
  • to indicate that the external source is about the same subject
  • to acknowledge the external source served as the source for a statement
  • to indicate whether shared values match
As databases change all the time, there is little value to indicate that a database shared the same value at a given date and time. Consider for instance the item for Mr Sudar Pichai, apparently he went twice to the Indian Institute of Technology Kharagpur and to Stanford University. When two source states that he went there, one source may know what academic degree was achieved at the end of the study where the other does not. When you only verify if the information in the two sources match, both sources match. One source may not care about what degree or when it was achieved and the other does. When you quote them as the source for the statement, you expect them to fully endorse the current content. Mr Pichai went to either educational institution once. Having two statements for the same thing completely defeats the objective of Wikidata; the objective of Wikidata being useable.

Having references for statements make sense when statements are exactly the same. When they are not, arguably there is little point but indicate that all values for a source match. This can be done by showing the source in green. It is a lot more reassuring to see all sources in green than a lot of references that give no assurance that the values are indeed the same,

Friday, August 14, 2015

#Wikidata - Mr Sundar Pichai

I heard of a dispute about the facts of Mr Pichai's study by Wikipedians. That was yesterday so I hoped that some of that discussion would transpire at the item for Mr Pichai.

Mr Pichai's item is indeed in need of serious attention. The stated place of birth should be more specific and, his education has the same school entered twice for no obvious reason. He was born in India but Wikidata has him as an "Indian American" for whatever reason.

The information when you Google Mr Pichai is much better. When Google and Wikidata were to compare each others records, the Wikidata item would certainly be flagged as problematic.

As a lot of Wikipedians have invested serious attention to Mr Pichai, comparing the Wikipedia article will expose the weakness of the Wikidata entry. I am not particularly interested in Mr Pichai, I leave it for someone else to sort this out.

Thursday, August 13, 2015

#Wikidata - #Quality, #probability and #set theory

The problem with any source is that it has errors. It cannot be helped. There is always a certain percentage that is wrong. When you take all the items of Wikidata that have statements, the type of process that added those statements provides an indication of the percentage of errors that were included.

I made thousands of mistakes. In a way I am entitled to have made those mistakes because I made over 2 million edits. Amir made even more edits with his bot. Because of the process involved the percentage of his errors will be fewer. When you only look at Wikidata and its items, you can be confident that these errors exist, you can be confident about what percentage is likely but there is no way to make an educated guess what is right or what is wrong. The only way to improve the data is by sourcing one statement at a time. It is a process that will introduce its own errors. That is something we know from experience elsewhere.

To add value to Wikidata, we need both quality and quantity. Let us consider the use of external sources that are known to have been created with the best of intentions. Consider one type of information, the place of birth for example. It is highly likely that Wikidata and that external source have many items in common. Once they are defined as being about the same person, we can use the logic of set theory. We can establish the number of records where both have a value for the place of birth. We can determine the amount of matching items, we can determine the number where one has a value and the other does not and, we can determine the number of items where there is a mismatch.

It is probable that most errors will be found where Wikidata and the source do not match. It is certain that even where the two match there will still be factual errors as both can be wrong.

Quality and confidence have much in common. Wikipedia has quality but we know it has issues. Wikidata has quality but we know it has issues. The easiest and most economical way to improve the quality of Wikidata is by comparing sources, many sources and concentrating on the differences. It is easy and obvious and when we ask someone to add a source to a statement we are confident that the result matters. It matters for both Wikidata and the external source.

This approach is not available to Wikipedia. It cannot easily compare with other sources and therefore there is no option but to source everything. Given that many statements find their origin in Wikipedia, new insights in Wikidata may prove a point and a need to adapt articles.

Consequently, applying set theory and probability will enhance the quality of Wikidata. It will help drive fact checking in Wikipedia and it is therefore the best approach to improve quality. Accepting new data from external sources and iterating this process of comparison will ensure that Wikidata will become much more reliable. Reliable because you can expect that the data is there and, reliable because you know that quality has been a priority in what we do.

Tuesday, August 11, 2015

#Wikidata - #Pen #awards

The many chapters of Pen International confer many awards. Mr Mazin Darwish now has his 2014 award and would it not be fun to have a query that shows all the people who ever were awarded one of the many, many "Pen awards".

First, all the chapters have to be part of Pen International, then all the awards have to be conferred by a Pen chapter and finally all the people have to be recognised as honored with one of the Pen awards.

This is something that is of interest, it is awarding and, why not.

It is much better than following the "instructions" on solving the "garbage" that is the honorary university degrees and doctorates. I am told to find sources for people who have an honorary doctorate or whatever and add sources that provide credence to such a statement. It may be a solution but it is a solution that does not scale.

To be honest, I cannot be bothered. When Wikidata in its infinite wisdom does not have a way to deal with contaminated data, it has a bigger problem, it makes me doubt all existing statements. All that is needed to cope with such issues is a way to flag data for being "suspicious".

With known "no good" data, you invite people to participate in providing a solution. The proposed solution however is not my cup of tea; it is not what I do. I cannot be bothered.

Monday, August 10, 2015

#Wikidata - #Free Mazin Darwish

It is satisfying to learn that Mr Darwish has been freed. He was jailed since February 2012 and, the BBC has it that we was freed. It mentions that he is the director of the Syrian Centre for Media and Freedom of Expression (SCM) and received many international awards.

Wikidata already knew about a few of those awards, finding more awards was a matter of reading the three Wikipedia articles. It is just a matter of doing the research. One of the awards Mr Darwish received was the PEN Pinter Prize in 2014. However, the Wikipedia article calls it the "Pinter International Writer of Courage Award". This award is not listed on the "List of PEN awards".

There is a reason to celebrate. Mr Darwish is free. It is satisfying to see that a lot of information is already there. Working on the data that exists on Mr Darwish connects him with more people sharing similar connections.

Every day there is someone who is worthy of attention. I can do this, you can do this. It is how Wikidata gains relevance. Relevance because it is information available for use in any language including Arabic.

#Wikidata - #corroboration and #sourcing

The problem with sources available on statements in Wikidata is that even when they are by definition the source of a statement, it is not what we understand a source to be. When I use tools to add statements to Wikidata based on lists and categories from a Wikipedia, that Wikipedia is my source. My tools do not help me add this fact so I do not add Wikipedia as a source. Other tools do and consequently there are some 20 million statements sourced in this way.

When no source is available, a statement can be corroborated by finding identical information in an external source. The difference is important. The external source is no source proving the veracity or the origin of the fact, it merely indicates that it does not differ. Corroboration is important, it does improve the likelihood that a statement is correct. It adds a notion of quality.

Wikidata items often refer to many external sources. Only when a fact new to Wikidata is added as a statement from one of these sources, the external source IS the source.

Some external sources provide information with the authority of a respected organisation. When the RKD Netherlands Institute for Arts History indicates that Nora van de Vlier received the Willink van Collen Prize in 1954 I would consider it a source and happily accept it as a source for a new statements in Wikidata. When such information is from DBpedia or Freebase, I would appreciate more references at a later date.

When it is not the original source the only thing I care to know is that there is no discrepancy between the data provided and the data available at the external sources. When external data is pushed into Wikidata as a reference, it could easily be considered a fraud. It is certainly clutter.

Sunday, August 09, 2015

#Quality for #Wikidata and for external #sources

There are always arguments to find why not to accept Wikidata as a quality resource. Many Wikipedians ignore Wikidata because they do not trust the quality of its data. They require sources because that is why they trust a fact in Wikipedia to be good.

The practical problem is that Wikidata has some 15 million items and most have one or multiple statements. Each statement should be sourced given the notion of sources as a requirement. Given the speed of new information in Wikidata, sourcing for all statements is not going to happen anytime soon and consequently an alternative that demonstrates quality is needed.

One best practice of Wikidata is publishing external sources for our items. It already adds a feeling of quality because it allows a person to see what those external sources have to say. It takes some software and a workflow to leverage this sense of quality and solidify it as a measurable quality improvement.

Obviously both Wikidata and external sources have their issues. Where they all agree, there is the least need to work on improving quality. Where Wikidata has no data, it is obvious to add data and use the external source as a reference. It becomes interesting when there is a difference.

The first thing to do is flag a differing statement as suspicious. It signals to software and people that there is a need for attention. People can research the issue and come to the conclusion that
  • Wikidata is correct
  • the external source is correct
  • both are incorrect
In all these circumstances, the flag for the statement will be changed, the statement may be changed and in every case a source is to be provided. This is when true sources make the biggest difference because the flag does not go away and with quality sources where there is this obvious need, the quality of Wikidata is easier to appreciate.

Saturday, August 08, 2015

#Wikidata - #Google on Morton Mintz

Mr Mintz is contrary to what Google has to say about him not born in 1946. It is impossible when the Wikipedia article is correct in stating that "in his early years (1946–1958) reported for two St. Louis, Missouri newspapers, the Star-Times and the Globe-Democrat".

This is exactly one of those exceptions where sources matter. Typically Google will have it right and in my opinion only when sources differ it becomes relevant to provide a source for the provided information.

Mr Mintz was born on 26 January 1922 and, he wrote this himself in an essay on I expect that Google will pick up on this information. It is interesting to learn from where they pick this up. Will it be from Wikidata, from the source I provided above or from this blogpost.

I will be most happy when it is Wikidata. It would beg the question if Google would be interested in reporting on differences it is aware of. For me such flags will improve our quality rapidly. Concentrating on differences will have a huge impact and not only at Wikidata.

#Wikidata - Frances Oldham Kelsey

When someone like Mrs Kelsey dies, it is wonderful to read about her in the Wikipedia article. There are always some factoids that can be added to the Wikidata item.

What struck me most is not the story of how Mrs Kelsey prevented the USA from the effects of thalidomide in babies but the importance that an article in the Washington Post had. Thanks to an article by Morton Mintz there was an outcry that resulted in the passing of the Kefauver Harris Amendment. It required drug manufacturers to provide proof of the effectiveness and safety of their drugs before approval.

Another fun fact is that the FDA named an award after Mrs Kelsey; she was the first to be awarded the FDA Kelsey Award. There is not much to find about this award because there is a controversy about the award. People protested that Mrs Kelsey's good name was abused by political appointees to the FDA who wanted to diminish its powers..

For Mrs Kelsey only a few awards were added in Wikidata. There are more awards but as always, there is more to do.

Thursday, August 06, 2015

#Wikidata - #garbage in, garbage out

Wikidata has a problem. It does not have a process of indicating that data is suspect. When I wrote about Mrs Clinton, I indicated that two properties had the same label and that its content was mixed up. It was suggested that the two properties should be merged in answer it was then indicated that the two properties are distinct and should be kept. The problem we are left with is that the content of both properties can not be relied on.

As there is no way to indicate that single statements are suspect or correct, there is a mess. There is no way to fix such a mess.

When other sources include this data, it is possible to use them to have statements use the correct property based on what those sources say. Include a reference and one statement is fixed. The problem is that Wikidata distrusts other sources. It shows in the reluctance to embrace data from other sources. It shows in the reluctance in accepting that a lot of data is flawed in its core; particularly when it is flawed in its definition.

Tuesday, August 04, 2015

#Wikidata and its #references

When a Wikidata statement is referenced, it means to me that the statement as stated can be verified in the reference. Rather black and white; you add a reference to prove that THAT statement is correct.

Apparently not so at Wikidata. When you change a statement and make it more precise, the reference is "still good". So much so that an admin threatens to block because this is so "obvious".

In my opinion, references should always back up the fact as stated. When they did not, they are wrong in the first place. When a referenced fact is found wanting, the old known good is no longer good and consequently there is no point to keep the reference with the improved facts.

When Wikidata is in the business of keeping values that show that at sometime a booboo was made, then I am pretty sure that Wikidata will crumble under the weight of known errors. It is it hard enough to distinguish between facts and fiction it does not need the addition of right and wrong.

Monday, August 03, 2015

#Wikidata - #featured item; László Krasznahorkai

Mr Krasznahorkai will be this weeks Wikidata "featured item". It shows off how much information may be available for the subject at hand. Indeed, a lot of great work went into including and linking all the work Mr Krasznahorkai produced over the years. Adding information like that is a labour of love.

Mr Krasznahorkai is a celebrated writer. It shows in the many awards that were bestowed on him. All in all there are 21 listed among them the Man Booker International Prize.

For many of the awards like the Vilenica Prize, it is just a matter of harvesting the data and add it to the right items; easy enough. For others like the "Hungarian Heritage-Award", it is a bit more involved because it does not exist as a Wikipedia article in English. The article on Mr Krasznahorkai in Hungarian however looks promising. Many more awards are available there.

By harvesting awards for Mr Krasznahorkai, the information on the featured item becomes even better. The best thing is that in the process all the other items for people who are known to be awarded will have better information as well. Obviously, data from English Wikipedia is overexposed as more people do know that language and it is rather rich.

Sunday, August 02, 2015

#Wikipedia - Maintaining the #Akademy award

Lydia challenged Wikipedians to update the information about Akademy. The conference in 2014 and 2015 have come and gone by now. The most interesting part of the English article is the information about the awards. There are three of them; one for best application, one for best non application and a jury's award.

Awards are important and Lydia has two main points:
  • the most important one is collective celebrating successes in a community
  • the other one is being able to communicate to the outside about the achievements
At Wikidata there is enough to celebrate. I would welcome a Wikidata award for 2015 for the reasons stated above. Sjoerd and Amir would be on my short list. <grin> If nobody reacts in time, I may even announce the award </grin>. Then again, why not steal a page out of the KDE book and have multiple awards?

#Wikipedia - list of awards of Jimmy Carter

President Carter is not special in having a separate Wikipedia article with his awards. As a list in an article is rather boring it makes sense to have that separate article.

With distinguished people like Mr Carter, the list is long and it tells a story of recognition. Most but not all of the awards are already known in Wikidata but surprisingly the quality of the information is not always that great. It says: "International Mediation medal, American Arbitration Association, 1979" and it is hard to find anything on that medal except in this list. There is no doubt that Mr Carter received the medal, it is just that there is no other source available.

Both Mr Carter and the American Arbitration Association will be able to acknowledge the existence of the medal and, the AAA will be able to inform us about any and all organisations and people who were honoured in this way. For them it is one way to get extra mileage out of the PR that is so often the reason for an award.

Saturday, August 01, 2015

#Wikidata - the Forough Faroukhzad Award

When I wrote about the Radcliffe fellow Mrs Mehrangiz Kar, I mentioned that recognition came to her in the form of many awards. One of them was the Forough Faroukhzad Award. This award was only known as a string of text. I asked an Iranian friend if he could identify a few more people who received this award. He identified a few more for me and consequently, this award became incrementally more relevant

When you concentrate on the people of the Radcliffe college or alternatively the winners of the Forough Faroukhzad award, you make them better connected. It gets you to other universities or other awards. It may show you where those who teach were educated or it may show you nothing at all because the data is just not there.

A Dutch scientist published in a journal that dementia has a one in three link to heart and circulatory issues. It is highly likely that such knowledge may spur people to take better care of their health. The scientist does not have an article or an item. The organisation that reports on it funded the research. It has a high impact on research on this issue in the world and it is hardly known outside the Netherlands by people who are not working in this field..

The Hartstichting also issues awards and it makes it obvious why the persons involved are important. I can read Dutch, not Iranian. We need friends from all over the world making the connections. Giving us a view that brings a new perspective.