Saturday, November 26, 2016

The problem with #science explained with #Wikipedia

It is a recurring theme. People study a subject and reality is different. The science is flawless, the results are impressive and indeed important strides are made forward. The study of heart disease is a great example; many studies resulted in an improved life expectancy for men. Particularly white men. The Dutch Hartstichting is raising funds for new research because of this existing bias in research. For women in the Netherlands, heart disease is the number one killer because heart disease is different in women; it was not noticed before because heart disease in women was not studied.

Wikipedia as it is commonly known in research has the same problem. It is not Wikipedia as we know it, it is English Wikipedia. My contributions to Wikipedia have not been to English Wikipedia; they went to the Dutch Wikipedia and I will not be noticed as one of the most prolific contributors to Wikimedia projects because my contributions to "Wikipedia" are hardly significant..

As I blogged before; scientific papers do not publish when it does not involve English Wikipedia. The consequence is that when people quote research, their quotes include this bias and strictly speaking it is not necessarily true when you consider Wikipedia. The problem with biased research is that the policies of the WMF are based on the known "facts".

Nothing new so far. We all know it when we are honest. So what can we do to remove some of the bias? The first thing is to devalue any and all research that is English Wikipedia only. It only covers less than half of what we do.The second thing is to evaluate research for its algorithms. When both the algorithms and the data are available, it is possible to run the algorithm on a more inclusive data set and check the validity. With the quality of Wikidata data as a source on all the Wikipedias improving, such an approach is increasingly feasible. The last thing is for the Wikimedia Foundation itself to address this bias, With English Wikipedia being less than 50% of its traffic and workflow, it would be good when a similar percentage of its efforts is focused on the bigger half of what we all do.

So what is the harm? We expect all Wikipedians largely to do what "Wikipedians" do. However, we are not all English Wikipedians. The need other people have is not discussed, not taken seriously. We have seen wonderful examples of potential functionality showcased but it is not taken further, not taken in production because it does not fit the preconceived ideas of what we do, it is not part of the road map. The projects in Wikidata are not about Wikidata but about how to make us all in one big data glob and USING the data is only seen in relation to Wikipedia articles. We do not know how much Wikidata is used, some studies are done but they are in relation to "Wikipedia" and that is not relevant to me. We find that Wikisource gains more and more content that may be valuable to our readers but we do not market this data because we never did marketing for Wikipedia. There are several websites that only do this in a way that could be much improved if we took Wikisource seriously.

It hurts us to only consider English Wikipedia and this bias in research and policy is more damaging than the bias that is considered by the English Wikipedians.

Wednesday, November 23, 2016

#Bias in #research

Actually, it starts with something else. You need to publish so you have to select a subject to study that will be of interest to the publisher..

As a consequence hardly any research is done about the other Wikipedias. I have been informed by a reliable source that it has to be English or it will not be published.

Now Wikimedia Foundation, how about that? Is there any research done on Wikipedia or is all the research biased in this way?

Tuesday, November 01, 2016

#Wikidata year 4; What Gupta year is that?

Wikidata is celebrating its fourth birthday. It is celebrated by some mighty fine gifts. It is a time to reflect on what has gone before and what is ahead of us. Obviously there are challenges we face and my gift are some queries / questions I do not know how to address. I focus on the Gupta empire because it currently has my interest.

During the era of the Gupta empire there was a "Gupta year". An article refers to it and my first question is: what date would the birthdate of Wikidata be in Gupta years?

Obviously there are many maps including the Gupta empire, Can I have them sorted by date please? What other countries border the Gupta empire? Who were its rulers and how does the map change over time?

To get answers is nice but for me it is important that the algorithms involved are relevant to any country old and new. Relevant to timelines old and new. When we can express dates in the "Year Gupta", we can check if dates in Wikidata are indeed Julian or maybe Gregorian..

When we have continuance in maps over time, we will know if a location, a city for instance or the land of a tribe is part of what country; what culture.

Wikidata live long and prosper :)