Monday, May 12, 2014

#WMHack - #Wikidata and bias towards English #Wikipedia

English Wikipedia has more articles, it has loads of info boxes. As importantly most of the people importing data know English and consequently a lot of data is imported into Wikidata from that source. According to some it must be that Wikidata is biased because information available in the English Wikipedia is better represented..

Uhm, a great argument. Lets analyse that. The current statistics show that 50% of all Wikidata items have zero or one statement. They typically refer to only one article and, have only one label. Effectively it is garbage wherever it originated from. That leaves us with the other half. Less than one in thirty has 10 or more statements and for less than one in fifteen we have ten or more labels. Only a fraction of the subjects covered by the English Wikipedia has great information.

We assume good faith when we say that the English Wikipedia aims to share in the sum of all knowledge AND aims to provide a neutral point of view. It is easiest to retrieve information from the en.wp and when we add all that to Wikidata, we gain the most improvement in quality for the least amount of effort for most of us. Even then, it is surprisingly problematic to set up the effort to get data in. It takes time and an inordinate amount of bickering before data finally finds its place.

When you look at all that data once it found its place, you find that Wikidata itself looks best in English. As long as fall back languages for statements are not supported, its information looks reasonable only when seen in the Reasonator.

Wikidata aims to be a quality resource for information. When all it has is information from the English Wikipedia, you might as well go to the English Wikipedia. When Wikidata gets its house in order and compares its information with many sources and reports on the differences found, it helps improve information everywhere and reduces the existing bias.

When those sources include multiple Wikipedias, it will be biased towards those it compares with.

No comments: