Monday, October 14, 2013

Battle of the sexes, the #Wikidata story II

#Wikidata is the most obvious place to study the distribution of articles among the sexes. There is however a problem; many "items" representing people do not indicate that they are "human" or their sex.

One approach to this problem to this problem has been indicated by Markus Krötzsch. In a blog post he describes how he determines the sex of a person based on a first name. He produced a long list of first names and the probable sex of people with that name.

With this approach it is possible to determine the overall distribution of the sexes within Wikidata. It should also be possible to look at the distribution of the sexes within the different Wikipedias because of the interwiki links. When first names are known in the Latin script, determining the first names in different scripts and thereby expand the reach of this process should be feasible as well.

As this algorithm works pretty well for research, it could be the basis for a tool. When someone is shown the first paragraph of an article with a yes/no button it should be really easy to quickly add a lot of information to Wikidata.
