When category names are well chosen, they predict similarity between what is in a category. A research paper named: "Recognizing Descriptive Wikipedia Categories for Historical Figures" came to this conclusion, it is complete with a lot of mathematics. They did their panel research so it must be good.
At the back of the paper there is a list of English categories and their "Surprise level". It focuses on balancing the effect of both
size of the category and the probability of inside-category pairs
become close neighbours.
This makes size of the category relevant. One of the categories is "International Tennis Hall of Fame inductees", it has 222 entries. The German category knows about 22 additional inductees. The S-level is 163.84. For "24 Hours of Le Mans drivers" there are 1,247 entries and the German category knows about 501 additional drivers, the S-level is 138.47.
Categories in Wikidata may include a definition of its content. For in stance: "is a list of" - "human" with a qualifier of "award received" - "International Tennis Hall of Fame". This definition can be used in tools even bots to include all the missing statements in Wikidata.
The absence of articles shows a bias; they are what editors found notable enough to write articles about. It is however not that relevant. One question is: does this research translate to other Wikipedias and its categories another is if there is a predictive value for the relevance of missing articles in other Wikipedias for the same category.
For some categories, relevance exists because of the interest in a specific culture. For me Johan Cruyff is more relevant than any "wide receiver" in American Football; I cannot name one. This research is interesting but it does not give us the most famous people ever. This is obvious because of the distribution of topics of English Wikipedia.
Thanks,
GerardM
No comments:
Post a Comment