Friday, October 18, 2013

#Wikidata needs 3,780,000,000 labels

For an item to be found in Wikidata, it needs a label. Ideally any item has a label in every language that Wikidata supports and currently there are 280+ languages. Currently Wikidata is not really useful for practically all of these languages. The statistics show that a small percentage has more than 10 labels and most labels have only one label.

There are over 13.5 million items and consequently there is a need for 3,780,000,000 labels give or take a few.

There are several ways to drastically improve the number of labels. There are also ways to minimise the impact of missing labels.

The most obvious one is to use every title of a Wikipedia article as a label. There are for instance 729,712 articles in Chinese and 244,699 articles in Arabic these will probably provide the most requested information in those languages. We can also use lexical information from any freely available resource. Wiktionary for instance includes a wealth of usable data. Implementing these two strategies alone will have a big impact.

The names of people typically stay the same in most languages with the same script. There are standards for transliteration and they can will us with an adequate result. Adding these strategies to the mix and it will become even better.

Finally, there is our community. When we provide them with awareness what labels are most often requested and failed they may either add a label to an existing item or they can add a new item.

Statistics indicate that we have a problem but with this awareness we can build statistics that will indicate what works and where our efforts have the most impact in making Wikidata truly useful.
