Monday, September 02, 2013

The need for a mass merge in #Wikidata

At the Swedish #Wikipedia they created many new articles of animal and insect species by bot. According to a mail reporting on the progress, close to a million articles were created.

This has been a huge undertaking and this success has been repeated by running the bot on the Cebuano and Waray Waray Wikipedia as well.

The problem is that for instance on the English Wikipedia many of these taxons already exist and already have their own Wikidata items.

Two Wikidata items have just been merged, they are about the Hersilia, a genus of spiders. As an item has been created for each "taxon", I am quietly confident that over 100.000 duplicates exist in Wikidata. Probably more.

This probably means that the best approach to creating new articles with a bot is by first introducing the data to Wikidata. It is not nice to have to merge so many items. In this case the data in infoboxes can be compared. This will likely indicate when the subject of the items is identical.

This is a nice puzzle involving a lot of data.
Thanks,
      GerardM
Post a Comment