Wednesday, November 05, 2025

Missing award recipients in both Wikidata and the Wikipedias

Professor Fei-Fei Li is one of the recipients of the 2025 Queen Elizabeth Prize for Engineering. It says so on the English Wikipedia and it is confirmed on the website of the prize.

There are nine Wikipedias with an article for the award and there is Wikidata. When the 2025 awardees are known on a Wikipedia, "2025" should be available in the text of the article. Otherwise the article is likely out of date. The recipients should be known on Wikidata AND there should be an "award received" for the award with a date of 2025.

When you check Wikidata for this award using "Reasonator", you will find that Wikidata is in need of an update. It is by accident that I learned of this award. Updates are an hit or miss affair, this would be improved when a bot produces a list of all the awards that are in need of updates. When a bot produces this list for every Wikipedia for all the known awards, it enables people to do this maintenance work. 

Obviously 2025 is this year and it will have the most mutations. A similar job can be run for other years but it is less likely to bring many additions, more likely these list will become reduced in size over time.

Thanks,

      GerardM

Saturday, November 01, 2025

English Wikipedia awards, a Wikidata user story

I noticed that Yahvinder Malhi received the Roman Magalev Prize on Bluesky. There is an English Wikipedia article for both Mr Malhi and for the prize. I looked it up, Mr Malhi is not known as a prize winner on the article for the prize, it is however on the his personal article as a text reference.

So why not have a tool that produces a list for all awards on a Wikipedia where Wikidata knows about an award AND an award winner where both have a Wikidata item and the award winner is not on the award article. Easy obvious and it will improve the quality of articles about awards.

This can work two ways.. Why not have a tool that produces a list where awards known at Wikidata are not linked on the article.

Technically it is not that hard. It is just a few queries that are to be run on a regular basis. It is the user interface where it becomes tricky. How will a user know that something was fixed.. How will we run it for all the Wikipedias.. Will we be smart and recognise red links..

Another tool could be where we indicate to Wikipedias with an article for an award when a change happened for that award.. particularly new award winners for the current year.. It could be a list where editors are triggered to revisit their articles.

Thanks,

       GerardM

Sunday, October 26, 2025

Automated updates for Wikimedia projects

 

I revisited my Wikipedia user page. On it I have several subpages that are regularly automatically updated when things change on Wikidata. One of them is about the "Prix Roger Nimier", I had not looked at it for years. I updated Wikidata from the data on the French Wikipedia and to make it interesting, I added the Listeria template to my French Wikipedia user page. It updated and the English and French article are nearly identical. The difference is in the description.

There are many personal project pages that are automatically updated from Wikidata. The point that I wanted to make: topics are not universally maintained. As I had another look after a few years, I found that many have had regular updates. The quality however is not that great. From a Wikimedia perspective, it seems that we have not one audience but many. When we allow for automatic updates, we will be able to share the sum of all our knowledge with a much bigger audience.

Thanks,

       GerardM

Sunday, October 19, 2025

Providing Resources for a subject used in a Wikimedia project

So you want to write an article in a Wikimedia Project. It may be new to your project but it is likely part of the sum of all Wikimedia knowledge. Lets consider a workflow that acknowledges this reality.

An article typically starts with a title and it may be linked to an existing item in Wikidata. If so, the item, the concept is linked to a workflow. All the references for all articles are gathered. All relations known at Wikidata are presented. Based on what kind of item it is, tools are identified presenting information in the concept articles. They are categories and info boxes. References for content in the info boxes are included as well.

Another workflow is for existing articles. All references and relations expressed in the article show as green, unused references and relations show as orange. Missing categories and values in info boxes are presented and the author may click to include them in the article. Values in info boxes may show black, red or blue it will be whatever the author chooses.

The workflow is enabled once the concept or the article is linked to Wikidata. So for those Wikipedians who do not want to change, they just do not make use of this workflow and are left unbothered. There will be harvesting processes based on the recent changes on all projects; a change will trigger processes that may look for vandalism for new relations and for suggestions for new labels. 

The most important beneficiary will be our audience. This workflow makes the sum of all our knowledge actionable to improve articles, populate articles and reflect what we know in all our articles. Our editors have the choice to use this tool or not. Obviously their edits will be harvested and evaluated in a more broad context; all of the Wikimedia projects. The smaller projects where more new articles are created will have an easy time adding info boxes and references. The bigger projects will find the relations that are not or not sufficiently expressed with references. 

Providing subject resources will work only when it is supported on a Foundation scale. It is not that volunteers cannot build a prototype, it is the need for scalability and sustained performance that is not provided by the Toolforge.

Thanks,

      GerardM

Saturday, October 18, 2025

Using AI for both Wikidata/Wikipedia quality assurance

When people consider the relation between Wikipedia and Wikidata, it is typically seen from the perspective of creating new information either in a Wikipedia or in Wikidata. However what can we do for the quality of both Wikipedia and Wikidata when we consider the existing data in all Wikipedias and compare it to the Wikidata information.

All Wikipedia articles on the same subject are linked to only one Wikidata item. Articles linked from a Wikipedia article are consequently known to Wikidata. When Wikidata knows about a relation between these two articles, dependent on the relation they could feature in info boxes and/or categories in the article. At Wikidata we know about categories and what they should contain. Info boxes are known to Wikipedias for what they contain, relations are likely to be known both to Wikidata and Wikipedia

Issues identified in this way will substantially improve the integrity of the data in all our projects. We are expecting false friends and missing information in Wikidata and in all Wikipedias.

Using AI for identifying issues ensures that quality will be constantly part of the process. That basic facts are correct so that the information we provide to our audience will be as good as we have it.

Thanks,

       GerardM

Monday, October 13, 2025

Batch processes for Wikidata .. importing from ORCiD and Crosreff - a more comprehensive trick

Every week a process runs that produces a list of all the papers for all the scientists known to Wikidata that have an ORCiD identifier. The papers are known by a DOI and typically all scientific papers at Wikidata have a DOI. ORCiD-Scraper uses this list for interested users to upload the information of these papers to Wikidata using the "QuickStatements" tool. One paper at a time for one author at a time.

What if.. what if all new papers of all authors known to ORCiD are added? The challenge will be not to introduce duplicate papers or duplicate authors.. So lets agree on two things, we only introduce authors who have an ORCiD identifier and for now we only introduce papers who have a DOI and at least one author who has an ORCiD identifier.

The trick is to cycle through all authors known to Wikidata. For instance a thousand at a time. All new papers have their DOI entered in a relational database where a DOI can exist only once. All these papers may include multiple authors, they enter the same relational database where an author is unique. When all papers for the first thousand authors and associated authors in the database, we can first add all missing authors to Wikidata and add the Wikidata identifiers in the relational database. We then add the missing papers. It is likely that no duplicates are introduced but there will be duplicates for authors where Wikidata does not know about the ORCiD identifier.

We cycle through all our known authors and we can rerun each week.

We can do this but why.. 

One reason is an English Wikipedia template named {{Scholia}} it may be used on scholars and subjects. Unlike Wikipedia the information it presents will always be up to date. There are more reasons but this is a great starter.
Thanks,
      GerardM

Saturday, September 27, 2025

Moving forward with Amir's "Internal Links in #Wikipedia" presentation

At Wikimania 2019, my friend Amir presented "Internal Links in Wikipedia". It provides a wonderful expose of what is problematic with the existing functionality with blue and red links in all our Wikipedias. At the end of 2024, technically things have moved forward, this blog post's intention is to provide arguments what a local Wikibase for wiki links will bring to both editors and readers and why it does not need to be controversial. By definition, changes made to Wikipedia are controversial. 

Functionally, every link red or blue should remain exactly as is. Technically, every blue link refers to one article and every article SHOULD have an item at Wikidata. Every link, blue or red, may be referred to from many places and SHOULD be about only one concept. For every destination there MAY be a link to an item at Wikidata. At this time we have no way of knowing if there is only one concept and if there is an item at Wikidata for that concept.

Many years ago Wikidata solved a similar problem. Wikidata was an instant success because it replaced the interwiki functionality. The solution proposed today is similar and only possible now that Wikidata can be "federated" with many instances of a Wikibase. 

All destinations for both red and blue links will be known in a local Wikibase federated with Wikidata. Any destination may be linked to a Wikidata item but the name of the local article/destination will remain unique. Thanks to this federation, disambiguation support may be provided based on what is known both locally and globally when a new link is created. It will know about the synonymy for each subject.

This change does not need to be controversial because like with the interwiki links, people can opt out of this new functionality. When only a subset of the editor community becomes involved, the quality of all links will improve quickly. With the interwiki links fixed, Wikidata was ready to become a knowledge base. As the wiki links in the local Wikibases get in shape, the Wikidata knowledge base may be used to signal that articles should be in specific categories, or that red links could be added in summation articles like in articles about an award.

Our dependence on Wikipedia editors will remain key but tools like the Wikidata knowledge base are available to bring us the data that enables us with information that is up to date and improves the connections between all our articles. Manually checking wiki links is a Sisyphean task, with tooling it becomes manageable and worthwhile.

Thanks,

      GerardM

Batch processes for Wikidata .. importing from ORCiD and Crosreff

One of my favourite batch processes produces data for a tool called "Orcid-Scraper". I use it to add the missing publications known to ORCiD. I do it as a hobby however, I would use my time more effectively when in stead of producing a database that enables me adding new data, the new data is added to Wikidata. 

This was done in the past by a different tool. It was a drama because Wikidata is NOT a relational database. The problem is that an item cannot be created with the certainty that it will be unique. To ensure that new items will be unique there are plenty of available tricks. 

The easiest trick is to have an option in the tool to create all the missing papers known for a given author. One author at a time and, from Scholia. It makes use of results from a batch process that runs once a week. Cheap, cheerful highly effective.

Then there is a need for another batch process. For all the "author string"s that include an ORCiD identifier, existing authors are sought and these author strings are changed into "author"s removing the link to the ORCiD identifier as it is implicitly part of the author. This process can run once a week.

A second batch process, also running once a week, looks for "author string"s with ORCiD identifiers without corresponding authors. It generates a list of ORCiD identifiers with associated "author string"s and creates one new item uniquely identified by that ORCiD identifier. 

Obviously new authors make it useful to run the first batch process again.

These batches could run exclusively for an author processed by Orcid-scraper making this tool and Scholia more powerful and up to date.

Thanks,

       GerardM