Sunday, October 26, 2025

Automated updates for Wikimedia projects

 

I revisited my Wikipedia user page. On it I have several subpages that are regularly automatically updated when things change on Wikidata. One of them is about the "Prix Roger Nimier", I had not looked at it for years. I updated Wikidata from the data on the French Wikipedia and to make it interesting, I added the Listeria template to my French Wikipedia user page. It updated and the English and French article are nearly identical. The difference is in the description.

There are many personal project pages that are automatically updated from Wikidata. The point that I wanted to make: topics are not universally maintained. As I had another look after a few years, I found that many have had regular updates. The quality however is not that great. From a Wikimedia perspective, it seems that we have not one audience but many. When we allow for automatic updates, we will be able to share the sum of all our knowledge with a much bigger audience.

Thanks,

       GerardM

Sunday, October 19, 2025

Providing Resources for a subject used in a Wikimedia project

So you want to write an article in a Wikimedia Project. It may be new to your project but it is likely part of the sum of all Wikimedia knowledge. Lets consider a workflow that acknowledges this reality.

An article typically starts with a title and it may be linked to an existing item in Wikidata. If so, the item, the concept is linked to a workflow. All the references for all articles are gathered. All relations known at Wikidata are presented. Based on what kind of item it is, tools are identified presenting information in the concept articles. They are categories and info boxes. References for content in the info boxes are included as well.

Another workflow is for existing articles. All references and relations expressed in the article show as green, unused references and relations show as orange. Missing categories and values in info boxes are presented and the author may click to include them in the article. Values in info boxes may show black, red or blue it will be whatever the author chooses.

The workflow is enabled once the concept or the article is linked to Wikidata. So for those Wikipedians who do not want to change, they just do not make use of this workflow and are left unbothered. There will be harvesting processes based on the recent changes on all projects; a change will trigger processes that may look for vandalism for new relations and for suggestions for new labels. 

The most important beneficiary will be our audience. This workflow makes the sum of all our knowledge actionable to improve articles, populate articles and reflect what we know in all our articles. Our editors have the choice to use this tool or not. Obviously their edits will be harvested and evaluated in a more broad context; all of the Wikimedia projects. The smaller projects where more new articles are created will have an easy time adding info boxes and references. The bigger projects will find the relations that are not or not sufficiently expressed with references. 

Providing subject resources will work only when it is supported on a Foundation scale. It is not that volunteers cannot build a prototype, it is the need for scalability and sustained performance that is not provided by the Toolforge.

Thanks,

      GerardM

Saturday, October 18, 2025

Using AI for both Wikidata/Wikipedia quality assurance

When people consider the relation between Wikipedia and Wikidata, it is typically seen from the perspective of creating new information either in a Wikipedia or in Wikidata. However what can we do for the quality of both Wikipedia and Wikidata when we consider the existing data in all Wikipedias and compare it to the Wikidata information.

All Wikipedia articles on the same subject are linked to only one Wikidata item. Articles linked from a Wikipedia article are consequently known to Wikidata. When Wikidata knows about a relation between these two articles, dependent on the relation they could feature in info boxes and/or categories in the article. At Wikidata we know about categories and what they should contain. Info boxes are known to Wikipedias for what they contain, relations are likely to be known both to Wikidata and Wikipedia

Issues identified in this way will substantially improve the integrity of the data in all our projects. We are expecting false friends and missing information in Wikidata and in all Wikipedias.

Using AI for identifying issues ensures that quality will be constantly part of the process. That basic facts are correct so that the information we provide to our audience will be as good as we have it.

Thanks,

       GerardM

Monday, October 13, 2025

Batch processes for Wikidata .. importing from ORCiD and Crosreff - a more comprehensive trick

Every week a process runs that produces a list of all the papers for all the scientists known to Wikidata that have an ORCiD identifier. The papers are known by a DOI and typically all scientific papers at Wikidata have a DOI. ORCiD-Scraper uses this list for interested users to upload the information of these papers to Wikidata using the "QuickStatements" tool. One paper at a time for one author at a time.

What if.. what if all new papers of all authors known to ORCiD are added? The challenge will be not to introduce duplicate papers or duplicate authors.. So lets agree on two things, we only introduce authors who have an ORCiD identifier and for now we only introduce papers who have a DOI and at least one author who has an ORCiD identifier.

The trick is to cycle through all authors known to Wikidata. For instance a thousand at a time. All new papers have their DOI entered in a relational database where a DOI can exist only once. All these papers may include multiple authors, they enter the same relational database where an author is unique. When all papers for the first thousand authors and associated authors in the database, we can first add all missing authors to Wikidata and add the Wikidata identifiers in the relational database. We then add the missing papers. It is likely that no duplicates are introduced but there will be duplicates for authors where Wikidata does not know about the ORCiD identifier.

We cycle through all our known authors and we can rerun each week.

We can do this but why.. 

One reason is an English Wikipedia template named {{Scholia}} it may be used on scholars and subjects. Unlike Wikipedia the information it presents will always be up to date. There are more reasons but this is a great starter.
Thanks,
      GerardM