Monday, October 13, 2025

Batch processes for Wikidata .. importing from ORCiD and Crosreff - a more comprehensive trick

Every week a process runs that produces a list of all the papers for all the scientists known to Wikidata that have an ORCiD identifier. The papers are known by a DOI and typically all scientific papers at Wikidata have a DOI. ORCiD-Scraper uses this list for interested users to upload the information of these papers to Wikidata using the "QuickStatements" tool. One paper at a time for one author at a time.

What if.. what if all new papers of all authors known to ORCiD are added? The challenge will be not to introduce duplicate papers or duplicate authors.. So lets agree on two things, we only introduce authors who have an ORCiD identifier and for now we only introduce papers who have a DOI and at least one author who has an ORCiD identifier.

The trick is to cycle through all authors known to Wikidata. For instance a thousand at a time. All new papers have their DOI entered in a relational database where a DOI can exist only once. All these papers may include multiple authors, they enter the same relational database where an author is unique. When all papers for the first thousand authors and associated authors in the database, we can first add all missing authors to Wikidata and add the Wikidata identifiers in the relational database. We then add the missing papers. It is likely that no duplicates are introduced but there will be duplicates for authors where Wikidata does not know about the ORCiD identifier.

We cycle through all our known authors and we can rerun each week.

We can do this but why.. 

One reason is an English Wikipedia template named {{Scholia}} it may be used on scholars and subjects. Unlike Wikipedia the information it presents will always be up to date. There are more reasons but this is a great starter.
Thanks,
      GerardM

No comments: