Saturday, October 05, 2019

Rebecca R. Richards-Kortum

A text on the Internet read: "She’s Rice’s first-ever MacArthur grant winner. But her real claim to fame? Her clever medical inventions might just save your life." It is not as if I know her even though I added to her Wikidata item in the past .

I looked her up because she approves of the NEST360° organisation on Twitter. It is an organisation committed to reducing neonatal mortality in sub-Saharan hospitals by 50 percent.

Such organisations deserve a place in Wikidata, it has members I am adding. I consider it part of my "Africa project" even though it does not have a place there yet.

Yesterday I added an item for "neonatal care" and all the papers that are already included in Wikidata  about neonatal care need to be associated with the subject. Scientists like Prof Joy Lawn are to be marked for their specialty.

How is it possible that it takes a 60 year old white male from the Netherlands to add something this basic to Wikidata. We are talking about more yearly deaths than Ebola..

Tuesday, October 01, 2019

What data is wrangled is obvious when its presentation is considered

When you watch a game, you want to know the score. When you have a favourite author, you want to know all his/her publications and when you hear about a place you want to know where it is. Easy.

Such data may be included in a repository like Wikidata and, in essence the data is still simple. You still want to know the score, the publications or the location, the question is how do you get the data in a format that makes sense.

People are really good at understanding data when it is in an agreeable format.. These are three format for the same data; a scientist in Wikidata. This is how Wikidata presents its data and imho the data is really hard to understand. This is the same data in Reasonator, it is a general purpose tool that shows data and its relations. It can be used for all kinds of data, it is my goto tool to get to grips with data related to one item. Finally Scholia presents data formatted in a way that makes sense for this scientist.

Given how awful the default presentation of Wikidata is, it is obvious why everyone teaching the use of Wikidata focuses on querying the data and therefore people seek/work on the results provided in what is their default tool. I typically focus on particular subjects, today it was Dr Shima Taheri, I added a reference, some publications and genders for her co-authors. To do this I am triggered by the presentation of the data in the tools I use.

The holy grail for Wikidata is the use of its data in Wikipedia info boxes. However, people are taught to query data and that approach does not align well with the data items you find in info boxes. So when the purpose of Wikidata is in Wikipedia info boxes, presentation needs to become a priority.

Thursday, September 26, 2019

The lowest hanging fruit in #DBpedia

What I hate with a vengeange is make work. DBpedia as a project retrieves information from all the Wikipedias, wrangles it into shape and publishes it. In one scenario they have unanimous support from one or more Wikipedias agreeing on the same fact and, they all may have their own references.

We should import such agreeable data without further ado. An additional manual step to import to Wikidata is not smart because manual operations introduce new errors. Arguably when there is no unanimous support manual intervention may improve the quality but given the quantity of the data involved, it means that a lot of data will not become available. THAT in and of itself has a negative impact on the quality of available data as well.

So what to do.. Harvest all the data that is of an acceptable quality, that is the data DBpedia accepts for its own purposes. Enable an interface where people verify the data where their project is challenged.

When we truly aim to engage people, we enable them to target the data they want to work on. I will happily work on scientists but do not expect me to work on "sucker stars". More than likely there will be people who care about soccer stars but not about "crazy professors".

Wednesday, September 25, 2019

With #DBpedia to the (data) cleaners

The people at DBpedia are data wranglers. What they do is make the most of the data provided to them by the Wikipedias, Wikidata and a generous sprinkling of other sources. They are data wranglers because they take what is given to them and make the data shine.

Obviously, it takes skill and resources to get the best result and obviously, some of the data gathered does not pass the smell test. The process the data wranglers use includes a verification stage as described in this paper. They have two choices for when data that should be the same is not; they either have a preference or they go with the consensus ie the result that shows most often.

For data wranglers this is a proper choice.. There is an other option for another day, these discrepancies are left for the cleaners.

With the process well described, the data openly advertised as available, the cleaners will come. First people akin to the wranglers, they have the skills to build the queries, the tools to slice and dice the data. When these tools are discovered, particularly by those who care about specific subsets, they will dive in and change things where applicable. They will seek the references, make the judgments necessary to improve what is there.

The DBpedia data wranglers are part of the Wikimedia movement and do more than build something on top of what the Wikis produced; DBpedia and the Wikimedia projects work together improving our movement's qualities. With the processing data generally available this will become even more effective.

Sunday, September 22, 2019

Comparing datasets, bigger or better or it does not matter?

When Wikidata was created, it was created with a purpose. It replaced the Wikipedia based interwiki links, it did a better job and, it still does the best job at that. Since then the data has been expanded enormously, no longer can Wikidata be defined by its links to Wikipedia as it is now only a subset.

There are many ongoing efforts to extract information from the Wikipedias. The best organised project is DBpedia, it continuously improves it algorithms to get more and higher grade data and it republishes the data in a format that is both flexible and scalable. Information is also extracted from the Wikipedias by the Wikidata community. Plenty of tools like petscan and the awarder and plenty of people working on single items one at a time.

Statistically on the scale of a Wikidata, individual efforts make little or no impression but in the subsets the effects may be massive. It is for instance Siobhan working on New Zealand butterflies and other critters. Siobhan writes Wikipedia articles as well strengthening the ties that bind Wikidata to Wikipedia. Her efforts have been noticed and Wikidata is becoming increasingly relevant to and used by entomologists.

There are many data sets, because of its wiki links every Wikipedia is one as well. The notion that one is bigger or better does not really matter. It is all in the interoperability, it is all in the usability of the data. Wikipedia wiki links are highly functional and not interoperable at all. More and more Wikipedias accept that cooperation will get them better quality information for its readers. Once the biggest accept data as a resource to curate the shared data the act of comparing data sets is improved quality for all.

Saturday, September 07, 2019

Language barriers to @Wikidata

Wikidata is intended to serve all the languages of all the Wikipedias for starters. It does in one very important way; all the interwiki links or the links between articles on the same subject are maintained in Wikidata.

For most other purposes Wikidata serves the "big" languages best, particularly English. This is awkward because particularly people reading other languages stand to gain most from Wikidata. The question is: how do we chip away on this language barrier.

Giving Wikidata data an application is the best way to entice people to give Wikidata a second look.. Here are two:
  • Commons is being wikidatified and it now supports a "depicts" statement. As more labels become available in a language, finding pictures in "your" language becomes easy and obvious. It just needs an application
  • Many subjects are likely to be of interest in a language. Why not have projects like the Africa project with information about Africa shared and updated by the Listeria bot? Add labels and it becomes easier to use, link to Reasonator for understanding and add articles for a Wikipedia to gain content.
Key is the application of our data. Wikidata includes a lot, the objective is to find the labels and we will when the results are immediately applicable. It will also help when we consider the marketing opportunities that help foster our goals.


@Wikidata - #Quality is in the network

What amounts to quality is a recurring and controversial subject. For me quality is not so much in the individual statements for a particular Wikidata item, it is in how it links to other items.

As always, there has to be a point to it. You may want to write Wikipedia articles about chemists, artists, award winners. You may want to write to make the gender gap less in your face but who to write about?

Typically connecting to small subsets is best. However we want to know about the distribution of genders so it is very relevant to add a gender. Statistically it makes no difference in the big picture but for subsets like: the co-authors of a scientist or a profession, an award, additional data helps understand how the gender gap manifests itself.

The inflation of "professions" like "researcher" is such that it is no longer distinctive, at most it helps with the disambiguation from for instance soccer stars. When a more precise profession is known like "chemist" or "astronomer", all subclasses of researcher, it is best to remove researcher as it is implied.

Lists like members of "Young Academy of Scotland", have their value when they link as widely as possible. Considering only Wikidata misses the point, it is particularly the links to the organisations, the authorities (ORCiD, Google Scholar, VIAF) but also Twitter like for this psychologist. We may have links to all of them, the papers, the co-authors. But do we provide quality when people do not go into the rabbit hole?

Sunday, August 25, 2019

There is much more to read; introducing the "one page wonder"

Given that our aim is to share in the "sum of all knowledge", realistically we will not have it all at our disposal to share. It is also fairly likely that we will not know about all subjects.

When you google for a given subject, it is as likely as not that you will drown in too much data, too many false friends or find nothing at all when there is nothing to find in "your" language.

Increasingly, what we know about in the Wikiverse is linked to a Wikidata item. Pictures may depict a subject, articles may be written about a subject and all of them refer to a Wikidata item that may have labels in any language. Items that may even have links to references.

When we are to find for Wikipedia readers more to read, we need a mechanism, a place where we can link a subject to external resources. Resources like the Internet Archive, "your" library. the papers we know in WikiCite but to the free versions of these papers. The page will show the label in "your" language and  a picture. It links to all the pictures depicting the subject as well.

Putting the "one page wonder" in production is easy. It is all on one page and is fully internationalised. The localisation is done at and when people want to make it useful in "their" language, they will add the missing labels for the Wikidata items.

With the "one page wonder" in place it becomes interesting:
  • Is "your" local library known to us and do we get your permission to find it for you. How do we supply "your" library with a search string?
  • The Internet Archive's wayback machine may have content in "your" language but can you navigate its English only user interface. 
  • What other organisations do we want to partner with to provide you with more to read
  • Will we be able to show local pictures, a Dutch cow looks different from an Indian cow..
  • What other issues will there be..
  • Oh and yes, we can include the Reasonator, queries and what have you.. we just have to think about what to show.
       GerardM the @Wikimedia movement infrastructure most people even do not know

Just consider this, there are more than 200 functioning Wikipedias and this is only possible because people localise the MediaWiki software in over 280 languages. It makes, the website where all this work happens a strategic resource to the Wikimedia movement.

Internationalisation (i18n) and localisation (l10n) are an integral part of software development. It is an integral part of a continuous process and it requires constant attention. The day to day jobs are well in hand. The localisation itself is a community effort and with developers continually expanding the software base a continuous effort is needed of the translators to keep up with their language. This is hard and for many languages it is a struggle to keep up with even the "most used messages".

Managing this effort is a continuous effort, it is essential to maintain the i10n and the localisation optimally. It follows that it should be obvious what messages have the biggest impact first on the readers and then the editors of a Wikipedia. What should be in the "most used messages" changes over time and when it is considered strategic, such maintenance is to be considered a Wikimedia/MediaWiki undertaking. 

Translatewiki has always been an independent partner of the Wikimedia Foundation and it has always been firmly part of the Wikimedia movement. Given that partnerships are a key part of the strategic plans of the WMF, the proof of the partnership pudding is very much in how it interacts with a TWN does not need to be part of the WMF organisation for it to fund TWN, it is clearly a quid pro quo. The WMF should even encourage TWN and other partners to collaborate for their i18n and l10n and enable this for strategic purposes, strengthening these partners globally.