Tuesday, April 23, 2019

Scopus is "off side"

At Wikidata we have all kinds of identifiers for all kinds of subjects. All of them aim to provide unique identifiers and the value of Wikidata is that it brings them together; allowing to combine the information of multiple sources about the same subject.

Scientists may have a Scopus identifier. In Wikidata Scopus is very much a second rate system because to learn what identifiers goes with what people requires jumping through proprietary hoops. Scopus is the pay wall, it has its own advertising budget and consequently it does not need the effort of me and volunteers like me to put the spotlight on the science it holds for ransom. When we come across Scopus identifiers we  include them but Scopus identifiers are second class citizens.

At Wikipedia we have been blind sighted by scientists who gained awards, became instant sensations because of their accomplishments. For me this is largely the effect of us not knowing who they are, their work. Thanks to ORCiD, we increasingly know about more and more scientists and their work. When we don't know of them, when their work is hidden from the real world, I don't mind. When we know about them and their work in Wikidata it is different. It is when we could/should know their notability.
Thanks,
      GerardM

Sunday, April 14, 2019

The Bandwidth of Katie Bouman

First things first, yes, many people were involved in everything it took to make the picture of a black hole. However, the reason why it is justified that Katie Bouman is the face of this scientific novelty is because she developed the algorithms needed to distill the image from the data. To give you a clue about the magnitude of the problem she solved; the data was physically shipped on hard drives from multiple observatories. For big science, the Internet often cannot cope.

There are eternal arguments why people are notable in Wikipedia. For a lot of that knowledge a static environment like Wikipedia is not appropriate and this environment is causing a lot of those arguments. To come back to Katie, eh every scientist, their work is collaborative and much of it is condensed into "scientific papers". One of the black hole papers is "First M87 Event Horizon Telescope Results. I. The Shadow of the Supermassive Black Hole". There are many authors to this paper not only "Katherine L. Bouman". When a major event like a first picture of a black hole is added, it is understandable that a paper like this is at first attributed to a single author..

Wikimedia projects have to deal with the ramifications of science for many reasons. The most obvious one is that papers are used for citations. To do this properly, it is science who defines what is written and not selected papers to support an opinion. The public is invited to read these papers and the current Wikipedia narrative is in the single papers, single points of view. This makes some sense because the presentation is static. In Wikidata the papers on any given topic are continuously expanded, the same needs to be true for papers by any given author. Technically a Wikipedia could use Wikidata as the source for publications on a subject or by an author. The author could be Katie Bouman and proper presentations make it obvious that the pictures of a black hole were a group effort with Katie responsible for the algorithms.
Thanks,
       GerardM

Tuesday, April 09, 2019

@Wikidata is no relational #database

When you consider the functionality of Wikidata, it is important to appreciate it is not a relational database. As a consequence there is no implicit way to enforce restrictions. Emulating relational restrictions fail because it is not possible to check in real time what it is that is to be restricted.

An example: in a process new items are created when there is no item available with an external identifier. Query indicates that there is no item in existence and a new item is created. A few moments later the existence of an item with the same external identifier is checked using query. Because of the time lag that exists, what is known to be in the database and what actually is in the database differs and query indicates there is no item and a new but duplicate item is created.

Implications are important.

Wikidata is a wiki. The implications are quite different. In a wiki things need not be perfect, and the restrictions of a relational model are in essence recommendations only. In such a model duplicate items as described above are not a real problem, batch jobs may merge these items when they occur often enough. Processes may use arrays knowing the items it created earlier and thereby minimising the issue.

Important is that we do not blame people for what Wikidata is not and accept its limitations. Functionality like SourceMD enable what Wikidata may become; a link to all knowledge. Never mind if it is knowledge in Wikipedia articles, scholarly articles or in sources used to prove whatever point.
Thanks,
      GerardM

Sunday, March 24, 2019

#Sharing in the Sum of all #Knowledge from a @Wikimedia perspective II

When we are to share in the "sum of all knowledge" we share what we know about subjects; articles, pictures, data. We may share what knowledge we have, what others have and that is what it takes  for us to share in the sum of all knowledge. The question is why should we share all this, how to go about it and finally how will it benefit our public and how will it help us share the sum of all knowledge.

At the moment we do not really know what people are looking for. One reason is that search engines like the ones by Google, Microsoft and DuckDuckGo recommend Wikipedia articles and as a consequence the search process is hidden from us. We do not know what people really are looking for. However, some people prefer the "Wikipedia search engine" in their browser. We can do better and present more interesting search results. From a statistical point of view, we do not need big numbers to gain significant results.

When we check what the "competition" does we find their results in many tabs; "the web" and "images" are the first two. The first is text based and offers whatever there is on the web. What we will bring is whatever we and organisations we partner with, have to offer. It will be centered on subjects and its associated factoids presented in any language.

One template to consider is how Scholia presents. It differs. It depends on whether it is a publication, a university, a scholar, a paper. Large numbers make specific presentations feasible and thanks to Wikidata we know what kind of presentation fits a particular subject. A similar approach is possible for sports, politics. It takes experimentation and that is what makes it a Wiki approach.

Thanks to this subject based approach, language plays a different role. Vital is that for finding the subjects potentially differing labels are available or become available. One important difference with the Google, Microsoft or DuckDuckGo approach is that as a Wiki, we can ask people to add labels and missing statements. This will make our subject based data better understood in the languages people support. Yes, we can ask people to have a Wikimedia profile and yes, we may ask people to support us where we think people looking for information have to overcome hurdles.
Thanks,
       GerardM

Saturday, March 16, 2019

#Sharing in the Sum of all #Knowledge from a @Wikimedia perspective I

Sharing the sum of all knowledge is what we have always aimed for in our movement. In Commons we have realised a project that illustrates all Wikimedia projects and in Wikidata we have realised a project that links all Wikimedia projects and more.

When we tell the world about the most popular articles in Wikipedia, it is important to realise that we do not inform what the most popular subjects are. We could, but so far we don't. The most popular subjects is the sum of all traffic of all Wikipedia articles on the same subject. Providing this data is feasible; it is a "big data" question.

We do have accumulated data for the traffic of articles on all Wikipedias, we can link the articles to the Wikidata items. What follows is simple arithmetic. Powerful because it will show that English Wikipedia is less than fifty percent of all traffic. That will help make the existing bias for English Wikipedia and its subjects visible particularly because it will be possible to answer a question like: "What are the most popular subjects that do not have an article in English?" and compare those to popular diversity articles.

In Wikidata we know about the subjects of all Wikipedias but it too is very much a project based on English. That is a pity when Wikidata is to be the tool that helps us find what subjects people are looking for that are missing in a Wikipedia. For some there is an extension to the search functionality that helps finding information. It uses Wikidata and it supports automated descriptions.

Now consider that this tool is available on every Wikipedia. We would share more information.With some tinkering, we would know what is missing where. There are other opportunities; we could ask logged in users to help by adding labels for their language to improve Wikidata. When Wikidata does not include the missing information, we could ask them to add a Wikidata item and additional statements, a description to improve our search results.

This data approach is based on the result of a process; the negative results of our own Search and it is based on active cooperation of our users. At the same time, we accumulate negative results of search where there has been no interaction, link it to Wikidata labels and gain an understanding of the relevance of these missing articles. This fits in nicely with the marketing approach to "what it is that people want to read in a Wikipedia".
Thanks,
      GerardM

Saturday, March 09, 2019

A #marketing approach to "what it is that people want to read in a @Wikipedia"

All the time people want to read articles in a Wikipedia, articles that are not there. For some Wikipedias that is obvious because there is so little and, based on what people read in other Wikipedias, recommendations have been made suggesting what would generate new readers.This has been the approach so far; a quite reasonable approach.

This approach does not consider cultural differences, it does not consider what is topical in a given "market". To find an answer to the question: what do people want to read, there are several strategies. One is what researchers do: they ask panels, write papers and once it is done there is a position to act upon. There are drawbacks; 
  • you can only research so many Wikipedias
  • for all the other Wikipedias there is no attention
  • the composition of the panels is problematic particularly when they are self selecting
  • there are no results while the research is being done
The objective of a marketing approach is centered around two questions: 
  • what is it that people are looking for now (and cannot find) 
  • what can be done to fulfill that demand now
The data needed for this approach; negative search results. People search for subjects all the time and there are all kinds of reasons why they do not find what they are looking for.. Spelling, disambiguation and nothing to find are all perfectly fine reasons for a no show. 

The "nothing to find" scenario is obvious; when it is sought often, we want an article. Exposing a list of missing articles is one motivator for people to write. Once they have written, we do have the data of how often an article was read. When the most popular new articles of the last month are shown, it is vindication for authors to have written popular articles. It is easy, obvious and it should be part of the data Wikimedia Foundation already collects.. In this way the data is put to use. It is also quite FAIR to make this data available. 

For the "disambiguation" issue, Wikidata may come to the rescue. It knows what is there and, it is easy enough to add items with the same name for disambiguation purposes. Combine this with automated descriptions and all that is requires is a user interface to guide people to what they are looking for. When there is "only" a Wikidata item, it follows that its results feature in the "no article" category.

The "spelling" issue is just a variation on a theme. Wikidata does allow for multiple labels. The search results may use of them as well. Common spelling errors are also a big part of the problem. With a bit of ingenuity it is not much of a problem either.

Marketing this marketing approach should not be hard. It just requires people to accept what is staring them in the face. It is easy to implement, it works for all the 280+ language and it is likely to give a boost to all the other Wikipedias but also to Wikidata.
Thanks,
        GerardM

Sunday, February 17, 2019

@WikiResearch - Nihil de nobis, sine nobis

There is this wonderful notion how Research is going to tell us what to do in light of the strategic Wikimedia 2030 plans. Wonderful. There is going to be this taxonomy of the information we are missing.

Let me be clear. We do need research and the data it is based on, it is to be available to us. There is no point in a future taxonomy of missing knowledge when we have been asking for decades : "what articles are people looking for that they cannot find". If there is to be a taxonomy what else should it be based on?

When we are to fill in the gaps of what Wikipedia covers, we can stimulate more new articles by indicating what traffic they get in the first month. Stimulate our readers to learn more by showing what Wikidata has to offer and show its links to texts in other languages. It may even result in new stubs even articles in "their" language. This technology has been available for years now.

The WikiResearch is full of arguments on the importance of citations and Wikidata as the platform for all Wikipedia sources, why then are the WikiResearch papers not in Wikidata from the start. What is it, that WikiResearchers consider that Wikidata is not about them? Just as it is about any other subject Wikidata covers? What is it that makes their work less findable (FAIR) than what is known to have been published as open content by the NIH?

The point I want to make is that no matter how well intended it is what the WikiResearch aims to achieve, they lose the interest, involvement and commitment of people like me, the people they need to get the results they aim for.

Yes do research, but we should not wait for its results, we know how to stimulate people to write new articles.
Thanks,
      GerardM