Sunday, September 22, 2019

Comparing datasets, bigger or better or it does not matter?

When Wikidata was created, it was created with a purpose. It replaced the Wikipedia based interwiki links, it did a better job and, it still does the best job at that. Since then the data has been expanded enormously, no longer can Wikidata be defined by its links to Wikipedia as it is now only a subset.

There are many ongoing efforts to extract information from the Wikipedias. The best organised project is DBpedia, it continuously improves it algorithms to get more and higher grade data and it republishes the data in a format that is both flexible and scalable. Information is also extracted from the Wikipedias by the Wikidata community. Plenty of tools like petscan and the awarder and plenty of people working on single items one at a time.

Statistically on the scale of a Wikidata, individual efforts make little or no impression but in the subsets the effects may be massive. It is for instance Siobhan working on New Zealand butterflies and other critters. Siobhan writes Wikipedia articles as well strengthening the ties that bind Wikidata to Wikipedia. Her efforts have been noticed and Wikidata is becoming increasingly relevant to and used by entomologists.

There are many data sets, because of its wiki links every Wikipedia is one as well. The notion that one is bigger or better does not really matter. It is all in the interoperability, it is all in the usability of the data. Wikipedia wiki links are highly functional and not interoperable at all. More and more Wikipedias accept that cooperation will get them better quality information for its readers. Once the biggest accept data as a resource to curate the shared data the act of comparing data sets is improved quality for all.
Thanks,
      GerardM

Saturday, September 07, 2019

Language barriers to @Wikidata

Wikidata is intended to serve all the languages of all the Wikipedias for starters. It does in one very important way; all the interwiki links or the links between articles on the same subject are maintained in Wikidata.

For most other purposes Wikidata serves the "big" languages best, particularly English. This is awkward because particularly people reading other languages stand to gain most from Wikidata. The question is: how do we chip away on this language barrier.

Giving Wikidata data an application is the best way to entice people to give Wikidata a second look.. Here are two:
  • Commons is being wikidatified and it now supports a "depicts" statement. As more labels become available in a language, finding pictures in "your" language becomes easy and obvious. It just needs an application
  • Many subjects are likely to be of interest in a language. Why not have projects like the Africa project with information about Africa shared and updated by the Listeria bot? Add labels and it becomes easier to use, link to Reasonator for understanding and add articles for a Wikipedia to gain content.
Key is the application of our data. Wikidata includes a lot, the objective is to find the labels and we will when the results are immediately applicable. It will also help when we consider the marketing opportunities that help foster our goals.

Thanks,
      GerardM

@Wikidata - #Quality is in the network

What amounts to quality is a recurring and controversial subject. For me quality is not so much in the individual statements for a particular Wikidata item, it is in how it links to other items.

As always, there has to be a point to it. You may want to write Wikipedia articles about chemists, artists, award winners. You may want to write to make the gender gap less in your face but who to write about?

Typically connecting to small subsets is best. However we want to know about the distribution of genders so it is very relevant to add a gender. Statistically it makes no difference in the big picture but for subsets like: the co-authors of a scientist or a profession, an award, additional data helps understand how the gender gap manifests itself.

The inflation of "professions" like "researcher" is such that it is no longer distinctive, at most it helps with the disambiguation from for instance soccer stars. When a more precise profession is known like "chemist" or "astronomer", all subclasses of researcher, it is best to remove researcher as it is implied.

Lists like members of "Young Academy of Scotland", have their value when they link as widely as possible. Considering only Wikidata misses the point, it is particularly the links to the organisations, the authorities (ORCiD, Google Scholar, VIAF) but also Twitter like for this psychologist. We may have links to all of them, the papers, the co-authors. But do we provide quality when people do not go into the rabbit hole?
Thanks,
      GerardM

Sunday, August 25, 2019

There is much more to read; introducing the "one page wonder"

Given that our aim is to share in the "sum of all knowledge", realistically we will not have it all at our disposal to share. It is also fairly likely that we will not know about all subjects.

When you google for a given subject, it is as likely as not that you will drown in too much data, too many false friends or find nothing at all when there is nothing to find in "your" language.

Increasingly, what we know about in the Wikiverse is linked to a Wikidata item. Pictures may depict a subject, articles may be written about a subject and all of them refer to a Wikidata item that may have labels in any language. Items that may even have links to references.

When we are to find for Wikipedia readers more to read, we need a mechanism, a place where we can link a subject to external resources. Resources like the Internet Archive, "your" library. the papers we know in WikiCite but to the free versions of these papers. The page will show the label in "your" language and  a picture. It links to all the pictures depicting the subject as well.

Putting the "one page wonder" in production is easy. It is all on one page and is fully internationalised. The localisation is done at translatewiki.net and when people want to make it useful in "their" language, they will add the missing labels for the Wikidata items.

With the "one page wonder" in place it becomes interesting:
  • Is "your" local library known to us and do we get your permission to find it for you. How do we supply "your" library with a search string?
  • The Internet Archive's wayback machine may have content in "your" language but can you navigate its English only user interface. 
  • What other organisations do we want to partner with to provide you with more to read
  • Will we be able to show local pictures, a Dutch cow looks different from an Indian cow..
  • What other issues will there be..
  • Oh and yes, we can include the Reasonator, queries and what have you.. we just have to think about what to show.
Thanks,
       GerardM

#Translatewiki.net the @Wikimedia movement infrastructure most people even do not know


Just consider this, there are more than 200 functioning Wikipedias and this is only possible because people localise the MediaWiki software in over 280 languages. It makes translatewiki.net, the website where all this work happens a strategic resource to the Wikimedia movement.

Internationalisation (i18n) and localisation (l10n) are an integral part of software development. It is an integral part of a continuous process and it requires constant attention. The day to day jobs are well in hand. The localisation itself is a community effort and with developers continually expanding the software base a continuous effort is needed of the translators to keep up with their language. This is hard and for many languages it is a struggle to keep up with even the "most used messages".

Managing this effort is a continuous effort, it is essential to maintain the i10n and the localisation optimally. It follows that it should be obvious what messages have the biggest impact first on the readers and then the editors of a Wikipedia. What should be in the "most used messages" changes over time and when it is considered strategic, such maintenance is to be considered a Wikimedia/MediaWiki undertaking. 

Translatewiki has always been an independent partner of the Wikimedia Foundation and it has always been firmly part of the Wikimedia movement. Given that partnerships are a key part of the strategic plans of the WMF, the proof of the partnership pudding is very much in how it interacts with a translatewiki.net. TWN does not need to be part of the WMF organisation for it to fund TWN, it is clearly a quid pro quo. The WMF should even encourage TWN and other partners to collaborate for their i18n and l10n and enable this for strategic purposes, strengthening these partners globally. 
Thanks,
     GerardM

Sunday, August 11, 2019

How to value open data and why Wikidata won't go stale

The data in Wikidata is data everyone knows or could know. A lot of awful things could be said of its content and quality and all of it misses one important point. It is being used, its use is increasing, it is increasingly used by Wikipedias and that provides an incentive to maintain the data.

What Wikipedia indicates is that most data is stable, not stale. A date of birth, a place of birth so much remains the same. When we bury data in text, it is always a challenge to get the data out. When we bury data in Wikidata it just takes a query to bring it back to life. Who was a member of multiple "National Young Academies, Similar Bodies and YS Networks" for instance; you do not find it in the texts of those organisations but you will increasingly find it in Wikidata. Once the data is in there, it is stable and available for query.

As GLAMS make their content available under a free license, their collections gain relevance as the collection gains an audience. Just consider that only a small part is available to the public in the GLAM itself and on Commons it is there for all to find. Commons is being wikidatified and those collections become available in any language gaining additional relevance in the process.

The best example is what the Biodiversity Heritage Library does. It is instrumental in the digitisation of books, it makes them publicly available and gains the collections they are from an audience. Volunteers prove themselves in this process and both professionals and the wider world benefit. From a data perspective the data is new because only now available.

When a publisher mocks Open data, it is self serving. It is in their interest that data is inaccessible, only there for those who pay. There are plenty of examples of great data initiatives that went to ground and obviously when the data does not pay the rent, publishers will pull the plug. It is different for the data at Wikidata. It is managed by an organisation that has as its motto "share in the sum of all knowledge". The audience the WMF has makes it a world top ten website, it is not for sale and it is not going anywhere. As long as there are people like me who care about the availability of information, the data at Wikidata may go stale in places waiting for another volunteer to pick up the slack.
Thanks,
      GerardM

Saturday, August 10, 2019

#Statistics or how many researchers are a #physicist

At @Wikidata most "researchers" are given this "occupation" out of convenience. We do not know how to label them properly, there are too many, so as all scholars must be researchers we make them so.

Nothing inherently wrong here; it is better to know them for what they also are then to know nothing about them at all. One issue though; we do not know the physicists from the chemists, from the behaviorists or any other specialism in science. We can query for physicists anyway but we will not catch them all.

Queries that show the numbers for a profession are easy enough to make. The value of such one time wonders is minimal, the results are fleeting, any moment now another scientists like Walter Hofstetter may become known to be a physicist and the numbers are no longer true. They are useful when we run queries like these regularly, save the results and present them like Magnus does for Wikidata itself.

What it takes is a mechanism that mimics Magnus's approach. We gain an insight in how Wikidata is performing over time and it provides motivation for people who care, for instance about physicists.
Thanks,
       GerardM

Tuesday, August 06, 2019

#Statistics for National Young Academies

The Global Young Academy is linked to many national young academies. It and they represent many relevant scientists. They represent all of science and, they are interested in representing science to the public. The question: how can we make them visible.

First you add the orgs to Wikidata and then the scientists to the orgs. When you then add the same Listeria lists to Wikipedias, we will see a picture when we have one and we may notice who has a local Wikipedia article.

There are many interesting statistics possible:
  • the gender ratio
  • the different professions in the mix
  • awards received 
  • the known number of publications per person
  • the organisations they are employed at
However, first things first. It is my intention to include all the current members and the alumni of the Young Academy of Sweden before Wikimania.. Second, these scholars are bright :) once they put their mind to it, they will help themselves to nice statistics based on the info we accumulate in Wikidata. They can be linked to on the Wikipedia pages.
Thanks,
     GerardM

Sunday, August 04, 2019

Helping @Wikipedia readers find their read, one author, one publication at a time

Reading is what the public of Wikipedia does and in a way, every Wikipedia is an invitation to further reading. Wikipedia is an encyclopedia and by definition, its coverage of a subject is limited. Its reliability is defined by its sources and they themselves are typically a subset of what may be read about that subject.

The quality of the invitation for further reading differs. How do we invite people to read a Shakespeare in Dutch, German, Malayalam, Kannada even English?

The primary partner in this quest for further reading; the local library. We can put all of them on a map and invite people to go there or to probe its website for further reading. Having them all in Wikidata with their coordinates puts libraries and what they stand for on the map. We can invite them to use services like OpenLibrary or WordCat, the bottom line; people read.

In this people first approach, the user interface is in the language people want to read their book in. It follows that the screen may be sparse. When it is to be a success, it is run like a business. We have statistics on libraries, people seeking, books found and a perspective in time. It is about people reading books not about transliterating books. Our business model: people reading. Funding is by people, organisations who care about more reading by more people. The numbers entice people to volunteer their efforts making more books, publications available in the language they care about.

To make this happen, the WMF takes the lead enabling and maintaining such a system and partnering with any and all organisations that care about this, organisations like the OCLC and the Internet Archive.

We will succeed when we make the effort.
Thanks,
       GerardM