Saturday, March 31, 2012

#Wikidata, the interview

The press release is out, the mailing lists are full of it. But what is this brand new Wikidata project about. Who better to ask but Lydia Pintscher and Daniel Kintzler. Lydia does "community communications" for the Wikidata project and Daniel has been involved in many data related projects including OmegaWiki, the first Wikidata iteration.

Wikidata is still brand new; it does not have its own logo however it is ambitious and I expect that it will improve the quality and the consistency of the data used in MediaWiki projects everywhere. Enjoy the answers to the ten questions answered by Lydia and Daniel.

What is it that Wikidata hopes to achieve?
The Wikidata project aims to bring structured data to Wikipedia with a central knowledge base. This knowledge base will be accessible for all Wikipedias as well as 3rd parties who would like to make use of the data in it.

When you express this in REALLY simple language, what is the "take home" message of Wikidata?
We are creating a central place where data can be stored. This could for example be something like the name of a famous person together with the birthdate of that person (as well as a source for that statement). Each Wikipedia (and others) will then be able to access this information and integrate it in infoboxes for example. If needed this data can then be updated in one place instead of several. There is more to it but this is the really simple and short version. The Wikidata FAQ has more details.

Structured data is very much like illustrations. The same data can be used over and over again. Will there be a single place to update for everywhere where it is used?
Yes. There will be a central place but we are also working on integrating this in the editing process in the individual Wikipedias.

This project is organised and funded by the German chapter. Will this ensure that the data can be used in many languages?
The German chapter is indeed organising it. Funding is coming from Google Inc., AI2 and the Moore Foundation. One of the main points of Wikidata is that it will no longer be necessary to have redundant facts in different Wikipedias. For example it is not really necessary to have the length of Route 66 in each language’s article about it. It should be enough to store it once, including sources for the statement, and then use it in all of them. In the end the editor will be free to decide if he or she wants to use that particular fact from Wikidata or not. This should be especially helpful for smaller
Wikipedias who can then make better use of the work of larger Wikipedias.
In order to make the data useful in pages written in different languages, we of course have to provide a way to supply information in different languages. This is described in more detail below.

In new Wikipedias a lot of time is spend in localising info boxes. Will Wikidata make this easier?
The template for the infobox still has to be created by the respective Wikipedia community -- but filling the infoboxes would be much easier! Wikidata could be used to get the parameters for the infobox templates automatically, so that they do not need to be provided by the Wikipedians.

Will it be possible to associate the data with labels in many languages? 
Each entity record can have a label, a description (or definition) and some aliases in every language. Not only that: each language version of the label can have additional information like pronunciation attached. For example, the record representing the city of Vienna may have the label “Vienna” in English and “Wien” in German, with the respective pronunciations attached (/viːˈɛnə/ resp [viːn]).

Many people who are part of the project have a Semantic MediaWiki background. How does this affect the Wikidata project? 
Wikidata will profit from the team’s experience with Semantic
MediaWiki in two ways: they know what worked for SMW, and they know what caused problems. We plan to be compatible to the classic SMW in some areas: for instance, we plan to re-use SMW plugins for showing query results. On the other hand, we will use a data model and storage mechanisms that are more suitable to the needs of a data set of the style and size of Wikipedia.

To what extend will Wikidata data be ready to be expressed in a "semantic" way and if so what are the benefits?
Wikidata will only express very limited semantics, following the linked data paradigm rather than trying to be a true semantic web application. However, Wikidata will support output in RDF or the resource description framework, albeit relying on vocabularies of limited expressiveness, such as SKOS.

The DBpedia project extracts data from the Wikipedias. To a large extend this is the same data Wikidata could host. Is there a vision on how Wikidata and DBpedia will coexist?
If and when all structured data currently in Wikipedia is maintained within Wikidata, the extraction part of DBpedia will no longer be necessary. However, a large part of DBpedia’s value lies in the
mapping and linking of this information to standard vocabularies and data sets, as well as maintaining a Wikipedia-specific topic ontology. These things are and will remain very valuable to the linked data

You worked on the first Wikidata iteration; OmegaWiki. What is the biggest difference ?
The idea of Wikidata is quite similar to OmegaWiki - it’s no coincidence that the software project that OmegaWiki was originally based on was also called “Wikidata”. But we have moved on since the
original experiments in 2005: The data model has become a bit more flexible, to accommodate the
complexity of the data we find in infoboxes: for a single property, it will be possible to supply several values from different sources, as well as qualifiers like the level of accuracy. For instance, the
length of the river Rhine could be given as 1232 km (with an accuracy of 1km) citing the Dutch Rijkswaterstaat as of 2011, and with 1320 km according to Knaurs Lexikon of 1932. The latter value could be marked as deprecated and annotated with the explanation that this number was likely a typographical error, misrepresenting earlier measurements of 1230 km.
This level of depth of information is not easily possible with the old Wikidata approach or the classic Semantic MediaWiki. It is however required in order to reach the level of quality and transparency
Wikipedia aims for. This is one of the reasons the Wikidata project decided to implement the data model and representation from scratch.

No comments: