Monday, October 07, 2013

More heady stuff about #Wikidata and ontologies

When I asked Emw many questions, I received only three answers. Having an opinion on Wikidata and expressing it is hard. It feels very much like exploring new frontiers. The questions did not go away so I asked them again and, I am mighty pleased that Antoine Isaac was willing to provide me with some answers. Antoine does not do Wikidata, he is a/the scientific coordinator at He wrote an email to the Wikidata mailing list that got me interested in asking him my questions. I hope we will collaborate for our GLAM data well with Antoine and Europeana.

I exchanged several emails with Antoine but the answers do stand on their own. I will react in a follow up blog post. I hope you appreciate what Antoine has to say as much as I do.

The Wikipedia article calls upper level ontologies "political" why should Wikidata be interested in any of that?
I wouldn't call them 'political', that's a bit far-stretched. Indeed ULOs embody some abstract considerations on how to represent the world. And as said before, this can backfire as soon as you consider an open web of data, where different representations may well co-exist. Considering a single upper-level ontology as the guiding principle for everything is dangerous. But it is true that they provide valuable bodies of knowledge to re-use, and it may make sense to re-use them for specific domains (e.g. biology, geography) that could be compatible as a whole with the approach of one ULO.
Does it not make sense to group statements together as qualifiers as part of a statement (like it is done for office held [1]) ?
I like qualifiers and dislike them at the same time!On the one end, it's good to have some meta-metadata about the provenance of a statement (who made/endorsed it), or its scope (e.g. the time it applies). In your example, this is the "start date" and "end date". This practices actually fits what is happening in the RDF / Semantic Web area, where a lot of work is being done about Provenance, and many people use quad stores with 'named graphs' instead of just triple stores. 
On the other end, I am anxious about qualifiers being used with other semantics that "here's some info about a statement". In your example, "preceded by" and "succeeded by" are more difficult to interpret in this sense (compare with property P580 that mentions explicitly "statement"). I mean, it is possible to interpret your qualifiers as data on statement. But I really feel that people (and you?) will understand it as, say "Te Rata is the predecessor of KorokÄ« Mahuta" and not "The statement 'Te Rata held the office of Maori Monarch' is the predecessor of the statement 'KorokÄ« Mahuta held the office of Maori Monarch'". Which should be the right thing to do (I mean, the one compatible with the "start date" semantics).  
Note that what you told in the other email is the kind of use of qualifiers that would worry me: it is the person who has a birth/death date or a sex, not the statement 'is a person'.Of course one could see the birth and death date to influence of the 'date of validity of the statement "is a person"'. But still we'd be talking about two different things, from a knowledge representation perspective. 
Note also (just for the fun of refering to upper-level ontologies and their dangers) that for some ULOs 'is a person' doesn't have a begin and end date. Being a person is 'rigid' i.e. it must stick to the subject forever. You can't have been a person once and then cease to be a person. Even if you die you're still a person...And don't tell me that rigidity foresees the some ontologies may have Person an an anti-rigid property. This may probably not be the choice made in your favourite ULO. Unless it's one that addresses both reality and beliefs as two possible sides of a same property. But then, good luck re-using it!!!
DBpedia does not have qualifiers, will this impact their ability to use data from Wikidata?
I can't really speak for DBpedia. But I'd say that if qualifiers are used in a way that is both consistent and compatible with the understanding of 'named graphs' in RDF, then they might be interested.
As we map Wikidata items to the content in other repositories, what do we need to compare the data from these repositories
Which repositories are you talking about? Which content? Are 'repositories' knowledge bases, like OCLC or Europeana in the book/artwork domain? Is 'content' 'data'? If yes, then what is required to establish correspondences is hard work looking at the fields and seeing how they correspond. This may be of course alleviated if all (wikidata included) look at what's already happening and try to minimize the risk of coming with data models that are too indiosyncratic. (this is why something like RDF is quite useful!)
When differences in content between repositories are found is there a standard method to harmonise the content
Again assuming a reading of 'repositories' and 'content' as above. I don't think there is a standard method. What you can prey for (and of course as designers of data repositories, we are somehow in the position of making it happen!) is that all repositories keep track of as many unambiguous identifiers they can keep track of (e.g. ISBNs for books) which would help automatic reconciliation. Otherwise make sure the data on the content (where 'content'='the object in the real world') is as complete as possible. For works of art that would mean that comparisons can be made on titles, creators, dates and place of creation, etc.

No comments: