Tuesday, September 17, 2013

Some answers about the heady stuff of #Wikidata

I asked Emw questions as a result of his email about the migration away from the "GND main type". I am happy with the answers I received and I hope you will enjoy reading them.

At Wikidata, I contribute to discussions about properties, where I espouse using W3C recommendations and conventions from the wider Semantic Web. I'm also active in discussions about how to model molecular biology data.  Outside of Wikidata, I've been an active contributor to Wikipedia and Commons for several years.

I only had time to answer three of your questions, but I did that much pretty extensively.  The remaining questions are mostly beyond my knowledge and I don't have any well-formed opinion on them.  If you'd like, I can try to answer those questions or others next week.

My answers to your questions:

1) The GND system has been ditched. Can you explain why this is a good thing?
The GND main type property has several major problems. Deprecating that property helps us focus on better solutions for classifying knowledge on Wikidata.
Major issues with P107:
  1. With the GND main type property, "person" can mean things well beyond the common understanding of that word.  It can mean things like Coco Chanel -- i.e. 'person' as conventionally understood -- or it can mean a god, literary character, pseudonym, collective pseudonym or spirit.  The standard response to this glaring issue is "'person' is meant to generalize, don't take the term literally". That is not a sufficient solution.  If a classification system for all human knowledge considers Vishnu and Coco Chanel to be both be 'persons', that's a big problem.  Beyond giving users bizarrely unexpected query results, it means properties that should be safe to assume for any given 'person' item simply cannot be.
  2. Any item that is not a person, place, event, organization or work is classified as a "term", which contains virtually no information.  We need to be able to classify things like gravity, carbon, DNA, cancer, clarinet, Twelver Shia Islam, fashion boot, dog and potato as more than simply "terms".  One sixth of the property is kruft.
  3. Not even the GND directly uses GND main types.  The GND Ontology has a hierarchical class system and the Deutsche Nationalbibliothek -- which developed it -- uses the lowest-level, most specific GND class available for a subject.  This indicates that the GND senses the GND main types are not appropriate to use as they are with P107.
  4. The nature of P107 implies that the property is only for the highest level of classification, and that additional properties would be needed for each level in the hierarchy of classification for lower-level types. This would entail lots of unnecessary work to create and update classifications. For example, want to specifically classify Nauru? If property P107 were to persist, then you would need to add something to the effect of "main type: Place" and "subtype: Administrative unit". The problem gets drastically worse for subjects with more levels of classification, like organisms, instruments, molecules, diseases, towns, etc.
The GND system itself -- the GND Ontology -- is not the real problem.  The real problem is that P107 is a "main type" property.  In a project to structure all knowledge -- which Wikidata is -- restricting all items into a small set of types will inevitably lead to many, many classifications that are either A) too broad to be useful or B) simply incorrect.
2.  You sent an email where you asked for attention for what is to be next. Why should there be something next?
Because -- although it is complex -- the world has structure, and classes or types are a useful way to express that structure.  The lopsided debates in the Primary sorting property RFC indicate that so-called "main type" properties (sometimes also called "principal group" or "primary sorting" properties) are a bad idea.  However, that does not mean that the basic notion of grouping things into "types" or "classes" is also a bad idea.
A much better solution for classifying things is to use "type" properties recommended for the Semantic Web by the W3C -- that is, use rdf:type and rdfs:subClassOf. These properties exist in Wikidata as instance of (P31) and subclass of (P279). These properties have been part of W3C recommendations for the Semantic Web for almost a decade. They are fundamental properties used in large controlled vocabularies to structure data into knowledge.  They facilitate classification at an arbitrary granularity.  Together 'instance of' and 'subclass of' can classify all subjects and be used to determine precisely where each subject exists in the hierarchy of knowledge -- or, perhaps -- a collection of hierarchies of knowledge.

Not only do they solve those structural problems of P107 and other "main type" properties, but by being based on W3C recommendations, instance of (P31) and subclass of (P279) also make Wikidata more interoperable with the rest of the Semantic Web.
That said, deciding on properties like P31 and P279 is only the beginning of forming a better way to do classification on Wikidata.  We need a way to map the information in P107 to use P31 and P279.  That's a topic of active discussion on Wikidata.
3.  The GND is a library system then you mention upper ontologies. What is the difference, and how are they practical in the Wikidata context?
The GND (Gemeinsame Normdatei) authority file is used as a library classification system, but it's based on the GND Ontology.  The ontology has a hierarchy of high-level entities and sub-classes.  The P107 property is based on those so called "high-level entities", which were called "main types" in Wikidata as shorthand.  The main GND types are person, place, event, organization, work, term or "undifferentiated person".  These main types are fine as a way to classify items of general interest in a large library, but they're much too small to form a sound basis for a classification system for all human knowledge. 

That's what upper ontologies are for.  An upper ontology is a way to have standard vocabulary about high-level entities in our world.  The idea is to formalize these very general concepts in a way that captures the richness of human language while also being precise enough to be machine-understandable. 

For example, the Suggested Upper Merged Ontology (SUMO) sets a class "entity" as the most general type of thing -- everything is an "entity".  From there, SUMO classifies things in the world as either "physical" or "abstract".  "Physical" things can be "objects" or "processes".  "Abstract" things include so-called "set-classes", "propositions", "quantities" and "attributes".  (More information on SUMO is available in Towards a standard upper ontology.)
There are several other upper ontologies available, like BFO and UMBEL.  I am not an expert in ontologies, and I have not learned enough about each of them to make an informed statement on their advantages and disadvantages.  However, because they seem to offer unifying terminology for different domains of knowledge, upper ontologies strike me as something worth consideration by the Wikidata community.

No comments: