Wednesday, September 25, 2019

With #DBpedia to the (data) cleaners

The people at DBpedia are data wranglers. What they do is make the most of the data provided to them by the Wikipedias, Wikidata and a generous sprinkling of other sources. They are data wranglers because they take what is given to them and make the data shine.

Obviously, it takes skill and resources to get the best result and obviously, some of the data gathered does not pass the smell test. The process the data wranglers use includes a verification stage as described in this paper. They have two choices for when data that should be the same is not; they either have a preference or they go with the consensus ie the result that shows most often.

For data wranglers this is a proper choice.. There is an other option for another day, these discrepancies are left for the cleaners.

With the process well described, the data openly advertised as available, the cleaners will come. First people akin to the wranglers, they have the skills to build the queries, the tools to slice and dice the data. When these tools are discovered, particularly by those who care about specific subsets, they will dive in and change things where applicable. They will seek the references, make the judgments necessary to improve what is there.

The DBpedia data wranglers are part of the Wikimedia movement and do more than build something on top of what the Wikis produced; DBpedia and the Wikimedia projects work together improving our movement's qualities. With the processing data generally available this will become even more effective.
Thanks,
        GerardM

No comments: