Words and what not: September 2015

Monday, September 28, 2015

#Wikidata - ten questions about #Kian

The quality and quantity of Wikidata relies heavily on technology. When the people who develop their tools collaborate, the results increase exponentially. Amir has made his mark in the pywikibot environment and now he spreads his wings with Kian.

I have mentioned Kian before and I am happy that Amir was willing to answer some questions now that has over 82,000 edits.
Enjoy,
GerardM

What is Kian

Kian right now is a tool that can give probability of having certain statement based on categories but the goal is to become a general AI system to serve Wikidata.

Why did you write Kian

Huge number of items without statement always bothered me and I thought I should write something that can analyse articles and take out some data out of articles.

How is Kian different from other bots

It uses AI to extract data, I have never seen something like this in Wikipedia before. The main advantage of using AI is adaptability. I can now run Kian on languages that I have no idea about them.

Another advantage of using AI is having probability which can be useful in lots of cases such as generating list of mismatches between Wikipedia and Wikidata that shows possible mistakes in Wikidata or Wikipedia.

Is there a point in using Kian iteratively

With each run of Kian Wikidata becomes better. After a while we would have so much certainty in data that we can assure Wikipedia and other third party users using our data is a good thing

What can Kian do other bots cannot do

First is generating possible mistakes and building a quality assurance workflow.

Another one is adaptability of adding broad range of statements with high accuracy.

What can Kian do for data from Wikipedias in "other" languages

We can build a system to create these articles in languages such as English since using Kian now we have data about those articles.

Let me give you an example: Maybe there is an article in Hindi Wikipedia, we can't read this article but Kian can extract several statements out of that article. Then using resonator or other tools we can write articles in English Wikipedia or other languages.

What question did I fail to ask

Plans about Kian. What I'm doing to make Kian better. Hopefully we would have a suggesting tool using Kian very soon.

What does it take for me to use Kian

We have a instruction in github you only need an account in Wikimedia Labs

Does Kian use other tools

Yes, right now it uses autolist which makes it up-to-date.

What is your favourite tool that is not a bot

Autolist, Wikidata can't go on without this tool.

Sunday, September 27, 2015

#Wikidata - primary sources tool statistics

It is a good thing that there are statistics for the primary sources tool. It is a dashboard that shows the current state only.

Given the discussion on the usefulness of this tool, this is not really helpful. It does not help any argument because everyone will be given different numbers at a different time.

Compare this with useful statistics for Wikidata. Here values are available that show trends over time. Consequently action can be undertaken based on the numbers. It would be really welcome that as part of the creation of these statistics, current numbers for the primary sources tool would be included.

Either way, success or failure, statistics help when people agree that numbers are relevant.
Thanks,
GerardM

Saturday, September 26, 2015

#Wikidata - #Freebase atrophies in the Primary sources status

It is a good thing that there are statistics for the primary sources status. It demonstrates clearly how dysfunctional it is. Only 18K statements have been approved. After all the time that the tool exists, it is not even one percent.

For a "serious" power user it is quite possible to do add this number of statements in a day to Wikidata any day. The sad thing is there is every reason to believe that the quality of a power user is just as good as anything that is in this dump in the first place.

Mathematics show that it is easy to check and verify the data that is in Wikidata with other sources. When such a process is well designed, it is iterative and consequently adding data that is deemed useful for inclusion in Wikidata will be processed in every iteration.

These sad statistics demonstrate one thing and one thing only; the failure that is in this approach. It would be wise to abandon it and concentrate on workflows instead that leverage the value that is in the huge community that may serve fixing issues.
Thanks,
GerardM

Sunday, September 13, 2015

#Wikidata - What is the #Buikslotermeer

In the history of the Netherlands, land was steadily disappearing. The peat that was the land was replaced by water and this process increased in speed as lakes increased in size. One solution was to make a polder out of a lake. It worked well and it resulted among many others in the polder of the Buikslotermeer.

As the city of Amsterdam grew in size, a new part was called after the old polder.

Wikidata needs disambiguation between the two. One of the reasons is that a picture like this one, is about the polder and not at all about the neighbourhood of Amsterdam.

The polder will have statements about things like when the dikes broke.
Thanks,
GerardM

Saturday, September 12, 2015

#Wikidata's embarrassment of riches

Wikidata is improving its content constantly. Proof may be found in people pointing to issues and the follow up it generates. They add data, change data and remove data; Wikidata is better for it.

With the official Wikidata Query being live, it is even easier for people who understand SPARQL to query, compare and comment on Wikidata's content. As mentioned before, it is in the comparison of data that it is easiest to improve both quality and quantity.

For this reason it is an embarrassment how a rich resource that is Freebase is treated; it might as well not exist. It lingers in the "primary sources tool" a lot of well intentioned work is done. In Q3/2015 there may even be a workflow to include even more data in there.

Probably, this tool is only relevant for static data and, that is not necessarily the best. Actively maintained data is much to be preferred. When I understand things well, people may tinker with it in this data dungeon and it is then for the "community" to decide upon inclusion in Wikidata. It is not obvious what its arguments could be. It is not even obvious how any data will compare to the quality of Wikidata itself. Its quality is not quantified for quality either.

Once data is included, there are many ways to curate the data. It is done by comparing it against other sources. It is obviously a wiki way because it invites people to collaborate.
Thanks,
GerardM

Monday, September 07, 2015

#Wikimedia - more #contributors or more #editors?

The foundation of the Wikimedia projects are its people. Whatever effort these people do, the more it generates data. The data may be in the form of text, images, sources, software or statements but it is all about mangling it into information. The question is not so much what has value for us all, as it all has its own value, its own merit. It gets its value from the people who take an interest.

The question is how to generate more merit. How do we get people involved to do their "own" thing. One way of doing this is by not seeking for the perfect solution. Yes, we can do a lot in an automated way. However, this will only get us mostly more of the same and not necessarily more of what is of interest to some.

Consider, there are people who demand better quality. When all they can do is look helplessly from the sidelines, they get frustrated. When you give them something to do, they have a choice; to put up or to shut up. Personally I care about human rights so I enrich content related to human rights. The data is not perfect but I notice improvements. I notice when other people contribute as well. It feels positive.

The problem with many tools is that they are great for what they aim to do but once they get into the grey area of doubt and uncertainty they flounder. Technically the negative results from Kian are perfect. It is just that it does not make it easy for people to work on these results. It is not obvious what result will be enough for Kian.

What we really want is tools that people can use, tools that are as obvious as we can make them, tools that have descriptions and workflows. Tools that do not need nerds or developers to use. Tools that can be used by you and me. Tools that get us more contributors. Contributors that like me work on a subject they care about.
Thanks,
GerardM

Sunday, September 06, 2015

#Wikimedia - improving #search

One "key performance indicator" for search is the number of people who get zero results for a search. The objective is to make the number of people who do as small as possible. [1]

In the early days of the Dutch Wikipedia, a librarian was always happy to explain how he improved search results at his library.

His first observation was that he needed to know what people could not find. The observations were aggregated in timeslots. In this way he knew what people were looking for. His favourite observation was "People are stupid; they do not know how to spell". Allowing for the most prevalent spelling errors improved the results a lot. The other part was that people were looking for things the library did not provide.

The message for the Wikimedia search team.. Consider publishing the known errors aggregated over time. Have the community mark the spelling errors as such and use that to serve content anyway. The other part where we do not have data, consider that Wikidata has more information than any Wikipedia, when results do not exist as articles, publish what people seek and there might be a community out there adding missing articles.
Thanks,
GerardM

#Wikidata - Tiberius, Modestus and Florentia

Tiberius, Modestus and Flotentia are three Catholic martyrs. There is a Dutch article by that name and Kian has it on its "problems list".

Lists like these "lists of possible mistakes" are necessary. At best they are an evolutionary step between having no awareness and being aware of issues. It would be wonderful when there are workflows for fixing issues in a way that prevents them from reappearing on such a list.

For these three martyrs new items were added and they are linked to the item for a "group of people". Joseph Guislain is in the Dutch article also a museum in Ghent. It is easy enough to add an item for the museum and link him to the person. But will it fix this issue for Wikidata?

Workflows for the issues that we face would be wonderful:

when done, an item should disappear from a "to do" list
it should be more obvious what it is that will fix an issue
when we identify martyrs or whatever, we should involve people who are into the related subjects

Magnus has many tools that have people fix things. They are workflows we could adopt and by doing this make them even easier to use.

Thanks,

GerardM

#Wikidata - #StrepHit, the package damaged the message

When a good idea is posted, the message of the announcement can completely blow it away.

First the good news. StrepHit has the potential of becoming a valuable tool for new content for Wikidata. It is all about Natural Language Processing and consequently it is all about harvesting facts from text. The idea is to harvest structured facts and provide references for statements and harvest references for existing statements. This is really welcome, it may prove to be important.

For the bad news, the plan is based on a number of awful assumptions that prevent it from being taken seriously at first glance.

The best thing the authors can do is appreciate that what they are building is a tool. A tool that analyses text, a tool that can be trained to do a good job. A tool that can be integrated with other tools. A tool that is not defined by particular use cases or assumptions.

When it runs in an optimal way, it is much like Kian. It runs and makes changes to Wikidata directly. This week it added 21.426 statements with a very high rate of certainty. Problematic data is identified and lists are created and this is where people are invited to make a difference.

Kian works in the Wiki way, it does its thing and it invites people to collaborate. It does not assume that people have to do this that or the other. Contrast this with StrepHit where the author suggests that people should not be allowed to add statements without references. If that is not enough, it will not even add data to Wikidata but considers the data it generates a "gift" and condemns its data to the "Primary sources tool". It is a sad place where valuable data lingers that is not finding its way into Wikidata.

StrepHit and tools like it may become valuable. Its value will be in a direct relation to how it integrates in other tools. When it does it will be great, otherwise it will sit in its corner gathering dust.
Thanks,
GerardM

Thursday, September 03, 2015

#Wikidata - #Wikimedia Public Policy

To make its point about its public policies absolutely clear, the Wikimedia Foundation dedicated its own website to it. It is well worth a visit, it is well worth it to give this subject a good think.

In Wikidata the discussion was started on one of the more important Wikipedia policies; its "BLP" or Biography of Living Persons. Obviously, Wikidata does not have a BLP because it does not have biographies. We do however have data on living people and data on people can be as libelous as text. Talk about "hard data"...

With data on people, there are all kind of potential issues. It may be incomplete, it maybe wrong, it may be problematic. It is obvious that Wikidata has its problems with quality, this blog has mentioned them before.

When Wikidata is to have a DLP or Data on Living Persons, there are two parts to it. The first is having a way of addressing issues. The second is a way to prevent issues arising.

When issues arise, much of the best practices of the BLP can be adopted. Yes, have sources, Yes, investigate the sources. But first things first, have a policy, have a place where issues can be reported.

The question of quality is in two. Typically Wikidata does not have enough data to be balanced. This can be remedied in many ways. We should be more aggressive in adding data, this can be by cooperating with other sources and by investing in tools like Kian. The other part is in being sure about the veracity of the available data. This is also something where tools will make a difference.

Both a BLP and a DLP are important aspects of a Wikimedia Public Policy. Wikidata shows its maturity by not having had reason to have its DLP. Something to be grateful of.
Thanks,
GerardM