Thursday, December 29, 2016

#Wikidata - Khagan of the Rouran

A great sign that Wikidata gains traction in other languages; much of the data for Yujiulü Shelun, Khagan of the Rouran from 402 to 410, does not have labels in English. When the idea is to include all these Khagans of the Rouran it becomes a challenge. The English article does have many names but do they fit what is already there for other languages.

The challenge is to do good and bring things together. It is relevant to have all the right items properly connected. One thing that is missing; the item for Khagan of the Rouran. That is easily fixed.

Tuesday, December 27, 2016

#Wikimedia and the "official point of view"

One of the pillars of #Wikipedia is its Neutral Point of View (NPOV). The point is that we should not take sides in an argument but should present arguments from both ends and thereby remain neutral. The problem is what to do when arguments are manifestly wrong. When science repeatedly shows that there is no merit in a point of view.

What to do when it is even worse, when science is manipulated to show what is of benefit to some. When the Wikimedia Foundation had its collaboration with Cochrane, it was onto something important. Cochrane is big on debunking bad science.

The new government of the USA has a reputation that precedes its actions. It already states that science is bad. It will state its point of view. They will argue that it is good for all but how will they substantiate this? In the mean time much of what science said so far will remain standing. The snake oil salesmen will try to sell you their product and I wonder how it will find its way in Wikipedia. Will we look at science and will we resist the snake oil?

Monday, December 26, 2016

#Wikidata - #caste and how to include it in Wikidata

With all respect to cultural heritage, forcing people to be included in any caste is a form of discrimination. In an article about Nangeli, a woman of the Nadars, it becomes clear how important it is to understand its history

The Nadars are a heterogeneous group, comprising people of diverse standing. When in a school curriculum the story of Nangeli was included, it did not do justice to this diversity.

The problem with discrimination is that it has to be simple or it is not understood. It is how I interpret why it was pulled from the curriculum. This whole notion of the impossibility of there being one simple caste system is expressed well in the Wikipedia article on a historic article on the Nadars; the Sivakasi riots: "This belief, that the Nadars had been the kings of Tamil Nadu, became the dogma of the Nadar community in the 19th century". It casts doubt on schema where castes are expressed in a simple way.

What we can do is linking what we know is related. Link historic facts associated with class and castes. But it starts with making the effort.

Sunday, December 25, 2016

#Wikidata - the grandson of King Thibaw

When the BBC writes an article about royalty, it makes sense for both Wikipedia and Wikidata to have correct information available.

Descendants of king Thibaw Min it helps when it is known that this king was part of the Konbaung Dynasty and that a dynasty is not a country.  This is relevant because any claim to Myanmar is based on being part of that dynasty.

It is simple; dynasty is family. It is why Mr Trump and his offspring are factually a business dynasty.. When we are to get our facts straight, it makes sense to understand such basics. A dynasty can lose control over its "assets" but it remains a family.

Historically there have been many families with claims to a crown. Understanding such a claim is of interest and it is relevant to know the history of the whole world. History is not only lived in the western world.'

Tuesday, December 20, 2016

#Wikidata - a country is not a dynasty

When a "country" comes into being, it is after a struggle. In the same way when a "country" comes to an end, it is after a struggle. The same is true for dynasties; when a royal line comes to a start or an end, it is not without a struggle. However sometimes in a country there is continuation and one dynasty follows a previous one. Several dynasties succeeded each other in the Delhi Sultanate. The "country" finally ended with the last of the Lodi dynasty.

So when a country knows only one dynasty and starts and ends with that dynasty, it does not make the dynasty the country. Making up a name for a country is easy; when these monarchs are called "king" it is a kingdom, when they are a "sultan", it is a sultanate.

If there is one drawback, it is that there might be a name for that country in the languages of the people who were linked to it. For this reason all the countries that I am about to create may be prime suspects for a merger.. The item, not the country :)

Saturday, December 10, 2016

#Wikidata - Sembiyan Mahadevi - is it a title or is she a queen?

Queen Sembiyan Mahadevi was the spouse of  Gandaraditya, her son was Uttama Chola. Many of the Chola queens who followed her used "Sembiyan Mahadevi" as a title. This is what the English article tells us.

To really accept that it was a title, a source would help. It would be cool to have a list of all the people who used the title and it would be good to separate the person from the title in separate articles. It seems that the Tamil article is more substantial but as I do not read Tamil and Google translate does not help me sufficiently to understand what it says. 

Queen Sembiyan Mahadevi matters not only because she is important in the Chola dynasty but also because of the relevance she has in Tamil culture. Her father was a Mazhavarayar chieftain but Wikipedia does not know about them. 

When Wikidata knows about Indian nobility, its dates and connections, it becomes a resource that is helpful. Once her father has a name and it is clear what is meant by a "Mazhavarayar chieftain", slowly but surely it becomes clear who ruled where and who were contemporaries. It would be cool when Wikidata allows for a query that shows a "monarch" and shows fellow monarchs in neighbouring countries. 

Thursday, December 08, 2016

Was Cezhiyan Cendana a Pandyan king?

There is no way for me to find out if Cezhiyan Cendan was a Pandyan king or not. The only source I can find is a blog saying so. The problem is that texts in Wikipedia make me doubt. The text in the article for Maravarman Avani Culamani states that he is succeeded by his son Jayantavarman.

One fun fact is that templates do not have sources. It is however what I base information on when I add information to Wikidata. The other interesting point is that dates given are overlapping to the extent that they are not reliable.

So this is where we get into a problem. When information is good enough for a Wikipedia, is it good enough for Wikidata. More importantly is the question how do we curate information like this in a way that helps us all?

Wednesday, December 07, 2016

A Pandya King did not rule #India

The Pandyan Kingdom existed for some fourteen centuries; for many of the kings not much is known; A template contains much of what is known about them; not much.

Arguably; having this information in Wikidata serves a purpose. The information can be curated by people who know about the Pandyan kings and there are several things that they could do.
  • Some of the names of kings seem to be incorrect, certainly inconsistent.
  • The names of these kings can be added in the original language
  • Dates may be added to the period these kings were king
  • The data can be used in one of the other Wikipedias that are relevant in India.
One funny fact is that for all these kings it is impossible to have been a citizen of India. They were citizens of the Panyan kingdom. Many of such facts were added by bot and, it reflects factoids that exist in Wikipedias. It is just wrong.

Tuesday, December 06, 2016

#Research to help #Wikipedia do better

It is one thing to bemoan everything that is problematic with research, it is another to do better. For research on Wikipedia to be published, it has to be about "English" OR it has to be linked to English OR publication is not the end goal.

At the Dutch Wikimedia Conference Professor de Rijke gave the keynote speech. He spoke about the kind of research he is into and he spoke about "Wikipedia" research performed at the University of Amsterdam. He challenged his audience to cooperate and his challenge resulted in me formulating ten proposals for research. The point of these proposals is that I hope they do provide more worthwhile insight and includes a link to “English” in order for it to be published.
  1. Previous research, studied how long it took for a subject to appear in English Wikipedia after it was first mentioned in the news / social media. The new question would be: how long does it take for the same subject to appear in any Wikipedia and, how long does it take and to what extend does it happen for those articles to get corresponding articles in other Wikipedias and how long does it take for the English Wikipedia to take notice?
  2. In the search engine for Wikidata we use the description to help differentiate between homonyms. There are two approaches to a description; many existing descriptions are not helpful and hardly any items have texts exist in all of the 280 languages. There are however automatically generated descriptions. The question is: what do people like more, the automated descriptions or the existing questions? Is there a real difference for people who use Wikidata in English as well?
  3. Many people know their languages, this is obviously true for readers of Wikipedia. For the regulars there is a “Babel” template that allows them to indicate what languages they know. For the others for some purposes geo-location is used to make a guess. Do people find it useful to have it indicated that articles exist in the languages they know in search requests? Does it make a difference that a quality indicator is set for those other texts on the same subject?
  4. Many people make spelling errors when they search for a subject or when they create a wiki link to another subject. Google famously suggests what people may be looking for. We can expand the search and include items from Wikidata (40% increase in reach) but we can also use Google or any other search engine to help people get to the sum of all knowledge. We can ask people to answer some questions after they are done. Are people willing to do this and how does it expand our range of subjects that we know about. Are people willing to curate this information so that we can expand Wikidata and at least recognise the subjects we have no articles about?
  5. When we show the traffic for the articles people edited on in the last month, we gain an insight in what people actually read. We also congratulate people on the work they did and show appreciation. Does this kind of stimulus stimulate more articles? How do you stimulate for subjects that people hardly read (eg Indian nobility).. Do you compare with existing articles in the same category?
  6. There have been several Wikipedias that include bot generated texts. It is a famously divisive issue in the Wikipedia community. There has been no research done on this. With Wikidata there is an alternative way to exploit the underlying data. When the data is included in Wikidata, it is possible to generate text on the fly. This data may be cached for performance issues but there are two main advantages; both the script and the data can be updated. The question is: does it serve a purpose for our readers? Will editors update the data or the script to improve results or will they use the text as a template for new articles? Will it take the heat of the argument of generated texts? How will it affect projects that were not part of the existing controversy and does it work for them?
  7. Wikidata does not allow for the dating of its labels. It follows that it is not easily understood what the relation is between Jakarta and Batavia. How are such issues generally stored as data and what alternatives exist for Wikidata. How does it improve the usefulness of Wikidata as a general topic resource?
  8. Wikidata now includes data from sources like Swiss-Prot. What are the benefits to both parties? Does it make for people editing this data at Wikidata and what is the quality of such edits? Does it get noticed by Swiss Prot and is there a cooperation happening? How is this organised and to what extend does “the community” interfere with the notions of academia? Do such communications exist or are these groups doing “their own thing”?
  9. What is the effect on the ultra small Wikipedias when generated texts are available based on available labels.. Does it mean more interest in creating the templates for articles and work on labelling? What does it mean when such generated articles are available to search engines?
  10. At this time many articles in the English Wikipedia are written by students, university students. The result is positive on many levels but the question is, is what they write understood by Wikipedia readers? When students write their articles, it is mostly based on literature. It is well known that the bias in scientific papers is huge. Negative results are not published and many results from studies are ignored. The question would be: is sufficient weight given to debunking studies or are they put aside with an argument of a “neutral point of view”. This would make sense when students are graded on what they write given accepted fact on the university.

Saturday, November 26, 2016

The problem with #science explained with #Wikipedia

It is a recurring theme. People study a subject and reality is different. The science is flawless, the results are impressive and indeed important strides are made forward. The study of heart disease is a great example; many studies resulted in an improved life expectancy for men. Particularly white men. The Dutch Hartstichting is raising funds for new research because of this existing bias in research. For women in the Netherlands, heart disease is the number one killer because heart disease is different in women; it was not noticed before because heart disease in women was not studied.

Wikipedia as it is commonly known in research has the same problem. It is not Wikipedia as we know it, it is English Wikipedia. My contributions to Wikipedia have not been to English Wikipedia; they went to the Dutch Wikipedia and I will not be noticed as one of the most prolific contributors to Wikimedia projects because my contributions to "Wikipedia" are hardly significant..

As I blogged before; scientific papers do not publish when it does not involve English Wikipedia. The consequence is that when people quote research, their quotes include this bias and strictly speaking it is not necessarily true when you consider Wikipedia. The problem with biased research is that the policies of the WMF are based on the known "facts".

Nothing new so far. We all know it when we are honest. So what can we do to remove some of the bias? The first thing is to devalue any and all research that is English Wikipedia only. It only covers less than half of what we do.The second thing is to evaluate research for its algorithms. When both the algorithms and the data are available, it is possible to run the algorithm on a more inclusive data set and check the validity. With the quality of Wikidata data as a source on all the Wikipedias improving, such an approach is increasingly feasible. The last thing is for the Wikimedia Foundation itself to address this bias, With English Wikipedia being less than 50% of its traffic and workflow, it would be good when a similar percentage of its efforts is focused on the bigger half of what we all do.

So what is the harm? We expect all Wikipedians largely to do what "Wikipedians" do. However, we are not all English Wikipedians. The need other people have is not discussed, not taken seriously. We have seen wonderful examples of potential functionality showcased but it is not taken further, not taken in production because it does not fit the preconceived ideas of what we do, it is not part of the road map. The projects in Wikidata are not about Wikidata but about how to make us all in one big data glob and USING the data is only seen in relation to Wikipedia articles. We do not know how much Wikidata is used, some studies are done but they are in relation to "Wikipedia" and that is not relevant to me. We find that Wikisource gains more and more content that may be valuable to our readers but we do not market this data because we never did marketing for Wikipedia. There are several websites that only do this in a way that could be much improved if we took Wikisource seriously.

It hurts us to only consider English Wikipedia and this bias in research and policy is more damaging than the bias that is considered by the English Wikipedians.

Wednesday, November 23, 2016

#Bias in #research

Actually, it starts with something else. You need to publish so you have to select a subject to study that will be of interest to the publisher..

As a consequence hardly any research is done about the other Wikipedias. I have been informed by a reliable source that it has to be English or it will not be published.

Now Wikimedia Foundation, how about that? Is there any research done on Wikipedia or is all the research biased in this way?

Tuesday, November 01, 2016

#Wikidata year 4; What Gupta year is that?

Wikidata is celebrating its fourth birthday. It is celebrated by some mighty fine gifts. It is a time to reflect on what has gone before and what is ahead of us. Obviously there are challenges we face and my gift are some queries / questions I do not know how to address. I focus on the Gupta empire because it currently has my interest.

During the era of the Gupta empire there was a "Gupta year". An article refers to it and my first question is: what date would the birthdate of Wikidata be in Gupta years?

Obviously there are many maps including the Gupta empire, Can I have them sorted by date please? What other countries border the Gupta empire? Who were its rulers and how does the map change over time?

To get answers is nice but for me it is important that the algorithms involved are relevant to any country old and new. Relevant to timelines old and new. When we can express dates in the "Year Gupta", we can check if dates in Wikidata are indeed Julian or maybe Gregorian..

When we have continuance in maps over time, we will know if a location, a city for instance or the land of a tribe is part of what country; what culture.

Wikidata live long and prosper :)

Saturday, October 29, 2016

#Wikidata - Queen Kumaradevi

Queen Kumaradevi was married to Chandragupta I. According to Wikipedia she was of the Licchavi clan. The coin shows her with her husband on a coin minted by their son.

When you read Wikipedia, you will read about daughters of kings married off to nobility. They paint a picture of alliances, their marriages often meant some stability in an often brutal world.

When you are interested in such things, western nobility is well documented. Not so for nobility of India. I have added lately a series of maharajahs, kings and emperors and am every time amazed that nobody beat me to it. I often document who was related to who and often find missing links documented and add items for them. Regularly the missing links are implied but miss a generation.

I am sure of one thing; India has its fair share of people who know and care about such things. How do we get them interested, how do we get proper information about all this in Wikidata?

Sunday, October 23, 2016

Kigeli V, Mwami of Rwanda

Kigeli was the last ruling Mwami of Rwanda. He died October 16.

When a last ruler dies, it follows that there are previous rulers and, there is a lot that is of interest in the history of the mwamis. His father for instance was deposed because he refused to become catholic.

I have added the rule of several mwamis to Wikidata because such basic information is often lacking. Wikipedia articles are often stubs at best and sources are often absent.

Typically a monarch is part of a dynasty. With a new dynasty it represents often a new family but certainly a change that makes for it to be recognised as such. The article on the kingdom of Rwanda describes the role of the mothers of a king. They are yet unknown to us and consequently a lot of relevant information is missing.

When you see all those red links, it is obvious that significant red links exist in any language. When they are linked to Wikidata, information like the follow up as ruler and who is related to who becomes a task that can be done once and be done well. It is one way to emancipate information that has been of little concern to Wikipedias.

Saturday, October 22, 2016

#Wikidata - statements are doing fine

In September there are more Wikidata items with 10 or more statements than items with no statements. Wikidata is growing up.

Thursday, September 29, 2016


I read an article, I found what was written astounding and signalled that I had to read it again to really understand what is said and what it implies. The article was published in a quality newspaper; the Independent. The reply that I got was: "Indeed. And it's Fisk, so you can't just pretend it is an obscure journalist talking about something that may have happened..."

As I did not know Robert Fisk, I looked him up. I checked his Wikipedia article and found that he has indeed a reputation that is really good. He received many more rewards than was known at Wikidata so I added several and it is fun to establish the quality of its sources. For the Lannan Cultural Freedom Prize the Lannan website says it all. It is linked on the item for the award and that should suffice. For the Amnesty International UK Media Award it is not so obvious. It is conferred by te UK branch of Amnesty International and it has no dedicated page for the award. I added the award, the chapter and had a look at the pages for the award ceremony for each year. These Wikipedia articles refer to webpages that no longer exist.

For the Lannan Cultural Freedom Prize I added the other recipients because it gives some insight in the relevance of the award. I did not do this for the Martha Gellhorn prize for journalism.

The point of this all is that reputation amounts to trust about the message that is written. Read the article, it is likely that you are not familiar with the Wahhabi belief, a subset of Sunni Islam that is practiced in Saudi Arabia. The article is about 200 Sunni scholars that denounce the Wahhabi belief. Several major scholars are involved. Have a read and have a think, the article is by a major journalist published in a major news paper about something that is not without consequences.

Thursday, September 08, 2016

#Wikimedia - the need for #sceptism

It is all over the news; another psychology study debunked. With two thirds of the repeated studies being debunked, there is a lot in the literature of psychology no longer valid. The source for the article I read is Mr Eric-Jan Wagenmakers professor at the university of Amsterdam.

The NWO, the Netherlands Organisation for Scientific Research, is funding 3 million Euro to repeat key research. The problem is that science is in love with what is new and quick results. Three million is at best a start.

When science cannot be relied on, collaboration with scientists and universities easily becomes controversial. The programs taught are inherently point of view and often a conflict of interest is easily established. Consider; when doctors prescribe substances that are FDA approved, it seems obvious that these substances have a positive effect on patients. Then consider that we have a Wikipedian in Residence at Cochrane, they make a reputation from debunking much of the use of such substances. We provide end user information and it seems obvious that just repeating the list of FDA approved substances without further information is not at all in our users best interest. It is even likely that we are liable for misinformation under several legislatures.

There is a need to be sceptical about sources. It is important that we not only improve the technology behind our sources, we also need an ability to mark information as debunked and have that information filter through our projects and in the information we provide. Remember, debunked is not a POV it comes with sources of its own.

Sunday, September 04, 2016

#Diversity - A Woman's hall of Fame

Wikipedia has a category of some 40 Women's hall of Fame. They are women from the past and the present that are seen as exemplary. For all the women who have an English article there is now a statement indicating that they are seen as such.

For many women who are on these lists there is no article. Obviously when the objective is to have quality articles on notable women, it is good when there are lists with articles that could be written.

There are such lists and the best thing is they is some form of automated maintenance. The Women in Red project has such lists. Many of their lists find their basis in Wikidata and it is therefore possible to add people to their lists by adding key data.

All the women who have articles are now known as such, The next thing is to add the missing articles, the red links. So far I have added items for them one by one and stated what they are known for. Obviously this is a stub. More information is needed to state what they are known for, where they lived, why they are notable. It is not only how you enrich the data it is also how you increase diversity.

#Wikidata - the conflict of interest in medical information

According to the clinical evidence handbook only 12% of the 2500 most prebscribed substances and treatments by doctors are not proven effective. There is a massive conflict of interest when unsubstantiated facts are allowed in Wikidata. Arguments like "it is NPOV" are used to defend the practice or "it is harmful for patients" when they can find out that a substance is no better than a placebo but does have negative side effects.

When an external source knows about a substance, it is fine to link to that source. This is not the same as importing the data wholesale particularly when the data is so obviously categorically problematic.

The Wikimedia Foundation has a responsibility and it is not in indicating what substances are prescribed. When we are to include information it is not on the basis that it has been approved for use but on the basis of that it is actually proven to be beneficial. An error rate of 12% on such vital information is not acceptable.

Sunday, August 28, 2016

#Wikidata - La Galería de las Mujeres de Costa Rica

#Marketing is something the #Wikimedia Foundation does not do. It does not mean that concepts like KPI are foreign to the WMF. Take this list from the English article "La Galería de las Mujeres de Costa Rica" the women listed are "women who have broken gender stereotypes and advanced human rights principals".

A lot of effort goes into fighting for a diverse Wikipedia where both women are given proper attention. If I were a marketing man, I would say that lists like this provide pointers to people who want to help. I would be happy with a list that shows all the current people with an article and I would be ecstatic when I had a list that would show all the missing articles that would auto update.

The funny thing is that technically it is not that hard to produce. It is not even that hard to include the technology into MediaWiki but it takes a marketing man to drive the point home that you have to engage people and that it shows the quality of a Wikipedia project when we know where we are lacking and where we should concentrate.

Tuesday, August 23, 2016

#Wikidata - Colorado Women's Hall of Fame

There is a continuous effort underway in #Wikipedia to celebrate notable women. When women are seen as a role model, it is obvious that they deserve attention.

The Colorado Women's Hall of Fame is an organisation that celebrates women and every year 10 more women are included. The article on the organisation includes a list and it includes many red links. So more can be done, not only in Wikipedia but also in Wikidata.

As Wikidata is maturing, SPARQL is now of sufficient quality that many of the tools developed by Magnus are transitioning to SPARQL. This takes time and at the same time some tools are discontinued or do not fully function any more. Linked Items is one such tool. It creates a list of items that are found in a Wikipedia text. It is ideal when a text based file full of wiki links exist. It is just a matter of copying in the links and it will generate a list with Wikidata items for you. It is then needed to restrict the items that are used and it was possible to use WDQ the engine that could when SPARQL for Wikidata was a distant dream. Sadly it does not work anymore.

A solution is taking the list of items and copying to Petscan, the tool Magnus favours. It uses SPARQL and it is something of a Swiss army knife for data. When you are used to earlier tools like Autolist, many of the assumptions are wrong and it takes time to discover how the tool works. It does and that is why there are a large number of women who are known to be on the Colorado women's hall of fame.

Sunday, August 14, 2016

#Wikidata - #quality is not abstract

There is a new "Request for Comments" on quality for Wikidata. It is an attempt to describe quality in a top down approach. It is about words, it is abstract and well, I wish them well.

Wikidata has qualities. When you understand Wikidata by what it is and what it does you understand the not so abstract qualities it has. Its principle aim is to bring structure to the data that is in the Wikimedia projects.

The first quality that Wikidata brought was that it replaced the text based interwiki links. The improvement was important; in a short space of time the quality of these interwiki links improved and the associated number of edits went down. The quality of the interwiki links is not absolute but there has been no research on the follow up.

Interwiki links represent  connection between articles of Wikimedia projects that are about the same subject. Within a Wikipedia, a Wikisource there are links that are in essence similar to Wikidata statements. When a university is mentioned, the subject may be a student or staff at that university and when the statement has been made there is a reason for inclusion in categories. We can research the concurrence of such statements and Wikilinks. Quality improves when the concurrence improves.

When enough data is available, it becomes possible to use Wikidata statements in templates. Templates and info boxes expect high quality data in Wikidata and the available data is typically not good enough. When it is easy to make statements to wiki links and red links, the data in an info box will grow with the added statements.

We do need to work on the quality for our readers. This is done best by leveraging the data we have and engage our communities not only to link articles together but also by expanding these links with the statements that bind them together.

Yes, we will have to solve abstract issues but the reality is that they are not so abstract. Issues have their basis in what it is we have to understand this in what we hope to achieve; serving the world with the sum of all our available knowledge.

Monday, August 08, 2016

Is convergence between #Wikipedia and #Wikidata possible?

Wikidata is piggybacking on Wikipedia I was told. This is true; much data is imported from any and all of the Wikipedias and thereby Wikidata changes for the better. It improves in quality and become much more than what any single Wikipedia has to offer. At the same time Wikidata is rather awkward in its use and, there has been too much thinking in terms of what people know and expect for their own project.

Perspectives evolve. I tend to think of Wikidata as not yet good enough for most purposes. It is incomplete and its quality is inconsistent when we consider statements about its items. The remedy is obvious; work on the areas that are relevant and where Wikidata can easily make a difference.

That is fine road plan for me but Wikipedians also use Wikidata, they even need to use Wikidata. When they add an article about a person, the authority control data is served from Wikidata and, they have to add the information to Wikidata if it is to show. So what can be done to make this easy so that the use of Wikidata and Wikipedia may converge?

One aspect that seems important is that Wikidata information needs to function in whatever edit mode. The biggest motivational handicap I found is that most of what I did does not have an effect. It is much more rewarding when effects are more noticeable. All wiki links in an article link to other articles that have items of their own. Why not have a toggle that either shows these links with relations or not? For the brave hearts that take an interest it is cool, The others do not even have to notice.

When such links are annotated, they result in statements and such statements may even imply categories or other subsequent functionality. Currently bots only harvest in Wikipedia but why not have them add to the Wikipedias in a predetermined way? It makes for a much more dynamic editing process and it will definitely improve quality.

What do you think?

Tuesday, July 26, 2016

Have #Wikipedia share the sum of available #knowledge

If Wikipedia is to succeed in sharing the sum of all knowledge, it has to first share the sum of available knowledge. To do this Wikipedians have to become more inclusive. They have to realise that Wikipedia is not about them but about its readers.

Typically the question "What do readers want" is answered by what readers find. This answer has one flaw. It assumes that Wikipedia includes what people seek and it forgets what people seek and do not find. This is a lost opportunity on many levels. To start with, Wikipedia is not singular and a subject may exist in another language. As we do not know what is missed, we do not know what to write to satisfy an existing demand. Finally more and more available information does not even have a Wikipedia article but its information is available in other projects.

A partial solution to these issues was around for a long time. It extends search by adding results from Wikidata. It allows you to find data in any script from any project. If there was no article, it shows information using the Reasonator. It is relatively easy to revive this and it will make even more sense when it results are included as positive results.

Once Wikipedians consider Wikidata as a tool, they will find that both red links and wiki links may link to Wikidata items. Typically they are the same links for the same subject in any language. This is relevant to editors because it is one way to clarify what links exist to an article and, it is only one step away to annotate them as statements in Wikidata and thereby document such links. They will find a lot of erroneous links and it will improve overall quality.

The good news, the links between wiki links and Wikidata items already exist. What is lacking is a verification process that these wiki links are good. Adding links to statements for red links is technically not that hard. It will add some turmoil at the Wikidata end; many items will be added and will have to be merged eventually. One benefit of this approach is that it is not necessary for everyone to collaborate but it will benefit the people that matter most; all the readers of all the Wikipedias.

Saturday, July 23, 2016

#Wikipedia - #GMO controversy as a red herring

#Wikipedia has/had a big discussion on the safety of GMO food. When you read from what the Signpost has to say; it is only about the safety for people to eat this stuff.

The problem is that many promises have been made and this is only one issue, not even the most relevant issue. Read the article "20 years of failure" by Greenpeace or reads its rebuttal to what some Nobel Prize winners had to say.

The question if it is safe to eat is only one. The question if it will do us any good is more relevant. It does not bring us a more reliable food supply. It will not bring us more resiliency against climate change and it is very much in doubt that "golden rice" actually brings additional vitamins while a balanced diet does.

The important point of Greenpeace is that it backs its assertions with science. It is not in it for the money and its aim? A world that we can live in.

Monday, July 18, 2016

#Wikipedia - Dr Mary Meeker and SOI testing

Mrs Meeker and her husband Robert Meeker worked on a system used in education. She is known for applying Guilford's Structure of Intellect theory ("SI") to creating assessments and curriculum materials for use in teaching children and adults. The premise of SI is that intelligence comprises many underlying mental abilities or factors, organized along three dimensions—Operations (e.g., comprehension), Content (e.g., semantic), and Products (e.g., relations). When you are interested, read the article.

The article compared her work to the debunked Myers Briggs Type Indicator. This is something we should not do. The article on Mrs Laurie Helgoe provides all the arguments needed to restrict information on that indicator to that article. It is not best practice to use tools that are ambiguous in its results and therefore using it in comparison is not in the interest of our readers.

Sunday, July 17, 2016

#Wikipedia - notability of Mrs Laurie Helgoe

When popular knowledge gets debunked, it makes for notability. Mrs Helgoe debunked the Myers Briggs Type Indicator. It is used a lot even by those who should know better to classify human personality traits.

It is quite something when research shows how much popular methods are wrong. Instead of representing a 25-30% of the population, introverts make up 57% of the population. It means that Myer Briggs is off by 100%.

The critique of the article for Mrs Helgoe has it that it is an orphan; no articles link to it. Having read the article, it is more valid to find fault at the Myers Briggs article; it does state that the method is not valid but it more less glosses over that fact.

The problem with the Myers Briggs article is that it attempts to explain the method used, a method that is invalid.

#Reasonator - the perspecive on #Wikidata people do no get

#Wikidata is where Wikimedia data lives. It started with a big service to Wikipedia; It centralised its interwiki data and this was a huge step forward in its quality. There is still a lot of work done on improving it even further because many of the problems left need a different perspective.

The next official challenge is to provide data to infoboxes. This problem is utterly different from the challenge replacing interwikilinks. It is impossible to import all the data from infoboxes all at once and start improving. The quality of the data in infoboxes is worse but that is not the problem.

So people have imported oodles of data and the quality is as expected; poor but improving. One problem is that all the work is happening at Wikidata and it does not transfer to Wikipedia. There is not even an official way to have a good look at the data available at Wikidata. The unofficial tool is Reasonator, it is currently broken and it is why I am reflecting.

Reasonator provides an intelligible perspective on the data of an item. It makes many problems "obvious". It shows imported statements and it shows all the references to the item that is shown. It allows you to see all (with a maximum of 500) statements that share common properties.

With a functional Reasonator, many people work on data from Wikipedia with a Wikidata perspective. When Wikidata is to fulfil its promise of improving the quality of data of Wikipedia considerably, the first thing to do is change objectives and perspective. The perspective could be Wikipedia based and the objective is not replacing data in infoboxes but quality. The good thing is that it is actually possible to achieve this.

A few observations; all wikilinks are in effect links between Wikidata items. Many of the links indicate that an article "needs" to be in a category and consequently this can be automated.

Why do this? When people look at all the wikilinks with a Wikidata perspective, it will make a lot of faulty links obvious. A painter of the 16th century did not receive a 20th century award for instance. Quality will improve.  As more statements and possibly items are created, it will affect every article about the same and related topics.

It needs only one thing, a Reasonator like view of the data from a Wikipedia point of view.

Thursday, July 14, 2016

#Wikidata - Virginia Berninger; Samuel Torrey Orton award 2015

The Samuel Torrey Orton award is conferred by the International Dyslexia Association. It is named after Samuel Orton who was a pioneer in the field of dyslexia.

Mrs Berninger was added to Wikidata because she is the 2015 recipient of the award. It is my intent that Wikidata slowly but surely knows about the more recent award winners, one at a time. It so happened that two of my projects intersected; adding information about female psychologists and awards. Mrs Margaret J. Snowling received the award and this bit of data was added.

My notion of quality for Wikidata is that items need their statements and that more links are better. This allows for all kinds of statements. linking awards to the conferring organisation, the website of an award or an organisation, other awardees.

The funny thing is that adding Mrs Berninger may encourage Wikipedians to write an article about her or at least add her to the list of award winners :)

Monday, July 11, 2016

#Wikidata - Margaret D. Foster - a #female #scientist

I was asked to blog about Mrs Foster. The argument was: "This article missed all the points why Mrs Foster is notable". One great feature of the improved article is a picture that was lovingly restored by Adam Cuerden.

Well, to be honest, I remember a presentation by Rosie Stephenson-Goodknight where she argued that the first step to get some gender balance is to write an article warts and all. It does not have to be perfect, the least it does is be there and invite scorn and improvements.

This sentiment is part of the original Wikipedia ethos; it is good to have stubs and red links. It is good to have a start to improve upon. In this sentimental spirit I improved the data on Mrs Foster on Wikidata a bit. I used Autolist to add the content of a few categories and, I added some universities she had attended.

So yes, the article has improved and it is exactly why both Wikipedia and Rosie are a success.

Saturday, July 09, 2016

#Wikidata - Bródy Sándor-díj

When #Wikidata has really succeeded, it includes all the data of all the Wikipedias. The Sándor Bródy Prize is known on three Wikipedias and it is reasonable that the Hungarian Wikipedia has the most information.

The last known winner, Gábor Kálmán, won the prize in 2012. Currently it is a red link. There is no information about who won in later years and my Hungarian is not enough to find out more if the prize was conferred.

All this transpired from a recent idea that in order to improve the quality of Wikidata for awards, we should add all the winners of awards for 2015. Lydia suggested asking on Twitter for a query and both Magnus and Wikidatafacts provided a SPARQL query. For the Sándor Bródy Prize no winners were known, this was remedied with the "Linked Items" tool. As the objective was to only add the last winner, Mr Kálmán and the date for 2012 were added. There are some 13,881 awards known without a 2015 winner..

The objective for the Sándor Bródy Prize has not been achieved. However, the quality of the data has improved considerably. To make it as good as the information on the Hungarian Wikipedia, dates have to be added and two items have to be added to fill in for existing red links.

The point of all this is that it is possible to quantify a lack of data in Wikidata and by inference a lack of quality. As time goes by, people can use these queries as a tool to make improvements or people will just add data and as a consequence the quality will improve. Either way it is obvious that it takes time and effort to get the desired quality. However on a micro level, it is possible for Wikidata to be better than any of the other projects because its data for a specific award is better. For the the Sándor Bródy Prize all it takes is two items and a few dates.

Friday, July 08, 2016

#Wikidata - making a statement

#Statistics are powerful particularly when they tell the whole story like the ones produced by Magnus. They are a set of statistics and they indicate the progress of Wikidata. The most relevant statistics are included; they indicate the number of statements, the number of labels and the number of links over time.

There is more to the statistics. For some references or better still, the lack of references is why some people oppose the use of Wikidata. Specially for them there are five statistics that indicate progress made. The good news is that more and more items have referenced statements (62.64%). This growth can be understood because a lot of effort has been going into providing tools to add sources and many people do add them.

Improving the quality of Wikidata is complex. There are many factors that make a difference. Personally I care most about rich annotations with statements for each and every item. Others care more about references. It is important that improvements are made in every way. This is why Wikidata becomes increasingly relevant and why new ways open up to improve its quality even further.

As Wikidata matures, its quality becomes increasingly obvious. When more Wikimedia projects use its data, it will grow the number of people who are involved and Wikidata will evolve into a rich and trustworthy source of data.

Thursday, July 07, 2016

#Wikimedia chapters

As a movement, much of the local effort is channeled through chapters. A lot of important work is done because they provide continuance to the work that we do. It enables us to foster relations and organise the more complicated activities.

The map shows beautifully where national chapters exist. In the USA chapters do exist but the map does not acknowledge this.

Typically I am content with the information that is shown through the Reasonator, it however has its limits. This is where the lists maintained through ListeriaBot become relevant. This list of chapters shows additional columns like "country" and start "date". It enables sorting and this is relevant functionality. The same list may exist on a Wiki supporting a different language, for instance Dutch. The cool thing is that the same bot will at some stage update all these "similar" lists. It is just a matter of completing the data for better use.

Tuesday, July 05, 2016

#Wikimedia talks - The long tail

Many talks about Wikimedia products and issues are given at a Wikimania. However it is not uniquely Wikimania where Wikimedia is the big thing; there are the conferences held by chapters. Many chapters like the Dutch chapter, have a tradition of recurring conferences.

Such presentations are relevant. What was presented was the current state of affairs by the "thought leaders" at the time. Many presentations have not been recorded on video or audio but quite often the slides or a paper is available somewhere.

It is easy enough to add these conferences, these talks to Wikidata. It is relevant because it allows for lists that can be generated with a bot like ListeriaBot, it enables people to find the presentations and when they find it interesting they can even see what is available. What it also does is link presentations to the people involved. At some stage you may find how often Lydia presented and where. <grin> at this time, only once </grin>.

Monday, July 04, 2016

#Wikimania - What have we learned, how to experience it

We have not all been to Esino Lario. It is where Wikimania 2016 happened. That does not mean that it is not possible to see many of the presentations. You may find them on YouTube, maybe elsewhere. The same goes for presentations of previous Wikimania's.

In the history of Wikipedia and our movement, these presentations are notable. It even makes the people who presented notable. As I have been watching several presentations, as I argued that we are really bad at recognising our own notability, I have created a list; they link to lists of presentations for previous Wikimania's. The cool thing is that they are updated regularly by the ListeriaBot. So when I or someone else adds a Wikimania talk, the talk will magically appear.

What you find is rather basic but it works. You will be linked to the presentation on YouTube, You will be linked to the items for the "author", the language used for the presentation and the talk itself. At this stage Wikidata like Commons is mostly a repository. For best effect use Reasonator. Compare Wikidata with Reasonator for the presentation of James Heilman for instance..

I am really happy to have been helped by TweetsFactsAndQueries. He helped a lot in getting the SPARQL queries in a reasonable shape. He figured out how to show only the YouTube video ID instead of the full URL. It is probably possible to show an icon instead but that is for later. What is missing are links to the presentation (the power point) and the submission paper. I have no idea yet how that is to be modelled in Wikidata.

It is important to include such data for several reasons. First it brings access to the presentations for the people who are interested. Secondly it documents what we do puts a timestamp to our thinking in time. Thirdly it documents our history.

Sunday, July 03, 2016

#Wikimania - James Heilman on #quality and #language support

James Heilman presented at Wikimania 2016. I have not been to Wikimania and even the people who went to Wikimania may not have listened in. James had a lot to say about quality and the power of having information in "other languages". The talk is powerful and the arguments are compelling. At this time only 48 views of the talk on YouTube,

Saturday, July 02, 2016

#Wikimania and #Wikidata #dogfooding

Being critical is one thing, showing how a difference can be made one item at a time is how to make a difference. At the latest Wikimania many interesting and relevant presentations were made. So far Wikimania talks were missing at Wikidata and as I have been watching several talks, I have added these talks largely using the Ted talks as a model.

What you find are only some of the talks, it is easy to add more and by adding the YouTube Video ID it is possible to see the presentations directly from the Reasonator.

So when you want to add your presentation, it is easy.

#Wikimedia - #notadog and #notmemberofthepack

It is controversial when you tell the #Wikipedia crowd that they do not look after their own. I am not a Wikipedian, I do not want to be as my experiences are not that positive. It is not that I do not care deeply about free content and the opportunities the Wikimedia Foundation offers.

My point is simple; as a community we seem to be only interested in the production and maintenance of content and not really in the quality of the reader experience / the consumption of all the good stuff that exists. I will support it with a number of examples.

Wikimania is the annual Wikimedia conference and it boasts many high grade presentations important for the understanding of the past and the history of what we do. For this reason it makes sense to do a thorough job and include all the presentations in Wikidata so that we provide the same opportunity to explore this content as we do for TED conferences and presentations. When we do, we will honour the many Wikimedians who presented in the past. This may prove controversial because of the many conflicting notions of notability.

Wikisource is "Wikimedia project, an online digital library of free content textual sources on a wiki". For readers there is one vital problem. Much of it are works in progress, some need a finishing touch others still need a lot of tender loving care. With a different approach to finished goods and, it is easy enough to know the status of sources, a clean user interface can be delivered to potential readers expanding the reach of the work that is done.

Commons is an "online repository of free-use images, sound and other media files, part of the Wikimedia Foundation". As a repository it functions really well. Large numbers of media files are deposited and used within the Wikimedia projects but my experience is that when you seek an image, it is really hard. The category system is hard to navigate, there are no labels attached to images that help finding relevant content and it is English only. Finding something among 32,289,013 files is really hard. For now I have given up on Commons.

Being critical of Wikipedia is frowned upon. A typical response is "so fix it" but when solutions are offered that improve its quality, typically a suggestion falls on deaf ears. It is easy enough to improve the functionality of red links, but this idea is probably to mundane to consider even though it has been proven to be easy to implement.

People fault me for being blunt. To some extend it is part of the culture I grew up in; to some extend it is because I have lost faith in the "community". My experience is that there is too much group think and I am definitely not a member of the pack, I prefer to make up my own mind, I articulate my opinions and arguments and care not too much when people react negatively without considering the arguments.

Thursday, June 30, 2016

#Wikipedia - Wikipedian of the year Rosie Stephenson-Goodknight

Every year at Wikimania, a "Wikipedian of the year" is selected. This year there were two. Rosie Stephenson-Goodknight is one of them.

From the Wikipedia article it is not so clear why Rosie was selected for this honour. At best it is a stub and it needs a lot of work before it becomes clear why Rosie is notable. The article is a one liner with a lot of "external links".

Rosie was the one person missing as an award winner so it was easy enough to remedy this. It is now just for anyone to do a better job for Rosie both on Wikipedia and Wikidata.

Sunday, June 26, 2016

#Wikidata supports over 300 languages - implications

Wikidata is a single project that supports over 300 languages. The aim is that all data is usable in any and all of them. One important consequence is that for each item 300 translations of the label are needed.

Obviously less items is more. Each item has to have a purpose that is clear, obvious and cannot be expressed in another way.

For this reason I am opposed to the addition of all kinds of subclasses that add no value. There is no point to "APA Award". It is an award that is conferred by the American Psychiatric Association (APA) and it can be easily described in two statements.

It makes it extra hard to add translations. It is relevant to know that it is an award. It is important to know who conferred it but there is no point having this expressed in a combined item.  There is no point, it does not fit with the work that is done on awards. It makes Wikidata less usable and consequently such items need to be deleted.

Saturday, June 25, 2016

#Wikidata has a CC-0 license. This should not change II

Wikidata is becoming a repository where people may choose to share their data .. or not. When they do not want to share their data with Wikidata, it is their choice and that is fine.

The bottom line of copyrights for databases is that single facts cannot be copyrighted. It is only the whole of a database that can be under a copyright. When you look at the data of Wikidata and its structure, it is in many ways a reflection of all the Wikipedias. Increasingly its data finds its way into Wikidata and as a consequence data that may be found in a specialised database gets included in Wikidata.

Wikidata also has the habit of including identifiers to external sources on an item level. As a consequence people can see what other sources have to say about the same source. It also enables bots to make a comparison. When it writes a report about the differences, it is original research and consequently it does not violate any copyright. When based on such a report people make changes, it takes an effort to find what is correct and consequently it does not violate copyright either. When an agreement is in place, it is possible to add missing data to Wikidata. When done properly there will be an attribution of the original source and, when it is done by a bot, it may be a bot dedicated to that resource.

The objective of the Wikimedia Foundation is to share data. This is why it makes so much sense for Wikidata to have a CC-0 license. As the quality improves, as more and more comparisons are made and the differences are reconciled the data becomes more valuable. Given its scope, not much is out of scope and it is obvious that Wikidata needs to include data from other sources wholesale. It may get information in so many ways. With the CC-0 license it is obvious. Use our data, compare our data, improve our data and this will bring more power to us all.

Wednesday, June 22, 2016

#Wikidata - Home Children

As I am adding more information to a female psychologist, Mrs Margaret Humphreys, I found that she documented Home Children, a British program of sending destitute children abroad. Sending them away was cheaper than leaving them with their families on welfare.

It became such a scandal that prime ministers of several countries that were involved apologised for the awful way people were treated.

Originally it was considered a solution to the slavery in the British match making industry. From good intentions it became something dreadful.

The problem; what statements to use to identify this program, the people who apologised for it, the original good intentions..

Tuesday, June 21, 2016

#Wikidata - the Lange-Taylor Prize

Wikidata knows about many awards and it is a challenge to make the information available but it is even harder to keep them up to date. An example is the Lange-Taylor Prize,

Have a look at the English Wikipedia article, strictly speaking it is not  a list. It is a mish mash. To make this a "proper" Wikidata list, it helps when the "point in time" is added to the award winners. It helps when they are completed. Michel Huneault is the 2015 winner, he or she has to be added as an item to Wikidata and, he is not the only award winner who does not have a red link or an item.

Adding the point in time as a qualifier has an additional relevance. It becomes possible to build a query with no award winners for 2015. When it is missing and this happens a lot both in Wikipedia and Wikidata, we can check the website for the award and maybe find a 2016 winner as well.

As it is, the English article is a stub. There are missing links for instance to Mrs Katherine Dunn. Adding all this info to Wikidata makes improves the quality of its data but it makes it also possible to incorporate this list on both the English and the Czech Wikipedia.

Monday, June 20, 2016

#Wikidata - #Pakistan Peoples Party politicians

There has been an announcement that lists may be generated on a Wikipedia using Wikidata data. For the Urdu Wikipedia, a list of politicians of the Pakistan Peoples Party could be interesting. This functionality is not available yet, but Reasonator does show us what would be on such a list.

As you can see many of them do not yet have an article in Urdu or alternatively a label. Once a label has been added, it will show up in the list. This may also help other languages from Pakistan like Sidhi because the label will fall back to Urdu and not English.

Saturday, June 18, 2016

#Wikidata has a CC-0 license. This should not change.

The Wikipedia Signpost is a publication of the English Wikipedia. It published a piece about copyright and Wikidata and it suggested that a more restrictive license would be fine. Their problem: others benefit and do not need to acknowledge Wikidata as a source.

For me the most important thing of our work is that it is used. Everything we can do to make our data used more increases the value of our data, This is best achieved by refusing to put any restrictions on our data.

One argument for another license is that "it recognises the labour that goes into maintaining the data". The question is how to recognise this and why.  Every data point has its own history both for the property and for the data and as a consequence it is the database that you refer to for the attribution. For human consumption it is the label that gives Wikidata much of its relevance; giving tribute to the people who add labels is as relevant.

Data is mostly generated in an automated or semi-automated way. I would not have over 2 million edits if all statements I added had to be done by hand. With StrepHit, a tool that retrieves facts from authorised sources, data gathering will become even more sophisticated, reliable and complete. The link to personal glory in attribution becomes very much absent.

Wikidata will become increasingly rich in references and tools like StrepHit will ensure the quality of such references. Wikidata is already very rich in references to other sources of data and it is why Wikidata will evolve into a resource for comparison with the data in these sources. These other sources may opt to adopt or report and the same is our option. Comparisons allow us to research the issues that exist with the data we hold and these comparisons will become highly automated and intelligent.

My point is very much that Wikidata is not a glory project. Our data is incomplete and immature and in several ways more ambitious than what a Wikipedia aims to do. Wikidata can include the ambitions of a Wikipedia up to a point. To realise its own ambitions, becoming a valuable and valued resource in the web of data set, it is important to be as open and available as we can be. A license that does not restrict is one of the underpinnings. Moving towards a more restricted license will only create a morass of uncertainty and doubt. It will bring us no benefit.

Thursday, June 16, 2016

#Wikidata - Mark Fiore won the 2016 #Herblock prize

The Herblock prize is just one award I added data to. I grabbed the data from the Wikipedia article and used "Linked Items" to import the winners. I checked the website of the award and noticed that there is a winner.

I added Mr Fiore as the 2016 Herblock prize winner.

I have done this before but something is changing. At Wikidata they are investigating how lists with Wikidata data may be used in a Wikipedia. Now that makes all the work that I have done relevant because I have concentrated on such lists and categories.

When this works out well, it takes one edit to include new data in every Wikipedia that has an interest about certain data. As Wikidata is finally evolving in this direction, things like showing a label, hopefully any label will be what is shown when a label in the language of the Wikipedia is missing are now relevant. Another new feature is that changes from Wikidata may be shown in the history.

The next thing to consider is that when Wikidata knows that somebody studied at a university, it automatically shows in an associated category.. Technically it is not hard, selling it to the Wikipedia crowd maybe.