Tuesday, December 06, 2016

#Research to help #Wikipedia do better

It is one thing to bemoan everything that is problematic with research, it is another to do better. For research on Wikipedia to be published, it has to be about "English" OR it has to be linked to English OR publication is not the end goal.

At the Dutch Wikimedia Conference Professor de Rijke gave the keynote speech. He spoke about the kind of research he is into and he spoke about "Wikipedia" research performed at the University of Amsterdam. He challenged his audience to cooperate and his challenge resulted in me formulating ten proposals for research. The point of these proposals is that I hope they do provide more worthwhile insight and includes a link to “English” in order for it to be published.
  1. Previous research, studied how long it took for a subject to appear in English Wikipedia after it was first mentioned in the news / social media. The new question would be: how long does it take for the same subject to appear in any Wikipedia and, how long does it take and to what extend does it happen for those articles to get corresponding articles in other Wikipedias and how long does it take for the English Wikipedia to take notice?
  2. In the search engine for Wikidata we use the description to help differentiate between homonyms. There are two approaches to a description; many existing descriptions are not helpful and hardly any items have texts exist in all of the 280 languages. There are however automatically generated descriptions. The question is: what do people like more, the automated descriptions or the existing questions? Is there a real difference for people who use Wikidata in English as well?
  3. Many people know their languages, this is obviously true for readers of Wikipedia. For the regulars there is a “Babel” template that allows them to indicate what languages they know. For the others for some purposes geo-location is used to make a guess. Do people find it useful to have it indicated that articles exist in the languages they know in search requests? Does it make a difference that a quality indicator is set for those other texts on the same subject?
  4. Many people make spelling errors when they search for a subject or when they create a wiki link to another subject. Google famously suggests what people may be looking for. We can expand the search and include items from Wikidata (40% increase in reach) but we can also use Google or any other search engine to help people get to the sum of all knowledge. We can ask people to answer some questions after they are done. Are people willing to do this and how does it expand our range of subjects that we know about. Are people willing to curate this information so that we can expand Wikidata and at least recognise the subjects we have no articles about?
  5. When we show the traffic for the articles people edited on in the last month, we gain an insight in what people actually read. We also congratulate people on the work they did and show appreciation. Does this kind of stimulus stimulate more articles? How do you stimulate for subjects that people hardly read (eg Indian nobility).. Do you compare with existing articles in the same category?
  6. There have been several Wikipedias that include bot generated texts. It is a famously divisive issue in the Wikipedia community. There has been no research done on this. With Wikidata there is an alternative way to exploit the underlying data. When the data is included in Wikidata, it is possible to generate text on the fly. This data may be cached for performance issues but there are two main advantages; both the script and the data can be updated. The question is: does it serve a purpose for our readers? Will editors update the data or the script to improve results or will they use the text as a template for new articles? Will it take the heat of the argument of generated texts? How will it affect projects that were not part of the existing controversy and does it work for them?
  7. Wikidata does not allow for the dating of its labels. It follows that it is not easily understood what the relation is between Jakarta and Batavia. How are such issues generally stored as data and what alternatives exist for Wikidata. How does it improve the usefulness of Wikidata as a general topic resource?
  8. Wikidata now includes data from sources like Swiss-Prot. What are the benefits to both parties? Does it make for people editing this data at Wikidata and what is the quality of such edits? Does it get noticed by Swiss Prot and is there a cooperation happening? How is this organised and to what extend does “the community” interfere with the notions of academia? Do such communications exist or are these groups doing “their own thing”?
  9. What is the effect on the ultra small Wikipedias when generated texts are available based on available labels.. Does it mean more interest in creating the templates for articles and work on labelling? What does it mean when such generated articles are available to search engines?
  10. At this time many articles in the English Wikipedia are written by students, university students. The result is positive on many levels but the question is, is what they write understood by Wikipedia readers? When students write their articles, it is mostly based on literature. It is well known that the bias in scientific papers is huge. Negative results are not published and many results from studies are ignored. The question would be: is sufficient weight given to debunking studies or are they put aside with an argument of a “neutral point of view”. This would make sense when students are graded on what they write given accepted fact on the university.

No comments: