Tuesday, March 31, 2015

#Wikidata - What to do in a #Datathon II

There is little point to a Datathon when the results have no practical impact. By implication there is little point to Wikidata when it has no practical application. Luckily most of the Indian languages use Magnus's extension to search making any and all advances in Wikidata immediately useful.

The next thing is to decide for a datathon is what it is you want to expose in your language, your script. The result will be biased but the difference is in not sharing this information. That option is even worse.

Having said that, there is one upside to concentrating on a subject domain. Take for instance the "King of Nepal", you will see that it is referred to "List of monarchs of Nepal". All these listed monarchs are now a "king of Nepal". It now takes one person to add a label in Nepalese to make this label visible on all the monarchs of Nepal. It is a subclass of "king" and, it takes one person to add a label in Nepalese for all subclasses of king.

This is the beauty of adding labels in Wikidata. Once it has been added, it is used everywhere. A label for "politician", "lawyer", "date of death" are added once and are in use on hundreds of thousands of items. Adding labels is therefore really satisfactory and effective.

Sunday, March 29, 2015

#Wikidata - What to do in a #Datathon

I was asked for pointers for a "datathon". It is adding data to Wikidata for a specific purpose. The most obvious thing is to be clear what it is you want to achieve.

What to add to Wikidata:
  • adding labels to items in a language
  • adding statements for existing items
  • adding items and statements based on a Wiki project
  • adding missing items to create links among items
Realistically, it is always a bit of all of that. The people attending are not all the same either, they differ in interest and they differ in skills. One goal for a "datathon" may be the transfer of skills. When this is the case, start with the basics of Wikidata. How to add labels, how to add statements. how to add items. 

Another goal is to add information for a specific domain. This may be based on information known to a Wiki project but that is optional. When information for a specific domain is to be worked on, Working together and use as many tools as available makes a real difference.

As a blogpost should not be too long, more later..

Thursday, March 26, 2015

#Wikidata is ready for #Wikipedia on its own terms

Yet again a Wikipedian raises the old question about the quality of Wikidata. Yet again the same questions are raised. Yet again the same answers are given. The same questions are raised but with a "different" angle; "our policies have it that"... It is really old wine in new caskets.

Wikidata is immature, it does not include enough data. This is also true for Wikipedia as well; both do not include the sum of all knowledge. Arguably, Wikidata is more inclusive.

Several Wikipedias have a policy requiring sources for facts. What Wikidata does is compare its data with other sources and flag differences. This process is immature but it exists. It is probably as reliable or better than the Wikipedia way of relying on one source at a time.

When someone enters incorrect data at three sources, he will be asked not to do it again or else... Just like in any Wikipedia.

As Wikidata matures, such questions will be increasingly desperate because who will care in the end?

#Wikipedia - Suzette Jordan 1974-2015

Additional attention for important women is always welcome. Mrs Jordan was known as the victim of the Park Street Rape Case. What makes Mrs Jordan so special is that she spoke out. This was a novelty and not really welcomed by the status quo. It was suggested by senior politicians that it was a a misunderstanding between a lady and her client.

Thanks to women like Mrs Jordan, the silence around rape is changing in indignation.

Thank you Mrs Jordan,

Wednesday, March 25, 2015

#Wikimedia - Guy Kawasaki

With the title "The art of the start" Mr Kawasaki proves himself an author who is known for looking at things with a fresh eye. I have read the book and found it inspiring.

It is therefore that I am ever so happy to hear that Mr Kawasaki is the latest member of the board of the Wikimedia Foundation.

It will be interesting to see what a philosophy of looking fresh at issues and with an eye to create results will do for our movement.

I welcome Mr Kawasaki to our movement and I am ever so happy that in the quote used Wikimedia and not Wikipedia is mentioned. It inspires hope for more inclusive policies.

#Wikidata - #Chiefs and #Nigerians

Mr Willie Obiano is the current Governor of Anambra State. At this time, there is no article for one of the more influential men from Nigeria.

It is easy to add information about him to Wikidata as abundant information about him is available on the Internet. When you read about Mr Obiano, he is referred to as Chief Willie Obiano. It follows that in addition to being a governor he is influential for being a chief.

You may also find that Chiefs in Nigeria are organised and as such Mr Obiano is a "chief of chiefs". How to recognise Mr Obiano as a chief is not clear to me. He was elected a chief and he was elected to be the chief of chiefs..

I would love when people step up the plate and add this information that is important to understand Nigeria.

Tuesday, March 24, 2015

#Wikidata - Kian dehumanising disambiguation pages

Logic has it that a disambiguation page like the one for Hristo Nikolov is not a human. There are at least two of them. There are more of them but this disambiguation page does not know about the others..

Hristo Nikolov is the first item of a recent list produced by Kian. All of them include Wikipedia disambiguation pages that have been recognised as a human. It takes attention to fix the issues involved. It includes linking Wikidata items to the correct article and, it means removing and adding statements.

Once all items of this list are done, quality has improved. We can turn over a page and go into "maintenance mode". When this report is run weekly or monthly, there will be only a few new cases. They can be fixed quickly and we can be more confident about the Wikidata quality.

Sunday, March 22, 2015

#Wikidata - Gaston's war

Gaston's war is a "1997 film" according to the automated description in Reasonator. At the time of  the Wikimedia Foundation Metrics 3.5.2015 there was no description in English. Since then somebody added the description in English Hurray!!... Eh not really.

It is only Hurray, when you only care about English. In contrast the automated description was there already for most other languages. As relevant, when an item is updated with statements it may reflect in automated descriptions in all languages while a fixed description became stale  maybe even wrong.

In the Metrics meeting it was mentioned that images might exist in the article.. Actually, Wikidata often has images where a Wikipedia article has not.

PS there is an API to create the automated description.. Why not use it?

Saturday, March 21, 2015

#Commons - Picture of the Year

Every year Commons has its Picture of the Year competition. This year's winner shows a butterfly drinking the eye fluid of a tortoise. There is a word for it.. lachryphagy I had never heard of it.

Congratulation to Commons for another successful year :)

#Wiidata - the National Thesaurus for Author names

In Wikidata we link to external sources. For authors the NTA Identifier is just one of many. It is associated with the Dutch Royal Library. This identifier is currently associated with 128.701 authors from all over the world.

While the NTA Identifier is used a lot, there is no article about it and, there is no item for it either. As such it is not exceptional.

With a link to the Dutch libraries, it is easy to understand why we could and should cooperate. Libraries are even more into "sharing the sum of all knowledge" than we are; Wikipedia is in the final analysis only an encyclopaedia.

We could do the following:

  • share the information we hold with them
  • ask if they will share the information we hold
  • promote the reading of books and publications
  • link to the Dutch libraries for authors and books to see if a book is available.
Our aim is to share in the sum of all knowledge. The aim of libraries is to share in the sum of all knowledge. We use what they provide as sources in our projects. It is easy and obvious to understand why we should seek for cooperation.

#Wikipedia - Joseph Reagle

For whatever obscure reason English Wikipedia decided to delete the article on Joseph Reagle. I do not understand it, Mr Reagle is bound to publish more books and conflating him with his book "Good faith collaborations" is a bit silly. It feels like part of the Wikipedia culture where we do not look after our own; do not consider our efforts as notable.

Mr Reagle is certainly notable enough to remain in Wikidata. Not only as the author of the book but also because of the French article. It will be wonderful when additional data is added to the item,

Tuesday, March 17, 2015

#Diversity - Therezinha Zerbini, a lawyer from #Brazil II

The Spanish article about Mrs Zerbini has it that she won the prestigious Bertha Lutz award. According to the Portuguese article about the award, there are 74 women who have been celebrated in this way but not Mrs Zerbini.

Because of the extra attention given to women this month, I have added all these women to Wikidata. They are now all known as award winners, they are all women and, I added the date when the award was given.

There were three issues; some articles did not have a Wikidata item. Some were not known to be human and finally some articles did not exist. At this moment these issues have been solved for Wikidata and, I even added Dutch labels for good measure.

What still needs to be done is adding labels in Portuguese and write articles in any language. Finally I would appreciate it if someone could establish if Mrs Zerbini did win the award or not.

Monday, March 16, 2015

#Diversity - Therezinha Zerbini, a lawyer from #Brazil

Therzinha Zerbini died. She is the kind of woman that deserves to be better known. There is no article yet in English but from what I understand using Google translate, it is people like her that shaped history.

She is a founder of the Movimento Feminino pela Anistia. It does not even have a Wikipedia article in Portuguese..

This organisation was vocal about the existence of imprisonment, torture and political persecution. At the time when these things happened. As such it was timely, relevant and in the end it won the day.

Mrs Zerbini is the kind of woman we should know more about.

Friday, March 13, 2015

#Wikidata - #African #American

Harvesting Wikipedia has its perils. It is easy and obvious to make statements about occupation, employer, alma mater. Subjects like nationality, religion, ethnicity are problematic. It is like the image says... you are not African American when you are from the UK but how do you tell the difference?

The problem is very much that stigma is involved. Harvesting such information on the "auto pilot" brings those issues to the front in Wikidata. People from all over the world are involved. The "best" practices of every Wikipedia raises its heads, its questions, its bias.

Not harvesting certain data is "safe" and, it is appropriate for me. It is not that I do not see the issues, it is more that it is not for me to do.

Thursday, March 12, 2015

#Wikimedia #Commons - 25 million images

#Wikidata - #Diversity and #LGBT

Richard Glatzer died. He died from motor neurone disease. I added this for everyone who died in this way. Mr Glatzer is associated through four categories with LGBT and I will not touch any of these with a bargepole.

I will not touch it because it is so redundant and opaque. When people are associated with LGBT, I do not know they are gay and in what way. What I care for is that sexual orientation is expressed in Wikidata once.

Associating people with LGBT does not mean that people who are gay have been active and known as such. It does not even mean that the people who are championing the cause of LGBT are gay.

From my position, the information about LGBT sucks big time, I can not use it and, I find this regrettable.

Auto-#translations - let's talk "man and horse"

Niklas wrote in his blog about using a grammatical framework for generating texts. He also mentions Reasonator and indicates that its support for language is limited.

When you look at the code for generated texts, it is indeed a programmers job and not so much something a translator does. What it does is parses existing data and generates strings of text. The generated text for Marilyn Monroe for instance has it that she is an "actor".

"Lets talk man and horse" is a Dutch proverb and it is exactly what we will not have in generated texts. It is indeed grammar that we need to concentrate on. I am sure that many people will be EAGER to work on generated text using a grammatical framework.

Once it is shown to work in Reasonator, the next step is obvious; generating the texts for consumption in Wikipedias that want to share in the sum of available knowledge . This is best done by caching results because of the flexibility it offers. When this is seen as problematic; it is even easier to generate articles using bots.

Niklas, what category of topics tickles your fancy?

Wednesday, March 11, 2015

#Kian does #Kannada #Wikipedia

Mr A. Surya Prakash is an author and journalist.There is an article about him on the Kannada Wikipedia. The item on Wikidata exists thanks to Kian. It is one of 48 articles created based on articles. These 48 items are identified as human.

Using Google Translate it is easy to find the English name. Finding additional information about Mr Prakash was easy as well. Several awards he received are not known in English. I added one in Wikidata. Many of the things I did I could have left for Kian to do in a later run. It is however so cool to be able to do share in this knowledge. One thing I did was to merge the Kannada category for journalist.

I asked Amir to have Kian run for all the languages from India. It finished with 500 new humans for Hindi, my latest news is that it is running for Tamil. What we know from the Kian run on the German Wikipedia is that most items have been already been identified for what they are. I found that for only 11 people I could add that they were a journalist.

Tuesday, March 10, 2015

Categoria:Morti a Rottenburg am Neckar

The category on the Italian Wikipedia has been expanded with statements that help Reasonator find all the people who died in Rottenburg am Neckar. This was done by bot and it was done for the categories who identify the place of death.

This is awesome! Not only do we find all the people who are included in the Italian category, we also find all the people who have been  identified based on information from elsewhere. Caspar Adelmann for instance only has an article on the German Wikipedia.

With Kian we have a tool to make the most of information that exist in our projects. The relevance of tools like Kian is in sharing the information of Wikidata. The value is in its use. Reasonator is how we add value to the data we have and it is wonderful to see how things become connected.

Monday, March 09, 2015

#Kian - Adding humans known to #German #Wikipedia

The first feat of Kian was to recognise humans based on articles from the German Wikipedia. Now that is the holy grail because once you know that an item is human, it is trivially easy to make all kinds of assumptions.

It was done by learning from what is already there. Given that for 90% of all Wikidata items it is already known what they are, there is plenty of learning material for Kian.

Once everything was said and done, 3600 articles are now known to be about humans. Given that much of the learning is based on Wikidata data itself, it follows that languages like Malayalam or Chinese can be tackled in the future. The future is bright and we happily welcome the first 3600 humans Kian recognised.

Saturday, March 07, 2015

#Kian - the future is bright

Send someone bright to a course on artificial intelligence, someone with a pedigree of great bot-work and you get "Kian". In Farsi it means "glory" and,yes I have seen the light.

Amir introduced Kian in an e-mail: "Kian is a three-layered neural network with flexible number of inputs and outputs. So if we can parametrize a job, we can teach him easily and get the job done." If I understand properly what this means, I will stop doing most of the work I have been doing on Wikidata with AutoList. I will not reach three million edits and I will be ever so happy.

Amir's example for a first job: Add P31:5 (human) to items of Wikidata based on categories of articles in Wikipedia. The only thing we need to is get list of items with P31:5 and list of items of not-humans (P31 exists but not 5 in it). then get list of category links in any wiki we want[2] and at last we feed these files to Kian and let him learn. Afterwards if we give Kian other articles and their categories, he classifies them as human, not human, or failed to determine. As test I gave him categories of ckb wiki (a small wiki) and worked pretty well and now I'm creating the training set from German Wikipedia and the next step will be English Wikipedia. Number of P31:5 will drastically increase this week.

I am ever so happy because once we know what an item is, it becomes easy to make all kinds of inferences.

Indian #politicians from #Kerala

The information on politicians from #India on the Malayalam Wikipedia is more extended for the members of the Kerala legislative assembly than on the English.. That is entirely reasonable and good.

It shows that English does not include the sum of all information that is available to us. It also shows that we need someone to help us translate the names of these fine people into English to make this available to you and me.

Friday, March 06, 2015

#Kiwix - getting #Labs ready for the #Wikipedia big time

Offline #Wikipedia received a big boost. It is updating monthly its images for most of the #Wikimedia projects. Most but not all. Emmanuel was asked to write up about his challenges and I am happy to share this with his permission. Developments like this make both Labs and Kiwix even more strategic to out goals.

Following Yuvi's and Andrew's invitation, I write this email to explain what I want to do with Labs and share with you my first experiences. 
== Context == 
Most of the people still don't have a free and cheap broadband access to fully enjoy reading Wikimedia web sites. With Kiwix and openZIM, a WikimediaCH program, we have been working on solutions for almost ten years to bring Wikimedia content "offline".
We have built a multi-platform reader and have created ZIM, a file format to store web site snapshots. As a result, Kiwix is currently the most successful solution to access Wikipedia offline. 
== Problem == 
However, one of the weak point of the project is that we still don't achieve to generate often enough new fresh snapshots (ZIM files). Generating ZIM snapshots periodically (we want to provide a new fresh version each month) of +800 projects needs pretty much hardware resources.
This might look like a detail but it's not. The lack of up-to-date snapshots brakes many action within our movement to advert more broadly our offer. As a consequence, too few people are aware about it reported last Wikimedia readership update. An other side effect is that every few months, volunteer developers get the idea to build a new offline reader based on the XML dumps (the only up2date snapshots we provide for now), which is near to be a dead-end approach. 
== Goal == 
Our goal with Labs  is to have a sustainable and efficient solution to build, one time a month, new ZIM files for all our projects (for each project, one with thumbnails and one without). This is at the same time a requirement for and a part of a broader initiative which has for purpose to increase the awareness about our "offline offer". Other tasks are for example, storing all the ZIM files on Wikimedia servers (we currently only store part of them on download.wikimedia.org) and improve their accessibility by making them more visible (WPAR has for example customised their sidebar to provide a direct access 
== Needs == 
Building a ZIM file from a MediaWiki is done using a tool called mwoffliner which is a scraper based on both Parsoid & MediaWiki APIs. mwoffliner, after scraping and rewriting content, store them in a directory. At the end, the content is then self-sufficient (without online dependencies) and can be then packed in one step in a ZIM file (using a tool called zimwriterfs).
To run this software you better have:
  • A little bit bandwidth
  • Low network latency (lots of HTTP requests)
  • Fast storage
  • Pretty much storage (~100GB per million article)
  • Many cores for compression (ZIM, ZIP and picture optimisation)
  • Time (~400.000 articles can be dumped per day on a machine)
My guess is that we need a total of around a dozen of VMs and 1.5 TB of storage. 
== Current achievements == 
We have currently 3 x-large VMs in our "MWoffliner" project:
With them we are able to provide, one time a month, ZIM for all instances of Wikivoyage, Wikinews, Wikiquote, Wikiversity, Wikibooks, Wikispecies, Wikisource, Wiktionary and a few minors Wikipedias.
Here are a few feedbacks about our first months with Labs:
  • Labs is a great tool, it's fully in the Wikimedia spirit and it works.
  • Support on IRC is efficient and friendly
  • We faced a little bit instability in December but instances seem to be stable now
  • The Documentation on wikitech wiki seems to be pretty complete, but the overall presentation is to my opinion too chaotic and stepping-in is might be easier with a more user-friendly presentation.
  • Mediawiki Sementic & OpenStackManager sync/cache/cookie problems are a little bit annoying
  • Overall VM performance looks good although suffering from sporadic instabilities (bandwidth not available, all the processes stuck in "kernel time", slow storage).
In general, Labs does the job, we are satisfied and think this is an adapted solution to our project. 
== Next steps == 
We want to complete our effort and mirror the biggest Wikipedia projects. Unfortunately, we have reached the limits of a traditional usage of Labs. We need more quota and we need to experiment with the NFS storage because an x-large instance in not able to mirror more than 1.5 millions of articles at a time. How might that be made possible?

Thursday, March 05, 2015

The #Italian job; a perspective on #bias

The Italian #Wikipedia is really into registering death and place of death. This is why an initial bot-run by Amir resulted in more deaths in Italy then elsewhere. This bias was visualised by Vizidata, I blogged about it in the past.

Today Amir finished another run of his bot on the Italian Wikipedia and he increased the bias with over 100,000 edits.

The point is very much that Wikidata has a bias and it is based on the quality of the Wikipedia data depended on. Running a bot repeatedly will increase quality and by inference bias. People die in Italy, apparently not so much in the Netherlands.

Now who says this Italian bias is bad?

Tuesday, March 03, 2015

#Wikimedia Nederland - #Sebastiaan is leaving

Sebastiaan was project leader at the Dutch Wikimedia chapter. He did tons of great work, and made a point of registering activities with his camera. Of particular mention is his huge involvement in the GLAM area.

Sebastiaan will pursue his career elsewhere. His friendly cooperative attitude will be sorely missed.
Thank you,

Monday, March 02, 2015

#Diversity - the Lillian Smith Book #Award II

There are two ways of improving the content of Wikidata. It can be by adding large amounts of statements or by adding more details to existing data. As I was adding the details, I found that several award winners do not have an article. Adding them in Wikidata is easy and obvious.

Mr A.G. Mojtabai for instance received the award in 1986. Adding a red link in Wikipedia is not hard either. Thanks to the Redwd template, I linked him to both Wikidata and to Reasonator. One issue is that all these authors and the award are primarily known on the English Wikipedia. Consequently their work and relevance has at this time a limited public.

It would be nice when the presence of great information at Wikidata will lead to articles in other languages. The question is very much if it does.

Sunday, March 01, 2015

#Hackathon - the #genealogy of Catharina-Amalia, Princess of Orange

Visualising the genealogy of person can be extremely interesting. The tool that was the best we had exists in Reasonator and for the Dutch Royal family it did not really work. Too many people are involved.

The approach of a new tool developed at the hackathon in Bern is really interesting. It limits itself to five generations and it shows pictures of the people. It is nice to see princess Catharina-Amalia.

It is also great to notice that you can have the same information in for instance Chinese or Russian. When you click on one of the persons in the genealogy, it will produce the genealogy for that person..

At this time the new tool is very much in development. It is great to show why hackathons are so relevant.

#Diversity - the Lillian Smith Book #Award

Lillian Smith who is obviously white, openly embraced controversial positions on matters of race and gender equality, she was a southern liberal unafraid to criticize segregation and work toward the dismantling of Jim Crow laws, at a time when such actions almost guaranteed social ostracism.

The Lillian Smith Book award honours those authors who, through their outstanding writing about the American South, carry on Smith's legacy of elucidating the condition of racial and social inequity and proposing a vision of justice and human understanding.

It is obvious that these writers are important as sources for the subject and consequently, registering them as award winners is important. This was done by harvesting the information from the article using Magnus's LinkedItems tool,

Obviously more can be done.
  • including all her work in Wikidata; it does not need a Wikipedia article
  • including all the works of the prize winning authors
  • adding dates as a qualifier for the award winners
  • complete the list of award winners
  • work on similar awards
There is always more that can be done on a subject as relevant as this.