Thursday, October 31, 2013

#divcon - the first update about #Wikidata queries

The WikiDataQuery has a back-end that is a bit more sophisticated. Every 10 minutes or so the data used for the queries is updated.
Last but not least, the queries have a "permalink" function. You can find out what the current numbers are :)

#divcon - The current number of #males and #females in #Wikidata

With the WikiDataQuery it is easy to learn the number of males (798,920) and females (152,118) known in Wikidata. It is based on the last dump. Real time query is being worked on.

These are de-duplicated numbers of males and females that have none or more Wikipedia articles.

Ron Woodroof, an #American activist

This weekend I will speak at the #Wikimedia #diversity conference in Berlin. The conference aims to increase the pool of knowledge in the Wikimedia projects and to enable a heterogeneous, colourful, diverse community.

There will be a focus on the imbalance in the gender participation in our projects. There is also an imbalance in how subjects are covered

Ron Woodroof was an activist who is getting more attention these days because the movie "Dallas Buyers Club" is centred around him. He is notable enough for the Swedish to have an article written in 2012 about him. In the English version he was nothing but a redirect to the movie.

Love him or hate him. Mr Woodroof is notable. He is part of an anti-culture in the USA, he died of AIDS. It will be interesting to notice how many similar stories will be told at the conference.

#Gapminder, the #Ignorance project

When the aim is to share in the sum of all knowledge, it seems obvious that ignorance is the ultimate enemy that is to be defeated. To defeat ignorance you have to get the facts straight.

Gapminder and Hans Rosling are well known for bringing an insight based on published facts, statistics and visualisation. The presentations are world class and the next insight they are seeking is the ignorance people have about our world.

It is not to gloat but to bring home the message that we do not know basic information about our world. With such an attitude you want to do better and bringing attention to ignorance implies that you want to ensure that people are better informed. Informing or sharing the sum of all knowledge is what the Wikimedia Foundation does best in its projects.

By covering 280+ languages most people who can read and have access to the Internet can be reached. It is "just" a matter of making this information available to all of them on a platform they use. Wikipedia is that platform and Wikidata is where the data should get a home.

There are two issues stopping us for now. Wikidata does not yet support the property type "value" and, we do not have the visualisation tools yet. Both can be helped, support for the property value is in the pipe line. Seeking cooperation with Gapminder seems an obvious thing to do.

Remove erroneous #constraints from #Wikidata

When a particular property is used, like the P218, that is ISO 639-1 to you, you can expect specific values to be associated with them. There is a list of constraints expected on the talk page. For the constraints mentioned, a list with constraint violations is published.

The point of such a list is to find members of the Wikidata community to cooperate and fix those errors.

One constraint used on many properties is associated with the deprecated property P107, the "main type (GND)". Many a list of constraint violations are regularly generated like this one for the P218.

When asked, it was NOT thought to be a good idea to remove all the erroneous constraint reports that are based on a false premise. So far I have constrained myself from uttering profanities. It is better to make the case and make the case publicly and loudly.

Wednesday, October 30, 2013

Missing labels in #Wikidata; a call for #data driven #user #participation

As more Wikidata items gain statements, people may benefit from the new information provided. That is, when there is a label for the properties and related items. When there is no label you will see a property or item identifier.

It is extremely annoying and, it is why I use Wikidata in English and not in Dutch.

When every reference to a missing label is registered for a period of 30 days, it becomes easy to suggest to our community what needs to be done to improve the usefulness of the data provided. As more of the high impact labels are translated, users will get more pertinent information.

The beauty of this suggestion is that it involves only the Wikidata development team. This makes it easy to implement. All these new labels will enlarge the number of items to search from.

An experiment like this, is like putting a toe in the water; data is accumulated to point to where our community involvement does the most good. Once there is some clarity that data driven user participation makes an impact, there are many more opportunities to explore..

Let us start with this easy opportunity.

Magnus, master of #Wikimedia #tools

When Magnus agreed to answer some questions, I wanted his answers to have impact. So I asked a few influential people what they would ask Magnus. I am happy both with the questions and the answers; they offer us plenty opportunities for making more of an impact.

Lydia: what's his next cool hack
Technically, since the moment I received these questions, it would be AutoLists
I haven't decided what to do after this. There is the prospect of extending/rewriting the Free Image Search Tool to use Wikidata, and to suggest images for Wikidata from the various Wikimedia projects in return. 
Also, with the long-term plans to use Wikidata, or at least its software, to finally get a grip on metadata for files on Commons, it might be interesting to create and seed a database for the "obvious cases", which could then in turn seed "Wikidata Commons", when it finally shows up.
Lydia: and what he'd like to do but doesn't get to (so others can do it)
A million things! Many of my tools could do with some fixing, code review, extension, you name it. If you can code, have a look at the existing tools and maybe sign up for a tool or two which you would like to improve (not just mine, of course!). 
I still have to port tools from the toolserver. Have a look here at tools to port and, help me prioritise! 
Another thing I had started back on the toolserver was a "tool pipeline", where you could chain up tools so that the output of one tool is the input of the next. This could be quite powerful, especially for less tech-savy users who can't write their own glue code, even if the concept sounds uncomfortably like the human centipede ;-) Have a look at my attempt back in 2012.
Brion: What's the awesomest tool you've created that people don't (yet?) know about?
That's a hard one, as I carefully and thoroughly spam the community for every new tool I create ;-) 
One that is probably less known, and that I use myself on Wikipedia, is a special CSS for wide screens, where it shows the Wikipedia main text in a central column, and arranges TOC, infoboxes, thumbnails, and small tables on the left or right. Screenshot & CSS

As for actual running code, less known but with huge potential IMHO, are live category intersects on Wikipedia. Together with [[User:Obiwankenobi]], we came up With this: Screenshot JS
Brion: What's the best way to make things happen? (getting cool projects done, even if nobody else is working on them yet
Have something to show people. Even if it's little more than a mock-up. Even if it's full of bugs, slow, ugly, fails half the time, and does only 1% of what it should do. A single thing that people can see and click on can get people to the cause, where a hundred pages of text will not. Stallman's GNUpedia existed on a mailing list with many a fancy idea; Wikipedia was there to see, to "touch", to edit. Even if it was a single, bare Perl script.
Alolita: what are your recommendations are for the next 5 innovative tools WMF needs to fund to build for Wikimedia websites
Not sure about funding for tools; coding projects big enough to fund should probably be designed as Wiki(m|p)edia-default extensions, or even MediaWiki core. In terms of new functionality prioritised in-house beyond the current scope, these would be the most interesting ones for me:
  • OSM integration. This has been on the WMF back burner for years.
  • Commons file metadata as Wikidata. On Wikidata itself, or added to Commons itself.
  • More data for Tools Labs. View data so that we can support GLAM projects; there has been some talk about this lately, but it has been a few weeks away for years now. Also, faster access to the wikitext than through the API/dumps would be useful.
  • Add Wikispecies to Wikidata. It might revive Wikispecies (which seems sadly underused), or at least rescue much of its information onto Wikidata.
  • One old pet idea of mine: Wikimedia (bi-)weekly magazine. Consider an amalgamate of Wikinews (which, let's face it, has pretty much failed as a daily source of news), Wikipedia "did you know"-type interesting articles, interesting places from Wikivoyage (which appears to be booming), fascinating images from Commons, an excerpt from Wikibooks, a short story from Wikisource, you get the idea. An editorial text, or political comment (relevant to the Wikimedia world; think CISPA), from individuals might not be too far fetched an idea. An issue would be generated in the usual fashion, maybe with an editor-in-chief (rotating? elected every three months?) to address the time-critical nature of the project. A completed issue would then be generated in several formats; on-wiki, PDF, etc. Maybe even go through proprietary distribution channels like Amazon or Apple, if the license situation allows it. I am sure tablet users would flock to it, and we might get some of these "consumers" to participate in one of the Wikimedia projects, maybe with "photos by readers" as an entry-level drug :-)
Alolita: How WMF can help support the development community better * process recommendations * policy recommendations * participation in helping grow more developer contributors
I'll answer these as one question. For tool developers, Labs is already a fine solution. There are a few rough edges which could be smoothed, though; "find someone on IRC" is often the way things are done, which is not always the best. There were a few outages recently, which can happen, and it took hours for someone to have a look a the issue, which is not exactly a life-or-death situation, but still frustrating for tool authors, and not just because of "your tool is broken" messages piling up around them. I understand that only core systems can have someone on standby 24/7, but maybe Labs issues could bubble up the chain automatically if no one from the usual Labs crew is around. 
Getting more people to make use of the great Labs facilities is important. Part of that could be a simple visibility issue: if more people knew that Labs offers not only CPU and storage, but live mirrors of the Wikimedia databases, or that they could easily help fixing or improving that vital tool, we would probably see more widespread adoption of Labs by programmers. Labs-based git repositories for each tool could also enhance visibility, and improve code reuse (which I practice heavily within my own tools). 
In terms of adoption of tools and extensions by volunteer developers into Wikimedia wikis, MediaWiki core itself, or as stand-alone systems, clear and time-limited review procedures would be required, as well as assistance by Wikimedia developers to get code up to MediaWiki standards. On the other end, a public listing with project ideas that, if implemented, would be considered for adoption by Wikimedia, could focus volunteer programmer participation. I am talking about a simple list of "we would like to have these, but don't have the manpower" issues, written by Wikimedia staff, as opposed to obscure feature requests in bugzilla that might be long obsolete.
Open source is also to work together on software, tools. Your tools are available, how do we get more people to cooperate on them
By refusing to fix bugs, we could force the bug submitter to do it ;-) 
On a more serious note, I think improving visibility, as mentioned above, of the great things Labs offers, and that participation on existing tools is encouraged and welcome. Just like many people use Wikipedia but still are not aware that they could edit it themselves, potential volunteer programmers might think that these tools are done by Wikimedia staff, or some carefully chosen group, rather than by run-of-the-mill geeks like them.
When you had someone to work on improving ONE tool; what would you want him or her to do?
Extend Wiri a.k.a. "the talk page". It is not so much a normal, "useful" tool, but a nice technology demo, not just for Wikidata, but for a confluence of technologies (data from Wikidata, speech generation from third-party open source, speech recognition from Google) enables functionality that plays in the same ballpark as big commercial solutions like Siri or Google Now. 
I am well aware it will never be en par with the Mathematica/Wolfram Alpha backend in terms of calculation power, or up-to-the-minute sports results. But even simple improvements, like a listing of options if the (speech) input is ambiguous (e.g. several people with the same name), translating complex questions into Wikidata query requests, displaying images, maps, or using AutoLists to show multiple results, could turn it from a mere demo into an actually useful tool. With proper, fault-tolerant grammar parsing, real reasoning, and the ability to explain the reasoning, it could be amazing!

Tuesday, October 29, 2013

Using #Babel should be obvious in #Wikidata

When you know multiple languages, you want to see the Wikidata labels used for the languages that are relevant to you.

My current list of languages is defined using the #Babel template on my user page. The syntax is:
It makes a world of a difference when you are able to add useful labels in the languages you know. Added labels ensure that subjects can be found.

#Wikidata .. hip hip hurray !!

Monday, October 28, 2013

Abdullah, King of #Saoudi Arabia

King Abdullah has many brothers and sisters. In #Wikidata he and his siblings are registered as children of his father. Different is that the Reasonator now shows them by querying the parents. It is no longer needed to register them as brother or sister on each brother and sister.

The edits of #Wikidata II

The number of Wikidata edits are measured real time and hidden in the number of edits are the number of qualifiers added.

As there is a new set of statistics in town, their numbers are measured as well. Currently there are 23.176 qualifiers on statements. As you can see in the graphic, they are growing fast. So far there are no bots who add qualifiers.

When the bots know how to set qualifiers, it will make the available information more understandable and relevant to humans.

Sunday, October 27, 2013

John Sherman, #USA Secretary of State and of the Treasury

It is fun to check what #Wikidata has to say of people and places prominently mentioned in #Wikipedia. The article about Mr Sherman is a featured article. There is always something to improve and it is now registered that Mr Sherman had a father, a grandfather and several brothers. Sadly it does not register in the Reasonator.

Mr Sherman was part of the "Opposition party", this is an article that needs some serious attention. The article describes three instances of an Opposition party and this does not make sense in the context of Wikidata. From the article it was not clear when Mr Sherman moved to this party from the Whig and later to the Republican party.

Saturday, October 26, 2013

The edits of #Wikidata

"Right now the meter is standing on my fridge and is watched by a webcam"

I really enjoy this image; I find that I go back to it. One of the implications earlier in the day was that in a month every Wikidata item may or may not get one edit.

Raja Lumu, the first sultan of #Selangor

His father and son both have articles in one #Wikipedia. In all other Wikipedias they are red links. His son was his successor and second sultan of Selangor.

When a new article gets written on Ibrahim Shah of Selangor, it is a best practice to link this article to other articles. Given the existing links in articles to other articles, there is already a concept cloud for Raja Ibrahim Shah. This "concept cloud" will include his father, his successor and child among others. 

With a bit of luck, these concepts have labels in Wikidata and when they have not, it makes sense to add them and get familiarised with them. All these concepts are likely to be used in the new article. 

When the concepts known to exist for an item are available while editing an article, it makes sense to use them while editing the article.
  • editors become aware of relevant concepts
  • editors are prompted to link to existing articles
  • editors are prompted to add missing labels

Friday, October 25, 2013

An unknown woman from #Indonesia, a picture from the #Tropenmuseum

The Tropenmuseum has provided us with a large collection of images from its collection. Within Commons these images are particularly relevant because there is so little available on the subjects covered by this collection.

Attribution to the GLAMs we are working with is vitally important. Providing an insight to how a collection is used determines subsidies.

As these images are by and large in a category, it is possible to use that category and accumulate data with it. GLAMorous informs about the number of articles using images in a category. Even more relevant is the BaGLAMa tool; this is where you find the number of page views for these images.

There is room for improvement; there is a template for the Tropenmuseum but it is not part of the data on most images from the Tropenmuseum. There are templates for creators like Isidore van Kinsbergen; they are not universally applied either.
When the data on images exists in a database, it will become much easier to combine data in many new ways. Adding the creator and institution templates prepares for the moment when these templates are populated with data from Wikidata.

The Gelman #Library, an institution in Washington DC

The Gelman library or by its full name the "Estelle and Melvin Gelman Library", is one of the many GLAMs we have content from in Commons. For all these institutions there is a template in the "Institutions" namespace.

All these templates are an amazing resource, it must have been compiled by many people over many years. On Commons it links to material that is known to be part of the collection of that institution. Making such links is in itself an Herculean job and one that is not finished at all.

The template on all those institutions typically refer to a Wikipedia article. The Wikipedia article links to a Wikidata item and it aims to include all the information that exists on the institution template on Commons. To harvest all that information tools like the pywikipedia bot will be used.

Some people complain about Wikipedia that "everything is said and done". For those people Wikidata is a bit of fresh air; there is so much more that can be done.

Thursday, October 24, 2013

Manna Dey, a #singer from #India

Today Manna Dey died. He was really famous in India and #Bangladesh. When you want to know about Mr Dey, you will not find anything about him on Wikipedia when there is no article about him.

Realistically, you would not expect an article about him in the Occitan Wikipedia. Sadly, the Occitan Wikipedia cannot share the sum of all knowledge. To demonstrate that it could provide information after all, the following template was added:
{{Infobox Musica (artista)}}
It resulted in an infobox with several missing labels in Occitan. They were added and the result is relevant information about Mr Dey on the Occitan Wikipedia. As you can see, they are red links for the Occitan Wikipedia.

What if the result of a "not found" was nicely formatted data from Wikidata?

Wednesday, October 23, 2013

#SignWriting is free to use - please use it and publish with it! ;-)

This message so deserves to be heard far and wide. So I make use of the license and republish :)

SignWriting List
October 23, 2013

Dear SignWriting List -
Did you know that you all have my FULL permission to use and publish with SignWriting and Sutton Movement Writing? Some of you know that, but in our crazy world not everyone can believe it ;-)

Sometimes I receive emails from people who want so badly to use SignWriting, but are fearful that I will be mad at them if they use it…they want legal proof that they have permission to use the system. I feel sad when I receive those emails because I want so much for people to use the system freely, and there are no restrictions whatsoever… of course I always give my full permission.

I encourage you all to use the materials on our sites freely and to write freely in all of your languages… and then in return I am delighted to post your new work using the system…it is a great sharing of information …. 

Have you heard about the free contracts or licenses now available on the web to help people feel secure and not concerned? They are "free licenses" with logos that can be placed on a web site that shows that the information on that web site is free to use… apparently these free licenses are an easy way to communicate "permissions" ...

Therefore, I am showing you here the three free licenses I have chosen for our web sites and for our system…they are all free licenses….meaning that nothing has changed….You were free to use it before, and you are free to use it now …. But the licenses simply state this officially. You can find these licenses as the bottom of, for your information:

#Ethnicity in #Wikidata

In the Ottoman Empire, the Grand Vizier was the most powerful person but for the Sultan. As long as the "blood tax" or Devşirme was levied, they were not Turkish. 

Blood tax was the practice by which the Ottoman Empire took boys from their Christian families, who were then converted to Islam with the primary objective of selecting and training the ablest children for leadership positions, either as military leaders or as high administrators to serve the Empire.

These kids gained a lot of power and as a group they maintained that power base. Many of them married into the family of the Sultan.

In this context it is relevant to know the ethnicity because it must have been comforting to their family that they did well. It will have served as a stabilising force in the Empire.

It feels really awkward to specify the ethnicity of people. People are equal, right? So much so that the property Ethnic group is considered controversial in Wikidata because it is so easy to abuse. 
  • Mr Obama is half black and half white. Does that not qualify him more as an American than as an Afro-American? 
  • Mr Morsi would not qualify himself as an Arab but as an Egyptian and a Muslim first.

''#Pywikipediabot'' ''20:00 Thursday, October 24, 2013 - Sunday, October 27 22:00'' UTC

The one tool used by most bots has not only stood the test of time, it is getting ready to complete its rejuvenation. It moved closer to the tools used in the Wikimedia Foundation. The road map has it that a triage is needed.

This interview happened largely on the pywikipedia mailinglist. Another great resources for bot runners. What you read is a compilation, more takes can be found in the list archive.. :)

What is pywikipedia?
  • A Python-based framework to manipulate Mediawiki installations. Any installation, not only those run by the Wikimedia Foundation, can be worked on. You can change everything, you can change in the Wiki as on  an editor also per API. Thus, pywikipediabot can be used to create so called bot tasks, to do the same change to a lot of pages and do it fast.
  • A huge bunch of scripts which use the framework above for all tasks you can think of: One script for example for mass uploading pictures (like scans of a book), one script for cleaning up page source code (like removing <br>, multiple empty lines, reorder  interwikilinks,...), scripts for fixing common errors, and so on. change everything you can change in the Wiki as an editor also per API. Thus, pywikibot can be used to create so called bot tasks, to do the same change to a lot of pages and do it fast.
As I understand it there are two versions, "core" and "compat". What is pywikipedia core and what is compat?
Compat is old. Core is the redesign with complete new and cleaner data structures. Most new API functions (like those to modify Wikidata) are only or much better supported by core.
Why keep them both, it must be a lot of work to have to maintain them both..
There are so many working scripts which do their job now for years - thus, there is not much pressure to move them to core. Nonetheless, if you plan to do something new, use core, it's much more appealing.
Recently all the bugs have been moved to bugzilla ... What is it that you hope to achieve by this?
Handling bugs will be a lot more easier, there will be more eyes to keep on debugging process, tasks will be centralized, it will be more wikified by merging in WMF infrastructure, you can see more in this suggestion page.  
At Sourceforge you have your code repository and - unlinked - your bug and feature request system which over the years get filled with a lot of dormant, outdated bugs, which were fixed a long time ago. We hope to get a clean list of bugs.
Recently all the code has been moved to GIT ... What is it that you hope to achieve by this?
GIT allows you to work much more naturally and doesn't break everything when you merge your different ideas again. So many new features compared to SVN that I do not want to outline this here. In short: In SVN you can branch your code if you do something different. But merging does never work, therefore, noone uses this important feature. With GIT, merging works (as it stores much more information as SVN). Thus, whenever you start a new task, you branch, you have a new idea.
What is the biggest challenge running the pywikipedia bot?
On the pywikibot side, learning its limitation and working around them. On the larger, bot side of things, performance. You have to always balance thread-safe generators and better support for sections (such as retrieve the whole page but submit a section or vice versa). The API/server performance with the local computer's performance (mostly disk access for me, might be RAM limitations when running on a VPS) and the network performance. While not related to pwb, this is something you have to consider for each new bot one writes. pwb could help by providing more advanced support for threads, such as retrieve the whole page but submit a section or vice versa)
How many people are using the pywikipedia bot and how many people are developing code for pywikipedia bot
It has been widely used since 2003, and It has more than 100 authors but now they are just five active developers. After the launch of Wikidata many bots lost its work :-) e.g. one bot was active on all wikipedias, now it works only on 2-4 and the wiktionaries.
It is possible to use the pywikipedia bot in so many ways... Is it easy to learn what it can and cannot do?
Yes: Just join the irc channel and ask. For nearly all tasks, some "swiss army knife" exists, and the developers are extremely helpful. Just ask.  
When script have documentation inside, its easier. Some scripts have documentation on, but nobody really knows all about it.
How long does it take before pywikipedia bot supports a new Wikidata data type ?
Usually it takes a week or two based on how much the developers are willing to do. The following datatypes are already supported: item, string, URL, coordinates. Time is nearly done. - > ~ 2 month
Tomorrow there will be the pywikibot triage... What is it and who can participate?
The bug day will focuse mainly on categorizing and prioritizing and closing non-reproducible bugs and fixing them if the bug is not a big deal. Because we just migrated there are ~700 open bugs (some of them are really old and it's fixed long time ago) so we need to clean them up. It's really necessary for us to have this "bug triage".
The good news is that anybody can participate.

Tuesday, October 22, 2013

Junkers 188A-3, a photo from the #Bundesarchiv

The Junkers Ju 188 was a German Luftwaffe high-performance medium bomber built during World War II. A picture was part of the continuing cooperation with the German Federal Archives.

Once images become part of Commons, it is not the end of the cooperation. When the descriptions of the pictures are thought to be problematic, people report this and these reports often result in changes both in the descriptions on Commons and at the General Federal Archives.

The problem with this picture is that at the indicated time this type of air plane was not flying with the unit it is attributed to.

Lutfi Pasha, a Grand Vizier of the Ottoman Empire

To become a Grand Vizier in the Ottoman Empire was the culmination of a long and distinguished career. Many of them combined this rank with being involved in the many wars the Ottoman Empire fought.

Before today, there were no statements for Lufti Pasha in Wikidata. The Wikipedia article indicated that he had written several books on religion and history. This made it likely that there would be a VIAF identifier.
  • VIAF and its associated resources and Wikipedia disagree on the date of his death.
  • ISNI has several names for Lufti Pasha but they seem to be transliterations in several languages
  • More resources are indicated
What will be interesting is if one of the many bots will pick up on the new data in Wikidata and how they will enrich it even more.

The same #data provides mores #statistics and insights

#Wikidata is useful because it brings things together. As it brings more things together its value will only increase when its data is leveraged.

With these statistics you get a feeling how things are developing and what can be done next.

Today the new statistics have been enhanced with information about the number of links per item. Many items are not linked to articles at all. The large number of them is what makes it remarkable.

The best statistics are the ones that are actionable. What these statistics show is how much the different Wikipedias do not cover. It makes it obvious that you do not share in the sum of all knowledge when your language is not aware of known information.

The statistics suggest that when Wikidata provides enough information from statements, they could be made available when there is no article and even serve as the target for a red link. Presentation will be key; what is provided by the Reasonator could be a start for persons.

Monday, October 21, 2013

#Wikidata #statistics: presenting labels per item

Understanding statistics is very much the consequence of how data is presented.
I understand the presentation with the most statements on the bottom best; it shows best the work done by the Wikidata community. Compare it with the presentation below based on the same data.

Sunday, October 20, 2013

#Wikidata: Operation "Fat Head" is a GO

New functionality for #Pywikipedia has been writen. When you use search in Wikidata, there were many missing links because there was no label for them.

Operation "Fat head" is a job executed by Dexbot it uses the names of Wikipedia articles for the new labels. They are used for searching Wikipedia so they will work equally well for Wikidata.

I expect that the most requested items will benefit the most and, as a result Wikidata will gain relevance in many languages.

#Statistics; beware of the long tail ..

The start of #Wikidata was by including the interwiki links. It takes at least two articles to have such a link in the old system so it is weird when the majority of items in Wikidata have only one link.

The numbers of labels is closely related to the number of links; one label is typically added when a link is created. So when you look at the growth of the labels, this means if follows that the articles with no interwiki links have been added as well.

Byrial created some really interesting statistics that are informative and useful. For many languages the number of links is higher than the number of labels. When the associated article names are added as labels, this will probably swell the "fat head" more than the "long tail" in the distribution of labels.

Wikidata becomes useful for a language when the high demand search items are included as labels. We need to know what people are looking for and fail to find. There are two categories; the red links and the failed search items.

It will be interesting to learn if the WMF statistics department knows where our search fails. When they do we know not only what labels to add but also what articles to write.

My moment of "glory"

Someone is adding "notable" Wikimedians to #Wikidata. Most of them do not have a Wikipedia article. This is no longer a criteria for inclusion; now there is the dratted Commons Category. I did not want to embarrass a fellow Wikimedian so here is my moment of glory..

I do think that including an image and the commons category is compulsory to ensure a minimum of notability and relevance.

Saturday, October 19, 2013

Do not use #Wikipedia as your #source

You know it, I know it, we all know it. You should use Wikipedia to learn the basics and never ever refer to it as your source.

What you should refer to as a source are publications that have authority. They can be scholarly works, they can be periodicals anything that is stable enough so that you can go back to it for verification. Yes, Wikipedia has its "Cite this page" but it starts with the following notification:
IMPORTANT NOTE: Most educators and professionals do not consider it appropriate to use tertiary sources such as encyclopedias as a sole source for any information—citing an encyclopedia as an important reference in footnotes or bibliographies may result in censure or a failing grade. Wikipedia articles should be used for background information, as a reference for correct terminology and search terms, and as a starting point for further research.
As with any community-built reference, there is a possibility for error in Wikipedia's content—please check your facts against multiple sources and read our disclaimers for more information.
It is ironic that Wikipedia is the most often used source in Wikidata. It "deserves" some choice words and maybe more but let's leave it at that.

What is more interesting is that resources like VIAF and the JPL Small-Body Database make it in this listing. They have authority and never mind if it is only there so that we can verify a statement or that it is the origin of the information, it has relevance.

By referring to these sources, they gain relevance. When these sources are US governmental resources like the JPL, the use in Wikidata may even strengthen their position for future funding.

By using sources responsibly we contribute back. It is very much like what we expect when people quote Wikipedia; it is called attribution. When we use sources responsibly, we may even find errors in external data. They will welcome such contributions and we will all share in the sum of all knowledge.

#Google #Translate is a friend

"I want a watch list of the English #Wikipedia in Russian" was why technical support was sought on IRC. Technically it is doable to retrieve a label from Wikidata in any language if it exists.

Consider the use case: when something changes for a subject in a specific language, you get to know what changes there are so that you can comment on it or improve the corresponding article in your language.

When you learned things have changed, you want to read the article, the changes. This is where Google translate really helps. It allows you to understand what is new. It even helps you to express your opinion for on the talk page. A tool like this will increase the number of people who are interested in the points of view on a subject expressed in the many languages of Wikipedia.

Friday, October 18, 2013

#Wikidata needs 3,780,000,000 labels

For an item to be found in Wikidata, it needs a label. Ideally any item has a label in every language that Wikidata supports and currently there are 280+ languages. Currently Wikidata is not really useful for practically all of these languages. The statistics show that a small percentage has more than 10 labels and most labels have only one label.

There are over 13.5 million items and consequently there is a need for 3,780,000,000 labels give or take a few.

There are several ways to drastically improve the number of labels. There are also ways to minimise the impact of missing labels.

The most obvious one is to use every title of a Wikipedia article as a label. There are for instance 729,712 articles in Chinese and 244,699 articles in Arabic these will probably provide the most requested information in those languages. We can also use lexical information from any freely available resource. Wiktionary for instance includes a wealth of usable data. Implementing these two strategies alone will have a big impact.

The names of people typically stay the same in most languages with the same script. There are standards for transliteration and they can will us with an adequate result. Adding these strategies to the mix and it will become even better.

Finally, there is our community. When we provide them with awareness what labels are most often requested and failed they may either add a label to an existing item or they can add a new item.

Statistics indicate that we have a problem but with this awareness we can build statistics that will indicate what works and where our efforts have the most impact in making Wikidata truly useful.

#SignWriting, #sign languages - an #Interview with Valerie II

Half of the people who sign are not deaf. Therefore the number of people who benefit from a sign language that can be written is more than just the number of people who are deaf and sign.

Thanks to SignWriting, sign languages can move on from only having an "oral tradition". I have been privileged to witness as the SignWriting community moves slowly but surely ahead in gaining recognition by making their languages and their cultures equal to any other language in the age of Internet.

Knowing Valerie is an inspiration. She is a real mover and shaker for so many people. I rate her as highly as Jimmy Wales. Anyway, these are the questions I put to her. Enjoy!

How do you explain what it means when a language cannot be written?
If I understand it correctly, most of the world's languages do not have a written form. All languages CAN be written. But most languages are not written.

Sign languages are now written languages!  ;-))

But it takes effort by people who know their languages to want to develop a way to write it and lots of languages, for example, in Africa and Asia, may not have the political position, nor the funds, to invest in the development.
How many languages are written in the SignWriting Script?
That is also hard to say, but we estimate that small groups of people are writing their sign languages in around 40 countries…based on real written literature and also by word of mouth…we have definite proof for many, and some proof for some…. Some sign languages have 100s if not 1000s of written documents - ASL and DGS are two examples
Can you recognise what sign language it is from a written text?
Yes, between written ASL and written DGS (German Sign Language) we can definitely see a difference immediately - Other sign languages, not so much yet, because we do not have that much experience comparing written documents - as each sign language has more and more literature, it will be easier to recognize differences in the sign language literature - a lot has to do with the choice of style of writing… German Sign Language is written with more mouth movements than our ASL written documents… so one can see which language it is quickly
What are the most active sign languages (that are writing with SignWriting)?
American Sign Language, Argentine Sign Language, Brazilian Sign Language, Catalan Sign Language, Czech Sign Language, French-Canadian Sign Language, French-Belgian Sign Language, French-Swiss Sign Language, French Sign Language, Flemish Sign Language, German Sign Language, German Swiss Sign Language, Italian Sign Language, Jordanian Sign Language, Korean Sign Language, Maltese Sign Language, Nicaraguan Sign Language, Norwegian Sign Language, Polish Sign Language, Portuguese Sign Language, Saudi Arabian Sign Language, Slovenian Sign Language, Spanish Sign Language, Tunisian Sign Language and more…

The use of SignWriting is growing rapidly. How do you know about how it develops?

Through the internet, in different ways. And through individuals sending me documents. And through mentions on the SignWriting List. Some write documents publicly in SignPuddle. Some get their school degrees - Ph.Ds and Master Degree theses are posted written on SignWriting or using SignWriting… Papers are presented about SignWriting and they are listed in publications, and occasionally people write to me privately or join the SignWriting List or Facebook or Twitter… but I actually do not know how many people use SignWriting and I never will, because it is free on the Internet and the way it spreads is like it has a life of its own
Can SignWriting be used on mobile phones or is there an app for that…
Yes, we have two apps for the iPhone…one from Germany and one from California:
SignWriting App from California, by Jake Chasan (age 16 ;-)
Signbook App from Germany, by Lasse Schneider, University of Hamburg
Are there many schools where they teach SignWriting
SignWriting is spread freely on the internet, so lots of people learn SignWriting on their computers, but there are schools with official courses too, such as

  • Osnabrück School for the Deaf in Germany
  • A school in French-Canada (Quebec)
  • Schools in French-Belgium
  • A school in Flemish-Belgium (Brussells)
  • A school in Poland
  • A University in the Czech Republic
  • A Catholic School for the Deaf in Slovenia
  • Bible Translators teach SignWriting in Madrid
  • A university in Barcelona (Catalan Sign Language)
  • ASL classes in a hearing high school in Tucson, Arizona
  • ASL classes at San Diego Mesa Community College in San Diego, California
  • ASL classes at UCSD, University of California San Diego
  • Courses on SignWriting are taught throughout Brazil, by Libras Escritas, also online
  • A School for Deaf Children in Brazil: Teacher Sonia Messerschimidt
  • Santa Maria-Rio Grande do sul - Brasil , Escola Estadual de Educação Especial
  • Letras LIBRAS , UFSC Florianópolis, (SignWriting is an integral part of the curriculum)
  • the list goes on and on ... 
How hard would it be to have the pupils at these schools write two articles a month ... How many Wikipedias could be started that way…
Just as soon as Steve Slevinski and Yair Rand are ready for us to move Nancy Romero's 37 articles, and Adam's 2 articles and Charles Butler's 1 article, from the Wikimedia Labs to the Incubator, and we have tested the ASL User Interface, and we have tested the new features like linking and selecting text, and when Steve has completed the new SignWriting Editor program that will make it possible to write articles directly on the Incubator site, then of course we can ask teachers and students to test our new software and start writing articles… our software isn't ready yet though. 
That is why Nancy Romero, our most prolific English-to-ASL translator and SignWriter, is writing enough articles to lay a foundation so we can get started - we could ask students to write the articles over in SignPuddle Online, and then we can move those articles over to the Incubator for them - but until the Editor software is completed it wouldn't be as much fun for the students as it will be later - that day is coming and it is an excellent idea for the future.
Why is Wikipedia strategically important for getting more people to know about SignWriting.
I consider it VERY important because it provides us literature for readers to read written in ASL that are not children's stories or religious literature. Ironically we have plenty of children's stories written in ASL, including Cat in the Hat, Goldilocks, Snow White and others…and Nancy Romero has written close to the entire New Testament in ASL based on the New Living Translation, and another Deaf church has also written much of the Bible - but we are really grateful to have articles to read in Wikipedia that are general non-fiction - educational, historical, scientific.
Wikipedias are also important because they will encourage others to write articles in ASL which will indirectly teach people how to write ASL. Another wonderful indirect result will be an added respect from the general public. Surprised visitors will realize that ASL and other sign languages can now be written -
Val ;-)

Reasonator gets "instance of" "#human"

Reasonator, the tool by Magnus now appreciates that a person is a "human". An increasing number of items in Wikidata identify a person as a human. As a consequence, all the tools that rely on appreciating that an item is about a human need to adapt. When you are aware of tools that still need to adapt, pinging the author of the tool is a reasonable approach :)

I selected Herman Melville for the illustration; he is mentioned today as the author of Moby Dick on the main page of the English language Wikipedia.

#Statistics for #Wikidata

Finally some up to date and useful statistics for Wikidata. Magnus Manske brings us these wonderful statistics that are up to date and actually useful. To appreciate the one above, you have to know that there are many items who are linked to outside sources like VIAF. Everything above 5 statements has probably contributions made by volunteers.

When an item has no label in a language, you cannot search for it. If anything this is what has me worried. To be useful in a language you have to be able to find an item. Given that there are 280+ languages represented in Wikidata, this is probably the biggest challenge facing Wikidata.

Thursday, October 17, 2013

#VIAF, sources and the battle of the sexes continued ..

In the English language #Wikipedia there are more articles about men. In #Wikidata there are more persons known than in VIAF and both are aware of more men as well.

When you research Wikidata you may learn the gender ratio of men and women in any Wikipedia. As more statements are added, the data will become more precise and more articles will become known to be about "humans".

VIAF is maintained by the OCLC and it brings together the information of many libraries from around the world. It links people; authors and subjects to books. Wikipedia does to a certain extend the same thing by linking sources to facts known in articles. Many of these sources are books or periodical publications and identified by an ISBN or ISSN number.

The notes, references and further reading used in Wikipedia articles do refer to books and periodical publications and often provide the ISBN and ISSN numbers as well. They may even refer you to your library.

Wikidata allows for adding sources to its statements. Typically there are no real sources for any statement on any item. Sadly this is used as an excuse for not using Wikidata.

When all the Wikipedias are mined for its sources and when they are added to Wikidata as sources for a subject, Wikidata will become a useful tool for the library world. For people who love to add sources, it becomes much easier to add sources to statements but also to verify the veracity of statements.

As to the battle of the sexes, when you care about more articles about women, you have to write them. There are so many notable humans who do not have an article yet.

#SignWriting, #sign languages - an #Interview with Valerie I

The biggest challenge for a #language to gain permanence is to be written. Many languages became written languages by adopting an existing script. Sign languages are fundamentally different and therefore there was no script to adopt.
Valerie Sutton, a ballerina, developed a method to register dance movements. Linguists who researched sign languages asked Valerie if this could be applied to sign languages. Many iterations later, SignWriting became an ISO recognised script, it is known to be used by at least 40 sign languages that all may gain their Wikipedia.

When I asked Valerie to answer ten questions, her response was to ask the SignWriting community for their opinion. This, the first part, explains the need for sign languages. As it is the most often asked question about sign languages it deserves a full response.

Some people say "Why do they not use English?" .. How different is a sign language from a spoken language?
A lot of Deaf people DO use English … ;-)

Deaf people who use a sign language as their primary daily language, also use English as their second language. But, they cannot hear their second language, English, and American Sign Language (ASL) and other sign languages are rich languages that give a deep communication that is more profound than speaking a second language you struggle to hear, or cannot hear at all… Lip reading does not give all the sounds made on the lips, and many conversations have to be guessed at most of the time… They say that at best, lip reading gives 30% understanding and everything else is guessed at….

So if a signing Deaf person, whose primary language is American Sign Language, lives is in the United States or English-speaking Canada, they have to get around, and they learn English to get by

Just as I learned Danish when I lived in Denmark. Learning a second language is a requirement and you do your best… But there was one difference for me - I can HEAR Danish, which was my second language years ago when I lived in Denmark… I would have found it much harder to learn Danish if I were Deaf and could not hear it…And truth be told, if I really wanted a deep and profound communication, I would always migrate back to my native language English. My second language did not give me the true communication of my native tongue.

So the question is really not "why one language is better or easier that the other?" but instead a realization that deafness creates a barrier to learning spoken languages, and that first and second languages are different experiences… Deaf people are not choosing one language over the other, but instead managing the best they can between the hearing and Deaf communities.

Both signed and spoken languages are good languages and should be equally respected. And it is my feeling that everyone should learn another language if they can. Hearing people who speak English as their native language oftentimes enjoy learning American Sign Language, but no one asks them why they just don't use English?! (at least I hope not ;-)

In school we are asked to learn a foreign language…so why not learn American Sign Language or other signed languages?

And in return, Deaf people spend most of their lives learning spoken languages to the best of their ability and I give them my utmost respect for the hard work I know they must go through everyday...

Some Deaf people are born into Deaf families. Deaf children in Deaf signing families have a native language, sign language, from the beginning, so their language development is early and considered the same as a hearing child's language development. Spoken languages are a "second language" to everyone in the Deaf family. So they do not feel "different" than their parents or siblings.

Deaf children born into hearing families sometimes have a harder time, because oftentimes the family doesn't even know the child is deaf until later, and so language development may not start early, and also they are different than their own parents and family members.

So native signing Deaf people have their own native language, a sign language, and yes, the grammar and structure of American Sign Language, for example, is quite different than the grammar and structure of English. Verbs are conjugated differently, adverbs and adjectives are in different positions in the sentence, and there are elements of American Sign Language that are much more sophisticated than in English, or at least expressed very differently, and so oftentimes there is not a real way to translate between the two languages that is a true "match"… what takes a paragraph in English can be expressed with a short phrase in ASL, and vice versa… Some say that the grammar of ASL is closer to Russian or Spanish than it is to English...

That is why SignWriting is important. When both languages can be written, both languages can be compared, and understood better…

Val ;-)