Friday, September 27, 2013

Importing data from the #Polish #Wikipedia

All new information in #Wikidata has an origin. It can come from many sources and the quality varies. When Matmarex mentioned his source as a Polish project about information about persons, I wanted to learn more.

This is the kind of project that we should welcome at Wikidata. Please have a read and be happy with great undertakings like this.

What is the data you are importing based on
The data is based on the index of biographies maintained by hand by a few dedicated Polish wikipedians at Noty_biograficzne . The nature of created-by-hand data is that not all of it can be automatically parsed, but it was surprisingly consistent – accepting just several common variants resulted in over 60 000 items the bot could understand and only several hundred that could not be used (I am hoping to sort these out by hand). Some typos in the source are unavoidable, but overall the quality seems to be very high.
Getting quality personal information has been a project on the pl.wp for quite some time, can you explain what it means ?
I am not sure myself what the index was intended to be, not yet being a wikipedian when it was started in 2004 – possibly a crossover between a category system (the concept of a category was only introduced on that year, I don't know what was first) and a list of articles needing creation. Currently it serves as, well, an index – list of all biographies on the Polish Wikipedia, ordered alphabetically by last name (or, in some cases, by pseudonym). It's easier to find what you're looking for if you only remember last name of a person and possibly their occupation than using the built-in search system (for example search suggestions are ordered by article title – thus first name – and it's not possible to limit results to only biographies).
You are running a bot adding descriptions in Polish, what software are you using..
Unlike most bot operators I'm not using the Pywikibot framework – I opted for my own custom-written library in Ruby called Sunflower and a a set of scripts using it.
Does it use the Wikidata API and why is this important
Yes, both for uploading and using the information. The API is basically what made the project possible.
What other data do you have about all these people
The old index contains birth and death years in addition to the descriptions. I didn't upload it because it's basically unsourced (and unsourced data is seemingly as frowned-upon on Wikidata as it is on Wikipedia, if not more) and because, when I tried comparing them with birth and death categories on the biographies themselves, I found over 1500 conflicts.
Can you use your data to compare against the data on Wikidata
Not really; there isn't much to compare in this area, especially since I was uploading basically free-form text. Only several hundred items out of the 60 thousand my bot edited already had a description in Polish, me and a couple other editors reviewed them all in a few days.
The birth/death data could be compared, though, but I haven't looked into it. Any help would be welcome!
Can you add your data where Wikidata has none
There are a few things the index uses that are not yet present on Wikidata. The birth and death dates are the biggest one, real names of people using pseudonyms (such as Sting or Madonna) would be a valuable piece of information as well. I didn't try to upload either – the dates would need better sources (the 1500 conflicts are a strong indicator that information from Wikipedia might not be good enough) and there are currently no properties defined for first / last names because of how complicated the topic is (there is currently a discussion under way).
Did you know that this type of data becomes available on several Wikipedias in stub articles
I considered parsing the articles themselves to extract the descriptions, but decided that this would be too error-prone to automate entirely. Instead I developed a gadget that helps users write short descriptions for biographic articles by extracting the information from the lead-in paragraph and presenting it on the index pages – they can be adjusted by a human and saved to Wikidata with one click! This benefits both projects at once and I think is a good example of how they can work together.
Is it possible to transfer this biography project from the pl.wp to Wikidata
It could be done entirely on Wikidata and using Wikidata information, but there are two preconditions – presence of the required information on Wikidata (birth/death dates, last names for correct sorting) and ability to generate lists from the data (so-called "phase 3" could accomplish this).

