Wednesday, July 28, 2010

Failing #statistics (finally)

Now that Erik Zachte announced issues with the statistics as published, it is a good moment to reflect. It is the aim of the Wikimedia Foundation to double its reach in five years time, doubling our traffic. The expected result is expressed numerically and consequently we require hard numbers.

There are numbers of Wikimedia's traffic by the likes of Alexa and comScore, so there are alternative numbers providing us with a second opinion. Their numbers while good are no alternative for the numbers needed for our own purposes.

The numbers are used in many ways and for many audiences. They are important for the GLAM's that contributed material to us. These same numbers provide the arguments to other GLAMs to work with us. They are used to learn how a competition is doing. They provide background numbers when we talk to the press on many subjects.

Our statistics are vital. When I asked for a slot for a panel discussion at Wikimania about statistics, the numbers ended up being quite different. I am now at a loss how to appreciate the numbers we have. I understand that some statistics will be approximated to what they should have been. Other numbers will not receive such royal treatment.

This mishap is painful and I really hope it is felt that way. As we have several people working professionally on statistics, as many studies are done based on the numbers we provide, as the Toolserver is another resource that relies heavily on us accruing the right numbers, it is fair to call statistics one of our primary processes.

For our other databases we have redundancies, I hope that we will learn from those responsible for the accumulation of data that our statistics are based upon how our data collection will be made more robust in the future.
Thanks,
       GerardM

4 comments:

Erik Zachte said...

I don't think you have enough appreciation for how many plates a very small tech team needs to keep spinning. As I mentioned in my blog post: once I was convinced there was an issue, and I raised it, I got a speedy response.

Erik Zachte said...

You seem to be bit confused here: your session at Wikimania focussed on which languages/geographies should get most attention.

Underreported page views however regrettable are distributed pro-rate over the globe, and over languages, no distortion relevant to your discussion topic at all.

GerardM said...

There may have been a quick response once you were convinced there was an issue. It is however very much beside the point. The point is that other applications were installed on the stats server without adequate capacity planning.

My session at Wikimania was not about languages and geographies. It was about what data is considered relevant and the consequences bringing us biased numbers and approaches.
Thanks,
GerardM

Erik Zachte said...

"without adequate capacity planning."

Very cheap criticism. Why do you think WMF is hiring more tech staff?

"My session at Wikimania was not about languages and geographies."

I must have gotten that wrong then. Maybe you should blog about it more often.