Friday, March 06, 2015

#Kiwix - getting #Labs ready for the #Wikipedia big time

Offline #Wikipedia received a big boost. It is updating monthly its images for most of the #Wikimedia projects. Most but not all. Emmanuel was asked to write up about his challenges and I am happy to share this with his permission. Developments like this make both Labs and Kiwix even more strategic to out goals.
Thanks,
       GerardM

Following Yuvi's and Andrew's invitation, I write this email to explain what I want to do with Labs and share with you my first experiences. 
== Context == 
Most of the people still don't have a free and cheap broadband access to fully enjoy reading Wikimedia web sites. With Kiwix and openZIM, a WikimediaCH program, we have been working on solutions for almost ten years to bring Wikimedia content "offline".
We have built a multi-platform reader and have created ZIM, a file format to store web site snapshots. As a result, Kiwix is currently the most successful solution to access Wikipedia offline. 
== Problem == 
However, one of the weak point of the project is that we still don't achieve to generate often enough new fresh snapshots (ZIM files). Generating ZIM snapshots periodically (we want to provide a new fresh version each month) of +800 projects needs pretty much hardware resources.
This might look like a detail but it's not. The lack of up-to-date snapshots brakes many action within our movement to advert more broadly our offer. As a consequence, too few people are aware about it reported last Wikimedia readership update. An other side effect is that every few months, volunteer developers get the idea to build a new offline reader based on the XML dumps (the only up2date snapshots we provide for now), which is near to be a dead-end approach. 
== Goal == 
Our goal with Labs  is to have a sustainable and efficient solution to build, one time a month, new ZIM files for all our projects (for each project, one with thumbnails and one without). This is at the same time a requirement for and a part of a broader initiative which has for purpose to increase the awareness about our "offline offer". Other tasks are for example, storing all the ZIM files on Wikimedia servers (we currently only store part of them on download.wikimedia.org) and improve their accessibility by making them more visible (WPAR has for example customised their sidebar to provide a direct access 
== Needs == 
Building a ZIM file from a MediaWiki is done using a tool called mwoffliner which is a scraper based on both Parsoid & MediaWiki APIs. mwoffliner, after scraping and rewriting content, store them in a directory. At the end, the content is then self-sufficient (without online dependencies) and can be then packed in one step in a ZIM file (using a tool called zimwriterfs).
To run this software you better have:
  • A little bit bandwidth
  • Low network latency (lots of HTTP requests)
  • Fast storage
  • Pretty much storage (~100GB per million article)
  • Many cores for compression (ZIM, ZIP and picture optimisation)
  • Time (~400.000 articles can be dumped per day on a machine)
My guess is that we need a total of around a dozen of VMs and 1.5 TB of storage. 
== Current achievements == 
We have currently 3 x-large VMs in our "MWoffliner" project:
With them we are able to provide, one time a month, ZIM for all instances of Wikivoyage, Wikinews, Wikiquote, Wikiversity, Wikibooks, Wikispecies, Wikisource, Wiktionary and a few minors Wikipedias.
Here are a few feedbacks about our first months with Labs:
  • Labs is a great tool, it's fully in the Wikimedia spirit and it works.
  • Support on IRC is efficient and friendly
  • We faced a little bit instability in December but instances seem to be stable now
  • The Documentation on wikitech wiki seems to be pretty complete, but the overall presentation is to my opinion too chaotic and stepping-in is might be easier with a more user-friendly presentation.
  • Mediawiki Sementic & OpenStackManager sync/cache/cookie problems are a little bit annoying
  • Overall VM performance looks good although suffering from sporadic instabilities (bandwidth not available, all the processes stuck in "kernel time", slow storage).
In general, Labs does the job, we are satisfied and think this is an adapted solution to our project. 
== Next steps == 
We want to complete our effort and mirror the biggest Wikipedia projects. Unfortunately, we have reached the limits of a traditional usage of Labs. We need more quota and we need to experiment with the NFS storage because an x-large instance in not able to mirror more than 1.5 millions of articles at a time. How might that be made possible?

No comments: