Saturday, March 31, 2012

#Chennai hackathon III

The biggest challenge after a #hackathon is probably to keep the hackers engaged. When they know that their effort is appreciated, they are more likely to remain involved in our projects.

One of the projects of the Chennai hackathon is the "Random Good WP India article tool" and it is really nice to learn that the tool is now available for your use. Have a look and see if a tool like this is something for your wiki.

If there is enough interest, a tool like this can become available for the projects in the Wikis themselves.

#Wikidata, the interview

The press release is out, the mailing lists are full of it. But what is this brand new Wikidata project about. Who better to ask but Lydia Pintscher and Daniel Kintzler. Lydia does "community communications" for the Wikidata project and Daniel has been involved in many data related projects including OmegaWiki, the first Wikidata iteration.

Wikidata is still brand new; it does not have its own logo however it is ambitious and I expect that it will improve the quality and the consistency of the data used in MediaWiki projects everywhere. Enjoy the answers to the ten questions answered by Lydia and Daniel.

What is it that Wikidata hopes to achieve?
The Wikidata project aims to bring structured data to Wikipedia with a central knowledge base. This knowledge base will be accessible for all Wikipedias as well as 3rd parties who would like to make use of the data in it.

When you express this in REALLY simple language, what is the "take home" message of Wikidata?
We are creating a central place where data can be stored. This could for example be something like the name of a famous person together with the birthdate of that person (as well as a source for that statement). Each Wikipedia (and others) will then be able to access this information and integrate it in infoboxes for example. If needed this data can then be updated in one place instead of several. There is more to it but this is the really simple and short version. The Wikidata FAQ has more details.

Structured data is very much like illustrations. The same data can be used over and over again. Will there be a single place to update for everywhere where it is used?
Yes. There will be a central place but we are also working on integrating this in the editing process in the individual Wikipedias.

This project is organised and funded by the German chapter. Will this ensure that the data can be used in many languages?
The German chapter is indeed organising it. Funding is coming from Google Inc., AI2 and the Moore Foundation. One of the main points of Wikidata is that it will no longer be necessary to have redundant facts in different Wikipedias. For example it is not really necessary to have the length of Route 66 in each language’s article about it. It should be enough to store it once, including sources for the statement, and then use it in all of them. In the end the editor will be free to decide if he or she wants to use that particular fact from Wikidata or not. This should be especially helpful for smaller
Wikipedias who can then make better use of the work of larger Wikipedias.
In order to make the data useful in pages written in different languages, we of course have to provide a way to supply information in different languages. This is described in more detail below.

In new Wikipedias a lot of time is spend in localising info boxes. Will Wikidata make this easier?
The template for the infobox still has to be created by the respective Wikipedia community -- but filling the infoboxes would be much easier! Wikidata could be used to get the parameters for the infobox templates automatically, so that they do not need to be provided by the Wikipedians.

Will it be possible to associate the data with labels in many languages? 
Each entity record can have a label, a description (or definition) and some aliases in every language. Not only that: each language version of the label can have additional information like pronunciation attached. For example, the record representing the city of Vienna may have the label “Vienna” in English and “Wien” in German, with the respective pronunciations attached (/viːˈɛnə/ resp [viːn]).

Many people who are part of the project have a Semantic MediaWiki background. How does this affect the Wikidata project? 
Wikidata will profit from the team’s experience with Semantic
MediaWiki in two ways: they know what worked for SMW, and they know what caused problems. We plan to be compatible to the classic SMW in some areas: for instance, we plan to re-use SMW plugins for showing query results. On the other hand, we will use a data model and storage mechanisms that are more suitable to the needs of a data set of the style and size of Wikipedia.

To what extend will Wikidata data be ready to be expressed in a "semantic" way and if so what are the benefits?
Wikidata will only express very limited semantics, following the linked data paradigm rather than trying to be a true semantic web application. However, Wikidata will support output in RDF or the resource description framework, albeit relying on vocabularies of limited expressiveness, such as SKOS.

The DBpedia project extracts data from the Wikipedias. To a large extend this is the same data Wikidata could host. Is there a vision on how Wikidata and DBpedia will coexist?
If and when all structured data currently in Wikipedia is maintained within Wikidata, the extraction part of DBpedia will no longer be necessary. However, a large part of DBpedia’s value lies in the
mapping and linking of this information to standard vocabularies and data sets, as well as maintaining a Wikipedia-specific topic ontology. These things are and will remain very valuable to the linked data

You worked on the first Wikidata iteration; OmegaWiki. What is the biggest difference ?
The idea of Wikidata is quite similar to OmegaWiki - it’s no coincidence that the software project that OmegaWiki was originally based on was also called “Wikidata”. But we have moved on since the
original experiments in 2005: The data model has become a bit more flexible, to accommodate the
complexity of the data we find in infoboxes: for a single property, it will be possible to supply several values from different sources, as well as qualifiers like the level of accuracy. For instance, the
length of the river Rhine could be given as 1232 km (with an accuracy of 1km) citing the Dutch Rijkswaterstaat as of 2011, and with 1320 km according to Knaurs Lexikon of 1932. The latter value could be marked as deprecated and annotated with the explanation that this number was likely a typographical error, misrepresenting earlier measurements of 1230 km.
This level of depth of information is not easily possible with the old Wikidata approach or the classic Semantic MediaWiki. It is however required in order to reach the level of quality and transparency
Wikipedia aims for. This is one of the reasons the Wikidata project decided to implement the data model and representation from scratch.

#WebFonts for the #French #Wikisource

Following the deployment of the WebFonts extension on the English Wikisource, it is now the French who asked for WebFonts. With WebFonts it is possible to show selected texts in a specific font. At this time the selected texts will not be in French because no fonts are available for the Latin script.

Once content is labelled as being in a different language, it is obvious what texts are in French or in another language. This will not only trigger WebFonts for the languages with a default font, it will also help people analysing the content of the French Wikisource.

WebFonts is a tool that delivers fonts where needed. So far we have not really considered supporting languages in the Latin script with WebFonts. When there are fonts that serve a purpose and are freely licensed, adding support will certainly be considered.

One of the issues that we will have to consider is how to support the really big fonts. When they support multiple scripts it does affect the time it takes to send a font to the user.  This is one area where local fonts and web fonts substantially differ.

Historic scripts like Runic and Cuneiform but also modern scripts like the Latin script have evolved a lot over time. To do justice to this and to differences in different places, it will be important to have fonts that reflect these differences.

With the interest of the Wikisourcerers we may learn from them about the fonts available under a free license that do justice to all such considerations.

Friday, March 30, 2012

#Chennai Hackathon II

Several of the topics at the Chennai hackathon had to do with the manipulation of data.

Word lists
One project searched for a list of unique words in the Tamil Wikipedia. They used a dump and retrieved an initial list of 1.3 million unique words. Such a list is quite useful; once the list is filtered from words in other languages and other problematic strings. It can be used to build spell checkers, it can even be the basis for research into building a more effective "transliteration input method".  Currently the phonetic typing for Tamil is considered inefficient since it takes too many keystrokes for same text as opposed to the Tamil99 keyboard layout. It takes a big corpus to find the patterns that are more effective.

A project like this may be beneficial for other languages as well. Wikipedia often represents one of the largest corpora of the modern use of a language. Once this project is documented and when there is a set-up to run it repeatedly, it will stimulate the development of more applicable language technology.

Structured database search over Wikipedia
This project is vintage hacking; learning how to use semantic search using DBpedia and ending up with Wikipedia content is great stuff. The hackathon report has this to say:  "Amazing search tool that made it super easy to query information in a natural way". That makes it really interesting to have it hosted somewhere so that people can comment and experience the semantic web first hand.

Parsing Movie data into a database
Wikipedia is rich in information about movies. They typically include info boxes and when you extract everything Wikipedia has to offer together in a database, it becomes really awesome. It can be the beginning of an "IMDB" for the movies from India for instance. 

As this is also the approach taken by DBpedia, the skills learned can be applied to for instance the Tamil Wikipedia. Once a language is supported in DBpedia, it has a prominent entry in the Semantic web.

Random Good WP India article tool
Spreading the word about great articles about India is one way of enthusing people to write some more. Technically it may not be that hard but when you start without any knowledge of the MediaWiki API or JSON, it is quite a feat when a working prototype is produced in a day.

A tool like this can be used on any Wiki and by any project that wants to show of its quality content. You never know how pushing great contents will pull in more interest for our projects.

#Chennai Hackathon

The Chennai hackathon has come and gone. There has been a report on the India mailing list. It is a wonderful report, many people who should be interested will not read it because they are not subscribed to it.

The projects people worked on were very diverse:

  • Wikiquotes via SMS - @MadhuVishy and @YesKarthik
  • API to Rotate Images (Mediawiki Core Patch) - Vivek
  • Find list of unique Tamil words in tawiki - Shrinivasan T
  • Program to help record pronunciations for words in tawikt
  • Translation of Gadgets/UserScripts to tawiki - SuryaPrakash [[:ta:பயனர்:Surya_Prakash.S.A.]]
  • Structured database search over Wikipedia - Ashwanth
  • Photo upload to commons by Email - Ganesh
  • Lightweight offline Wiki reader - Feroze
  • Patches to AssessmentBar - gsathya
  • Parsing Movie data into a database - Arunmozhi (Tecoholic) and Lavanya
  • Random Good WP India article tool - Shakti and Sharath
  • Fix bugs on tawiki ShortURL gadget - Bharath
  • Add 'My Uploads' to top bar along with My Contributions, etc - SrikanthLogic
  • WikiPronouncer for Android -  Russel Nickson
  • Wiktionary cross lingual statistics - PranavRC
Most of these projects deserve attention of a larger public. They represent missing functionality or they have a different approach to something we are struggling with. They are all by people who have a keen interest in the projects of the Wikimedia Foundation and as such they represent our "latest generation". 

There is too much great stuff in this list to single out in this blog post. Then again, there is always the next blog post.

Monday, March 26, 2012

Translate reports of the #Hungarian #Wikimedia chapter

Like so many chapters, the Hungarian chapter writes monthly reports. Their main audience are the members of the chapter and for them Hungarian is a must. Many of the international partners and relations need the same report in English.

The Hungarian monthly reports are really needed in two languages. Having them translated in other languages would certainly be appreciated; the German WikimediaWoche for instance refers regularly to newly published chapter reports. This is a great argument for having reports translated in German. When people want to translate the Hungarian reports in another language, adding that language is only a click away.

Obviously many more chapters post their monthly reports you can also find them on . Having the reports both in English and in the language of the land seems obvious. When more chapters translate their reports, it becomes relevant to translate the pages on Meta that point to these reports as well. This in turn will promote the translation of the reports of other chapters even more.

Wednesday, March 21, 2012

The #Santali language and the Ol Chiki script II

A request for a #Wikipedia is made for the Santali language. The request was made for use with the Latin script but Santali language is also written in the Ol Chiki script. It stands to reason that when a language is written in multiple scripts, it is relevant for the people who read that language to have fonts available for all the scripts in use. It is unlikely that the Ol Chiki font is part of any standard operating system.

When you are interested in an Ol Chiki font, one resource that pops up when you search for it is Wesanthals. Several fonts can be found at this website. Sadly they are not available under a standard free license. The good news is that we are reaching out to developers of Ol Chiki fonts.

When we do have fonts available under a free license, we can provide support for the Ol Chiki font in WebFonts. A text in Ol Chiki will be tagged like this:
<div lang="sat-olck"> </div> or <span lang="sat-olck"> </span>

Tuesday, March 20, 2012

#WebFonts for the #Tibetan #Wikipedia

Some time ago we enabled the Jomolhari font on for the Dzonghka language. It is not only Dzonghka but also the Tibetan language Denzongkha and the Ladakhi language that use the Tibetan script.

Being able to support the Tibetan language is particularly relevant as there is a Wikipedia in this language. For this reason the Jomolhari font was enabled for Tibetan as well and we are now looking for confirmation from the Tibetan community that they are happy with the result.
As you can see on this user page at, the standard text shows readable Tibetan thanks to the WebFonts extension. The text to the right in the Babel user information however shows the Unicode blocks. This indicates that WebFonts does not get triggered.

As you can see in the edit screen, the text is explicitly marked as Tibetan. When WebFonts is enabled on a Wiki and when Tibetan is configured  with one or more fonts, Tibetan will show properly. The HTML in the Babel user information however does not include a language attribute. It should.

Monday, March 19, 2012

#WebFonts on #Wikisource

As the #MediaWiki Webfonts extension has been implemented on the English language Wikisource, it provides all kinds of opportunities. What it does require is that you understand how things work.

There are a few things that are important:
  • All web fonts are available where WebFonts is enabled
  • When a text is properly tagged and a font exists the web font will be triggered
  • People easily recognise different languages and different scripts. It is computers and search engines that really benefit from proper tagging. By inference our public does.
  • As more freely licensed fonts become available for more scripts, including historic languages and scripts, the user experience of projects like Wikisource and Wiktionary will greatly improve
The Amiri font, a wonderful font for the Arabic script, recently became available in WebFonts. As it is so much prettier then the default font that exists on so many computers, I added the necessary tags to trigger Amiri as an example on one article in en.wikisource. What made my day is that the Wikisourcerers improved on what I did and have the English and Arabic text of the national anthem of Saudi Arabia side by side.


Sunday, March 18, 2012

The #Santali language and the Ol Chiki script

At first the Santali language was hardly written and when it was, it was written in the Bengali alphabetOriya alphabet, or the Latin alphabetAs Santali is not an Indo-Aryan language (like most other languages in the north of India), Indic scripts do not have letters for all of Santali's phonemes, especially its stop consonants and vowels, which made writing the language accurately in an unmodified Indic script difficult. 

The  Ol Cemet', Ol Chiki or simply Ol was created in 1925 by Raghunath Murmu specifically for the Santali script one letter is assigned to each phoneme.

A Wikipedia has been requested for the Santali language and it will be localised in the script that is popular with most people. Apparently this is the Latin script. This does not mean that the Ol Chiki script is not relevant for use within a Santali Wikipedia.

There is a font that is available for free. Sadly it is not available under a free license. Finding the copyright holder for these fonts allows us to request a change of the license. Once the license is changed, we can make the font available as a web font.

Friday, March 16, 2012

#mlwlux - #Wikipedia, how many people do live in #Amsterdam

At the Multilingual Web conference in Luxembourg, it was mentioned that Amsterdam has a different number of people living there depending on what Wikipedia you are reading.

It needs a solution it was said. Yes it does and letting us know is part of a solution. It certainly allows the number for Amsterdam to become the same. A real solution would be one where such data is maintained centrally and served where ever it is needed.

The Wikidata project is starting and one of its objectives is to serve as a central data repository. The people at the conference will appreciate that it is not simple; 783.364 is the number quoted on the Dutch Wikipedia while an American would say that a number like 783,364 is quite a lot more.

Anyway; we do appreciate that the numbers game and the data game is very much something we have fun with in the near future.

Thursday, March 15, 2012

#DBpedia and #Wikidata

At the #mlwlux conference, the new Wikidata project was discussed. The question raised was, what will that do to DBpedia. To understand, DBpedia is very big in the semantic web world and once Wikidata is going to maintain the data locally, it will affect DBpedia.

Given that the world changes around us anyway, the best thing that could happen for both Wikipedia, DBpedia and Wikidata is that we start actually using and considering the data aspects of the information that resides in Wikipedia.

Much of the data in DBpedia is retrieved from info boxes. What DBpedia is really good at is finding inconsistencies in those info boxes. When info boxes will transition to Wikidata based info boxes, all these inconsistencies will have to be addressed.  When we start now, it will improve consistency and quality in Wikipedia.

Such activity will give the DBpedia and the Wikidata communities a shared goal. Many of the people involved are member of all three communities. Really, once the data is used practically not only on the semantic web but also on the wiki everyone wins.

Wednesday, March 14, 2012

#multilingweb - #Arabic and #Hebrew are #RTL

Amir  asked me to ask at the Multilingual Web conference in conference to people like Richard Ishida:
"How hard would it be to allow assigning element directionality according to lang.  In HTML4 and in the current draft of HTML5, <span lang="ar"> has dir="ltr", unless specified otherwise, and I find it ridiculous". 
The usual replies Amir gets is:
  • Backwards compatibility
  • Many websites already use HTML-5 even though it is not a finished product. This will break them
His reply to this:
  • If a document explicitly specifies that it's HTML5, it should have directionality assignment by default.
  • Add an attribute to the root HTML tag, something like: <html dir="bylang"> or <html dirbylang="true">
Amir has aired his view before and an answer he gets from some "standards people" is: "It would be very problematic to do it, because most web developers don't use the lang attribute". This is rather funny because this will work only when the lang attribute is used.  And anyway at some point in time the public at large do not care really about previous versions of HTML as all the websites still in use will have moved on.

What we need is proper meta data for all languages and such data can hide very nicely inside a browser. Developers of websites do not know and want to know about all the linguistic niceties necessary to support a multi lingual web. It so many ways it makes sense to provide language support from inside the browser.

At the Wikimedia Foundation we we don't just support a lot of languages, we are also well aware of the languages we support. Ours is a world-wide community of people who have the opportunity to openly complain to us about bad support for their language and they expect that their complaint is actually read and is being taken care of.

Tuesday, March 13, 2012

#multilingweb - #i18N and #l10n testing framework

A subject for the conference about the Multilingual Web is what gaps exist in supporting a multilingual web. Obviously people who are living the multilingual web like my colleagues in the WMF localisation team suffer these gaps. Our team has been asked what to do next and one idea of Amir I love to put forward.
A testing framework for localization
I searched and I couldn't find any testing framework that is focused on localization. Many localization-specific issues must be tested, for example, grammatical correctness of generated messages, text readability in different scripts, support for encodings and fonts, etc. You can find a fuller list in Wikipedia (i wrote most of that section myself). It is possible to test all these things using the current frameworks, but much of it would be manual.
For example, i'm not familiar with any tool that would automatically or semi-automatically create screenshots of all the possible translated strings with their complete context. This would be useful for the translator, to see how to translate a message; for the developer, to see whether any message runs out of the screen or hides a button that must remain visible; and for the tester who speaks the language and wants to see whether all the generated messages are grammatically correct. Currently, a developer must do this manually;
it is time-consuming, inefficient, hard to plan and to maintain, and the fact is that the developers are hardly ever doing this.
Such a testing framework would be really great for so many organisations. At we support many organisations with their internationalisation and localisation. A framework will make it easy to repeat the testing often. This will improve consistency and quality and makes for a great multi lingual experience.

#Wikimedia #Mobile Frontend is opening up

The mobile frontend of #Wikipedia achieved 9.5% of the total traffic. It is growing rapidly. Having a strategy to support your wiki on mobile platforms is really important. There was a bit of good news on the Wikitech mailing list:

[Wikitech-l] First steps at making MobileFrontend usable beyond the WMF infrastructure
 Over the last couple of weeks, I've taken a few steps to remove some of theWMF-specific bits of the MobileFrontend code base:
Also, now if you view an article with "useformat=mobile" in the URL's querystring, MobileFrontend will keep the mobile view enabled as you browse pageto page until you explicitly exit the mobile view.
While MobileFrontend has now been generalized enough to be used beyond theWMF cluster, there are still quit a few things that could be done toimprove the ease of out-of-the-box usage. For instance, adding configurableWURFL support (which I believe should be fairly straightforward - this isout of date but the same idea should still work: making it possible to use path-based modifiers to signify mobile view(eg -> -
It would be great if anyone can help test/provide feedback for the changesthat have been made, and especially if anyone wants to help add thefeatures mentioned above!
There are many organisations large and small who use MediaWiki. Many of them stand to gain a lot of traffic with out of the box support for mobile devices. The great thing of this effort is that the MobileFrontend will be supported in the next releases; it is what the WMF uses itself.

A #Wikimedia #storyteller is the new #Communications manager

There are many wonderful stories waiting to be told relating to any and all of the Wikimedia projects. When Matthew Roth one of the three storytellers transitioned into the role of Global Communication Manager, I asked him ten questions and as I am happy to have received his answers.

You were a storyteller for the last Fundraiser, how many stories were written and how many stories were used?
At the Wikimedia Foundation, the three Storytellers -- Victor Grigas, Aaron Muszalski, and I -- did approximately 220 interviews over 5 months. The Storytellers worked with the fundraising directors to narrow those interviews down to about 40 that we thought might make for good fundraising appeals. Of those, the managers of the fundraiser wrote appeals from 14 Wikipedia editors from around the world, in addition to the Jimmy Wales appeal (some Wikimedia chapters ran localized appeals with other people). We were very proud to show a more diverse and demographically disparate group of appeals than in past fundraisers and I think the goal will be to continue this trend moving forward in 2012 and beyond.
What story is behind a title of "global communications manager" role? Is it about communicating WMF or communicating Wikipedia ?
I think my staff page says it best: In the role of Global Communications Manager for the Wikimedia Foundation, I work to support the community of communications volunteers representing Wikimedia chapters around the world. With them, I help craft the global story of Wikipedia and the Wikimedia Foundation, while working to increase the population of editors, raise funds from our readers, and serve our enormous volunteer community. I'm accountable for promoting the Wikimedia Foundation mission and vision through our public communications materials, including press releases, our social media channels, and community listservs. Every day I have the privilege of conversing with the the dynamic community of Wikimedians, as well as the world’s top reporters, bloggers and thought leaders.
As there is a global audience, we may have to localise some messages to be really effective. What are your thought
Where possible Foundation messaging should be localized and reflect the audience. In many cases, Wikimedia chapters will be best positioned to communicate localized messages and influence our global message to better align with localized need. My role involves supporting this work at the Chapter level. Similarly for very active individuals who are not in formal chapters but who do very meaningful work to promote the health of the projects, I want to be sure their work is promoted.
There are many social media platforms. They are wildly popular and many Wikimedia related activity can be found on them. At the same time people say "Wikipedia is not a social media". I think this attitude does not help, what do you think
I think the Wikipedia community will debate the merits of incorporating more social platforms within the editing space and will arrive at solution that works for the various projects. There is probably a spectrum of what will or won't work across the different projects. In an anecdotal way, photowalks can be a great way to get photographers together to take photos and upload them to Commons and hackathons can work wonderfully for developers and programmers. Such social meet-ups might not be as effective in every situation for writing and editing Wikipedia.
For more standard social media platforms, like, Twitter, Facebook or Google +, the Foundation is having ongoing discussion about how to better utilize our assets. Even though the Facebook page has 1 million likes, for web properties as big and well-known as Wikipedia and sister projects, that's not very many people. Our Twitter account has a relatively tiny following, given the size and reach of the projects. For instance, @tinucherian, a communications member of the Wikimedia India Chapter, currently has 10,000 more followers than Wikipedia (this attests to his remarkable social media acumen, and to our underachievement as a large and worldwide entity).

These facts reflect the historical distance we've kept between Wikipedia and social media. There is no single staff member who is responsible for updating social media platforms, but rather the role is done by a committee of people from different departments.
How does your work relate to other people blogging about and reaching out the Wikimedia message
Again, I think it's safe to say social media has not historically been one of the highest priorities of the Foundation, so there have not been too many formal or informal relationships developed with bloggers or others in social media spaces. I think to the extent we can, the Foundation supports initiatives through the Community, Tech, and Global Development departments to support those improving Wikipedia projects and those committed to the Wikimedia mission and goals, but it is an open question how far we will go to embrace social media.
We do not communicate really about Commons, Wikibooks, Wiktionary, Wikisource ...
I don't have enough background to comment on this assertion. I personally use and rely on Commons often and greatly enjoy contributing photos there (though with work, I haven't had as much time as I would like to take or process photos). I hope to do a great deal to promote these projects where and when I can.
Issues like the extension of copyright are not so much a Wikipedia but very much a Commons, Wikisource and Wikibooks issue, should our movement take up this issue as it does affect our mission so much.
I'm not as well versed as some in the foundation about this, but our legal department often works in this space and promotes open licenses and free access. Our stance on Golan v. Holder -- that public domain works should not be taken from the public domain and returned to copyright -- is only the most recent example. If you want more detail on this, however, I'd have to defer to our legal team, as they know much more than I do. 
When "global" is in your title, reaching out to people who do not speak English is part of the job. What do you think about translation?
While I haven't been fully apprised of the extent of the translation work you and the localization team are doing for the Foundation, I think some of the most exciting work in the projects happens with translation. The MediaWiki Translate extension and provide some of the more important tools for enabling successful translation of open source materials throughout the world. Their application in numerous platforms speaks to their importance. I certainly hope to learn more about the work as part of my job.
From the exposure I had to the translation infrastructure for the Wikimedia Fundraiser, I was amazed by the organization and cooperation among volunteers around the world who quickly and efficiently translated fundraising appeals and guaranteed the 2011 fundraiser was a remarkable success. 
I speak Spanish and Portuguese, so from a personal level, I'm very interested in language and translation. I often say that I learned the most about my native language, English, when I learned to speak and write in Spanish. For instance, I had no idea there was a subjunctive tense and that I constantly used it until I had to render my thoughts in another language where its usage is frequent and vital. I'm also very interested in linguistics and language formation in bilingual (or trilingual and beyond) children. My understanding of it is limited, but I've read some material about language formation and I always have it in the back of my head that I would love to take more classes in linguistics.

I am also interested in the politics of translation as it relates to what gets translated and published across different languages and in different countries, and what that means for cultural and political relationships. One of the more interesting classes I took on the issue was in 2003 at the Naropa Institute in Boulder, Colorado. The professor was a Mizrahi Jewish translator and publisher who talked a lot about the politics and culture behind what gets published and what doesn't in Hebrew. Completely opened my eyes to a power dynamic I didn't understand before.

With the rise of free knowledge and the ever eroding limits to publication, I think the dynamics of power have already shifted dramatically and will continue to shift. The ability to freely disseminate information and knowledge in ones own language, and the ability to have that translated into other language, can have significant socio-political implications. I'm probably not smart enough to really even understand how significant that is :) Feel free to point me to good reading on the issue, if you like.
Two of our regular bloggers followed the excellent workshop Siebrand gave on Translate our translation software, do you know it, have you used it
I have not used it.
You moved from telling stories to get us funding. Now all our stories and all our objectives is your domain. Do you have room to decide on what is hot and what is not?
I don't know about "hot" stories, but I am very interested in expanding the stories we tell about the Wikimedia movement and include as many engaging, thoughtful voices as possible. We are actively working to expand the public's understanding of the marvelous work that happens in the projects and the fascinating people behind them.
Matthew Roth
Global Communications Manager
Wikimedia Foundation

Monday, March 12, 2012

#Standards - A gap in plural support

When you can speak only for fifteen minutes, there is so much you  do not have the time for. Issues that are relevant, things the conference is meant to address.

When I wrote about plural, a subject that is actively discussed on, Niklas told me about his frustration that his inventory plural rules in various databases had not resulted in anything at all.

As the Multilingual Web conference explicitly asks to identify where standards and best practices cover our needs, this is certainly one that is relevant to us.

As the hashtag of the conference will surely find its way to twitter, this can be seen as an experiment; will people who will go to the conference see this and will there be some follow up at the conference.

PS Niklas will be at the conference as well

Supporting #plural in #gettext and #MediaWiki

Gettext, the #i18n module of the #GNU software, is and has been really import for the internationalisation and localisation of open and free software. To a large extend it is what is used by many localisation platform.

What gettext provides is technology. What it does not provide is the specific rules needed to implement the internationalisation for a specific language. When we bootstrapped plural support at, we copied the rules from other applications to start of with.

The way plural is implemented for applications supported at is well documented. When you read the documentation, it is clear that there is no consistency and these inconsistencies are documented.

One of our contributors, Lloffiwr is taking an active interest in the subject and is compiling a list thashows the MediaWiki plural rules for the languages enabled for localisation at Such a list informs our localisers what is expected of them when they localise a message with plural support.

Sunday, March 11, 2012

The web is Multilingual - #multilingweb

Presenting at the conference about the Multilingual Web will be fun. When you read what the conference is about, what they ask presenters to include in their presentations is interesting:
  • existing best practices and/or standards that are relevant
  • new standards and best practices that are currently in development
  • gaps that are not covered by best practices and/or standards
In so many ways, what we do is implement the best practices as we know them. We are establishing best practices and are running into the gaps of the standards regularly because no other project supports the 412 languages that have an existing Wikipedia or are requesting a Wikipedia

It is great for us to have two people at this conference; we will learn a lot from the other presenters and from the people who attend. We expect that many best practices are set into a professional environment. Our environment consists of dedicated volunteers. The monthly update to our community of localisers at has 4600 recipients. Our puzzle will be to adapt what we learn for our setting.

We want to learn about translation work flow, we want to discuss what to do about languages that are not yet supported in the CLDR. Most of all we want to learn what we do not know, our blind spots. 

Friday, March 09, 2012

Fixing #terminology in #translatewiki is easy

The most important objective of is to make it easy for our localising community. Another really important objective is to make sure that we facilitate all the needs for our localisers.

When we learn that a particular community localises on their local wiki, we really want to understand why. When we do, we will look for a solution that will make localisers more efficient and preferably reconsider using translatewiki.

At this time localisation is done locally on the Sanskrit Wikipedia. The argument to do this is because a lot of work is going into deciding on the best terminology used. Once the community has decided on what word or phrase to use, the changes that reflect the decision are made locally one at a time.

Translatewiki offers an alternative that is much more efficient. It is possible to export the messages. The result is a text file in the "gettext" format and all it takes to replace the changed terminology is to use the tools available in any decent text editor. Once this is done, the updated file is uploaded to complete this task.

Every language on has its portal page and the Sanskrit portal should include references to the used terminology. This will prevent a person new to localising Sanskrit from using the terminology that he or she likes best.

Obviously it is up to the Sanskrit community to do as they please. Localising in an environment dedicated to localisation makes the people involved in localisation more efficient. The benefits of  their work will be widely shared and the developers at are eager to learn why their environment is not good enough. They will fix things where it is broken.

It is all #Malayalam to me

When you give me a text to read in a script like the Malayalam script, it is all the same to me. Like in the example below, I can recognised different elements; I recognise the user interface, I notice an introductory text followed by the text that is introduced. It is likely this way because it is after all text in the Malayalam Wikisource.

For automated systems like the ones employed by Google, recognising a script is easy; all characters of a script have their own number and it is just a matter of looking them up. This is what automated systems do really well.

They are, like me, not able to recognise that the bottom text in the example below is written in Sanskrit.

Automated systems can be helped by identifying texts using labels embedded in the text. Something like this: <div lang="sa">സംസ്കൃതം</div>. It helps Google but it will confuse its systems. The default script for Sanskrit is Devanagari and this is implied when you only use the language code. 

When the embedded label is like this <div lang="sa-Mlym">സംസ്കൃതം</div>, it will be clear that it is Sanskrit written in the Malayalam script. It will not only be clear to Google, it will also be clear to the WebFonts extension because it can be configured to use fonts for the Malayalam script in a Sanskrit text.

The #Mongolian #script

Have a look at the article "Mongolian script". Now do the same thing with another browser or use Chrome as your browser.

The blocks represent the Unicode characters and I do not have a Mongolian font on my system yet. What is really cool is that Chrome shows Mongolian correctly from top to bottom. So far the lack of support for top down text in browsers is what prevents us from figuring out what we need to do in MediaWiki.

It will be cool when top down scripts are supported in most modern browsers. It enables us to support the people who read the top down scripts as well and consequently it brings us closer to what we aim to achieve.

Transliterate when it grows your audience

The Chinese and the Serbian Wikipedia have one thing in common; their content is shown in one of the two scripts you can select from. In essence the process is simple; you change a text from one to another script using a fixed set of rules and the only difference is the script of the language. You do not change the orthography it should be just other characters saying the same thing.

Understanding this is quite important because transliteration is not to accommodate differences in dialects. Far from it. When dialects are expressed in the same script, the differences become easier to understand when there is no longer any confusion because of the different scripts.

The "InScript" input methods exist for the scripts of India. What makes them special is that the same sounds are placed at the same location. This makes typing in different scripts easy. This gives the impression that it should be relatively easy to transliterate between scripts.

Changing scripts for a text is of relevance for Sanskrit. Sanskrit is written in many scripts and when a text originated in a script different from Devanagari, many readers of such an original text are helped with a transliteration into a script they are familiar with. At the same time it helps people appreciate how broad a cultural base the Sanskrit language has.

When transliteration works for Sanskrit, it is likely that the same or similar routines will work well for languages like Konkani. At Silpa there is a tool where you can test transliteration. It is a work in progress; each script has many features that need attention.Custom logic need to be written sometimes for script pairs, sometimes for specific language attributes.

What would be cool is when someone works on this existing code for the transliteration of Indian scripts. It needs more work both on script specific rules and on languages specific rules.

Wednesday, March 07, 2012

#Twitter does #RTL

When Twitter started to support right to left languages in earnest, they did need localisation for their software. Given the relevance of Twitter a great localisation is important; many people will actually see it, use it.

With the implementation of RTL languages like Arabic, Farsi and Hebrew, and the simultaneous localisation there are many people who are exposed to the same issues and flaws in the software. What is so wonderful that all the people localising software find each other and work together. It is as a Saudi Wikipedian wrote on the Arabic village pump: "we all have the same problems".

Amir, one of my colleagues in the WMF Localisation team, does a fair amount of localisation on other projects in his own time. The one thing we learn from him is about cool features used in other localisation platforms.

What really pleases us is how often we learn how much the community model of is valid. It does make a difference when people are not working on their own little island. Many applications are localising an upcoming release and this makes all these people testers as well. People learning together about issues and working together on solutions particularly with developers involved as well make for a perfect and agile software development environment.

Meet ProofreadPage a #MediaWiki extension for#Wikisource

The ProofreadPage extension is an essential tool for Wikisource. Essential because it is installed on every Wikisource. As you can see from the screen-print, it is a tool that helps with the work flow of transliterating a text to the computer. The example is an old French text and in yellow it says that the transliterated text conforms to the scan.

It is an essential tool with one potential problem: its primary maintainer ThomasV has abandoned it. With the upgrade to MediaWiki release 1.19 several bugs surfaced and they were documented on the Scriptorium of the English language Wikisource. This was done by one of the very active Wikisourcerers, user:Billinghurst.

This problem was signalled by Lars to several WMF notables. Indeed there are plenty bugs registered in Bugzilla for ProofreadPage including patches. It is however really cool to notice that Sumana is finding reviewers for them and it is wonderful to learn that ThomasV was only one of more then sevendevelopers committing code in the last couple of months.

Tender loving care is needed to make sure that it will continue to work well after the MediaWiki release 1.19 and it is wonderful to learn that Zaran indicated his willingness to take over as its primary maintainer. One of his first tasks is to identify the existing relevant patches and integrate them.

Monday, March 05, 2012

#Sanskrit, sources and scripts

According to the English #Wikipedia, the Sanskrit language is written in Devanāgarī, various Brāhmī-based alphabets, and the Latin script. Practically we assume that Sanskrit is written in the Devanagari script and for its Wikipedia that is fine.

When Sanskrit sources exist in many scripts or alphabets, and particularly when the Brahmi based alphabets are not in common use, it is even more interesting for the Sanskrit community to find freely licensed fonts for these alphabets.
Asokan Edict - Delhi Inscription

The Omniglot website provides a lists of different scripts used for Sanskrit
Writing system used to write Sanskrit
Brāhmi, Devanāgari, Grantha, Kharoṣṭhi, Śāradā, Siddham, Thai, Tibetan 
Syllabic alphabets / abugidas 
Ahom, Balinese, Batak, Bengali, Brahmi, Buhid, Burmese, Chakma, Cham, Dehong Dai, Devanagari, Dhives Akuru, Ethiopic, Evēla Akuru, Gondi, Grantha, Gujarati, Gupta, Gurmukhi (Punjabi), Hanuno'o, Hmong, Javanese, Kannada, Kharosthi, Khmer, Lanna, Lao, Lepcha, Limbu, Lontara/Makasar, Malayalam, Manpuri, Modi, New Tai Lue, Oriya, Pallava, Phags-pa, Ranjana, Redjang, Shan, Sharda, Siddham, Sindhi, Sinhala, Sorang Sompeng, Sourashtra, Soyombo, Sundanese, Syloti Nagri, Tagalog, Tagbanwa, Takri, Tamil, Telugu, Thai, Tibetan, Tocharian, Varang Kshiti
It may be that it is not only a Brahmi font that will add more value to the Sanskrit Wikisource for original texts in Sanskrit. An original text is written in its original script and such a script may have different styles. It is great that with the WebFonts extension we can truly provide an authentic experience. It is just a matter of having the freely licensed fonts and an expressed demand.