Words and what not: Measuring linguistic diversity on the Internet

Monday, March 08, 2010

Measuring linguistic diversity on the Internet

Analysing linguistic diversity on the Internet is like and unlike measuring Wikipedia traffic. It is like Wikipedia because its "big" languages have an overwhelming presence and it is unlike Wikipedia in that the smaller languages are hardly noticeable because they are not clearly marked.

A language like Picard is recognised as a language however, software that is in common use, be it proprietary or open source is not configured to establish a document as being in that language. The language recognition software of a Google is not yet able to recognise it by its characteristics. And as a bug keeps the Picard Wikipedia out of the Wikipedia traffic statistics, it can be argued that Picard does not exist on the Internet at all.

Evolution of percentages of
English speaking Internet users and web pages

When research is done about linguistic diversity on the Internet for an organisation like UNESCO, the question is what such research is to achieve. UNESCO aims to preserve and promote linguistic diversity and, the technical ability of languages to manifest itself on the Internet is a key enabler.

The UNESCO research documents the issues measuring linguistic diversity from a traffic perspective on the Internet for a few languages but it does not look into what enables such traffic. It does not explain why it is so hard to extend the research to the long tail of the Internet traffic.

Part of the meta-data of document on the Internet or elsewhere, is an indication what language a document is in. Typically software only knows about a subset of the recognised languages. So one valid metric is, what languages do software allow you to write in. For OpenOffice for instance it is essential that the locale data is known in the CLDR. The CLDR data is public and, statistics can be created from its development. As you can imagine, there is no data for the Picard language ...

When UNESCO includes such statistics in its linguistic diversity report, it will become clear how much needs to be done in order to make support for linguistic diversity a reality.
Thanks,
GerardM

Monday, March 08, 2010

Measuring linguistic diversity on the Internet

No comments: