Monday, January 14, 2013

Multiple languages, multiple scripts in a #Wikipedia article

Many Wikipedia articles describe something "foreign". For instance Moscow is known in the Russian language as Москва. It is a best practice to mention the native name of a subject and include the pronounciation expressed in IPA as well.

When an article includes text in multiple languages, it is important to be able to identify what text is in what language. Several purposes are served in this way. 
  • provide support for language technology like web fonts and input methods
  • enable data mining for the building of spell checkers.
  • create frequency lists for words
The standard for identifying text as being in a specific language in a web page is well defined. At the start of a text the language should be identified and the end of text in that language needs to be identified as well. 

When a Wikipedia article is parsed, such tags make it possible to enable relevant technology for that language. This is of particular relevance for the "smaller" languages because a Wikipedia in that language is quite often the biggest corpus. Research of such a corpus is a lot easier when all words that are explicitly in another language are marked as such. 

Adding such tags is relatively easy. To a large extend bots can be used to add them. Making use of these tags in how MediaWiki works is something else. Tagging is however the first step that needs to be taken and, there is no reason I know of why this can not be done at this time.
