When an article includes text in multiple languages, it is important to be able to identify what text is in what language. Several purposes are served in this way.
- provide support for language technology like web fonts and input methods
- enable data mining for the building of spell checkers.
- create frequency lists for words
When a Wikipedia article is parsed, such tags make it possible to enable relevant technology for that language. This is of particular relevance for the "smaller" languages because a Wikipedia in that language is quite often the biggest corpus. Research of such a corpus is a lot easier when all words that are explicitly in another language are marked as such.
Adding such tags is relatively easy. To a large extend bots can be used to add them. Making use of these tags in how MediaWiki works is something else. Tagging is however the first step that needs to be taken and, there is no reason I know of why this can not be done at this time.
Thanks,
GerardM
No comments:
Post a Comment