Friday, March 09, 2012

It is all #Malayalam to me

When you give me a text to read in a script like the Malayalam script, it is all the same to me. Like in the example below, I can recognised different elements; I recognise the user interface, I notice an introductory text followed by the text that is introduced. It is likely this way because it is after all text in the Malayalam Wikisource.

For automated systems like the ones employed by Google, recognising a script is easy; all characters of a script have their own number and it is just a matter of looking them up. This is what automated systems do really well.

They are, like me, not able to recognise that the bottom text in the example below is written in Sanskrit.

Automated systems can be helped by identifying texts using labels embedded in the text. Something like this: <div lang="sa">സംസ്കൃതം</div>. It helps Google but it will confuse its systems. The default script for Sanskrit is Devanagari and this is implied when you only use the language code. 

When the embedded label is like this <div lang="sa-Mlym">സംസ്കൃതം</div>, it will be clear that it is Sanskrit written in the Malayalam script. It will not only be clear to Google, it will also be clear to the WebFonts extension because it can be configured to use fonts for the Malayalam script in a Sanskrit text.
Post a Comment