Thursday, July 07, 2011

Digitizing a #Malayalam #dictionary

Digitizing a Malayalam book is quite different from digitizing an English book. For an English book you use OCR and move straight on to the proofreading and formatting of the text.

It becomes even more interesting when the text is rich in all kinds of annotations like in this dictionary. It is the the first Malayalam-English-Malayalam Dictionary by Dr Herman Gundert. There are all kinds of opportunities here, ehm problems.

There is no OCR for Malayalam yet so it has to be crowd sourced. The annotations are cryptic and make only sense in a dead tree dictionary. When you digitize such a text and it remains a flat text and consequently it is extremely hard to use the public domain content elsewhere; in OmegaWiki or in Wiktionary for instance.

Santhosh is experimenting with Semantic MediaWiki . It will allow for exportable information and, that will make the data gained much more useful.
