Friday, March 30, 2012

#Chennai Hackathon II

Several of the topics at the Chennai hackathon had to do with the manipulation of data.

Word lists
One project searched for a list of unique words in the Tamil Wikipedia. They used a dump and retrieved an initial list of 1.3 million unique words. Such a list is quite useful; once the list is filtered from words in other languages and other problematic strings. It can be used to build spell checkers, it can even be the basis for research into building a more effective "transliteration input method".  Currently the phonetic typing for Tamil is considered inefficient since it takes too many keystrokes for same text as opposed to the Tamil99 keyboard layout. It takes a big corpus to find the patterns that are more effective.

A project like this may be beneficial for other languages as well. Wikipedia often represents one of the largest corpora of the modern use of a language. Once this project is documented and when there is a set-up to run it repeatedly, it will stimulate the development of more applicable language technology.

Structured database search over Wikipedia
This project is vintage hacking; learning how to use semantic search using DBpedia and ending up with Wikipedia content is great stuff. The hackathon report has this to say:  "Amazing search tool that made it super easy to query information in a natural way". That makes it really interesting to have it hosted somewhere so that people can comment and experience the semantic web first hand.

Parsing Movie data into a database
Wikipedia is rich in information about movies. They typically include info boxes and when you extract everything Wikipedia has to offer together in a database, it becomes really awesome. It can be the beginning of an "IMDB" for the movies from India for instance. 

As this is also the approach taken by DBpedia, the skills learned can be applied to for instance the Tamil Wikipedia. Once a language is supported in DBpedia, it has a prominent entry in the Semantic web.

Random Good WP India article tool
Spreading the word about great articles about India is one way of enthusing people to write some more. Technically it may not be that hard but when you start without any knowledge of the MediaWiki API or JSON, it is quite a feat when a working prototype is produced in a day.

A tool like this can be used on any Wiki and by any project that wants to show of its quality content. You never know how pushing great contents will pull in more interest for our projects.
Thanks,
      GerardM

No comments: