Saturday, June 10, 2006

Punjabi and what IS that script

On the WiktionaryZ main page, we have a list of languages, they point to "portal" pages for those languages. It is quite clear that a project like WiktionaryZ has to take the different scripts into account that a language may manifest itself in. After a lot of head scratching, I created a link to "cmn-Hans" and "cmn-Hans", to indicate that there is Mandarin both in a simplified and a traditional script. One other reason, it looks more organisms this way.

I then tried my hand at the Punjabi language. Punjabi is written in two scripts, and I guessed wrong trying to identify them. One was indeed an Indic script, ਪੰਜਾਬੀ is written in the Gurmukhī script while پنجابی is written in the Shahmukhi script. Shahmukhi is indeed an Arab script but it is not the Arab script. In order to properly identify these words, I looked them up at Unicode where there is a nice list of the ISO-15924 script codes. Gurmuki has Guru as its code and Shahmukhi .. is absent.

It is probably pretty safe to indicate it as Arab, but when my information says that it is not, it is indeed problematic. I could also have it as an uncoded script. The problem with these standards is that they work up to a point. The point is what are they there to do.

When you write Dutch the standard Latin script is used, however that leaves out one character and consequently all word processors capitalise the ij wrong, it should be IJ and not Ij.. I think a similar thing is happening with the Shamukhi script. It is assumed to be Arabic but the style of the glyphs is different. I think it is just one of those things that may change in the future.

I think I will indicate it to be Arab for now .. :)


Sunday, June 04, 2006

What to do next

We have in our GEMET data our first data online. We have already planned the import of the languages that are in ISO-639-3 and allow for the translation of these names to other languages. We will also extend the number of languages that can edit. Given that we are in pre-alpha and that we are at this moment still very much a development project, we include all languages that showed some activity towards localizing the MediaWiki user interface.

The question is what to do next. We cannot and should not rush the development but we do have a host of data that we could import. It could be other thesauri, glossaries or ontologies. It could be a long list of Expressions in a given language to populate a spell checker. We could import the data resulting from Duesentrieb's Wikiword application, this could give us a link to Wikipedia articles.

When we had more active collaborating developers, we could consider doing the import and export routine or we could start working on inflections.

What would you consider the next bit of data to import after the languages? What would you start programming on given where we are at this moment?