Saturday, May 20, 2006

Languages, dialects and orthographies

When a text is known to be in a certain language, and this language is more or less familiar to a person, this text may be meaningful. I have had some French classes and when I am in Italy there is quite a lot that I can understand. An automated process cannot do this; it helps quite a lot when a text has Meta-data that indicates what language, dialect or script it is.

One of the things that makes sense to be aware of, is what orthography a text, phrase or word is in. It is definitely something that is in a class of its own and it matters when text is to be understood in an automated way. Languages do change over time and, the recognized correct orthography changes to reflect this. The German and Dutch language both have had their fair chair of changes. The functional design of WiktionaryZ has always had a place to indicate that a given spelling is dated. The way we will export the WiktionaryZ data will be by using standards like TBX, LMF maybe RDS, SKOS or something different but standard. The problem is; how do we indicate that a given word needs to be spelled different since a given date ?

