Tuesday, June 12, 2012

#Gutenberg; #Unicode is the new #ASCII

Project Gutenberg is the original project that makes books that are out of copyright available for reading to you. At this time they made some 39,000 books available for you to read. Through affiliated organisations, including Wikibooks and project Runeberg, there are some 100,000 books to read.

When the project started in 1971, ASCII was how text was digitally stored on a computer. It has been a guiding principle of project Gutenberg to store text in ASCII ever since.

When a book was written in Greek, the characters of the Greek alphabet were transliterated into ASCII and, at the time this made perfect sense. It made sense because ASCII represented the standard every computer understood.

Nowadays, modern computers use Unicode for the encoding of text. The first notions of Unicode came into being sixteen years after the start of project Gutenberg. Most scripts are defined in Unicode and modern software expects Unicode. Now, twenty five years later there are free fonts for all the important scripts. Transliteration into ASCII for all these languages makes the resulting product unusable for many people. Most people do not know how to read a text that is transliterated.

Is there still a point to stick to ASCII or can ASCII be safely replaced by Unicode?

1 comment:

Bawolff said...

I imagine the transliteration is not a 1:1 mapping. Otherwise I'd suggest get a computer program to untransliterate the text