Thursday, June 28, 2012

#ImpactOCR - A #font using #unicode private use characters

When really old historic texts are digitised and OCR-ed, the images of letters found are mapped to the correct characters. Characters are defined in Unicode and when a character is NOT defined, it is possible to define them in the "private use" space.

As part of the Impact project, really old texts have been digitised, texts in many languages. At the recent presentation it was mentioned by two speakers that there were characters used in the Slovenian and Polish language that are not (yet) defined in Unicode. As part of their project, the missing characters were defined in the Unicode private use area and the scanning software was taught to use them.

With the research completed, with the need for all these characters and their shape defined, it will be great when these characters find their way in Unicode proper. When the code points for the missing characters are defined and agreed, the OCR software can learn to recognise the characters at the new code points, a conversion program can be written for the existing texts and it will be more inviting to include these characters in fonts.

Now that the project is at its end, it is the right moment to extend the Latin script in Unicode even further.

No comments: