Words and what not: #Djatoka to focus in on a scanned #newspaper

Tuesday, March 16, 2010

#Djatoka to focus in on a scanned #newspaper

When you make an archive of newspapers digitally available, it can involve two things: scanning the paper and optical character recognition (OCR) of the text. Scanning is already a big job, performing a quality OCR is an even bigger job and often prohibitively expensive.

A scanned newspaper page consists of multiple articles, their start can be relatively easy be recognised and, with a bit of programming these anchor points to the text can be recognised automatically. This would make the digital navigation of a scanned paper a lot easier.

When an article is identified as relevant by a user, it can be named. This allows for easy referencing. An OCR process can run on the text of the article and, the user can be asked to proofread the result. In this way the article gains usability as a resource.

An important part of such a workflow is that the underlying scanned newspaper is an essential part of the resource; not only does it provide the source material, it provides provenance when people can return to the original material and verify the veracity of what is digitally available.
Thanks,
GerardM

Tuesday, March 16, 2010

#Djatoka to focus in on a scanned #newspaper

No comments: