Scanning Neel Doff – from pdf to text

July 23rd, 2013 § 1 comment

After the consumer-like scanning process of two books of Neel Doff, I thought I was in heaven. The pdf that it gives as result is neat, keeps the nice texture of the old book and reads very well on an e-reader.
But to make this work accessible for machinal agents, the text needs to be non-formatted in a plain text-file. The command that I knew of in Linux seemed easy and magic: pdftotext file.pdf.

With the naivety of the consumer I opened the text-file with the expectation to only have to delete the side-texts, like the introduction, the credits, etc.
But, oh, no. The content of the file is pure art, a beautiful piece of text that is hardly legible at some places!

OCR works very well in the typeset pdf, because the form of the letters correspond to what we interpret as a letter. Once the formatting is gone, the letters and words can be anything.

OCR works very well in the typeset pdf, because the form of the letters correspond to what we interpret as a letter. Once the formatting is gone, the letters and words can be anything.

Tesseract OCR software is what you need for this, or the proprietary Adobe software. Sound and net artist Andre Castro developed this very nice script that does a good job.

Unfortunately the job is not perfect. Because OCR has difficulties to interpret particular elements in lay-out and fonts,the txt-file comes with a lot of errors.
Some regular phenomenons are:
*the combination of specific letters in some fonts (it can take m for n or I for i etc)
*headers might have become part of sentences
*footnotes are placed inside the flowing text
*page numbers are not recognized as such

The Gutenberg Project developed a collective tool to proofread ocr’ed scans. If your book is in the public domain and if you sign up as a proofreader, you can activate your own project and invite other proofreaders in. As a result your book becomes part of the Gutenberg Project, a stable and durable collection, and will be freely shared all over the planet.
If your scanned book is under copyright, you can fasten up the text-corrections by copying the txt in an Open Office document. The spellcheck option will be your greatest collaborator.

Tagged

§ One Response to Scanning Neel Doff – from pdf to text