Image representing Google as depicted in Crunc...
Image via CrunchBase

OCR is the technology used to turn an image of text into plain (editable, search-able) text. If you’re like me (i.e., a nerd) you probably have a pile of scanned journal articles and books and such meticulously sorted on your hard drive (PDFs for example). You can read them and print them, but you can’t search them or edit them. Wouldn’t it be nice if you could?

Well, there are a number of free options on the web, but they all have their problems. Google has some of the best OCR technology out there–they recently acquired CAPTCHA to make it even better–and they have apparently been rolling this out into Google Docs. The Google Docs version is not as wonderful as you might like, but it works on high-res documents. Read about how to turn your images into text here.

Update: I was not able to get this to work with PDFs, surprisingly. The web-app only accepts PNG, JPEG, or GIF images right now. That is unfortunate, and I assume will be “corrected” in the future. Has anyone tried this on an image yet?

 

PDFAs a follow-up to my previous post, here is an excellent review of some more great PDF conversion and manipulation tools.

Also I am happy to report that I have had good success converting PDF images to plain text with OCR terminal, so give it a try!

 

PDF

Paper isn’t going away, of course, but having all your documents on such an antiquated medium is often less than ideal. There is at least one major disadvantage to paper: searching is much more difficult. That’s just one of the reasons PDFs are so popular! Anybody can open a PDF file for free, search it for the information they need, and store it for later browsing without any significant impact on harddrive space.

Not all PDFs are Created Equal

But perhaps you don’t know that there are two kinds of PDFs. The best kind of PDF is the kind generated by computer software from a text file. These PDFs are searchable because the text is preserved.

But many PDFs are generated from images rather than text. If you create a PDF by scanning a document in a photocopier or image scanner then the result is usually an image-based PDF, rather than a text-based PDF. This means that your PDF will not be searchable because you computer does not have access to the underlying text, even though you can read it just fine.

Searching any PDF with OCR

So how can you overcome this difficulty? By using Optical Recognition (OCR) software. OCR tools look at the image and try to convert it to plain text, which can then be searched, copy-and-pasted, and indexed just like any other document (I worked with several such software systems during my undergraduate degree).

There are several good free OCR tools available for converting PDF documents to plain text. The best out there is that used by Google, which powers its Google Books services. The problem here is that you don’t have direct access to their software. You need to go fishing and wait for Google to bite. You can find instructions for doing that here.

If you want more control over your software, and you probably do, check out this list of handy PDF tools, many of which are OCR converters. There is also a lot of great software on this list.

Finally a new service, PDF-to-word, currently in invite-only Beta, accurately converts PDF images to MS Word documents. You might have to just bookmark this one since it’s not yet available to the public, but you might find an invite code online, such as here.

Conclusions

One remaining limitation of all this is that the OCR software listed above is optimized for English. Problems often occur with German and French, and don’t even bother trying it on Greek or Hebrew. Nevertheless the advantages for English scanned images are worth investing some time experimenting with one of these systems, especially if you have a lot of scanned PDF documents.

 

Digitalization is the way of the future, and with the recent deal between authors and Google books, that future may in fact be bright for all parties.

In the course of my dissertation work I often have to track down primary sources, and when those sources are particularly rare it becomes difficult. Or it used to be difficult. Now I Google it.

Exhibit A: This morning I needed to track down some homilies of Hebrews by Chrysostom. Being a dedicated Greek Geek, I wanted the “original,” which means I need Patrologia Graeca volume 63. Where am I going to get it? Google Books of course–they have the entire series digitalized and downloadable for your convenience. This is what sites like Google Books and archive.org are made for—primary sources in the open domain.

Image view of v63 of Patrologia Graeca

Image view of v63 of Patrologia Graeca

Here are some screenshots for you. The first is the standard scan, downloadable as a pdf. The second is Google’s attempt at a little OCR, which obviously is struggling with both the Greek and the Latin. This is to be expected. I did a little natural language processing way-back-when; a lot of OCR software will “guess” the letters based not only on shape, but on the software’s (limited) understanding of the language, which for Greek and Hebrews is probably NULL. Still, I was impressed, and this is a harbinger of great things to come.

OCR view of v63 of Patrologia Graeca

OCR view of v63 of Patrologia Graeca

So what primary sources have you been trying to track down? How do you use research tools like these? Post in the comments!

© 2011 Nerdlets Suffusion theme by Sayontan Sinha