Paper isn’t going away, of course, but having all your documents on such an antiquated medium is often less than ideal. There is at least one major disadvantage to paper: searching is much more difficult. That’s just one of the reasons PDFs are so popular! Anybody can open a PDF file for free, search it for the information they need, and store it for later browsing without any significant impact on harddrive space.
Not all PDFs are Created Equal
But perhaps you don’t know that there are two kinds of PDFs. The best kind of PDF is the kind generated by computer software from a text file. These PDFs are searchable because the text is preserved.
But many PDFs are generated from images rather than text. If you create a PDF by scanning a document in a photocopier or image scanner then the result is usually an image-based PDF, rather than a text-based PDF. This means that your PDF will not be searchable because you computer does not have access to the underlying text, even though you can read it just fine.
Searching any PDF with OCR
So how can you overcome this difficulty? By using Optical Recognition (OCR) software. OCR tools look at the image and try to convert it to plain text, which can then be searched, copy-and-pasted, and indexed just like any other document (I worked with several such software systems during my undergraduate degree).
There are several good free OCR tools available for converting PDF documents to plain text. The best out there is that used by Google, which powers its Google Books services. The problem here is that you don’t have direct access to their software. You need to go fishing and wait for Google to bite. You can find instructions for doing that here.
If you want more control over your software, and you probably do, check out this list of handy PDF tools, many of which are OCR converters. There is also a lot of great software on this list.
Finally a new service, PDF-to-word, currently in invite-only Beta, accurately converts PDF images to MS Word documents. You might have to just bookmark this one since it’s not yet available to the public, but you might find an invite code online, such as here.
Conclusions
One remaining limitation of all this is that the OCR software listed above is optimized for English. Problems often occur with German and French, and don’t even bother trying it on Greek or Hebrew. Nevertheless the advantages for English scanned images are worth investing some time experimenting with one of these systems, especially if you have a lot of scanned PDF documents.
Related posts:






[...] a follow-up to my previous post, here is an excellent review of some more great PDF conversion and manipulation [...]