Optical Character Recognition HowTo
Optical character recognition (OCR) is useful for converting images of text into normal computer text that you can edit, copy, paste and search. With OCR software you can convert old image-based PDFs to text. About 95% of words from a cleanly scanned PDF using a modern font are correctly recognized. Of the ~ 1 in 20 that confuse the algorithm, about half are easily corrected with a spell checker, and the remaining must be manually adjusted.
Google has made the tesseract OCR code they use for Google Books available. They don’t officially support OS X. Below are instructions to get tesseract-ocr running on OS X. As usual, Developer Tools (XCode) needs to be installed.
A slightly simpler installation does not use LibTIFF, but in this case you can only convert single page and uncompressed TIFF files. As PDFs are usually multiple pages, it is worth it to install and compile with LibTIFF.
# prepare dependencies fink install libtiff libtiff-shlibs # fetch svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-svn cd tesseract-svn # compile ./runautoconf export CXXFLAGS=-m32 # Force 32-bit architecture ./configure --prefix=/Users/mankoff/local/tesseract/ --with-libtiff=/sw/lib make make install # test run ~/local/tesseract/bin/tesseract image.tif out # run it say Tesseract finished # might take a while. Turn on your speaker. gn Finished Tesseract # alias gn='growlnotify -s -m' less out.txt # check
Notes:
- Here is a sample image for testing
- The image must be in TIF format.
- The extension must be have one “f”: TIF or tif not TIFF.
- Images and complex equations are not handled
I tried to run tesseract on my handwriting and it could not decode it. I wrote a simple sentence as clearly as possible, took a photo:
And the result was:
THE ®.U\(.K [awww Fox TUMPED oval #IE uxzv ooé
It got “THE” and “Fox” and most of JUMPED. However, tesseract supports full training so if you need to convert your notes read the documentation and post what your learn below.

July 22nd, 2010 at 14:56
Just found WatchOCR, a live linux distro that does optical character recognition and returns it to the PDF so the PDF is searchable.
http://www.watchocr.com/index.html
December 17th, 2010 at 09:22
And fink now has a package called ‘gocr’, which is very easy to install, and produces the following output on the image:
TE QUlK
oWN FoX
MeED oE
TE LĄy