| Follow with RSS

Optical Character Recognition HowTo

May 7th, 2010 by Ken Mankoff

Optical character recognition (OCR) is useful for converting images of text into normal computer text that you can edit, copy, paste and search. With OCR software you can convert old image-based PDFs to text. About 95% of words from a cleanly scanned PDF using a modern font are correctly recognized. Of the ~ 1 in 20 that confuse the algorithm, about half are easily corrected with a spell checker, and the remaining must be manually adjusted.

Google has made the tesseract OCR code they use for Google Books available. They don’t officially support OS X. Below are instructions to get tesseract-ocr running on OS X. As usual, Developer Tools (XCode) needs to be installed.

A slightly simpler installation does not use LibTIFF, but in this case you can only convert single page and uncompressed TIFF files. As PDFs are usually multiple pages, it is worth it to install and compile with LibTIFF.

# prepare dependencies
fink install libtiff libtiff-shlibs

# fetch
svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-svn
cd tesseract-svn

# compile
./runautoconf
export CXXFLAGS=-m32 # Force 32-bit architecture
./configure --prefix=/Users/mankoff/local/tesseract/ --with-libtiff=/sw/lib
make
make install

# test run
~/local/tesseract/bin/tesseract image.tif out # run it
say Tesseract finished # might take a while. Turn on your speaker.
gn Finished Tesseract # alias gn='growlnotify -s -m'
less out.txt # check

Notes:

  • Here is a sample image for testing
  • The image must be in TIF format.
  • The extension must be have one “f”: TIF or tif not TIFF.
  • Images and complex equations are not handled

I tried to run tesseract on my handwriting and it could not decode it. I wrote a simple sentence as clearly as possible, took a photo:


The Quick Brown Fox

The Quick Brown Fox


And the result was:

THE ®.U\(.K
[awww Fox
TUMPED oval
#IE uxzv ooé

It got “THE” and “Fox” and most of JUMPED. However, tesseract supports full training so if you need to convert your notes read the documentation and post what your learn below.

2 Responses to “Optical Character Recognition HowTo”

  1. Ken Mankoff Says:

    Just found WatchOCR, a live linux distro that does optical character recognition and returns it to the PDF so the PDF is searchable.

    http://www.watchocr.com/index.html


  2. Ken Mankoff Says:

    And fink now has a package called ‘gocr’, which is very easy to install, and produces the following output on the image:

    TE QUlK
    oWN FoX
    MeED oE
    TE LĄy


Leave a Reply