Tesseract is a program that does OCR – optical character recognition. The goal is to take a picture of text and transform it into text; e.g. you scan a page of a book and it will turn it into editable text. Note, Tesseract does not do layout analysis so a two column layout might confuse Tesseract. Also, the input image must be within certain parameters and Tesseract will not convert it for you nor will it complain too much if you give it a bad image; it will just fail to output the text.
To actually use Tesseract, you’re going to want to install ImageMagick to get the command line utility `convert`. The best way to do this is with homebrew which in my experience has been a thousand times better than MacPorts. Basically, homebrew does not clobber the system stuff, does not install to /opt/ or any other weird place, and does recognize system compilers and X libraries and things like that…
Anyways, let’s start:
1) Install imagemagick
brew install imagemagick
Imagemagick has jpeg, libtiff, little-cms, jasper as dependencies, but all of these are also in homebrew and will install automatically.
2) Install tesseract
brew install tesseract
Tesserat depends on libtiff, which should have been installed when you installed imagemagick. Pretty straight forward.
After these two steps you should have the `tesseract` command-line program and the `convert` command line program.
3) Acquire an image
I just copied a selection from a PDF that I was reading for class. This PDF is a scanned book and had two columns (both pages). I saved it as a PNG with 200 DPI. (I also read a tip to save the PNGs without an Alpha layer – I’m pretty sure TIFFs don’t support Alpha channels so I think that will just go away in the conversion process anyways.) The image is below:
4) Convert the image to a TIFF
It doesn’t really matter what file format you capture your image in, but it has to be high-res enough – the 200-300 DPI range should do. If your source image is not within that range, you will have to resample the image at a higher DPI. Anyways, this is where ImageMagick comes in:
convert -density 200 -units PixelsPerInch -type Grayscale +compress test.png test_input.tif
Now, the -density and -units is telling ImageMagick we want the output file to be 200 DPI. These must be there even if your input file is 200 DPI – when I ran it without these options I got a file that had a much lower DPI (72 I think) which Tesseract cannot use. The -type Grayscale makes sure to remove colors from output image. Tesseract doesn’t really handle colors too well, but handles black and white great. And since the image is pretty much just black text on white paper this isn’t a problem. The +compress option must be there as well – I’m still reading up on ImageMagick and don’t know a whole lot about the TIFF format but without it tesseract will not be able to open/process the input image file. For whatever reason WordPress did not like the TIFF file so there is no thumbnail.
5) Run tesseract
I ran tesseract with the following command:
tesseract test_input2.tif output2 -l eng
The -l option specifies a language, and, if you installed through homebrew, there will be a number of language data training packs installed in the correct place. Tesseract may default to English but I like being explicit. Here is the text output:
Certainly theology needs empirical facts and scientific theoretical
insights. The social scientists offer help. Yet they do not accomplish
what l must now attempt. My main question is where and how the
church must stand to be the witnssing church; that is, what must be
the relation between the culture that is the church (and the larger
Christian and biblical metaculture the church represents) and those
cultures the church indwells, evangelizx, serves? Answering will
require all the resources that Christian theology can bring to bear, and
not a little help from such as Berger and Bellah as well. Already they
have showed us, willy»nilly that theology is required for the task: they
make such ample (and often skillful) use of it, themselves!
So, a few basic mistakes will be caught by a spell-checker. There is a lowercase L instead of “I”. Overall, I’m pretty happy with the results.
Tesseract works great; fussing with the input image is the most difficult part. But some reminders: the input must be a TIFF (and .tif at that), must be grayscale, and have between 200 – 300 DPI.
OCRopus and many other frontends solve many of these problems by using the tesseract engine with a GUI. OCRopus is not on homebrew, and only one of the dependencies for OCRopus (iulib) is not on homebrew. I’ll start working on a homebrew formula for iulib and the OCRopus and write it up in a new post. Also, there is a perl module called Image::OCR::Tesseract that I’ll be testing on my system and maybe extending.Posted in Uncategorized | Comments Off