OCR unsearchable PDF files

PDF files of text that aren't searchable are a real pain. With the use of a very nice OCR program called Tesseract, converting these to plain text is relatively straightforward. Each page needs to be conveterted to a tiff file, and then these tiff files get passed through tesseract. The two scripts below can be used to automate this process.

To use, you'll need imagemagick and tesseract. Then run ocr.sh file.pdf to process a pdf file.

pdf2tif

#!/bin/sh
# Derived from pdf2ps.
# $Id: pdf2tif,v 1.0 2006/11/03 Fred Smith
# Convert PDF to TIFF file.
 
OPTIONS=""
while true
do
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;
esac
shift
done
 
if [ $# -eq 2 ]
then
 
    outfile=$2
 
elif [ $# -eq 1 ]
then
 
    outfile=`basename "$1" .pdf`-%04d.tif 
 
else
 
    echo "Usage: `basename $0` [-dASCII85EncodePages=false]
 
[-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2
 
    exit 1
 
fi
 
# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"

ocr.sh

#!/bin/sh
 
# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.
 
pdf2tif $1
 
# edit this to point to wherever you've got your tesseract binary
progdir=/usr/bin
 
for j in *.tif
 
    do
    x=`basename $j .tif`
    echo "Processing $j"
    ${progdir}/tesseract ${j} ${x}
    #rm ${x}.raw
    #rm ${x}.map
 
#un-comment next line if you want to remove the .tif files when done.
 rm ${j}
done