====== OCR unsearchable PDF files ======

PDF files of text that aren't searchable are a real pain.  With the use of a very nice OCR program called [[http://code.google.com/p/tesseract-ocr/|Tesseract]], converting these to plain text is relatively straightforward.  Each page needs to be conveterted to a tiff file, and then these tiff files get passed through tesseract.  The two scripts below can be used to automate this process.

To use, you'll need imagemagick and tesseract.  Then run ''ocr.sh file.pdf'' to process a pdf file.  

===== pdf2tif =====
<code bash>
#!/bin/sh
# Derived from pdf2ps.
# $Id: pdf2tif,v 1.0 2006/11/03 Fred Smith
# Convert PDF to TIFF file.

OPTIONS=""
while true
do
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;
esac
shift
done

if [ $# -eq 2 ]
then

    outfile=$2

elif [ $# -eq 1 ]
then

    outfile=`basename "$1" .pdf`-%04d.tif 

else

    echo "Usage: `basename $0` [-dASCII85EncodePages=false]

[-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2

    exit 1

fi

# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"
</code>

===== ocr.sh =====
<code bash>
#!/bin/sh

# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.

pdf2tif $1

# edit this to point to wherever you've got your tesseract binary
progdir=/usr/bin

for j in *.tif

    do
    x=`basename $j .tif`
    echo "Processing $j"
    ${progdir}/tesseract ${j} ${x}
    #rm ${x}.raw
    #rm ${x}.map

#un-comment next line if you want to remove the .tif files when done.
 rm ${j}
done
</code>

===== Links =====

  * [[http://www.groklaw.net/articlebasic.php?story=20061210115516438|Source [groklaw]]]