This is an old revision of the document!
OCR unsearchable PDF files
PDF files of text that aren't searchable are a real pain. With the use of a very nice OCR program called Tesseract, converting these to plain text is relatively straightforward. Each page needs to be conveterted to a tiff file, and then these tiff files get passed through tesseract. The two scripts below can be used to automate this process.
To use, you'll need imagemagick and tesseract. Then run ocr.sh file.pdf
to process a pdf file.
pdf2tif
#!/bin/sh # Derived from pdf2ps. # $Id: pdf2tif,v 1.0 2006/11/03 Fred Smith # Convert PDF to TIFF file. OPTIONS="" while true do case "$1" in -?*) OPTIONS="$OPTIONS $1" ;; *) break ;; esac shift done if [ $# -eq 2 ] then outfile=$2 elif [ $# -eq 1 ] then outfile=`basename "$1" .pdf`-%04d.tif else echo "Usage: `basename $0` [-dASCII85EncodePages=false] [-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2 exit 1 fi # Doing an initial 'save' helps keep fonts from being flushed between pages. # We have to include the options twice because -I only takes effect if it # appears before other options. exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"
ocr.sh
#!/bin/sh # takes one parameter, the path to a pdf file to be processed. # uses custom script 'pdf2tif' to generate the tif files, # generates them at 300x300 dpi. # drops them in our current directory # then runs $progdir/tesseract on them, deleting the .raw # and .map files that tesseract drops. pdf2tif $1 # edit this to point to wherever you've got your tesseract binary progdir=/usr/bin for j in *.tif do x=`basename $j .tif` echo "Processing $j" ${progdir}/tesseract ${j} ${x} #rm ${x}.raw #rm ${x}.map #un-comment next line if you want to remove the .tif files when done. rm ${j} done