====== OCR unsearchable PDF files ======
PDF files of text that aren't searchable are a real pain. With the use of a very nice OCR program called [[http://code.google.com/p/tesseract-ocr/|Tesseract]], converting these to plain text is relatively straightforward. Each page needs to be conveterted to a tiff file, and then these tiff files get passed through tesseract. The two scripts below can be used to automate this process.
To use, you'll need imagemagick and tesseract. Then run ''ocr.sh file.pdf'' to process a pdf file.
===== pdf2tif =====
#!/bin/sh
# Derived from pdf2ps.
# $Id: pdf2tif,v 1.0 2006/11/03 Fred Smith
# Convert PDF to TIFF file.
OPTIONS=""
while true
do
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;
esac
shift
done
if [ $# -eq 2 ]
then
outfile=$2
elif [ $# -eq 1 ]
then
outfile=`basename "$1" .pdf`-%04d.tif
else
echo "Usage: `basename $0` [-dASCII85EncodePages=false]
[-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2
exit 1
fi
# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"
===== ocr.sh =====
#!/bin/sh
# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.
pdf2tif $1
# edit this to point to wherever you've got your tesseract binary
progdir=/usr/bin
for j in *.tif
do
x=`basename $j .tif`
echo "Processing $j"
${progdir}/tesseract ${j} ${x}
#rm ${x}.raw
#rm ${x}.map
#un-comment next line if you want to remove the .tif files when done.
rm ${j}
done
===== Links =====
* [[http://www.groklaw.net/articlebasic.php?story=20061210115516438|Source [groklaw]]]