Tuesday, April 20, 2010

Convert Scanned PDF Documents to Text without having to wait for google bots

Working supporting old scientific hardware sometimes brings me some challenges. Usually, the manuals are only on paper and when there is a digital version it was digitized (scanned) .

Googling a little, came to me an article which relies on waiting for the google to index your files and OCR them. But there is an open source alternative.

Looking a little further, I did find another two articles at linuxquestions.org and at linux.com, on which I've found the tesseract-ocr. So to solve my issue, I had to first convert my PDF file to a bunch of TIF images, and so OCR them with tesseract. This way:

gs -dNOPAUSE -sDEVICE=tiffg4 -r300x300 -sOutputFile=page%03d.tif -- 1850_operators_manual.pdf

ls -1 *.tif | cut -d e -f 2 | while read line ; do tesseract "page"$line "page"$line -l eng; done

Hope this helps somebody...


  2. It keeps the line breaks, page breaks and the blank lines between the paragraphs.