Working supporting old scientific hardware sometimes brings me some challenges. Usually, the manuals are only on paper and when there is a digital version it was digitized (scanned) . Googling a little, came to me an article which relies on waiting for the google to index your files and OCR them. But there is an open source alternative. Looking a little further, I did find another two articles at linuxquestions.org and at linux.com , on which I've found the tesseract-ocr . So to solve my issue, I had to first convert my PDF file to a bunch of TIF images, and so OCR them with tesseract. This way: gs -dNOPAUSE -sDEVICE=tiffgray -r300x300 -sOutputFile=page%03d.tif -- 1850_operators_manual.pdf ls -1 *.tif | cut -d e -f 2 | while read line ; do tesseract "page"$line "page"$line -l eng; done Hope this helps somebody...