Skip to main content

Posts

Showing posts with the label manuals

Convert Scanned PDF Documents to Text without having to wait for google bots

Working supporting old scientific hardware sometimes brings me some challenges. Usually, the manuals are only on paper and when there is a digital version it was digitized (scanned) . Googling a little, came to me an article which relies on waiting for the google to index your files and OCR them. But there is an open source alternative. Looking a little further, I did find another two articles at linuxquestions.org and at linux.com , on which I've found the tesseract-ocr . So to solve my issue, I had to first convert my PDF file to a bunch of TIF images, and so OCR them with tesseract. This way: gs -dNOPAUSE -sDEVICE=tiffgray -r300x300 -sOutputFile=page%03d.tif -- 1850_operators_manual.pdf ls -1 *.tif | cut -d e -f 2 | while read line ; do tesseract "page"$line "page"$line -l eng; done Hope this helps somebody...