Working supporting old scientific hardware sometimes brings me some challenges. Usually, the manuals are only on paper and when there is a digital version it was digitized (scanned) .
Googling a little, came to me an article which relies on waiting for the google to index your files and OCR them. But there is an open source alternative.
Looking a little further, I did find another two articles at linuxquestions.org and at linux.com, on which I've found the tesseract-ocr. So to solve my issue, I had to first convert my PDF file to a bunch of TIF images, and so OCR them with tesseract. This way:
gs -dNOPAUSE -sDEVICE=tiffgray -r300x300 -sOutputFile=page%03d.tif -- 1850_operators_manual.pdf
ls -1 *.tif | cut -d e -f 2 | while read line ; do tesseract "page"$line "page"$line -l eng; done
Hope this helps somebody...
This comment has been removed by the author.
ReplyDeleteMatém a formatação?
ReplyDeleteIt keeps the line breaks, page breaks and the blank lines between the paragraphs.
ReplyDeleteNow, there's an app for that... ;-)
ReplyDeletehttps://github.com/jbarlow83/OCRmyPDF