Convert Scanned PDF Documents to Text without having to wait for google bots

Working supporting old scientific hardware sometimes brings me some challenges. Usually, the manuals are only on paper and when there is a digital version it was digitized (scanned) .

Googling a little, came to me an article which relies on waiting for the google to index your files and OCR them. But there is an open source alternative.

Looking a little further, I did find another two articles at linuxquestions.org and at linux.com, on which I've found the tesseract-ocr. So to solve my issue, I had to first convert my PDF file to a bunch of TIF images, and so OCR them with tesseract. This way:

gs -dNOPAUSE -sDEVICE=tiffgray -r300x300 -sOutputFile=page%03d.tif -- 1850_operators_manual.pdf

ls -1 *.tif | cut -d e -f 2 | while read line ; do tesseract "page"$line "page"$line -l eng; done

Hope this helps somebody...

Comments

UnknownApril 20, 2010 at 10:47 AM
This comment has been removed by the author.
ReplyDelete
Replies
Fernando M de BittencourtApril 20, 2010 at 4:57 PM
Matém a formatação?
ReplyDelete
Replies
Filipi ViannaApril 21, 2010 at 10:31 AM
It keeps the line breaks, page breaks and the blank lines between the paragraphs.
ReplyDelete
Replies
Filipi ViannaMay 22, 2017 at 6:51 AM
Now, there's an app for that... ;-)
https://github.com/jbarlow83/OCRmyPDF
ReplyDelete
Replies

Add comment

FilipiVianna

Search This Blog

Convert Scanned PDF Documents to Text without having to wait for google bots

Labels

Comments

Post a Comment

Popular posts from this blog

More trickery with gnuplot dumb terminal

Replace transparency in PNG images with white background (for lots of files...)

Pettry rendered LaTeX equations using PHP