Skip to main content

Convert Scanned PDF Documents to Text without having to wait for google bots

Working supporting old scientific hardware sometimes brings me some challenges. Usually, the manuals are only on paper and when there is a digital version it was digitized (scanned) .

Googling a little, came to me an article which relies on waiting for the google to index your files and OCR them. But there is an open source alternative.

Looking a little further, I did find another two articles at linuxquestions.org and at linux.com, on which I've found the tesseract-ocr. So to solve my issue, I had to first convert my PDF file to a bunch of TIF images, and so OCR them with tesseract. This way:

gs -dNOPAUSE -sDEVICE=tiffgray -r300x300 -sOutputFile=page%03d.tif -- 1850_operators_manual.pdf

ls -1 *.tif | cut -d e -f 2 | while read line ; do tesseract "page"$line "page"$line -l eng; done

Hope this helps somebody...

Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. It keeps the line breaks, page breaks and the blank lines between the paragraphs.

    ReplyDelete
  3. Now, there's an app for that... ;-)
    https://github.com/jbarlow83/OCRmyPDF

    ReplyDelete

Post a Comment

Popular posts from this blog

uSleep on windows (win32)

I am facing a terrible issue regarding timing on windows. Googling arround, I've found those infos: Using QueryPerformanceCounter and QueryPerformanceFrequency APIs in Dev-C++ ( http://yeohhs.blogspot.com/2005/08/using -queryperformancecounter-and_13.html ) QueryPerformanceCounter() vs. GetTickCount() http://www.delphifaq.com/faq/delphi_windows_API/f345.shtml How to time a block of code http://www.cryer.co.uk/brian/delphi/howto_time_code.htm And Results of some quick research on timing in Win32 http://www.geisswerks.com/ryan/FAQS/timing.html With that I'm trying to write something like a uSleep function for windows: # include<windows.h> void uSleep ( int waitTime){ __int64 time1 = 0, time2 = 0, sysFreq = 0; QueryPerformanceCounter((LARGE_INTEGER *)&time1); QueryPerformanceFrequency((LARGE_INTEGER *)&freq); do { QueryPerformanceCounter((LARGE_INTEGER *)&time2); // }while((((time2-time1)*1.0)/sysFreq)<waitTime); } while ( (time2-time1) <waitTime); } T

More trickery with gnuplot dumb terminal

In my post " Plotting memory usage on console " the chart doesn't pan the data. Now, using a named pipe, the effect got a little bit nicer. First, we have to run the memUsage.sh script to get a file filled with memory usage info: ./memUsage.sh > memUsage.dat & Then we have to create a named pipe: mkfifo pipe Now we have to run another process to tail only the last 64 lines from the memUsage.dat while [ 1 ]; do tail -64 memUsage.dat> pipe; done & And now we just have to plot the data from the pipe: watch -n 1 'gnuplot -e "set terminal dumb;p \"pipe\" with lines"' And that is it!

powerpoint slides to jpeg

Looking for some way to convert power point slides to JPG, I've found this site: http://www.commandlinefu.com/commands/browse It has tons of good linux command line tips. And here is the tip about the pdf to jpg which brought me there: http://www.commandlinefu.com/commands/view/719/convert-pdf-to-jpg And to convert the powerpoint to pdf before, one can issue the unoconv command: unoconv -f pdf slides.ppt