Skip to main content

Convert Scanned PDF Documents to Text without having to wait for google bots

Working supporting old scientific hardware sometimes brings me some challenges. Usually, the manuals are only on paper and when there is a digital version it was digitized (scanned) .

Googling a little, came to me an article which relies on waiting for the google to index your files and OCR them. But there is an open source alternative.

Looking a little further, I did find another two articles at linuxquestions.org and at linux.com, on which I've found the tesseract-ocr. So to solve my issue, I had to first convert my PDF file to a bunch of TIF images, and so OCR them with tesseract. This way:

gs -dNOPAUSE -sDEVICE=tiffgray -r300x300 -sOutputFile=page%03d.tif -- 1850_operators_manual.pdf

ls -1 *.tif | cut -d e -f 2 | while read line ; do tesseract "page"$line "page"$line -l eng; done

Hope this helps somebody...

Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. It keeps the line breaks, page breaks and the blank lines between the paragraphs.

    ReplyDelete
  3. Now, there's an app for that... ;-)
    https://github.com/jbarlow83/OCRmyPDF

    ReplyDelete

Post a Comment

Popular posts from this blog

uSleep on windows (win32)

I am facing a terrible issue regarding timing on windows. Googling arround, I've found those infos: Using QueryPerformanceCounter and QueryPerformanceFrequency APIs in Dev-C++ ( http://yeohhs.blogspot.com/2005/08/using -queryperformancecounter-and_13.html ) QueryPerformanceCounter() vs. GetTickCount() http://www.delphifaq.com/faq/delphi_windows_API/f345.shtml How to time a block of code http://www.cryer.co.uk/brian/delphi/howto_time_code.htm And Results of some quick research on timing in Win32 http://www.geisswerks.com/ryan/FAQS/timing.html With that I'm trying to write something like a uSleep function for windows: # include<windows.h> void uSleep ( int waitTime){ __int64 time1 = 0, time2 = 0, sysFreq = 0; QueryPerformanceCounter((LARGE_INTEGER *)&time1); QueryPerformanceFrequency((LARGE_INTEGER *)&freq); do { QueryPerformanceCounter((LARGE_INTEGER *)&time2); // }while((((time2-time1)*1.0)/sysFreq)<waitTime); } while ( (time2-time1) <waitTime); } T

Soft body deformation

The wikipedia has a short entry on " Soft body dynamics " but it cites this interesting framework called SOFA. "SOFA [ 1 ] is an Open Source framework primarily targeted at real-time physical simulation , with an emphasis on medical simulation. It is mostly intended for the research community to help develop newer algorithms, but can also be used as an efficient prototyping tool or as a physics engine ." [1] It is also multi-platform. As soon as I have some test written, I will put some shots here. [1] SOFA (Simulation Open Framework Architecture). (2009, March 26). In Wikipedia, The Free Encyclopedia . Retrieved 14:01, May 7, 2009, from http://en.wikipedia.org/w/index.php?title=SOFA_(Simulation_Open_Framework_Architecture)&oldid=279736872

More trickery with gnuplot dumb terminal

In my post " Plotting memory usage on console " the chart doesn't pan the data. Now, using a named pipe, the effect got a little bit nicer. First, we have to run the memUsage.sh script to get a file filled with memory usage info: ./memUsage.sh > memUsage.dat & Then we have to create a named pipe: mkfifo pipe Now we have to run another process to tail only the last 64 lines from the memUsage.dat while [ 1 ]; do tail -64 memUsage.dat> pipe; done & And now we just have to plot the data from the pipe: watch -n 1 'gnuplot -e "set terminal dumb;p \"pipe\" with lines"' And that is it!