Skip to main content

Convert Scanned PDF Documents to Text without having to wait for google bots

Working supporting old scientific hardware sometimes brings me some challenges. Usually, the manuals are only on paper and when there is a digital version it was digitized (scanned) .

Googling a little, came to me an article which relies on waiting for the google to index your files and OCR them. But there is an open source alternative.

Looking a little further, I did find another two articles at linuxquestions.org and at linux.com, on which I've found the tesseract-ocr. So to solve my issue, I had to first convert my PDF file to a bunch of TIF images, and so OCR them with tesseract. This way:

gs -dNOPAUSE -sDEVICE=tiffgray -r300x300 -sOutputFile=page%03d.tif -- 1850_operators_manual.pdf

ls -1 *.tif | cut -d e -f 2 | while read line ; do tesseract "page"$line "page"$line -l eng; done

Hope this helps somebody...

Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. It keeps the line breaks, page breaks and the blank lines between the paragraphs.

    ReplyDelete
  3. Now, there's an app for that... ;-)
    https://github.com/jbarlow83/OCRmyPDF

    ReplyDelete

Post a Comment

Popular posts from this blog

More trickery with gnuplot dumb terminal

In my post " Plotting memory usage on console " the chart doesn't pan the data. Now, using a named pipe, the effect got a little bit nicer. First, we have to run the memUsage.sh script to get a file filled with memory usage info: ./memUsage.sh > memUsage.dat & Then we have to create a named pipe: mkfifo pipe Now we have to run another process to tail only the last 64 lines from the memUsage.dat while [ 1 ]; do tail -64 memUsage.dat> pipe; done & And now we just have to plot the data from the pipe: watch -n 1 'gnuplot -e "set terminal dumb;p \"pipe\" with lines"' And that is it!

Replace transparency in PNG images with white background (for lots of files...)

I had to remove transparency from a PNG image file from the command line... and stack overflow came into my help[1]... But I needed it for lots of files... then, adding a "while read line" did the job: ls -1 *.png |  cut -d . -f 1 | while read line; do convert -flatten $line.png flatten/$line.png ; done; [1] Replace transparency in PNG images with white background https://stackoverflow.com/questions/2322750/replace-transparency-in-png-images-with-white-background

Pettry rendered LaTeX equations using PHP

When I was writing a simple scientific webapp, during my undergrads studies, I needed to generate some equations to be shown by the app. I was already familiar with LaTeX equation formatting syntax, so I decided to use this nice peace of software. So, after "googling" a little, I found the imgtex , written by Koji Nakamaru , which is a fast CGI script, written in perl. What I did, was port it to PHP. To run it, you must have a LaTeX distribution and the dvipng software both installed on the same machine which you will run the PHP script. Here is the PHP code: To use this code, you just have to pass the LaTeX commands through GET to the PHP. For example, adding the following string to your URL: http://localhost/imgtex.php?res=300&cmd=x=\frac{-b\pm\sqrt{-4ac}}{2a} The res variable sets the resolution for the generated image and the cmd specifies the LaTeX command. This way, the above URL will produce the following image: