Wednesday, January 30, 2013

Scan Tailor

For quite some time now, I've been putting all the readings (except books) for my various classes online, on my course websites (such as here or here). In the good cases, this just means downloading stuff from JSTOR or some equivalent, or linking directly to it, though I very often convert the PDFs I get that way to DjVu files, since (a) the DjVu files are often a lot smaller (like one tenth the size), and (b) I prefer to have files where two pages appear on a single page, so that when they are printed you don't waste so much paper.
I've developed a number of tools to make this easier to do. I'll blog about them later. (Some of them are already available here.) What I wanted to mention today is a really, really cool program I found last weekend, called "Scan Tailor". It's "free and open source" (GPLv3) and available for Linux, OSX, and Windoze.
I've found that the easiest way to scan stuff I need from books (e.g., Dummett's "Frege's Myth of the Third Realm", from Frege and Other Philosophers) is first to make a photocopy of the paper and then to scan that. (This is easiest with a sheet-fed scanner, such as the HP Scanjet 5590, which works fine with Linux.) You can simply make a PDF or DjVu from the result, but, if you just do that, it will usually look like crap. For one thing, you get big black marks along the sides and, often, down the middle (where the spine of the book is). And the pages can be hard to read on screen, since they are often not quite square. Even half a degree's rotation is easy to see, and very annoying.
So, in the past, I'd load the pages one by one into Gimp (an open source image editor) and fix them up. But this was time consuming, and hard to get right. That is where Scan Tailor comes to the rescue.
Here's how it works. You put all the page images into some directory, and then you open Scan Tailor and create a new "project", pointing the program at that directory. All the pages then get loaded up. You can then rotate them, if need be. But the really useful bit is that the program will then automatically (i) split the pages in half (which you need to do for the next step), (ii) deskew them (i.e., correct for rotation of the text), (iii) put a bounding box around the actual text (thus eliminating the black marks), (iv) put clean new margins around that text, (v) despeckle the images (remove stray black dots), and then (iv) output the resulting page images to a directory of your choosing, so you can assemble them into a PDF or a DjVu.
And at every one of those steps, you can intervene to make manual corrections, if need be. It totally, totally rocks.

2 comments:

  1. Thanks for outlining your step by step use of the software. I hadn't learned to string together some of the commands like you wrote about here.

    I photographed one book that was too large and fragile to scan. It took a LOT of work just to straighten the pages. I haven't found a good de-skewer to date, at least for the stuff I snan. Perhaps this one will be better than the rest.

    For processing images, it does manipulations 4 or more of my image processors will do. The thicken/fatten lines is a slider like no other. Not contrast, not gamma, etc, but the results as exactly what I want to control.

    ReplyDelete
  2. Even better, you can do almost all of this from the command line, or in a script, and then the whole process is automated. You can then load the result into ScanTailor, and fix what needs fixing manually.

    ReplyDelete

Comments welcome, but they are expected to be civil.
Please don't bother spamming me. I'm only going to delete it.