Friday, August 5, 2016

Converting "Stitched" Pages from PDFs

More and more great books (including philosophy) are going out of copyright and so are appearing on public archives, like Project Gutenberg and archive.org. Unfortunately, though, many of these PDFs are constructed in a somewhat odd way, with each page consisting of several separate images that get "stitched" together. So if you try to extract the pages to run them through OCR, say, it ends up looking like you ran the pages through a shredder.

Fortunately, as I noted in an earlier post, we can use ImageMagick to fix this up. I'm having to do this often enough now that I've written a small script to automate the process. Here it is:
#!/usr/bin/perl
my @splits = @ARGV;
my $start = shift @splits;
my $stop = shift @splits;
sub normalize {
        my $in = shift;
        if ($in < 100) { $in = "0$in"; }
        if ($in < 10) { $in = "0$in"; }
        return $in;
}
my $newpage = 1;
while (1) {
        my @files;
        for (my $i = $start; $i < $stop; $i++) {
                push @files, "*" . normalize($i) . ".pbm";
        }
        my $cmd = "convert " . join(" ", @files) . " -append outpage" . normalize($newpage) . ".tiff";
        print "$cmd\n\n";
        system($cmd);
        my $lasttime = scalar @splits;
        last if $lasttime == 0;
        $start = $stop;
        $stop = shift @splits;
        $newpage++;
}

The script can also be downloaded here.

There are two ways to invoke the program.

stitch_pages -n INIT STEP PAGES

In this case, INIT gives the number of the first image (this will usually be 0 or 1); STEP tells how many images are used to construct each page; and PAGES tells how many pages we are constructing.

Obviously, this assumes that there are the same number of partial images for each page. If that is not true, you can use the other form and specify the "splits" manually.

stitch_pages -s SPLIT1 SPLIT2 ... SPLITn

In this case, we will stitch together the partial images SPLIT1 - (SPLIT2 - 1), etc. The last split given should thus be one greater than the last image available.

No comments:

Post a Comment

Comments welcome, but they are expected to be civil.
Please don't bother spamming me. I'm only going to delete it.