Batch processing image collections

Let me show you how I completed a task that would have taken more than three days of processing in less than six hours, through the magic of command line tools!

Timelapse source material surely takes up amazing amounts of disk space. I have hundreds of gigabytes of timelapse frames material that I want to archive but do not need to keep at its original full resolution, so I am recompressing the individual frames to match the output size (usually HD, 1920×1080 pixels).

The obvious way to do this is using Adobe Bridge, Photoshop batch processing or any other tool that you can use to go through a list of files.

The main problem with this approach is that pretty much all these tools iterate through the files sequentially, which is not very efficient since it only uses a small fraction of the available CPU units and would take many hours to complete.

We can surely find a way to parallelize the process using command line tools. A bit of searching quickly turned up a proposed solution for a very similar issue that I adapted to my own needs:

find . -name "*JPG" -print0 | xargs -0 -n 1 -P 16 sh -c 'echo $1 && gm mogrify -resize 1920x1440 $1' sh

That looks like a daunting command, but it is actually just a sequence of rather simple tasks that we can investigate individually. Let’s use some color to highlight them individually (black is mostly “glue” required to stick them together):

find . -name "*JPG" -print0 | xargs -0 -n 1 -P 16 sh -c 'echo $1 && gm mogrify -resize 1920x1440 $1' sh

find pretty much does what’s on the box; it goes through the specified location and subdirectories and returns a list of files that match the given criteria. If you don’t tell it what to do, it will just list them:

brain:2015 step$ find . -name "*JPG"
./2015-01-30/G0011490.JPG
./2015-01-30/G0011491.JPG
./2015-01-30/G0011492.JPG
./2015-01-30/G0011493.JPG
./2015-01-30/G0011494.JPG
./2015-01-30/G0011495.JPG
( ... many more ... )

On the other end, gm is the GraphicsMagic tool that is told to ‘mogrify’ (change in place) the given file into a version that fits into the specified size, for example:

gm mogrify -resize 1920x1440 G0011490.JPG

This will resize the specified image to fit into a box that is 1920×1440 pixels in size. As a result images that don’t fit the 16:9 proportions of HD are still resized such that the longer side is 1920 pixels wide.

Finally, xargs is the ‘glue’ between the two commands; it collects the list of files returned by find and calls gm with batches of them, which achieves the parallelism we were looking for. The parameters -0 -n 1 -P 16 tell xargs what kind of input it should expect and how it should divide it up for the calls to gm.

A similar variation can be used to convert the images to a different file type – for some timelapses I shot RAW images in Nikon’s NEF format, and we can easily convert them to JPEG by giving different instructions to the gm tool. I am also running this command line through the time utility, which will provide a summary of how much wall time and CPU time the command took to complete:

time find . -name "*NEF" -print0 | xargs -0 -n 1 -P 16 sh -c 'echo $1 && gm convert -resize 1920x1440 "$1" "$1".jpg' sh
./2013/2013-11-09 Aurora/_STP1558.NEF
./2013/2013-11-09 Aurora/_STP1561.NEF
./2013/2013-11-09 Aurora/_STP1560.NEF
./2013/2013-11-09 Aurora/_STP1559.NEF
./2013/2013-11-09 Aurora/_STP1562.NEF
./2013/2013-11-09 Aurora/_STP1563.NEF
./2013/2013-11-09 Aurora/_STP1564.NEF
./2013/2013-11-09 Aurora/_STP1565.NEF
./2013/2013-11-09 Aurora/_STP1566.NEF
./2013/2013-11-09 Aurora/_STP1567.NEF
( ... )
real 324m10.632s
user 4697m23.947s
sys 129m20.528s

Looking at the last three lines of output, we see that converting 11159 images this way took 324 minutes, or roughly 5½ hours. The user and sys items tell us how much time the computer actually spent on the problem (similar to the concept of man-hours in other projects). Since we used multiple one CPUs working in parallel, this number is much larger than the actual wall time. For tasks only using one CPU it is usually lower than the real time, which accounts for the time the CPU was busy doing other tasks or perhaps had to wait for data to be read or written to disk drives.

Note that this transforms a file like _SMA1650.NEF into _SMA1650.NEF.jpg and keeps the original NEF around. Obviously we are going to remove the NEF file (that can easily be done in the Finder), and we can do some quite similar magic to clean up the file name from .NEF.jpg to .jpg. This could have been included in the command, but it is long and complicated enough as it is and running it separately makes it easier to build up and test the command sequence.


Posted

in

by