Digital Photo Archiving: Preserving digital files

Pile of media

We tend to assume that our file formats will be with us forever, often times because many of the commonly seen file formats have been able to stick around for a while. You can still download GIF images that people used to advertise their dial-up BBS systems in the eighties and nineties before everybody was on the Internet.

Film makes it look easy. All forms of film throughout history can be scanned and printed today. My father went through his lifetime collection and only found one roll that had been destroyed so far by the ravages of time. There's a relatively limited set of things you can do to make the film last, beyond that you have to accept that it will eventually fade to nothingness.

These days, we can use digital surrogates (meaning, scans of the film) to help. If the digital files are unreadable in the future, there's always the film to go back to, and if the film is damaged, you still have the scan.

But what about images that are born digital, meaning pictures you took on your digital camera? Most people are already at risk because they have only one copy of their pictures with no backups. Is it worse than that?

The images born in a standard format are easy. My Canon A95 and G7 only shoot JPG files, so that's all I've got. As long as I make sure that I keep my JPG files backed up and copied regularly to fresh media, that's not a problem.

I had a discussion about RAW files with some archivists and it turns out that they don't realize just how deeply screwed we really are. They have maintained the opinion that your RAW files should be saved as TIFF files for archiving, because they TIFF files are lossless.

There's a problem with this, however. See, TIFF files are lossless. But the conversion from a RAW to a TIFF is not a lossless operation. A RAW file contains the simplest, rawest, least processed version of an image. A TIFF is built around images stored as an RGB matrix, where all three primary colors are shared by a single pixel. RAW images are different, they conform to how the sensor works, which generally means one primary color per pixel, sometimes with the sensor arranged in weird patterns. The sensor responds linearly to light, whereas the eye and film respond logarithmically to light.

The conversion algorithms necessary to go from RAW to TIFF (or any other format) is an evolving science. You can take a RAW file from one of the earliest RAW-format digital cameras and process it with a new converter and the image will be sharper and have better color. Were you to lose the RAW format file (or the ability to use it, for that matter) you won't be able to do that.

So, what's the problem with just saving RAW images? The camera manufacturers write their RAW files in an undocumented proprietary format. Nikon even encrypts their RAW files in parts to prevent you from reading it, at least partially to protect the revenue stream from their high-end RAW conversion tool. So photographers can convert their RAW images only by the whim of the camera manufacturer.

There's been some quite well documented cases of this. Nikon's F5 film camera has the ability to store metainformation on a computer, but Nikon hasn't upgraded the software since the days of Windows 95 and MacOS 7.1. Canon's T-90 has similar abilities, but the software only was designed for the MSX, a hardware platform that only caught on outside of the US and has been completely replaced by the modern PC architecture. Microsoft removed compatibility with the oldest PC software in the 64 bit versions of Windows Vista.

The goal is to save everything, not just the image information, but all of the metainformation. Encoded in any JPG or RAW file straight out of the camera is a substantial amount of useful information, primarily a snapshot of the camera settings at the moment a picture was taken. This is useful information for the photographer immediately, but also provides useful information for the archivist. It makes it easier for an archivist to place a photographer's work in on a timeline, and also provides provenance. If a photograph was taken with clearly different settings, it might have been somebody borrowing a camera (e.g. my camera is only set in "Auto" mode if it's in somebody else's hands) or if the information is inconsistent, it could be a fraudulent claim.

And, while we're talking about provenance, it should also be noted that you can characterize the individual way a sensor works mathematically, so you could further establish if a given file was taken by a camera known to be owned by a specific photographer or not.

The metainformation in JPEG and TIFF files is usually stored as EXIF or IPTC tags, however only some of that information is properly documented. Much useful information is written as MakerNote tags. Often times, information that has a standard way to be stored in the EXIF tags will end up being stored as a proprietary MakerNote tag instead. For example, most Canon cameras don't put the ISO where the standard dictates, instead putting it in a custom Canon field. They don't consider it a bug because their own Canon-provided software works with it just fine.

There's another problem, slightly more festering. TIFF files have serve as a standard format for quite some time. They have substantial documentation in place. However, there are ways to make a TIFF file that is valid according to the spec, but that not every program will read. There are limits to what the TIFF format can do. Simply suggesting that a concerned photographer save their image as a TIFF is not enough... you need to specify a relatively exact set of requirements for the TIFF file. By manipulating the parameters in Photoshop, I have been able to create files that other apps can't read.

It gets worse, too. JPEG and TIFF files are brittle. You have a good guarantee that if you save a copy of an image, you will be able to extract the image out of it, but any format extensions specific to an individual app or any of the proprietary EXIF tags might not make it. This means that if you use an image management tool to modify some of the metainformation in a file (say, store a comment or a rotation value) you could corrupt the metainformation.

The DNG format is an attempt to, using the TIFF file format, create a neutral format for different camera manufacturers to use. However, this does not go nearly far enough. The DNG format still allows for undocumented segments in the file and very likely makes the problem worse, not better and, since the Adobe RAW converter is written without the cooperation of the camera manufacturers, there is a high likelihood that at least some part of the metainformation isn't being decoded all of the time. Also, it inherits many of the problems of the TIFF format with respect to being a brittle format with lots of ways to lose information.

So, where does this leave us? Well, first, you must protect your from-camera original images, RAW, TIFF or JPEG, and no photo management or editing tool should ever modify those images in any way, except under a defined set of circumstances.

Second, you must assume that your RAW images are going to be unreadable in the future. The OpenRAW group might succeed in getting manufacturers to make their formats more friendly. So you want to preserve every RAW image as a archivally-formatted TIFF file.

We don't have any good answers right now. The people who have been reverse engineering existing RAW format files and publishing as much information as they've been able to glean from it are doing the future a service... however there's a very high long-term chance that, without active effort, huge amounts of current history will not be there for the historians (or curious great grand children) of 2100.


Recently added Photos: