Monday, January 1, 2018

A Functional Introduction to GDAL, Part 2: GeoTiff Compression

So you just got a hard drive containing some gorgeous new 6" resolution aerial imagery for your hometown and you can't wait to play with it. Upon opening the imagery folder you're hit with a grim reality: high-res imagery file sizes are huge.

Compression techniques allow us to eliminate redundant data in a raster file in such a way that we can recreate the original file (lossless compression) or enough of the original to look good (lossy compression).

Because of it's ubiquitous support in other software and multitude of compression options, this post will focus on some of the compression options available in the GeoTiff driver. The different compression schemes can be specified with the creation option -co compress=<type>. The GeoTiff driver page has a list of all the available compression schemes.

Lossless and Lossy Compression

Compression methods fall into two general categories: "lossy" and "lossless."

"Lossy" compression schemes throw away chunks of data that they think aren't too important for the sake of smaller file sizes. jpeg compression relies on the way the human brain processes visual imagery to throw away varying amounts of "extraneous" data that (depending on the amount of compression specified) we don't usually notice. This results in files that are drastically smaller than if we had full three 8-bit values for each raster cell.

The reason it's called lossy compression is because you will lose data. Once you use a lossy compression scheme on a raster, you can not recover the original data. You can get something that looks very visually similar, but the underlying values are irrevocably changed.

In contrast, "lossless" compression schemes use patterns and other mathematical magic to compress the data in a way that it can be perfectly reproduced when it's decompressed. The compression ratios for lossless compression schemes usually aren't quite as good as lossy compression schemes, but you keep all of your original data.

As a rule, never use lossy compression on data where the cell value is important, like elevation data or reflectance values in multispectral imaging. It does work great for aerial imagery that you're just trying to use as a basemap, however.

Lossy Compression with JPEG

My preferred aerial imagery compression scheme is jpeg with YcBcR color representation (-co compress=jpeg -co photometric=ycbcr). If you're big into color theory you can probably explain better than I can what YcBcR is and why exactly it helps, but long story short it can sometimes give you better jpeg compression ratios (read: smaller files) for aerial imagery.

You can also play around with the jpeg quality creation option (higher quality = less data loss but larger file size), I've found the default of 75 to work pretty good for the projects I've done without major (or even noticeable) visual degradation. This gives me files that are roughly 5% to 10% the size of what they'd be without any compression. I've seen smaller ECW files from vendors, but this is close enough for what I need.

One gotcha I've come across with jpeg compression is artifacts, both inside the image data area and in the NoData areas along some borders. I'll write more on this later, with visual examples, but here are two guidelines to reduce/eliminate jpeg compression artifacts:
  1. To eliminate artifacts on the outside edges, make sure the original image data area is perfectly rectangular. An image of merged tiles that shift down a couple pixels every tile, creating a stair-step border of white NoData areas when merged together, will show some artifacts in the white NoData areas.
  2. To eliminate artifacts inside the data areas on the right and bottom of your image, make sure the tile/stripe size is a multiple of 8 or 16 (see the note about RGB jpeg compression on the GeoTiff driver page linked above): -co blockxsize=256 -co blockysize=256. If that doesn't work, make sure your total image height and width are multiples of your block size (see this mailing list thread). 

Lossless Compression with Deflate and LZW

When I'm working with large elevation datasets and want compression that preserves the original data, I'll usually use deflate compression (-co compress=deflate). This generally seems to give me higher compression ratios than LZW compression (the other main lossless compression scheme)—I've seen it reduce file size by around 50%, but your mileage may vary based on specific datasets.

You can use the -co predictor and -co num_threads options when working with LZW or deflate compression to speed up the compression. If you're just working with aerial imagery, -co predictor=3 doesn't make sense because your data isn't in floating point format (i.e., it's all whole numbers). These can be useful for very large datasets, but the smallish example dataset below only took a few seconds to compress no matter the method.

Compression Comparison

I used the San Francisco Bay image available from Earthstar Geographics to do a basic comparison of the compression schemes we've discussed. The image is 7204 x 7204 pixels in three bands of type byte, originally in the GeoTiff format. Note that the compression percentages will vary between every image (the relatively poor compression for LZW surprised me here) but the general ranking should stay the same.


Compression Scheme File Size % of Original File Size Lossy/Lossless
Uncompressed
155.9 MB
n/a
n/a
Deflate (z=6)
112.7 MB
72.3%
Lossless
LZW
142.9 MB
91.7%
Lossless
jpeg (75% Quality)
25.1 MB
16.1%
Lossy
jpeg (YcBcR, 75% Quality)
9.1 MB
4.8%
Lossy
jpeg (YcBcR, 90% Quality)
14.9 MB
9.49%
Lossy
ECW (2:1, from vendor)
28.3 MB
18.2%
Lossy

Note: The file sizes here were calculated on a Mac, which defines a MB as 1000^2 bytes (decimal-based), while Windows currently defines a MB as 1024^2 bytes (binary-based). If you download the example file yourself, you'll see the metadata file says its 148 MB, which would be the Windows version of the size. Confusing? Yeah. 

The jpeg compression is the clear winner here, and zoomed in to actual size (one pixel in the image is one pixel on my monitor) I couldn't tell any difference even between the 75% quality jpeg and the uncompressed GeoTiff.

Compression and BigTIFF

The default TIFF format only allows for files 4 GB or smaller. GDAL overcomes this barrier with the BigTIFF creation option that allows for really big files. The default option is if_needed, which tries to figure out if the new file will be greater than 4 GB and sets the flag if needed. Because this is the default, you have probably already created uncompressed GeoTiffs larger than 4 GB without even realizing this option exists.

However, compression throws a wrench into this default. The driver doesn't know how well the compression will work and fails to set the BigTIFF option, which often leads to write errors when creating large datasets. Whenever I'm working with large compressed datasets I use -co bigtiff=yes if there's any possibility the output will be greater than 4 GB.

ECW and MrSid

I've mentioned "ECW" and "MrSid" a couple times in this series. These are different formats (not compression options for GeoTiffs) that use mathematical magic known as wavelets to achieve lossy compression ratios greater than what's possible with jpeg, and with better image quality and data access to boot. Several Jpeg2000 compression schemes also use wavelet-based compression in a similar matter. 

The potential downside is that these are not open formats—they are protected by patents and require special (and possibly expensive) licenses. There's usually a free license for reading the data (which is what allows GDAL to read them), but creating them (or in the case of ECW, distributing images from a server as part of a webmap) require ponying up the license fee. You're the only one who can decide if that route is best for you.

Next up

Our next post will investigate gdal_translate, one of the most common and most useful GDAL tools.

Notes
Paul Ramsey's excellently concise compression discussion: http://blog.cleverelephant.ca/2015/02/geotiff-compression-for-dummies.html
Kersten Clauss' discussion (and test script!) on GDAL's lossless compression options: http://www.digital-geography.com/geotiff-compression-comparison/
A deep discussion on jpeg compression by Even Rouault, one of the GDAL developers: http://erouault.blogspot.com/2014/04/advanced-jpeg-in-tiff-uses-in-gdal.html (Seriously, this guy is awesome. Super knowledgeable and very active on the gdal-dev mailing list).