3/02/2015

03-02-15 - Oodle LZ Pareto Frontier

I made a chart.

I'm showing the "total time to load" (time to load off disk at a simulated disk speed + time to decompress). You always want lower total time to load - smaller files make the simulated load time less, faster decompression make the decompress time less.


total_time_to_load = compressed_size / disk_speed + decompress_time

It looks neatest in the form of "speedup". "speedup" is the ratio of the effective speed vs. the speed of the disk :


effective_speed = raw_size / total_time_to_load

speedup = effective_speed / disk_speed

By varying disk speed you can see the tradeoff of compression ratio vs. cpu usage that makes different compressors better in different application domains.

If we write out what speedup is :


speedup = raw_size / (compressed_size + decompress_time * disk_speed)

speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)

speedup ~= harmonic_mean( compression_ratio , decompress_speed / disk_speed )

we can see that it's a "harmonic lerp" between compression ratio on one end and decompress speed on the other end, with the simulated disk speed as lerp factor.

These charts show "speedup" vs. log of disk_speed :

(the log is log2, so 13 is a disk speed of 8192 mb/s).

On the left side, the curves go flat. At the far left (x -> -infinity, disk speed -> 0) the height of each curve is proportional to the compression ratio. So just looking at how they stack up on the far left tells you the compression ratio performance of each compressor. As you go right more, decompression speed becomes more and more important and compressed size less so.

With ham-fisted-shading of the regions where each compressor is best :

The thing I thought was interesting is that there's a very obvious Pareto frontier. If I draw a tangent across the best compressors :

Note that at the high end (right), the tangent goes from LZB to "memcpy" - not to "raw". (raw is just the time to load the raw file from disk, and we really have to compare to memcpy because all the compressors fill a buffer that's different from the IO buffer). (actually the gray line I drew on the right is not great, it should be going tangent to memcpy; it should be shaped just like each of the compressors' curves, flat on the left (compressed size dominant) and flat on the right (compress time dominant))

You can see there are gaps where these compressors don't make a complete Pareto set; the biggest gap is between LZA and LZH which is something I will address soon. (something like LZHAM goes there) And you can predict what the performance of the missing compressor should be.

It's also a neat way to test that all the compressors are good tradeoffs. If the gray line didn't make a nice smooth curve, it would mean that some compressor was not doing a good job of hitting the maximum speed for its compression level. (of course I could still have a systematic inefficiency; like quite possibly they're all 10% worse than they should be)


ADDENDUM :

If instead of doing speedup vs. loading raw you do speedup vs. loading raw + memcpy, you get this :

The nice thing is the right-hand asymptotes are now constants, instead of decaying like 1/disk_speed.

So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff.

3 comments:

Ants Aasma said...

If the compressors can work in a streaming way wouldn't total_time be more accurately modeled as max(compressed_size/disk_speed, decompress_time)? The graphs would be significantly less curvealicious though.

cbloom said...

Yeah, this "total time" is really just a score. It should not be taken literally. You're right that "max" has some merit.

I showed a few different options back here :

http://cbloomrants.blogspot.com/2012/09/09-15-12-some-compression-comparison.html

I actually think L2 mean of decomp & load time has merit.

The issue is that actual "total load time" isn't the only thing you care about usually; you also want minimum CPU use, and you want minimum disk file size, for other reasons.

Cyan said...

It's an excellent visualization methodology.

old rants