So at Digg, we have been working our own Hadoop cluster using Cloudera’s distribution. One of the things we have been working through is how can we split our large compressed data and run them in parallel on Hadoop? One of the biggest drawbacks from compression algorithms like Gzip is that you can’t split them into multiple mappers. This is where LZO comes in.
The LZO library implements a number of algorithms with the following features:
Compression is comparable in speed to deflate compression.
On modern architectures, decompression is very fast; in non-trivial cases able to exceed the speed of a straight memory-to-memory copy due to the reduced memory-reads.
Requires an additional buffer during compression (of size 8 kB or 64 kB, depending on compression level).
Requires no additional memory for decompression other than the source and destination buffers.
Allows the user to adjust the balance between compression quality and compression speed, without affecti