|
|
Parallel file compression
by
Jørn Amundsen
—
last modified
Feb 09, 2010 01:11 PM
An explanation of how to speed up file compression by utilizing available cores on a multicore node. Compression is used regularly to reduce disk space requirements of stored data. On Unix systems, file hierarchies in folders usually is stored in an archive file with tar(1) and then compressed, in-flight with tar (-a switch) or in a separate step with gzip or another popular compression program. Caution should be exercised when compressing large output from scientific simulations. The main concern is compression is time-consuming, and the gain in reduced storage might not justify the compression effort. Compression is useful only if the data contains repeated patterns. Binary simulation results, with significant data onto a background of white noise often contains little or no repeated patterns, and compressing such data might result in a compressed file of equal or even larger size than the original. Text files, however, usually are compressed with good results. It is not unusual to experience reductions in size of 3 times or more. Ordinary compression programs are sequential, and usually utilize one core only on a multicore node. The pigz application of Mark Adler is threaded, and is capable of utilizing all cores on a node to reduce compression time. Pigz is compatible with the popular gzip compression program on Unix systems, and pigz output might be decompressed with gzip and vice versa. On njord, pigz is capable of utilizing all 16 cores on a node for efficient file compression. Speedup and efficiency from compressing a 512MiB, 1024MiB and a 2048MiB file is shown in the figure on speed and efficiency below.
Note the reduced performance above 14 cores. This is because pigz use additional threads internally for data and work management, and requesting more than 14 cores oversubscribes the node. Figure 2 compares performance of pigz 2.1.6 with gzip 1.3.13 on njord. If it is necessary to compress data, reducing wall time on a node by utilizing all cores for compression might be useful to otherwise idling the cores with a sequential compression program.
Now, what about decompression, and what about in-flight parallel compression with tar(1) ? Decompression can't be parallelized. The best pigz can do is to optimize data management by a few additional threads. In-flight compression with pigz isn't a good idea either, because tar opens a pipe to the compression program, and the sequential tar program is not capable of feeding pigz at an appropriate rate to speed up work.
Read more about pigz(1) on the UNIX manual page on njord, with «man pigz». Document Actions |
|
