数据压缩：LZ77 vs. LZ4 vs. LZ4HC

1. Introduction

Data compression is a technique for reducing the size of data, saving storage space, and improving transmission speeds across networks. Various algorithms have been developed over the years, each offering different trade-offs between compression ratio, speed, and computational efficiency. Among these, LZ77, LZ4, and LZ4HC stand out as popular choices for different scenarios.

In this tutorial, we’ll explore these three algorithms, comparing their performance and use cases to figure out when to use each one.

2. What Is Data Compression?

Before diving into the specific algorithms, let’s briefly discuss data compression and why it’s important. Data compression involves encoding information using fewer bits than in the original representation. The goal is to reduce the size of the data while retaining as much of the original information as possible. This is particularly useful in areas like file storage, data transmission, and bandwidth management.

There are two main types of data compression. Lossless compression ensures that the original data can be perfectly reconstructed from the compressed data. It’s ideal for scenarios where accuracy is crucial, such as text files, databases, and executable programs. Lossy compression reduces file size by removing some of the data, often resulting in a loss of quality. It’s commonly used for multimedia files like images, videos, and audio where some loss of quality is acceptable.

LZ77, LZ4, and LZ4HC are all lossless compression algorithms, meaning they allow for exact reconstruction of the original data.

3. Understanding LZ77

3.1. The Basics of LZ77

LZ77 is a foundational data compression algorithm developed by Abraham Lempel and Jacob Ziv in 1977. It is the basis for many other compression algorithms, including LZ4 and LZ4HC.

The core idea behind LZ77 is to replace repeated patterns with references to the first match of the identified pattern in the uncompressed data.

LZ77 works by sliding a window over the data stream and identifying sequences of data that have already appeared within a certain range (the “window”). When a match is found, the algorithm replaces the second occurrence with a reference to the first occurrence, thus reducing the size of the data.

For general data compression tasks, such as compressing text files, LZ77 offers a good balance between speed and compression ratio. It’s fast enough for most tasks while still providing a reasonable reduction in data size.

3.2. How LZ77 Works

Here’s a simplified example of how LZ77 compresses data.

Let’s say we have the following string of text:

abcabcabcabc

LZ77 would process this as follows:

It reads the first three characters abc and stores them as they are
When it encounters the second abc, it recognizes that this sequence has already appeared. Instead of storing it again, it stores a reference to the first abc
This process continues for the entire string, significantly reducing the data size

The reference typically consists of two values: the distance back to the sequence’s start and the match’s length.

For the string abcabcabcabc, the LZ77 output might be something like:

abc(0, 3)(3, 3)(6, 3)

where (n, m) represents the match found n characters back with a length of m.

4. Understanding LZ4

4.1. The Basics of LZ4

LZ4 is a high-speed compression algorithm based on LZ77 but optimized for speed rather than compression ratio. It was developed by Yann Collet in 2011 and is widely used when quick compression and decompression are more important than achieving the highest possible compression ratio.

LZ4 is particularly popular in real-time applications with critical low latency, such as network traffic compression, real-time storage systems, and data transfer protocols.

Speed is critical in real-time video streaming. We need to compress data quickly to minimize latency and ensure smooth playback. LZ4 is ideal here because it offers fast compression and decompression, transmitting video streams efficiently without introducing significant delays.

4.2. How LZ4 Works

LZ4 operates similarly to LZ77 but with several optimizations that make it faster.

In LZ4, the matching algorithm is designed to quickly identify repeated sequences without spending too much time on each search. LZ4 focuses on finding sequences of a fixed length, usually 4 bytes (32 bits). This fixed-length approach simplifies the matching process because the algorithm doesn’t need to check every possible match length, as it would with LZ77. Instead, it quickly identifies sequences of this fixed size.

To further enhance speed, LZ4 uses hash tables to store and quickly retrieve the positions of these fixed-length sequences. When the algorithm encounters a new sequence in the data, it calculates a hash value for that sequence and checks if this hash has been seen before. If it has, LZ4 knows that a match exists and can quickly reference the earlier occurrence.

Additionally, unlike more thorough algorithms that might continue searching to find the absolute best match, LZ4 is designed to stop as soon as it finds a good enough match. This “early exit” strategy allows the algorithm to move on to the next part of the data more quickly, further speeding up the compression process.

However, these optimizations come at a cost. LZ4 typically achieves a lower compression ratio than more thorough algorithms like LZ4HC.

5. Understanding LZ4HC

5.1. The Basics of LZ4HC

LZ4HC (LZ4 High Compression) is a variant of LZ4 that prioritizes achieving a higher compression ratio over speed. While LZ4HC is slower than LZ4, it still operates faster than many other compression algorithms, making it a good middle-ground option for those who need better compression but still require decent speed.

LZ4HC is particularly useful when storage space is more constrained or when data needs to be compressed once and decompressed many times. When storing large amounts of data for long-term archival, storage space becomes a significant concern. We might compress data once and rarely access it, so the compression ratio is more important than speed. LZ4HC is perfect in this scenario, providing a high compression ratio and reducing the required storage space.

5.2. How LZ4HC Works

LZ4HC takes a more thorough approach to data compression compared to its faster counterpart, LZ4. This is evident in several key aspects of its matching algorithm.

First, LZ4HC employs a more exhaustive search algorithm designed to find the best possible matches in the data stream. By conducting a deeper search, LZ4HC significantly increases the likelihood of identifying longer and more frequent matches, which in turn results in better compression ratios. This thoroughness, however, comes at the cost of speed, as the algorithm spends more time ensuring that the matches it finds are optimal.

In addition to its exhaustive search method, LZ4HC enhances its matching capabilities through a backward search technique. Unlike LZ4, which conducts a simpler forward search from the current position, LZ4HC searches backward from the end of the current position. This backward search helps maximize the length of the matches by considering sequences that may not be immediately obvious in a forward search, further improving the compression efficiency.

Moreover, LZ4HC improves upon LZ4’s fixed block size approach by introducing adaptive block size capabilities. Instead of relying on a one-size-fits-all block size, LZ4HC can dynamically adjust the block size based on the characteristics of the data being compressed. This adaptability allows LZ4HC to optimize the compression process more effectively, ensuring that the algorithm can handle a wider variety of data types with improved results.

These enhancements result in a significantly better compression ratio compared to LZ4, though at the cost of compression speed.

6. Comparing LZ77, LZ4, and LZ4HC

Now that we’ve covered the basics of LZ77, LZ4, and LZ4HC, let’s compare their speed, compression ratio, and suitable use cases:

Algorithm

Speed

Compression Ratio

Ideal Use Cases

LZ4

Fast

Low

Real-time data compression: Network traffic, real-time storage systems, data transfer protocols

LZ77

Moderate

General-purpose data compression: Text file compression, moderate-speed applications

LZ4HC

Slow

High

High-compression scenarios: Archival storage, backup systems, scenarios with limited storage space

Each algorithm has its own trade-offs regarding speed. LZ4 is designed to be fast, making it ideal for real-time applications. LZ77 is slower but still efficient, and LZ4HC, while slower, offers a higher compression ratio.

The compression ratio is crucial when storage space is at a premium. LZ4HC provides the highest compression ratio, followed by LZ77, with LZ4 offering the lowest ratio in exchange for speed.

7. Conclusion

In this article, we’ve explored the differences between LZ77, LZ4, and LZ4HC, three popular data compression algorithms. Each algorithm offers unique advantages, from LZ4’s speed to LZ4HC’s high compression ratio.

Persistence

REST

Security