1. Overview

It’s common practice to use compressed files to reduce file size when working with large datasets. Accessing data from such files can be quite challenging, especially when using the command line. This is when the awk command comes in handy. In this tutorial, we’ll look at how to use the awk command with compressed files.

2. What Is awk?

Developed in the 1970s, awk is a powerful text processing tool used for data manipulation and analysis. Moreover, we can use it for processing large amounts of data quickly.

When we pass the -h flag with the awk command, it lists all the arguments that can be passed to it:

$ awk -h
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:          GNU long options: (standard)
        -f progfile             --file=progfile
        -F fs                   --field-separator=fs
        -v var=val              --assign=var=val
Short options:          GNU long options: (extensions)
        -b                      --characters-as-bytes
        -c                      --traditional 
... other lines truncated ...
Examples:
        awk '{ sum += $1 }; END { print sum }' file
        awk -F: '{ print $1 }' /etc/passwd

Here, the output displays the syntax of the awk command. It also provides additional information with examples.

3. Using awk With Compressed Files

Next, we’ll discuss a couple of methods for using awk with compressed files.

One way to use awk with compressed files is to first uncompress the files and then use command. Another way is to use an appropriate tool to read the data from the compressed file and then use it with awk.

3.1. Uncompressing the Compressed Files

There are various compressed file formats, including gzip, tar, zip, and more. Later, we’ll discuss specific tools that can decompress each of these formats.

Before decompressing files, we should always check for the file extension that was used to compress that specific file.

For now, we’ll be using gzip to uncompress a file data.gz in gzip format:

$ gzip -d data.gz

After this,  let’s use awk to process the data:

$ cat data | awk '{print $1}'
CONTAINER
1ebe91fa065c
3c4d4e670b88
c0f38d0b0352

The above command reads data from the uncompressed file data and prints the first field of each line.

$ cat data | awk '/3c4d4e670b88/{print}'
3c4d4e670b88   postgres:11.5-alpine    "docker-entrypoint…   5 days ago      Up 5 days                  5432/tcp  postgres_db

Similarly, this command searches for the exact text 3c4d4e670b88 and prints the line that contains it.

3.2. Reading Data From the Compressed Files

It’s possible to perform processing on compressed files without decompressing them using specialized tools such as zcat for gzip files, bzcat for bzip files, and so on. These tools use a streaming technique to read the data from the compressed file and feed it into awk, without the need to first extract the contents of the file.

For instance, to process a gzip file, let’s use the zcat command:

$ zcat data.gz | awk '{print $1}'
CONTAINER
1ebe91fa065c
3c4d4e670b88
c0f38d0b0352

Here, the pipe symbol divides the command into two parts. First, zcat reads data from the compressed file data.gz, followed by the use of awk to further process the data and print the first field of each line.

Similarly, we can use bzcat for processing the bzip2 file data.bz2:

$ bzcat data.bz2 | awk '{print $1}'
CONTAINER
1ebe91fa065c
3c4d4e670b88
c0f38d0b0352

In this case, we used bzcat in place of zcat since we’ve compressed the file in bzip2. The resulting output is similar to the previous example.

4. Conclusion

In this article, we looked at different ways to use the awk command to perform numerous operations on compressed file data.

All in all, awk is a powerful and very useful command-line utility that we should have when working with huge datasets, compressed files, or simply processing text.