1. Overview
In Linux, we usually pack multiple files into one single archive using the tar command. Further, tar with the -z option allows us to compress the archive using gzip to save disk space.
In this tutorial, we’ll learn how to do grep on a tar.gz archive to find which files contain an interesting pattern for us.
2. Introduction to the Problem
As usual, let’s understand the problem by an example.
2.1. The Example
First of all, let’s see some log files in the logs directory:
$ head logs/**/*.log
==> logs/app1/app.log <==
2022-01-20 15:21:10 application started
2022-01-20 17:07:14 [Warn] Security alert: 10 Permission Denied Requests from the same IP ...
2022-01-20 17:08:14 [Warn] High RAM usage: 90%
2022-01-20 17:14:10 RAM usage is back to normal
==> logs/app1/user.log <==
2022-01-20 19:22:10 user Kevin login
2022-01-20 20:21:10 user Kevin logout
2022-01-20 22:18:10 security alert: 10 times failed login from the same IP
==> logs/app2/app.log <==
2021-11-20 15:21:10 application started
2021-11-20 17:08:14 [Warn] High CPU usage: 80%
2021-11-20 17:14:10 CPU usage is back to normal
==> logs/app2/user.log <==
2021-11-20 19:21:10 user Eric login
2021-11-20 22:08:14 security alert: 10 times failed login from the same IP
2021-11-20 23:44:10 user Eric logout
As the output above shows, we have four log files for two applications – app1 and app2.
Next, let’s pack them into an archive using the tar command and verify if the created tarball contains all files we need:
$ tar czf app_logs.tar.gz logs
$ tar tzf app_logs.tar.gz
logs/
logs/app2/
logs/app2/app.log
logs/app2/user.log
logs/app1/
logs/app1/app.log
logs/app1/user.log
2.2. The Problem
Now, let’s say we want to do a case-insensitive search in the app_logs.tar.gz tarball to find out which log files contain “security alert” messages.
We expect to see three files in the result with the matched log entries:
logs/app2/user.log:2021-11-20 22:08:14 security alert: 10 times failed login from the same IP
logs/app1/app.log:2022-01-20 17:07:14 [Warn] Security alert: 10 Permission Denied Requests from the same IP ...
logs/app1/user.log:2022-01-20 22:18:10 security alert: 10 times failed login from the same IP
The first idea that may come up for solving the problem is probably the three-step solution:
- Extracting all files from the tarball to a directory
- Doing a grep search on the extracted files
- Removing the extracted files
This can be the most straightforward way to achieve the goal. Our example has only four small log files. However, in the real world, the tarball may contain a significant number of files. Also, the files in the archive can be much bigger than our example. Therefore, these three steps may increase the disk IO load dramatically.
If we just want to know which files contain the given pattern, extracting all files to a disk is unnecessary and inefficient.
In this tutorial, we’ll see some more efficient ways to solve the problem in one shot.
3. About the zgrep Command
When we see the requirement “searching in a gzipped tarball”, many of us may think of a handy utility called zgrep. Indeed, as its name implies, the zgrep command can do grep on gzipped files without extracting all files to disk. Also, zgrep supports most of grep‘s options nicely.
First, let’s try to solve our problem using zgrep:
$ zgrep -Hai 'security alert' app_logs.tar.gz
app_logs.tar.gz:2021-11-20 22:08:14 security alert: 10 times failed login from the same IP
app_logs.tar.gz:2022-01-20 17:07:14 [Warn] Security alert: 10 Permission Denied Requests from the same IP ...
app_logs.tar.gz:2022-01-20 22:18:10 security alert: 10 times failed login from the same IP
We’ve used three of grep‘s options in the command above:
- -H: Output the filename for each match
- -a: Treat binary file as a text file
- -i: Ignore case distinctions when matching patterns
As the output above shows, zgrep has successfully found the three “security alert” occurrences.
However, if we take a closer look at the filenames in the output, we only see the tar.gz file’s name instead of the names of the log files in the archive.
Next, to figure out why it happens, we need to understand how zgrep works.
First, zgrep is just a shell script:
$ file $(which zgrep)
/usr/bin/zgrep: POSIX shell script, ASCII text executable
That means we can read the source to understand how it works. Simply put, zgrep uses gzip to decompress the files to Stdout and pipes it to grep to perform the search.
Basically, it’s pretty similar to the command:
tar xzfO app_logs.tar.gz | grep -Hai 'security alert'
Here, we use the -O option to ask the tar command to extract files to Stdout instead of disk.
Therefore, zgrep can search the files’ content in a compressed archive, but it cannot tell which file inside the archive hits the match.
That is to say, zgrep isn’t suitable to solve our problem.
4. Using tar with the –to-command Option
4.1. The Solution
Although zgrep cannot solve our problem, we’ve learned that if we extract files to Stdout, we can pipe it to grep directly and avoid extra disk IO loads.
So, our problem can be easily solved if there’s a way to decompress files inside the tarball to Stdout and keep their original filenames.
Fortunately, tar provides the –to-command=COMMAND option.
This option tells tar to extract files to Stdout and pipe to COMMAND.
Moreover, COMMAND can obtain the information of files inside the tarball through a set of TAR_* environment variables, for example:
- TAR_FILENAME – The name of the file
- TAR_SIZE – The size of the file
- TAR_FILETYPE – The type of the file, for example, whether it is a regular file, directory, or symbolic link, and so on
- …
Obviously, the variable TAR_FILENAME is exactly what we’re looking for.
On the other side, the grep command has the –label=LABEL option to display LABEL as the filename. It’s pretty useful when we grep on Stdin.
Therefore, to solve the problem, we can assemble a command like tar … –to-command=’grep …’ and pass tar‘s TAR_FILENAME variable to grep‘s –label option.
Let’s give it a try:
$ tar xzf app_logs.tar.gz --to-command='grep --label=$TAR_FILENAME -Hi "security alert";true'
logs/app2/user.log:2021-11-20 22:08:14 security alert: 10 times failed login from the same IP
logs/app1/app.log:2022-01-20 17:07:14 [Warn] Security alert: 10 Permission Denied Requests from the same IP ...
logs/app1/user.log:2022-01-20 22:18:10 security alert: 10 times failed login from the same IP
It works! We’ve got the expected result.
4.2. A Couple of Notes
We’ve solved the problem. However, there are still a couple of minor points that are worth mentioning.
The first one is to add the true command after the grep command.
When using the –to-command=COMMAND option, the tar command will output an error message if COMMAND returns an error (non-zero) code.
On the other hand, when grep has found a match in the input, the grep command returns 0. Otherwise, it returns 1.
So, it means tar will treat grep‘s all “no match found” cases as errors and output the error messages:
$ tar xzf app_logs.tar.gz --to-command='grep --label=$TAR_FILENAME -Hi "security alert"'
tar: 207869: Child returned status 1
logs/app2/user.log:2021-11-20 22:08:14 security alert: 10 times failed login from the same IP
logs/app1/app.log:2022-01-20 17:07:14 [Warn] Security alert: 10 Permission Denied Requests from the same IP ...
logs/app1/user.log:2022-01-20 22:18:10 security alert: 10 times failed login from the same IP
tar: Exiting with failure status due to previous errors
This messes up the output, which is definitely not what we want. Therefore, we add the true command at the end to make COMMAND always return 0 and suppress those error messages.
Another point we should note is we’ve wrapped COMMAND with single quotes.
This is because the TAR_* variables are assigned during tar‘s execution and passed to COMMAND. So, for example, if we double-quote COMMAND, the $TAR_FILENAME variable will be expanded by the shell when invoking the tar command:
$ tar xzf app_logs.tar.gz --to-command="grep --label=$TAR_FILENAME -Hi 'security alert';true"
:2021-11-20 22:08:14 security alert: 10 times failed login from the same IP
:2022-01-20 17:07:14 [Warn] Security alert: 10 Permission Denied Requests from the same IP ...
:2022-01-20 22:18:10 security alert: 10 times failed login from the same IP
As the test above shows, we have three empty filenames in the output, as the shell variable $TAR_FILENAME doesn’t exist when we start the tar command.
Therefore, we need to use single quotes to prevent expanding variable names when starting tar.
5. Conclusion
In this article, we’ve learned how to search in a compressed tarball, without extracting all files to disk, to find which files contain the given pattern.
First, we’ve discussed why the commonly used zgrep command cannot solve our problem.
Then, we’ve addressed the solution using an example of tar‘s –to-command option.