1. Overview
In this tutorial, we’ll be learning how to efficiently remove files inside a .tgz file (aka .tar.gz). We’ll start by going over the most straightforward method, and then, we’ll look at a faster method using an external tool.
2. Removing Files From a Tarball Using tar
A .tgz is a tar file that has been compressed using the gzip algorithm. First, we’ll need to decompress this file into a tar file. Then, we can delete the unwanted files using the tar command.
After removing the files we want, we just need to re-compress the tar file. There are multiple commands that can be used to handle gzip compression, but in this section, we will just cover gzip and pigz.
2.1. Removing an Individual File
Let’s begin by using gzip and tar to remove junk.txt from our .tgz:
$ gzip --to-stdout --decompress "infile.tgz" | tar --to-stdout --delete 'junk.txt' -f - | gzip > 'outfile.tgz'
We decompress the .tgz using gzip –decompress then pipe the output to our tar –delete command. The –to-stdout flag allows us to pipe the output of one command to the other without using an intermediary file. Finally, we pipe the resulting output to gzip to compress it.
2.2. Installing pigz
To increase the speed of compression, we can replace gzip with pigz. This should be quite a bit faster depending on the CPU in our system.
To install pigz, we first need to download and extract the source files. Then, we can build from the source:
$ cd *pigz* && make
To build from source, we cd into the target directory and run the make command. The output of this command will be the executable pigz. We can run this program from this directory with ./pigz or we can move it to a directory in our path (such as /usr/local/bin/) to make it globally accessible.
2.3. Using Wildcards
To allow for easier removal, we can use wildcards. These allow us to remove files without explicitly defining their full path:
$ pigz --to-stdout --decompress 'infile.tgz' | tar --to-stdout --wildcards --delete '*junk*' -f - | pigz > 'outfile.tgz'
In this example, we use the –wildcards flag with –delete ‘*junk*’. This will remove all files with the word “junk” in the name.
3. Removing Files From a Tarball Using bsdtar
We can also remove a file from a tarball using the bsdtar command. This is slightly more simple than the previous methods since we only have to use one command.
3.1. Installing bsdtar
To install bsdtar, we need to download the libarchive package from the source files.
After downloading the source files, we can build from the source:
$ cd *libarchive* && cmake . && make
As we can see, the process is similar to building pigz. The difference is that we need to use cmake to make the build script before we run the make command.
The bsdtar command can be called from the created bin directory or can be made globally accessible by moving it to a directory in our path.
3.2. Using bsdtar
The bsdtar command enables us to copy a tarball while excluding files or directories matching a specified pattern:
$ bsdtar --auto-compress --create --file 'outfile.tgz' --exclude 'junk.txt' @'infile.tgz'
To copy a tarball using bsdtar, we create the same flags we would use to create a tarball from scratch. However, instead of providing a directory, we provide the tarball we want to copy. To indicate that we want to operate on the contents of the tarball, we must preface the filename with @.
After that, we just have to specify the unwanted files using the –exclude flag, and our new .tgz file will be created.
4. Comparing Speeds
Now that we’ve gone over a few different methods, let’s compare their speeds.
4.1. Creating Our Script
To measure the time taken for a command to execute, we just have to find the difference in the current time before and after execution. Using this method, let’s create a basic script to find how long it takes each method to run:
#!/bin/sh
gzip_com() {
gzip --to-stdout --decompress "${1}" | tar --to-stdout --delete "${2}" -f - | gzip > "${3}"
}
pigz_com() {
pigz --to-stdout --decompress "${1}" | tar --to-stdout --delete "${2}" -f - | pigz > "${3}"
}
bsdtar_com() {
bsdtar --auto-compress --create --file "${3}" --exclude "${2}" '@'"${1}"
}
for i in 'gzip_com' 'pigz_com' 'bsdtar_com'; do
s="$(date +%s)"
"${i}" "${1}" "${2}" "outfile__${i}__.tgz"
e="$(date +%s)"
printf '\n-- %s --\nSeconds: %s\n' "${i}" "$((e-s))"
done
In this script, we iterate over a group of functions corresponding to the commands we want to time. To get the current time, we use date and specify the output to be in seconds using +%s.
4.2. Running Our Script
Now that we have created our script, let’s test it out using a moderately sized tarball (~2.5 GB), which we can find in the nerd-fonts GitHub repository:
$ ./test_script.sh 'nerd_fonts_v3_0_0.tgz' 'nerd-fonts-3.0.0/patched-fonts/Gohu'
The first argument we pass to our script is the name of our tarball, and the second argument is the path of the file/directory we want to remove.
The results of this script will change depending on the speed of our computer, but the output should look something like:
-- gzip_com --
Seconds: 264
-- pigz_com --
Seconds: 89
-- bsdtar_com --
Seconds: 277
As we can see, pigz will be quite a bit faster than the other methods used, with bsdtar and gzip being roughly equal.
5. Conclusion
In this article, we learned how to remove files from a .tgz file using tar, pigz, and bsdtar. After that, we learned how to find the time it takes to achieve our objective for the different methods we presented.