1. Introduction
File deletion is a big part of Linux administration. Whether manually or with scripts, we delete files as part of upgrades, log rotation, backups, and many other activities. Since directories can contain large amounts of files, knowing how to handle them optimally can save a lot of time.
In this tutorial, we explore how to efficiently delete a large directory in Linux. First, we discuss file deletion in general and go over when, how, and why large directories come about. Next, we prepare large directories for removal and test several tools with the task. In addition, we show how to delete large datasets in less than a second. Finally, we discuss alternative approaches and implementations for faster directory removal.
Notably, it’s recommended to use ionice and nice when running processor or input-output intensive operations. This way, we potentially reduce and properly spread the workload at any given time.
We tested the code in this tutorial on Debian 12 (Bookworm) with GNU Bash 5.2.15. It is POSIX-compliant and should work in any such environment.
2. File Deletion
Under Linux, files are inodes. An inode stores file metadata, including where file contents are. On the other hand, directories are lists of names pointing to inodes.
Because of this, there are different ways to delete files.
2.1. Link Removal
Once there is no hard link or handle left to a file, its inode becomes available. When that happens, the kernel marks the inode number as free:
$ touch /file.ext
$ tail --follow /file.ext &
[1] 667
$ lsof /file.ext
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
tail 667 root 3r REG 8,16 0 666 file.ext
$ rm /file.ext
First, we create a file and open it in tail for watching. After that, we use the lsof (List Open Files) command to confirm a handle to the file exists. Finally, we remove the actual file.
As a result, we only have the lingering inode due to an open handle. Killing the background tail process would purge that.
2.2. Purging
Importantly, file metadata and contents can remain intact on the storage until overwritten, i.e., purged. This behavior varies between older and newer ext filesystems. It’s like selling a house with everything from the last owner still in it. What does this mean for us?
We won’t need to bother calling the movers. Similarly, there are two main reasons it’s costly to rewrite segments of storage with data: slowness and wear.
Since inodes are kilobytes at most, ext3 and later versions do indeed zero them out but do not bother to purge contents. How does this behavior relate to file containers?
3. Create a Large Directory
Of course, to have a large directory to delete, it must first exist.
3.1. Common Large Directories
From what we explained earlier, we can deduce that removing directories would be most efficient by just removing all references – directory and contents. In practice, that means size is not so much the issue, but object quantity is.
File stores with thousands or millions of entries exist for many reasons:
- log rotations
- database files
- distributed filesystems
- specific use cases
Let’s explore some preliminary steps we can take to avoid having to delete large amounts of files.
3.2. Considerations
Importantly, how well the kernel deals with many files depends strongly on the filesystem type. For example, XFS might be slow with multiple small files, while ReiserFS was specifically made for the purpose of handling them.
Another way to handle large directories with many files is to create a filesystem within a file. For that, we first create a file of a given size and format it with a given partition:
$ dd if=/dev/zero of=largedirfs.img bs=1G count=5 && mkfs.ext4 largedirfs.img
In this case, we generate the largedirfs.img file with an example size of 5GB.
After that, we mount the file as a loop device:
$ losetup /dev/loop666 largedirfs.img && mount -o X-mount.mkdir /dev/loop666 /mnt/largedir
This way, we can use the resulting /mnt/largedir path for any large directories and then just format or recreate the filesystem. This way, we don’t handle multiple objects.
Of course, doing so can be tedious and not part of the regular file creation process.
3.3. Artificial Large Directory Creation
So, let’s create a directory with 1 million files as quickly as possible:
$ mkdir /dir1m; for f in {1..1000000}; do > /dir1m/$f.ext; done
We’ll test /dir1m with some tools for deleting. With time, we’ll see how fast an operation runs.
4. Delete a Large Directory With rm (Remove)
The classic rm does indeed only unlink files and doesn’t purge them.
However, there are a couple of ways to do that for directories, which we’ll look at.
4.1. Wildcards
Combining rm with globbing, we might experience issues:
$ rm --force /dir1m/*.ext
/bin/rm: cannot execute [Argument list too long]
The problem here is that wildcard expansion means all 1 million filenames become arguments. Consequently, the command line gets too long, and the shell refuses to execute.
However, we’ve no reason to use this syntax if we want the whole directory removed.
4.2. Recursion
The –recursive (-r) flag is best when dealing with many files. In fact, it’s necessary to use recursion in order to delete a directory or subdirectory:
$ time rm --recursive --force /dir1m
real 13.57s
user 1.04s
sys 8.11s
cpu 67%
This is our first real result: it took around 14 seconds to delete 1 million files.
So, what alternatives do we have to the standard rm?
5. Finding and Deleting Files With find
Of course, we can use the find command to remove files. However, it would use much more resources and a lot more time to complete.
One improvement would be to use the GNU -delete switch to find:
$ time find /dir1m -delete
real 29.93s
user 1.11s
sys 8.40s
cpu 31%
Doing this avoids the rm command calls. Additionally, we can get a better performance via xargs:
$ time find /dir1m -print0 | xargs --null --no-run-if-empty rm --recursive --force
real 12.80s
user 1.16s
sys 8.62s
cpu 76%
Basically, we just output NULL-separated file paths and pass them to xargs, which runs rm. For a single directory, the performance is the same with or without find or xargs.
Except for the last one, all of these options are slow mainly because they don’t use the internal iteration of rm with –recursive. Furthermore, they needlessly go through each file. This would only make sense when we filter what gets deleted.
6. Deleting a Large Directory With rsync
An unlikely option for efficient deletion is the rsync command:
$ mkdir /void
$ time rsync --archive --delete /void/ /dir1m/
real 15.74s
user 1.50s
sys 12.47s
cpu 88%
$ rm --recursive --force /void /dir1m
First, we create an empty directory: /void. Next, we synchronize /dir1m to the empty /void via the –archive and –delete flags and remove the leftovers.
Similar to rm, rsync uses the unlink() system call. Unlike rm, rsync doesn’t do much else.
There is another option that works the same way.
7. Using perl to Delete Directory Contents
In fact, perl is useful not only for text processing but file operations as well. Written in C, it’s also suitable for low-level system calls:
$ cd /dir1m
$ time perl -e 'for(<*>){((stat)[9]<(unlink))}'
real 17.05s
user 2.57s
sys 13.36s
cpu 93%
Here, we use -e (execute) to execute a one-liner, which calls unlink() on all files in the current directory via <*>.
Due to the overhead of a scripting language and its interpreter, this method is slightly slower than rsync and rm. Still, perl provides options for precise filtering, should we require that.
8. Custom Filesystem Formatting for Directory Removal
If we go with the filesystem in a file approach, we can follow several steps to refresh its contents.
First, we umount the respective loop device:
$ umount /dev/loop666
Next, we format the file via mkfs.ext4:
$ mkfs.ext4 -F largedirfs.img
Here, we [-F]orce a format, since mkfs.ext4 usually detects an existing filesystem.
Finally, we can remount:
$ mount -o X-mount.mkdir /dev/loop666 /mnt/largedir
Overall, the whole process usually takes less than a second regardless of the filesystem contents:
$ time {
umount /dev/loop666 &&
mkfs.ext4 -F largedirfs.img &&
mount /dev/loop666 /mnt/largedir;
}
[...]
real 0m0.208s
user 0m0.005s
sys 0m0.021s
At this point, we have a clean /mnt/largedir with a fresh filesystem.
9. Modes of Delete Execution
Sometimes, it’s not about the particular tool, but the approach that the tool takes to perform the deletion.
9.1. Parallel Processing
Although parallelizing file deletion might sound like a good approach, it often isn’t for a number of reasons:
- multiple command runs are worse than a single command with internal unlink calls
- processing isn’t a considerable part of deletion, so reducing it doesn’t get the overall time down
- input-output operations depend mainly on the storage setup, not the central processing unit
If anything, parallel removal and deletion via, e.g., GNU parallel, might lead to considerably decreased performance if the storage medium isn’t up to the task.
For example, the rmz tool attempts to parallelize the removal process according to the current system specifications.
To begin with, let’s use wget to download the appropriate binary from the release page:
$ wget https://github.com/SUPERCILEX/fuc/releases/download/2.0.0/rmz-x86_64-unknown-linux-gnu
Next, we make the resulting file executable:
$ chmod +x rmz-x86_64-unknown-linux-gnu
Lastly, we use rmz:
$ time ./rmz-x86_64-unknown-linux-gnu /dir1m/
real 0m41.935s
user 0m0.490s
sys 0m13.286s
As we can see, the parallel solution is indeed much slower than most others. Of course, results can vary based on the system used.
9.2. Cron Scheduling
Instead of attempting parallel runs of rm, we can instead schedule a large directory deletion in the background with a scheduler like cron.
To do so, we create a script and add it to crontab for execution:
$ cat /etc/crontab.daily/rmlargedir.sh
#!/usr/bin/env bash
rm --recursive --force /dir1m
$ chmod +x /etc/crontab.daily/rmlargedir.sh
In particular, we create rmlargedir.sh as an executable script under /etc/crontab.daily/, so it runs every day. This way, we don’t have to manually run the command or wait for its completion. On the downside, setting a specific time for the deletion job can necessitate a particular cycle of creation as well.
Here, we use rm, but we can of course use any way of deleting.
10. Summary
In this article, we discussed methods for efficiently deleting a directory in Linux.
The clear winner in our tests is the rm command. However, if we want to have some control over what we remove, then find and perl are viable alternatives. Further, if we prepare a special path for an arbitrarily large dataset, its removal can happen in less than a second.
In conclusion, we should always define what’s to be done when deleting and choose the most efficient way to do it.