在Linux中高效删除大目录

1. Introduction

File deletion is a big part of Linux administration. Whether manually or with scripts, we delete files as part of upgrades, log rotation, backups, and many other activities. Since directories can contain large amounts of files, knowing how to handle them optimally can save a lot of time.

In this tutorial, we explore how to efficiently delete a large directory in Linux. First, we discuss file deletion in general and go over when, how, and why large directories come about. Next, we prepare large directories for removal and test several tools with the task. In addition, we show how to delete large datasets in less than a second. Finally, we discuss alternative approaches and implementations for faster directory removal.

Notably, it’s recommended to use ionice and nice when running processor or input-output intensive operations. This way, we potentially reduce and properly spread the workload at any given time.

We tested the code in this tutorial on Debian 12 (Bookworm) with GNU Bash 5.2.15. It is POSIX-compliant and should work in any such environment.

2. File Deletion

Under Linux, files are inodes. An inode stores file metadata, including where file contents are. On the other hand, directories are lists of names pointing to inodes.

Because of this, there are different ways to delete files.

2.1. Link Removal

Once there is no hard link or handle left to a file, its inode becomes available. When that happens, the kernel marks the inode number as free:

$ touch /file.ext
$ tail --follow /file.ext &
[1] 667
$ lsof /file.ext
COMMAND PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
tail    667 root    3r   REG   8,16        0  666 file.ext
$ rm /file.ext

First, we create a file and open it in tail for watching. After that, we use the lsof (List Open Files) command to confirm a handle to the file exists. Finally, we remove the actual file.

As a result, we only have the lingering inode due to an open handle. Killing the background tail process would purge that.

2.2. Purging

Importantly, file metadata and contents can remain intact on the storage until overwritten, i.e., purged. This behavior varies between older and newer ext filesystems. It’s like selling a house with everything from the last owner still in it. What does this mean for us?

We won’t need to bother calling the movers. Similarly, there are two main reasons it’s costly to rewrite segments of storage with data: slowness and wear.

Since inodes are kilobytes at most, ext3 and later versions do indeed zero them out but do not bother to purge contents. How does this behavior relate to file containers?

3. Create a Large Directory

Of course, to have a large directory to delete, it must first exist.

3.1. Common Large Directories

From what we explained earlier, we can deduce that removing directories would be most efficient by just removing all references – directory and contents. In practice, that means size is not so much the issue, but object quantity is.

File stores with thousands or millions of entries exist for many reasons:

log rotations
database files
distributed filesystems
specific use cases

Let’s explore some preliminary steps we can take to avoid having to delete large amounts of files.

3.2. Considerations

Importantly, how well the kernel deals with many files depends strongly on the filesystem type. For example, XFS might be slow with multiple small files, while ReiserFS was specifically made for the purpose of handling them.

Another way to handle large directories with many files is to create a filesystem within a file. For that, we first create a file of a given size and format it with a given partition:

$ dd if=/dev/zero of=largedirfs.img bs=1G count=5 && mkfs.ext4 largedirfs.img

In this case, we generate the largedirfs.img file with an example size of 5GB.

After that, we mount the file as a loop device:

$ losetup /dev/loop666 largedirfs.img && mount -o X-mount.mkdir /dev/loop666 /mnt/largedir

This way, we can use the resulting /mnt/largedir path for any large directories and then just format or recreate the filesystem. This way, we don’t handle multiple objects.

Of course, doing so can be tedious and not part of the regular file creation process.

3.3. Artificial Large Directory Creation

So, let’s create a directory with 1 million files as quickly as possible:

$ mkdir /dir1m; for f in {1..1000000}; do > /dir1m/$f.ext; done

We’ll test /dir1m with some tools for deleting. With time, we’ll see how fast an operation runs.

4. Delete a Large Directory With rm (Remove)

The classic rm does indeed only unlink files and doesn’t purge them.

However, there are a couple of ways to do that for directories, which we’ll look at.

4.1. Wildcards

Combining rm with globbing, we might experience issues:

 $ rm --force /dir1m/*.ext
/bin/rm: cannot execute [Argument list too long]

The problem here is that wildcard expansion means all 1 million filenames become arguments. Consequently, the command line gets too long, and the shell refuses to execute.

However, we’ve no reason to use this syntax if we want the whole directory removed.

4.2. Recursion

The –recursive (-r) flag is best when dealing with many files. In fact, it’s necessary to use recursion in order to delete a directory or subdirectory:

$ time rm --recursive --force /dir1m
real    13.57s
user    1.04s
sys     8.11s
cpu     67%

This is our first real result: it took around 14 seconds to delete 1 million files.

So, what alternatives do we have to the standard rm?

5. Finding and Deleting Files With find

Of course, we can use the find command to remove files. However, it would use much more resources and a lot more time to complete.

One improvement would be to use the GNU -delete switch to find:

$ time find /dir1m -delete
real    29.93s
user    1.11s
sys     8.40s
cpu     31%

Doing this avoids the rm command calls. Additionally, we can get a better performance via xargs:

$ time find /dir1m -print0 | xargs --null --no-run-if-empty rm --recursive --force
real    12.80s
user    1.16s
sys     8.62s
cpu     76%

Basically, we just output NULL-separated file paths and pass them to xargs, which runs rm. For a single directory, the performance is the same with or without find or xargs.

Except for the last one, all of these options are slow mainly because they don’t use the internal iteration of rm with –recursive. Furthermore, they needlessly go through each file. This would only make sense when we filter what gets deleted.

6. Deleting a Large Directory With rsync

An unlikely option for efficient deletion is the rsync command:

$ mkdir /void
$ time rsync --archive --delete /void/ /dir1m/
real    15.74s
user    1.50s
sys     12.47s
cpu     88%
$ rm --recursive --force /void /dir1m

First, we create an empty directory: /void. Next, we synchronize /dir1m to the empty /void via the –archive and –delete flags and remove the leftovers.

Similar to rm, rsync uses the unlink() system call. Unlike rm, rsync doesn’t do much else.

There is another option that works the same way.

7. Using perl to Delete Directory Contents

In fact, perl is useful not only for text processing but file operations as well. Written in C, it’s also suitable for low-level system calls:

$ cd /dir1m
$ time perl -e 'for(<*>){((stat)[9]<(unlink))}'
real    17.05s
user    2.57s
sys     13.36s
cpu     93%

Here, we use -e (execute) to execute a one-liner, which calls unlink() on all files in the current directory via <*>.

Due to the overhead of a scripting language and its interpreter, this method is slightly slower than rsync and rm. Still, perl provides options for precise filtering, should we require that.

8. Custom Filesystem Formatting for Directory Removal

If we go with the filesystem in a file approach, we can follow several steps to refresh its contents.

First, we umount the respective loop device:

$ umount /dev/loop666

Next, we format the file via mkfs.ext4:

$ mkfs.ext4 -F largedirfs.img

Here, we [-F]orce a format, since mkfs.ext4 usually detects an existing filesystem.

Finally, we can remount:

$ mount -o X-mount.mkdir /dev/loop666 /mnt/largedir

Overall, the whole process usually takes less than a second regardless of the filesystem contents:

$ time {
  umount /dev/loop666 &&
  mkfs.ext4 -F largedirfs.img &&
  mount /dev/loop666 /mnt/largedir;
}
[...]
real 0m0.208s
user 0m0.005s
sys 0m0.021s

At this point, we have a clean /mnt/largedir with a fresh filesystem.

9. Modes of Delete Execution

Sometimes, it’s not about the particular tool, but the approach that the tool takes to perform the deletion.

9.1. Parallel Processing

Although parallelizing file deletion might sound like a good approach, it often isn’t for a number of reasons:

multiple command runs are worse than a single command with internal unlink calls
processing isn’t a considerable part of deletion, so reducing it doesn’t get the overall time down
input-output operations depend mainly on the storage setup, not the central processing unit

If anything, parallel removal and deletion via, e.g., GNU parallel, might lead to considerably decreased performance if the storage medium isn’t up to the task.

For example, the rmz tool attempts to parallelize the removal process according to the current system specifications.

To begin with, let’s use wget to download the appropriate binary from the release page:

$ wget https://github.com/SUPERCILEX/fuc/releases/download/2.0.0/rmz-x86_64-unknown-linux-gnu

Next, we make the resulting file executable:

$ chmod +x rmz-x86_64-unknown-linux-gnu

Lastly, we use rmz:

$ time ./rmz-x86_64-unknown-linux-gnu /dir1m/

real    0m41.935s
user    0m0.490s
sys     0m13.286s

As we can see, the parallel solution is indeed much slower than most others. Of course, results can vary based on the system used.

9.2. Cron Scheduling

Instead of attempting parallel runs of rm, we can instead schedule a large directory deletion in the background with a scheduler like cron.

To do so, we create a script and add it to crontab for execution:

$ cat /etc/crontab.daily/rmlargedir.sh
#!/usr/bin/env bash
rm --recursive --force /dir1m
$ chmod +x /etc/crontab.daily/rmlargedir.sh

In particular, we create rmlargedir.sh as an executable script under /etc/crontab.daily/, so it runs every day. This way, we don’t have to manually run the command or wait for its completion. On the downside, setting a specific time for the deletion job can necessitate a particular cycle of creation as well.

Here, we use rm, but we can of course use any way of deleting.

10. Summary

In this article, we discussed methods for efficiently deleting a directory in Linux.

The clear winner in our tests is the rm command. However, if we want to have some control over what we remove, then find and perl are viable alternatives. Further, if we prepare a special path for an arbitrarily large dataset, its removal can happen in less than a second.

In conclusion, we should always define what’s to be done when deleting and choose the most efficient way to do it.

Persistence

REST

Security