1. Introduction

The du command is a way to get file size information. Although it has many options, filtering by extension isn’t part of them. While extensions aren’t that critical in Linux, they still usually distinguish file types. Because of this, adding extensions, as well as separating and handling data based on them, is often beneficial.

In this tutorial, we’ll look at the du command and methods to check the total file size of files with a particular extension. First, we go over file sizes and space usage. Next, we thoroughly explore the du command. Finally, we check ways to use du for our needs.

We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments unless otherwise specified.

2. File Sizes and Space Usage

To begin with, files usually occupy at least a given minimal chunk of storage due to the blocks at different levels in the hierarchy:

  • physical device controller builds, where 512-byte sectors or blocks are common
  • filesystem configuration and limitations may necessitate different block or allocation unit sizes like 1024b, 2048b, 4096b, 8192b, and more

So, instead of getting the size of a file, we might end up getting the space used by it. Also, there are other specific conditions to consider:

Thus, a 1b file would usually occupy at least the physical block size of 512 bytes on most physical storage mediums. Meanwhile, the actual logical size taken up by its data on the partition filesystem is usually larger. On the other hand, a 0b file often resides only as a logical mark instead of actually using even a single block of space on storage:

+------------------------------------------------------------------------------------------+
|            Real Size       |   Apparent Size  |            Size on Storage               |
+------------------------------------------------------------------------------------------+
| [ minBlockSize, inf]       | realSize         | ceil(realSize/minBlockSize)*minBlockSize |
|       ^                    | + fragmentedData | + fragmentedData                         |
|   min(physicalBlockSize,   | + indirectBlocks | + indirectBlocks                         |
|       filesystemBlockSize) | + sparseData     | + sparseData                             |
|------------------------------------------------------------------------------------------|
| [1b, minBlockSize]         | realSize         | minBlockSize                             | 
|------------------------------------------------------------------------------------------|
| 0b                         | 0b               | 0b                                       |
+------------------------------------------------------------------------------------------+

Because of these file size discrepancies, some tools have a way of getting the apparent and real sizes of a file.

3. The du Command

The du (Disk Usage) command is a way to check how much space given files use and what their sizes are:

$ du
12      ./subdir
40      .

In our sample dataset, we see the current working directory . (period) uses up a total of 40K on the medium, while the only subdirectory subdir accounts for 12K.

Alternatively, we can pass a file or directory as an argument:

$ du ./dir
12      ./dir/subdir
40      ./dir

To do so for multiple files or directories, we use –files0-from with a file name or for stdin as an argument:

$ printf './dir/file1\0./dir/file2\0./dir\0' | du --files0-from=-
4       ./dir/file1
4       ./dir/file2
12      ./dir/subdir
32      ./dir

Either source should pass all desired paths with NUL terminators, as we did here with printf and \0.

3.1. Human-Readable Sizes

Since 1K blocks aren’t always a helpful measure of sizes, du offers flags to augment this unit:

  • –human-readable or -h flag displays sizes in bigger units when convenient and adds the unit letters with –si using powers of 1000, not 1024
  • –bytes or -b is only available in GNU du and gets the file size in bytes
  • –block-size or -B can scale to a requested unit with -k and -m being shorthands for –block-size=1K and –block-size=1M, respectively

So, let’s try to get a better picture of sizes:

$ du --human-readable
12K     ./subdir
40K     .
$ du --bytes
6199    ./subdir
24224   .

Still, from the above, only the non-standard –bytes or -b ensures we get the apparent file size. Let’s see a switch for that specific purpose.

3.2. Apparent Size

Due to the varying file sizes depending on the viewpoint, du has the –apparent-size flag for getting the as-close-as-possible real file size:

$ du --human-readable file1
4.0K    file1
$ du --human-readable --apparent-size file1
666     file1

Notably, the second output displays the actual byte size instead of an approximation. This way, we can avoid hardware and filesystem blocks to get the actual size instead of space usage. For most purposes, –bytes (-b) and –apparent-size (which -b implies) provide the most accurate file size.

3.3. List Files

Instead of providing files as arguments, we can use –all or -a to list the whole hierarchy below the current or supplied directory with sizes:

$ du --bytes --all
666     ./file1
79      ./file2
1120    ./subdir/file4
983     ./subdir/file5
6199    ./subdir
13184   ./file3
24224   .

Now, we have a wider view of the size distributions.

In addition to the usual –dereference (-L) and –no-dereference (-P) flags to follow and not follow links, we can dereference only arguments with –dereference-args or -D (-H).

Importantly, we can also perform filtering:

  • –one-file-system or -x is similar to the xdev flag of the find command as it excludes mount points from recursive searches
  • –exclude provides a way to exclude files by name pattern
  • –exclude-from or -X is similar to –exclude but gets patterns from a file passed as a value

In fact, we can filter and get a total for a given subset of the data.

3.4. Totals

Conveniently, we can get a –total (-c) of the sizes for a given directory:

$ du --human-readable --apparent-size --all --total ./dir
666     ./dir/file1
79      ./dir/file2
1.1K    ./dir/subdir/file4
983     ./dir/subdir/file5
6.1K    ./dir/subdir
13K     ./dir/file3
24K     ./dir
24K     total

Thus, we see the total directory size along with its whole structure and a breakdown. The flag is often more useful when combining files and directories from different paths:

$ printf '/etc/ssh/sshd_config\0./dir\0' | du --human-readable --apparent-size --all --total --files0-from=-
3.3K    /etc/ssh/sshd_config
666     ./dir/file1
79      ./dir/file2
1.1K    ./dir/subdir/file4
983     ./dir/subdir/file5
6.1K    ./dir/subdir
13K     ./dir/file3
24K     ./dir
27K     total

Moreover, we can extract only the last row value for each argument with –summarize or -s:

$ du --human-readable --apparent-size --summarize ./dir
24K     ./dir

Further, we can combine both options and use tools like tail to get only the last row.

Now, let’s see how we can get the total for a given extension.

4. File Size for Extension

There are several ways to get the total size for files of a specific type.

4.1. Using Globbing

Many shells support globbing to select multiple paths at once.

First, we can apply globbing by file name within a single directory without recursion to get the total size of .c files:

$ du --human-readable --apparent-size --total --summarize ./dir/*.c

To descend in any single subdirectory, we can add the ** double wildcard supported by Bash:

$ du --human-readable --apparent-size --total --summarize ./dir/**/*.c

In this case, we exclude the .c files in the root of ./dir. Yet, we do have a way to get the total size of all files with a given extension under a provided path.

4.2. Using find

As usual, the output of find provides a flexible way to process file paths:

$ LC_ALL=C find ./dir -iname '*.c' -type f -print0 |
  du --human-readable --apparent-size --total --summarize --files0-from=- |
  tail --lines=1

In essence, we use find with a NUL-terminated -print0 to output all paths of [f]ile -type objects with an -iname match of *.c, i.e., .C and .c. Notably, we ensure the locale settings are standard with LC_ALL=C.

All paths we get are piped to du for processing with –files0-from=-. Finally, we only extract the total summary with tail.

5. Summary

In this article, we looked at the du command and ways to get the total size of files with a given extension.

In conclusion, while du is a flexible command, combining it with find produces more detailed results when it comes to directory traversal.