1. Introduction

Everything is a file in UNIX and, by extension, Linux. Hence, checking file sizes is a common activity, especially when dealing with logs or storage considerations. For this reason, scripts often automate the task and link it to other actions that depend on the size of a given file.

In this tutorial, we look at portable commands to file sizes in bytes. First, we briefly explore different file size units. Next, we delve into file size checks with several standard tools. Finally, we discuss some common interpreted programming languages that provide an easy way of getting file sizes.

For brevity, we implicitly assume all files to be regular files as opposed to other filesystem objects.

We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments.

2. Storage and Precise File Size Checking

Regular files are simply data identified by an inode, part of a filesystem. Basically, this data takes up a given amount of space on storage for each file:

$ ls -l
total 20
-rw-r--r-- 1 baeldung baeldung  666 Oct 18 15:10 file1
-rw-r--r-- 1 baeldung baeldung  565 Oct 19 16:50 file2
-rw-r--r-- 1 baeldung baeldung  160 Oct 29 05:10 file3
-rw-r--r-- 1 baeldung baeldung  660 Jul 29 11:20 file4
drwxr-xr-x 2 baeldung baeldung 4096 Aug 18 18:16 subdir

In the code snippet above, the fifth column of ls shows the size of each file in bytes.

However, with the -l long list format, it also displays the total number of blocks all files take at the top of its output. Importantly, the number of blocks differs from the total number in bytes, as blocks are based on values defined during partition filesystem formatting.

Also, the directory entry, as identified by the d prefix, has a byte value of 4096. This is a common minimum allocation unit on filesystems. Hence, it’s not really a value we’re after.

Finally, many applications provide a way for users to see human-readable instead of raw byte values:

$ ls -l /dir/file
-rwxr-xr-x 1 baeldung baeldung 207270832 Jun 10 06:23 /dir/file
$ ls -l -h /dir/file
-rwxr-xr-x 1 baeldung baeldung 198M Jun 10 06:23 /dir/file

With the -h or –human-readable flag, ls shows 198M instead of 207270832 bytes. Now, does M stand for Mega (10^6) or Mebi (2^20)? We have to check the documentation, which tells us that it’s the latter. So, we do the calculation 198 * 2^20 = 198 * 1048576 = 207618048, which gives us a rough estimate of the actual 207270832, rounded to the closest Mebibyte. So, we had to exert more effort to get a less accurate result.

In essence, we most often can’t and shouldn’t rely on anything but the raw bytes value of file sizes, and even then – only for regular files.

3. Portable File Size Checks

File size checking is required on most systems for one reason or another. Because of this, we’ll explore ubiquitous methods to get a correct reading of file sizes.

For our purposes, we define portable as being compliant with POSIX or SUS, and ideally – both. This limit mostly covers UNIX systems.

3.1. ls

Since we already touched on the subject, let’s start with ls. As part of both standards above, it’s so common to UNIX platforms that the PowerShell Get-ChildItem cmdlet in Windows has an alias with the same name.

However, a major problem with ls is its inability to provide only the file size in bytes. To extract it from all the available information, we need to parse the -l long list format while forcing user IDs instead of user names with -n or –numeric-uid-gid:

$ ls -l --numeric-uid-gid /dir/file
-rwxr-xr-x 1 1000 1000 207270832 Jun 10 06:23 /dir/file

With –numeric-uid-gid, we can assume there will be no spaces within the first 4 columns, so we can get the size from column 5 via cut:

$ ls -l --numeric-uid-gid /dir/file | cut --delimiter=' ' --fields=5
207270832

Parsing the ls output with cut, we specify the delimiter as ‘ ‘ space and the field of interest as 5, which is the file size in bytes. Similarly, we can extract column 5 with awk:

$ ls -l --numeric-uid-gid /dir/file | awk '{print $5}'
207270832

We can even use bash to do the parsing:

$ FILEINFO=( $( ls -l --numeric-uid-gid /dir/file ) )
$ echo ${FILEINFO[4]}
207270832

Still, all of these options use two commands and a pipe to complete a seemingly simple task. Let’s explore single tools that can achieve the same.

3.2. wc

Actually, the standard wc command can fit our needs:

$ wc --bytes < /dir/file
207270832

With its -c or –bytes parameter, wc processes any data piped to it, returning the number of bytes within. On top of that, it works flawlessly for both text and binary files and is probably the most portable solution of all.

Importantly, if we replace the redirection with only the file path, the output includes that path on a second column.

Although more compact visually, the solution with wc can involve reading the whole file instead of using a simple and much less resource-intensive system call. Whether it does depends on the version and implementation, as GNU coreutils wc optimizes and avoids the overhead by also using a system call, albeit a different one.

3.3. du

In fact, unlike the POSIX du implementation, GNU coreutils du provides the -b flag:

$ du -b /dir/file
207270832 /dir/file

While it works and returns the correct size in bytes, du has critical disadvantages:

  • requires the correct du version
  • appends the file path or name

Let’s look at another tool that most, but not all, systems have.

3.4. stat

The stat command is more or less a direct interface to the standard file status system calls. As such, we can use it to get information about the file size:

$ stat --format=%s /dir/file
207270832

In this example, the -c or –format flag specifies the data and format we want to acquire about our file as %s (total size, in bytes). At first, this looks optimal since stat is also standardized. However, stat is not available on all platforms and might have different switches on some.

Due to this, we’ll move on to probably the most universal solution.

4. Programming Languages

Although most are not in the POSIX or SUS standards, programming languages have their own encapsulated worlds with implementations for basic system operations on all supported platforms and operating systems. Scripting languages, in particular can assist with all but the most unique operating system tasks.

Basically, portability comes down to the number of operating systems which come with a given interpreter out of the box. Let’s see how we can use some common interpreters to get a file’s size.

4.1. Perl

The Practical Extraction and Report Language (Perl), as supported by the perl interpreter, has been a standard part of many UNIX operating systems for a long time. In fact, Perl first appeared on December 18, 1987.

With perl, getting the size of a file is trivial:

$ perl -e '@a=stat(shift);print $a[7];' /dir/file
207270832

Here, we use the -e switch to run one line of code, which takes a single argument (/dir/file), runs the stat() function on it, and prints only the field of interest.

4.2. Python

Python has been around since 1991 but has really been on the radar after 2003. It’s integral to many major Linux versions like Debian, Ubuntu, and Red Hat. Moreover, like Perl, Python is shipped with other major UNIX distributions as well.

Of course, python can solve our problem as well:

$ python -c 'import os;import sys;print(os.path.getsize(sys.argv[1]));' /dir/file
207270832

Similar to Perl and its -e flag, -c makes python execute one-liners. We import the necessary modules for operating (os) and general (sys) system operations. Next, we print the result of the os.path.getsize() function called with our argument (sys.argv[1]) – the file of interest.

4.3. Ruby

The Ruby programming language, younger than both Perl and Python, offers great flexibility and is available on multiple platforms.

Let’s see how we can get our file’s size with the ruby interpreter:

$ ruby -e 'print File.stat(ARGV[0]).size' /dir/file
207270832

In this case, we use the File.stat() function on the file path argument (ARGV[0]) and extract the size field from it.

5. Summary

In this article, we explored how to get the true size of a file in a universal manner.

In conclusion, while we have multiple options to get file sizes, some are more portable than others, while many require additional considerations.


« 上一篇: R2DBC 教程