1. Overview

In this tutorial, we’ll be discussing how to create big files (> 100 MB) on a Linux system. Before we dive into the actual code and commands, we need to understand what a Linux file looks like and how the storage structure is maintained.

2. Inodes

Linux stores the information as files in a specific structure as Inodes, which is the abbreviation for index node. There exist a 1-1 relationship between the files and Inodes. To accommodate the different file sizes, each Inode has 12 pointers for smaller files and three other pointers for larger files. Today will be focusing on the three large-file pointers.

2.1. Indirection Tables

The first one of the three-pointers, the 13th pointer, points to a single-indirection table. A single index table has a cluster of pointers that point to other storage clusters. On a 64-bit machine with a 4kb cluster, the single indirection table can achieve a max size of 2MB. (4kb cluster size / 8 byte per pointer * 4kb each cluster a pointer points to)

The 14th pointer points to a double-indirection table. This pointer points to a cluster of pointers, and each of the pointers within this cluster points to a single-indirection table, giving it two layers of pointers. Therefore, on a 64-bit machine with a 4kb cluster, the double-indirection table can hold up to 1GB. (4kb / 8 bytes * 2MB)

The last pointer points to a triple-indirection table. Its name simply tells us that there are three layers of pointers between the Inode and the actual storage cluster. The size of the triple-indirection table on a 64-bit 4kb-cluster machine can hold up to 512GB. (4kb / 8 bytes * 1GB) Up to this point, the size has already exceeded the capacity of some storage.

The maximum size of the Inode can also grow if we increase the file cluster size. However, we’ll not cover it in this tutorial.

2.2. How Linux Handles the Indirection Tables

When creating files using nodes and indirection tables, Linux allows the user to leave some part of the file “blank”. We call these blank spaces “holes” in the file, and we can use many different ways to create these holes.

When this knowledge is in our minds, we can finally dive deep into the Linux command and POSIX methods to apply it to real cases. Linux offers various types of solutions to create such large files quickly, no matter what the content in the file is.

3. Linux Command: truncate and fallocate

Linux has two commands that create a large file. We can use either truncate or fallocate to achieve the same goal, but these two commands create files in different ways.

3.1. The truncate Command

The truncate command specifically shrinks or extends a list of files to the desired size. However, if we shrink the size of a file, we may risk losing some content in the file. If we try to extend a file to a certain size, the truncate command will fill the extra space with zeros. Thus, we may risk breaking the coding conventions of certain files and making them unable to get read.

If we don’t specify the truncate command with the -c argument, then truncate will create new files if the specified file does not exist. Typical usage of the truncate command comes with the -s argument. Here is how we shall use it with this argument. The size unit must be capitalized.

$ truncate -s 5K test.txt 

This command does not give any terminal output. Instead, we’ll find a file named test.txt that appears in the current directory if the file does not exist. Otherwise, this command will adjust the dedicated file to the given size. We can verify this using the ls -lh command:

$ ls -lh test.txt
-rw-rw-r--    1    username    username    5.0K    Mar 25    21:01    test.txt 

However, this type of file filled with zeros cannot be detected by the du command without an argument –apparent-size. The reason behind this is that the du only counts for disk usage. A cluster filled with zeros but not explicitly allocated to anything is not considered used. A comparison of the differences can be found below:

$ du -sh test.txt
0    test.txt
=================================
$ du --apparent-size -h test.txt
5.0K     test.txt

In brief, we can use the truncate command to create a file with an arbitrary size. However, doing so might break the file because it is filling zeros to all the spaces the file occupies, and it is also not efficient.

3.2. The fallocate Command

Linux also offers a better command solution called fallocate. It is a better option than the truncate command because it only allocates the storage clusters for a file instead of filling all the empty spaces with zeros. The absence of the I/O operation makes the fallocate much faster than the truncate command.

However, this command only asks the operating system to reserve spaces for files rather than initializing them. Therefore, we may still encounter problems when we try to open such an allocated file with clusters not initialized.

To create a file with a certain size, we need to use the -l argument. We can find the syntax below:

$ fallocate -l 5K test2.txt 

To verify the outcome, we can simply use the ls -lh command:

$ ls -lh test2.txt
-rw-rw-r--    1    username    username    5.0K    Mar 25    21:22    test2.txt 

If we try to use the du command on the second text file, the result will be very different:

$ du -sh test2.txt
8.0K    test2.txt 
=================================
$ du --apparent-size test2.txt
5.0K    test2.txt

In this case, the du -sh command counts for storage usage based on the number of assigned clusters. Typical machines come with a cluster size of 4.0K, and that is the smallest size the machine can allocate. The operating system must assign two clusters for a 5KB file.

4. Conclusion

In this tutorial, we have covered the fundamentals of Linux file systems first. Once we understand how the file system works on Linux systems, we can easily create files using terminal commands. Both truncate or fallocate can create a file with some part of it filled with zeros or left uninitialized.

As we create these files, we always need to be careful that most of the data stored in this file is not readable due to inconsistent coding conventions.