1. Overview
In this tutorial, we’re going to take a look at how we can display the first characters from a file. We’ll be using the tools that are already available on most Linux distributions. Additionally, we’ll provide some workarounds for the default behavior of the tools.
Finally, we’ll go a little deeper and discuss which command to pick for the job.
2. Using the head Command
The head command is used to display the first lines of a file. By default, the head command will print only the first 10 lines. The head command ships with the coreutils package, which might be already installed on our machine.
Let’s print the first 10 lines of a JSON file:
$ head package.json
{
"name": "in-your-pocket",
"version": "0.1.0",
"description": "",
"scripts": {
"watch": "webpack --config webpack/webpack.dev.js --watch",
"build": "webpack --config webpack/webpack.prod.js",
"clean": "rimraf dist",
"build:clean": "npm run clean && npm run build",
"test": "npx jest",
The head command is also capable of printing the first “n” bytes of a file. In the ASCII character set, each character takes one byte. Therefore, we can print the first “n” characters of a file by supplying the –bytes or -c option:
$ head --bytes 100 package.json
{
"name": "in-your-pocket",
"version": "0.1.0",
"description": "",
"scripts": {
"watch":
Mind that newlines, tabs, and spaces are also counted as bytes. Alternatively, if the value of the –bytes option is a negative number, the head command will print all the characters except the last “n” characters.
Let’s suppose we have a file with lowercase alphabets, and we want to exclude the last 10 characters. Then we can supply -10 as the argument to the -c or –bytes option:
$ head -c -10 alphabets
abcdefghijklmnopq
3. The sed Utility
The sed command stands for Stream EDitor. It’s a tool that we can use to modify a text stream. Not only that, but it’s also capable of other operations such as using it to print the first “n” lines or characters of a file.
The general syntax for sed is very simple:
$ sed [OPTIONS] [EXPR] <FILE>
We can give sed an expression that instructs the tool on how to modify the text stream. To print the first “n” characters, we’ll supply sed with an expression and our alphabets file:
$ sed -z 's/^\(.\{12\}\).*/\1/' alphabets
abcdefghijkl
- The -z option will separate lines by null characters, thereby preventing sed from operating on each line separately.
- The script expression modifies the contents of the file and displays it on the standard output. In our case, we substitute the whole contents of the file with the first 12 characters.
- The final argument is our alphabets file that contains lowercase characters.
Additionally, we can change the value 12 to the number of characters that we need to print.
Usually, the above expression can be difficult to write over and over. For that reason, we can write another expression that’s relatively simple:
$ sed -z 's/.//6g' <<< $(cat alphabets)
abcde
In the command above, we feed the output of the cat command to the sed command. The sed command, in turn, substitutes all characters, as specified by the dot, with the first 5 characters in the file. We should note that we’ll need to add 1 to the number of characters that we want to print.
4. Using the cut Command
The cut command is used to remove parts of a text file or a text line. Not only that, but we can also use the cut command to extract portions of a text from a file or a string. For instance, if we want to extract the nth character of a file, we can use cut:
$ cut -c 5 alphabets
e
The -c option is used to specify a character. However, we can also specify a range of characters or bytes to print:
$ cut -c 1-5 alphabets
abcde
As we can see, it works very well. The problem arises when we want to print “n” characters of a file containing newlines. As in sed, there is also the -z option available in cut that we can use to treat newlines as NUL characters.
In the following snippets, we’ll illustrate this by using an alphanumeric file where each set of alphanumerics is separated by a newline:
$ cut -c 1-5 alphanumeric
abcde
ABCDE
12345
By supplying the -z or –zero-terminated option, we can override this default behavior:
$ cut -z -c 1-5 alphanumeric
abcde
5. Using the dd Utility
The dd command is mainly used to copy bytes or blocks from a source to a destination. It’s a powerful utility that provides several options. One of the options is the bs option. The bs option takes the number of bytes to read at a time as an argument. Let’s see it in action:
$ dd bs=1 count=5 if=alphanumeric
abcde5+0 records in
5+0 records out
5 bytes copied, 0.000109751 s, 45.6 kB/s
- The bs option is used to specify the number of bytes to read at a time
- The count option specifies the number of total bytes to read
- The if option specifies the input file to read from
In the output, we can see our first 5 characters, along with some additional information that we don’t need. Fortunately, the dd command has the status option that we can use to suppress the I/O information:
$ dd bs=1 count=5 if=alphanumeric status=none
abcde
Alternatively, we can also redirect the I/O information to /dev/null:
$ dd bs=1 count=5 if=alphanumeric 2> /dev/null
abcde
Oftentimes, we might need to specify a range of bytes to print. For that reason, we can use the skip option to skip the first “n” bytes:
$ dd skip=5 bs=1 count=5 if=alphanumeric 2> /dev/null
fghij
6. Using the awk Command
The awk command is used to search for a pattern in a text file and carry out operations on it. It defines its own programming language, which is fairly easy to use. The awk utility is installed on most Linux distributions. However, if we don’t have it already, we can install it from the official repository using yum or apt.
Once installed, we can verify it through:
$ awk --version
GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0-p13, GNU MP 6.2.1)
Now, let’s print the first 5 characters from our alphabets file:
$ awk '{ print substr($0, 0, 5) }' alphabets
abcde
The awk command is followed by the pattern that we want to scan and print. In our case, we want to print a sub-string of our text file. The substr function takes three arguments for the column, starting position, and an inclusive ending position. Finally, we specify our alphabets file to awk for scanning.
As can be seen in the above snippet, it works well with a file that has one line, but the output is different for files with multiple lines. Let’s see how it looks:
$ awk '{ print substr($0, 0, 5) }' alphanumeric
abcde
ABCDE
01234
Certainly, we didn’t expect this. By design, awk works on each line in a file individually, rather than treating the whole file as a string. For that reason, we’ll use a simple workaround with the echo and cat commands:
$ echo $(cat alphanumeric)
abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUXWYZ 0123456789
In the above example, the echo command writes the output of the cat command to a string, converting newlines to spaces in the process. We can then pipe this output to our awk command to print the first 5 characters:
$ echo $(cat alphanumeric) | awk '{ print substr($0, 0, 5) }'
abcde
7. Which One to Use?
For the most part, we should be fine with the head and cut commands, as they are ubiquitous and quick to use. However, the problem arises when we have a file that contains complex Unicode characters like emojis. Let’s suppose we have a file that contains a fire emoji followed by some text:
🔥 This is a fire emoji.
Now, if we want to print “🔥 This” using the head command, we’ll simply slice the text from the first character to the sixth character:
$ cut -c 1-6 text_with_emoji
🔥 T
Well, that is not the behavior that we expected. As we know, each character takes a single byte. This is not true for complex Unicode characters. For instance, emojis take 4-6 bytes, depending on the emoji used. In our case, the fire emoji takes 4 bytes. So, if we use 1-9 as an argument, it will print the portion that we want:
$ cut -c 1-9 text_with_emoji
🔥 This
Now, it works as we expected. However, this isn’t an effective solution to the problem because we might want to process dynamic text in a script file, where we aren’t sure of the characters used. For that reason, we might want to use a more robust solution:
$ string=$(cat < "text_with_emoji") && printf '%s\n' "${string:0:6}"
🔥 This
Let’s break it down:
- We created a variable string and assigned it the contents of our file using cat
- We used the built-in printf command to specify the format of the string
- In the second argument to the printf command, we sliced our string variable from 0 to 6
- The printf command will treat each character as a real character instead of treating it as a single byte
By executing the command above, we have a more robust solution that we can implement in our scripts. Better yet, we can create a bash script with the above commands and use it instead of the other commands.
8. Conclusion
In this article, we showed how to use a variety of tools to print the first specified number of characters from a file. For our examples, we used the head, sed, dd, cut, and awk commands.
Finally, we went through the issues regarding those commands and picking the right command for the job.