1. Overview
We use the grep command to filter searches in a file for a particular pattern of characters. The text search pattern is called a regular expression. It’s one of the most used Linux commands to display the lines that contain the pattern that we are trying to search.
When we’re using the grep tool we may wish to avoid scanning binary files to save time. This can sometimes relate to certain text files as well as actual real binary files.
In this short tutorial, we’re going to look at how we can use grep and how to exclude binaries from our searches.
2. Why Binary Files Can Be a Problem With grep
There are two cases when grep might think our files are binaries; encoding errors and NUL bytes. Let’s explore them a little bit more.
2.1. Encoding Errors
The grep tool considers a file to be binary if it contains an encoding error according to the C99 mbrlen function. We can see this with an example. Let’s create a file with a UTF-8 encoding error because \x80 cannot be the first byte of a UTF-8 Unicode point:
$ printf 'Encoding\x80' >> encoding.txt
If we now grep for the matching word “Encoding”:
$ grep "Encoding" encoding.txt
Binary file encoding.txt matches
We see that grep interprets the file encoding.txt as a binary file when it is only a text file with an encoding error.
2.2. NUL Bytes
The grep tool will scan buffers trying to read NUL bytes, but it also attempts to see if it can determine that a file must have NULs in the remaining data. Holes are unwritten data and Unix mandates that they read as NUL bytes, so if a file contains a hole, it contains a NUL, and grep will consider our file to be a binary. Let’s see a very simple example where a text file contains a NUL byte:
$ printf "File with NUL byte\0" >> nul.txt
Let’s now use the grep command in this file:
$ grep "NUL" nul.txt
Binary file nul.txt matches
We can see that also in this case, grep thinks this is a binary file instead of just a plain text file with a NUL byte.
3. The grep Command with Binary Files
When we try to find all files that contain a certain string value, it can be very costly to check binary files that we might not want to check. On some occasions, binary files can be very large and we would be wasting time and resources scanning through them. Let’s look at an example where we would not want to look inside a binary file.
3.1. Using grep Without Suppressing Binary Files
Let’s suppose we want to search for the text “printHello” among all our files. This word corresponds to a defined C function “void printHello” and is used multiple times in our project, however, we would like to know where and how. We can now generate the text file (hello.c):
$ cat <<EOF >>hello.c
#include <stdio.h>
#include <stdlib.h>
void printHello(){printf ("Hello World\n");}
int main() {
printHello();
return 0;
}
EOF
Let’s now compile hello.c and generate the binary file (out.x):
$ gcc hello.c -o out.x
To generate the file out.x we are using GCC, the C compiler present on most Linux distributions.So, let’s now grep “printHello” throughout all our files:
$ grep "printHello"
hello.c:void printHello(){printf ("Hello World\n");}
hello.c:printHello();
Binary file out.x matches
The grep output indicates that “printHello” was found in the hello.c file. However, it’s also found in the binary file.
3.2. Using grep Suppressing Binary Files
We would prefer to see only the text files which contain code so let’s now use grep to skip binary files:
$ grep -I "printHello" *
hello.c:void printHello(){printf ("Hello World\n");}
hello.c:printHello();
Here we used the -I parameter and we could also use -binary-files=without-match. These are the grep options to skip over binary files. This is exactly what we were looking for. We now have all the matches from the text file but not from the binary file.
4. Conclusion
In this article, we saw how the grep tool understands binary files. We also saw in which cases our non-binary files can still be interpreted as binary by grep.
Finally, we learned a simple command that will help us to scan through text files while suppressing the binary files.