1. Overview
In the context of CSV files, a null field refers to a field with no data. The occurrence of null fields can arise for multiple reasons, such as missing information or errors while generating a field value.
Since null fields are anomalies in the CSV file, detecting lines or records containing such fields can be important.
In this tutorial, we’ll explore different Linux utilities to check for null fields in a CSV file. Though we’ve tested each approach on Ubuntu Linux, they should work fine on most other Linux distributions without any modification.
2. Understanding the Scenario
Let’s look at the sample data.csv file that contains null fields on multiple lines:
$ cat data.csv
,a1,a2,a3,a4,a5
,b2,b3,,b5
c1,,c3,c4,c5
d1,,,d4,
e1,e2,e3
f1,,f3
g1,g2,g3,
It’s important to note that the count of expected fields in records isn’t fixed. Moreover, the expected number of fields in a record is count_of_commas + 1. So, the expected fields in the first record are 6, while the expected fields in the last one are 4.
3. Using grep
grep is the de facto standard for searching through text. In this section, let’s solve our use case with the grep utility.
3.1. Null Field Patterns
A null field in a line can be present at the start, middle, or end of a line. Let’s see the patterns we should look for when finding a null field.
Firstly, if the null field is present at the start, then we should be looking for a comma as the first character of the line:
$ grep '^,' data.csv
,a1,a2,a3,a4,a5
,b2,b3,,b5
We got the first and second lines as the output as they start with a comma.
Secondly, when the null field is present in the middle, then there should be two consecutive commas:
$ grep ',,' data.csv
,b2,b3,,b5
c1,,c3,c4,c5
d1,,,d4,
f1,,f3
The second, third, fourth, and sixth lines have a null field in between, as they have two consecutive commas.
Lastly, the null field could also be present at the end, so we should be looking for a comma as the last character in a line:
$ grep ',$' data.csv
d1,,,d4,
g1,g2,g3,
We can see that the fourth and seventh lines have a null field at the end of the line.
3.2. With -E Option
We can combine the three patterns for null fields using the | operator. Now, let’s use the -E option with grep to apply all the patterns together:
$ grep -E '^,|,$|,,' data.csv
,a1,a2,a3,a4,a5
,b2,b3,,b5
c1,,c3,c4,c5
d1,,,d4,
f1,,f3
g1,g2,g3,
Great! We’ve got all the lines that contain one or more null fields.
4. Using sed
sed is also an effective utility for pattern search that we can use to find all the lines with null fields.
First, let’s find all the lines that contain the null field in the beginning:
$ sed -n '/^,/p' data.csv
,a1,a2,a3,a4,a5
,b2,b3,,b5
We used the -n option to disable sed‘s default behavior of printing every line. Further, we used the p command to print the lines that match the pattern.
Next, let’s get the lines that contain null fields in the middle:
$ sed -n '/,,/p' data.csv
,b2,b3,,b5
c1,,c3,c4,c5
d1,,,d4,
f1,,f3
Further, we can find the lines with the null field at the end:
$ sed -n '/,$/p' data.csv
d1,,,d4,
g1,g2,g3,
Lastly, we can combine all the patterns with an | operator to see all the lines that contain one or more null fields:
$ sed -n '/^,\|,,\|,$/p' data.csv
,a1,a2,a3,a4,a5
,b2,b3,,b5
c1,,c3,c4,c5
d1,,,d4,
f1,,f3
g1,g2,g3,
Fantastic! It looks like we got this one right. Further, we must note that we escaped the | operator so that sed doesn’t treat it as a literal character within the pattern.
5. Using awk
Alternatively, we can use the awk utility for our use case. Let’s see this in action.
As before, we can start by finding the null fields that are present at the beginning:
$ awk '/^,/{print $0}' data.csv
,a1,a2,a3,a4,a5
,b2,b3,,b5
We got the correct lines.
Now, let’s find the lines that have null fields in the middle:
$ awk '/,,/{print $0}' data.csv
,b2,b3,,b5
c1,,c3,c4,c5
d1,,,d4,
f1,,f3
As expected, we got the second, third, fourth, and sixth lines as the output.
Moving on, let’s find the lines that have null fields at the end:
$ awk '/,$/{print $0}' data.csv
d1,,,d4,
g1,g2,g3,
Finally, let’s merge these patterns using the | operator and search all the lines that have one or more null fields:
$ awk '/^,|,|,,/{print}' data.csv
,a1,a2,a3,a4,a5
,b2,b3,,b5
c1,,c3,c4,c5
d1,,,d4,
e1,e2,e3
f1,,f3
It looks like we nailed this one.
6. Using mlr
In this section, we’ll explore an interesting utility, mlr, that excels at working with structured text, such as CSV and TSV files.
Since mlr doesn’t come preinstalled in Linux distributions, let’s first install the utility:
$ apt-get install miller
Now, we can define a custom filter in the is_null_filter.mlr file to iterate over all fields and check for the presence of a null field:
$ cat is_null_filter.mlr
@is_null = false;
for (key, value in $*) {
if (value == "") {
@is_null = true;
break;
}
}
filter @is_null;
At first glance, it might seem unfamiliar. However, the definition is quite intuitive and simple. Let’s break down the logical flow of the script.
First, we’ve initialized a variable, @is_null, as false. Then, we iterate over each field’s value and compare it with an empty string. If we find a match, we set @is_null to true and break out of the for loop. Lastly, we set the filter as @is_null‘s value.
Next, let’s see this in action by processing the data.csv with the is_null_filter.mlr script:
$ mlr --fs comma put -f is_null_filter.mlr data.csv
1=,2=a1,3=a2,4=a3,5=a4,6=a5
1=,2=b2,3=b3,4=,5=b5
1=c1,2=,3=c3,4=c4,5=c5
1=d1,2=,3=,4=d4,5=
1=f1,2=,3=f3
1=g1,2=g2,3=g3,4=
The output lines are as expected, except for the extra indices for each field. Further, we must note that we passed comma as the field separator with –fs option. Additionally, we use the put sub-command to put the processing operation for each line with the is_null_filter.mlr script.
Lastly, let’s use the –nidx option to suppress the indices and show the lines as available in the data.csv file:
$ mlr --nidx --fs comma put -f is_null_filter.mlr data.csv
,a1,a2,a3,a4,a5
,b2,b3,,b5
c1,,c3,c4,c5
d1,,,d4,
f1,,f3
g1,g2,g3,
Excellent! We learned an interesting utility while solving our use case.
7. Conclusion
In this article, we learned how to check for null fields in a CSV file. Further, we explored different Linux utilities, such as sed, awk, and grep, to perform pattern matching for null fields and return the matching lines.
Additionally, we also learned about the mlr utility and used it to solve our use case with ease.