1. Overview
AWK is a data-driven scripting language originally designed for pattern searching and text manipulation. It requires no compilation and allows us to use variables, functions, and logical operators. The AWK language has different interpreters like awk, nawk, gawk, and mawk on different operating systems.
In this tutorial, we’ll discuss these interpreters and the differences between them.
2. Setup
Let’s create a sample file named employees.txt with the Nano editor:
$ nano employees.txt
We can then paste in this data:
John Doe 1784 1/22/54 750000
Lucy Kibaki 2054 4/12/54 350000
Fridah Machanja 1004 19/16/54 250000
Robert Edward 1654 7/22/54 650000
We’ll use this file throughout the tutorial to test how awk, nawk, gawk, and mawk work.
3. awk
The original version of awk was written in 1977 at AT&T Bell laboratories, as a pattern-matching program for processing files with vast amounts of data, for example, database files.
It’s sometimes referred to as original awk or old awk or oawk in some systems.
The awk utility executes programs written with the AWK programming language. It specializes in textual data processing and manipulation. This allows us to write compact but powerful programs in the form of statements that scan for specific text patterns. The awk interpreter searches for patterns in each line of a document or documents, and we can define an action to perform whenever there’s a match.
It groups input data into records and fields. It then processes the records one at a time until the end of input. Furthermore, it uses a record separator to control how the input data is split. The default record separator is a newline, but we can change it using the RS variable.
Let’s find the installation directory of awk:
$ which awk
/usr/bin/awk
We can also find out the version available:
$ awk --version
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
Copyright (C) 1989, 1991-2019 Free Software Foundation.
...truncated
Our current system is running on Ubuntu Linux. GNU AWK, also known as gawk, is the default awk interpreter on most Linux distributions.
On Linux systems, awk and nawk are simply symlinks to the gawk interpreter and running either gives the same result.
With the original awk, we can perform actions like arithmetic and string functions, use programming concepts such as conditionals and loops, and output formatted data reports. awk enables us to think of a text file as a textual database made up of fields and records.
The original awk command has this basic syntax:
$ original-awk [ -F fs ] [ -v var=value ] [ 'prog' | -f progfile ] [ file... ]
To test the original awk command, we need to first install it:
$ sudo apt install original-awk
Once installed, we can use original-awk to simply print all the records from the employees.txt file:
$ original-awk '{print}' employees.txt
John Doe 1784 1/22/54 750000
Lucy Kibaki 2054 4/12/54 350000
Fridah Machanja 1004 19/16/54 250000
Robert Edward 1654 7/22/54 650000
4. nawk
The nawk interpreter was created in 1985. It’s a modification of the original version of the awk interpreter from 1977. It introduced user-defined functions, computed regular expressions, and several input streams. This made the AWK language more powerful, and nawk became widely available with Unix System V Release 3.1 in 1987.
The nawk interpreter is available in systems such as SunOs, OSF1, IRIS64, and many more.
It introduced a large number of new features to the original awk interpreter. Some of the features include new keywords such as delete, do, function, and return.
There are new built-in functions such as atan2(), close(), gsub(), cos(), sin(), srand(), rand(), sub(), and system(). Furthermore, new variables such as FNR, ARGC, ARGV, SUBSEP, RSTART and RLENGTH were included in subsequent updates.
The nawk interpreter has continued to be updated and some of the features in the most recent versions of nawk include:
- executing Unix commands from scripts
- easier management of multiple input streams
- defining custom functions
- flushing open output files and pipes
- processing command-line arguments more gracefully
- processing results of other Unix commands
- a new predefined variable named ENVIRON
The nawk command has this basic syntax:
$ nawk [ -F fs ] [ -v var=value ] [ 'prog' | -f progfile ] [ file ... ]
The -F fs flag defines the input file separator. nawk searches every input file for lines matching any set of patterns defined literally in prog or in one or more files represented as -f progfile.
The file… name represents the stdin, and -v var=value is an assignment that’s done before execution of prog. We can use the -v option multiple times in one command.
Since all features in the original awk are available in nawk, we can use it to print all the records on the employees.txt file:
$ nawk '/Frida/' employees.txt
Fridah Machanja 1004 19/16/54 250000
In this command, we’re searching for lines containing the pattern “Frida”.
5. gawk
The gawk interpreter was first developed for Unix systems in the 1970s. It’s the GNU implementation of the AWK language. It’s a command-line utility available by default in most Linux systems. Most distributions also provide a symbolic link for awk or nawk interpreters to gawk.
gawk offers more recent Bell Laboratories awk extensions and a couple of GNU-specific extensions. This gives us a lot of flexibility when searching through data files.
The gawk and nawk interpreters allow us to include custom functions and manage multiple input and output streams. We can also manipulate command-line arguments.
The gawk command has this basic syntax:
$ gawk [ POSIX or GNU style options ] -f program-file [ -- ] file
To be able to handle various types of characters from different countries and locales, POSIX added changes to the basic and extended regular expressions. Further changes were also made to the bracketed character class of characters.
gawk is the only interpreter that supports this new character class of metacharacters. Some of them include, [:alnum:] for alphanumeric characters, [:digit:] for numeric characters, [:alpha:] for alphabetic characters and [:graph:] for non-blank characters.
Here’s an example of how to use bracketed character classes. Let’s use them to retrieve a record from our sample file:
$ gawk '/R+[[:lower:]]+[[:space:]]+E[[:lower:]]+[[:space:]]+[[:digit:]]/' employees.txt
Robert Edward 1654 7/22/54 650000
gawk scans for the letter R followed by one or more lowercase letters, followed by one or more blank spaces. Then the letter E is followed by lowercase letters, then a space, and finally digits. Bracketed character classes need to be enclosed in another set of brackets to be recognized as regular expressions.
6. mawk
The mawk interpreter was originally written by Mike Brennan in 1996. However, there was no maintainer for some time until 2009, when Thomas E. Dickey started making improvements. He began with fixes from the Debian package and resolved issues that weren’t handled by the former maintainer.
mawk is an interpreter of the AWK language that’s useful for the manipulation of files, prototyping algorithms, and text processing. It provides a few features and extensions not available in other AWK interpreters.
mawk is smaller in size in comparison to other interpreters, but it’s much faster than gawk in processing records.
Here’s a list of options specific to mawk that have a -W prefix:
- exec file: reads programs from written in files
- version: checks the current mawk version
- interactive: sets unbuffered writes to stdout and reads from stdin
- posix_space: forces mawk to not consider ‘\n’ to be space
- usage: prints usage message to stderr and exits. It’s similar to “-W help”.
- dump: creates an assembler-like listing of the internal representation of a program to stdout
mawk allows multiple -W options, and we can combine them by separating each option with a comma.
The mawk command has this basic syntax:
$ mawk [-W option] [-F value] [-v var=value] [--] 'program text' [file ...]
Let’s check the installation directory of mawk on our system:
$ which mawk
/usr/bin/mawk
We can even check the current version available:
$ mawk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan
random-funcs: srandom/random
regex-funcs: internal
compiled limits:
sprintf buffer 8192
maximum-integer 2147483647
Let’s use mawk to print the first two lines of the employees.txt file:
$ mawk 'NR~/^(1|2)$/' employees.txt
John Doe 1784 1/22/54 750000
Lucy Kibaki 2054 4/12/54 350000
In this command, we’re using NR, which gives us the total number of records processed, then we truncate the list to just the first two lines. We can specify more line numbers by separating them with pipes (‘|’).
7. Comparison
Let’s look at this table showing comparisons between these AWK interpreters:
Comparison Factor
awk
nawk
gawk
mawk
Definition
First interpreter of AWK language for manipulating text files
Sometimes referred to as new awk. It’s a new version of awk that had additional updates
GNU representation of the awk interpreter and has built-in features specific to gawk
An AWK language interpreter created by Mike Brennan
Format
original-awk [ -F fs ] [ -v var=value ] [ ‘prog’ | -f progfile ] [ file… ]
nawk [ -F fs ] [ -v var=value ] [ ‘prog’ | -f progfile ] [ file … ]
gawk [ POSIX or GNU style options ] -f program-file [ — ] file
mawk [-W option] [-F value] [-v var=value] [–] ‘program text’ [file …]
Options available
-f progfile and -Fc
–v assignment, -F ERE, and–f profile
-f, -F, -v, -b, -c, -C, -d, -D, -e, -E, -g, -h, -i, -I, -l, -L, -M, -n, -N, -o, -O, -p, -P, -r, -s, -S, -t, and -V
-f, -F, -v, -W dump, -W version, -W exec file, -W interactive, -W posix_space, and -W sprintf=num
7. Conclusion
In this article, we’ve talked about the AWK scripting language. We also talked more about AWK interpreters, starting with original awk, sometimes referred to as old awk or oawk.
By default, all AWK interpreters have built-in variables like FS, RS, OFS, ORS, NF, NR, and FILENAME. However, interpreters like mawk and gawk have features that aren’t available in other interpreters of the AWK language.