1. Overview

We sometimes need to extract IP addresses from a file containing a long list of them. This could be, for example, from a server’s access log file.

In this tutorial, we’ll explore the different methods we can use to extract IPv4 addresses from a file.

2. Setup

Let’s create a file sample.log that contains a few lines of an access log file:

$ touch sample.log

We then open it with the Nano editor:

$ nano sample.log

Next, let’s paste in these log entries and save the file:

13.66.139.0 - - [19/Dec/2020:13:57:26 +0100] "GET /index.php?option=com_phocagallery&view=category&id=1:almhuette-raith&Itemid=53 HTTP/1.1" 200 32653 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
157.48.153.185 - - [19/Dec/2020:14:08:06 +0100] "GET /apache-log/access.log HTTP/1.1" 200 233 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "-"
157.48.153.185 - - [19/Dec/2020:14:08:08 +0100] "GET /favicon.ico HTTP/1.1" 404 217 "http://www.almhuette-raith.at/apache-log/access.log" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "-"
216.244.66.230 - - [19/Dec/2020:14:14:26 +0100] "GET /robots.txt HTTP/1.1" 200 304 "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, [email protected])" "-"
54.36.148.92 - - [19/Dec/2020:14:16:44 +0100] "GET /index.php?option=com_phocagallery&view=category&id=2%3Awinterfotos&Itemid=53 HTTP/1.1" 200 30662 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)" "-"
92.101.35.224 - - [19/Dec/2020:14:29:21 +0100] "GET /administrator/index.php HTTP/1.1" 200 4263 "" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" "-"
73.166.162.225 - - [19/Dec/2020:14:58:59 +0100] "GET /apache-log/access.log HTTP/1.1" 200 1299 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36" "-"

Each line above represents different entries in this format:

IP-ADDRESS - - [TIMESTAMP] REQUEST & REQUEST-INFORMATION

In the next sections, we’ll explore different methods for extracting IP addresses from this file.

3. Using grep

The Linux grep command is one of the most powerful utilities for searching a specific string of characters in a file or files. It’s very useful in situations where we have to search through large access log files.

We’ll use it by creating a regex pattern that matches the format of IP addresses:

$ grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' sample.log
13.66.139.0
157.48.153.185
157.48.153.185
216.244.66.230
54.36.148.92
92.101.35.224
73.166.162.225

We have a regular expression with four identical parts, each separated by a dot in the command above. Each regular expression represents one to three digits in the range 0 to 9.

By default, the grep command prints the whole line that contains the matching pattern. We’ve used the -o option to trim the results and print only the matched parts.

The regular expression above has flaws because it can also match IP addresses that fall outside the bounds of valid IPv4 addresses. We can only use it in situations where we’re sure that the input file contains only valid IPv4 addresses.

3.1. Extracting Only Valid IPv4 Addresses

To make things more strict, let’s modify our regular expression to only match valid IPv4 addresses.

For us to effectively test this out, let’s add an IP address such as “999.888.777.666” to the end of the sample.log file:

$ echo "999.888.777.666" >> sample.log

This will add the invalid IP address at the end of the file.

We can then modify the regular expression and run grep again:

$ grep -Eo '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' sample.log
13.66.139.0
157.48.153.185
157.48.153.185
216.244.66.230
54.36.148.92
92.101.35.224
73.166.162.225

This regular expression is more strict since it only matches IP addresses that have a value equal to or less than 255 in each of its four parts.

We’re using the -E option to interpret the patterns as extended regular expressions (EREs) and the -o option to trim the results and only print the matched part.

We can push things further and pipe the results to the uniq and sort commands. This counts and sorts the records in ascending order. It also filters the list so it’ll only print unique IP addresses and their respective counts:

$ grep -Eo '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' sample.log | uniq -c | sort
      1 13.66.139.0
      1 216.244.66.230
      1 54.36.148.92
      1 92.101.35.224
      1 73.166.162.225
      2 157.48.153.185

We’re passing the -c option to the uniq command to get the total count of individual IP addresses.

4. Using Perl

Perl stands for Practical Extraction and Report Language. It’s useful in printing reports based on data input through a file or stdin. It has grown into a general-purpose language used for writing programs from quick one-liners to full-scale applications.

We can use Perl to extract the valid IP addresses by using the same regular expression as before:

$ perl -nle '/(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/ and print $&' sample.log
13.66.139
157.48.153
157.48.153
216.244.66
54.36.148
92.101.35
73.166.162

We’re using the -n option to add loops to the code executed by the -e option. The -l option ensures every record is printed on its line, which improves readability.

We can also use pipes with the Perl command to get the total count of each IP address, and then sort the results in an ascending or descending order:

$ perl -nle '/(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/ and print $&' sample.log | uniq -c | sort
      1 13.66.139.0
      1 216.244.66.230
      ...truncated...

We can even save the results to an output file by using the output redirection (>>) option and specifying an output file:

$ perl -nle '/(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/ and print $&' sample.log >> output_file.txt

5. Using awk

The awk command is a Linux utility to manipulate data and generate reports based on the data. It lets us write small but effective programs as statements that define text patterns to search for. Moreover, we can define an action to perform whenever a match is found.

Let’s look at how we can use the awk command to extract all the IP addresses from the sample.log file:

$ awk 'match($0, /(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/) {print substr($0, RSTART, RLENGTH)}' sample.log
13.66.139
157.48.153
157.48.153
216.244.66
54.36.148
92.101.35
73.166.162

We’re using the match() function to define a string and a regular expression. We then print a substring containing a string ($0), and the predefined variables RSTART and RLENGTH. These variables represent the index and length of characters of the matched string.

Alternatively, if we’re fetching the list of IP addresses from a valid access log file with the same format as the sample.log file, we can simply use awk to fetch the first column:

$ awk '{print $1}' sample.log 
13.66.139.0
157.48.153.185
157.48.153.185
216.244.66.230
54.36.148.92
92.101.35.224
73.166.162.225

The first column is usually an IP address in most access log files.

6. Conclusion

In this article, we’ve explored different methods to extract IPv4 addresses from an input file. The methods are similar to one another, with the core part being the regular expression that we define.

We can use the grep command without the -o option whenever we want to have a detailed look at each IPv4 address and its specifics. However, any other method that we’ve covered would work for when we just want to get the list of IP addresses.