1. Introduction
In this tutorial, we’ll study what a robots file is and how attackers exploit robots.txt to gain information about a web server.
2. The Robots Exclusion Protocol
The robots.txt file is described in the internet standard RFC 9309, which provides a “Robots Exclusion Protocol” (REP). This protocol aims to regulate the behavior of automated crawlers and spiders of webpages. A web domain can contain, in the root folder of the web server, a file called robots.txt that comprises instructions to the crawlers. These instructions are organized in key: value pairs and can be complemented by comments prefixed by a hashtag.
The most common keys used in robots.txt are:
- User-agent: this indicates the user agent of the bot to which the subsequent keys are referred
- Disallow (or alternatively, Allow): this contains the paths that a bot can’t access (respectively, that a bot can access)
- Crawl-delay: contains the expected minimum interval between sequential requests that the server allows
A basic robots.txt file can look like this:
# This is a comment to the file.
This robots.txt informs Google's main webcrawler
that it is not allowed to access the directory
"resources", and that it should wait 10 seconds
between each sequential access to the webpage
User-agent: googlebot Disallow: /resources/ Crawl-delay: 10
These lines contain indications rather than exact “instructions”. This is because compliance by the web crawlers to the RFC 9309 standard is not enforced, and it’s conducted on a voluntary basis.
However, since the standard allows the administrator of a web server to express the limitations that they want the robots to follow when accessing web resources in an automated manner, it is a good idea to follow the indications by the administrator of the website. An attacker can, however, exploit robots.txt, which is why we have to be careful about what we insert in it.
3. How Attackers Exploit robots.txt
Because it contains information on the structure of a website, the robots.txt can be used by an attacker to learn about resources that cannot be reached just by repeatedly crawling through the hyperlinks. If we follow common security practices when building a web server, we must have certainly disabled the directory listing and created some rules for accessing the resources. There is still, however, a risk that the attackers take advantage of the robots file to learn about the structure of our web server.
For example, some Apache servers use a module that provides the status of the server. That module logs in clear, among other things, logins and passwords that are sent via the POST method to our server. Certain configurations of that module make the log accessible as a webpage under www.example.com/?server-status, and we might be tempted to prevent crawlers from indexing it.
Therefore, we might consider including the following:
# Disallow crawlers from accessing
the server status page
User-agent: * Disallow: /?server-status
If the crawlers followed the instructions in our robots.txt, then they will no longer access that resource. However, in doing so, we’d be effectively confirming the existence of an important vulnerability in our server! The human attacker, by simply reading the robots file, will guess that there is a log of the activity of the web server and that this log is exposed to the internet.
4. How to Mitigate the Vulnerability
We can follow three methods to prevent attackers from learning about a website’s structure.
First, we shouldn’t rely on the principle of security via obscurity. If a certain resource or directory should not be accessed remotely, it should either not be positioned on a machine exposed to the internet, or the access should be limited via the rules of configuration of the web server. Removing all links pointing to it from any other webpage is insufficient.
We can also avoid suggesting to the attacker what would be the most valuable targets by not enumerating the resources that the crawlers shouldn’t access. Suppose we generally disallow crawling (by indicating Disallow: / in the robots file) but then allow access to individual resources or directories (by using the Allow key). In that case, we are then not mentioning explicitly the paths that are more sensitive. This is, in a sense, analogous to the difference between the usage of white versus blacklists, since the former includes the addresses of the authorized machines and are therefore more valuable.
Finally, we can also limit the usage of the robots file altogether. We can do so by including the relevant indications to the crawlers within the HTML header of a page. For example, if we wanted the crawlers not to index a certain page, we could include the following in the HTML head:
In this manner, we’ll make it harder for any attackers to exploit robots.txt to learn about the structure of the web server.
5. Conclusion
In this article, we studied what a robots.txt file is and how attackers exploit robots.txt to learn about the structure of the directories of a web server. We also saw how to mitigate this vulnerability by redacting the robots file appropriately.