About /robots.txt
In a nutshell
Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called
The Robots Exclusion Protocol.
It works likes this: a robot wants to vists a Web site URL, say
http://www.example.com/welcome.html. Before it does so, it firsts checks for
http://www.example.com/robots.txt, and finds:
Code:
User-agent: * Disallow: /
The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
There are two important considerations when using /robots.txt:
- robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
- the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
So don't try to use /robots.txt to hide information.
Why did this robot ignore my /robots.txt?
It could be that it was written by an inexperienced software writer. Occasionally schools set their students "write a web robot" assignments.
But, these days it's more likely that the robot is explicitly written to scan your site for information to abuse: it might be collecting email addresses to send email spam, look for forms to post links ("
spamdexing"), or security holes to exploit.
Can I block just bad robots?
In theory yes, in practice, no. If the bad robot obeys /robots.txt, and you know the name it scans for in the User-Agent field. then you can create a section in your /robotst.txt to exclude it specifically. But almost all bad robots ignore /robots.txt, making that pointless.
If the bad robot operates from a single IP address, you can block its access to your web server through server configuration or with a network firewall.
If copies of the robot operate at lots of different IP addresses, such as hijacked PCs that are part of a large
Botnet, then it becomes more difficult. The best option then is to use advanced firewall rules configuration that automatically block access to IP addresses that make many connections; but that can hit good robots as well your bad robots.