Robots.txt explained

robots.txt
Long Name:Robots Exclusion Protocol
Status:Proposed Standard
First Published:1994 published, formally standardized in 2022
Website:,

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

The standard, developed in 1994, relies on voluntary compliance. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with security through obscurity. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate server overload. In the 2020s many websites began denying bots that collect information for generative artificial intelligence.

The "robots.txt" file can be used in conjunction with sitemaps, another robot inclusion standard for websites.

History

The standard was proposed by Martijn Koster,[1] [2] when working for Nexor[3] in February 1994[4] on the www-talk mailing list, the main communication channel for WWW-related activities at the time. Charles Stross claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a denial-of-service attack on Koster's server.[5]

The standard, initially RobotsNotWanted.txt, allowed web developers to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots; server overload was a primary concern. By June 1994 it had become a de facto standard;[6] most complied, including those operated by search engines such as WebCrawler, Lycos, and AltaVista.[7]

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under Internet Engineering Task Force.[8] A proposed standard was published in September 2022 as RFC 9309.

Standard

When a site owner wishes to give instructions to web robots they place a text file called in the root of the web site hierarchy (e.g.

Notes and References

  1. Web site: Historical . Greenhills.co.uk . 2017-03-03 . https://web.archive.org/web/20170403152037/http://www.greenhills.co.uk/historical.html . 2017-04-03 . live .
  2. Web site: Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web . Roy . Fielding . First International Conference on the World Wide Web . 1994 . Geneva . September 25, 2013 . PostScript . https://web.archive.org/web/20130927093658/http://www94.web.cern.ch/WWW94/PapersWWW94/fielding.ps . 2013-09-27 . live .
  3. Web site: The Web Robots Pages . Robotstxt.org . 1994-06-30 . 2013-12-29 . https://web.archive.org/web/20140112090633/http://www.robotstxt.org/orig.html#status . 2014-01-12 . live .
  4. Web site: Important: Spiders, Robots and Web Wanderers . Martijn . Koster . www-talk mailing list . 25 February 1994 . Hypermail archived message . dead . https://web.archive.org/web/20131029200350/http://inkdroid.org/tmp/www-talk/4113.html . October 29, 2013 .
  5. Web site: How I got here in the end, part five: "things can only get better!" . Charlie's Diary . 19 June 2006 . 19 April 2014 . https://web.archive.org/web/20131125220913/http://www.antipope.org/charlie/blog-static/2009/06/how_i_got_here_in_the_end_part_3.html . 2013-11-25 . live .
  6. Web site: The text file that runs the internet. The Verge. Pierce. David. 14 February 2024. 16 March 2024.
  7. Web site: Robots.txt Celebrates 20 Years Of Blocking Search Engines . Barry Schwartz . Search Engine Land . 30 June 2014 . 2015-11-19 . https://web.archive.org/web/20150907000430/http://searchengineland.com/robots-txt-celebrates-20-years-blocking-search-engines-195479 . 2015-09-07 . live .
  8. Web site: Formalizing the Robots Exclusion Protocol Specification. Official Google Webmaster Central Blog. en. 2019-07-10. 2019-07-10. https://web.archive.org/web/20190710060436/https://webmasters.googleblog.com/2019/07/rep-id.html. live.