Google opens source code for the robots.txt parser
an RFC draft of the Robots Exclusion Protocol (REP) , making its parser available robots.txt under the Apache License 2.0 license. Until today, there was no official standard for Robots Exclusion Protocol (REP) and robots.txt (the closest standard was this ), which allowed developers and users to interpret it in their own way. The company's initiative aims to reduce differences between implementations.
A draft of the standard can be viewed on the IETF website
, and the repository is available on Github via the link https://github.com/google/robotstxt
The parser is the source code that Google uses as part of its production systems (with the exception of minor edits - such as the cleaned header files used only inside the company) - the robots.txt files are parsed exactly the way Googlebot does (including how he handles unicode characters in patterns). The parser is written in C ++ and essentially consists of two files - you need a compiler that is compatible with C ++ 11, although the library code goes back to the 90th, and you will see "raw" pointers and strbrk
. In order to build it, it is recommended to use Bazel
(CMake support is planned in the near future).
The very idea of robots.txt and the standard belongs to Martein Koster, who created it in 1994 - according to legend
, the reason for this was the search spider Charles Stross, who dropped the Koster server using a DoS attack. His idea was picked up by others and quickly became the de facto standard for those involved in search engine development. Those who wanted to parse it, sometimes had to do reverse engineering of Googlebot's work - among them the company Blekko, who wrote his own search engine for his search engine Perl
The parser didn’t do without funny moments: for example, take a look at .
Source text: Google opens source code for the robots.txt parser