Digital Marketing

Google May Expand List of Unsupported Robots.txt Rules

Google may expand the list of unsupported robots.txt rules in its documentation based on analysis of real-world robots.txt data collected through the HTTP Archive.

Gary Illyes and Martin Splitt described the project in a recent episode of Search Off the Record. The work began after a community member submitted a pull request to Google’s robots.txt site suggesting two new tags be added to the unsupported list.

Illyes explained why the team made the range more than two tags in PR:

“We tried not to do things by ourselves, but instead to gather information.”

Rather than adding only the two proposed tags, the team decided to look at the top 10 or 15 most used unsupported rules. Illyes said the goal is “a good starting point, a decent foundation” for writing unsupported tags that are more common in the wild.

How Research Works

The team used the HTTP archive to learn what rules websites use in their robots.txt files. HTTP Archive runs monthly crawls on millions of URLs using WebPageTest and stores the results in Google BigQuery.

The first attempt hit the wall. The team “quickly discovered that no one is really asking for robots.txt files” during automatic parsing, which means that HTTP archive datasets do not include robots.txt content.

After consulting with Barry Pollard and the HTTP Archive community, the team wrote a custom JavaScript attacker that extracts the robots.txt rules line by line. The custom metric was compiled before the February release, and the resulting data is now available in the custom_metrics dataset in BigQuery.

What the Data Shows

The parser extracts every row that matches the colon field value pattern. Illyes described the resulting distribution:

“After enabling and disabling and the user agent, the drop is very large.”

Beyond those three fields, the execution of the rules falls into a long tail of strange directives, and garbage data from broken files that return HTML instead of plain text.

Google currently supports four fields in robots.txt. Those fields are user agent, allow, disallow, and sitemap. The documentation says that some fields are “unsupported” without listing which unsupported fields are more common in the wild.

Google has specified that unsupported fields are ignored. The current project extends that work by identifying specific rules that Google plans to write.

The top 10 to 15 most used rules beyond the four supported fields are expected to be added to Google’s unsupported rules list. Illyes did not specify specific rules that would be included.

Typo Tolerance May Be Stretched

Illyes said the analysis also turned up a common misspelling of the no-holds-barred rule:

“Maybe I’ll increase the typos we accept.”

His expression implies that the parser has already accepted some misspellings. Illyes did not commit to a timeline or the names of specific errors.

Why This Matters

Search Console already displays anonymous robots.txt tags. If Google were to write less well-supported guidelines, that would make its public documents more closely reflect the anonymous tags that people already see appear in Search Console.

Looking Forward

The planned update will affect Google’s public documentation and how disallowing typos is handled. Anyone maintaining a robots.txt file with rules beyond user agent, allow, disallow, and sitemap should check the guidelines that never worked for Google.

The HTTP archive data is publicly queried in BigQuery for anyone who wants to test the distribution directly.


Featured Image: Screenshot from: YouTube.com/GoogleSearchCentral, April 2026.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button