Google Says Markdown for AI SEO Strips Important Parts

Mosegas 12 hours ago

0 0 4 minutes read

Google Says Markdown for AI SEO Strips Important Parts

In a recent Search Off the Record podcast, host John Mueller and Martin Splitt pushed back on the idea inspired by AI SEOs that reductive, content-only versions are the best way to improve AI Search. They make the case that all the things AI SEOs want to remove are actually useful for ranking.

Non-Content Sections of Web Pages Are Important

The TL;DR for this part is that HTML is for browsers to give people a visual page, and screen readers to read it.

Martin Splitt starts the discussion by explaining why plain HTML doesn’t seem like the best way to provide content for AI agents and LLMs. The idea is that, in addition to the content, there is a lot of other code in the HTML that is irrelevant to the LLM or AI agent that might be visiting the site to find the content.

The appeal of markup, is that it can provide content in a way that frees up all the HTML that is intended to make a web page visible to humans or readable by a screen reader.

Splitt explains:

“And I think that’s why people think it’s good for LLMs, because you have little things, little tokens. And if you look at the HTML file without the browser that we provide, if you just look at plain HTML in a text editor, basically, then it’s hard to read the content, because there’s so much cruft, so many things in it.

He also praises the score for the ability to convey the essence of the content:

“But if the Markdown render fails and you look at the Markdown file in a text editor, it’s still structured and readable. Like the link is the name of the link text, like the anchor text, then in square brackets and then in regular brackets. That’s probably what I would do if text was all I had.
If I were writing an email without actually being able to link things, I would mark some kind of link text and put some kind of way of saying, like, and that’s where you need to go to see that.
And I think this minimalism is probably what makes people think, yes, this is good for a machine that needs to understand this content, unlike HTML. “

Converting HTML To Text Is Not Easy

Mueller and Splitt noted that despite how complex HTML looks, crawling and making sense of it is trivial and easy to do. The selling point about using markdown for LLMs, that it makes it easier to crawl and index content, breaks down completely at this point.

John Mueller explains:

“I think the big thing is that the web with HTML and everything has been around for a really long time, longer than Markdown. And all the crawlers out there, they’ve practiced HTML. And converting HTML to text is a trivial thing. There’s a lot of libraries that can do that for you. So if you think about what the average web browser might look for or might need to understand that page in an HTML page.”

Markdown Fails to Detect Content

Discovery is when any searcher visits a web page and discovers other web pages within one website, and from website to website.

Splitt said that grounding focuses on one aspect of content: the content itself. He explained that this makes it difficult for search engines to see a web page in the context of how it connects to the rest of the website’s content through links, which aid in discovery.

He explained:
“Yes, and I mean, another thing, yes, it’s good that Markdown is usually focused on a piece of content, but the HTML with all the links and navigation and headings and all kinds of things like that are extracted from the Markdown files that make the website important to understand the structure and how this connects to the rest of the site.
So I think that is also a bad thing. If we’re going to lose this, that’s probably not worth crawling on Discovery, huh? “

Take away

Reading patents and research papers, it is clear that search engines see a website as a collection of individual web pages, but also as groups of web pages belonging to categories and categories, and the entire website itself as a whole. Zoom in, and the website is one point among thousands and thousands of other websites in the world of websites, self-organized by links by categories and quality levels.

In SEO, we have to understand the site from a zoomed-in view to think about how all the pieces fit together. The reason is because that’s what search engines do.

AI-based SEO seems to be focused on making it easier for LLMs and AI agents to crawl and index content. Crawls and pointers are valid concerns. But by insisting on markup files, they don’t consider the basics of discovery and how trivial it is to extract content from an HTML web page, making markup files useless.

Apart from the above issues, there is also one about honesty. There used to be something called a keyword meta tag that some search engines used to get hints about what a web page was about. Naturally, domain owners and SEOs used it to dump all the keywords they wanted to rank for, regardless of content.

I’m not saying SEOs and website owners aren’t honest, but search traffic is money, and people will do what they will. So the final consideration is that search engines will never trust markdown content and use it as canonical if it is trivial to crawl and extract the original content from HTML.

Circling back to what Mueller and Splitt had discussed, Google insists that AI SEO’s insistence on ranking removes a significant amount of valuable content.

Watch Search Off the Record Episode 111 here:

Mosegas 12 hours ago

0 0 4 minutes read