US Publishers Want Common Crawlers to Stop Scraping Their Content

Mosegas 4 hours ago

0 0 3 minutes read

US Publishers Want Common Crawlers to Stop Scraping Their Content

Digital Content Next, the trade body representing US digital publishers, has sent a cease-and-desist letter to the Common Crawl Foundation.

The letter requires Common Crawl to stop collecting publisher content and remove material that is already in its dataset.

DCN CEO Jason Kint announced the legal notice in a blog post, and the Press Gazette reported more details on the letter this week.

Common Crawl has crawled several billion new pages every month since 2007 to create a free public archive. That archive has been used to train many of the AI models used today. OpenAI’s GPT-3 paper lists Normal Crawl filtered as 60% of the model’s training mix.

The argument is important for any site that blocks AI crawlers. Blocking the Common Crawl crawler, CCBot, stops future collections but does not affect archived content, which anyone can download.

What DCN wants

The letter calls on Common Crawl to stop “scraping, storing, or sharing copyrighted, paid, subscriber-only, or otherwise protected content from DCN member companies in datasets,” and to remove member content it has collected.

DCN claims that Common Crawl “flagrantly infringed” copyrighted content by creating datasets and sharing them with AI companies.

The book says that “copyright law is not an exit law.” In other words, DCN’s position is that publishers should not request an exemption. Common Crawl should require permission to install them.

Kint wrote that notice:

“It challenges the growing assumption that content created with huge investment can be collected, stored, repurposed, and monetized simply because it’s technologically accessible.”

Why DCN Can Question the Removal Process

The DCN letter asks if the Common Crawl follows the exit instructions and if it removes the content when asked. Per Press Gazette, DCN attorneys are investigating whether Common Crawl’s statements to publishers “may have been inaccurate or misleading.”

Common Crawl publishes a public registry of websites that have requested to be deleted. It includes submissions from the Associated Press, the BBC, and the News/Media Alliance’s largest submissions covering hundreds of domains. The Press Gazette reports that the list includes other major publishers.

This is not the first time that the removal process has been reversed. The Atlantic reported in November that content from the New York Times and Danish publishers was still available after Common Crawl agreed to remove it.

Common Crawl Response

Common Crawl executive director Rich Skrenta declined to comment on the book when contacted by the Press Gazette.

He has previously pushed back on similar claims. In a November blog post in response to The Atlantic, Skrenta denied that the agency had lied to publishers or skimmed paychecks.

He said the archive file format cannot be edited after publication without violating its integrity. Instead, Common Crawl claims to remove or filter the affected URLs from subsequent searches and make them inaccessible through its public tools and indexes:

“When a publisher asks us to remove previously crawled items, we respond quickly and initiate a removal process that reflects the technical design of our dataset.”

He added:

“No one at Common Crawl has ever claimed that this project was quick or perfect; rather, we were open about its complexity and ongoing nature.”

In a forum post this week, Skrenta said Common Crawl is taking part in working on standards to open up how websites display AI crawl preferences.

Why This Matters

The DCN letter targets the archive, not just future transparency, and says the burden shouldn’t fall on publishers to come out in the first place.

Most publishers in BuzzStream’s sample have already made the decision to block, with 79% of the 100 news sites it surveyed blocking at least one training bot. The Cloudflare Annual Review data we compiled in January found CCBot among the bots with the most comprehensive disallowance commands for all top domains. The question DCN raises is what those blocks accomplish if years of content remain available for training anyway.

Looking Forward

Whether DCN grows depends on how Common Crawl responds, and Common Crawl has yet to say how it will go. Both sides want different rules on who takes the initiative.

Skrenta supports a standards function that allows sites to specify their preferences, which continue to emerge as a model. The UK’s CMA took a similar approach when it required Google to allow publishers to opt out of AI search features.

DCN says scrapers must first require a permit. If more trade groups take up that argument, the pressure shifts from individual robots.txt files to the archives themselves.

Featured Image: Andre Boukreev/Shutterstock

Mosegas 4 hours ago

0 0 3 minutes read