LLMs ‘Wouldn’t Exist’ Without Reddit Data

Mosegas 2 hours ago

0 0 5 minutes read

LLMs ‘Wouldn’t Exist’ Without Reddit Data

Reddit CEO Steve Huffman said that major languages ”wouldn’t exist as we know it” without Reddit content. He called user-generated field data the “modern fuel” of AI.

Huffman commented during an interview at the World’s Most Innovative Companies Conference.

What Huffman Says About Reddit’s Value In AI

Huffman explained the position of Reddit data in the AI ecosystem.

Huffman says:

“LLMs would not exist as we know it without Reddit. Reddit is one of the largest sources of training information for LLMs and Reddit continues to be one of the main sources of both training data and we are the most cited platform, the most cited platform of all models.”

He pointed out that the claim cited was created by Profound, an AI data tracking company.

Huffman explained why AI companies depend on content.

“There is no artificial intelligence without real intelligence. At the end of the day, these models are very simple. They replicate to a large extent what you’ve eaten elsewhere and a large part of that use is actually people’s discussion on Reddit because it’s natural and covers basically every topic you can think of.”

Deals For Some, Cases For Others

Reddit announced data licensing agreements with Google and OpenAI in 2024. Huffman referred to those as Reddit’s first AI data deals and did not announce any additional deals.

“Since we did the first two deals with Google and OpenAI, that was over two years ago, so we’ve learned a lot. They’ve learned a lot. The whole world has learned a lot. Exactly how valuable Reddit data is and how useful it is. And so I think we’re very deliberate and selective there. But yes, we’re open and open for business.”

For companies that did not agree to the terms of the license, Reddit took legal action. The company sued Anthropic in California Superior Court, alleging unauthorized use of Reddit content and violation of Reddit’s terms of service. Reddit has filed a federal lawsuit against Perplexity in the Southern District of New York, along with three data-scraping firms, alleging DMCA anti-circumvention violations and related claims.

Huffman drew a line between the two groups.

“Companies like Google and OpenAI where we’ve had a good relationship, we can actually make an agreement and implement measures to monitor and access our data on behalf of our users but then collaborate on making products for the next generation of the Internet.”

He also added that “not all companies are willing to be a cooperative partner, so unfortunately we have to go the other way, which is a crime.”

Huffman told the audience that Reddit’s position on commercial use is simple. “The commercial use of our data requires commercial terms,” he said. Reddit began charging for commercial API access in 2023, a move that predates current licensing agreements.

Huffman said Reddit still provides free data access to researchers and universities and is trying to remain flexible for non-commercial use.

What Changed Reddit’s Openness

According to Huffman, Reddit’s willingness to share data changed freely when the AI industry moved away from open research. As SEJ previously reported, Reddit’s limited access to most search engines while Google remains the exception.

“Historically, Reddit was like we were born on the open internet and Reddit was open and had permission to access its data. And honestly, I think we would be in a different situation today if AI companies were still open and still operating as open source and doing open research.”

Huffman said the problem is that Reddit can no longer track how its data is being used. “People are using our data and we don’t know what it was used for,” he told the audience.

Aside from commercial terms, Huffman said Reddit wants to prevent its data from being used to identify users, target them with ads, or replace or differentiate the platform.

Reddit’s Own AI Efforts

Huffman acknowledged what he called “a paradox.” Reddit’s content powers third-party AI programs, but the company also uses AI throughout its site.

The most visible product is Reddit Answers, a powerful search feature for LLM. It reads posts and comments, and organizes them into responses made up of words that quote the user. Huffman noted that it is designed to be questions without definitive answers.

“What Reddit Answers does are a couple of things that are different from Reddit. First, it only answers with voice quotes from real people. And then the second thing it does is try to present more opinions because the whole point when you’re on Reddit, you want someone’s opinion.”

Behind the scenes, Reddit uses AI for content moderation and classification. LLMs can assess whether a comment falls into plagiarism, something Huffman previously described as difficult because of the subjectivity involved.

Huffman introduced AI moderation as a way to reduce exposure to bad content, not as a replacement for Reddit’s community moderation model.

“The worst job on the Internet used to be looking at the worst content on the Internet and deciding whether or not it was online,” Huffman said. “That job ends.”

The Gray Area of AI-Written Posts

Huffman also faced the challenge of users writing content with AI tools and pasting it on Reddit. That’s different from automated bot activity, he emphasized.

“The most annoying thing I see not just on Reddit, but all over the internet is someone who wrote a post or comment about ChatGPT and then pasted it on Reddit. Like, is that a bot? It sure sounds like a bot, but there’s someone behind the idea.”

Huffman cast the issue as single-minded. “It’s very important to us that there’s someone behind the idea, behind the content, behind the information,” Huffman said. But he also noted that “writing sucks” when users rely on AI to write their posts.

Rather than creating a policy to address it, Huffman indicated that Reddit will let its community handle the issue. Users are already taking AI-written polls and calling them comments. Huffman said Reddit will “give more power to users and more subreddits to reject that type of content altogether.”

He compared the extensive question to the calculators for the math class. “Kids these days are just learning to write about AI. What are we going to do about it?” he said. “We have to learn, I think, along with everyone else.”

Why This Matters

Huffman’s comments reinforce Reddit’s tone that user interactions are a key input for AI systems.

The problem with AI-authored content described by Huffman is one SEJ compiled as part of a broader YouTube AI investigation. Reddit’s decision to allow community polls to manage AI-generated posts, rather than building tools to find information, is a different approach than platforms that have used automatic labeling.

Looking Forward

Huffman told Fast Company that Reddit is “in the market talking to people all the time” about new data deals, though he did not reveal the third deal.

Reddit’s lawsuits against Anthropic and Perplexity are both ongoing. The Anthropic lawsuit was the subject of a federal court hearing in March.

Mosegas 2 hours ago

0 0 5 minutes read