(Bloomberg) -- Artificial intelligence will become an important part of Reddit Inc.’s business, the company said Thursday in its long-awaited filing for an initial public offering — tapping into a revenue stream that could be both lucrative and controversial. 

San Francisco-based Reddit, a platform that hosts conversations on thousands of different topics, makes most of its money by selling ads that appear alongside social content. In its filing, the 19-year-old company outlined another line of additional business: selling that content to companies building ChatGPT-like chatbots.

Big tech companies, like Google and OpenAI, are willing to pay a lot of money for content to improve their large language models, AI software that is built using troves of data. On Thursday, in addition to its public filing, Reddit announced a deal with Alphabet Inc.’s Google, allowing Google’s AI products to use Reddit data to improve their technology. Bloomberg had earlier reported the existence of a $60 million AI deal. 

“Reddit’s vast and unmatched archive of real, timely, and relevant human conversation on literally any topic is an invaluable dataset for a variety of purposes, including search, AI training, and research,” Reddit co-founder and Chief Executive Officer Steve Huffman wrote in the filing, which described such deals as an “emerging opportunity” for the company.

In its S-1 filing, Reddit said that in January it entered into licensing agreements with an aggregate value of $203 million, with terms ranging from two to three years. The company also said that it expected to bring in at least $66.4 million from such deals this year. 

AI companies are snapping up licensing deals to feed their models more content. In December, OpenAI inked a deal worth tens of millions of euros with Axel Springer SE, which owns Politico and Business Insider. Such agreements are high-stakes, because AI models are often training on copyrighted information, muddying claims of ownership. For example, the New York Times sued OpenAI in December, alleging copyright infringement. 

Training AI models on user-generated data — the kind Reddit hosts — can also come with risks. The content is less reliably accurate than news articles, artificial intelligence researchers say. Reddit “is basically a forum where people post anything,” Giada Pistilli, principal ethicist at Hugging Face, which makes and hosts AI models. “You can find conspiracy theories and any kind of problematic stuff.”

Os Keyes, a doctoral candidate at the University of Washington who studies artificial intelligence and data ethics, said that Reddit could introduce some problematic content into AI systems. 

“We’ve already seen that models are prone to hallucinate facts that don’t exist,” Keyes said. They pointed to a notable example, in 2013, when Reddit users incorrectly accused someone of being a suspect in the Boston Marathon bombing. “Stuff that appears on Reddit are not validated facts.”

Reddit said that when partners use its data API, they are required to stop showing content that has been taken down from the site. The company added that AI companies have already used Reddit to train models in the past without paying, and that organizing formal deals will help it enforce measures such as requiring the deletion of content that has been taken down because of policy violations.

Read More:  What the Reddit Revolt Means for Social Media in an AI Era

Reddit has previously been criticized for its handling of toxic and hateful content posted by its users and largely moderated by unpaid volunteers. In 2020, about 15 years after the site’s founding, Reddit introduced a ban on hate speech. When it comes to moderating problematic content, it isn’t always clear where the line is. In 2021, for example, the company said it would leave up subreddits that spread misinformation related to Covid-19. Days later, after protest from many of its own users, Reddit banned the forum in question, saying it had violated other rules.

The company says that in addition to its moderators, it has internal safety teams dedicated to enforcing its policies through both automation and human review.

If AI models absorb inaccurate content, companies can try to clean it afterward, Pistilli said, but the process can be difficult. “That’s a lot of effort and a lot of work. The better practice would be to clean your data before,” Pistilli said. “Unfortunately, people prefer quantity over quality.”

It’s still too soon to say how Reddit’s unusually vocal community of users will respond to the licensing push, if at all. Last year, thousands of subreddits staged a protest over the company’s decision to increase prices for third-party app developers.

--With assistance from Rachel Metz.

(Adds company comment on moderation in the 12th paragraph.)

©2024 Bloomberg L.P.