Questionable web content used to train LLMs

What happened: Google's massive C4 dataset was used to train next-gen language models such as Google's T5 and Facebook's LLaMA.

Details: In a recent wide-ranging analysis , the Washington Post found that the public dataset contains text from Stormfront, Kiwi Farms, 4chan, and other potentially problematic websites, including at least 27 identified by the U.S. government as markets for counterfeits and piracy. Others included the white nationalist site VDARE, the far-right news site Breitbart, and the Russian state-backed outlet RT.

Background: The Google C4 dataset is created by crawling the web and scraping text data from a large number of web pages. According to Google, C4 was initially developed as a "cleaned version" of Common Crawl's AI training data.

Why it matters: The Post released a search tool for website owners and others to find out if a specific site was included in Google's C4 dataset. The investigation found that the dataset was dominated by websites related to journalism, content creation, entertainment, and software development, with patents.google.com, wikipedia.org, and scribd.com listed as the top three sites. However, the training data from more questionable sites could potentially cause the AI models to generate text that's undesirable, racist, pornographic, unreliable, and harmful.


Post a Comment

Previous Next

Contact Form