OpenAI’s Training Practices Under Scrutiny: New Findings on Copyrighted Data
A study co-authored by researchers from esteemed institutions such as the University of Washington, Stanford, and the University of Copenhagen has emerged, raising significant questions about the integrity of OpenAI’s training practices. It provides evidence that some of OpenAI’s AI models may indeed have been trained on copyrighted materials, a topic already at the heart of ongoing lawsuits against the company.
Legal Background
OpenAI faces multiple lawsuits filed by authors, programmers, and various rights holders claiming the company unlawfully utilized their original works—ranging from books to computer code—to develop its AI models. While OpenAI advocates for a fair use defense, the plaintiffs argue that such a provision does not exist under current U.S. copyright law for the purpose of training AI.
Insights from the Study
The newly published research introduces a novel methodology aimed at identifying specific training data that models may have “memorized.” These models function as prediction engines, absorbing vast amounts of data to discern patterns that enable them to generate diverse outputs including text, images, and more.
Interestingly, while the majority of the content produced by these models is not mere regurgitation of training data, the research indicates that instances of verbatim copying do occur. For example, certain image models have been known to reproduce Screenshots from movies, whereas language models have shown tendencies to plagiarize from news articles.
Methodology Details
The study’s approach uses the concept of “high-surprisal” words—unique or uncommon terms within particular contexts. For instance, the term “radar” in a sentence involving “humming” stands out as high-surprisal compared to more generic options. The authors examined several iterations of OpenAI’s models, including GPT-4 and GPT-3.5. They tested these models by removing high-surprisal words from excerpts of fiction and New York Times articles, prompting them to predict the masked words. Their findings suggest that effective predictions point towards the models having memorized those particular snippets.
Key Findings
The results revealed that GPT-4 showed indications of memorization particularly with well-known fiction texts from the dataset known as BookMIA, which features samples of copyrighted eBooks. Additionally, this model was found to have memorized excerpts from New York Times articles, though to a lesser extent.
Conclusion and Future Implications
Abhilasha Ravichander, a doctoral candidate at the University of Washington and co-author of the study, suggested that these findings are pivotal in understanding the controversial data used for training models. She emphasized the necessity for models to be transparent and subject to scientific examination to ensure their trustworthiness. In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically,
Ravichander stated, reinforcing the call for improved data transparency in artificial intelligence.
OpenAI has been a proponent of more lenient restrictions on utilizing copyrighted content for AI training. Although the company has established several licensing agreements and mechanisms for copyright owners to opt out of having their works included in training datasets, they continue to advocate for legislative adaptations surrounding fair use as it pertains to AI.