A new study from top universities shows that OpenAI’s AI models, like GPT-4, memorized parts of copyrighted books, news articles, and other works. Researchers say this could explain why OpenAI is facing lawsuits from authors and publishers who claim their work was used without permission.
How Did Researchers Discover This?
The researchers conducting this investigation came from both the University of Washington and the University of Copenhagen, as well as Stanford University. The research team evaluated OpenAI’s AI models through tests that entailed the omission of specific words in textual data obtained from books and news publications. The experts wanted to determine whether the AI system could complete the sentences with accurate predictions.
For example, in the sentence “Jack and I sat perfectly still with the radar humming,” the word “radar” is unusual. Most people would expect words like “engine” or “radio.” If the AI guessed “radar” correctly, researchers concluded it must have memorized the original text during training.
The analysis confirmed that GPT-4 from OpenAI retained specific parts from popular fiction literature and articles published in the New York Times during its training process. OpenAI conducted controversial training for its AI by deploying copyrighted materials.

Why Are Authors and Publishers Upset?
Writers, programmers, and companies like The New York Times are suing OpenAI. These groups say OpenAI used their books, code, and articles to build its AI without asking or paying. OpenAI argues this is allowed under “fair use” laws, which let people reuse small parts of work for education or research.
However, the lawsuits claim fair use does not apply here because OpenAI is making money from its AI tools. Authors worry that if AI can copy their work perfectly, fewer people will buy books or news subscriptions.
Abhilasha Ravichander, a researcher who worked on the study, said, “To trust AI, we need to know how it was trained. Right now, companies like OpenAI are not sharing enough details.”
What OpenAI Says About the Study
OpenAI has not directly commented on the study. However, the company has said it follows fair use rules and offers ways for creators to opt out of having their work used for training. For example, website owners can block OpenAI’s bots from scanning their content.
Still, many creators say these tools are hard to use or do not work well. They want clearer rules and compensation for their work.
AI’s Big Problem With Copyrighted Work
This is not the first time AI companies have been accused of using copyrighted material. Earlier this year, another study found that OpenAI’s GPT-4o model recognized content from paid programming books. The AI could answer questions about these books even though they were behind paywalls, suggesting it was trained on secret copies.
Public complaints arise when AI companies continue employing unpaid work, as it potentially diminishes creative content creation. Creative content will decrease throughout the Internet because artists and programmers, along with writers, may eliminate their work from online distribution.

What Happens Next?
A global debate about enacted legislation regarding AI training has sprung up in various nations. The upcoming European Union AI Act will force businesses to disclose all the data sources that power their AI models.
The U.S. judicial system is currently addressing suits that might determine the legitimacy of copyrighted work training for AI systems. Some AI experts propose building a software licensing platform for businesses that lets AI companies acquire permitting rights to books, articles, and musical content. Several individuals point out that rigorous AI regulations might reduce AI development, thus allowing China to advance its AI operations.
For now, the debate continues. The dispute between authors and AI companies about data ownership has left the rest of society stranded between their opposing needs. A programmer who is suing OpenAI declared that “AI exists to assist humanity rather than usurp their work.”