Artificial intelligence (AI) has made massive strides in recent years, with systems like ChatGPT demonstrating an impressive ability to generate human-like text on demand. However, this rapid progress has not come without controversy, particularly regarding copyright protections. A recent high-profile lawsuit filed against tech giants Microsoft and OpenAI alleges the misuse of copyrighted material in developing state-of-the-art AI systems.
The Lawsuit and Who’s Behind It
On November 21st, 2022, author Julian Sancton filed a blockbuster copyright infringement lawsuit in a Manhattan federal court accusing Microsoft and OpenAI of illicitly copying works without consent to train their lucrative AI models. Sancton, known for the immerse book “Madhouse at the End of the Earth” documenting a doomed 19th century polar expedition, filed the class-action suit on behalf of himself and thousands of other nonfiction authors.
The suit makes explosive claims that while Microsoft and OpenAI have built multi-billion dollar businesses and soaring stock valuations based on advanced AI, they have done so in part by unlawfully reproducing millions of copyrighted books, news articles, and online content without compensation or permission. This includes allegedly scraping full texts and passages to feed training datasets for systems like ChatGPT and Microsoft’s Copilot coding tool, which then regurgitate this content in response to user prompts.
The Deep Pockets Behind the Suit
Powering the lawsuit is high-profile law firm Susman Godfrey and its partner Justin Nelson, known for winning massive judgments in intellectual property cases. This suggests authors mean serious business in taking on the deep pockets of Big Tech over AI copyright matters.
“This lawsuit seeks to hold OpenAI and Microsoft accountable for their refusal to pay nonfiction authors,” Nelson said, “and to prevent the companies from infringing on works in the future.”
Microsoft and OpenAI’s Slippery Defense
In response to similar allegations of copyright infringement in recent months, Microsoft and OpenAI have maintained questionable defenses. They argue AI systems don’t directly copy texts, but rather develop statistical representations and patterns from ingesting vast volumes of data. Thus, they claim AI is no different fundamentally than technologies like cameras and microphones that recreate aspects of the perceptible world.
However, critics counter that the intensive training process still requires making verbatim reproductions of copyrighted content. The companies also insist responsibility lies primarily with end users, with Microsoft boldly stating “it merely provides tools for users to apply responsibly and legally.”
Yet the lawsuit rejects this defense, contending Microsoft and OpenAI are still liable for actively enabling infringement on their platforms, just as tech companies must address other illegal conduct enabled by their technologies and users.
Voracious Data Harvesting Under Scrutiny
At the crux of the issue is AI’s nearly insatiable appetite for data. Cutting-edge language models like ChatGPT are trained on humongous datasets encompassing upwards of trillions of words scraped from published books, news articles, academic papers, blogs, websites, and more. It’s these mammoth troves of text that provide the essential raw material for AI abilities to generate coherent passages, answer questions knowledgeably, and even develop coding solutions.
For example, research suggests ChatGPT was trained largely on a dataset dubbed The Pile, containing 570 gigabytes of text estimated to represent millions of copyrighted works. And Copilot draws heavily on publicly posted code from developer documentation and open software repositories. Yet the exact sources ingested by commercial AI remain undisclosed, leaving authors, journalists, and coders unsure if their creative output secretly helped develop AI now capitalized on by Big Tech giants like Microsoft.
Protecting Creators in the Age of Automation
Beyond immense monetary damages for past infringement, the lawsuit seeks court injunctions blocking unauthorized use of copyrighted writings in commercial AI going forward. But more broadly, the simmering tensions reflect wider debates on protecting rights and interests of creators as automation accelerates through artificial intelligence.
“Spend years writing then get AI’d with no compensation,” tweeted journalist Kara Swisher in response to the lawsuit, crystallizing these concerns.
Some now propose legally requiring compensation for authors and coders whose work gets extracted into lucrative AI training datasets and models. Others call for mandatory attribution systems that automatically credit any text or code from copyrighted sources that gets regurgitated by AI. Rules around fair use and other copyright exemptions may also need rethinking considering AI can instantly generate human-quality outputs drawing on copyrighted works.
For now though, the high-stakes lawsuits continue piling up as authors, journalists and other creators try protecting their intellectual property rights against the ingenious but legally questionable copyright maneuvers of Big Tech companies aggressively commercializing AI. The unfolding battle raises crucial unanswered questions around who pays the costs of automation and who stands to profit most as creative works become mere fuel for the unrelenting machine learning advances of artificial intelligence.