Piracy vs Progress: AI’s great copyright heist
- Adam Spencer
- Apr 2
- 3 min read
From LibGen to Llama: The meteoric rise of generative artificial intelligence has ignited a fierce legal showdown over the data used to train these systems. Recently released court documents shine a light on how tech giants download pirated books to build AI models, while creators go unpaid and ethical questions mount.

Journalist Alex Reisner, in a gripping Atlantic report, exposed how Meta employees, with a nod from “MZ,” turned to the controversial online omnibus Library Genesis (LibGen) after deeming paying for licensed content “unreasonably expensive” and “incredibly slow".
This moral sidestep has sparked lawsuits, including one from comedian Sarah Silverman, whose memoir The Bedwetter was among the pirated works.
This echoes the thorny questions raised in another recent Nerd News about AI and creative writing: how do we fuel generative AI’s potential without eroding the human foundations it relies on? What is a fair price to pay content creators?
Feeding the beast
AI thrives on massive training datasets - it is the lifeblood of its learning. Meta’s team, recognising this, decided, “books are actually more important than web data” and stressed the need to “get books ASAP".
But this insatiable appetite for data clashes head-on with copyright laws, which were never designed for the era of mass-scale machine learning we’re now navigating.
LibGen - AI's dirty secret
Enter LibGen, the titan of the “shadow libraries” fuelling AI’s growth. Launched in 2008 by Russian scientists, it boasts 7.5 million books and 81 million research papers - everything from Silverman and Dickens to The Lancet and an Italian Bhagavad-Gita. All copied and uploaded to the web despite copyright protections.
Reisner reveals how Meta and OpenAI tapped into LibGen’s trove, with many other LLM developers likely following suit using LibGen or similar unauthorised sources.
Business author David Meerman Scott sums up the frustration felt by creators: “I’m totally cool with Generative AI tools training on all content that I put out there for free! Have at it. However, Meta chose to rip off my paid content without my permission (or my publishers’ permission) and use it in ways I didn’t authorise. Not cool, Zuck, not cool.”
The ethics of exploration
Meta argues “fair use", claiming their large language models “transform” copyrighted materials into new work, thus avoiding legal liability. But internal memos suggest executives knew the “medium-high legal risk” and strategised to hide their tracks.
A fear was expressed that paying even for a single copyrighted book could neuter the fair use argument. From a company raking in $164 billion in revenue and $62 billion in profits in 2024, such calculated risks feel to many less like necessity and more like audacity.
Show me the money?
What would fair reimbursement even look like? Literary legend Ian McEwan advocates royalties based on how often a written work is used for training and tied to the AI platform's profits. The Authors Guild pushes simpler licensing fees of $1,000 to $3,000 per work.
These would ensure creators share somewhat in AI’s gains, but scaling payments for more popular works and penetrating complex training models would be significant challenges.
Charting AI's ethical escape route
Recent developments offer some hope. Companies like China’s DeepSeek suggest models trained on smaller, high-quality datasets can pretty much match those that use more dubiously sourced data. Harvard’s Institutional Data Initiative recently released nearly a million public-domain books for AI training - from Shakespearean classics to Czech mathematics textbooks - allowing ethical sourcing.
In summary, if tech bros want their AI models to evolve ethically alongside human creativity, they surely cannot trawl illegal stockpiles of licensed data. Only time will tell how big an ‘if’ this is.
That’s all from me for now. If you'd like more geeky fun, please check out my other newsletters below, or connect with me on LinkedIn and/or X.
Yours in nerdiness,
Adam
Comentarios