How is Meta Getting Its Hands on Advance Digital Galleys to Train Its AI?
Maris Kreizman: One of the Richest Companies in the World is Stealing From the Rest of Us
On Thursday, March 20, all of the writers I know were in a bit of a frenzy. That morning Alex Reisner at the Atlantic had published a piece about Llama 3, Meta’s AI model, and the astonishing number of pirated books on which it had been trained. Meta’s leadership, against the advice of their lawyers, had used LibGen, a pirate file-sharing site supposedly intended to make academic papers more accessible worldwide. Along with Reisner’s article came a handy search bar where you could type your name to see if Meta had used any of your writing to train its generative language models.
Anyone who has ever published a book, or even an academic paper, quickly entered their name in the search, and just about every traditionally published author I know came away 99.9 percent pissed off, and maybe .01 percent validated, it was an honor to even be included, etc etc.
This is not the first time we writers have anxiously typed our name into a search bar. In 2023 Reisner published a piece that placed the number of books that were used to train Meta’s AI at 183,000. That time I felt that .01 percent of despair when I typed my name in to zero results, but I also felt a little optimistic that authors such as Sarah Silverman (comedians can be authors; The Bedwetter is very fun!), Paul Tremblay, and Michael Chabon, whose work had in fact been lifted, had filed separate copyright lawsuits that were then consolidated into one: Kadrey v. Meta is a class action suit in Northern California that is still ongoing.
Last week the Authors Guild assured writers that if our books were used by Meta at any point to train their AI (Reisner’s latest reporting puts the number of books at 7.5 million), we’re automatically included in the Kadrey v. Meta case. So I guess we… wait?
I’ve been trying to make sense of why this kind of theft feels different, more invasive. I haven’t even received any pre-publication reviews yet, but my work already belongs to Meta.But here’s the thing. When I did my search, I found my previous book, and in the grand scheme of things that was shrug-worthy. It was published in 2015 and sold approximately 100 copies and is now out of print. But my upcoming essay collection won’t be published until July 1, and yet somehow Meta has already accessed it to train its AI. Advance copies of digital galleys are available legitimately for the most part only on NetGalley and Edelweiss, and both of those services have strict terms and conditions about what users can do with unpublished work (not much!). How in the hell did LibGen, and therefore Meta (and perhaps also OpenAI) get their hands on not yet published work?
I’ve been trying to make sense of why this kind of theft feels different, more invasive. I haven’t even received any pre-publication reviews yet, but my work already belongs to Meta. This is where I pull out the old nugget that the Authors Guild reported the results of a survey in 2022 that revealed that the median income for authors was below the poverty level. It’s me! And many of my peers! It made me think of how musicians are also under attack, how they can no longer make a living from their art.
The last time I used a torrenting site was in the days of searching for files on Napster in the very early aughts. I remember how thrilling it was to find music, but then how terrifying it quickly became when individual users started to be sued by record labels for torrenting the latest 98 Degrees album or whatever. Ultimately Napster was shut down, but as Liz Pelly notes in her new book Mood Machine, the anti-pirating frenzy within the music industry paved the way for predatory streaming sites like Spotify to emerge by creating alternatives to piracy. The streaming sites have managed to devalue music and the artists who make it, all while enriching large corporations and making discovery more difficult for individual users. Don’t let this happen again.
I love the idea of file-sharing as a tool for making writing more accessible to those who can’t afford to buy it, especially in an age when public libraries are facing major existential threats from the Trump Administration. The idea that LibGen has digitized academic papers for the use of individuals who couldn’t otherwise get to them sounds noble as hell. So why does LibGen also have such an enormous book catalog, including access to swaths of not yet published works?
File-sharing as a tool to enrich the already obscenely rich and powerful (Meta’s valuation is currently $1.56 trillion, which seems like it would be more than enough to pay licensing fees) feels like the ultimate violation of copyright and artists’ voices and the power of the written word in general. The Authors Guild has some guidance for what to do if your work was in LibGen’s data set, but it’s difficult not to feel existential despair and a great deal of rage while we wait to see how this all plays out. I fear that once again the work of individual artists is being used and denigrated in order to benefit a class of people who don’t care about the art and fear no consequences.