It sure looks like Meta stole a lot of books to build its AI.
It’s a grim week for Meta. The company formerly known as Facebook, and before that Facemash, “designed to evaluate the attractiveness of female Harvard students,” now encompasses Facebook, Instagram, Threads, WhatsApp, and Meta, the failed vision for a remote workplace, fun-zone, and Zucker-verse where legs are always just around the corner.
CEO and founder Mark Zuckerberg announced that slurs are okay on their platforms, added a pro-Trump UFC boss to their board, and made appearances in the aggrieved weirdo media world to make some convoluted case that we need more masculine energy in business, more resentment overall, and more fealty to Don Trump. Zuckerberg has also recently switched up his personal style so that he now looks like he’s perpetually in a sitcom flashback where an older actor is unconvincingly costumed to look like their younger self.
And in the Northern District of California, Wired reports, recently unredacted court documents reveal that Meta used a database of pirated books to train its AI systems. These documents were unsealed as part of a copyright lawsuit, one of the earliest of many similar cases, called Kadrey et al. v. Meta Platforms. The plaintiffs in this case are a number of writers and performers, including Richard Kadrey, Christopher Golden, Junot Diaz, Laura Lippman, Sarah Silverman, Ta-Nehisi Coates, and—jump scare!—Mike Huckabee.
The new documents quote Meta employees frankly admitting to using stolen stuff from a notorious piracy site:
…an internal quote from a Meta employee, included in the documents, in which they speculated, “If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues.”…
…These newly unredacted documents reveal exchanges between Meta employees unearthed in the discovery process, like a Meta engineer telling a colleague that they hesitated to access LibGen data because “torrenting from a [meta-owned] corporate laptop doesn’t feel right 😃”. They also allege that internal discussions about using LibGen data were escalated to Meta CEO Mark Zuckerberg (referred to as “MZ” in the memo handed over during discovery) and that Meta’s AI team was “approved to use” the pirated material.
Meta has claimed that they used publicly available material that was legally accessible under fair use doctrine, but that doesn’t pass the smell test to me: just because something is public on the internet, doesn’t make it legal.
The plaintiffs are arguing that they should be allowed to expand their case to incorporate these new findings:
“Meta, through a corporate representative who testified on November 20, 2024, has now admitted under oath to uploading (aka ‘seeding’) pirated files containing Plaintiffs’ works on ‘torrent’ sites,” the motion alleges. (Seeding is when torrented files are then shared with other peers after they have finished downloading.)
“This torrenting activity turned Meta itself into a distributor of the very same pirated copyrighted material that it was also downloading for use in its commercially available AI models.”
Legally, Meta and their lawyers may find a way to finagle the law and get around this. But in plain terms, it doesn’t seem defensible for a major company with tons of lawyers, money, and talent to knowingly use stolen work to build something that they then turn around and sell.
I’m not naive enough to think that this lawsuit, or any of the many others currently winding their way through the courts, will end in this kind of software leaving the market—in America, you can’t unring a bell that’s been valued in the billions. But I do hope that the writers and artists whose work was stolen are compensated.
In spite of all this, tech-optimists continue to push AI in more places, and people in power continue to trumpet it as the future of everything. In the case of publishing, for example, the excellent xoxopublishinggg Instagram account has been posting anonymous responses about publishing workers’ experiences with AI in the workplace—it seems like a lot of publishers are at least curious about these tools in ways that don’t bode well for an AI-less future.
If you’re considering using AI, or are feeling pressure at work to do so, you can add “built on piracy” to the list of concerns about this tech, alongside its environmental impact, its human toll on underpaid and marginalized workers, and the simple fact that it is incapable of making anything good.