Alex Reisner on Covering Books3 and Fighting Piracy
In Conversation with Whitney Terrell and V.V. Ganeshananthan on Fiction/Non/Fiction
Writer, programmer, and tech consultant Alex Reisner joins co-hosts Whitney Terrell and V.V. Ganeshananthan to talk about his recent Atlantic articles on Books3, a massive data set that includes hundreds of thousands of pirated e-books, and that Meta and other companies have used to train generative AI. Reisner explains how he extracted book names and titles from long strings of text in Books3 to create a searchable database, and why not finding yourself in the database doesn’t mean your work is safe. He also reflects on the dangers of metaphorical language in discussing AI, what he’s heard from legal experts, what publishers are and aren’t doing, and how piracy has shifted from benefiting individuals to helping corporations profit. Reisner reads from his groundbreaking Atlantic coverage.
Check out video excerpts from our interviews at Lit Hub’s Virtual Book Channel, Fiction/Non/Fiction’s YouTube Channel, and our website. This episode of the podcast was produced by Anne Kniggendorf and Todd Loughran.
From the episode:
V.V. Ganeshananthan: So, Alex, I wanted to go back to something you were talking about earlier—“substantial similarity.” I remember reading in one of your pieces that some of the lawyers were arguing that AI is creating works not substantially similar to our works. So then, that means—to Whitney’s point about hypocrisy—that there’s this two-faced thing going on where they’re saying to the court, “When we ask our AI to write a book in the style of Alice Munro, it’s not actually doing that.” But then the use is being marketed as, “We can get this AI to create a text that is Alice Munro-like, in a way that you won’t be able to tell the difference.” Am I understanding that correctly? What do you think about this “substantial similarity” argument?
Alex Reisner: Yeah, I think these systems can—to some degree—spit out their training text. The companies have gone to great lengths to prevent that from happening. [The systems] also can’t do that 100 percent of the time. In some cases they can, but companies are preventing that from happening.
VVG: How can they say to some people, “We can do a perfect imitation of Sugi,” and then say to these other people, “This definitely does not sound like Sugi. It’s legally defensible.”
AR: I’m not really sure how to answer that. I think it’s really getting into this gray area, because nothing like this has been in the courts before. There’s an undergraduate writing exercise where you try to imitate the voice of a writer you like, but no one goes out and tries to sell that in the way that—very soon—could be happening here. Going back to the “substantial similarity” thing again, I think the legal meaning of that is very technical, and I’m not sure exactly what it means. To be honest, I don’t know. I don’t know if the actual words need to be similar, or if capturing an author’s voice is similar enough? I’m just not totally sure how the judges are going to see that.
VVG: Yeah, and I’m sure there’s going to be all sorts of variability in the same way that like—are they going to be mapping syntactical patterns or vocabulary? All of which can probably be mathematically represented, as you were talking about earlier. So to ask the question that all of our listeners would probably like us to ask—if you had written a book that was in one of these data sets, what would you be doing right now to protect your own work?
The Authors Guild put out this piece and gave us some advice. Some of that advice is obvious, like send letters to the company, donate money to the Guild to support the lawsuit. Then some is less obvious, like setting up Google alerts for your book, sending takedown notices when you find unauthorized copies, including their “no AI” training statement that they suggest you put on your copyright page. Or… learn how to edit a robots.txt file so you can restrict open AI’s crawler, GPTBot. I just barely understood the last sentence I said. What do you think about this advice? Should I be learning how to edit a robots.txt file?
AR: Robots.txt is actually important and may become a key part of this. It’s pretty technical. But the quick explanation is it’s a file that sits at the root of every website and describes what robots can and can’t view on your site, and to some extent, how they can use what they view. So a lot of people are now using robots.txt to block GPTBot, which is the robot that ChatGPT uses when it scrapes a lot of content from the web for training. So you can do that, but again, that’s not going to help with books, that’s only with stuff that you put on your own website, because that’s the only website where you can control robots.txt. Even suggestions that seem like they should work—like a “no AI” training clause in your copyright notice in the front of your book”—my understanding is that it’s not really going to work.
As an author, you have a very limited ability to specify how your work can be used, for example, you could put in a copyright notice that you can’t read this book on the Sabbath. But in court, a judge is gonna say you can’t enforce that. People who buy your book can read it whenever they want. In the same way, if a judge decides that training AI on copyrighted material is fair use, they’re going to say that no author can specify that a company can’t do that. So it’s really tricky on an individual author level.
I think what seems really important to me right now is staying on top of what the publishers are doing. As I said, they’re there, they seem to be embracing generative AI. They’re staying awfully quiet as all these authors are filing lawsuits. And I don’t know if the Authors Guild is planning some kind of interaction with them.
The Writers Guild of America just achieved something with the studios in Hollywood that could be helpful. The Authors Guild is a very different kind of organization. The whole labor situation is very different. But I don’t have any great advice other than to try to keep an eye on the publishers and maybe encourage them to keep AI out of the book acquisition and editing process.
Whitney Terrell: The Screenwriters Guild is much more powerful and has a much longer history of striking and negotiating with the studios. Authors like us, we’re more like professional golfers. We’re independent contractors. I don’t think people think of themselves as being in a union or guild in that way. So it may be the time for authors like us to learn how to do that because it’s going to take collective action to protect some of this stuff.
VVG: Alex, you were referring earlier to the guy who made Books3, Shawn Presser, who told you that he did it, in part, to have a data set available to people other than rich corporations who are developing AI. In other words, to level the playing field by making OpenAI-grade training data widely available. And as you wrote, piracy used to primarily benefit individuals. I have been thinking about this recently because I learned that my work appears in libraries like Z-Library. I was talking to someone else about it and they were like, “This is incredibly important for accessibility in the Global South. You’re writing about Sri Lanka and people there who want to be able to access your work might be accessing it this way.”
My initial reaction was to be like, “There are unauthorized copies out there, I feel violated.” And then she was talking about the grief that people experience when Z-Library gets taken down, pops back up again, gets taken down, pops back up again. The sadness people experience when they lose access to some of these things. I was moved by that story and thought about my friends who are copyleft activists, and have talked about this kind of accessibility. But now, this kind of piracy is benefiting corporations. So is there a way to thread that needle to stop corporations while adopting anything close to a copyleft perspective?
AR: Yeah, it’s a really good question. It’s pretty complicated. You’re talking basically about how we manage access to this stuff for different people. This situation is—as I see it—a consequence of just digitizing everything, which we’ve been doing for the past 25 years. There’s a sense in which digitization cheapens books. It cheapens writing. It just turns it into data and becomes kind of ephemeral. It spreads really easily across the internet. Since the advent of social media, we’ve seen how companies can scan and mine texts for demographic information, like our habits, our brand preferences, and our writing style, which they can mimic. Things being digital is extremely convenient, but this is part of the cost of it.
Transcribed by Otter.ai. Condensed and edited by Mikayla Vo.
Alex Reisner at The Atlantic • “What I Found in a Database Meta Uses to Train Generative AI” • “These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech” • “Revealed: The Authors Whose Pirated Books Are Powering Generative AI”
Open Letter to Generative AI Leaders (The Authors Guild) • Practical Tips for Authors to Protect Their Works from AI Use (The Authors Guild) • “Some writers are furious that AI consumed their books. Others? Less so,” by Sophia Nguyen, The Washington Post • Fiction/Non/Fiction, Season 6, Episode 17: “Chatbot vs. Writer: Vauhini Vara on the Perils and Possibilities of Artificial Intelligence” • “My Books Were Used to Train AI,” by Stephen King, The Atlantic • “Murdered by My Replica?” by Margaret Atwood, The Atlantic • “My Books Were Used to Train Meta’s Generative AI. Good.” by Ian Bogost, The Atlantic • Alice Munro • Rebecca Solnit • Meghan O’Rourke • George Saunders • Ta-Nehisi Coates • Martin Amis • “Sarah Silverman is suing OpenAI and Meta for copyright infringement,” by Wes Davis, The Verge