On the one hand: I hate it when people misuse their technological powers to commit literal copyright crimes. On the other hand, how much can I complain, really, when I had no Shelf Life topic at the ready and suddenly an ill-conceived publishing-related venture imploded right before everyone’s eyes on social? Is it possible that my sporadic lack of preparedness causes these scandals? Am I responsible for all this?
No; but it happens with surprising frequency and often coincides with days when I am unprepared (which also occur with surprising frequency).
To summarize the current issue: Benji Smith, creator of the cloud-based word processor Shaxpir, was called out with increasing vigor throughout the day Monday for Prosecraft.io, his companion project to Shaxpir. Prosecraft purported to be a tool for literary linguistic analysis, which took the full text of some 27,000 or more books, analyzed that text, and produced summary statistics including total word count, “vividness,” passive voice, and various subsets of adverbs. These statistics were expressed mostly as percentages (eg, a book might be 83% “vivid” and contain 1% adverbs) as well as percentiles (eg, a book might be in the 50th percentile of “vividness” of all the books indexed in the Prosecraft project).
I want to say that first of all I take issue with “vividness,” but that’s not actually where I take my first issue. The first issue—for, I think, everyone—is that the library of 27,000 books uploaded and analyzed for the Prosecraft project were not all books in the public domain. Authors whose work was found within and who joined everyone else in raising a cry about it include Maureen Johnson, Celeste Ng, Diana Urban, and T. Kingfisher. All of them confirm that they did not give consent for their books to be included in the project, and neither did their publishers.
The first issue is the provenance of the text that was being fed to Prosecraft. The text of these books was not provided by a legitimate source (that is, someone with the authority to grant rights and permissions for the title). Smith claims that he made use of web crawlers—basically automated tools that search the web for specific things—to find the texts. However, many of these texts are not available freely online—or on the web at all.
As Diana Urban explained in a tweet on Monday:
The “most vivid page” excerpt from my book was literally the most spoilery moment of the climax, not published publicly, nor scrapable … unless you were scraping book pirating sites? How did you acquire these books, exactly? Will the data be deleted permanently?
The only way to get the full text of many of these books in a digital format is to buy them or to download them illegally from piracy websites. Given that Smith already confirmed he collected the books using web crawlers and not by purchasing them, then by process of elimination….
In the apology blog post from Monday evening that went up after the removal of the Prosecraft.io website, Smith explained that he “believed [he] was honoring the spirit of the Fair Use doctrine.” Now, for my part, I don’t think you have to know anything about Fair Use to know that this is suspect; if someone says they thought they were honoring the spirit of a rule instead of, you know, following that rule—it feels like they’re already trying to weasel out of something. I think if you believed you had followed the rules of fair use you might say, “Hey I followed the rules of fair use” and not “I believed I was honoring the spirit of these rules.”
But, yeah. The first thing to consider in determining whether your (or anyone’s) use of copyrighted material might fall under the auspices of fair use is “the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.” Smith has not explained, anywhere I could find, why or in what way he believed his use of this copyrighted material was fair use. As far as I know, he has not said anything like, “This was fair use because I made Prosecraft exclusively for educational purposes.” I can only assume he has not made a statement like this because counsel has advised him not to. He has, however, stated repeatedly that Prosecraft is a free tool for authors to use to learn and improve their craft by applying the metrics found within to their own writing.
I could not figure out what other argument he might make for fair use; based on what I’ve read, the educational, not-making-money-off-this angle seems to be the main thrust of the fair use defense, although it’s only implied and he hasn’t stated outright that this is why he felt he was in the clear of any copyright infringement.
However: In a July 2018 post on the Shaxpir blog, in which Smith introduces the Prosecraft.io tool, he specifically states that the analysis of works in the Prosecraft library is available to paying users of his word processor Shaxpir 4: Pro.
The words coming out of Benji Smith’s mouth regarding Prosecraft is that it’s a free tool with which writers may evaluate the work of authors they admire, to learn from them. Free, and educational.
But what he’s actually doing is using the illegally acquired full text of copyrighted works to train a language model (an AI) to create features for software that he sells to users. Commercial. For profit.
Many of the folks whose books had been scraped for the Prosecraft library spoke out, as did a good part of the online writing community otherwise, and Smith removed Prosecraft.io from the web late Monday. The site is still there as far as I can see but searching doesn’t produce any results. This is great. However, there has been no mention from Smith on his Twitter account nor in his apology email regarding what he will do with:
The data he has amassed from the illegally acquired text, or
The Prosecraft feature of his paid Shaxpir 4: Pro software.
As of right now, the real time use of the Prosecraft library analytics are still being touted as a feature of the paid Shaxpir word processor, so although the Prosecraft.io site is down, the AI trained on that dataset is still available to people who pay money to Smith for it. Which is kind of a big reason why this is not a fair use of the ill-gotten data.
Many Twitter users are still calling for clarity on what will happen with the data collected for and by the Prosecraft.io library and for its deletion, rightfully. In the apology blog post, Smith writes:
But in the meantime, “AI” became a thing.
And the arrival of AI on the scene has been tainted by early use-cases that allow anyone to create zero-effort impersonations of artists, cutting those creators out of their own creative process.
That’s not something I ever wanted to participate in.
This seems to me like a deliberate misdirection. I might be reading this wrong. But I feel like no one is coming after him because he’s trained a LLM on 27,000 books—many of them under copyright—because they are worried he’ll create an AI tool to automatically write novels using the data he’s collected. I mean, he could probably, and that would be horrible. But that’s not why people object to what he has done. People are objecting to their copyrighted works being acquired illegally and then used for commercial purposes. To include “Don’t worry, I’m not making a generative AI tool to impersonate authors and I never will!” in the apology feels like a smokescreen to distract from what he actually has done and is still doing, the thing he did not stop doing when he took the Prosecraft site down, the thing that is already explicitly against the fair use doctrine and, therefore, actionable.
Either that or it is pure condescension; “Don’t worry guys, I know AI is scary but that’s just because you understand it poorly.” No we don’t. We understand it just fine.
Amanda Silberling has written an article on this for TechCrunch that covers a few additional scary uses of AI against authors, including a case in which someone used an AI tool to write several books and published them on Amazon using the name of established author Jane Friedman (and Amazon won’t remove them, of course).
At the end of his apology, Smith said he hopes to secure permission to use the stolen books (hmm, good luck with that) and relaunch the project.
At the time of this writing I am still waiting to see what happens when Nora Roberts (’s publisher) finds out her whole catalog got imported into this garbage fire project.
If you have questions that you'd like to see answered in Shelf Life, ideas for topics that you'd like to explore, or feedback on the newsletter, please feel free to contact me. I would love to hear from you.
For more information about who I am, what I do, and, most important, what my dog looks like, please visit my website.
After you have read a few posts, if you find that you're enjoying Shelf Life, please recommend it to your word-oriented friends.