Editor’s note: Tom Snyder, executive director of rapidly growing Raleigh-based RIoT and a thought leader in the emerging Internet of Things, is the newest columnist to join WRAL TechWire’s list of top drawer contributors. “Datafication Nation” premiers today. His columns are part of WRAL TechWire’s Startup Monday package.
+++
RALEIGH – John Grisham , David Baldachi, George R.R. Martin and numerous other authors made headlines last week, suing Open AI for training its large language model with their copyrighted material. Similar lawsuits exist against Google, Meta and other big tech companies that have ingested massive amounts of copyrighted material to train AI.
Copyright is established in US law to allow content creators to prevent others from profiting off protected work. There are “fair use” carve-outs to allow re-publishing excerpts of copyrighted content with proper citation for educational or journalistic purposes. Parody of copyrighted work, like is seen regularly on Saturday Night Live, is also considered fair use.
Direct derivative interpretations like creating a Broadway play version of an existing copyrighted novel requires permission from the copyright owner. Fan fiction is considered derivative work, but many authors directly encourage it, or don’t pursue litigation, since it can be complex and expensive. In short, copyright infringement of derivative works isn’t uniformly enforced.
ChatGPT and other AI tools have been trained with massive amounts of data, much of which is scraped from internet sites. This has opened a need for fresh interpretation of copyright law.
- Is scraping data from websites by for-profit companies directly a copyright violation?
- AI-generated content has no human author, so can an AI be held liable for infringement?
- Should LLM-generated content be able to be copyrighted?
- Is the use of copyrighted material in LLM training, “fair use” as defined by the law?
- If a user uploads copyrighted content into a ChatGPT prompt, is that infringement?
- What is the fair value to compensate copyright creators to use their data for training?
Some companies are trying new approaches to get ahead of these lawsuits. I met with Shutterstock at a recent conference and asked how they are handling image copyright. Historically Shutterstock acted as a marketplace between content creators and consumers. With copyrighted material to train from, Shutterstock can now become the creator, potentially displacing the creator half of the marketplace entirely.
Shutterstock’s strategy is to pay a royalty to any creator whose image is used to train an AI-generated derivative work. This seems like a good approach, until you do the math. Millions of images are used in the training sets, so the royalties to each content creator are tiny fractions of pennies. One analysis estimated ~$0.007 per image. It is difficult to see a future where the photographer side of that market continues to thrive.
It will be interesting to see how these lawsuits play out, and how copyright interpretation evolves. There have been a few wins for content creators, such as the takedown of Books3, a site that contained troves of pirated books from authors like Stephen King. But the copyrighted material is already so deeply embedded into the LLM algorithms that rewinding history may be an impossibility at this point.