Why AI Can't Steal Books: Understanding the Legal Landscape

Discover the legal landscape surrounding AI's use of copyrighted books. Learn why AI can't be considered stealing books and explore alternative approaches to address concerns. Insights from a pro-AI author's perspective.

25. April 2025

party-gif

Discover why you can relax about AI "stealing" books. This blog post explores the legal realities and practical considerations around AI's use of copyrighted material for training purposes, providing a balanced perspective that will put your mind at ease.

Why Relying on Lawsuits Won't Work

While it may feel frustrating to see copyrighted material used in AI training datasets without permission, the legal reality is that this practice is currently considered legal. Multiple court cases have been dismissed or ruled in favor of the AI companies, as the courts have determined that using copyrighted material for training purposes, without displaying or distributing significant portions to end-users, does not constitute copyright infringement.

The legal consensus seems to be that the current copyright laws were not designed to address this type of AI-driven use of copyrighted material. Attempting to address this issue through lawsuits has proven largely ineffective, with zero successful cases for plaintiffs so far. The courts have consistently ruled that as long as the AI models do not reproduce substantial portions of the copyrighted works, there is no violation of copyright.

Given this legal landscape, relying on lawsuits is unlikely to be a productive approach. Instead, the best course of action is to focus efforts on advocating for changes to copyright laws that would provide more protection and compensation for authors whose works are used in AI training. Reaching out to government representatives to push for legislative updates is a more promising avenue than pursuing costly and ultimately unsuccessful legal battles.

It's important to understand that even if new laws are enacted, they would not be able to retroactively undo the use of copyrighted material in existing AI models. The focus should be on shaping future policies and practices, rather than trying to reverse what has already occurred.

How AI Models Actually Learn and Improve

There are a few key reasons why I don't worry too much about the use of copyrighted material in training AI models:

  1. How AI Learns: AI learns in a similar way to how humans learn, by building mathematical models based on the input data. Once the model is trained, the original training data can be discarded, as the AI has internalized the patterns and relationships, much like how we remember the contents of a book we've read.

  2. Synthetic Data: There are now models specifically trained to generate synthetic data, reducing the need for using copyrighted material in training. This synthetic data can be used to further improve AI models without relying on copyrighted sources.

  3. Compensation Agreements: Some AI companies have already started making deals with content creators and publishers to compensate them for using their data in training. This trend is likely to continue as a way for AI companies to cover their legal bases.

  4. Ethical Training Data: There are large databases of public domain and Creative Commons content that can be used to train AI models in an ethical manner, without relying on copyrighted material.

  5. Iterative Improvement: AI models are improving through iterative optimization and fine-tuning, rather than relying on constantly expanding training datasets. This means future models may require less copyrighted material to achieve significant performance gains.

While the use of copyrighted material in AI training may feel unfair, the current legal consensus is that it is not considered copyright infringement, as long as the AI does not reproduce or distribute the original content. The best course of action is to work with policymakers to update copyright laws to address this new technology, rather than relying on lawsuits that have so far been unsuccessful.

Ethical Training Data Options

While the use of copyrighted material in AI training datasets has raised concerns, there are ethical alternatives available. Some options include:

  1. Public Domain Content: There are vast databases of content that is in the public domain, free from copyright restrictions. These can be utilized for AI training without legal issues.

  2. Creative Commons Licensed Content: Many creators choose to release their work under Creative Commons licenses, which allow for certain uses like non-commercial AI training. These provide an ethical source of training data.

  3. User-Generated Content: Platforms like YouTube allow creators to share their content, which can then be used for AI training under the platform's terms of service. This content is provided willingly by the creators.

  4. Commissioned or Licensed Content: AI companies can work directly with content creators to license or commission data for training purposes, ensuring fair compensation.

  5. Synthetic Data Generation: Techniques exist to generate synthetic data that mimics real-world data, reducing the need for using copyrighted material in training.

By exploring these ethical alternatives, AI developers can build their models while respecting the rights of content creators. This approach balances innovation with the protection of intellectual property.

The Futility of Trying to Put the Genie Back in the Bottle

The legal consensus seems to be that using copyrighted books to train AI models is currently not considered copyright infringement, as long as the AI does not display or distribute substantial portions of the copyrighted material to end-users. While this may be frustrating for authors who feel their work has been used without permission, the reality is that the genie is already out of the bottle.

Numerous court cases have already ruled in favor of the AI companies, dismissing claims of copyright infringement. Attempting to sue these companies is unlikely to be an effective strategy, as the legal system has so far upheld the legality of this practice. Instead, the best course of action is to focus efforts on advocating for changes to copyright law that would require AI companies to obtain permission and provide compensation to authors.

However, even if new laws are enacted, they would not be able to retroactively undo the training that has already occurred. The AI models developed using these datasets will remain legal, and the companies may simply shift to using synthetic data or content from other sources going forward.

Rather than dwelling on the perceived injustice, authors would be better served by exploring ways to leverage AI technology to enhance their own creative process. Automating certain tasks or using AI-generated content as a starting point could free up time and mental energy to focus on the aspects of writing that they most enjoy. Embracing the potential of AI, rather than resisting it, may be the most productive path forward.

Focusing on Positive Possibilities with AI

While the use of copyrighted material in AI training datasets can be concerning, it's important to focus on the positive possibilities that AI presents. Rather than dwelling on the limitations or perceived injustices, we should explore how to leverage this technology to enhance our creative endeavors.

The key is to shift our mindset from seeing AI as a threat to our work, to viewing it as a tool that can augment and empower our creativity. Instead of worrying about AI "stealing" or "plagiarizing" our content, we should consider how we can use AI to streamline the less enjoyable aspects of the writing process, freeing up time and mental space to focus on the parts we truly enjoy.

As author Alfred North Whitehead said, "Civilization advances by extending the number of important operations which we can perform without thinking of them." By automating certain tasks, we can devote more energy to the higher-level, more fulfilling aspects of our craft. This mindset shift is crucial - we must stop thinking about how AI can replace us, and start exploring how it can help us achieve new heights.

Furthermore, the rapid advancements in AI technology, such as the iterative improvements to models like GPT-4, demonstrate that the quality of these systems is improving without relying on ever-expanding training datasets. This suggests that the future of AI may involve more ethical and sustainable practices, where synthetic data and voluntary licensing agreements become the norm.

Rather than wasting time and energy fighting a losing legal battle, we would be better served by channeling our efforts into more productive endeavors. Reaching out to government representatives to advocate for updated copyright laws is one avenue, but we should also explore how we can personally benefit from integrating AI into our creative workflows.

The key is to approach this challenge with an open and innovative mindset. By focusing on the positive possibilities, we can unlock new creative frontiers and truly harness the power of AI to advance our craft and our civilization.

FAQ