by Traverse Legal, reviewed by Enrico Schaefer - February 24, 2026 - Ai Tips For Lawyers, Artificial Intelligence
AI copyright memorization now drives real product risk, not academic debate. A new paper, Extracting books from production language models, reports a method for pulling long blocks of in-copyright book text from several production-grade language models.
Model output matters because it sits in front of customers. If a model reproduces protected text, plaintiffs can frame the product as a copying machine, not a learning tool.
The researchers tested whether production language models could reproduce copyrighted books in near verbatim form. They used a two-phase approach. They started with a short prefix from a book, then they iteratively prompted the model to continue when extraction succeeded. The paper describes a metric called near verbatim recall to measure how much extracted text overlaps with the original. See the methodology section in the paper.
The paper reports varying extraction results across models and titles. In some scenarios, the researchers report high overlap between model output and the original book text. The point for business readers stays simple. The researchers claim they can extract long passages, sometimes at scale, even from production systems with refusal layers.
Output reproduction changes legal exposure because it shifts the fight from training inputs to product behavior. Training cases revolve around ingestion, transformation, and fair use arguments. Output cases revolve around what the user received and whether it matches the protected expression.
Copyright law grants authors exclusive rights to reproduce and distribute their work. You can read the core exclusive rights in 17 U.S.C. section 106. Fair use can still apply in some contexts, and courts analyze it under 17 U.S.C. section 107. The closer an output tracks a copyrighted passage, the harder it becomes to treat the result as harmless or purely transformative.
AI companies draw a bright line between learning patterns and storing books. That line supports a practical defense theme. The model does not operate like a searchable library with complete copies sitting inside it.
Courts care about copying because copyright law targets copying of protected expression, not general ideas. Courts also care about whether a defendant’s output mirrors protected wording closely enough to qualify as substantial copying. Lawyers fight over how much similarity counts and what parts count as protected expression, but the core principle stays stable. Copying protected text creates exposure.
This is why AI copyright memorization triggers litigation risk. A model can avoid holding a book in a human-readable file and still create serious exposure if it can reproduce long passages on request. Plaintiffs will argue capability and repeatability. Defendants will argue guardrails, rarity, and user conduct. Courts will evaluate the evidence and the legal theory tied to the claim.
People use memorization as a shortcut for one idea. The model can reproduce training text in the same wording, not merely the same meaning.
Plaintiffs focus on reproducible excerpts and repeatability. If the same prompt pattern can pull the same passage again, plaintiffs argue the system can deliver copyrighted expression on demand. AI copyright memorization becomes easier to explain when the output behaves like a repeatable feature instead of a one-off accident.
The Stanford and Yale researchers described an extraction approach built to pull long continuations from a short book prefix, then repeat and extend the continuation when the output stays close to the source text. The paper sits here: Extracting books from production language models.
People also talk about Best of N prompting in this context. Best of N means running many prompt variations, then selecting the strongest result. Repetition increases the chance the model lands on a continuation that matches a known passage.
Defendants argue ordinary users do not interact with products this way. They frame the result as a stress test or a misuse scenario, not a normal use case.
Plaintiffs still use these tests because they show capability. A court can treat capability as relevant even when the pathway requires persistence. Plaintiffs also use the technique to argue foreseeability. If extraction works through known prompting patterns, plaintiffs argue companies can detect and reduce the behavior.
For companies deploying AI, this point matters more than the philosophy debate. You do not control how an opposing expert will test your system once litigation starts.
This research strengthens a plaintiff narrative that many cases already push. Models can reproduce protected text, not only summarize it. Plaintiffs will use outputs as exhibits because judges and juries understand side-by-side text faster than training pipeline arguments.
Defendants will likely respond in a few predictable ways.
Courts may treat these facts differently depending on the claim and jurisdiction. Cases focused on output behavior will care more about repeatability and how easily a user can reach verbatim passages. Cases focused on training will still turn on fair use arguments, the scope of copying during training, and how the plaintiff frames harm.
AI copyright memorization risk shows up in outputs and in the response discipline. Treat it as governance you can design and enforce.
Start with product controls. Policies do not block verbatim text. Your system has to block it. Cap long form continuity, add similarity checks for extended outputs, and trigger refusals when a prompt asks for full chapters or full books. When the system flags a potential match, route the user to a safer path, like summary, analysis, or citation style guidance, rather than continuation.
Next, lock employee rules. Support teams and marketers create risk when they paste third-party text into prompts or ask for large reproductions. Train staff to request summaries, outlines, and issue spotting. Train staff to stop when an output looks like a recognizable excerpt. Require them to capture the prompt and output for review instead of recycling the text into public content.
Logging matters because disputes turn on proof. Keep records of prompts and outputs where your privacy commitments allow it. Keep records of safety filter triggers and overrides. Keep versioned notes for model and policy changes. When someone reports verbatim output, freeze the relevant logs and route the incident to a single internal owner.
If you rely on an AI vendor, push contract terms into operational reality. The contract should cover how the vendor handles verbatim output risk, how fast it responds to incidents, what logs it can provide, and how indemnity and liability limits align with your exposure.
Finally, build a response workflow before a dispute hits. Assign an owner for inbound notices. Preserve logs immediately. Avoid admissions in early communications. Remediate product behavior in parallel while counsel evaluates legal posture.
Creators gain leverage through clear evidence. Courts and opposing counsel respond to repeatable results, not one screenshot.
Capture the full prompt and full output, the date and time, and the exact model and interface used. Then test repeatability. Run the same prompt again and document whether the same passage appears. Small variations can also matter because they show whether the output depends on a single lucky prompt or a predictable pathway.
Decide on your goal early. Some creators want removal. Some want licensing. Some want a litigation posture. The right path depends on what the outputs show and how the company responds.
If the outputs keep reproducing long, recognizable excerpts, pause before you fire off a demand or post screenshots online. Build a clean evidence packet first, then choose a response path you can defend.
AI copyright memorization means a model can reproduce protected text in the same wording, not only in a summarized or paraphrased form.
Outputs can create copyright exposure when they reproduce protectable expression without permission. Liability depends on the facts, the claims, and defenses such as fair use.
Fair use depends on a multi-factor analysis under US law. Courts weigh purpose, nature of the work, amount used, and market effect. Outputs that reproduce long passages create more risk than outputs that summarize or transform.
Near verbatim copying means the output matches the original wording and structure closely enough that a side-by-side comparison shows substantial overlap.
Prompting methods can matter because they relate to foreseeability, user behavior, and product design. Companies may argue that the use is abnormal. Plaintiffs may argue capability and repeatability.
Businesses can reduce risk by blocking long-form reproduction, limiting data shared in outputs, training staff on safe prompting, logging key events, and tightening vendor controls.
📚 Get AI-powered insights from this content:
As a founding partner of Traverse Legal, PLC, he has more than thirty years of experience as an attorney for both established companies and emerging start-ups. His extensive experience includes navigating technology law matters and complex litigation throughout the United States.
We’re here to field your questions and concerns. If you are a company able to pay a reasonable legal fee each month, please contact us today.
This page has been written, edited, and reviewed by a team of legal writers following our comprehensive editorial guidelines. This page was approved by attorney Enrico Schaefer, who has more than 20 years of legal experience as a practicing Business, IP, and Technology Law litigation attorney.
