Explore expert commentary and practical insights on personal injury law, liability coverage, and bad faith claims tailored for Missouri lawyers.

Artificial Intelligence and Insurance Policy Interpretation: What the Eleventh Circuit Opened, What the Scholars Are Fighting About, and What Missouri Practitioners Need to Know

How AI tools are already reshaping coverage analysis — and why the legal profession’s debate over their reliability matters for every Missouri insurance litigator

Missouri Injury & Insurance Law  |  missouriinjuryandinsurancelaw.com

Introduction

Insurance coverage disputes are, at their core, disputes about language. A claim is paid or denied, a defense is tendered or refused, and an extra-contractual claim rises or falls, all because of what a word or phrase in a policy does or does not mean. For decades, the tools for resolving those disputes have been largely unchanged: dictionary definitions, the contra proferentem canon, the doctrine of reasonable expectations, and the competing instincts of judges and lawyers who bring their own unexamined assumptions to the ordinary-meaning inquiry.

That may be changing. In 2024, an Eleventh Circuit judge named Kevin Newsom did something that would have seemed absurd a few years ago: he opened ChatGPT and asked it whether installing an in-ground trampoline counts as “landscaping.” He found the answer more useful than he expected. He wrote a 29-page concurrence describing the experiment, naming ChatGPT, Google’s Bard/Gemini, and Anthropic’s Claude as tools that courts and lawyers should “consider” as part of the ordinary-meaning toolkit. Four months later, he did it again in a criminal sentencing case, this time running 30 queries across three AI models. By early 2026, the legal academy was fully engaged: one major law review article argued that AI models could become the “new workhorse of contractual interpretation,” while a sharp counterattack in the Harvard Journal on Legislation argued with equal force that these models are simply not ready for judicial chambers.

This post covers the full arc of that debate — the judicial opinions, the competing scholarly literature, and the practical and ethical questions facing Missouri practitioners who want to use these tools today. The developments described here are not merely academic. They are arriving, and practitioners need to be prepared.

Part I: The Judicial Foundation — Judge Newsom’s Two Concurrences

A. Snell v. United Specialty Insurance Co., 102 F.4th 1208 (11th Cir. 2024)

The underlying dispute in Snell v. United Specialty Insurance Co., 102 F.4th 1208 (11th Cir. 2024), was conventional coverage fare. James Snell operated a landscaping company under a commercial general liability policy that limited coverage to his “landscaping” operations. A family hired Snell to convert an above-ground trampoline into a ground-level installation. He dug the pit, built a retaining wall, and capped the structure with decorative wood. A child was later injured. Snell tendered the claim to United Specialty, which denied coverage on the ground that installing a trampoline is not “landscaping.” The district court agreed. On appeal, both sides expended significant effort fighting over the ordinary meaning of the term.

The Eleventh Circuit affirmed on different grounds. Under Alabama law, Snell’s application was incorporated into the policy, and in that application Snell had denied doing recreational equipment construction. That was enough to resolve the case without deciding what “landscaping” means.

Judge Newsom joined the majority but wrote separately because he had spent considerable time wrestling with the ordinary-meaning question before the application-incorporation issue became dispositive — and in doing so, he had consulted ChatGPT.

“Here’s the proposal, which I suspect many will reflexively condemn as heresy, but which I promise to unpack if given the chance: Those, like me, who believe that ‘ordinary meaning’ is the foundational rule for the evaluation of legal texts should consider — consider — whether and how AI-powered large language models like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude might — might — inform the interpretive analysis.” — Snell, 102 F.4th at 1221 (Newsom, J., concurring).

He was surprised by the outputs. ChatGPT and Bard generated definitions of “landscaping” that were broader and more contextually attuned than traditional dictionary definitions — encompassing both aesthetic and practical modifications to outdoor spaces, rather than limiting the concept to natural features alone. He found the outputs more transparent than dictionary-shopping, because the exact prompts and full verbatim responses could be disclosed and examined by any reader. He appended both queries and responses to his concurrence as an appendix.

His concurrence then presented a systematic five-part case for LLM use in ordinary-meaning analysis.

1. LLMs Train on Ordinary Language

The ordinary-meaning rule rests on the premise that words should be understood as ordinary people use them. LLMs are trained on an enormous volume of real-world language — doctoral dissertations, news articles, government websites, industry publications, comment threads, and casual conversation. Newsom noted that GPT-3.5 Turbo alone trained on roughly 400 to 500 billion words. That dataset, he argued, may better capture how ordinary people actually use language than a dictionary compiled by editors selecting from published formal sources. See Snell, 102 F.4th at 1225-27 (Newsom, J., concurring).

2. LLMs Understand Context

Unlike a dictionary, which provides a list of definitions stripped from context, a modern LLM can assess how a word functions in a specific semantic environment. Newsom illustrated this with the word “bat” — LLMs readily distinguish the flying mammal from the wooden implement, based on surrounding context. He described LLMs as “high-octane language-prediction machines” that convert language into mathematical representations allowing detection of subtle usage patterns. See id. at 1226-28. That contextual sensitivity is directly relevant to coverage disputes, where the question is not merely what a word can mean in the abstract, but what it most plausibly means in the specific instrument at hand.

3. LLMs Are Accessible and Inexpensive

Expert linguists are expensive. Corpus linguistics databases require specialized training. Surveys of the general public are logistically impractical for routine litigation. LLM queries, by contrast, cost little or nothing and can be run by any practitioner in minutes. Newsom noted that this accessibility also has a democratizing dimension, giving smaller firms and solo practitioners access to interpretive tools that might otherwise require expensive expert retention. See id. at 1228-29.

4. LLMs May Be More Transparent Than Dictionaries

Courts treat dictionaries as authoritative without scrutinizing their premises — who chose the definitions, which sources the editors consulted, and how they ordered competing senses. Newsom cited Justice Scalia and Bryan Garner’s own warnings about uncritical dictionary reliance. See id. at 1229-30 (citing Antonin Scalia & Bryan A. Garner, Reading Law: The Interpretation of Legal Texts (2012)). By contrast, an LLM query can be fully disclosed: the exact prompt, the version of the model used, and the verbatim response can be appended to a brief or opinion for examination by all parties. Newsom modeled this by attaching his queries and full responses to his concurrence.

5. LLMs Compare Favorably to Other Empirical Approaches

Corpus linguistics and public surveys have both been proposed as alternatives to dictionary-based interpretation. Both have real advantages but practical limitations — surveys are logistically impossible for routine litigation, and corpus linguistics requires specialized database access and has been criticized for the discretion involved in dataset selection. Newsom described LLMs as a potential middle ground: empirically grounded in large real-world language samples, practically accessible, and less susceptible to the selection-bias criticism leveled at corpus analysis. See id. at 1230-33.

Newsom also acknowledged four serious risks: (1) hallucination, where LLMs generate plausible-sounding but fabricated information; (2) underrepresentation of offline communities in training data; (3) the risk of strategic prompt manipulation; and (4) the concern that judicial reliance on AI could slide toward algorithmic adjudication and “robo judges.” He concluded that each risk was real but manageable — and often no worse than the risks already embedded in conventional interpretive tools. His bottom line: “Having initially thought the idea positively ludicrous, I think I’m now a pretty firm ‘maybe.'” Id. at 1234.

B. United States v. Deleon, 116 F.4th 1260 (11th Cir. 2024) — The Sequel

Four months later, in United States v. Deleon, 116 F.4th 1260 (11th Cir. 2024), Newsom wrote what he explicitly called “a sequel of sorts” to Snell. The issue was whether a federal sentencing guideline enhancement for “physically restrained” applied when the defendant had pointed a gun at a cashier and demanded money but had never physically touched the victim. The two-word phrase “physically restrained” does not appear as a defined composite phrase in any standard dictionary — creating an interpretive gap that Newsom found useful for testing LLMs.

He queried ChatGPT (GPT-4o), Claude 3.5 Sonnet, and Gemini, each with the same question: “What is the ordinary meaning of ‘physically restrained’?” All three models produced responses that, in Newsom’s view, aligned with his own intuition: the phrase requires the application of tangible physical force, either through direct bodily contact or through a device or instrument — not mere psychological coercion through proximity to a weapon.

He then pushed further, running each model ten times for thirty total responses. The responses were not verbatim identical, but they were, he found, remarkably consistent in their common core. His treatment of this variation is one of the most practically useful parts of the opinion for coverage practitioners. He concluded that LLM variation is not a bug but a feature — it mirrors the variation you would get if you surveyed a large population of English speakers on the same question. See Deleon, 116 F.4th at 1272-77 (Newsom, J., concurring).

“I continue to believe — perhaps more so with each interaction — that LLMs have something to contribute to the ordinary-meaning endeavor. They’re not perfect, and challenges remain, but it would be myopic to ignore them.” — Deleon, 116 F.4th at 1278 (Newsom, J., concurring).

His four updated takeaways: (1) LLMs can assist in discovering ordinary meaning; (2) LLMs have a specific advantage over dictionaries for composite multi-word phrases whose meaning is more than the sum of their parts; (3) variation across repeated queries reflects and models everyday speech variation rather than unreliability; and (4) LLMs should serve as a supplemental tool alongside, not a replacement for, traditional interpretive methods. See id. at 1277-78.

The practical significance for insurance coverage practitioners is significant: many of the most contested policy terms are composite phrases, not single words. “Sudden and accidental,” “expected or intended from the standpoint of the insured,” “arising out of,” “in the course of,” “professional services,” and “bodily injury caused by an occurrence” are all examples of phrases that standard dictionaries address poorly, if at all, and for which Newsom’s *Deleon* experiment suggests LLMs may provide useful supplemental data.

Part II: The Scholarly Literature — Proponents, Skeptics, and a Critical Counterattack

A. Arbel & Hoffman, Generative Interpretation, 99 N.Y.U. L. Rev. 451 (2024)

The foundational academic work in this field is “Generative Interpretation,” authored by Professor Yonathan Arbel of the University of Alabama School of Law and Professor David Hoffman of the University of Pennsylvania Carey Law School, published in Volume 99 of the New York University Law Review in 2024. See Yonathan Arbel & David A. Hoffman, Generative Interpretation, 99 N.Y.U. L. Rev. 451 (2024). Judge Newsom explicitly and favorably cited this article in both Snell (102 F.4th at 1226-27 n.7) and Deleon (116 F.4th at 1266 n.1), making it essential reading for practitioners engaging with this area.

Arbel and Hoffman introduce “generative interpretation” as a new approach to estimating contractual meaning using LLMs. They demonstrate their methodology through grounded case studies drawn from actual contracts adjudicated in well-known judicial opinions, showing that AI models can help factfinders (1) ascertain ordinary meaning in context, (2) quantify ambiguity, and (3) fill gaps in parties’ agreements.

Two of their case studies are directly relevant to insurance coverage work.

First, the authors applied LLM “embedding” techniques — which convert words and phrases into high-dimensional mathematical vectors to measure semantic proximity — to the meaning of “flood” in an insurance contract. They offered what they described as cheap, objective, and immediately available support for the Fifth Circuit’s conclusion that “flood” encompasses water from any cause — without the court’s exhaustive resort to dictionaries, treatises, linguistic canons, out-of-circuit authority, and encyclopedias.

Second, they examined C & J Fertilizer, Inc. v. Allied Mutual Insurance Co., 227 N.W.2d 169 (Iowa 1975), a landmark case involving a commercial burglary policy requiring visible exterior signs of forced entry to trigger coverage. $50,000 of fertilizer disappeared with no signs of forced external entry, only tire tracks leaving the premises. The insurer denied coverage. The authors used LLM analysis of what “burglary” and the forced-entry provision would mean to an ordinary policyholder, arguing their analysis quickly and cheaply supported the conclusion the court reached only after extended doctrinal analysis — that the provision was intended to exclude inside jobs, not to impose a hidden technical requirement that exculpated external thieves who avoided leaving marks.

Their overarching conclusion: “generative interpretation is good enough for many cases that currently employ more expensive, and arguably less certain, methodologies,” and LLMs could become the “new workhorse of contractual interpretation.” 99 N.Y.U. L. Rev. at 458. They acknowledge limitations, offer best practices, and expressly disclaim any suggestion of replacing judges with AI adjudicators.

B. The Critical Response: Grimmelmann, Sobel & Stein, Generative Misinterpretation, 63 Harv. J. on Legis. ___ (2026)

Published in the Harvard Journal on Legislation in January 2026, “Generative Misinterpretation” by Professors James Grimmelmann of Cornell Law School, Benjamin Sobel of the University of Wisconsin Law School, and David Stein of Vanderbilt Law School is the most systematic challenge to the generative interpretation project. See James Grimmelmann, Benjamin Sobel & David Stein, Generative Misinterpretation, 63 Harv. J. on Legis. ___ (2026). It was written with direct knowledge of and response to both the Arbel/Hoffman article and Judge Newsom’s concurrences in Snell and Deleon.

The authors identify two gaps that any empirical interpretive method must bridge to be methodologically and socially legitimate.

The first is the reliability gap: are LLM methods consistent and reproducible enough for high-stakes settings? Their answer is no. They demonstrate that LLM outputs are “brittle and frequently arbitrary” — small changes in prompt phrasing, framing, or context can produce substantially different outputs. Critically, this is not the benign, survey-mimicking variation that Newsom described in Deleon; it is inconsistency that reflects the fundamental instability of probabilistic text generation. Id.

The second is the epistemic gap: do LLM outputs actually measure what proponents claim they measure? Proponents point to (1) LLMs’ training on large real-world datasets, (2) empirical measurements of LLM output patterns, (3) the rhetorical persuasiveness of LLM outputs, and (4) the assumed predictability of algorithmic methods. Grimmelmann and his co-authors argue that all four justifications rest on faulty premises about how LLMs work and what legal interpretation requires. Id.

“The superficial fluency of LLM-generated text conceals fundamental gaps between what these models are currently capable of and what legal interpretation requires to be methodologically and socially legitimate. LLM proponents do not yet have a plausible story of what that ‘something more’ comprises.” — Grimmelmann, Sobel & Stein, 63 Harv. J. on Legis. ___ (2026).

The reliability critique is particularly important for insurance practitioners. A core use case for LLMs in coverage disputes is establishing that a policy term has (or lacks) a single clear ordinary meaning. If LLM outputs on that question are brittle — if different prompts yield different answers with no principled basis for preferring one framing — then LLM evidence may function less as an objective empirical datapoint and more as a rhetorical tool each side can manipulate by framing queries favorably. The party who crafts the most advantageous prompt wins, rather than the party with the better argument about actual ordinary usage.

C. Adjacent Empirical Research

An October 2025 study titled “Not Ready for the Bench: LLM Legal Interpretation Is Unstable and Out of Step with Human Judgments” directly tested LLM reliability across legal interpretation tasks, including the landscaping fact pattern from Snell. The study found that “models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions,” with only “weak to moderate correlation with human judgment, with large variance across model and question variant.” The study was published as a preprint at arXiv in October 2025.

A 2025 working paper by Professor Eric Posner and co-author Shivam Saran from the University of Chicago Law School — “Judge AI: Assessing Large Language Models in Judicial Decision-Making,” Coase-Sandor Institute for Law & Economics Research Paper No. 25-03 (2025) — evaluated GPT-4o against 31 U.S. federal judges in a simulated appeal. The study found that GPT-4o is “strongly affected by precedent but not by sympathy” — the opposite of professional judges. The authors describe AI as “a formalist judge, not a human judge,” suggesting that AI and human legal reasoning differ structurally in ways that cannot be corrected through prompting.

A June 2025 American Enterprise Institute report by Clay Calvert — “Using Large Language Models to Understand Ordinary Meanings in Legal Texts: Some Early Judicial and Scholarly Experiments” — surveyed the full arc of judicial and academic experiments with LLMs in ordinary-meaning analysis, concluding that the balance of the evidence supports continued cautious experimentation rather than either wholesale adoption or categorical rejection.

The United Policyholders nonprofit organization published a detailed takeaway analysis of Snell and Deleon from the policyholder perspective in February 2026. See United Policyholders, AI and Insurance Policy Interpretation After Snell v. United Specialty: What Policyholders Need to Know (Feb. 25, 2026), available at uphelp.org. The organization warned that judges may already be consulting LLMs informally even when they do not cite them in opinions, and that insurers’ counsel are almost certainly running LLM queries on disputed policy terms. Practitioners who are unaware of how AI characterizes the terms in their cases are operating with incomplete information.

Part III: The Missouri Law Dimension — Vexatious Refusal, Third-Party Bad Faith, and the AI Temporal Problem

The Newsom concurrences address how courts should interpret policy language — but there is a second question that receives less attention and may prove equally important in Missouri practice: when, and by whom, was artificial intelligence used to interpret a policy, and what does that timing reveal about the reasonableness of the coverage determination?

This question implicates two distinct bodies of Missouri extra-contractual insurance law: the vexatious refusal statutes and the common law tort of third-party bad faith. These are related but separate bodies of law, and it is worth understanding the distinction before exploring how AI-assisted policy interpretation might affect both.

A. Missouri’s Vexatious Refusal Statutes — First-Party Coverage Disputes

Missouri’s vexatious refusal statutes — Mo. Rev. Stat. §§ 375.296 and 375.420 — provide an extra-contractual remedy when an insurer refuses to pay a first-party claim without reasonable cause or excuse. These statutes apply when the insured is seeking benefits under the insured’s own policy that have been denied — the paradigm case of an insured claiming under a homeowner’s, health, life, disability, or property policy. See Qureshi v. Am. Family Mut. Ins. Co., 604 S.W.3d 721, 727 (Mo. App. E.D. 2020) (setting out elements). Section 375.296 creates liability after a thirty-day demand period; section 375.420 authorizes the court or jury to award penalties of twenty percent of the first $1,500 of the loss and ten percent of the excess, together with a reasonable attorney’s fee. Mo. Rev. Stat. § 375.420.

The critical principle in vexatious refusal litigation is that the reasonableness of the insurer’s refusal is judged by “the situation as presented to the insurer at the time it was called on to pay,” not by the outcome of the litigation. See Russell v. Farmers & Merchants Ins. Co., 834 S.W.2d 209, 214 (Mo. App. S.D. 1992). This principle has direct implications for AI-assisted policy interpretation.

B. Third-Party Bad Faith — Common Law Extra-Contractual Liability

Missouri’s common law tort of bad faith refusal to settle is a separate and distinct cause of action that arises in the third-party context — that is, where the insured has been sued by a third party and the insurer either fails in its fiduciary duty to the insured or refuses to settle within policy limits, exposing the insured to an excess judgment. See Zumwalt v. Utilities Ins. Co., 360 Mo. 362, 228 S.W.2d 750 (Mo. 1950) (first recognizing the tort); Ganaway v. Shelter Mut. Ins. Co., 795 S.W.2d 554, 562 (Mo. App. S.D. 1990) (analyzing bad faith refusal to settle in a liability context); Scottsdale Ins. Co. v. Addison Ins. Co., 448 S.W.3d 818, 827-28 (Mo. banc 2014) (current elements of third-party bad faith claim). The insured who is subjected to a judgment in excess of the policy limits as a result of the insurer’s bad faith in disregarding its obligations may recover the full judgment, including the amount in excess of limits, along with additional tort damages such as emotional distress and reputational harm, and potentially punitive damages. See Overcast v. Billings Mut. Ins. Co., 11 S.W.3d 62, 67 (Mo. banc 2000).

The vexatious refusal statutes and the third-party bad faith tort are not the same claim, do not arise from the same facts, and do not involve the same remedy. The vexatious refusal statutes provide a statutory remedy for a first-party claim denial. The bad faith tort provides a common law remedy in a third-party liability context. A practitioner or claims adjuster who confuses the two frameworks will be poorly positioned to advise on exposure in either context.

C. The AI Temporal Problem — A New Dimension in Both Contexts

Now consider the intersection of AI-assisted policy interpretation with both bodies of law. The question of when an LLM was used to interpret a policy — and what that model’s training data reflected at that time — creates a temporal dimension that has no real analogue in traditional dictionary-based interpretation.

Suppose an insurer, during the claim adjustment process, uses an LLM to interpret a disputed policy term. The model generates a definition supporting the coverage denial. The insurer relies on that output as one basis for its refusal to pay. Here are the relevant time points at which the ordinary meaning of the term may have been different:

•  The time the policy was drafted, which may reflect the drafting attorney’s or ISO’s understanding of the term at the time of composition.

•  The time the policy was sold to the insured, when the reasonable policyholder’s understanding of the language governs.

•  The time the claim arose, when the relevant events occurred and the insured formed reasonable expectations about coverage.

•  The time the insurer made the coverage determination, when vexatious refusal liability is assessed under Missouri law.

•  The time any LLM was queried, which reflects that model’s training cutoff and the language patterns dominant in its training data.

•  The time of any legal briefing or trial, when the LLM output might be offered to a court but the model itself may have been retrained, modified, or retired.

This temporal problem is not merely theoretical. LLMs are periodically retrained on updated datasets. Ordinary language evolves. A model queried in 2026 about the meaning of “sudden and accidental” may reflect contemporary usage patterns that differ from those that would have prevailed when a policy was drafted in 2012 and when a pollution release occurred in 2019. If an insurer uses an LLM during the 2024 claim investigation and the model reflects 2024 language patterns, the model may not accurately capture what the language meant to the ordinary policyholder at the relevant time.

Moreover, under Missouri’s vexatious refusal framework, the insurer’s coverage determination is judged by the information available at the time of the refusal. See Russell, 834 S.W.2d at 214. If an insurer queries an LLM, receives a definition supporting the denial, and relies on that output without also consulting traditional sources or verifying the result, the adequacy of that investigation becomes a factual question. Was the reliance on a single LLM output — with its known limitations as to training cutoff, representativeness, and prompt sensitivity — a reasonable basis for the coverage determination? Or did the insurer use AI as a shortcut that substituted for the kind of genuine investigation that the vexatious refusal framework was designed to encourage?

In the third-party bad faith context, similar questions arise. An insurer’s refusal to settle a claim within policy limits turns on whether the insurer acted in good faith and with proper regard for the insured’s interests. See Scottsdale, 448 S.W.3d at 827-28. An insurer that used AI to interpret its own policy at the time of the coverage decision — and reached a coverage denial that an LLM trained on contemporary language would not have supported — may face additional scrutiny in a bad faith action over the quality and good faith of that investigation.

The temporal problem cuts in both directions. In a case where the insurer seeks to establish that its coverage denial was supported by the ordinary meaning of the policy language, presenting an LLM output obtained after the fact — at the time of briefing, by querying a model that did not exist at the time of the claim — tells the court something about current language usage but may not accurately reflect the ordinary meaning at any of the relevant historical time points. Practitioners offering LLM evidence on ordinary meaning should clearly identify the model queried, the date of the query, and, to the extent ascertainable, the approximate training cutoff of the model. As Newsom himself noted in Snell, practitioners should always “consider the temporal dimensions of the request.” Snell, 102 F.4th at 1234 (Newsom, J., concurring).

Part IV: Practical Guidance for Missouri Insurance Coverage Practitioners

A. Why Missouri’s Policy Interpretation Framework Matters Here

Missouri’s approach to insurance policy interpretation has features that make the AI-in-interpretation debate particularly relevant. The foundational rule is that policy language is given the meaning an ordinary policyholder would understand it to have at the time of the transaction. See Todd v. Missouri United School Ins. Council, 223 S.W.3d 156, 163 (Mo. banc 2007). Undefined terms are given their ordinary meaning. Where a term is susceptible to more than one reasonable interpretation, the ambiguity is resolved against the insurer under the contra proferentem doctrine. See id. Courts strictly construe exclusionary clauses against the insurer, who bears the burden of demonstrating that an exclusion applies. Burns v. Smith, 303 S.W.3d 505, 510 (Mo. 2010).

These doctrinal features create specific strategic opportunities for LLM use — and specific risks.

On the ambiguity question, an LLM queried multiple times on the same disputed term, generating materially different definitions across queries or across models, can itself serve as evidence that the term is susceptible to multiple reasonable interpretations. If reasonable ordinary people — and AI systems trained on how ordinary people use language — understand the term in multiple ways, that is a cogent argument for ambiguity, triggering the insured-favorable canon. Arbel and Hoffman specifically identified this use case in Generative Interpretation, noting that LLMs can help quantify the degree of ambiguity by measuring how similar or dissimilar different plausible interpretations are to the original term. See Arbel & Hoffman, 99 N.Y.U. L. Rev. at 487-92.

On the ordinary-meaning question for clear terms, consistent LLM outputs corroborating dictionary definitions and case law are useful additional support — not a replacement for traditional authority, but a supplemental datapoint that demonstrates the robustness of the claimed meaning.

For defense counsel representing insurers, both directions should be anticipated. Run LLM queries on disputed policy terms before filing coverage positions. If the LLMs support the insurer’s interpretation consistently and across multiple models, that corroborates the coverage analysis. If they do not — if the models generate broader or more coverage-favorable definitions — you need to know that before opposing counsel does.

B. How to Use AI in Policy Interpretation — Practical Methodology

Using AI tools — including Microsoft Copilot, Anthropic’s Claude, Google Gemini, and OpenAI’s ChatGPT — to check and cross-reference the definition of contested policy terms is a reasonable practice in the current landscape. It may or may not yield useful information in any particular case. But it is a legitimate supplement — a first-pass data point to be compared against and tested against other sources. Here is a recommended approach.

1. Use Definitional Prompts, Not Outcome Prompts

Ask: “What is the ordinary meaning of ‘sudden and accidental’ in the context of an insurance policy?” Do not ask: “Does the term ‘sudden and accidental’ require that a discharge have an identifiable start date?” The first tests language usage. The second asks the model to resolve a legal question — something LLMs are less equipped to do and that courts have no warrant to defer to. Newsom was explicit about this distinction, cautioning that “a cautious first use of an LLM would be in helping to discern how normal people use and understand language, not in applying a particular meaning to a particular set of facts to suggest an answer to a particular question.” Snell, 102 F.4th at 1234 (Newsom, J., concurring).

2. Run Multiple Queries Across Multiple Models

Because LLMs are probabilistic, variation across queries is expected. Run the same definitional query five to ten times on each platform you are using. Then run comparable queries on at least two different platforms. Consistent results across multiple runs and multiple models are evidence of reliability. Material divergence — where the same prompt generates fundamentally different definitions in different runs or on different platforms — signals either prompt sensitivity or genuine ordinary-meaning ambiguity, either of which is significant information for coverage analysis.

3. Compare Against Traditional Sources

Cross-reference AI outputs against Merriam-Webster, Black’s Law Dictionary, Couch on Insurance, applicable Missouri caselaw defining the term, and any relevant industry usage guides. Note where traditional sources and LLM outputs converge (persuasive) and where they diverge (may signal ambiguity or LLM unreliability). Convergence across multiple independent sources — including LLMs — is stronger support for a claimed ordinary meaning than any single source standing alone.

4. Consider the Temporal Context

For the reasons explained in Part III, the date of the query matters. If the policy was issued in 2010, run queries on what the term meant at that time by framing the prompt temporally: “What was the ordinary meaning of ‘[term]’ as used in an insurance policy in 2010?” Understand that current models may not reliably reconstruct 2010 usage patterns. For older policies, supplement AI-generated definitions with historical dictionary editions, contemporaneous industry publications, and caselaw from the policy period.

5. Document Your Methodology — Everything

Record the model used (including version or the platform’s description of the underlying model), the date of each query, the exact text of each prompt, and the full verbatim response. Do not summarize or paraphrase AI outputs in your notes — preserve the complete exchange. If you ultimately offer this evidence to a court, you will be required to disclose this information and cannot reconstruct it retroactively. Under Newsom’s recommended transparency framework, methodology disclosure is a prerequisite to judicial use, not an afterthought.

6. Verifying Every Factual Assertion Is Non-Negotiable

If an LLM-generated output references a case, cites a dictionary, quotes a statute, or attributes a usage pattern to a specific source — verify that every such assertion is accurate before treating it as reliable. LLMs hallucinate citations with unsettling frequency. They generate plausible-sounding case names, volume numbers, and page numbers that do not exist. Offering a fabricated citation to a court is a violation of Missouri Rule of Professional Conduct 4-3.3, which prohibits false statements of fact or law to a tribunal, and may constitute conduct involving dishonesty under Rule 4-8.4(c). Missouri federal courts, following national trends, have already sanctioned attorneys for submitting AI-generated briefs containing fabricated citations. See, e.g., Jennifer Kay, Fake AI Citations Produce Fines for California, Alabama Lawyers, Bloomberg Law (Oct. 13, 2025).

C. The Ethical Framework — Obligations When AI Evidence Is Offered to a Court

If you intend to offer LLM output to a court — whether in a brief, in a supporting declaration, or as evidence of ordinary meaning — you enter a domain governed by Missouri’s Rules of Professional Conduct and by emerging standards that are rapidly developing. Several obligations deserve specific attention.

1. The Distinction Between a Dictionary Definition and an LLM Output

When a practitioner cites a dictionary definition to a court in a coverage case, the methodology is simple and transparent: here is the dictionary, here is the edition, here is the page, here is the definition. There is a single, fixed, citable source. The court can locate and examine it. No methodology need be disclosed beyond the citation itself. An LLM output is different in kind. The same model queried on the same prompt on different days may generate different responses. The model’s training data, version, and configuration affect its outputs in ways that are not apparent from the printed response. An LLM output is not a fixed, citable source in the same way a Webster’s definition is. This is not a reason to refuse to use LLM outputs entirely — but it is a reason why the disclosure requirements for LLM evidence are necessarily more demanding than for a dictionary citation.

2. Rule 4-3.3 — No False Statements to a Tribunal

Missouri Rule of Professional Conduct 4-3.3 prohibits making false statements of fact or law to a tribunal. Any attorney who offers an LLM output to a court bears the obligation to ensure that the specific claims in that output are accurate. This means independent verification of every factual assertion, every citation, and every definition drawn from an LLM before it is offered as evidence of ordinary meaning.

3. Disclosure of Adverse Guidance Under Rule 4-3.3(a)(2)

Rule 4-3.3(a)(2) requires disclosure of directly adverse controlling legal authority known to the attorney. The Grimmelmann, Sobel & Stein article — “Generative Misinterpretation,” 63 Harv. J. on Legis. ___ (2026) — and the empirical instability research referenced above are not “controlling” in the binding sense, but they are directly relevant to any representation to a court that an LLM output constitutes reliable, stable evidence of ordinary meaning. A practitioner who presents LLM evidence as though the reliability debate does not exist, without flagging the critical literature, risks both a professional responsibility problem and a credibility problem if opposing counsel raises the critique in response. The better practice is to acknowledge the state of the debate and address the specific features of your methodology that mitigate the reliability concerns.

4. Prompt Design, Cherry-Picking, and Methodological Honesty

Practitioners should be aware that prompt design is not neutral. LLMs are highly sensitive to how questions are framed. A practitioner who crafts a prompt specifically to elicit the most favorable definition possible — testing framing variations and selecting the version that best supports the client’s position — is engaged in the same kind of result-driven manipulation that critics have long identified in dictionary-shopping. When the prompt has been optimized for a favorable result, a court is not receiving evidence of ordinary meaning; it is receiving advocacy dressed as empiricism. A complete disclosure of methodology should include not only the prompts used to generate the output being offered, but an honest description of the querying process overall. The discipline of running multiple prompts and disclosing the full range of results, rather than only the favorable ones, is what separates genuine empirical analysis from advocacy dressed as empiricism.

5. Rule 4-3.7 — The Lawyer as Witness Problem

Here is a practical problem that has received insufficient attention in discussions of LLM use in litigation: if you design, conduct, and document an AI query methodology for use as evidence in a case you are trying, you may be turning yourself into a necessary witness on your own methodology.

Missouri Rule of Professional Conduct 4-3.7 provides that a lawyer shall not act as advocate at a trial in which the lawyer is likely to be a necessary witness unless (1) the testimony relates to an uncontested issue, (2) the testimony relates to the nature and value of legal services rendered, or (3) disqualification of the lawyer would work substantial hardship on the client. Missouri courts apply this rule carefully. See generally State ex rel. Fleer v. Conley, 809 S.W.2d 405, 409 (Mo. App. 1991) (analyzing the competing interests when an attorney may be required to testify in a pending matter). The concern is that combining the roles of advocate and witness can confuse or mislead the fact-finder and can create a conflict of interest between the lawyer and the client.

If opposing counsel challenges the reliability of your AI methodology — and under the Grimmelmann critique they have good grounds to do so — the specific decisions made in querying the LLM become contested facts. What prompts were tried? Which were selected? Which were discarded? Who conducted the queries? What happened when the query produced an unfavorable result? If the attorney who conducted the queries is also the trial attorney, those questions may call the attorney as a witness. The methodological decisions that an attorney makes in querying an LLM are closer to the kind of contested factual choices that can give rise to witness testimony than the act of opening a dictionary to a specific page.

There are practical workarounds. One approach is to have the AI queries conducted and documented by a non-attorney professional — a trained litigation consultant, a paralegal working under careful supervision with complete documentation protocols, or a legal technology expert — whose work product can be offered through that person’s testimony rather than the attorney’s. Under such an arrangement, and if the work is directed by counsel in connection with anticipated litigation, the query documentation may also have a stronger claim to work product protection under the framework established by Warner v. Gilbarco, Inc., No. 2:24-cv-12333, 2026 WL 373043 (E.D. Mich. Feb. 10, 2026), where a court held that AI-assisted litigation preparation reflecting mental impressions is protected work product.

A second approach is to use a retained expert — someone with expertise in AI systems, prompt design, and LLM evaluation — who conducts the queries under a defined methodology, documents the full process, and is prepared to testify about the methodology and its reliability. This approach addresses both the witness-advocate conflict and the reliability critique simultaneously. The expert can address the Grimmelmann objections directly, explain the methodology’s safeguards, and subject themselves to cross-examination. The downside is cost. But if LLM evidence is going to play a meaningful role in a coverage dispute, the cost of an expert may be justified. This is analogous to the decision about whether to use a corpus linguist or a lexicographic expert in a high-stakes statutory or contractual interpretation case.

The law on these questions is genuinely undeveloped. The discoverability cases of early 2026 — discussed below — suggest that courts are just beginning to work through the privilege and work product implications of AI use in litigation. The witness-advocate issue has not been addressed directly in any published opinion involving LLM methodology evidence. The better approach, until clearer standards emerge, is to build the methodology around disclosure, documentation, and either third-party execution or expert presentation — rather than around an attorney personally conducting and vouching for AI queries that the attorney then argues in briefs.

6. Training Data, Model Version, and Bias Disclosure

When offering LLM evidence to a court, the question of how the model was trained is directly relevant to the reliability of its output. A model trained predominantly on internet text as of a particular date may not reliably capture the ordinary meaning of a policy term as understood at a different time, in a different industry context, or by a population that is underrepresented in online language. The representativeness critique — that LLMs trained on internet language overweight educated, urban, digital-native populations and underweight rural communities, older policyholders, and speakers for whom English is a second language — is a legitimate concern that a thorough methodology disclosure should address.

Practitioners should, to the extent ascertainable, document the approximate training cutoff of the model used, the platform’s description of the model version, and any known characteristics of the training data. This information is not always readily available, but the effort to obtain it is part of responsible methodology disclosure. Where it is not available, the limitation should be disclosed.

D. The Discoverability Problem — AI-Assisted Work Product in Litigation

An important development receiving less attention in coverage discussions is the emerging body of law on whether a party’s AI-assisted litigation materials are protected by attorney-client privilege or the work product doctrine — and whether using a consumer-grade AI tool waives those protections.

In February 2026, two federal courts addressed these questions for the first time on the same day and reached opposite conclusions.

In United States v. Heppner, No. 25-cr-00503-JSR, 2026 WL 436479 (S.D.N.Y. Feb. 17, 2026), Judge Jed Rakoff of the Southern District of New York held that 31 documents a criminal defendant created using the consumer version of Anthropic’s Claude were not protected by attorney-client privilege or the work product doctrine. The privilege claim failed on multiple grounds: Claude is not an attorney; the communications were not confidential because the consumer version of Claude’s privacy policy expressly provides that inputs may be used to train the model and may be disclosed to “governmental regulatory authorities”; and the defendant acted on his own initiative, without counsel’s direction. Id. at *2-5. The work product doctrine failed because the materials were not prepared “by or at the behest of counsel” and did not reflect defense counsel’s strategy. Id. at *5-6. Judge Rakoff noted, however, that the analysis might differ under a Kovel-type arrangement if counsel had directed the defendant to use the AI tool.

On the same day, Magistrate Judge Anthony P. Patti of the Eastern District of Michigan reached the opposite conclusion in Warner v. Gilbarco, Inc., No. 2:24-cv-12333, 2026 WL 373043 (E.D. Mich. Feb. 10, 2026). In a civil employment discrimination case, a pro se plaintiff had used ChatGPT to research legal questions and draft filings. When defendants moved to compel production of all her AI queries and responses, the court denied the motion. Because the plaintiff was a pro se litigant, she was effectively acting as her own counsel. Her AI interactions reflected her internal mental impressions prepared in anticipation of litigation and were protected as work product. The court rejected the waiver argument, holding that work product waiver requires disclosure to an adversary or in a manner likely to reach an adversary’s hands — and that “ChatGPT (and other generative AI programs) are tools, not persons.” Warner, 2026 WL 373043, at *3.

The practical implications for Missouri coverage practitioners are direct. If you use a consumer-grade AI platform to query policy terms or develop coverage arguments — and you do so without counsel’s direction, using a platform whose privacy policy allows data disclosure to third parties — those materials may not be protected if sought in discovery. Enterprise-grade AI tools with contractual data protection provisions present a different analysis. The lesson from Heppner is: if you are using AI in connection with active litigation, use it at the direction of counsel, use platforms with enterprise-level confidentiality protections, and do not use consumer tools to process materials your client received from counsel.

Part V: Looking Ahead

The Newsom concurrences are judicial experiments, not holdings. Snell and Deleon do not create a right to submit LLM evidence in any case, and no circuit has yet held that LLM outputs are admissible evidence of ordinary meaning. The academic debate is live, contested, and producing empirical research that cuts in both directions. The ethical and procedural framework governing AI use in court filings is being written in real time.

Several specific developments to watch:

•  Missouri Supreme Court and Bar guidance. The Missouri Bar and the Missouri Supreme Court have not yet issued formal rules or advisory opinions specifically addressing AI use in court filings. Practitioners should monitor the Missouri Bar’s professional responsibility committee and the standing orders of judges in the Western and Eastern Districts of Missouri and in the Missouri circuit courts.

•  The emergence of AI expert witnesses in coverage disputes. Both the Grimmelmann critique and the United Policyholders analysis anticipate that parties may begin retaining generative AI specialists to opine on the reliability and methodology of LLM outputs offered by the opposing party. This is a new category of potential expert in insurance coverage litigation.

•  Insurance-specific AI tools. Some companies are developing AI tools specifically trained on insurance policy language and coverage caselaw, rather than general internet text. Tools of this kind would partially address the representativeness critique, though they would introduce new questions about the nature and potential biases of specialized training data.

•  Evolving privilege and discoverability law. Heppner and Warner represent the first rulings on these questions. The law will develop rapidly, and practitioners should treat the current state of affairs as genuinely unsettled and requiring careful, case-specific analysis.

Conclusion

Judge Newsom’s concurrences in Snell v. United Specialty Insurance Co. and United States v. Deleon are not mere intellectual curiosities. They represent the opening of a serious judicial and scholarly conversation about whether and how artificial intelligence can improve the transparency and quality of legal interpretation — a conversation that will directly affect how insurance policy disputes are litigated and resolved.

The practical takeaway for Missouri insurance coverage practitioners is measured and specific. Using AI tools to check and cross-reference the definition of contested policy terms is a reasonable practice — a legitimate supplement, used carefully, to dictionaries, treatises, and caselaw. The practice must be accompanied by independent verification, methodological discipline, attention to temporal context, and a clear understanding of the tools’ limitations.

Offering AI-generated evidence to a court is a different matter, requiring careful attention to Missouri’s Rules of Professional Conduct, to the emerging scholarly and empirical debate about LLM reliability, and to the obligation to disclose rather than suppress adverse guidance. The reliability critique in Grimmelmann, Sobel & Stein is authority that an honest attorney must address rather than ignore. The temporal problem — the multiple time points at which an LLM might have generated a different definition — is directly relevant in both vexatious refusal and third-party bad faith cases where the quality of the insurer’s coverage investigation is at issue.

And the methodological problem — the risk that an attorney who personally conducts, selects, and vouches for AI queries becomes a necessary witness in the attorney’s own case under Missouri Rule 4-3.7 — is real and currently unaddressed by any court. Until clearer standards emerge, the better practice is to build AI methodology around third-party execution or retained expert testimony, not around attorney self-authentication of personally conducted queries.

AI is here to stay. The question is not whether to engage with these tools but how to do so profitably and responsibly. Missouri coverage practitioners who understand both the opportunity and the risk are better positioned than those who either ignore the tools entirely or embrace them without understanding what they can and cannot do.

For an in-depth treatment of how AI systems work and the professional rules regrading the use of AI see this foundational blog in the AI series: AI Systems for Missouri Lawyers: How They Work, What They Risk, and How to Use Them Responsibly

Key Sources Referenced in This Post

Snell v. United Specialty Ins. Co., 102 F.4th 1208 (11th Cir. 2024) (Newsom, J., concurring)

United States v. Deleon, 116 F.4th 1260 (11th Cir. 2024) (Newsom, J., concurring)

Yonathan Arbel & David A. Hoffman, Generative Interpretation, 99 N.Y.U. L. Rev. 451 (2024)

James Grimmelmann, Benjamin Sobel & David Stein, Generative Misinterpretation, 63 Harv. J. on Legis. ___ (2026)

Clay Calvert, Using Large Language Models to Understand Ordinary Meanings in Legal Texts: Some Early Judicial and Scholarly Experiments (AEI, June 2025)

Eric A. Posner & Shivam Saran, Judge AI: Assessing Large Language Models in Judicial Decision-Making, Coase-Sandor Institute for Law & Economics Research Paper No. 25-03 (Univ. of Chicago, 2025)

United Policyholders, AI and Insurance Policy Interpretation After Snell v. United Specialty: What Policyholders Need to Know (Feb. 25, 2026), available at uphelp.org

Not Ready for the Bench: LLM Legal Interpretation Is Unstable and Out of Step with Human Judgments (arXiv, Oct. 2025)

United States v. Heppner, No. 25-cr-00503-JSR, 2026 WL 436479 (S.D.N.Y. Feb. 17, 2026)

Warner v. Gilbarco, Inc., No. 2:24-cv-12333, 2026 WL 373043 (E.D. Mich. Feb. 10, 2026)

Todd v. Missouri United School Ins. Council, 223 S.W.3d 156 (Mo. banc 2007)

Burns v. Smith, 303 S.W.3d 505 (Mo. 2010)

Overcast v. Billings Mut. Ins. Co., 11 S.W.3d 62 (Mo. banc 2000)

Scottsdale Ins. Co. v. Addison Ins. Co., 448 S.W.3d 818 (Mo. banc 2014)

Ganaway v. Shelter Mut. Ins. Co., 795 S.W.2d 554 (Mo. App. S.D. 1990)

Zumwalt v. Utilities Ins. Co., 360 Mo. 362, 228 S.W.2d 750 (Mo. 1950)

Qureshi v. Am. Family Mut. Ins. Co., 604 S.W.3d 721 (Mo. App. E.D. 2020)

Russell v. Farmers & Merchants Ins. Co., 834 S.W.2d 209 (Mo. App. S.D. 1992)

State ex rel. Fleer v. Conley, 809 S.W.2d 405 (Mo. App. 1991)

C & J Fertilizer, Inc. v. Allied Mutual Ins. Co., 227 N.W.2d 169 (Iowa 1975)

Mo. Rev. Stat. §§ 375.296 and 375.420 (vexatious refusal statutes)

Missouri Rules of Professional Conduct, Rules 4-3.3, 4-3.7, and 4-8.4(c)

Leave a Reply