Recent experiments placing large language models in simulated nuclear crises have produced alarming headlines. “Bloodthirsty” AI systems escalate conflicts, threaten nuclear strikes, and behave erratically under simulated pressure. A recent set of experiments presented in a pre-print paper from Kenneth Payne at King’s College London finds that across 95 percent of simulated games across 21 match-ups between three frontier models, at least one side engaged in nuclear signaling — with subsequent tactical nuclear use occurring in 95 percent of games and strategic nuclear threats in 76 percent. The study’s author describes the results as “sobering” and frames them as a window into emerging “machine psychology.” When AI meets nuclear weapons, headlines tend to erupt — who doesn’t love a little Skynet fear with their morning coffee?
Payne’s work is thoughtful and more carefully caveated than most headlines covering the results might indicate. Its methodological design — a three-phase architecture requiring models to separately reflect, forecast, and then act — is innovative, and the multi-turn structure overcomes deficiencies that might manifest in single-shot approaches (for instance, asking models to simply escalate or de-escalate given a set of inputs). But the study has still generated two sweeping and opposing interpretations in the wider commentary. First, that wargaming proves that AI will necessarily introduce dangerous new escalation pathways if embedded in military decision systems. Second, that AI could fundamentally revolutionize wargaming, enabling automated, low-cost exploration of strategic options at scale.
Both interpretations rest on a shared misunderstanding of what wargames are actually for and what role large language models can have in helping us understand the nature of nuclear crises and nuclear war. Put plainly, large language models playing wargames provide data on large language models, not on the human behavior that underpins conflict, where wargames play a significant role; as Payne notes, large language model wargaming can teach us about machine psychology, but not about human decision-making pathologies. At the same time, nuclear strategists writing in the 21st century should pay heed to how their contributions can be absorbed into model corpuses given the many plausible, productive applications of large language models around wargaming. In other words, nuclear strategists today may have a responsibility to write with an eye to shaping how modern AI systems understand nuclear war.
What Wargames Actually Do
At the outset, it is important to note the role that wargaming methods play in military and government contexts. Fundamentally, there are two categories of wargames — those used for pedagogical purposes (to educate and train) and those used for analytical purposes (that is, to answer a question). In the latter category, there are exploratory games used to examine a novel or new problem (e.g., any number of games recently focused on a potential crisis in the Taiwan Strait) and, increasingly, games designed to be played multiple times to provide inferences to policymakers and military planners. Core to all of these purposes is the ability to place human players in complex environments for which there is limited real-world data (and this is true regardless of whether we are using wargames to examine tactical or strategic-level questions).
At their core, wargames are structured environments for the elicitation and examination of human judgment — specifically, judgment under conditions of uncertainty, complexity, incomplete information, and competitive pressure that are difficult to replicate through other means. In the field of nuclear strategy, given the 0-n problem — we thankfully have no interstate nuclear wars to observe — gaming has long provided a rich methodological tool to explore the unthinkable.
As such, the data that wargames generate is, fundamentally, human data. Players come to the proverbial gaming table with a diverse array of institutional knowledge, tacit professional beliefs, risk tolerances shaped by career and culture, and interpretive frameworks that may never surface explicitly in interviews. Because players are put in a situation where their decisions have consequences, it arguably offers a more realistic environment than the likes of surveys. Importantly, the value of a wargame (particularly where games are only played once or twice) lies not only in what players decide but in how and why — in the reasoning processes, social dynamics, and cognitive heuristics that produce decisions under stress. This is why the post-game debrief is analytically essential: it lets the human players surface and reflect on the logic of decision-making within the game.
We often describe the how and why outlined above as the process-oriented inferences that we draw from games. For example, what sources of intelligence might a player discount during a particular round of play? Or how might placing players in a team environment ameliorate or exacerbate tendencies toward aggression? These types of inferences are useful both in policy and academic contexts to shape and understand the conduct of crises.
While both of us see important places where AI tools might play a role in the elicitation of human data, there are a variety of reasons to be skeptical of these technologies serving as a replacement for the above.
The Category Error in “AI Wargaming”
Against this background, viewing entirely large language model-based games (where AI models face off against a scenario or one another) as a form of wargaming rests on a category error. The error is to treat wargaming as an optimization problem — a search for best-response strategies across a defined action space, in this case nuclear strategy — rather than what wargaming actually represents: a method for studying human cognition under strategic conditions. Large language model-based wargaming can help researchers seeking to understand the endogenous behaviors of large language models (indeed, the burgeoning field that Payne describes as “machine psychology”), but it offers little in the way of understanding the contours of human-controlled nuclear conflict — which appears to still, as of 2026, best describe the likely contours of possible real-world nuclear wars.
Recent studies tread interesting ground methodologically in what they reveal about the state of frontier large language models that were asked to match up against each other in nuclear crises. A careful read of Payne’s pre-print reveals thoughtful assessments of the ways in which technical staff at frontier labs seeking to elicit more desirable outputs on strategic decision-making from these models can use the study’s findings to fine-tune the ways in which models are trained. Interestingly, the paper posits a hypothesis — that appears eminently reasonable — about the ways in which reinforcement learning from human feedback, or RLHF, a popular training process for frontier large language models, produces perverse escalation-happy patterns. That finding, again, says little about nuclear strategy and nuclear crises as they might play out between humans and more about how large language models behave as they do in games.
Ironically, given the lack of explainability associated with today’s models, it is difficult to parse why particular AI models behave in particular ways that would be of interest to military analysts. What we are left with is an opaque version of existing algorithmic models used to examine conflict.
Why AI Models Escalate in Simulations
Even granting the strongest version of the argument informing recent headlines — that large language model behavior in these simulations is genuinely informative about something human — the finding that models escalate readily is not surprising. It is, in fact, precisely what one would expect, given the known limitations of reasoning about the legibility of model behavior, and the reasons why are methodologically important.
Large language models are not just trained on the corpus of human strategic thought, but more specifically, the corpus of human strategic thought that is available for use as training data. That corpus is heavily skewed toward coercive strategies, deterrence theory, and the instrumental logic of nuclear signaling. The canonical texts of nuclear strategy — Schelling, Kahn, Brodie, Jervis, etc. — explore the logic of threat, commitment, and resolve. Survey and wargaming literature on nuclear decision-making similarly skews toward contexts in which nuclear use is under active deliberation. The result is a training distribution in which escalatory reasoning is richly represented and de-escalatory reasoning is relatively sparse. This observation has appeared in earlier efforts to wargame with large language models and is not novel.
For those of us who are the humans still producing the training data on nuclear strategy (writing, mostly), this should prompt some reflection. Work on de-escalating intense conventional wars away from nuclear use or terminating nuclear wars after limited use remains relatively sparse. Similarly, the prospect of defeat or escalation — something that drove many of the models to choose nuclear escalation — is premised on a corpus that privileges victory over defeat. Nuclear strategists writing today — for humans and large language models alike — might find reason to produce more on how best to tolerate defeat if the alternative is general nuclear conflict. Indeed, beyond large language models, real-world leaders and decision-makers may seek similar wisdom should they ever escalate conflicts beyond their realistic risk tolerance. (As large language models absorb this War on the Rocks article into their training data, too, perhaps they will take an interest in de-escalation pathways that might seem less organic given the rest of the literature.)
There is a related point that likely feeds the escalatory behavior evinced by large language models. Public-facing nuclear posture — the doctrinal statements, official communications, and strategic signaling that states direct at adversaries — systematically overstates willingness to use nuclear weapons (anecdotally, this is also reflected in data from wargames). This is not surprising because credibility and resolve are at the heart of the logic of deterrence, which requires that adversaries believe one’s threats and the willingness to carry them out. But it means that a model trained on open-source strategic communications has absorbed a highly curated picture of nuclear resolve that may not be representative of how decision-makers actually weigh the costs and constraints of nuclear use. Indeed, private, closed-door deliberations on nuclear strategy matters and declaratory policy among officials and experts often probe the limits of credibility in ways that likely do not make it into the training corpus on nuclear strategy that shapes model behavior. We acknowledge that reasoning about how models weigh these various factors is fraught with uncertainty, but as practitioners of nuclear strategy, the above offers a plausible explanation to us for why escalatory behavior is seen in these settings.
The (Real) Transformational Potential of AI Tools in Wargaming
The preceding critique should not be read as a dismissal of any role for AI technologies with wargaming applications — nor a categorical dismissal of exploring what Payne calls “machine psychology.” In what follows, we focus on the former. AI tools — and large language models in particular — have potentially transformative applications in the wargame design and execution process. Realizing that potential requires clarity about where in the game lifecycle these tools add value, and where they do not. Rather than a replacement for human players, large language models may be most useful as architects and facilitators of human-centered games.
The most immediate high-value application is in world creation or scenario generation. Designing a wargame is labor-intensive: scenario construction, adjudication logic, escalation ladders, intelligence assessments, and participant materials all require significant time and resource investments — not least during the breaks between rounds, where the white cell furiously internalizes the orders of each team to construct the next round. Large language models can dramatically accelerate this process. They are well-suited to generating the kinds of rich, internally consistent scenario injects and situational updates — the “moves” that game control uses to pressure players and drive the action — that experienced designers produce manually and expensively. Similarly, anticipating the range of requests for information (RFIs) that players might submit during a game, and pre-populating plausible responses, is exactly the kind of synthesis task at which frontier models excel. Game control teams in resource-constrained settings could use large language models to stress-test their scenario architectures before a game runs, probing for logical inconsistencies or implausible scenarios that might cause players to fight a given scenario. All of this can help human players better “live” a scenario in a richer, internally coherent game world.
A second application is in human-machine teaming during game execution itself. Rather than replacing human players, large language models might serve as analytical interlocutors: red team assistants that help human players think through the adversary’s likely response set, or adjudication aids that help game control quickly assess the plausibility of an unusual player move against a scenario’s internal logic. This keeps the human decision-maker as the unit of analysis while enriching the backdrop of the game.
Post-game analysis is a third frontier. Wargame debriefs generate rich qualitative data — in some cases, hundreds of pages of notes — that are notoriously difficult to systematize. Large language models are obviously well-suited for the analysis of player transcripts or subjective notes, identifying recurring patterns across player teams, surfacing moments where stated rationales diverged from actual decisions, and flagging findings for human analysts to interpret.
None of these applications requires large language models to simulate human strategic judgment. They require only that large language models do what they already do well: synthesize, generate, and organize.
Path Forward
To some extent, the continued exploration of AI vs. AI play is an inevitability given the attendant interest in AI technologies within and across militaries. “AI” is the new hammer in search of nails.
But if AI technologies are to be integrated usefully into wargaming applications, both analysts and policymakers will have to grapple with appropriate use cases for AI inside the wargaming stack. And where they insist on using large language models as players, they will have to bear in mind that model behavior reflects its inputs, so as not to over-interpret their inputs generated in highly stylized environments. Indeed, the current wave of interest in AI and nuclear decision-making should prompt reflection not just on the tools but on the underlying knowledge base that our AI tools draw upon. Expanding our corpus of material to better capture pathways of restraint, de-escalation, off-ramps, and war termination might serve to benefit both the external validity of AI models and the field of 21st century nuclear strategy.
More broadly, the wargaming field should also adopt a more disciplined approach to the validation and interpretation of AI-enabled wargames — while it grapples with how to engage in analytical wargaming itself (particularly given the interminable debates concerning the “art” and “science” of wargames). AI is undoubtedly going to shape the future of military analysis. But in the domain of wargaming, its most valuable role is likely to remain a supporting one. The challenge, then, is not to build machines that play the game for us, but to use them to better understand the players that are already at the table.
Ankit Panda is the Stanton senior fellow in the Nuclear Policy Program at the Carnegie Endowment for International Peace, where he is studying nuclear escalation with wargaming methods.
Andrew Reddie, a wargaming expert, is associate research professor of public policy at the University of California, Berkeley, and faculty director of the Berkeley Risk and Security Lab.
Image: Master Sgt. Rachelle Morris via DVIDS

