RuntimeBuzz

Editorials

AI bioweapons risk moves to the order form

AI bioweapons risk moves to the order form

Public debate about AI and bioweapons often focuses on whether a model will answer a dangerous question. This framing misses the shift in evidence from 2025 and 2026. Large language models can walk users through parts of dangerous biological workflows. More importantly, protein-design tools can rewrite hazardous DNA so commercial screeners miss it. Wet-lab skills still block most paths to a finished weapon, but the process has three layers: knowledge, orders, and physical work. Policy attention stays on the first while the second is scaling.

The chatbot debate is the wrong unit of analysis

A RAND perspective from December 2025 tested Llama 3.1 405B, ChatGPT-4o, and Claude 3.5 Sonnet against a poliovirus recovery scenario (Brent & McKelvey, 2025). The models gave accurate instructions for steps that apply to other pathogenic viruses. The authors argue that relying on "tacit knowledge" as a permanent shield creates false comfort.

NIST’s generative AI profile offers a different view. At the time of its review, NIST found that LLMs did not substantially raise the likelihood of an attack beyond what traditional search provides. Synthesis, production, and deployment still require expertise and infrastructure (National Institute of Standards and Technology, 2024). NIST warns that if information access is the bottleneck, generative AI changes the math.

These findings measure different rungs on the same ladder. Treating biorisk as a single yes-or-no test repeats a common mistake in AI benchmarks: optimizing a proxy while the real system has several failure points. Regulators and lab safety teams should map progress to specific workflow steps like planning, sequence design, and ordering.

Planning is cheaper; building remains difficult

Practitioner reporting from the Golden Gate Institute for AI says LLMs help with preliminary research and planning faster than a literature search (Olvera, 2026). They can link pathogen details with fluid dynamics or aerosolization. Building a weapon-class agent still requires sourcing, assembly, growth, stabilization, and testing. Most of those steps depend on troubleshooting that lives in muscle memory.

The same series cites ActiveSite’s work with non-experts using mid-2025 models. Participants scored better on isolated virology tasks, but end-to-end workflow success stayed below 8% for simplified paths. This pathogen was easier to handle than influenza, and deployment was out of scope (Olvera, 2026). Seven sequential steps at fifty-fifty odds leave less than a one-percent chance of success.

Anthropic’s September 2025 biorisk report adds detail to the "lab stops everyone" story. A small 2024 wet-lab pilot found no uplift from Claude compared to internet access—but everyone did surprisingly well (Anthropic, 2025). A larger study with Sentinel Bio is testing expert-level tasks at scale. Meanwhile, Anthropic’s planning trials found Claude Opus 4–assisted participants scored higher on bioweapons acquisition plans than an internet-only control. Anthropic stressed that text plans are imperfect proxies for lab success (Anthropic, 2025).

The practical takeaway: novices can now plan known pathways more easily. Novel-agent design at scale is the next threshold. Risk monitoring should track which rungs loosen each quarter.

The order form is the real choke point

An October 2025 Science study led by Eric Horvitz’s team showed that AI-native risk extends beyond conversation. Researchers used open-source protein-design software to generate 76,089 variants across 72 proteins of concern. Many reformulated DNA sequences evaded screening tools used by synthesis providers (Horvitz et al., 2025; Wittmann et al., 2024). These sequences slipped past firewalls used to flag dangerous orders (Kennedy, 2025).

This attack differs from prompt injection. Models did not argue their way past a refusal. Instead, they paraphrased hazardous genetic text into functional variants that looked different to screeners. Industry collaborators patched their software, reaching a 97% detection rate for functional variants. This still leaves a fraction that slip through (Wittmann et al., 2024). Roughly one in five synthesis providers still do not screen orders (Olvera, 2026).

DNA synthesis is the choke point NIST names in its guidance. Nucleic acid orders are where digital designs become physical instructions (National Institute of Standards and Technology, 2024). Policymakers who treat biosecurity as chat moderation are defending the wrong door. Screening should run like application security: assume adversarial redesign, ship patches, and measure miss rates.

What frontier labs did—and what changed in 2026

Anthropic activated AI Safety Level 3 (ASL-3) standards when it launched Claude Opus 4 in May 2025. The company could no longer rule out ASL-3-level risk for CBRN (chemical, biological, radiological, and nuclear) capabilities (Anthropic, 2025a). ASL-3 measures target extended, end-to-end workflows and universal jailbreaks. Safeguards include constitutional classifiers and egress bandwidth controls to prevent weight theft.

OpenAI split its approach. The ChatGPT agent was the first model treated as high capability in biology under its Preparedness Framework. In May 2026, OpenAI announced Rosalind Biodefense. This program gives trusted developers and government partners access to GPT-Rosalind to build screening and detection tools (OpenAI, 2026). Fourth Eon Biosecurity is using this to work on AI-native sequence screening. This admits that the defensive stack must use the same design tools as attackers.

The tension lies in leakage outside these channels: open weights, third-party APIs, and models without ASL-3 monitoring. For biology, the "harness" is the synthesis cart and the design-tool API.

Governance has also shifted. Anthropic’s Responsible Scaling Policy version 3.0, effective February 2026, reframed when the company will delay development. Leadership now judges if Anthropic leads the AI race and if catastrophic risk is material (Anthropic, 2026). This is a higher bar than earlier versions. The commitment to pause training on safety grounds alone is no longer absolute. Buyers should read risk reports as operational documents, not reassurance.

Biology design tools are the next layer

NIST separates general-purpose LLMs from chemical and biological design tools (BDTs). These specialized systems are trained on scientific data and can design novel harmful agents (National Institute of Standards and Technology, 2024). The Science screening study used open-source protein design software to generate evasive variants.

Policy that only chases ChatGPT refusals is behind the curve. Evaluations need structure-aware models, vendor APIs, and order-screening metrics. The Horvitz consortium’s disclosure model—holding some methods back and routing access through IBBIS—shows how serious teams handle dual-use research. It also shows how fast defensive patches must spread across vendors to matter.

What to watch this week

  1. Read biorisk sections in system cards. Look for which workflow steps the evaluation actually simulates, rather than public leaderboard ranks.
  2. Track synthesis-screening vendor changelogs. A 97% detection rate implies holes. Unscreened providers remain a parallel path.
  3. Treat jailbreaks and sequence paraphrase as one arms race. Classifier updates on chat models do not fix order-form evasion.
  4. Separate defensive programs from default access. Rosalind-style trusted paths are not the same as consumer chat.

AI for bioweapons is a stack. Chat models lowered the cost of planning. Design tools stress-tested the gate between intent and DNA fulfillment. Labs still filter their own work while the supply chain catches up. The urgent policy question is whether the order form catches what the model designed.

References

Anthropic. (2025a, May 22). Activating AI Safety Level 3 protections. https://www.anthropic.com/news/activating-asl3-protections

Anthropic. (2025b, September 5). Why do we take LLMs seriously as a potential source of biorisk? https://red.anthropic.com/2025/biorisk/

Anthropic. (2026, February 24). Responsible Scaling Policy Version 3.0. https://www.anthropic.com/news/responsible-scaling-policy-v3

Brent, R., & McKelvey, G., Jr. (2025). Contemporary foundation AI models increase biological weapons risk (PE-A3853-1). RAND Corporation. https://doi.org/10.7249/PEA3853-1

Horvitz, E., Wittmann, B. J., Alexanian, T., Bartling, C., Beal, J., Clore, A., Diggans, J., Flyangolts, K., Gemler, B. T., Mitchell, T., Murphy, S. T., & Wheeler, N. E. (2025). Strengthening nucleic acid biosecurity screening against generative protein design tools. Science, 390(6768), 82–87. https://doi.org/10.1126/science.adu8578

Kennedy, M. (2025, October 2). AI designs for dangerous DNA can slip past biosecurity measures, study shows. NPR. https://www.npr.org/2025/10/02/nx-s1-5558145/ai-artificial-intelligence-dangerous-proteins-biosecurity

National Institute of Standards and Technology. (2024). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1). U.S. Department of Commerce. https://doi.org/10.6028/NIST.AI.600-1

Olvera, A. (2026, May 28). AI can help plan a bioweapon. Building one is still hard. Second Thoughts. https://secondthoughts.ai/p/ai-can-help-plan-a-bioweapon-building

OpenAI. (2026, May 29). Strengthening societal resilience with Rosalind Biodefense. https://openai.com/index/strengthening-societal-resilience-with-rosalind-biodefense/

Wittmann, B. J., Alexanian, T., Bartling, C., Beal, J., Clore, A., Diggans, J., Flyangolts, K., Gemler, B. T., Mitchell, T., Murphy, S. T., Wheeler, N. E., & Horvitz, E. (2024). Toward AI-resilient screening of nucleic acid synthesis orders: Process, results, and recommendations (bioRxiv 2024.12.02.626439). https://doi.org/10.1101/2024.12.02.626439