Virtual Lab of AI agents

The Virtual Lab of AI Agents Designs New SARS-CoV-2 Nanobodies

Nature AAP, 2025 - Paper Link

One-Sentence Take

A multi-agent LLM “Virtual Lab” co-designs a protein-engineering workflow that proposes 92 nanobody variants and yields two mutants with gained or improved binding to recent SARS-CoV-2 RBD variants, validated by ELISA; the work is a credible proof of principle for AI-orchestrated interdisciplinary research with clear strengths and important caveats.

What’s New, Exactly

Architecture: A Principal-Investigator agent creates domain agents and coordinates “team” and “individual” meetings; a Scientific Critic agent pressures specificity and catches errors while a human provides light-touch steering.
Contribution: An end-to-end computational design loop that mutates 4 known nanobodies (Ty1, H11-D4, Nb21, VHH-72) toward KP.3 binding using ESM LLR → AF-Multimer interface pLDDT → Rosetta ΔG, combined via a weighted score with iterative rounds, then wet-lab screens the top 92.
Outcome: Broad expression success, retention of Wuhan-RBD binding for most series, and two mutants that gain moderate binding to JN.1 or KP.3 while preserving Wuhan binding.

How the pipeline actually works (and why it makes sense)

Scoring: $WS = 0.2 \cdot (\text{ESM LLR}) + 0.5 \cdot (\text{AF interface pLDDT}) - 0.3 \cdot (\text{Rosetta } \Delta G)$ . Final selection swaps in ESM LLR_WT vs wild-type to avoid round-to-round drift. The blend rewards plausibility (ESM), confident interfaces (AF-M), and favorable binding energies (Rosetta).
Why ESM here: protein language models have been shown to suggest evolutionarily plausible antibody mutations that often improve fitness without antigen conditioning, which pairs well with an antigen-aware structural stage. Nat Biotech
Why AF-Multimer and Rosetta: AF-Multimer’s interface confidence correlates with plausible antibody–antigen geometry, though accuracy varies; Rosetta’s ΔG adds an orthogonal physical criterion. Protein Sci (open) 6

Evidence quality

Computational selection improves all three metrics across rounds; 85% of mutants beat wild-type on AF ipLDDT, 65% improve Rosetta ΔG, and all 92 have positive ESM LLR vs wild-type. These internal signals are consistent with enrichment of plausible binders. - Wet-lab validation: periplasmic expression is strong for many designs; ELISAs across Wuhan, JN.1, KP.3, KP.2.3, BA.2 show one Nb21 mutant with improved JN.1 binding and detectable KP.3 binding, and one Ty1 mutant with gained JN.1 binding; EC50 numbers reported for Wuhan vs JN.1 in the Nb21 mutant are 0.2 vs 2.0 ng mL⁻¹.
Variant context: JN.1 and KP.3 are recent Omicron-descended lineages with immune-evasive substitutions; the panel choice fits the period’s epidemiology. Lancet — JN.1 Lancet — KP.3.x

Strengths

Interdisciplinary tasking is explicit and auditable: prompts, roles, and meeting flows are methodized; a critic is institutionalized rather than ad-hoc. This is good systems design.
Wall-clock and cost are reported for the orchestration phase; agent code wrote core scripts, human wrote glue, which is honest about current autonomy ceilings.
The scoring stack triangulates three complementary signals; the final WT-normalized ESM term is a smart correction for multi-round accumulation.

Limitations and threats to validity

Binding is shown by ELISA to recombinant RBDs; there is no pseudovirus or live-virus neutralization, no kinetic constants (e.g., k_on/k_off by SPR), no epitope mapping; “improved binding” means ELISA signal and EC50 shifts, which can differ from neutralization potency.
Structure stage uses AF-Multimer not AF3; antibody–antigen accuracy can vary with paratope heavy loops and glycan context; ipLDDT is confidence, not accuracy. Using AF3 or orthogonal docking would strengthen claims. Protein Sci (open) Nature — AF3
Agent reliability: authors acknowledge knowledge-cutoff and hallucinations; they propose RAG or finetuning, but the current runs depend on prompt engineering, human curation, and sandboxed execution that is still manual. Nature
Selection bias: starting from four well-studied nanobodies constrains search to local sequence neighborhoods; no de novo scaffolding is attempted. This is a scoped demonstration, not a general nanobody discovery engine.

How to read the headline results

“Two promising” is meaningful for a rapid turn-around and tiny wet-lab budget, yet it is modest versus clinical benchmarks; framing should be “credible leads created quickly,” not “AI solved antibody discovery.” Nature
The Nb21 mutant’s Wuhan vs JN.1 EC50 ratio (~10×) suggests moderate cross-variant recognition; KP.3 ELISA signal exists but is weaker, consistent with JN.1→KP.3 similarity and escape. Next steps should prioritize epitope mapping and neutralization. Lancet — JN.1 Lancet — KP.3.x

Reproducibility and compute notes

Orchestration: total meeting time about 1–2 hours and ~ $10–$ 20 in GPT-4o tokens after prompt tuning; a week to run ESM+AF-M+Rosetta; ~8 weeks for synthesis and ELISAs. These are transparent and actionable timelines for replication.
Methods include prompt schemas and meeting templates, enabling ablations by agent roles, parallelization, and critic effects; authors report that specialized agents beat generic clones.

Comparative lens — why this works now

Protein LMs bias mutation proposals toward fitness-compatible changes, then structure-based scoring injects antigen specificity. The Virtual Lab’s novelty is not a new model but a governance and composition layer that reliably chains them. Protein Sci (open) Nature — AF3

What I would do next Nature

Swap AF-M with AF3 for complex prediction, then re-tune weights to 3-way calibrated scores (e.g., z-scores across batches) to reduce model-stage drift. Nature — AF3
Add epitope-focused constraints: paratope masks, epitope conservation filters, glycan-aware modeling, and negative selection against off-targets. Protein Sci (open)
Validate with orthogonal assays: BLI or SPR for kinetics, pseudovirus neutralization for function, and deep mutational scanning for escape profiling. Nature
Harden agency: RAG over curated protocol docs and code, sandboxed tool execution, and automatic citation checking to mitigate hallucinations. Nature

Safety note (important)

The work targets benign antigen binding and reports only non-actionable wet-lab details; any extension to pathogen functionality must follow institutional biosafety, dual-use review, and controlled-access protocols. Keep the Virtual Lab’s retrieval corpus safety-filtered.

Visual — pipeline at a glance

Further Questions Luc Had…

Tools the agent was given access to (as discussed/selected in-meeting and implemented via code the agents wrote):

ESM (protein language model) to compute mutation log-likelihood ratios;
AlphaFold-Multimer to predict complex structure and interface pLDDT;
Rosetta to relax structures and compute binding energy (RS ΔG).

The paper’s “Tools selection/implementation” and workflow sections explicitly list these three, and the Extended Data summarizes the scripts the agents produced (“Python script to compute ESM LLR…”, “Python script to extract AF-Multimer interface pLDDT…”, “Python/XML scripts for Rosetta dG”). Nature AAP

LLM used?:

GPT-4o powered all the agents in the Virtual Lab application to SARS-CoV-2 nanobody design. Nature AAP

Were those tools exposed as MCP (Model Context Protocol) servers?

No. The paper never mentions MCP and, in fact, states that all scripts were run by the human researcher; the authors frame future work as building sandboxed environments so agents could independently install and run tools—i.e., they did not have direct tool execution via a protocol like MCP in this study. Nature AAP

Did the agent refuse any action due to safety concerns?

No such refusals are reported. The text and methods describe agent discussions, code writing, and human-executed computation/experiments, but do not record any safety-motivated refusals by the agents. Nature AAP

Tasks handed off to humans:

Running everything: “All scripts were run by the human researcher,” with humans also writing ancillary data-wrangling/job-scheduling scripts for the compute environment.
Wet-lab & validation: humans synthesized/expressed/purified nanobodies and performed ELISA binding experiments.
Timeline indicates human execution of the computational pipeline (ESM+AF-Multimer+Rosetta), then multi-week lab work. Nature AAP

Deep dive — the paper’s stated autonomy ceilings, with exact constraints and why they matter.

Knowledge scope ceiling — model cutoff limits agent competence without external retrieval. The authors say the agents may miss up-to-date science and code because LLMs are trained only up to a cutoff, giving AlphaFold 3 vs AF-Multimer as a concrete miss. They propose providing relevant docs via retrieval-augmented generation or finetuning to reduce this ceiling. Nature AAP
Tool-execution ceiling — no autonomous installation or running of software. The paper explicitly frames sandboxed environments as future work to let agents independently install computational or AI tools, then write, debug, and run code that uses those tools. In the present study, this autonomy is unavailable. Nature AAP
Prompting ceiling — high dependence on prompt engineering and agenda iteration. The authors state that without appropriate guidance, agents give vague answers, forcing the human to refine agendas several times before getting desirable responses. They expect this burden to shrink as base models improve, yet it is a current ceiling. Nature AAP
Epistemic reliability ceiling — hallucinations require critic pressure and human verification. The paper notes that agents can invent incorrect facts or citations. They mitigate through multi-agent critique and by giving agents access to resources to verify knowledge, while still requiring the human researcher to verify key facts and decisions using trusted sources. Nature AAP
Execution ceiling — code is authored with agent help but executed by a human. The study states that all scripts were run by the human researcher, including environment-specific handling. This places a hard ceiling on end-to-end autonomy of the Virtual Lab in this experiment. Nature AAP
Decision-authority ceiling — humans retain high-level control of scientific choices. The paper emphasizes that the human provides high-level guidance where agents lack context, such as selecting readily available computational tools and constraining experimental validation. Even as LLMs improve, humans remain vital to guide questions, methods, and analyses. Nature AAP
Wet-lab autonomy ceiling — experiments are human-performed, not robotically automated by agents. The results were “experimentally validated by human researchers,” covering synthesis, expression, and ELISA assays. The Virtual Lab orchestrates design but does not autonomously operate lab hardware in this work. Nature AAP
Meeting-process ceiling — multi-round convergence requires human agenda setting and the PI agent’s synthesizing role. The framework relies on structured team and individual meetings, with the PI guiding discussions and summarizing. This organization improves quality but encodes a human-steered process rather than free-form autonomous planning. Nature AAP

Implications and author-proposed paths to raise the ceiling. First, attach an always-fresh knowledge interface via RAG or finetuning to shrink cutoff gaps. Second, deliver a hardened sandbox so agents can install and run domain tools with constrained permissions. Third, keep a critic agent and require human verification for high-stakes decisions to contain hallucination risk while letting more execution move off the human. Fourth, preserve human authority on goals and values, while gradually delegating low-level execution once sandboxing and verification are in place. Nature AAP

Bottom line. In this study the Virtual Lab is an effective planner, explainer, and coder that still depends on humans for up-to-date knowledge injection, safe tool execution, empirical verification, and high-level choices. The ceilings are explicit and actionable: knowledge cutoff, prompting burden, hallucination control, lack of autonomous tool execution, human-run scripts, human decision authority, and human wet-lab performance. Nature AAP

Deeper dive — concrete next steps that extend the paper’s own “what to do next,” mapped to explicit limits and methods.

Swap AF-Multimer for AF3 and calibrate scores across batches. Goal: reduce interface confidence miscalibration and improve complex geometry.

Design: re-run the exact ESM→structure→Rosetta workflow with AF3 in place of AF-Multimer, keep Rosetta as an orthogonal energy readout, and normalize each model’s outputs to batchwise z-scores before combining. Compare WS vs WS_WT using a small held-out set of designs.

Metric: correlation of structure-stage scores to wet-lab outcomes, and re-weight via logistic regression from [ESM LLR, AF interface confidence, Rosetta ΔG] to a single calibrated score. Risk: AF3 compute cost and input constraints; mitigate by running AF-Multimer on the long tail and AF3 on the top-k. Nature AAP

Add explicit epitope and paratope constraints during selection. Goal: bias toward variants that keep a consistent binding footprint across variants.

Design: use predicted complexes to define interface residues within 4 Å, require stability of that interface set across JN.1 and KP.3 models, and penalize designs that move the footprint. Use AF interface confidence to mask residues and forbid mutations at core paratope positions during later rounds.

Metric: number of designs that preserve an interface residue set across variants and their ELISA signal retention. Caveat: ipLDDT is confidence not accuracy; treat interface maps as soft constraints. Nature AAP

Strengthen wet-lab readouts with kinetics and functional assays. Motivation: Methods document multiplexed ELISA, not kinetics or neutralization.

Next: add BLI or SPR for k_on, k_off, K_D on Wuhan, JN.1, KP.3 panels, and run pseudovirus neutralization as an activity check. Use the same 23 picks from WS_WT to avoid selection bias, then iterate only after adding these readouts to the score.

Metric: rank correlation between ELISA EC50 and K_D, and between K_D and neutralization IC50; prefer designs that improve all three.

Harden the agency: retrieval and sandboxed execution. RAG: attach curated papers, protocol snippets, and tool docs so agents do not miss AF3 or recent variant context. Finetune or parameter-efficient adapt agents on project prompts and meeting transcripts. Sandbox: containerize ESM, structure, Rosetta, and plotting; bind-mount only per-run working dirs; force all tool calls through a single exec shim that logs command lines and checksums. Keep the critic agent’s “verify claims” step mandatory, and require the human researcher to validate key facts against trusted sources as the paper advises.
Re-weighting and multiple-hypothesis control for selection. Motivation: WS and WS_WT are sensible but hand-weighted; the pipeline makes hundreds of proposals per round. Plan: learn weights from held-out designs using cross-validated logistic regression or gradient boosting on [ESM LLR or LLR_WT, AF interface confidence, Rosetta ΔG], then freeze weights for a prospective run. Apply a simple Benjamini–Hochberg control over the panel ELISA p-values to manage false discoveries across variants. Report both learned weights and post-hoc calibration plots.
Ablations that test each tool’s marginal utility. Design: reproduce Extended Data score-evolution plots with four regimes: ESM-only, AF-only, Rosetta-only, and full stack. Hold wet-lab budget constant by testing the top-23 from each regime.

Metric: hit-rate above wild-type per variant, and mean improvement in ELISA EC50 vs WT.

Expectation: full stack should dominate, but AF-only may perform well if the interface predictor upticks with AF3.

Negative selection and broader antigen panels. Use the printed antigen array format to include off-targets and legacy RBDs. Penalize designs that gain non-specific binding or lose Wuhan binding while chasing KP.3.

Option: include MERS RBD as a non-SARS-CoV control since the Methods already include its expression recipe, then treat any measurable binding as a red flag.

Execution autonomy where safe, with preserved human authority. Let agents submit jobs inside the sandbox, but require human-approved agendas and stop-conditions per meeting. Keep the current paper’s principle that humans verify key facts and decide goals. Promote only low-risk automation first: data wrangling, batch scheduling, and plotting. Gate wet-lab actions behind human SOPs.
Reporting upgrades for reproducibility. Publish per-round CSVs with raw ESM, AF, Rosetta scores, learned weights, and WS_WT ranks; include code to regenerate Extended Data plots from those files. Add a minimal “replicate this run” script that enumerates seeds and exact tool versions, then emits a lockfile for environment build.

Bottom line: the authors already point to RAG, finetuning, and sandboxing as direct levers; pairing those with AF3, interface-aware constraints, stronger assays, learned scoring, and explicit ablations converts a promising demonstration into a robust, calibrated design engine while keeping humans in charge of goals and verification.

Deep documentation — “Tools selection/implementation” and “Workflow design” sections. Nature AAP

TOOLS SELECTION — how the stack was chosen. The team first fixed scope (mutating four Wuhan-binding nanobodies toward KP.3 RBD) and then ran a dedicated team meeting asking agents to list ML/computational tools, explicitly emphasizing pre-trained models for simplicity. They parallelized the meeting in five simultaneous runs and merged with the PI + Scientific Critic. The chosen stack: ESM (protein LM), AlphaFold-Multimer (structure/interface confidence), and Rosetta (binding energy). Nature AAP

TOOLS IMPLEMENTATION — governance & meeting mechanics. After picking tools, the PI held an individual meeting to assign the most appropriate scientist agent per tool. For each tool, they ran individual meetings with the selected agent + Scientific Critic, again in five parallel iterations, followed by a scientist-led merge. Meetings enforced agenda rules on code quality (good documentation; no undefined functions). Initial implementations had small errors; they ran one follow-up individual meeting (no parallelization, no Critic) to auto-fix all errors. Nature AAP

Per-tool deliverables (what code the agents actually produced):

ESM usage — The Machine Learning Specialist authored a 130-line Python script with three functions: (1) main runner; (2) CLI arg parser (e.g., input nanobody sequence); (3) routine using a pre-trained ESM model to compute mutation log-likelihood ratios (LLR) for all single-point mutants against the input sequence. Nature AAP
AlphaFold-Multimer usage — The Computational Biologist authored a 144-line Python script that: (1) verifies PDB format; (2) identifies interface residues; (3) computes interface pLDDT (ipLDDT); (4) batch-runs ipLDDT across a directory of predicted nanobody–RBD complexes and writes one CSV. ipLDDT is used as a confidence metric for the binding interface and has been shown to correlate with antibody–antigen affinity. Nature AAP
Rosetta usage — The Computational Biologist created: (a) a 30-line RosettaScripts XML to relax the AF-Multimer complex and compute interface binding energy (Rosetta “dG-separated”) using the REF15 scoring function; (b) a 71-line Python script to read all score files and emit a CSV of energies per complex. Nature AAP

WORKFLOW DESIGN — the end-to-end loop formalized by the PI. The PI agent designed a round-based pipeline that integrates the three tools with explicit scoring and selection. For each starting nanobody, evaluate all single-point mutants with ESM; for the top 20 by ESM LLR, run structure+energy scoring and rank by a weighted score (WS). The top five by WS seed the next round; repeat for four rounds total (up to four mutations). Nature AAP

Scoring math — definition and rationale. Weighted score: WS = 0.2·(ESM LLR) + 0.5·(AF ipLDDT) − 0.3·(RS dG), with the Rosetta term carrying a negative sign because more negative binding energy is better. To fairly compare designs across rounds (different mutation counts), final selection uses WS_WT, which swaps in ESM LLR_WT (mutant vs wild-type) for the ESM term. Nature AAP

Operational specifics — versions, batching, and round logic. Round 0 computes AF-Multimer (via LocalColabFold v1.5.5) + Rosetta (v3.14) on the wild type and assigns ESM LLR=0. Round 1 enumerates every single-point mutation, retains the top 20 by ESM LLR, and runs AF-Multimer + Rosetta on each; rank by WS and keep top 5. Rounds 2–4 repeat from those five seeds, producing up to 100 candidates per round (20 per seed), again down-selecting to five by WS. After four rounds, select 23 per seed overall using WS_WT, for 92 total designs across four seeds. Nature AAP

Where these sections sit in the paper — quick map. “Tools selection” (team meeting, pre-trained bias, ESM/AF-M/Rosetta choice). “Tools implementation” (agent assignment via PI; agent+Critic parallel code-writing; agenda rules; fix-up pass). “Workflow design” (PI formalizes WS, rounds, and thresholds). The “Nanobody design workflow” subsection then narrates the operational run and the WS_WT final selection used for the 92-design panel. Nature AAP