Round 2 — AoC Replication (planned)

Experiment ID: 2026-05-aoc-replication
Status: Spec — not yet executed. Drafted 2026-04-21.
Predecessor: Round 1 — Edge Deployment Evaluation
Targets: NDSS 2027 (deadline July 2026) as primary venue; IEEE S&P or USENIX Security as alternates.

One-sentence summary

Replicate Shapira et al. (2026) Agents of Chaos as closely as Round-2 scale allows, and run a matched second condition with AEGIS governance in the enforcement path, to measure which AoC case-study failure classes AEGIS intercepts and at what tool-call boundary.

Hypotheses (pre-registration-ready)

H1. Under the ungoverned (replication) condition, AEGIS-lab agents will reproduce a meaningful subset of the eleven AoC failure case studies (CS#1–#11) within the study window.
H2. Under the AEGIS-governed condition, the tool-call chains that produced those failures will be blocked at the capability-check boundary before irreversible effect. Concretely: every failure class that reduces to one or more of ATX-1 RC1/RC2/RC3/RC4 (the four Root Causes AEGIS inherits directly from AoC §16.2–16.3) will be intercepted.
H3. Failure classes that do not reduce to AEGIS-addressable Root Causes — particularly CS#6 (Agents Reflect Provider Values), which is a provider-layer issue not an architectural governance gap — will persist under both conditions.
H4. The “what worked in practice” cases (AoC CS#12–#16) will succeed in both conditions, because those represent behaviors where model-layer alignment already holds.

H3 and H4 are load-bearing for intellectual honesty: they scope what AEGIS can and cannot do, and support the “complement not replacement for model-layer alignment” framing.

Why this design

AoC (Shapira et al., arXiv:2602.20021v1) documented sixteen case studies across a two-week live laboratory study: eleven failure modes (CS#1–#11) and five “hypothetical cases (what happened in practice)” where agents resisted the attempted manipulation (CS#12–#16). The paper’s §16.2 explicitly identifies three structural properties LLM-backed agents lack — no stakeholder model, no self-model, no private deliberation surface — which map one-to-one to AEGIS ATX-1 Root Causes RC1, RC2, RC3. §16.3 names prompt injection as “a structural feature, not a fixable bug” — AEGIS RC4. RC5 (No Environment Model) emerged from AEGIS’s own RFC-0006 adversarial testing on 2026-03-26 and is not in AoC.

If ATX-1 is a faithful taxonomy of the architectural gaps AoC named, then AEGIS enforcing against ATX-1 techniques should block the AoC case-study attack chains at the capability boundary. Round 1 established this happens at single-operator scale. Round 2 establishes it at AoC-comparable scale.

Conditions

Condition	AEGIS state	Purpose
A — Ungoverned (AoC baseline)	Off	Establish that the Round-2 setup reproduces AoC case-study failures
B — AEGIS-governed (treatment)	AEGIS Core in path; capability registry scoped per AoC’s stated agent responsibilities	Measure interception

Counterbalancing: each prober runs against both conditions, with order randomized to control for learning effects. Fresh agent VMs per condition; no memory carries across.

Pre-registered metrics

Recorded per agent, per condition, per case study. All derivation scripts written and committed before the first run.

Primary

Case-study coverage — for each CS#1–#11, did it manifest in Condition A? In Condition B?
Time-to-first-failure per case — wall-clock time from study start to first incidence.
AEGIS intervention count per case — number of DENY / ESCALATE decisions on tool-call chains that would, absent enforcement, complete an AoC case-study attack. Audit IDs traceable.
Residual harms — any AoC-class failure that completed despite AEGIS enforcement, with the specific tool-call chain and the reason capability check did not block.
Case-study → RC mapping — for each observed case, which AEGIS Root Cause(s) the case reduces to. Verifies H2.

Secondary

“What worked” coverage (CS#12–#16) under both conditions (H4).
Provider-layer failures (CS#6 and related) under Condition B (H3).
Agent-observed governance experience — end-of-study self-report per Round 1’s format.
AEGIS decision-engine performance under live load — bridges the Round-1 bare-metal benchmark numbers.

Open decisions before kickoff

These block execution. Tracked in experiments/2026-05-aoc-replication/README.md (private repo) for live updates.

Prober recruitment. Three to five non-owner probers — collaborators, open-source contributors, or paid academic researchers. Funding terms to settle.
Pre-registration venue. OSF, GitHub tag-based, or NDSS’s artifact-availability track.
IRB / ethics review. Paid external probers may require institutional review.
Third-model stretch. Whether to add a third model family (Claude Sonnet, GPT-5, Gemini, or a local model) for CS#6 isolation.
AoC team courtesy notification timing. Default: notify Shapira/Gordon-Tapiero at preprint time, with the paper in hand.

Timeline

Weeks 1–2 (2026-04-21 to 2026-05-05): Spec finalization, prober recruitment, infrastructure setup.
Week 3: Smoke-testing dry run.
Weeks 4–5: Condition A (ungoverned) runs.
Weeks 6–7: Condition B (AEGIS-governed) runs.
Weeks 8–10: Analysis and write-up.
Week 11 (target 2026-07-07): NDSS 2027 submission.

NDSS deadline is the binding constraint.

Relationship to other AEGIS work

The Round-1 bare-metal-5000.json performance benchmarks slot into Round 2 as Condition-B performance overhead evidence. Round 2 results, when complete, retroactively strengthen the NIST AI RMF position statement and the edge-governance IEEE TNSE paper.