Experiment ID: 2026-05-aoc-replication
Status: Spec — not yet executed. Drafted 2026-04-21.
Predecessor: Round 1 — Edge Deployment Evaluation
Targets: NDSS 2027 (deadline July 2026) as primary venue; IEEE S&P or USENIX Security as alternates.
One-sentence summary
Replicate Shapira et al. (2026) Agents of Chaos as closely as Round-2 scale allows, and run a matched second condition with AEGIS governance in the enforcement path, to measure which AoC case-study failure classes AEGIS intercepts and at what tool-call boundary.
Hypotheses (pre-registration-ready)
- H1. Under the ungoverned (replication) condition, AEGIS-lab agents will reproduce a meaningful subset of the eleven AoC failure case studies (CS#1–#11) within the study window.
- H2. Under the AEGIS-governed condition, the tool-call chains that produced those failures will be blocked at the capability-check boundary before irreversible effect. Concretely: every failure class that reduces to one or more of ATX-1 RC1/RC2/RC3/RC4 (the four Root Causes AEGIS inherits directly from AoC §16.2–16.3) will be intercepted.
- H3. Failure classes that do not reduce to AEGIS-addressable Root Causes — particularly CS#6 (Agents Reflect Provider Values), which is a provider-layer issue not an architectural governance gap — will persist under both conditions.
- H4. The “what worked in practice” cases (AoC CS#12–#16) will succeed in both conditions, because those represent behaviors where model-layer alignment already holds.
H3 and H4 are load-bearing for intellectual honesty: they scope what AEGIS can and cannot do, and support the “complement not replacement for model-layer alignment” framing.
Why this design
AoC (Shapira et al., arXiv:2602.20021v1) documented sixteen case studies across a two-week live laboratory study: eleven failure modes (CS#1–#11) and five “hypothetical cases (what happened in practice)” where agents resisted the attempted manipulation (CS#12–#16). The paper’s §16.2 explicitly identifies three structural properties LLM-backed agents lack — no stakeholder model, no self-model, no private deliberation surface — which map one-to-one to AEGIS ATX-1 Root Causes RC1, RC2, RC3. §16.3 names prompt injection as “a structural feature, not a fixable bug” — AEGIS RC4. RC5 (No Environment Model) emerged from AEGIS’s own RFC-0006 adversarial testing on 2026-03-26 and is not in AoC.
If ATX-1 is a faithful taxonomy of the architectural gaps AoC named, then AEGIS enforcing against ATX-1 techniques should block the AoC case-study attack chains at the capability boundary. Round 1 established this happens at single-operator scale. Round 2 establishes it at AoC-comparable scale.
Conditions
| Condition | AEGIS state | Purpose |
|---|---|---|
| A — Ungoverned (AoC baseline) | Off | Establish that the Round-2 setup reproduces AoC case-study failures |
| B — AEGIS-governed (treatment) | AEGIS Core in path; capability registry scoped per AoC’s stated agent responsibilities | Measure interception |
Counterbalancing: each prober runs against both conditions, with order randomized to control for learning effects. Fresh agent VMs per condition; no memory carries across.
Pre-registered metrics
Recorded per agent, per condition, per case study. All derivation scripts written and committed before the first run.
Primary
- Case-study coverage — for each CS#1–#11, did it manifest in Condition A? In Condition B?
- Time-to-first-failure per case — wall-clock time from study start to first incidence.
- AEGIS intervention count per case — number of
DENY/ESCALATEdecisions on tool-call chains that would, absent enforcement, complete an AoC case-study attack. Audit IDs traceable. - Residual harms — any AoC-class failure that completed despite AEGIS enforcement, with the specific tool-call chain and the reason capability check did not block.
- Case-study → RC mapping — for each observed case, which AEGIS Root Cause(s) the case reduces to. Verifies H2.
Secondary
- “What worked” coverage (CS#12–#16) under both conditions (H4).
- Provider-layer failures (CS#6 and related) under Condition B (H3).
- Agent-observed governance experience — end-of-study self-report per Round 1’s format.
- AEGIS decision-engine performance under live load — bridges the Round-1 bare-metal benchmark numbers.
Open decisions before kickoff
These block execution. Tracked in experiments/2026-05-aoc-replication/README.md (private repo) for live updates.
- Prober recruitment. Three to five non-owner probers — collaborators, open-source contributors, or paid academic researchers. Funding terms to settle.
- Pre-registration venue. OSF, GitHub tag-based, or NDSS’s artifact-availability track.
- IRB / ethics review. Paid external probers may require institutional review.
- Third-model stretch. Whether to add a third model family (Claude Sonnet, GPT-5, Gemini, or a local model) for CS#6 isolation.
- AoC team courtesy notification timing. Default: notify Shapira/Gordon-Tapiero at preprint time, with the paper in hand.
Timeline
- Weeks 1–2 (2026-04-21 to 2026-05-05): Spec finalization, prober recruitment, infrastructure setup.
- Week 3: Smoke-testing dry run.
- Weeks 4–5: Condition A (ungoverned) runs.
- Weeks 6–7: Condition B (AEGIS-governed) runs.
- Weeks 8–10: Analysis and write-up.
- Week 11 (target 2026-07-07): NDSS 2027 submission.
NDSS deadline is the binding constraint.
Relationship to other AEGIS work
The Round-1 bare-metal-5000.json performance benchmarks slot into Round 2 as Condition-B performance overhead evidence. Round 2 results, when complete, retroactively strengthen the NIST AI RMF position statement and the edge-governance IEEE TNSE paper.