Adversarial Security Evaluation in the Age of Autonomous Attackers
The Infrastructure Strikes Back is the published Field Report No. 02 in the Adversarial AI Experiment Series, the work product of Efficient Frontier Labs, LLC. © 2026 Efficient Frontier Labs, LLC. All rights reserved.
Contacts.
Discovery is becoming commoditized. Defensive coherence, observability under pressure, and exploit composition are the new bottlenecks. We support this claim with evidence from The Infrastructure Strikes Back, the second experiment in the Adversarial AI Experiment Series, a live two-phase adversarial exercise in which human-plus-AI red teams attacked a starter application and then targeted blue-team-fortified deployments during an active defense window.
Across roughly two hours, the event generated more than 210 raw submissions that deduplicated into 29 canonical vulnerability families. The first critical-severity finding arrived six minutes and thirty seconds after event start. This report's quantitative comparison covers four attacked targets with usable telemetry: one unhardened reference deployment and three defended deployments. Additional blue-team deployments existed in the scored exercise but are excluded from the quantitative tables because they did not produce usable attack-window telemetry. Platform-side telemetry from the reference deployment, the only target where ground truth could be measured, recorded 121 HTTP 500 crashes in the live window. The application's own event journal recorded 13, a measured 89% miss rate on crash events, attributable entirely to two routes whose crashes occurred before the handler's logging call could execute. The defender's telemetry was lying while the attacker knew it.
A subsequent adversarial session at ClawCamp 2026 on April 16 ran a head-to-head comparison of the original naive codebase against a hardened rewrite against the same attacker pool. The naive deployment recorded 11 crashes; the hardened deployment recorded zero, with hardening cost of approximately 5–10ms of CPU per request. The hardened deployment nonetheless carried a Critical Framework-surface exposure via an unpatched Next.js release, the interpretation rule at the heart of the benchmark playing out against a target that was demonstrably hardened at the application plane.
The data showed that raw finding volume was a poor measure of security performance. The strongest offensive outcomes came from exploit chaining and operational reasoning. The strongest defensive outcomes depended on continuity under pressure, targeted adaptation, and retained visibility into failure states. Three defended deployments starting from identical codebases produced three measurably distinct defensive signatures (deception-heavy, rate-limit-focused, and minimal-adaptation) despite beginning from the same seeded flaws.
We therefore propose CHES-4D, the Four-Dimensional Adversarial Security Framework, as the application-plane measurement core, built around four measurable dimensions: Composition Depth, Harm Reduction Rate, Endurance Under Attack, and Signal Fidelity. We also introduce CHESS Surfaces, a parallel qualification layer that measures whether application-plane scores can be interpreted as representative of deployed posture. CHESS Surfaces was part of the benchmark architecture from the experiment's design. Two empirical validation cases emerged from within the experiment series itself: the April 9 event produced the first, through a substrate-mediated state-reset mechanism in which ephemeral in-memory storage wiped during a defending team's own redeployments undermined an otherwise strong application-plane posture (Sections 6.3 and 7.4), and the April 16 ClawCamp 2026 session produced the second, a Framework-category Critical finding against a demonstrably hardened deployment (Section 7.5). Together, CHES-4D and CHESS Surfaces constitute CHESS, the broader benchmark architecture developed in a companion implementation specification.
The Infrastructure Strikes Back is the second experiment in the Adversarial AI Experiment Series, a research program investigating how human-plus-AI adversarial dynamics are reshaping security evaluation. The first experiment, The Penthouse Heist, established that structurally distinct AI-assisted attack architectures independently converged on the same high-value weaknesses within hours. This second experiment asked the complementary question: what becomes measurable when defenders are live, adaptive, instrumented, and allowed to fight back in real time?
The answer is that defensive quality is multidimensional, and no single metric captures it. Three blue teams began from identical codebases containing the same seeded architectural flaws, yet produced three measurably distinct defensive signatures during the live attack window. One sharply reduced real attack success through broad deception but suffered the highest crash volume and the weakest telemetry fidelity. A second preserved continuity and honest telemetry through rate limiting paired with targeted deception on a single high-risk path, at the cost of only modest reductions in attacker success. A third hardened the code in advance, made minimal live adaptations, and performed statistically similar to the unhardened baseline.
A second finding, confirmed against platform-side ground truth, is that the application's own event journal missed 89% of crashes on the reference deployment. Only the route whose logging call wrapped the throw site in a try/catch block recorded its crashes; the two routes whose logging calls sat inside the handler body recorded zero of 108 crashes. This is the paper's strongest empirical result: a measured gap between what the defender's instrumentation saw and what actually happened, verified against platform telemetry.
No single scalar score ranked the defended deployments correctly. That empirical result, together with the measured telemetry gap, motivates the benchmark architecture proposed in Section 07.
This paper makes four claims.
| Dimension | Claim |
|---|---|
| Research program | The first two experiments in the series establish complementary findings: attack-side convergence and defense-side multidimensionality. |
| Empirical finding | Crash-before-log demonstrates that under machine-speed attack, defenders can lose visibility before they lose availability — measured here at an 89% telemetry-fidelity gap against platform-side ground truth. |
| Framework contribution | CHESS, a two-plane benchmark architecture, combines CHES-4D (Composition Depth, Harm Reduction Rate, Endurance Under Attack, Signal Fidelity) at the application plane with CHESS Surfaces (seven substrate and boundary categories) as a parallel qualification layer. |
| Benchmark implication | Future security evaluation should score workflow behavior — chaining, adaptation, continuity, telemetry fidelity — rather than relying on bug counts alone, and should read application-plane scores in the context of substrate exposure. |
This paper is a field report with benchmark implications. It does not claim to be a universal benchmark. It argues, based on telemetry from a live exercise, that a multi-plane adversarial view of security evaluation is now necessary.
The first experiment in the series, The Penthouse Heist, focused on attack-side convergence. Multiple structurally distinct AI-assisted attack architectures (including hierarchical swarms, single autonomous agents, AI-parallelized testing, and AI combined with existing pentesting tooling) independently converged on the same economically meaningful weaknesses within hours. The central lesson was that discovery is becoming commoditized.
The Infrastructure Strikes Back extends that line of inquiry. Instead of asking whether autonomous or AI-assisted attackers can find bugs quickly, it asks what becomes measurable when both sides of the loop are live: when defenders prepare, deploy, observe, adapt, and attempt to preserve service while autonomous attack traffic is already hitting production-like targets.
Taken together, the first two experiments support a cumulative thesis. The first shows that heterogeneous offensive systems converge. The second shows that defensive quality under live pressure is multidimensional. The implication is that future evaluation must move from isolated vulnerability findings to workflow-level security measurement.
The original framing of this work argued for a four-dimensional application-plane measurement layer. That framing still holds. But subsequent analysis sharpened a missing qualification: a system may score well inside the defended application boundary while still being invalidated by substrate weakness below that boundary.
Modern security evaluation must therefore operate as a multi-plane assessment. The concrete contribution of the CHESS framework is presented across two primary planes:
Both planes are developed in Section 07 and specified in full in the companion implementation document.
The event was conducted against a synthetic target with no connection to production systems. Each domain contained exactly three intentionally seeded architectural flaws, marked SEEDED FLAW in source comments so that observed differences in defensive performance would be interpretable: teams started from the same codebase and were not solving different security problems, only responding to them differently. Participants operated within a defined scope and rules of engagement. Findings were disclosed only to organizers and judges. The event was designed to generate research evidence about human-agent security dynamics, not to train offensive operations against live third-party systems.
The exercise was split into two one-hour phases. During Phase 1, blue teams hardened the starter codebase, fixed seeded flaws, and deployed their versions. During Phase 2, red-team developers released autonomous agents and supporting offensive workflows against those deployments while blue teams watched telemetry, hot-fixed, redeployed, and attempted to preserve normal user functionality.
Blue teams were permitted to fight back within a bounded tactical space: rate limiting, IP blocking, ban lists, tarpits, honeypot routes returning fake data to classified adversarial traffic, additional telemetry, and any combination deployed as hot-fixes. They were forbidden from attacking other teams, modifying files outside their assigned domain, adding dependencies, or breaking legitimate traffic with deception.
Blue teams were scored using a six-category rubric with explicit weights and behavioral anchors:
| # | Category | Weight | What judges evaluated |
|---|---|---|---|
| 1 | Initial design quality | 25% | Did the team find and fix the seeded flaws? Were the code changes sound and proportionate? |
| 2 | Coverage of surface | 15% | Did the team address the highest-risk paths in their domain or only cosmetic issues? |
| 3 | Stability and continuity | 15% | Did the happy-path probe pass under pressure? |
| 4 | Live adaptation | 20% | Did the team read telemetry and respond to what red teams were actually doing? |
| 5 | Live exploit reduction | 15% | How much did attack success decline during the live window? |
| 6 | Explanation and rationale | 10% | Could the team clearly explain what they changed and why? |
The weighting reflected a hypothesis: under adaptive autonomous attack, static hardening would matter, but real-time defensive coherence would matter more. For that reason, the live-window categories (Stability, Live adaptation, Live exploit reduction — combined 50%) were weighted more heavily than many traditional code-review-style evaluations would allow. The results broadly validated that hypothesis.
A scripted probe with user-agent HLOS-Judge-Probe/1.0 ran against blue deployments every ten minutes during Phase 2, testing signup, login, authenticated read, and authenticated write. If the probe failed at any scored checkpoint, the blue team scored zero on stability and forfeited both live-window scoring categories, a combined 35% of the total score.
This was not an operational footnote. It formalized the central tradeoff of production defense: defenders can rate-limit, tarp, or deceive, but they cannot break the system they are defending. The gate operationalized the availability-under-defense constraint that many analyst and standards discussions describe only abstractly. In practical terms, it prevented blue teams from "winning" by making the target unusable.
Raw submissions were consolidated into canonical vulnerability families when multiple reports described the same underlying control failure. The scale of overlap was substantial. Password-reset-token disclosure alone was reported independently more than 30 times; step-up authentication bypass generated more than 15 submissions; IDOR on profile update generated approximately 10; type-confusion crashes more than 15; session cookies missing the Secure attribute more than 10. That pattern is itself evidence for convergence: several independent offensive systems, operating in parallel with limited coordination, converged on the same vulnerability families within minutes of one another.
A methodological note on contested findings. This paper reports red team findings as submitted during the April 9 event, and defense profiles as inferred from judge-aggregator telemetry captured during the live window. Post-event, blue teams were invited to provide written responses clarifying their defensive posture and contesting specific red team observations. Where such responses have been received, the paper incorporates them, most substantively in Section 4 (attack composition), Section 6 (defensive profiles), and Sections 7.4–7.5 (the two in-experiment Surfaces validation cases). Refoldla's post-event response revealed that several "re-registration" claims against their target were mediated by the starter application's ephemeral in-memory storage: defensive redeployments (ban list updates, hot-fixes) restarted the serverless function instances and wiped the application state, which enabled attackers to re-register privileged usernames on the empty store. This is a substrate-level observation: defensive operations that restart the application become attack-enablers when state lives in memory. It is the first of the two in-experiment CHESS Surfaces validation cases and is developed in Section 7.4. A full adjudication of red-versus-blue disagreements is out of scope for this report. This paper reports findings as of April 22, 2026; additional blue-team written responses, platform-side telemetry, or other material evidence received after publication will be incorporated in dated revisions, logged in the revision history (Appendix D).
This section contains the paper's strongest empirical finding, measured against platform ground truth.
The scored exercise had five blue-team deployments in addition to the unhardened reference, with additional late or unshared deployments outside the red-team attack window. Analysis in this section is restricted to the three defended deployments for which the judge aggregator captured sustained telemetry during the live window — Saikavya, Refoldla, and Manny — plus the reference deployment. Two blue-team deployments are out of scope for the quantitative comparison: one team's deployment link reached the red teams too late in the live window to accumulate sufficient attack traffic for analysis; the other lacked usable aggregator data because the default admin token was rejected by the helper safety check. The application's judge-aggregator captured per-team event-log polls every 15 seconds. Only the reference deployment resided in a Vercel team we controlled, so only the reference deployment's crash counts could be verified against platform-level telemetry via vercel metrics. The three defended deployments analyzed below were hosted on separate Vercel accounts; their crash counts are drawn from journal-logged events only, and any extrapolation to platform-level totals is inference, not measurement.
Within that scope, the ground-truth comparison on the reference deployment is unambiguous:
| Route | Platform-side 500s | Journal-logged 500s | Missed |
|---|---|---|---|
/api/auth/login | 75 | 0 | −75 |
/api/identity/reset | 33 | 0 | −33 |
/api/actions/create | 13 | 13 | 0 |
| Total (reference) | 121 | 13 | −108 (89% missed) |
The two routes whose handlers threw before reaching their logEvent() call recorded zero of 108 crashes in the journal. The one route whose throw sites were wrapped in try/catch + logEvent() recorded 13 of 13. Journal-logged 5xx counts for the defended deployments were 39 (Saikavya), 0 (Manny), and 0 (Refoldla), all concentrated on the same wrapped route. Without platform access we cannot measure the true totals for those three deployments, but because all four forked the same starter codebase and shared the same handler structure, the same telemetry blind spot was present in every defended deployment's code. What that blind spot actually undercounted on each target is unmeasured: a deployment that stayed blind to its crashes and one that patched the type-confused paths and had fewer crashes to miss are indistinguishable from the journal alone. The 89% undercount is therefore a demonstrated property of the reference deployment and an illustration of a blind spot the defended deployments shared structurally, not a measured undercount rate for them.
The measured 89% miss rate on the reference deployment is the paper's strongest single empirical result. It is not inferred from response-duration heuristics or derived from any proxy; it is a direct comparison of the defender's own instrumentation against the underlying platform's request counters for the same routes in the same window.
The architectural cause is simple. The journal was implemented as a per-route try/catch + logEvent() pattern, meaning a handler that threw before reaching the logEvent() line produced no log entry. The crashes on /api/auth/login and /api/identity/reset were all triggered by type-confused payloads (non-string fields that the handler attempted to treat as strings), which threw at parse time, before any defensive logging could run. The handler died; so did its ability to describe its own death.
The blind spot was independently confirmed from the attacker's perspective. One red team reported that they sent more than 300 crash-inducing requests using type-confused payloads and then inspected the admin event logs for matching failures. They found none. In their words, the crash occurred before the logging path, leaving the induced failure effectively invisible to monitoring.
This is a particularly important kind of result because it is convergent in two directions at once. Platform-side telemetry confirmed the defender's instrumentation was blind. Attacker-side experimentation independently established the same conclusion through behavioral observation.
The architectural lesson is broader than the specific flaw. Logging placement depth (whether telemetry is captured globally or only inside handlers) should be treated as a first-class security property and audited accordingly. A "clean" dashboard may simply mean the system failed before it could describe its own failure.
A subsequent adversarial session at ClawCamp 2026 on April 16 tested this conclusion directly. A hardened rewrite of the starter application (infra-strikes-back-v2) moved the journal write to a middleware layer that runs before any handler business logic, and added a process-level crash handler that fires after a handler throws. Against the same attacker pool that crashed the original codebase 11 times in the same window, the hardened rewrite recorded zero crashes on platform-side telemetry. The structural fix held.
The hardening cost approximately 5–10ms of additional CPU per request at p95, negligible at the session's volume. This is an important point for future benchmark deployments: telemetry placement is a security property, and the cost of getting it right is small.
As David Wagner (UC Berkeley) observed during pre-release review of these findings, achieving true telemetry resilience under live fire suggests that the observability layer should be architecturally orthogonal to the application itself, ensuring that if the application is compromised, disrupted, or crashed, the security telemetry stream is not eliminated. The middleware-based journaling and platform-level metrics separation tested at ClawCamp represent a practical step toward that orthogonal design, demonstrating that security telemetry can be structurally isolated at negligible operational overhead.
This finding clarifies why observability belongs inside security evaluation rather than beside it. Traditional security scoring often treats logging and monitoring as operational hygiene or post-facto detectability. Under machine-speed attack, the ability to preserve truthful telemetry is part of the defense itself. A defender that loses observability may still appear stable from inside its own dashboard while in reality ceding initiative to the attacker.
For that reason, telemetry fidelity is treated as a measurable first-class dimension of adversarial security quality in the framework proposed in Section 7, not an implementation detail.
The event generated more than 210 raw submissions that deduplicated into 29 canonical vulnerability families. The issues that broke fastest were architectural flaws that nearly every serious team found early: reset-token disclosure, step-up inconsistencies, and adjacent-flow control failures. Timestamped submissions tell the speed story clearly. One team submitted its first critical (unauthenticated account takeover via in-band password reset token) at six minutes and thirty seconds after event start, then followed with three more critical findings within the next three minutes.
That is speed. But the most informative findings were not the easiest to rediscover. They were the ones that still mattered after obvious bug classes were already known.
One red team documented a seven-step elevated account takeover chain requiring only seven API calls:
That sequence is qualitatively different from a list of bugs. It shows how exploit composition converts individually understandable defects into operationally meaningful control loss. A counting-based evaluation would flatten the chain into its component parts. A workflow-native evaluation recognizes that the relevant unit is the chain itself. This is precisely what CHES-4D's Composition Depth dimension is designed to measure (see Section 07).
The password-policy bypass illustrates a vulnerability class that isolated endpoint testing often misses. One deployment enforced an eight-character minimum on signup but accepted single-character passwords through the reset endpoint. Once a reset token was obtained, attackers could downgrade account passwords to trivially guessable values. The issue existed because signup and reset were implemented as separate handlers that did not share a common validation function.
This is a textbook workflow failure: each endpoint may look individually plausible, but the system-level behavior is insecure when the flows interact.
One seeded flaw involved client-influenced session identity. A red team demonstrated that supplying an arbitrary identity field in the login request allowed impersonation of any identity, including admin, in a single API call. Downstream checks that trusted session.identity were all affected: ownership checks, authorization boundaries, and audit trails.
This matters because the flaw's downstream consequences exceeded the apparent scope of the seed. The initial issue looked like an input-handling weakness. In operation, it became a trust-propagation failure spanning the application's authorization model.
The contrast between red teams clarifies why composition matters more than counting. One team produced a broad catalog of vulnerabilities with strong evidence and professional write-ups. Another found fewer total issues but composed them into the event's most meaningful offensive outcomes: the seven-step takeover chain, the strongest identity-spoofing demonstration, and the crash-before-log blind spot.
Both outputs were valuable. But they reflect different offensive capabilities.
The red teams exhibited four distinct operating styles visible in both their submissions and their timing patterns. These styles corresponded, imperfectly, to the attack architectures observed in the prior experiment: hierarchical swarms, single autonomous agents, AI-parallelized testing, and AI combined with traditional pentesting tooling. Two experiments are insufficient to claim a stable taxonomy, but the recurrence is strong enough to be useful.
| Style | Strength | Limitation | Evaluation implication |
|---|---|---|---|
| Surgical discovery | Speed and efficiency | Limited depth | A benchmark measuring only time-to-first-critical overweights this style and underrates chain-builders whose highest-severity findings arrive later. |
| Systematic coverage | Cross-target validation | Submission volume can overstate signal | Useful for testing convergence claims: when the same flaw appears across every target, the finding's generalizability is empirically established. |
| Operational chain-building | Impact and exclusive chains | Slower start | Essential for measuring exploit composition depth; benchmarks that fail to reward chain quality will systematically undervalue this style and the risks it reveals. |
| Defense-in-depth assessment | Thoroughness and reporting quality | Less live adaptation | Useful for understanding broad surface quality and control interactions, but weak evidence for how a target holds up under active defensive pressure. |
These patterns matter because there is no single agentic attack style. A benchmark that scores only against chain-building adversaries may undervalue classes of risk that systematic attackers find first. A benchmark that overweights early criticals may miss deeper operational paths discovered later. Future evaluations should preserve adversarial diversity rather than collapsing all offense into a single pass/fail outcome.
Blue-team performance was more varied than the raw scoreboard suggested. Judge-aggregator telemetry from the three defended deployments and the unhardened baseline revealed three materially different defensive profiles emerging from identical starting codebases.
Three caveats govern the analysis below. First, defended-deployment telemetry is drawn from the judge aggregator's periodic polls of each team's /api/_admin/events journal, not from platform-side ground truth. As Section 3 demonstrated on the reference deployment, journal-based instrumentation substantially undercounts crashes; the per-team 5xx counts below should be read as lower bounds. Second, one blue team (Refoldla) provided a detailed post-event response that re-framed several specific red team claims as artifacts of the starter application's ephemeral in-memory storage; that response is incorporated in Section 6.3 and discussed in Section 7. Third, the profiles below for Saikavya and Manny are drawn from the judge-aggregator telemetry alone; Refoldla's profile additionally incorporates the team's post-event written response. Any additional written responses or platform-side data incorporated after publication will be reflected in dated revisions, logged in the revision history.
| Metric | Baseline | Saikavya | Refoldla | Manny |
|---|---|---|---|---|
| Non-admin events (judge-polled) | 1,113 | 2,965 | 2,610 | 3,011 |
| Journal-logged 5xx (lower bound) | 13 | 39 | 0 | 0 |
| Platform-side 5xx (where measurable) | 121 | n/a | n/a | n/a |
| Rate-limit 429s | 0 | 29 | 382 | 7 |
| Deception responses | 0 | 499 | 0 | 0 |
| Dashboard polls | 1,386 | 2,264 | 1,905 | 1,038 |
| Overall 2xx rate (journal-observed) | 67% | 40% | 60% | 64% |
| Late-window 2xx rate (inferred real) | ~63% | ~13% | ~60% | ~70% |
Platform-side 5xx counts for the three defended deployments are unavailable because those deployments reside in Vercel accounts outside our operational boundary.
| Saikavya (deception-heavy) | Refoldla (rate limiting + targeted deception) |
| Manny (minimal adaptation) | Unhardened baseline |
This deployment deployed the most elaborate active-defense stack in the event. It activated deception roughly 35 minutes into the live attack window on the defended deployments, returning fake but apparently successful responses to classified adversarial traffic. (The 35-minute figure is timed from the opening of that window, when the defended-deployment URLs were released; the deployment's own event log began recording roughly twenty minutes earlier, so the first deception entry in the cited raw export sits about 55 minutes into that deployment-local log.) Of 2,965 non-admin events captured by the judge aggregator, 499 (17%) were handled by the deception layer rather than reaching the real handler. The majority (353) appeared as fake "bad credentials" responses on /api/auth/login, with smaller volumes on /api/identity/signup (136 fake successful signups), /api/identity/reset (7 fake null-token acknowledgments), and /api/actions/create (3 fake action creates). The per-route, per-status breakdown and the exact reproducer used to derive 499 from the judge-aggregator export are documented in §B.4.1. The classifier appeared to key on suspicious username patterns (literal admin, attacker, victim, plus rate-limit-probe prefixes) and redirect those handles into decoy flows. Later, limited rate limiting was added (29 × 429 on /api/identity/reset). The combined result was the strongest exploit-reduction trajectory in the event: real attack success fell from roughly 74% in the opening window to 13% late in the exercise.
But the cost was substantial. This deployment also recorded the highest journal-logged crash count (39, all on /api/actions/create) and the weakest telemetry fidelity. Deception increased adversarial friction while simultaneously widening the gap between what attackers observed and what actually occurred. The deception layer itself was probed: at least one attacker used honeypot_check_* handles to try to fingerprint it, consistent with the attacker-side awareness that something was intercepting their traffic.
This deployment introduced rate limiting across a broad set of routes and fired 382 × 429 responses during the live window, the clearest case study of rate-limiting as an active defense in the dataset. Distribution: 264 × 429 on /api/_admin/events (the dominant admin-token brute-force surface), 58 on /api/identity/reset, 31 on /api/auth/login, 28 on /api/auth/stepup, and 1 on /api/_admin/reset. The rate limiter fired across both admin-token and user-auth brute-force surfaces, and triggered on explicit rate-limit probes (rl_* handles), which meant attackers could verify the defense was operating, a useful piece of observable-defense signal.
The deployment's success trajectory remained relatively flat at around 60%, weaker on exploit reduction than the deception-heavy profile but much stronger on continuity and observability. The system remained legible. The telemetry stayed honest. The application remained stable.
Alongside rate limiting, this deployment also operated a targeted deception layer on a single high-risk path: when a duplicate signup request arrived for an existing username, the server returned a byte-compatible HTTP 200 response with a synthetic user ID, and ran a dummy hashPassword() call so the response timing was indistinguishable from a successful signup. No second account was created, and the original account's credentials were preserved. The red team reported the path as unprotected: they observed the 200 responses and concluded the uniqueness check had been bypassed. The defense worked exactly as designed; it was the deception itself that produced the red team's mistaken conclusion. This is a different calibration of deception from Profile A's broad honeypot classification: Profile A reclassified large swaths of adversarial traffic into deception routes; Profile B concealed a specific defensive behavior behind a single byte-compatible response on a single path. Both are deception; they trade off differently between coverage, observability, and legibility.
A post-event response from this team re-framed one aspect of the red team's narrative that bears directly on the interpretation of these results. The starter application used ephemeral in-memory storage (globalThis Maps), which reset on every serverless function restart, including restarts triggered by crashes and by the defending team's own mid-event redeployments of ban lists and hot-fixes. The red team's most dramatic claim against this target (repeated admin-username registrations) was not, in this account, a uniqueness-check bypass; it was the same attacker re-registering the same username on a store that had just been wiped by a defensive redeployment. The defender's own remediation operations were the primary enabler of the attack. The team verified, on a post-event deployment with persistent storage (Upstash Redis), that duplicate signup is rejected with a timing-equalized fake-success response and that the original account password is preserved.
This observation is a substrate-level finding in its own right, and it is discussed further in Section 7.4 as an instance of the Surfaces interpretation rule: an application-plane defense posture that is nominally strong (rate limiting present, duplicate signup handled with targeted deception, step-up enforced) can be invalidated by substrate architecture (in-memory state on a platform that restarts the application frequently during defensive operations).
This deployment fixed the crash-inducing flaw during Phase 1 but introduced only limited defensive changes during the live window. It applied light rate limiting (7 × 429 responses on /api/actions/create only), no deception, and little evidence of adaptive response. Its attack success rate remained in the 64% to 86% range throughout the exercise, making it statistically difficult to distinguish from the unhardened baseline.
This is one of the paper's clearest practical lessons. Hardening alone was not enough. Without active adaptation during the live window, autonomous attack traffic found residual paths and continued exploiting them. The zero journal-logged 5xx count for this deployment should be read against Section 3's finding: per-route journal instrumentation can miss the large majority of crashes when a handler throws before its logging call. This deployment shares that instrumentation pattern, so its zero is consistent with either a genuine absence of crashes or an unmeasured undercount, and is not a proof of absence.
The unhardened baseline received approximately one-third the traffic of defended deployments, which limits direct per-window comparisons. It should be understood as an anchor rather than a fully traffic-matched control. Future experiment design should improve this with traffic replay or other controlled injection methods.
Even with that limitation, the baseline was informative. It retained the underlying crash-before-log flaw, produced the same route-specific crash behavior, and provided a rough reference point for interpreting whether defended deployments materially changed outcomes.
No single metric ranked the defenses correctly. If evaluated only on exploit reduction, the deception-heavy team would win. If continuity mattered most, the broad rate-limiting profile would win. If telemetry fidelity mattered most, the non-deception teams would rank much higher.
Defensive quality is multidimensional. Any evaluation system that compresses it into a single scalar score necessarily discards information that matters operationally. A weighted composite score may still be useful for operational reporting, but the weighting itself is a judgment that depends on context (a deception-heavy profile may be correct for one deployment and wrong for another) and that judgment cannot be made empirically without the underlying dimensions intact. The benchmark's analytic value depends on preserving them.
The rank-order comparisons below are descriptive summaries of this event, not claims of universal ordering across environments.
| Metric | Saikavya | Refoldla | Manny |
|---|---|---|---|
| Exploit reduction | 1st | 2nd | 3rd |
| Continuity | 3rd | 1st | 1st |
| Telemetry fidelity | 3rd | 1st | 1st |
| Adaptation speed | 1st | 2nd | 3rd |
| Dashboard engagement | 1st | 2nd | 3rd |
The findings above motivate a named benchmark architecture. CHESS is the overall structure. It combines application-plane measurement with a parallel qualification layer, on the empirical premise that defensive quality under live adversarial pressure is multidimensional, and that application-plane scores cannot be read as representative of deployed posture without understanding the substrate beneath them.
CHESS has two planes:
The two planes answer different questions. CHES-4D answers: how did the system perform on benchmarked tasks under live adversarial pressure? CHESS Surfaces answers: what relevant system surfaces were assessed, how severe is their posture, and how trustworthy or complete is the supporting evidence?
The four CHES-4D dimensions are not proposed in the abstract. Each is grounded in a specific empirical result from this event: Composition Depth, which measures the longest validated chain from initial access to operational control (formalizing the chain-building evidence from Section 04); Harm Reduction Rate, representing the trajectory of attacker success over the defensive window (formalizing Section 06); Endurance Under Attack, the ability to preserve legitimate functionality and state continuity under pressure (formalizing Section 06); and Signal Fidelity, measuring the fraction of relevant attack behavior that remains visible in defender-controlled telemetry (formalizing the observability gap from Section 03). The dimensions exist because the event required them to describe what was measured.
Definition. The longest validated chain from initial access to operational control.
Why it matters. Finding individual defects is no longer sufficient. The relevant question is whether attackers can compose those defects into meaningful control loss.
Measured here. Yes. Red teams documented validated chains of 5–7 steps ending in account takeover or privileged control (Section 04).
Definition. The trajectory of attacker success over the defensive window.
Why it matters. Defenses do not merely exist; they change the shape of the attack over time.
Measured here. Yes. Observed trajectories ranged from ~74% → 13% for the strongest reducing defense to effectively flat in the minimally adaptive case (Section 06).
Definition. The ability to preserve legitimate functionality and state continuity under pressure.
Why it matters. A defense that breaks the system or repeatedly crashes while blocking attacks has not solved the production problem.
Measured here. Yes. Visible through happy-path probe results, crash counts, and operational stability (Section 06).
Definition. The fraction of relevant attack behavior that remains visible in the defender's own instrumentation.
Why it matters. Defenders cannot adapt coherently to behavior they cannot see. Under machine-speed attack, truthful telemetry is part of the defense.
Measured here. Yes. Platform telemetry recorded 121 crashes on the reference deployment; the journal captured 13, a measured 89% undercount. One attacker independently confirmed 300+ induced failures invisible to the event log (Section 03).
Some failures do not merely lower an application-plane score. They invalidate the premise that the score is representative of the defended system as deployed. That is the role of CHESS Surfaces.
Substrate qualification was a foundational concern in the benchmark's design. The Infrastructure Strikes Back experiment was instrumented to expose substrate-level failure modes alongside application-plane behavior, and two empirical validation cases emerged from within the experiment series itself. The April 9 event surfaced a substrate-mediated state-reset mechanism in which a defending team's ephemeral in-memory storage was wiped during their own defensive redeployments, described in §7.4. The April 16 ClawCamp 2026 session produced an independent Framework-category Critical finding against a demonstrably hardened deployment, described in §7.5. The seven-category taxonomy presented below is the formal articulation of an architecture that was present from the start; full validation across all seven categories at scale remains the subject of the next experiment in the series.
CHESS Surfaces measures seven substrate and boundary categories whose compromise can bypass, invalidate, or dominate application-plane defensive performance:
Each category in a CHESS Surfaces assessment carries both a status (Clean, Qualified, Elevated, Critical) and a coverage state (complete, partial, insufficient). The design rule is explicit: a category is not called Clean without evidence. Status and coverage state remain distinct reporting dimensions so that absence of evidence is never silently conflated with evidence of absence.
CHES-4D scores describe application-plane adversarial performance. CHESS Surfaces determines whether those scores can be read as representative of deployed posture.
Strong application-plane results remain meaningful as measurements, but become conditional rather than standalone when Surfaces exposure is Critical.
The practical consequence is that a defender can score well on CHES-4D and still fail at the system level if the substrate beneath their application is exposed. The dashboard visualization rule follows from this: when aggregate Surfaces status is Critical, CHES-4D scores remain visible but are presented as conditional, not standalone.
The first CHESS Surfaces validation case came out of the April 9 event itself, though it took post-event adjudication to surface it as such. The starter application used ephemeral in-memory storage (globalThis Maps on Vercel's serverless runtime), which reset on every function cold start, including restarts triggered by crashes and by a defending team's own mid-event redeployments. When a team deployed a ban-list update, a hot-fix, or any other vercel --prod invocation during the live window, the function instances restarted and the user store was wiped.
One blue team's post-event response (discussed in §6.3) documented the mechanism directly. The red team's most dramatic claim against their target — repeated admin-username registrations, 20+ consecutive retakes — was not a uniqueness-check bypass but a persistence loop: each defensive redeployment wiped the user store and handed the attacker a clean slate for re-registration, executed at under-one-second cadence by an automated agent. The cycle ran: crash the server (type-confused JSON payloads) or wait for the defender's next vercel --prod invocation → function instance restart → empty store → re-register the privileged username → repeat. Blue team redeployments occurred every two to three minutes during the live window; the attacker's re-registration closed the loop in under a second each time. The defender's application-plane posture, in isolation, was strong: rate limiting on all auth endpoints, duplicate-signup handling by a timing-equalized fake-200 deception response, and step-up enforcement gated by a server-side secret. But the substrate supporting those defenses did not persist.
This is not a defense-failure story in the ordinary sense; it is a defense-as-attack-vector story. Application-plane hardening that assumes persistent state cannot be evaluated honestly when the substrate doesn't provide it. Under the CHESS Surfaces taxonomy, this resolves to the Framework category: the exposure is the runtime model's ephemerality, activated by defensive operations that would have been safe on a persistent substrate. Framework / Elevated under the threshold policy defined in the companion implementation specification — escalating to Framework / Critical for the duration of any defensive redeployment cycle. The team verified the resolution on a post-event redeployment with Upstash Redis: crashes still return HTTP 500, but the user store survives; duplicate signup receives the fake-200 defense with original credentials preserved; the persistence loop is broken at its foundation.
The significance for the benchmark is that this case was present in The Infrastructure Strikes Back from the start and exercised by live attacker traffic; it was simply not legible as a Surfaces case until Refoldla's written response adjudicated what the red team had observed. It is cited here as the first empirical validation of the Surfaces interpretation rule — strong application-plane performance becoming conditional, not standalone, when substrate exposure is elevated.
The case also illustrates a related point about cryptographic agility under live pressure. Several blue teams in this exercise converged on HMAC-signed session authority as a defensive move, a reasonable choice that nonetheless depended on state continuity to remain meaningful: a signed session that survives the substrate it was minted against still binds to a user record that may no longer exist. As attack-defense cycles compress from months to hours, the ability to rotate keys, replace libraries, and repair brittle trust paths at operational speed matters as much as the choice of cryptographic primitives. The Surfaces interpretation rule reads naturally as a generalization of that observation: trust propagation across substrate resets is exactly the property that makes application-plane scoring conditional rather than standalone.
The second validation case came one week later. The April 16 session at ClawCamp 2026, Break This App: Live Adversarial Agent Experiment in 45 Minutes, was designed in part to test the same hypothesis against a hardened target: that substrate-level exposure can bypass strong application-plane hardening. The session ran a head-to-head comparison of the original naive starter application (v1) against a hardened rewrite (v2, deployed to defend.day-zero.dev) against the same attacker pool in the same time window.
Application-plane hardening was demonstrably effective. Against essentially identical attacker traffic in the same 1 hour 45 minute live window, the naive deployment recorded 11 crashes on platform-side telemetry; the hardened deployment recorded zero. The hardened deployment also produced four times fewer reconnaissance 404s (64 versus 530), converted a substantially wider class of unauthenticated requests into explicit 401s (794 versus 302, a 49.6% versus 9.4% 401-rate), and fired its rate limiter against brute-force traffic where the naive deployment was silent. The hardening cost was measurable but modest: approximately 5–10ms of additional CPU per request at p95. By any reasonable application-plane measurement, v2 was harder to attack than v1.
And yet v2 was exposed. Recon and exploitation analysis by Phoenix Taredi (TechSoulutions), with automated browser testing and CVE correlation in collaboration with Claude (Anthropic), surfaced the bypass. Authorized black-box reconnaissance on the publicly accessible client-side bundles identified that v2 was running Next.js 14.2.35, two major versions behind current release, and exposed to CVE-2026-23864, a React Server Components denial-of-service vulnerability affecting the App Router. The CVE, published in NVD on January 26, 2026 and sourced to Meta, covers multiple DoS vectors in react-server-dom-parcel, react-server-dom-turbopack, and react-server-dom-webpack. A single malformed POST to any App Router Server Function endpoint triggers CPU exhaustion, out-of-memory exceptions, or server crashes. No authentication is required. CVSS vector: AV:N/AC:L/PR:N/UI:N.
The finding mapped cleanly onto the CHESS Surfaces taxonomy. Framework category, live-exploit evidence strength (demonstrated), reachability true, patched status false at time of discovery. Under the threshold policy defined in the companion implementation specification, this resolved to Framework / Critical.
This is the interpretation rule playing out as intended. Application-plane defenses on defend.day-zero.dev were demonstrably working: zero crashes against a pool that crashed v1 eleven times, four-fold reduction in reconnaissance success, rate limiting fired in the right places. A CHES-4D-only evaluation of this deployment would have produced a defensible application-plane score. But the Framework-surface exposure meant that any allowlisted host (including the host machine itself) retained a one-request path to denial of service. The application-plane score remained meaningful as a measurement of the defended boundary. It was not representative of deployed posture.
defend.day-zero.dev has since been upgraded to a supported Next.js release. Taken together with §7.4, the two cases establish CHESS Surfaces as validated in-experiment through two distinct substrate mechanisms: runtime state ephemerality activated by defensive operations, and framework-release exposure at a dependency boundary. A larger comparative corpus (multiple systems, multiple categories, multiple severity levels) remains the subject of the next experiment.
The need for substrate qualification is visible from a second vantage point: the distributed-systems and data-governance side. Oleksandra Bovkun (Databricks, speaking in a personal capacity) observed in a conversation following the April 9 event that for roughly twenty years, application-layer security worked because users had to navigate through authorized applications to reach data. A user might technically hold broad underlying data access, but in practice could only see what the application they were authorized to use chose to surface. Governance was enforced by that navigation bottleneck, and it was enough.
Agentic systems do not navigate. A chatbot connected to a company's full data surface retrieves whatever it has technical access to, independent of the application-layer pattern a human user would have followed. The implicit boundary that twenty years of application security relied on does not hold.
The same pattern appears when compromise occurs below the application layer entirely. Recent public incidents reinforce the point. The LiteLLM supply chain incident demonstrated that substrate beneath application-plane defenses (in that case, a widely adopted inference library) can become a compromise vector regardless of how well the application layer is defended. The April 2026 Vercel security incident, still under active investigation at the time of this writing, further illustrates the pattern: a third-party AI tool was compromised via its Google Workspace OAuth app, and the attacker used that access to reach the deployment platform's internal environment and customer environment variables that were not marked sensitive. The cross-boundary chain (Vendor API, Delegation, and Agent config surfaces interacting) is precisely the class of exposure CHESS Surfaces is designed to make legible in benchmark form, and is precisely what the ClawCamp finding in Section 7.5 demonstrated against a hardened target.
A more essayistic treatment of this argument, framed around the policy and design implications for organizations deploying agentic systems, is published alongside this paper.
This work sits within a rapidly moving landscape. Anthropic's Mythos Preview and Project Glasswing demonstrated that frontier models can autonomously discover serious vulnerabilities, including chains once considered too costly for broad automation. Microsoft's CTI-REALM moves evaluation toward end-to-end workflows by asking whether agents can convert threat intelligence into validated detections. DARPA AIxCC teams (Atlantis, Buttercup, RoboDuck) have shown that autonomous cyber reasoning systems can complete substantial portions of the discovery-and-patch cycle at operationally meaningful speed and cost. A cluster of commercial systems addresses adjacent parts of the same landscape across autonomous offensive validation, runtime security, and agentic observability.
This benchmark architecture is designed to be compatible with, rather than competitive against, existing analyst and standards frameworks. Forrester's Agentic Development Security category and AEGIS framework describe the defensive properties that matter under agentic attack in largely abstract terms. OWASP's Top 10 for Agentic AI catalogs threat classes. NIST AI RMF provides governance structure.
The contribution here is narrower and more operational: to define how some of those properties become measurable in live exercises, and how application-plane measurement should be interpreted alongside substrate qualification.
A directly complementary contemporaneous contribution is Fokou's Parallax (April 2026), which proposes an architectural paradigm for safe autonomous AI execution grounded in four principles drawn from established systems security: cognitive-executive separation, adversarial validation with graduated determinism, information flow control, and reversible execution. Parallax addresses the question of what to build; CHESS addresses the question of how to evaluate what has been built. The two are complementary rather than overlapping. A deployment that conforms to Parallax's architectural principles can be measured against CHES-4D and qualified against CHESS Surfaces, and the benchmark's substrate qualification dimension provides a method for verifying that Parallax's separations hold under adversarial conditions in the deployed environment, not only in the reference implementation. Both efforts share a motivating observation: prompt-level safety mechanisms operate at the same computational substrate as the threats they attempt to mitigate, and structural enforcement is necessary when execution capability is in scope.
The architectural description above is deliberately compact. The full evaluation contract (threshold policy, category attribution rules, the SurfaceFinding and SurfaceAssessment data contracts, the classification pipeline, and worked examples) is developed in a companion implementation document. This paper establishes the empirical motivation. The specification defines how the benchmark is computed.
A central challenge of adversarial security evaluation is that a simple count of raw findings flattens the diagnostic signal. A target that carries twenty minor findings may be more secure than one carrying three findings that compose into a catastrophic control loss. We therefore propose that CHES-4D's Composition Depth should be computed over evidence-backed chains rather than inferred from finding volume alone, treating each finding as a potential step in a larger graph. This approach introduces the kill-chain diagnosis artifact as the operator-reviewed unit of measurement, drawing inspiration from recent paradigms like ExploitGym (arXiv:2605.11086), which shift focus from isolated bug enumeration to cross-boundary composition.
Two empirical worked examples from a recent multi-repo ecosystem sweep (PR #272) illustrate the value of this diagnostic model. The first is KC-006, a Validated Composition Depth-6 chain spanning five distinct repositories (phoenix, omnis, hlos, keyfree, and staampid) that demonstrates how separate, individually minor rate-limit and identity-model defects compose into a substrate-wide economic drain. The second is KC-008, an exemplar of the credential-lifecycle-betrayal family, which validates all four operator-predicted betrayal vectors [resemblance (HLOS djb2 collision), selective disclosure (STAAMPID passthrough), refines-not-replaces (HLOS legacy fallback), and revocation (stale auth TTL and JWKS cache lag)] evidenced entirely by code-level findings. In both cases, the raw finding count fails to capture the true posture; only when compiled into operator-reviewed, evidence-backed kill chains does the composition risk become legible.
This paper draws conclusions from a single event with a small number of teams, one starter codebase, a synthetic target, and a short time window. These limits are real.
The unhardened baseline did not receive a fully matched traffic load, which constrains direct numerical comparison between baseline and defended deployments. It remains a useful anchor but not a full experimental control.
The event is sufficient to motivate a benchmark architecture, but not to claim that its dimensions, weights, or threshold tables are final. The benchmark should be understood as a research instrument under refinement, not as a settled universal standard.
This paper provides strong empirical support for both application-plane multidimensionality and the parallel substrate-exposure layer. Two validation data points emerged from within the experiment series itself: the April 9 ephemeral-state observation (Section 7.4) and the April 16 ClawCamp CVE-2026-23864 finding (Section 7.5). Both are Surfaces-level outcomes, reached through different substrate mechanisms within the Framework category, and together they establish CHESS Surfaces as validated in-experiment rather than proposed retroactively. Two corroborating public incidents, LiteLLM and the April 2026 Vercel security incident, instantiate the same pattern at production scale and are referenced in Section 7.6. These four cases are not equivalent: the first two are measured findings from experiments in this series; the second two are referenced public disclosures. What they share is the same class of substrate exposure that CHESS Surfaces is designed to make legible.
What this paper does not yet provide is a large comparative corpus across many systems showing the full two-plane architecture in action with quantitative scoring across all seven Surfaces categories. Both in-experiment validation cases surfaced in the Framework category; the Dependency, MCP/tool, Skill/plugin, Agent config, Delegation, and Vendor API categories await empirical assessment in subsequent experiments. That broader validation work is in progress, with the next experiment in the series expected to produce the first multi-system CHESS Surfaces assessments published alongside CHES-4D scores.
Ground-truth platform-side telemetry was available for the reference deployment on both event days and for the hardened v2 deployment on April 16. It was not available for the three defended deployments on April 9, which reside in Vercel accounts outside our operational boundary. Requests for platform-side data from those teams are in progress. Until those data are incorporated, per-deployment crash counts for Saikavya, Refoldla, and Manny on April 9 should be read as journal-based lower bounds, not measured totals. Section 3 demonstrated an 89% undercount on the reference deployment from a telemetry blind spot these deployments shared in code; what that blind spot actually undercounted on each of them is unmeasured without platform-side ground truth, so their journal counts cannot be treated as either measured totals or a quantified floor across the defended set.
Deployment counts in this report distinguish between blue-team deployments that existed, targets that were shared in time for red-team attack, and targets with usable exported telemetry. The quantitative tables cover only the last category.
This paper incorporates a post-event response from one defended team (Section 6.3 and Section 7.4). Responses from the other two defended teams have been requested but not yet received at the time of this publication. A revised version of this paper will incorporate additional responses as they become available.
The findings of this event suggest five implications.
The central result of The Infrastructure Strikes Back is not merely that attackers were fast or that defenders varied. It is that when offense and defense operate in the same live window, different dimensions of security become measurable than static evaluations usually capture.
Some teams reduced harm but lost telemetry fidelity. Some preserved continuity but failed to adapt. Some fixed the obvious flaw yet remained close to baseline under sustained pressure. No single metric ranked them correctly, because defensive quality was not one-dimensional.
That empirical result supports a benchmark direction.
At the application plane, defensive quality under live pressure should be measured as a coordinated vector: Composition Depth, Harm Reduction Rate, Endurance Under Attack, and Signal Fidelity.
At the broader system plane, those scores should be read in the context of Surfaces exposure, because a weak substrate can make a strong application-plane showing conditional rather than representative. Two cases from within the experiment series validate the interpretation rule: a defending team's ephemeral in-memory storage wiped by its own mid-event redeployments (April 9), and a hardened deployment carrying a Critical Framework-surface exposure via an unpatched Next.js release (April 16 at ClawCamp 2026). Together they establish CHESS Surfaces as a measurement layer proven against real targets, not a speculative addition to the benchmark.
The practical lesson is simple: future security evaluation must measure not just whether a system contains vulnerabilities, but how it behaves, adapts, remains legible, and preserves trustworthy telemetry while attack and defense are both in motion.
That is the benchmark implication of this event.
This paper benefited from conversations with participants, judges, and external reviewers who engaged with early drafts and interim findings. Oleksandra Bovkun (Databricks) contributed substantively to the framing of CHESS Surfaces through a discussion of how traditional application-layer governance models break down under agentic access patterns. Her comments are cited in Section 07 in her personal capacity and do not represent Databricks as a company. Kevin Boyle (Enterprise Account Executive, Data + AI at Databricks) provided key thought leadership, feedback, and editorial contributions to the final version of the paper, specifically in auditing telemetry discrepancies and validating finding counts; his contributions are made in his personal capacity and do not represent Databricks. David Wagner (UC Berkeley) provided helpful observations on telemetry placement following the live event. Recon and exploitation analysis for the ClawCamp 2026 Surfaces-Critical finding (Section 7.5) was performed by Phoenix Taredi (TechSoulutions), with automated browser testing and CVE correlation in collaboration with Claude (Anthropic).
CHESS. Overall benchmark architecture combining application-plane live-adversarial measurement with substrate qualification.
CHES-4D. Application-plane measurement vector: Composition Depth, Harm Reduction Rate, Endurance Under Attack, Signal Fidelity.
CHESS Surfaces. Parallel qualification layer measuring substrate and boundary exposure across seven categories: Dependency, Framework, MCP/tool, Skill/plugin, Agent config, Delegation, Vendor API. Each category carries a status (Clean, Qualified, Elevated, Critical) and a coverage state (complete, partial, insufficient).
Signal Fidelity. The fraction of relevant attack behavior that remains visible in defender-controlled telemetry.
Crash-before-log. A class of failure in which attacker-triggered faults occur before telemetry capture, producing defender blindness without obvious dashboard evidence.
Kill-chain diagnosis. An evidence-backed sequence of findings showing how separate defects compose into an operationally meaningful attacker objective; used to support CHES-4D Composition Depth and CHESS Surfaces interpretation.
Validated Composition Depth. Longest chain whose steps are evidence-backed and whose transition logic is accepted by triage/operator review. Distinct from "longest chain"; requires operator acceptance to count.
This appendix documents the data sources, measurement boundaries, and reproducibility for the telemetry cited throughout the paper, particularly the ground-truth crash counts in Section 3 and the v1-versus-v2 comparison in Sections 6 and 7.5.
Two data sources underlie the telemetry in this paper:
vercel metrics CLI, schemas vercel.request.count and vercel.request.route_cpu_duration_ms. Provides ground truth for HTTP status counts, route-level request volumes, CPU duration, client IP, JA4 fingerprint, user agent, geographic distribution, and bot categorization. Available only for the two Vercel projects operated by the authors: the-infrastructure-strikes-back-starter (v1, deployed to attack.day-zero.dev) and infra-strikes-back-v2 (v2, deployed to defend.day-zero.dev).isb-judge-export-20260410-1059). The Round 3 judge aggregator (Day-Zero Round 3 is the operational session identifier for the April 9 event; this paper is Field Report No. 02 in the Adversarial AI Experiment Series) polled each team's /api/_admin/events journal every 15 seconds during the live window and persisted the results to JSONL. Four attacked targets were exported: the reference deployment and three defended deployments with sustained aggregator access (Manny, Refoldla, Saikavya). Other blue-team deployments existed but are excluded from the quantitative comparison because they did not produce usable attack-window telemetry. Provides application-level events including HTTP status, route, session actor, and timestamps. Does not capture crashes that occur before the handler's logEvent() call (see Section 3). The aggregator source (snapshot.mjs, including the computeMetrics() definitions of "non-admin event", "attack route", "blocked", and "blocked ratio") is bundled in the export directory alongside the JSONL data and the export's own README.md, so the figures in §B.3, §B.4, §B.4.1 are independently reproducible from the cited bundle without external references.| Event | UTC window | Duration | Total platform requests |
|---|---|---|---|
| April 9, 2026 | 2026-04-10 00:45 → 06:00 | ~5 h | 3,568 (v1) |
| April 16, 2026 | 2026-04-16 19:45 → 21:30 | ~1 h 45 m | 3,208 (v1) + 1,601 (v2) |
Note: While the active live attack window for the April 16 ClawCamp event was bounded to 45 minutes (as detailed in §7.5), the platform-side metrics telemetry capture window was kept open for 1 hour 45 minutes to ensure all residual connection pools, late heartbeats, and slow event logs were fully captured.
Platform-side 5xx distribution on the reference deployment, same window as the journal export:
| Route | Platform 500s | Journal 500s | Miss |
|---|---|---|---|
/api/auth/login | 75 | 0 | −75 |
/api/identity/reset | 33 | 0 | −33 |
/api/actions/create | 13 | 13 | 0 |
| Total | 121 | 13 | −108 (89%) |
Platform-side counts for the three defended deployments were not available. Journal-logged counts from the judge aggregator:
| Team | Non-admin events | 5xx logged | 429s | Deception events |
|---|---|---|---|---|
| reference | 1,113 | 13 | 0 | 0 |
| Saikavya | 2,965 | 39 | 29 | 499 |
| Refoldla | 2,610 | 0 | 382 | 0 |
| Manny | 3,011 | 0 | 7 | 0 |
All logged 5xx events on every team were on /api/actions/create, the one route where throw sites were wrapped in try/catch + logEvent(). Zero teams captured crashes on /api/auth/login or /api/identity/reset, despite platform telemetry showing 108 such crashes on reference alone.
The 499 deception events in the Saikavya row above are not signalled by a dedicated boolean field in the export. The classification decision is made inside the handler and surfaces in the export only via the actor field: every request that the deployment's sentinel module classified as adversarial and rerouted to a decoy response was logged with the actor value [DECEPTION] <username>. The 499 figure is therefore reproducible from the cited export with a one-line filter:
jq -r 'select(.actor != null) | select(.actor | startswith("[DECEPTION] ")) | [.route, .status] | @tsv' \
team-saikavya/events.jsonl | sort | uniq -c | sort -rn | column -t
# 353 /api/auth/login 401
# 136 /api/identity/signup 201
# 7 /api/identity/reset 200
# 3 /api/actions/create 201
# ────
# 499 total
The classification rule itself is a stateless regex match on the submitted username, defined as ATTACK_ACTOR_RE in the Saikavya deployment's src/api/sentinel.ts. The regex was extended live during the attack window — successive commits added patterns such as fresh_, codex-, w[0-9]+_, and an explicit ^demo$ literal in response to observed traffic — so the 499 count reflects whichever revision of the regex was deployed at the time each request fired. The companion subtype tags defined in the same module ([DECEPTION:BOT] for the bot-heuristic path and [DECEPTION:SEQ] for the signup-then-reset sequence) and the related [RESERVED_PROBE] honeypot tag were present in the code but did not fire during the captured window; only the bare [DECEPTION] path produced events in this export.
Platform-side HTTP status distribution (selected statuses), same attacker pool, same time window:
| Status | v1 count | v1 % | v2 count | v2 % |
|---|---|---|---|---|
| 200 | 1,448 | 45.1 | 690 | 43.1 |
| 401 | 302 | 9.4 | 794 | 49.6 |
| 404 | 530 | 16.5 | 64 | 4.0 |
| 429 | 3 | 0.1 | 6 | 0.4 |
| 500 | 11 | 0.3 | 0 | 0.0 |
| Total requests | 3,208 | 1,601 |
The v1 crashes (11 total, 6 on /api/auth/login and 5 on /api/actions/create) all originated from a single venue NAT IP, with the crashes clustered between 20:10–20:45 UTC.
client_ip is the TCP source as seen by Vercel's edge. Both events had attendees behind shared venue NATs; IP-level counts conflate ~6 attendees per NAT into one row. JA4 fingerprints partially de-aggregate but collide when two attendees use the same browser/Node version. All attacker counts are lower bounds.lib/journal.ts ring buffer (50k entries, in globalThis) reset at each Vercel cold start and is not available for either event. All figures in this paper come from either the judge-aggregator export (persisted to JSONL) or vercel metrics platform counters.client_anomaly and browser_impersonation are useful as relative signals but not as labels.route_cpu_duration_ms (Vercel-measured CPU, not wall-clock). Excludes network, queue, and downstream I/O.iad1). v2 was a separate codebase deployed April 16 (deployment created April 16 15:39 PDT).attack.day-zero.dev URL. v2 received almost none of this traffic, likely because the hardened target returned fewer informative responses to recon patterns. Attacker population comparisons between v1 and v2 should account for this asymmetry, though crash-count comparisons remain valid because the crash-inducing payloads originated almost entirely from the in-room venue NAT.All queries require Vercel CLI ≥ 51, Vercel team hackathon-sandbox, Observability Plus.
# April 9 -- reference deployment (v1) only
P1=the-infrastructure-strikes-back-starter
F=2026-04-09T22:00:00Z; T=2026-04-10T08:00:00Z
vercel metrics vercel.request.count --project $P1 \
--since $F --until $T --group-by http_status
# Crash-count by route (the Section 3 measurement)
vercel metrics vercel.request.count --project $P1 \
--since $F --until $T \
--filter "http_status eq '500'" --group-by request_path
# CPU duration p95 (hardening cost measurement)
vercel metrics vercel.request.route_cpu_duration_ms \
--project $P1 --since $F --until $T \
--aggregation p95 --group-by request_path
# April 16 -- both projects
P2=infra-strikes-back-v2
F=2026-04-16T18:00:00Z; T=2026-04-17T02:00:00Z
# Same query set as above for $P1 and $P2
# Per-IP cross-project comparison:
vercel metrics vercel.request.count --project $P2 \
--since $F --until $T \
--filter "client_ip eq '63.195.206.253'" \
--group-by http_status --group-by request_path --limit 50
Platform-side crash counts for the three defended deployments on April 9 (Saikavya, Refoldla, Manny) would substantially strengthen the analysis in Section 3 by extending the journal-vs-platform comparison from one deployment to four. If such data become available, they will be incorporated in a dated revision per the revision history (Appendix D).
This paper draws on the work and good faith of many people. The author is grateful to all of them.
Specific contributions are credited inline in the main text: the Refoldla blue team (Ethan) for the post-event response that anchors the first in-experiment Surfaces validation case (§7.4 and §6.3); Phoenix Taredi (TechSoulutions), in collaboration with Claude (Anthropic), for the reconnaissance and exploitation analysis that produced the second validation case (§7.5); and Oleksandra Bovkun (Databricks), for the observation on agentic access and application-layer governance that anchors §7.6.
The scoring and adjudication of the April 9 event depended on the time and expertise of fifteen judges. Named here with the attribution lines each provided for this paper: Abhigyan Khaund, Software Engineer; Adwait Sathe, Data Engineer (Privacy); Arjun Chakraborty, Security AI Research at Microsoft; Arun Pandiyan Perumal, Site Reliability Engineer at Adobe; Damian Li, Senior Machine Learning Engineer at Retell AI; Goutham Nekkalapu, ML Engineer at Gen (Norton); Hardik Chawla, PM Data Platform and Integrations at Amazon; Nilesh Matai, Founding Engineer at Retell AI; Prabir Vora, Technical Chief of Staff at Retell AI; Preetham Kaukuntla, Staff Data Scientist at Glassdoor; Rohil Bansal, Software Engineer at Meta; Sujitha Vummaneni, Senior Security Engineer at Ripple; Sumanth Shiva Prakash, Group Product Manager at Adobe; Tyler D'Silva, Founding Engineer at Retell AI; and Usha Ratnam, Staff Software Engineer at Ripple.
Colin Behring provided the venue at 1900 Broadway in downtown Oakland where the April 9 event was held. Adeniji Asabi supported on-site coordination and logistics within the venue. Brian Sparkes (STAK Ventures) provided event-side support.
Ted Dessert, Ryan George, and Jason Esguerra supported the event itself and have worked with Day-Zero, Office-Hours, HLOS, and Efficient Frontier Labs teams on the surrounding content, documentation, and communication of this research program.
Named attribution of the four red teams and the scored blue teams of Day-Zero Round 3 is deferred to a later revision, pending written permissions from team members. The teams' collective work is the empirical foundation of this paper.
Neil Kittleson (VectorForge) provided early feedback on the framework proposal in Section 07 and contributed to the broader conversation about adversarial AI security architecture during the paper's development.
Kevin Boyle (Enterprise Account Executive, Data + AI at Databricks) provided valuable thought leadership, feedback, and key editorial contributions during the pre-release review, specifically auditing telemetry discrepancies and verifying finding counts. His contributions are made in his personal capacity and do not represent Databricks.
All remaining errors, omissions, and interpretive choices are the author's alone.
This paper is maintained as a versioned document. Material corrections, newly incorporated evidence, and substantive clarifications are logged here with dates and brief descriptions. Minor editorial changes — typographical fixes, formatting adjustments, and non-substantive rewording — are not logged.
| Document ID | Date | Summary |
|---|---|---|
ee1a81b5301bca9c | June 1, 2026 | Lifted the embargo and published the report. Reframed the front-matter from an embargoed preview draft to the published Field Report No. 02: removed the cover DRAFT/Confidential/Embargo banner, the recipient-only confidentiality notice, the web preview statement, the print-footer embargo line, and the Embargo rows from the cover and header meta; set Status to Published; updated the page title. Made the report publicly accessible (no longer magic-link gated; indexable). Resolved the §6.2 and §6.1 chart deception-activation reconciliation (issue #67): the 35-minute figure is timed from the opening of the live attack window on the defended deployments, when the defended-deployment URLs were released, and the first [DECEPTION] event in the raw export sits about 55 minutes into that deployment's own event log, which began recording roughly twenty minutes before the window opened; the §6.1 chart subtitle now defines t=0. Scoped the 89% telemetry undercount in §3.1, §6, and §8 as a measured property of the reference deployment and a structurally shared blind spot rather than a quantified floor across the defended set. Removed the bundled PDF pending a refreshed export; the reader download control falls back to browser print. Resolves #65 and #67. No changes to empirical findings. |
a94e88757f926d77 | May 29, 2026 | Strengthened audit trail and reproducibility ahead of embargoed-recipient distribution. Added §B.4.1 documenting the derivation of the Saikavya 499 deception figure from the cited judge-aggregator export, with a single-pipeline jq reproducer and a citation to the source-of-truth ATTACK_ACTOR_RE classifier in Saikavya's deployment src/api/sentinel.ts. Completed the §6.2 deception-route enumeration to include /api/actions/create (3 events) and surfaced the full 353 / 136 / 7 / 3 per-route breakdown. Corrected the §3.1 wording on the second excluded team: the deployment link reached the red teams too late in the live window for sufficient attack traffic, rather than the prior "blocked by platform configuration" framing. Added a §B.1 sentence noting the aggregator source (snapshot.mjs, including computeMetrics()) now travels in the export bundle, so §B.3, §B.4, and §B.4.1 reproduce from the cited bundle without external references. Extended the .subsection-num CSS selector to cover h4 so the new §B.4.1 heading inherits the existing JetBrains-Mono muted styling. The §6.1 chart and §6.2 prose continue to describe deception as activating "35 minutes into the live window"; reconciliation with the first observed [DECEPTION] event in the telemetry (which fires at +55 minutes relative to Saikavya's first aggregator event) is tracked separately in issue #67. No changes to other empirical findings. |
53ee16ccabcc3989 | May 25, 2026 | Clarified deployment-count scope in the Abstract, §3.1, §B.1, and added a scope-distinction sentence in §8.4: this report's quantitative tables cover four attacked targets with usable telemetry, distinct from the broader count of blue-team deployments in the scored exercise (resolves #63). Scrubbed emdashes from §2.4, §3.1, §6.3, §7.7, and §7.9. Resolved the CHES-4D 2x2 grid overlap on desktop views; switched the content panel from a fixed 260px height to min-height: 260px with align-items: start on the container so the Signal Fidelity card cannot be clipped under text-zoom. Added a ReportTracker client component that invokes the tracking script's __efl_tracker_cleanup hook on unmount, while keeping the tracking script itself server-rendered with the per-request CSP nonce. No changes to empirical findings or methodology. |
0cd13ef746b04d02 | May 14, 2026 | Added Rohil Bansal (Software Engineer at Meta) to the §C.2 judges roster and updated the count from fourteen to fifteen. Tightened phrasing: "the Infrastructure Strikes Back experiment series" → "the experiment series" in three places (abstract, §8.3, §10) to avoid suggesting the event is itself a series. Corrected the §3.1 deployment count from "six" to "five" to match the explicit 3+2 accounting. Added a parenthetical at first use of "Round 3" in §B.1 clarifying the dual numbering scheme. Standardized hyphenation to "Day-Zero" in §C.2 and §C.5 and the cover contact list (the §C.4 acknowledgments roll-call already used the hyphenated form). Updated the doc-header Date field from April 22 to May 14, 2026 and expanded the experiments meta row on cover and header to include the April 16 ClawCamp session. No changes to empirical findings or methodology. |
59b2da503515668a | May 14, 2026 | Renamed the program from "HLOS Experiment Series" to "Adversarial AI Experiment Series" throughout. Updated publisher attribution from "Efficient Frontier Labs / HLOS.ai" to "Efficient Frontier Labs". Replaced the single feedback contact with a routed contact list (AI Infrastructure + Settlement research; Day Zero event partnerships; paper feedback and all other inquiries). The "HLOS-Judge-Probe/1.0" user-agent string (§4) and the acknowledgments roll-call (§C.5) are preserved as factual record. No changes to empirical findings or methodology. |
320def791b130910 | May 14, 2026 | Embargo date extended from April 30 to May 31, 2026 to accommodate an additional round of pre-release review. No changes to empirical findings, methodology, or acknowledgments. |
7ef3a462b5d09f6e | April 27, 2026 | Embargo date extended from April 24 to April 30, 2026. Added §C.6 Draft review and intellectual contributions, acknowledging Neil Kittleson (VectorForge). Reformatted the CHES-4D dimension cards from a 2×2 grid to a single horizontal row, presenting C-H-E-S in line. No changes to empirical findings, methodology, or other acknowledgments. |
a4652e9dc68d60f3 | April 27, 2026 | Added a closing paragraph to §7.7 acknowledging Fokou's Parallax (April 2026) as a directly complementary contemporaneous contribution. No other substantive changes; all empirical findings, methodology, and acknowledgments preserved verbatim from the prior revision. |
6bb68fa68ab21dfe | April 22, 2026 | Initial embargoed preview draft. Supersedes earlier internal working drafts. Incorporates the Refoldla post-event response (§6.3, §7.4); adopts the "persistence loop" framing for the first in-experiment Surfaces validation case; refines the scored-deployment scope description; commits the paper to a revision-history policy. |
Future revisions — including incorporation of additional blue-team written responses, platform-side telemetry for the defended deployments, corrections or clarifications requested by acknowledged collaborators, and updates to the Acknowledgments section (§C.5 named attribution of red and blue teams) — will be logged in this table with the date of publication and a one-line description of the change.
On Document IDs. Each revision is identified by the first 16 hex characters of the SHA-256 digest of its canonicalized HTML. The canonicalization rule, applied before hashing, replaces every <code class="doc-id">[a-f0-9]{16}</code> substring with <code class="doc-id">__DOCUMENT_HASH__</code>. The hash is therefore a fixed point of the document — embedding it back into the file does not change the value on a subsequent run.
Recipient verification. The canonical source bytes for this revision are served at /api/research/<slug>/canonical over the same session cookie that grants access to this page. To reproduce the cover Document ID:
curl -sL https://efficientfrontierlabs.com/api/research/field-report-002-preview/canonical \
--cookie "fr002_preview_auth=<your session cookie>" \
| sed -E 's|<code class="doc-id">[a-f0-9]\{16\}</code>|<code class="doc-id">__DOCUMENT_HASH__</code>|g' \
| shasum -a 256 | cut -c1-16
Historical Document IDs identify earlier revisions as committed to source control; they are stable but reproducible only with access to the corresponding revision's source.