Findings from an Authorized Adversarial Security Experiment on AI Agent Settlement Infrastructure
This white paper is a public findings report for press, academic, industry, and investor audiences. Technical exploit details, exact endpoint paths, and exploit-enabling payloads have been intentionally omitted. Findings were disclosed to affected vendors and infrastructure providers prior to publication, and the critical issues described here were remediated before release.
Correspondence and feedback may be directed to context@hlos.ai.
On March 12, 2026, HLOS conducted an authorized adversarial security experiment against a live AI agent settlement platform. During the exercise, participants and participant-directed autonomous systems discovered and exercised 39 canonical vulnerability families across authentication, billing, access control, information exposure, and settlement-adjacent surfaces.
The most consequential operational failure was an environment isolation breakdown: the event sandbox retained production-connected Firebase and Stripe credentials instead of isolated staging credentials. As a result, the exercise generated real production-connected effects, including 424 account creations and 109 live Stripe checkout sessions representing approximately $60,000 in attempted checkout value.
The central finding of this report is not the presence of any single vulnerability. It is the convergence behavior of materially different AI attack architectures. Teams using a hierarchical agent swarm, a single autonomous agent, AI-parallelized testing, AI combined with existing pentesting tooling, and an open-weight model on a remote GPU independently converged on the same economically meaningful weaknesses within roughly two hours.
The platform did not fail uniformly. Multiple outer-layer controls failed, but no adversarial transaction completed, no funds were captured, and no settlement artifact was forged. Core settlement finality remained intact even as authentication, environment binding, ownership validation, and credential governance proved materially more fragile.
The Penthouse Heist was an authorized adversarial security experiment designed to test whether autonomous or semi-autonomous AI systems could discover meaningful weaknesses in a modern agent-financial stack faster than traditional manual testing would be expected to do. Participants operated within a defined scope and under monitored conditions. Cash prizes were offered across three challenge paths: compromise the settlement pipeline, break receipt-chain integrity, and defeat agent attribution.
This document is a public findings report rather than a step-by-step technical disclosure. Exploit reproduction steps, exact endpoint paths, and exploit-enabling payloads have been omitted or generalized. Findings were disclosed to affected vendors and infrastructure providers prior to publication, and the critical issues described here were remediated before release.
The target was a settlement infrastructure platform for AI agents. Its architecture included Firebase Authentication, Stripe payment processing, a Cloud Run application layer, API proxy services, wallet management, and a cryptographic settlement kernel responsible for producing and binding settlement artifacts. The event focused on application, authentication, billing, and settlement-adjacent surfaces exposed to participants during the exercise window.
A core analytical distinction in this report is that application-layer and environment controls failed in multiple places, while the deepest settlement integrity guarantees did not. That distinction shapes both the forensic interpretation of the event and the post-remediation implications for agent-financial infrastructure more broadly.
Participants were invited to explore the accessible attack surface using their own methods and tooling, including autonomous systems operating with limited human direction. For the purposes of reporting, overlapping submissions were consolidated into canonical vulnerability families when multiple reports described the same underlying control failure or exploit pattern. Similarly, attack architectures were classified by the dominant operating pattern demonstrated by the relevant team or participant.
The event was authorized. The production-connected credential crossover was not. The most significant operational error was that the sandbox deployment shared production-connected Firebase and Stripe credentials instead of isolated staging credentials. The resulting exposure materially changed the significance of otherwise expected event traffic and converted the exercise into a production-connected control failure with real but contained operational consequences.
Telemetry analysis reviewed 228,681 total spans across the 14-day period surrounding the event and identified a clear reconnaissance ramp into the live exercise window. The highest-confidence attack-associated pattern began approximately five hours before event start as participants mapped public pages, signup flows, billing surfaces, authentication states, project scopes, and settlement-relevant endpoints.
By the time the event formally began, a meaningful portion of the accessible attack surface had already been enumerated. The most concentrated attack hour was dominated by settlement-, receipt-, challenge-, token-, and MCP-related traffic, indicating that teams quickly oriented toward the platform's most economically meaningful surfaces rather than peripheral edge cases.
During the event window, the environment isolation breakdown produced four externally visible consequences: 424 accounts were created in the production Firebase instance; 109 live Stripe checkout sessions were initiated against the production Stripe account; those sessions represented approximately $60,000 in attempted checkout value; and zero adversarial transactions completed because sessions expired and adversarial wallets contained no funds. This was a serious production-connected control failure, but not a completed financial-loss event.
The most notable result of the exercise was that materially different AI-assisted attack architectures independently converged on the same high-value weaknesses. The first-place participant used a three-tier hierarchical swarm built around an unmodified commercial coding assistant. The second-place participant used a single autonomous agent seeded primarily with the event PDF, which autonomously generated a reconnaissance report, reverse-engineered a substantial portion of the x402 settlement flow, identified a provider credential injection issue, and produced submission-quality writeups.
Other top-performing teams used AI to parallelize static analysis and live API testing, combined AI systems with existing pentesting frameworks, or ran coordinator-worker setups around open-weight models on remote GPUs. The significance is not that one attack architecture dominated. It is that several distinct architectures reached similar conclusions quickly, suggesting that broad AI-enabled exploration is now a practical offensive capability even without bespoke tooling or frontier-only models.
Ninety submissions were consolidated into 39 canonical vulnerability families. These are clustered across broken access control, authentication bypass, business-logic failure, information disclosure, security misconfiguration, and injection-related weaknesses. The dominant pattern was not exotic zero-day behavior. It was ordinary authorization and ownership assumptions that were shallow enough to be discovered in parallel at machine speed.
The most severe families included unauthenticated reachability of settlement-relevant surfaces, global settlement denial-of-service conditions, receipt-chain suppression, auth middleware fail-open behavior, billing misattribution through arbitrary wallet identifiers, cross-tenant read exposure, unrestricted Firebase account creation, provider credential injection, billing callback SSRF, and proxy SSRF. In combination, these issues created a high-severity operating condition once explored systematically by autonomous systems.
The 424 adversarial Firebase accounts were classifiable by testing intent. Observed naming and activity patterns included query-shaped accounts, settlement-themed aliases, object-reference probes, injection-oriented variants, flood accounts, path-traversal experiments, SSRF-oriented probes, authentication tests, and privileged-identity impersonation aliases. A subset impersonated internal system identities under the platform domain, which Firebase accepted without effective domain validation or rate controls.
The platform did not fail uniformly. Core settlement finality remained intact. No settlement artifact was forged, and cryptographic hash bindings were not compromised. However, receipt completeness guarantees did fail, and the platform's outer layers proved substantially more fragile than its deepest integrity boundary. For external readers, this is a critical distinction: the event exposed serious operational and access-control weaknesses, but not a compromise of the core settlement artifact model.
Three root-cause themes emerged. First, environment isolation failed: production-connected credentials were present in the sandbox deployment, introduced during AI-assisted configuration work. When AI participates in infrastructure setup, environment binding and secret provenance must be treated as first-class controls rather than implicit assumptions.
Second, permissive boundary behavior recurred across the stack. Several controls prioritized request completion over hard rejection at the boundary, including fail-open auth middleware and insufficient ownership validation. The lesson is not that AI-generated code is universally insecure; it is that AI-assisted development in security-sensitive systems requires explicit fail-closed review, especially around auth, billing, and credential mutation paths.
Third, credential mutation emerged as a distinct threat model. The provider credential injection issue showed that an attacker could replace a stored credential and silently redirect platform traffic through attacker-controlled infrastructure. Credential administration and credential consumption should therefore be treated as separate security domains with different permissions, controls, and monitoring assumptions.
For authentication providers, account-creation systems that accept arbitrary identities without strong verification are poorly matched to an environment in which AI can automate account generation at machine speed. For payment processors, the lesson is not that payment rails failed internally, but that upstream access-control weaknesses can make payment-relevant flows reachable by adversarial actors. For teams building with AI-assisted development tools, the same systems that accelerate implementation can also accelerate broad offensive discovery against shallow control failures.
This experiment also suggests that static defenses and rules-based anomaly systems alone may be insufficient against automated, parallelized attack exploration by autonomous agents. Continuous adversarial testing is better understood not as an optional maturity layer, but as a core control for platforms that combine identity, billing, and agent execution surfaces.
A broader implication is cognitive as well as technical. Prior research has shown that informed reviewers can miss vulnerability classes that independent reviewers later identify, and that broader review populations improve overall discovery coverage. This experiment reproduced that pattern under AI-mediated conditions. Despite substantial prior hardening work, external participants and participant-directed systems surfaced critical weaknesses within hours. Internal review remains necessary, but it inherits the assumptions of the builder. Adversarial testing adds a different search posture, one that becomes especially valuable when autonomous systems can explore broad attack surfaces without assuming that previously reviewed paths are already secure.
This finding also resonates with David Wagner's work at UC Berkeley on automated security analysis and systematic vulnerability discovery. A central lesson from that body of research is that machine-driven exploration is valuable not because any single probe is especially sophisticated, but because automated systems can test widely, consistently, and without the contextual assumptions that shape human review. The Penthouse Heist extended that dynamic into an agent-mediated setting: materially different AI attack architectures converged on the same high-value weaknesses with limited human direction, reinforcing the case for continuous externalized adversarial testing as a core control for modern agent infrastructure.
All critical vulnerabilities described in this report were remediated prior to publication. Completed remediation work included immediate containment, architecture hardening, access-control enforcement, and operational changes to reduce the likelihood of similar failures recurring.
| Remediation Track | Completed Actions Prior to Publication |
|---|---|
| Containment | Key rotation, sandbox decommissioning, and suspension of affected test paths. |
| Authentication | Auth middleware hardening, account suspension mechanisms, and per-user rate limiting. |
| Billing | Ownership validation enforcement and tighter controls around payment-relevant flow access. |
| Credentials | Restrictions on credential update paths and clearer separation between credential administration and credential consumption. |
| Operations | Stronger environment-binding controls and explicit review of AI-assisted configuration work involving secrets and deployment settings. |
The statistics below are reported for clarity. "Participants" refers to registered event participants. "Total submitted findings" includes overlapping reports later consolidated. "Canonical vulnerability families" refers to grouped root-cause findings after deduplication. "Aggregate attempted checkout value" refers to the face value of initiated checkout sessions, not funds charged, captured, or settled.
| Metric | Value |
|---|---|
| Registered participants | 125+ |
| Total submitted findings | 90 |
| Canonical vulnerability families | 39 |
| Adversarial accounts created | 424 |
| Adversarial Stripe sessions initiated | 109 |
| Aggregate attempted checkout value | ~$60,000 |
| Completed adversarial transactions | $0 |
| Peak telemetry spans in one hour | 89,246 |
| Total spans in the attack window | 157,664 |
| Pre-event reconnaissance duration | ~5 hours |
| Distinct AI attack architectures observed | 5 |
Findings were disclosed to affected vendors prior to publication, including Firebase / Google Cloud and Stripe. This report excludes exploit reproduction steps and exact endpoint paths by design. The experiment was conducted under authorized rules of engagement, and the most serious findings were remediated before release.
This paper draws on the work and good faith of many people. The author is grateful to all of them.
The March 12 event was made possible by the collaboration of Adeniji Asabi, Brian Sparkes, Colin Behring, Ryan George, and Ted Dessert, alongside the author. This is the same core collaborator group that has supported the broader research program and subsequent field reports.
Special thanks to the judges and mentors who contributed to the event in their personal capacities. Their involvement helped make the exercise both rigorous and operationally informative. Judges and mentors included Adwait Sathe, Ari Gore, Avia Haimovich, Bhargava Kakrannaya, Chetan Jannu, Dhruv Diddi, Elizabeth "Lizzie" Siegle, Goutham Nekkalapu, Karthik Rao, Peter Van Voorhis, Rayyan Zahid, Ross Gates, Safiya Adatia, Sandeep Ayloo, Shri Lakshmi Rajagopal, Sumanth Shiva Prakash, Thiyagarajan Palaniyappan, Tosh Rayadhurgam, Usha Ratnam Jammula, and Usman Masood.
Winning teams included VCN, openClaw, JMD/DRMJ, NextPhase.ai, Socials, Canniffe, and Accelerando. Participating teams also included Scrooge, Eurasia, :), Innovators, BAOBAB, fogsignal, and ChefMate. The empirical foundation of this paper is their collective work.
All remaining errors, omissions, and interpretive choices are the author's alone.
This paper is maintained as a versioned document. Material corrections, newly incorporated evidence, and substantive clarifications are logged here with dates and brief descriptions. Minor editorial changes — typographical fixes, formatting adjustments, and non-substantive rewording — are not logged.
| Version | Date | Summary |
|---|---|---|
| v1.1 | April 22, 2026 | Reformatted into the Adversarial AI Experiment Series template used for Field Report No. 02 (The Infrastructure Strikes Back). Content unchanged from v1.0. Added this revision history appendix and restructured the Acknowledgments into the series-standard A.1–A.3 format; the event contributors line previously appearing on the cover was moved into §A.1, preserving the same visibility without altering the cover layout. |
| v1.0 | March 12, 2026 | Initial publication. |
Future revisions — including incorporation of additional event materials, participant writeups (with consent), methodology expansion, or corrections requested by acknowledged collaborators — will be logged in this table with the date of publication and a one-line description of the change.