Agentic Procurement Failure: Why Architecture Comes Before the Contract

The 2026 McKinsey Lilli incident looked like a SQL injection. It was a procurement-architecture failure. The engineering reality for AU listed and pre-IPO companies whose boards are now asking whether their agentic AI roadmap is an unpriced liability.

Bounded SaaS The legacy procurement regime Human at a screen; UI silently mediates every permission.
Unbounded Agents The emerging procurement regime Autonomous agent on APIs; permission must be in code, per call.
Agentic Due Diligence The procurement-side answer Engineering review during vendor evaluation, before contract signing.

By Gregory McKenzie · Registered Trans-Tasman Patent Attorney & Systems Architect · NETEVO · 14 min read · Published 29 May 2026

In March 2026, an autonomous AI agent breached McKinsey's internal generative-AI platform — a tool named Lilli, used by 72 per cent of the firm — in under two hours, for roughly twenty dollars in tokens. The disclosure came from Paul Price, writing as CodeWall. McKinsey's response was textbook: patched within hours, transparent communication, no evidence of client data accessed by unauthorised parties.

The incident reads as a SQL injection story. It is not. It is the most public expression to date of a procurement-architecture pattern hiding inside every enterprise agentic AI deployment: when the consumer of a vendor system is an autonomous agent rather than a human at a screen, the user interface stops being the permission boundary. The architectural work the UI did silently has to be done explicitly in code, on every API call. Twenty-two of the platform's two-hundred-plus endpoints required no authentication, and the database that fronted them held writable system prompts alongside user data.

This piece is for the people who decide — boards, CEOs, CFOs, heads of procurement, CTOs at strategic level — not the people who fix. The Law-to-Code Methodology treats governance as architecture rather than as PDFs; that principle extends from the visibility layer (the companion AEO vs GEO vs SEO pillar) into platform and agent infrastructure. NETEVO's principal is a registered Trans-Tasman patent attorney and systems architect.

What actually happened with Lilli #

The verified facts, sourced to the primary disclosures. The point of stating them precisely is that the rest of the piece rests on a stable factual base; the procurement-architecture argument does not need exaggeration to be load-bearing.

Lilli is the McKinsey-internal generative-AI platform launched firmwide in July 2023. At the point of the incident, McKinsey reports seventy-two per cent of the firm using it actively, more than five hundred thousand prompts per month, and over forty knowledge sources connected to it. The attack landed on 2026-02-28. The researcher disclosed to McKinsey on 2026-03-01. Patches deployed on 2026-03-02. Public disclosure on 2026-03-09. McKinsey's official statement on 2026-03-11.

The technical root cause was unglamorous. Twenty-two of the two-hundred-plus endpoints documented in the platform required no authentication. One of them concatenated a JSON key directly into a SQL query, and the database returned its errors verbatim through the API response — classic SQL injection, the kind of issue an undergraduate web-security curriculum addresses. OWASP ZAP did not detect it; the researcher's autonomous agent did. From that single endpoint, the agent reached the database that held the platform's user accounts, its chat history, its retrieval-augmented-generation document store, and — load-bearing for the argument that follows — the writable system prompts that defined how Lilli itself behaved.

The damage-surface figures need a careful read. CodeWall asserts the database contained 46.5 million chat messages, 728,000 files, 57,000 user accounts, and 95 writable system prompts across twelve model types. McKinsey's official statement reports the firm "identified no evidence that client data or client confidential information were accessed." Both can be true: CodeWall measured what was reachable; McKinsey reports what was actually accessed. The architectural exposure does not depend on which figure prevails — the argument is about reachability, not exfiltration.

IDC analyst Alessandro Perilli, in Agentic AI Governance: When AI Becomes Critical Infrastructure, frames it plainly: "The McKinsey/Lilli incident should be read as a market signal."

Where the architecture fails #

The pattern Lilli exposed is not specific to Lilli. It is specific to every system in which an autonomous agent is now the consumer of an API surface designed in the era when a human at a screen was the consumer. The shift is between two named procurement regimes — Bounded SaaS and Unbounded Agents — and the architectural work required by the second has to be done explicitly in code, where the user interface used to do it silently.

DimensionBounded SaaSUnbounded Agents
ConsumerHuman at a screenAutonomous software agent
Permission boundaryUI rendering rules (silent; role-based)Code on every API endpoint (explicit; scope-checked per call)
Blast radius of one compromised credentialBounded by what the UI shows that userEvery endpoint the agent can authenticate to — often the whole platform
Where the architectural work livesVendor-built UI does it silentlyVendor and buyer must build it explicitly, in code, on every call

Same substrate. Different procurement assumption. Three concrete failure modes follow.

The UI was the security boundary. Agents have no UI. #

Between roughly 2005 and 2023, SaaS procurement worked because the user interface did invisible permission work. The screen rendered only what the user's role allowed; permission was enforced as a side-effect of what the user could see. The procurement question — can this vendor be trusted with our finance data? — landed on a regime where the UI bounded the blast radius.

Autonomous agents have no UI. The agent talks to the underlying API directly. Every endpoint that previously sat behind the screen is now reachable in principle. The architectural work required is the explicit enforcement of permission on every API call, against the agent's scoped credential, with an audit trail of every decision. The authoritative references are OWASP API Security Top 10 (API1, API2, API5) and the OWASP Top 10 for Agentic Applications 2026 (ASI03 Identity and Privilege Abuse). IDC's Perilli puts the same point at a higher level: "Once a system can see proprietary knowledge, shape work products, and connect to tools, it stops being a productivity layer."

User data and AI configuration in the same database is a control-plane failure #

Lilli's database held two different kinds of object alongside each other: user data (chat messages, files, accounts) and what should have been control-plane data (writable system prompts, RAG knowledge bases, agent configuration). When a SQL injection in the data plane reached the control plane, the consequences were not symmetric. Compromised data is a privacy and confidentiality event. Compromised control plane is a behavioural event — the agent itself becomes the lever. Ninety-five writable system prompts across twelve model types is not a database; it is the agent's behaviour, in a relational table, behind the same authentication boundary as user content.

OWASP LLM Top 10 (2025) names two of the relevant categories — LLM07 System Prompt Leakage and LLM08 Vector and Embedding Weaknesses — and the OWASP Top 10 for Agentic Applications 2026 extends them as ASI06 Memory and Context Poisoning. The architectural prescription is that data plane and control plane belong on separate stores, with separate authentication, separate audit, and separate revocation. Co-residence is a procurement-time question, not a security-engineering afterthought.

When Agent A delegates to Agent B, scope must narrow. It usually does not. #

Multi-agent workflows compound the failure mode. When an upstream agent delegates a task to a downstream agent without explicitly narrowing the downstream agent's scope, the downstream agent inherits the upstream agent's full authority by default. NETEVO names this failure mode Implicit Authority Cascade (IAC). The implicit is load-bearing — the failure is that scope was never narrowed, not that scope was widened. The 1988 confused-deputy problem is the prior art; what is new in 2026 is the frequency with which delegation now happens, in production, between agents, without a designer in the loop.

The standards-track answer is forming. The IETF Internet-Draft on AI Agent Authentication and Authorization (Kasselman, Lombardo, Rosomakho, Campbell) builds on RFC 9635 (GNAP) and proposes unique agent identifiers, short-lived credentials, explicit scopes, revocation, observability, and audit. Forrester's Agent Control Planes Still Need A Robust Standards Stack — published 2026-03-20, eleven days after Lilli surfaced — observes that "agent control planes are a third plane" and that "agent governance identity does not travel". The vocabulary is converging. The architecture is engineering work, today.

The pattern across seventeen disclosed incidents #

Lilli is not a one-off. It is one of seventeen publicly-disclosed enterprise agentic-AI security incidents between mid-2024 and mid-2026, eight months of which fall inside the most recent disclosure window. The most relevant entries, with primary citations:

DateTargetDiscloserVulnerability class
2026-05-07Microsoft Semantic KernelMicrosoft Defender Security ResearchPrompt-injection to remote code execution in agent framework
2026-04-13Bain PyxisCodeWallHardcoded credentials, SQL injection, GraphQL account creation, Okta modification
2026-03-31BCG X AI data platformCodeWallUnauthenticated raw-SQL endpoint, 3.17 trillion rows reachable
2026-03-09McKinsey LilliCodeWallUnauth APIs, SQL injection, IDOR, writable AI configuration
2025-10-23ServiceNow Now AssistDoyensec / AppOmniAuth bypass (CVE-2025-12420), MFA/SSO bypass, agent execution
2025-09-25Salesforce AgentforceNoma Security"ForcedLeak" — indirect prompt injection + CSP bypass (CVSS 9.4)
2025-09-06Microsoft 365 CopilotAim Labs / MSRC"EchoLeak" zero-click indirect prompt injection (CVE-2025-32711)
2024-07-17SAP AI CoreWiz"SAPwned" multi-tenant AI infrastructure control-plane compromise

The instructive feature of the table is what does not connect the entries. The targets span consulting firms, a SaaS giant, hyperscaler AI platforms, an AI-data company, and an enterprise resource planner. The disclosers span independent researchers, vendor security teams, paid pen testers, and security-research labs. The vulnerability classes span SQL injection, broken authorisation, prompt injection, control-plane compromise, and indirect prompt injection. What connects them is the procurement pattern: every entry is a vendor system whose architecture was designed for one consumer (humans at screens) and is now being consumed by another (autonomous agents on APIs). The McKinsey response was textbook; the architectural exposure is category-wide.

The procurement question that should have been asked first #

The procurement lesson is the load-bearing one for boards. The architectural review of a vendor's agentic AI platform belongs at procurement evaluation, alongside the commercial, legal, and technical due diligence that already happens, not after contract signing as a security afterthought.

NETEVO names this Agentic Due Diligence (ADD): the engineering and architectural review of a vendor's agentic AI platform, conducted during procurement evaluation alongside commercial, legal, and technical due diligence. ADD examines four dimensions — agent identity and scoping, policy-as-code enforcement, audit and observability, and revocation. The output is a board-readable risk position the buyer can sign on, not a checklist of vendor self-assertions.

IDC reframes the board question precisely: "How much authority have we delegated, to which systems, under what controls?" That is the procurement-time question. The closing question — the one that distinguishes data-loss-era thinking from agentic-era thinking — is its corollary: "What decisions were shaped by a compromised system?" The contrast with the older "Was data exposed?" is the shift in scope. A compromised agent that operated unchallenged across a quarter is the larger problem.

The standards anchor is the NIST AI Risk Management Framework 1.0, specifically the MAP function — MAP 3.3 (context and intended use), MAP 4.1 (technical and legal risks), MAP 4.2 (internal controls), MAP 5.1 (downstream impacts). Every subcategory is a question a buyer should be answering before the contract is signed, with the vendor's architecture demonstrably supporting the answer.

The market is already pricing this. IDC's Future Enterprise Resiliency and Spending Wave 10 survey reports 16.7 per cent of planned AI investment now going to AI and agent security and governance, with more than one billion actively deployed AI agents in production by 2029 and agentic AI exceeding twenty-six per cent of worldwide IT spending. Boards funding this category are doing so against an industry signal, not a NETEVO assertion. The companion solution page on Agent Infrastructure describes how NETEVO conducts ADD as a discrete pre-contract engagement.

What AU listed and pre-IPO obligations already require #

The single most-cited fact about the Lilli incident, internationally, is that no regulator has yet commented on it. That is not a gap; it is the reason this section reads as it does. AU boards do not need a Lilli-specific regulator statement to act, because standing AU obligations already require what the incident has exposed as missing. The work is encoding obligations identified by counsel or compliance into executable controls — not interpreting which obligations attach to which facts.

APRA CPS 230 covers material service-provider and technology dependencies #

The Australian Prudential Regulation Authority's Prudential Standard CPS 230 Operational Risk Management commenced on 1 July 2025. It covers operational risk management, business continuity, and the governance of material service-provider arrangements — including technology dependencies. APRA-regulated entities submitted their material service-provider registers to APRA by 1 October 2025. Agentic AI platforms supplied by third parties sit squarely inside scope.

The Privacy Act now requires disclosure of substantially automated decisions #

The Privacy and Other Legislation Amendment Act 2024 introduced a requirement that privacy policies disclose when personal information is used for substantially automated decisions that significantly affect an individual's rights or interests. The amendment took effect in December 2024. The Office of the Australian Information Commissioner has published guidance for businesses selecting AI products and for developers training generative models. The Australian Privacy Principles apply to both the inputs to and outputs of AI systems, including AI-generated inferences about individuals.

ASX Listing Rule 3.1 continuous disclosure attaches to material events #

The ASX Listing Rule 3.1 continuous-disclosure framework attaches to information about an entity that a reasonable person would expect to have a material effect on the price or value of its securities. The rule applies to material technology incidents the way it has always applied to other operational events. Whether a particular agentic AI incident attaches to a particular disclosure obligation is a question for the entity's continuous-disclosure advisers in light of the facts.

The AI6 Essential Practices set the de-facto private-sector benchmark #

The National AI Centre published the Guidance for AI Adoption in October 2025, replacing the earlier Voluntary AI Safety Standard. The guidance organises six Essential Practices — the AI6: accountability, impacts and planning, risk measurement and management, information sharing, testing and monitoring, and human control. The practices are voluntary at federal level but form the de-facto private-sector benchmark against which AI deployments will be assessed in board reviews, regulator inquiries, and customer due-diligence questionnaires.

NETEVO encodes obligations like these as executable controls in policy-as-code. We do not interpret the application of any specific statute to any specific factual scenario, which is legal-practitioner work.

The Law-to-Code answer #

The thesis of the Law-to-Code Methodology is that governance is architecture rather than text. The same evidentiary discipline used in patent prosecution — define constraints precisely, defensible under examination, reproducible in practice — applied to digital infrastructure produces controls that are event-sourced, immutable, audit-trailed, and revocable. For Layer 3, that means agents whose identity is verifiable, whose scope is narrowed at every delegation, whose actions are captured in an append-only log against a specific policy version, and whose access can be revoked from a console without a vendor deploy.

NETEVO's RISKflo platform at HSBC is the proof point — more than 13 million events per year with 99-plus per cent uptime over 24 months. The architecture (event-sourced, policy-as-code, immutable audit) is the same primitive applied to the agentic AI engagement. The principal who designs the strategy is the patent attorney and systems architect who builds the platform; no handoff between strategy and execution. The Agent Infrastructure solution page describes the engagement model, and the forthcoming Agent Infrastructure Whitepaper describes the four-dimension implementation blueprint in technical depth.

The argument boards take from Lilli is not that McKinsey failed. McKinsey responded textbook-fast; the architectural exposure they confronted is the exposure every enterprise agentic AI deployment now has to design against. The procurement-architecture question — can this vendor be deployed against our context safely? — belongs at evaluation. That is the work.

The two-step path is editorial to solution to engagement. If the diagnosis here matches what your team is confronting, the solution pages below describe how the architecture is built and the engagement model under which NETEVO delivers it.

Solution

AI Agent Infrastructure

Where Agentic Due Diligence (ADD) is conducted as a discrete pre-contract engagement; intent engineering, MCP architecture, multi-agent orchestration, knowledge graphs, and agent-native product design under one engagement.

View solution
Solution

AI Governance & Readiness

The Layer 4 sibling — board-defensible governance, policy-as-code for AI compliance, regulatory preparedness across APRA, OAIC, ASX, and the AI6 Essential Practices.

View solution
Insight

Architectural AI: Where the Leverage Lives

The leverage-side companion pillar. Where the safety-side argument here asks what happens when agentic procurement goes wrong, the architectural-AI argument asks where AI investment should actually go — the four board-paper questions that distinguish leverage from content velocity.

Read pillar
Insight

AEO vs GEO vs SEO

The Layer 1 companion pillar — three measurement surfaces on one underlying discipline, AU-context, listed and pre-IPO framing.

Read pillar
Case study

RISKflo at HSBC

The event-sourced platform proof point — 13M+ events per year, 99%+ uptime, the architecture pattern NETEVO applies to agent infrastructure.

Read case study

Questions

Frequently asked questions

Procurement, governance, and AU-context questions. Service mechanics — what NETEVO builds, the engagement model, the four-dimension Agentic Due Diligence framework — are answered on the AI Agent Infrastructure solution page. Implementation depth — agent identity, policy-as-code engines, event-sourced audit, revocation — is answered in the forthcoming Agent Infrastructure Whitepaper.

What is agentic AI?

Agentic AI describes autonomous software systems that take actions without a human at the screen mediating each step. The agent has its own identity, credentials, and scope. It interacts with vendor systems through APIs rather than a user interface. The procurement and architectural models both differ from traditional SaaS.

What is the McKinsey Lilli incident?

In March 2026, security researcher Paul Price (writing as CodeWall) disclosed that an autonomous AI agent had breached McKinsey's internal Lilli platform in under two hours for roughly twenty dollars in tokens. Twenty-two endpoints lacked authentication and the database held writable system prompts alongside user data. McKinsey patched within hours; no client data was accessed by unauthorised parties.

Was client data exposed in the Lilli breach?

CodeWall asserts what was reachable from the compromised database — chat messages, files, accounts, system prompts. McKinsey's official statement reports no evidence that client data or client confidential information were accessed by unauthorised parties. Both can be true: reachability and actual access are different events. The architectural exposure remains the point.

Is agentic AI security a regulatory issue in Australia?

Yes, in the sense that standing AU obligations already apply. APRA CPS 230 covers material service-provider and technology dependencies. The Privacy Act 2024 amendments require disclosure of substantially automated decisions. ASX Listing Rule 3.1 attaches to material price-sensitive information. The National AI Centre's AI6 Essential Practices form the de-facto private-sector benchmark.

Does APRA CPS 230 apply to agentic AI deployments?

APRA CPS 230 covers operational risk management, business continuity, and material service-provider arrangements for APRA-regulated entities, with effect from 1 July 2025. Agentic AI platforms supplied by third parties sit inside scope as material dependencies. Whether a particular deployment is in scope for a particular obligation is for the entity's regulatory advisers.

What is the difference between an AI agent and a chatbot?

A chatbot is a conversational interface — a user types, the model responds, the user reads. An AI agent is an autonomous software system that takes actions: queries APIs, modifies records, triggers workflows, delegates subtasks. The agent has its own credentials and operates without a human in the loop on every step.

What is policy-as-code?

Policy-as-code is the practice of encoding compliance, security, and governance rules as executable, version-controlled configuration — not policy documents that an engineer reads and remembers. Engines like Open Policy Agent (Rego), AWS Cedar, and Oso evaluate proposed actions on every call and return allow, deny, or obligate decisions with audit context.

What is an AI agent control plane?

The control plane is the layer where authority is granted, scoped, observed, and revoked. For agent infrastructure, it is distinct from the data plane (where the agent operates) and the audit plane (where the record accumulates). Forrester's 2026 research argues the agent control plane is a third plane with its own standards stack still under construction.

How is agentic AI procurement different from traditional SaaS procurement?

Traditional SaaS procurement assumed the user interface enforced permission as a side-effect of role-based rendering. Agentic procurement removes the interface. Every endpoint must verify scope explicitly, in code, on every call. The architectural questions — agent identity, policy enforcement, audit, revocation — belong at procurement evaluation, not after contract signing.

What is the OWASP Top 10 for Agentic Applications?

OWASP published the [Top 10 for Agentic Applications](https://owasp.org/) in 2026, complementing the earlier OWASP Top 10 for Large Language Model Applications. It identifies significant security risks in agentic AI systems, including ASI01 Agent Goal Hijack, ASI02 Tool Misuse and Exploitation, ASI03 Identity and Privilege Abuse, and ASI06 Memory and Context Poisoning.

What should our board be asking about our AI agent roadmap?

Four procurement-time questions cover most of the territory: (1) Can the platform distinguish humans from agents, and can scope be narrowed per task? (2) Are controls executable, or asserted in PDFs? (3) Is every agent action logged immutably with decision context? (4) Can the buyer pull an agent's access from a console without a vendor deploy? These are the four dimensions Agentic Due Diligence examines.

What is NETEVO's view on the Lilli incident?

The Lilli incident is best read as a market signal about the procurement-architecture pattern hiding inside every enterprise agentic AI deployment, in IDC analyst Alessandro Perilli's framing. McKinsey's response was textbook; the architectural exposure they confronted is the one every agentic AI deployment now has to design against. Boards need procurement-time architectural review, not post-incident cleanup.

Should we wait for AU AI-specific legislation before deploying agents?

No. The relevant AU obligations are already in force: APRA CPS 230 (operational risk; July 2025), the Privacy Act 2024 amendments (automated decisions; December 2024), ASX Listing Rule 3.1 (continuous disclosure), and the AI6 Essential Practices (October 2025). Whether and how each obligation applies to a particular deployment is for the organisation's regulatory advisers.

Author

Greg McKenzie is the Principal of NETEVO, a registered Trans-Tasman patent attorney and systems architect, and the architect of NETEVO's Law-to-Code Methodology. He writes from Sydney.