OAuth was built for humans clicking login buttons. AI agents are a completely different trust model.
We ran AIR Blackbox v1.2.0, an open-source EU AI Act compliance scanner, against seven major AI agent frameworks: OpenAI Agents SDK, LangChain, CrewAI, Phidata, GPT Researcher, LlamaIndex, and Mem0. We checked five specific patterns that determine whether an agent system tracks who authorized what and what the agent actually did with that authorization.
The results were worse than expected.
A user connects their Google account to an AI assistant. Two weeks later, the agent has sent 47 emails, created 12 calendar events, shared 3 documents, and booked a flight. OAuth says all of this is fine. The token is valid. The permissions were granted.
But the system cannot answer basic questions. Which agent performed these actions? Did the actions match the original intent? Would the user still approve them right now? And if something went wrong, can anyone reconstruct the decision chain?
OAuth solved identity for humans logging into apps. Agents introduce delegation, and delegation is where things break down.
We built five new checks into AIR Blackbox v1.2.0 targeting the OAuth delegation gap. Each maps to EU AI Act Article 14 (Human Oversight) or Article 12 (Record-Keeping):
Agent-to-user identity binding. Does the code track which user authorized each agent action? If an agent sends an email, is there a user_id attached to that action?
Token scope and permission validation. Before the agent acts, does it verify that the action falls within granted permissions?
Token expiry and revocation handling. Can you kill a rogue agent instantly? If an agent starts behaving unexpectedly at 3am, is there a mechanism to revoke its credentials in real time?
Agent action audit trail. Most frameworks log LLM calls. Almost none log the actions the agent takes as a result — the emails sent, the APIs called, the data modified.
Agent action boundaries. Is there a defined set of tools and actions the agent is allowed to use?
| Framework | Pass | Warn | Fail | Total |
|---|---|---|---|---|
| Haystack (deepset) | 24 | 10 | 5 | 39 |
| OpenAI Agents SDK | 23 | 12 | 4 | 39 |
| Semantic Kernel (Microsoft) | 15 | 4 | 0 | 19 |
| GPT Researcher | 15 | 3 | 0 | 18 |
| Mem0 | 13 | 6 | 0 | 19 |
| DSPy (Stanford) | 12 | 6 | 1 | 19 |
| Check | Haystack | OpenAI SDK | GPT Researcher | Semantic Kernel |
|---|---|---|---|---|
| Identity binding | ✅ 3 files | ✅ 7 files | ❌ Missing | ✅ 3 files |
| Scope validation | ✅ 8 files | ✅ 32 files | ✅ 4 files | ✅ 44 files |
| Token expiry | ✅ 12 files | ✅ 18 files | ✅ 4 files | ✅ 32 files |
| Action audit trail | ✅ 2 files | ❌ Missing | ✅ 4 files | ✅ 6 files |
| Action boundaries | ✅ 1 file | ✅ 7 files | ❌ Missing | ✅ 2 files |
The pattern is consistent. LLM call logging exists. Action-level logging does not. Token scoping exists in some. User identity binding is rare. Action boundaries are almost nonexistent.
OAuth says the agent is allowed. Nobody tracks what it does. That is how you get 1,000 emails sent overnight — technically authorized, zero accountability.
The EU AI Act high-risk system rules take effect August 2, 2026. Article 14 requires human oversight mechanisms. Article 12 requires automatic logging of events during operation. If your agent acts on behalf of a user and you cannot reconstruct who authorized it, what it did, and why — you have a compliance gap.
pip install air-blackbox
air-blackbox comply --scan . -v
18 code-level checks. 5 OAuth delegation checks. Runs entirely on your machine. Apache 2.0.