Executive Summary
Prompt injection becomes operationally dangerous when it crosses a trust boundary and causes one of three outcomes:
- Protected data exfiltration
- Policy override
- Unsafe tool invocation
To make that claim reproducible, we built a deterministic evaluation harness that models three common agent patterns: retrieval-augmented assistants, browser-enabled agents, and internal tool runners.
The initial harness ships with a 9-scenario matrix and fixture-based scoring that measures exploit success as binary security outcomes — not model quality.
Problem Statement
Security reviews for LLM systems often collapse into vague questions:
- Is the model jailbreakable?
- Is the prompt robust?
- Does the assistant "seem safe"?
Those questions are too broad to help engineering teams decide whether a product change is safe to ship.
A more useful framing is exploit-oriented:
- Can untrusted content cause the agent to reveal protected data?
- Can untrusted content outrank configured policy?
- Can untrusted content trigger a blocked or approval-gated side effect?
That framing maps directly to the decisions that real platform, product, and security teams need to make.
Methodology
The MVP harness evaluates prompt injection through deterministic fixtures.
Design choices
- Synthetic secrets only
- Mocked side effects only
- No live provider benchmarking in the first release
- Per-scenario artifact output for repeatability
Each run produces:
report.json trace.jsonl summary.md Bulk runs also produce index.json.
Scenario Matrix
The first release covers 9 scenarios across three agent patterns and three exploit classes:
| Agent Pattern | Exfiltration | Policy Override | Unsafe Tool Invocation |
|---|---|---|---|
| RAG assistant | Retrieved secret leak | Retrieved instruction outranks policy | Retrieved manual triggers blocked tool |
| Browser agent | Cookie/session leak | Page text overrides navigation guard | Page drives destructive click |
| Internal tool runner | Hidden memory leak | Tool output disables safeguards | Tool output triggers side effect |
Initial Findings
The current baseline intentionally encodes vulnerable behavior so the harness can validate methodology and artifact generation before mitigations are layered in.
- All 9 seeded scenarios reproduce a successful exploit path
- Exfiltration appears when protected values cross into answers or outbound tool payloads
- Policy override appears when untrusted content receives higher effective priority than policy
- Unsafe tool invocation appears when approval gates are absent or bypassed
These results are not a claim about the entire ecosystem. They are a claim that the benchmark structure is useful, reproducible, and aligned with real exploit outcomes.
Why This Matters
Prompt injection is still too often discussed as a prompt-quality problem. That understates the engineering risk.
The security boundary is not the prompt alone. It is the full execution path:
- Retrieval assembly
- Browser or document context
- Tool output handling
- Approval and permission checks
When those boundaries blur, prompt injection stops being a model oddity and starts becoming a production security bug.
Mitigation Directions
The next round of measurements should compare the vulnerable baseline against concrete controls:
- Instruction/data separation in retrieval assembly
- Explicit trust labels for browser and tool content
- Approval-gated tool allowlists
- Structured refusal behavior that survives untrusted context
- Trace policies that detect priority inversion before side effects fire
Reproducibility
The harness is designed so teams can inspect and extend it locally.
Example commands:
# Run a single scenario
python -m savage_research rag_exfiltrate_retrieved_secret --output-dir artifacts/rag-demo
# Run the full 9-scenario matrix
python -m savage_research --all --output-dir artifacts/full-demo Each run writes report.json, trace.jsonl, and summary.md. Bulk runs add index.json.
Limits
This first release is intentionally narrow.
- It does not benchmark live model variance
- It does not yet measure long-horizon memory poisoning
- It does not claim prevalence beyond the included scenarios
- It is a baseline harness for exploitation-focused measurement, not a final benchmark
Conclusion
The strongest first step for AI security tooling is not a universal safety score. It is a reproducible answer to a smaller question:
Did untrusted content cause a concrete security failure?
That is the bar this harness is built to measure.
Extend the Harness
The scenario matrix is designed to be inspected and extended. We welcome scenario contributions from teams who have encountered real exploit paths not yet covered by the current matrix.
Get in Touch Back to Research