Prompt Injection Still Becomes a Security Bug at the Tool Boundary

Executive Summary

Prompt injection becomes operationally dangerous when it crosses a trust boundary and causes one of three outcomes:

Protected data exfiltration
Policy override
Unsafe tool invocation

To make that claim reproducible, we built a deterministic evaluation harness that models three common agent patterns: retrieval-augmented assistants, browser-enabled agents, and internal tool runners.

The initial harness ships with a 9-scenario matrix and fixture-based scoring that measures exploit success as binary security outcomes — not model quality.

Problem Statement

Security reviews for LLM systems often collapse into vague questions:

Is the model jailbreakable?
Is the prompt robust?
Does the assistant "seem safe"?

Those questions are too broad to help engineering teams decide whether a product change is safe to ship.

A more useful framing is exploit-oriented:

Can untrusted content cause the agent to reveal protected data?
Can untrusted content outrank configured policy?
Can untrusted content trigger a blocked or approval-gated side effect?

That framing maps directly to the decisions that real platform, product, and security teams need to make.

Methodology

The MVP harness evaluates prompt injection through deterministic fixtures.

Design choices

Synthetic secrets only
Mocked side effects only
No live provider benchmarking in the first release
Per-scenario artifact output for repeatability

Each run produces:

report.json trace.jsonl summary.md

Bulk runs also produce index.json.

Scenario Matrix

The first release covers 9 scenarios across three agent patterns and three exploit classes:

Agent Pattern	Exfiltration	Policy Override	Unsafe Tool Invocation
RAG assistant	Retrieved secret leak	Retrieved instruction outranks policy	Retrieved manual triggers blocked tool
Browser agent	Cookie/session leak	Page text overrides navigation guard	Page drives destructive click
Internal tool runner	Hidden memory leak	Tool output disables safeguards	Tool output triggers side effect

Initial Findings

The current baseline intentionally encodes vulnerable behavior so the harness can validate methodology and artifact generation before mitigations are layered in.

All 9 seeded scenarios reproduce a successful exploit path
Exfiltration appears when protected values cross into answers or outbound tool payloads
Policy override appears when untrusted content receives higher effective priority than policy
Unsafe tool invocation appears when approval gates are absent or bypassed

These results are not a claim about the entire ecosystem. They are a claim that the benchmark structure is useful, reproducible, and aligned with real exploit outcomes.

Why This Matters

Prompt injection is still too often discussed as a prompt-quality problem. That understates the engineering risk.

The security boundary is not the prompt alone. It is the full execution path:

Retrieval assembly
Browser or document context
Tool output handling
Approval and permission checks

When those boundaries blur, prompt injection stops being a model oddity and starts becoming a production security bug.

Mitigation Directions

The next round of measurements should compare the vulnerable baseline against concrete controls:

Instruction/data separation in retrieval assembly
Explicit trust labels for browser and tool content
Approval-gated tool allowlists
Structured refusal behavior that survives untrusted context
Trace policies that detect priority inversion before side effects fire

Reproducibility

The harness is designed so teams can inspect and extend it locally.

Example commands:

# Run a single scenario
python -m savage_research rag_exfiltrate_retrieved_secret --output-dir artifacts/rag-demo

# Run the full 9-scenario matrix
python -m savage_research --all --output-dir artifacts/full-demo

Each run writes report.json, trace.jsonl, and summary.md. Bulk runs add index.json.

Limits

This first release is intentionally narrow.

It does not benchmark live model variance
It does not yet measure long-horizon memory poisoning
It does not claim prevalence beyond the included scenarios
It is a baseline harness for exploitation-focused measurement, not a final benchmark

Conclusion

The strongest first step for AI security tooling is not a universal safety score. It is a reproducible answer to a smaller question:

Did untrusted content cause a concrete security failure?

That is the bar this harness is built to measure.

Extend the Harness

The scenario matrix is designed to be inspected and extended. We welcome scenario contributions from teams who have encountered real exploit paths not yet covered by the current matrix.

Get in Touch Back to Research