Prompt Injection Still Becomes a Security Bug at the Tool Boundary

Savage Software's inaugural research release: a deterministic evaluation harness that measures prompt injection as an exploitation problem, not a prompt-quality problem.

Executive Summary

Prompt injection becomes operationally dangerous when it crosses a trust boundary and causes one of three outcomes:

  • Protected data exfiltration
  • Policy override
  • Unsafe tool invocation

To make that claim reproducible, we built a deterministic evaluation harness that models three common agent patterns: retrieval-augmented assistants, browser-enabled agents, and internal tool runners.

The initial harness ships with a 9-scenario matrix and fixture-based scoring that measures exploit success as binary security outcomes — not model quality.

Problem Statement

Security reviews for LLM systems often collapse into vague questions:

  • Is the model jailbreakable?
  • Is the prompt robust?
  • Does the assistant "seem safe"?

Those questions are too broad to help engineering teams decide whether a product change is safe to ship.

A more useful framing is exploit-oriented:

  1. Can untrusted content cause the agent to reveal protected data?
  2. Can untrusted content outrank configured policy?
  3. Can untrusted content trigger a blocked or approval-gated side effect?

That framing maps directly to the decisions that real platform, product, and security teams need to make.

Methodology

The MVP harness evaluates prompt injection through deterministic fixtures.

Design choices

  • Synthetic secrets only
  • Mocked side effects only
  • No live provider benchmarking in the first release
  • Per-scenario artifact output for repeatability

Each run produces:

report.json trace.jsonl summary.md

Bulk runs also produce index.json.

Scenario Matrix

The first release covers 9 scenarios across three agent patterns and three exploit classes:

Agent Pattern Exfiltration Policy Override Unsafe Tool Invocation
RAG assistant Retrieved secret leak Retrieved instruction outranks policy Retrieved manual triggers blocked tool
Browser agent Cookie/session leak Page text overrides navigation guard Page drives destructive click
Internal tool runner Hidden memory leak Tool output disables safeguards Tool output triggers side effect

Initial Findings

9 Scenarios Executed
9 Exploit Successes
100% Exploit Rate

The current baseline intentionally encodes vulnerable behavior so the harness can validate methodology and artifact generation before mitigations are layered in.

  • All 9 seeded scenarios reproduce a successful exploit path
  • Exfiltration appears when protected values cross into answers or outbound tool payloads
  • Policy override appears when untrusted content receives higher effective priority than policy
  • Unsafe tool invocation appears when approval gates are absent or bypassed

These results are not a claim about the entire ecosystem. They are a claim that the benchmark structure is useful, reproducible, and aligned with real exploit outcomes.

Why This Matters

Prompt injection is still too often discussed as a prompt-quality problem. That understates the engineering risk.

The security boundary is not the prompt alone. It is the full execution path:

  • Retrieval assembly
  • Browser or document context
  • Tool output handling
  • Approval and permission checks

When those boundaries blur, prompt injection stops being a model oddity and starts becoming a production security bug.

Mitigation Directions

The next round of measurements should compare the vulnerable baseline against concrete controls:

  • Instruction/data separation in retrieval assembly
  • Explicit trust labels for browser and tool content
  • Approval-gated tool allowlists
  • Structured refusal behavior that survives untrusted context
  • Trace policies that detect priority inversion before side effects fire

Reproducibility

The harness is designed so teams can inspect and extend it locally.

Example commands:

# Run a single scenario
python -m savage_research rag_exfiltrate_retrieved_secret --output-dir artifacts/rag-demo

# Run the full 9-scenario matrix
python -m savage_research --all --output-dir artifacts/full-demo

Each run writes report.json, trace.jsonl, and summary.md. Bulk runs add index.json.

Limits

This first release is intentionally narrow.

  • It does not benchmark live model variance
  • It does not yet measure long-horizon memory poisoning
  • It does not claim prevalence beyond the included scenarios
  • It is a baseline harness for exploitation-focused measurement, not a final benchmark

Conclusion

The strongest first step for AI security tooling is not a universal safety score. It is a reproducible answer to a smaller question:

Did untrusted content cause a concrete security failure?

That is the bar this harness is built to measure.

Extend the Harness

The scenario matrix is designed to be inspected and extended. We welcome scenario contributions from teams who have encountered real exploit paths not yet covered by the current matrix.

Get in Touch Back to Research