Zurück zu allen Artikeln
Emerging Threats8. Mai 20266 Min. Lesezeit

Securing the AI & LLM Supply Chain: Prompt Injection and Beyond

Models, prompts, retrieved documents and tools are now part of your attack surface. We map the new AI/LLM risks and the controls that actually help.

Luca RomanoAppSec & AI Security Engineer · OSWE
Key takeaways
  • Treat every model output as untrusted input — it can carry instructions and payloads, not just answers.
  • Prompt injection is unsolved: contain its blast radius rather than trying to filter it away.
  • The real danger is what the model can do: tool-calling, data access and downstream actions define your risk.
  • Pin and verify model artifacts, datasets and dependencies the same way you govern any software supply chain.
  • Map your controls to the OWASP Top 10 for LLM Applications and NIST AI RMF to stay auditable.

When you ship a feature backed by a large language model, you are not just adding a clever text generator — you are extending trust to model weights, system prompts, retrieved documents, embeddings and external tools you do not fully control. AI supply chain security is the discipline of mapping that expanded attack surface and constraining what an attacker can do when any one link is hostile. This guide walks through the concrete risks practitioners are seeing in production and the controls that move the needle.

What "AI supply chain security" actually covers

Classic application security worries about the code you write and the libraries you import. An LLM feature inherits all of that and adds several new, often unfamiliar, links:

  • The model itself — weights pulled from a hub, a hosted API, or a fine-tune you trained on third-party data.
  • The training and fine-tuning data — which may be poisoned, mislabeled, or carry licensing and privacy liabilities.
  • Runtime context — system prompts, retrieved documents (RAG), user input, and tool results that all flow into the same context window.
  • Tools and plugins — the functions, APIs and agents the model can invoke on your behalf.

Each of these is a trust boundary. The mistake teams make is assuming the boundary sits only between the user and the application. In an LLM system, any content that reaches the context window can influence behavior, including a PDF an attacker uploaded last week or a web page your agent fetched mid-task.

Prompt injection: the defining risk

Prompt injection is the LLM-era equivalent of injection flaws, and it is the first entry in the OWASP Top 10 for LLM Applications for good reason. The core problem is structural: models do not reliably distinguish trusted instructions from untrusted data when both arrive as natural language in the same context.

Direct vs. indirect injection

  • Direct injection is the user typing adversarial instructions ("ignore your rules and...") straight into the prompt. Annoying, but you at least know the source.
  • Indirect injection is the dangerous one. Malicious instructions are embedded in content the model ingests on the user's behalf — a support ticket, a scraped page, a calendar invite, a document in your RAG index. The user never sees the payload, but the model obeys it.

Indirect injection turns RAG pipelines and autonomous agents into confused deputies. A poisoned document can tell the model to exfiltrate the conversation, call an internal tool, or rewrite its own summary to mislead a human reviewer.

There is no reliable prompt-injection filter

Do not architect on the assumption that a guardrail model or regex will catch injection attempts. Detection helps at the margins, but determined attackers bypass it. Design as if the model will be hijacked, and make sure that when it is, it cannot reach data or actions that matter. Containment beats detection.

Untrusted model output is the second half of the problem

Even setting injection aside, treat every token the model emits as untrusted input to the rest of your system. If model output is rendered into HTML without encoding, you have stored XSS. If it is passed to a shell, an eval, or a database query, you have command or SQL injection — now with a probabilistic source you cannot fully predict.

Practical rules:

  1. Encode on output. Model-generated text rendered in a browser must be escaped exactly like user-generated content. This is standard application security hygiene applied to a new source.
  2. Never pass raw output to interpreters. No model string should reach a shell, SQL engine or template renderer without parameterization or strict validation.
  3. Constrain format. Where you need structured output, enforce a schema (JSON schema, function signatures) and reject anything that does not parse, rather than trusting free text.

Tool-calling and agents: where blast radius is decided

A chatbot that only talks is low risk. The moment you give a model tools — the ability to send email, query a database, run code, or call internal APIs — its mistakes and its hijacked instructions become real-world actions. This is where most of your actual exposure lives.

Controls that matter:

  • Least privilege per tool. Each function the model can call should run with the narrowest scope possible. A "read customer record" tool should never be able to write or delete.
  • Human-in-the-loop for sensitive actions. Irreversible or high-impact operations (payments, deletions, outbound communications) should require explicit confirmation outside the model's control.
  • Deterministic authorization. Enforce access control in your own code, keyed to the authenticated user — not to whatever identity the model claims it is acting as. The model is not an authorization boundary.
  • Isolate untrusted data from privileged tools. If a session has ingested external content, treat it as tainted and restrict which tools remain available. Patterns like dual-LLM or capability gating help here.

For agentic systems, it helps to model attacker techniques explicitly. Frameworks such as MITRE ATLAS catalogue adversarial behaviors against AI systems and give you a vocabulary for threat modeling beyond generic appsec.

Securing the artifacts: models, data and dependencies

The "supply chain" in AI supply chain security is not a metaphor. Model artifacts arrive through the same risky channels as any dependency:

  • Pin and verify model provenance. Reference models by immutable digest, not a mutable tag. Prefer signed artifacts and verify checksums. Avoid loading weights in formats that allow arbitrary code execution during deserialization; prefer safe serialization formats.
  • Govern training and fine-tuning data. Data poisoning can implant backdoors or bias that surface only on specific triggers. Track dataset lineage, validate sources, and keep a record of what went into each model version.
  • Generate an AI-aware SBOM. Extend your software bill of materials to include models, datasets and the ML libraries around them, so you can answer "are we affected?" when a vulnerable component is disclosed.
  • Test before you trust. Subject AI features to adversarial evaluation and penetration testing that specifically targets injection, jailbreaks, data leakage and tool abuse — not just classic web flaws.

Map controls to recognized frameworks

You do not need to invent a methodology. Align your program to references your auditors and partners already accept:

  • OWASP Top 10 for LLM Applications — a prioritized risk list (prompt injection, insecure output handling, supply chain, excessive agency, sensitive information disclosure) that maps cleanly to engineering tasks.
  • NIST AI Risk Management Framework (AI RMF) — for governance, accountability and lifecycle risk.
  • MITRE ATLAS — adversary tactics and techniques specific to AI systems, useful for red-team scoping.
  • ISO/IEC 42001 — for organizations building an auditable AI management system.

Use these to structure reviews and to demonstrate diligence, but remember the list is a floor, not a ceiling.

Conclusion

LLM features expand your attack surface in ways traditional appsec checklists miss: untrusted instructions arrive as data, model output is itself untrusted, and tool access converts model mistakes into real actions. The durable strategy is not to chase a perfect prompt filter but to contain blast radius — least privilege, deterministic authorization, output encoding, verified artifacts and adversarial testing. Get those right and a hijacked prompt becomes an annoyance instead of an incident.

If you are putting AI features into production and want a clear-eyed assessment of where they could break, talk to our team — we will help you threat-model the pipeline and harden it before attackers do.

Written by

Luca Romano

AppSec & AI Security Engineer · OSWE

Secures application and AI/LLM supply chains. Maintainer of several SAST rulesets.