The Non-Negotiable AI Agent Stack

The Non-Negotiable AI Agent Stack
An AI agent is not a chatbot. It reasons across multiple steps, calls tools, writes to databases, retrieves documents, and makes decisions in loops. That is what makes it powerful. It is also what makes it a black box.
When something goes wrong, you cannot explain what happened. When an auditor asks for evidence, you do not have it. When a jailbreak gets through, you do not know until a customer tells you.
The stack below exists to solve that problem. Eight layers, each addressing a specific operational gap. Every tool mentioned is open source and self-hostable. Together they give you an observable, controllable, testable agent system — the precondition for any claim that your deployment is governance-ready.
The Stack at a Glance
How the Layers Connect on Every Request
01 — Orchestration
Why you need it. Agents execute multi-step reasoning with branching logic, tool calls, and state that persists across steps. You need a framework that manages the execution graph, handles failures, and lets humans intervene at critical decision points.
LangGraph models agents as state machines. You define nodes (actions), edges (transitions), and conditions. Built-in checkpointing means the agent's state survives crashes. Human-in-the-loop nodes let you pause execution and require approval before high-stakes actions proceed. The most mature option for complex, stateful agents.
Microsoft Agent Framework is the choice if you are on Azure. It combines AutoGen's multi-agent conversation patterns with Semantic Kernel's enterprise integration — Entra ID, VNet, native OpenTelemetry. Use this if your infrastructure is already Microsoft.
CrewAI defines role-based agents in YAML. You describe agents with goals and responsibilities, assign them to tasks, and let them collaborate. Fastest setup time. Best for structured workflows like report generation and QA pipelines where the execution pattern is predictable.
The interoperability layer. Two protocols are standardising how agents talk to tools (MCP, by Anthropic) and to each other (A2A, by Google). Both are now under the Linux Foundation. Every major framework is adopting them — which means your orchestration choice will not lock you out of the broader tool ecosystem.
02 — Structured Tracing
Why you need it. This is the backbone. Without it, every other layer produces disconnected logs in different formats. With it, you get one correlated timeline of everything the agent did: which model was called, what tokens were consumed, which tools were invoked, what guardrails decided, what the evaluation scored — all linked by a single trace ID.
OpenTelemetry with GenAI Semantic Conventions. OTel is already the standard for distributed tracing across cloud infrastructure. The GenAI conventions extend it for AI operations. Every LLM call becomes a structured span with standardised attributes: model name, token counts, latency, finish reason. Agent invocations, tool executions, and retrieval operations each have their own span types.
OpenLLMetry (by Traceloop) is the instrumentation library. Two lines of initialisation code and it automatically captures traces from more than twenty LLM providers and frameworks. It outputs standard OTLP, so it feeds into whatever you already use — Datadog, Grafana, Jaeger, Splunk — without new infrastructure.
Why this matters practically. When a guardrail blocks an output, the block decision is a span on the same trace as the LLM call that generated it. When an evaluation scores an output, the score is an annotation on the same trace. An auditor — or your own engineering team — can pull any trace and see the complete chain: input, reasoning, checks, score, output. No reconstruction. No guessing.
Privacy. Three modes are available. The production default captures metadata only — model, tokens, latency — with no content. Staging captures everything. A third mode stores content in separate encrypted storage and places only a reference in the trace.
03 — Observability
Why you need it. Tracing produces data. You need somewhere to visualise traces, debug prompt failures, track costs, run evaluation pipelines, and detect when agent behaviour drifts from baseline.
Langfuse (MIT) self-hosts on Kubernetes and connects to your OTel pipeline as an OTLP endpoint. It provides trace waterfalls, prompt versioning with Git sync, built-in LLM-as-judge evaluation pipelines, cost tracking across models, and annotation queues for human review. This is where your engineering team lives day-to-day when debugging and improving agents.
Arize Phoenix runs as a single Docker container. Faster to set up than Langfuse, with stronger built-in evaluators for hallucination and toxicity detection. Good for teams that want observability running in an afternoon without Kubernetes. Trade-off: Elastic License 2.0, not OSI-approved, restricts offering it as a managed service.
MLflow (Apache 2.0) is the right choice if you already have MLOps infrastructure. It adds LLM tracing and evaluation on top of your existing experiment tracking and model registry. The AI Gateway handles routing across multiple LLM providers with credential management.
Pick Langfuse for production-grade self-hosted observability. Pick Phoenix for rapid deployment. Pick MLflow if it is already in your stack.
04 — Red Teaming
Why you need it. Your agents will be attacked. Prompt injection, jailbreaks, encoding tricks, social engineering. You need to find what breaks before it is serving real users. This is pre-deployment testing — run it before every release, and on a schedule afterwards.
Garak (NVIDIA, Apache 2.0) is an automated vulnerability scanner. Think of it as a penetration testing tool for LLMs. It fires 120+ attack types at your agent: jailbreaks, base64-encoded injection, MIME-wrapped prompts, continuation exploits, data leakage probes. It connects to any LLM backend and produces structured reports mapping every vulnerability found. Run it against every agent before deployment.
PyRIT (Microsoft, MIT) is adaptive red teaming. Where Garak runs a fixed set of probes, PyRIT orchestrates multi-turn attack chains that evolve. One LLM generates adversarial prompts, another refines them based on your agent's responses, and automated scorers evaluate whether the attack succeeded. Built by Microsoft's AI Red Team and used internally on Copilot and Phi models. Use it for deep-dive testing of your highest-risk scenarios.
Promptfoo (MIT) bridges red teaming and CI/CD. Define your test cases, set pass/fail thresholds, run it in GitHub Actions on every deployment. This is how you make red teaming continuous rather than a one-time exercise.
Use all three. Garak for breadth (scan everything), PyRIT for depth (probe the scary scenarios), Promptfoo for continuity (gate every deployment).
05 — Guardrails
Why you need it. Red teaming happens before deployment. Guardrails happen on every interaction in production. They intercept inputs and outputs in real time — blocking prompt injection, redacting PII, filtering toxic content, enforcing topic boundaries, and validating output structure.
On input:
- Presidio (Microsoft, MIT) detects and anonymises PII. It scans for 30+ entity types — names, SSNs, credit cards, medical records — and replaces them with tokens before the input reaches the LLM. Under 30ms, no GPU. In agent pipelines, it de-anonymises responses afterwards to restore context for the user.
- LLM Guard (Protect AI, MIT) provides 15 input scanners covering prompt injection, invisible text, toxicity, secrets, and malicious URLs. Standalone HTTP API, under 30ms.
- NeMo Guardrails (NVIDIA, Apache 2.0) is the most comprehensive framework. Its unique capability is dialog rails: using the Colang DSL, you define allowed conversation flows, topic boundaries, and canonical response patterns. The agent cannot go off-script. NeMo also handles input and output rails, retrieval filtering (checking RAG chunks before they reach the LLM), and execution gating (blocking unauthorised tool calls).
On output:
- Llama Guard 3 (Meta) is a purpose-built safety classifier. It evaluates every output against 13 hazard categories — violence, exploitation, hate, self-harm, CBRN. Runs on-premises. The 1B parameter version runs on edge hardware.
- Guardrails AI (Apache 2.0) handles structural validation. It ensures outputs match expected schemas. If the agent is supposed to return JSON with specific fields, Guardrails AI catches malformed output and can automatically retry with feedback.
The runtime sequence:
Total added latency: 100–300ms for the full chain. A non-negotiable cost for production agents handling real data.
06 — Evaluation
Why you need it. Guardrails catch the obvious failures — toxicity, PII leaks, injection attacks. Evaluation catches the subtle ones. Is this response actually grounded in the documents it retrieved? Are the citations real? Did the agent's multi-step reasoning make sense? Is it answering the right question?
Ragas (Apache 2.0) is the standard for evaluating RAG pipelines. Its core metrics decompose every response into individual claims and verify each one:
- Faithfulness — Does the answer match the retrieved sources, or did the agent make things up?
- Answer relevancy — Does it actually address the question asked?
- Context precision — Were the right document chunks retrieved?
- Context recall — Did retrieval miss relevant information?
Run Ragas on every agent interaction, or a sample in high-volume production. It uses an LLM as judge — a second model evaluates the first model's output against the retrieved context.
DeepEval (Apache 2.0) is broader evaluation. It includes everything Ragas does, plus agent-specific metrics: Did the agent complete the task? Was its plan sensible? Did it use the right tools? Is the reasoning valid? DeepEval operates as a testing framework — define test cases with assertions and thresholds, run it in CI/CD, fail the build if quality drops.
The critical pattern. Evaluation results attach as annotations to the original OTel trace. The score is permanently linked to the interaction. You can query "show me every interaction this week where faithfulness dropped below 0.7" and get the full trace — input, reasoning, retrieval, output, guardrail decisions, and evaluation score — for each one.
07 — Security
Why you need it. An agent executes code, calls APIs, accesses databases, and sends messages. If an attacker manipulates it through prompt injection, the agent becomes their proxy — with your credentials. Sandboxing and secrets management are not optional.
E2B (Apache 2.0) provides Firecracker microVM sandboxes. Every tool execution runs in an isolated Linux VM — not a Docker container sharing the host kernel (containers have persistent escape vulnerabilities). The agent's code runs in a sandbox that cannot touch your infrastructure. ~150ms cold start, configurable compute and network limits, up to 24-hour sessions. Deploy on your own cloud.
HashiCorp Vault provides dynamic secrets. Instead of long-lived database credentials in environment variables, agents receive just-in-time credentials valid for 30 minutes with automatic revocation. If the agent is compromised, the credentials expire before they can be exfiltrated.
MCP Authorization. The Model Context Protocol standardises agent-to-tool communication, but its authentication is optional by default. The MCP Authorization Specification (November 2025) adds OAuth 2.1, tool-level permissions, and step-up authorisation. Enforce this by proxying all MCP connections through a central gateway that verifies permissions and logs every tool invocation.
This threat is real. CVE-2025-32711 demonstrated silent data exfiltration from Microsoft 365 Copilot through hidden instructions in Word documents. The OWASP Top 10 for Agentic Applications (December 2025) catalogues the full threat surface: goal hijacking, tool misuse, memory poisoning, privilege escalation.
08 — Memory & RAG
Why you need it. An agent without memory repeats mistakes and loses context between sessions. An agent without retrieval hallucinates — generating plausible-sounding answers that are not grounded in your actual data.
Agent memory.
Mem0 is the emerging default. It provides a hybrid memory store combining vector similarity search, graph relationships, and key-value lookups. Agents remember user preferences, past decisions, and learned facts across sessions. Mem0 self-edits when new information conflicts with old, and integrates with LangChain, CrewAI, and the OpenAI Agents SDK.
LangGraph checkpointing. If you are on LangGraph, memory is built in. The LangMem SDK adds a unique capability: procedural memory, where the agent updates its own system instructions based on experience.
Vector storage for RAG. Your agents need to ground their responses in enterprise documents — policies, knowledge bases, research, product data. That requires a vector database.
- pgvector — If you are already on PostgreSQL. Add the extension and you have vector search without new infrastructure. Good enough for collections under five million documents.
- Qdrant — When you need performance. Rust-native, 6ms query latency with complex filtering. Self-hostable, Apache 2.0.
- Milvus — When you need billion-scale. GPU-accelerated, distributed architecture, Linux Foundation project.
The RAG pipeline:
Hybrid search — combining vector similarity with traditional keyword matching — consistently outperforms either approach alone. Reranking adds 10–30% precision improvement at 50–100ms latency cost.
The Ninth Layer
These eight layers give you an observable, controllable, testable agent system. What they do not give you is independent verification that the system works as claimed.
The principle that applies to financial audits and pharmaceutical approvals applies here. You cannot credibly grade your own homework. A governance layer — independent, third-party evaluation using a structured methodology — sits on top of the stack and transforms raw telemetry into evidence that holds up to external scrutiny.
This is where SingleAxis operates. Credentialled evaluators assess your agent against the SingleAxis Standardised AI Safety Framework: 11 categories, 103 codes, four evaluation layers. The output is an Evidence Report — structured, auditable proof that your system has been evaluated by qualified humans against a defined taxonomy, with findings classified by severity and linked to remediation.
The enterprise builds layers one through eight. The evaluation is independent.
The Bottom Line
If someone asks you to prove what your agent did, why it did it, and whether it was checked — can you?
If the answer is no, this is the stack you need. And when you are ready to prove it to someone outside your organisation, that is where we come in.