Skip to content

2025

Agent Metrics - A Guide for Engineers

Measuring the performance of non-deterministic, compound systems like LLM-powered chat applications is fundamentally different from traditional software. An output can be syntactically perfect and seem plausible, yet be factually incorrect, unhelpful, or unsafe.

A robust measurement strategy requires a multi-layered approach that covers everything from operational efficiency to nuanced aspects of output quality and user success. This requires a shift in thinking from simple pass/fail tests to a portfolio of metrics that, together, paint a comprehensive picture of system performance.

This guide breaks down metric design into two parts:

  1. Foundational Metric Types: The basic building blocks of any measurement system.
  2. A Layered Framework for LLM Systems: A specific, hierarchical approach for applying these metrics to your application.

Part 1: Foundational Metric Types

These are the fundamental ways to structure a measurement. Understanding these types is the first step to building a meaningful evaluation suite. You began with a great start. Let's expand on it.

1. Classification (Categorical)

Measures which discrete, unordered category an item belongs to. The categories have no intrinsic order, and an item can only belong to one. This is crucial for segmenting analysis and routing logic.

Core Question: "What kind of thing is this?" or "Which bucket does this fall into?"

Examples:

  • Intent Recognition: [BookFlight], [CheckWeather], [GeneralChat]. This allows you to measure performance on a per-intent basis.
  • Error Type: [API_Failure], [Hallucination], [PromptRefusal], [InvalidToolOutput]. Segmenting errors is the first step to fixing them.
  • Tool Used: [Calculator], [CalendarAPI], [SearchEngine]. Helps diagnose issues with specific tools in a multi-tool agent.
  • Conversation Stage: [Greeting], [InformationGathering], [TaskExecution], [Confirmation].
2. Binary (Boolean)

A simplified version of classification with only two outcomes. It's the basis of most pass/fail tests and is particularly useful for high-stakes decisions where nuance is less important than a clear "go/no-go" signal.

Core Question: "Did it succeed or not?" or "Does this meet the minimum bar?" Examples:

  • Task Completion: [Success / Failure]
  • Tool Call Validity: [ValidAPICall / InvalidAPICall]. Was the generated tool call syntactically correct?
  • Contains Citation: [True / False]. Did the model cite a source for its claim?
  • Safety Filter Triggered: [True / False]. A critical metric for monitoring responsible AI guardrails.
  • Factually Correct: [True / False]. A high-stakes check that often requires human review or a ground-truth dataset.
3. Ordinal

Similar to classification, but the categories have a clear, intrinsic order or rank. This allows for more nuanced evaluation than binary metrics, capturing shades of quality. These scales are often defined in human evaluation rubrics.

Core Question: "How good is this on a predefined scale?"

Examples:

  • User Satisfaction Score: [1: Very Unsatisfied, ..., 5: Very Satisfied]. The classic user feedback mechanism.
  • Answer Relevance: [1: Irrelevant, 2: Somewhat Relevant, 3: Highly Relevant]. A common human-annotated metric.
  • Readability: [HighSchool_Level, College_Level, PhD_Level]. Helps align model output with the target audience.
  • Safety Risk: [NoRisk, LowRisk, MediumRisk, HighRisk]. Granular assessment for safety-critical applications.
4. Continuous (Scalar)

Measures a value on a continuous range, often normalized between 0.0 and 1.0 for scores, but can be any numeric range. These are often generated by other models or algorithms and provide fine-grained signals.

Core Question: "How much of a certain quality does this have?"

Examples:

  • Similarity Score: Cosine similarity between a generated answer's embedding and a ground-truth answer's embedding (e.g., 0.87).
  • Confidence Score: The model's own reported confidence in its tool use or answer, if the API provides it.
  • Toxicity Probability: The likelihood that a response is toxic, as determined by a separate classification model (e.g., 0.05).
  • Groundedness Score: A score from 0 to 1 indicating how much of the generated text is supported by provided source documents.
5. Count & Ratio

Measures the number of occurrences of an event or the proportion of one count to another. These are fundamental for understanding frequency, cost, and efficiency.

Core Question: "How many?" or "What proportion?"

Examples:

  • Token Count: Number of tokens in the prompt or response. This directly impacts both cost and latency. This directly impacts both cost and latency.
  • Number of Turns: How many back-and-forths in a conversation. A low number can signal efficiency (quick resolution) or failure (user gives up). Context is key.
  • Hallucination Rate: (Count of responses with hallucinations) / (Total responses). A key quality metric.
  • Tool Use Attempts: The number of times the agent tried to use a tool before succeeding or failing. High numbers can indicate a flawed tool definition or a confused model.
6. Positional / Rank

Measures the position of an item in an ordered list. This is crucial for systems that generate multiple options or retrieve information, as the ordering of results is often as important as the results themselves.

Core Question: "Where in the list was the correct answer?" or "How high up was the user's choice?"

Examples:

  • Retrieval Rank: In a RAG system, the position of the document chunk that contained the correct information. A rank of 1 is ideal; a rank of 50 suggests a poor retriever.
  • Candidate Generation: If the system generates 3 draft emails, which one did the user select? (1st, 2nd, or 3rd). If users consistently pick the 3rd option, maybe it should be the 1st.

Part 2: A Layered Framework for LLM Systems

Thinking in layers helps isolate problems and understand the system's health from different perspectives. A failure at a lower level (e.g., high latency) will inevitably impact the higher levels (e.g., user satisfaction).

Layer 1: Operational & System Metrics (Is it working?)

This is the foundation. If the system isn't running, nothing else matters. These metrics are objective, easy to collect, and tell you about the health and efficiency of your service.

Latency (Time-based):
  • Time to First Token (TTFT): How long until the user starts seeing a response? This is a primary driver of perceived performance. A low TTFT makes an application feel responsive, even if the total generation time is longer. How long until the user starts seeing a response? This is a primary driver of perceived performance. A low TTFT makes an application feel responsive, even if the total generation time is longer.
  • Total Generation Time: Full time from prompt submission to completion.
Throughput (Volume-based):
  • Requests per Second (RPS): How many requests can the system handle? Essential for capacity planning.
Cost (Resource-based):
  • Tokens per Request: Average prompt and completion tokens. This is the primary driver of direct LLM API costs.
  • Cost per Conversation: Total cost of a multi-turn interaction, including all LLM calls, tool calls, and other API services.
Reliability (Error-based):
  • API Error Rate: How often do calls to the LLM or other external tools fail (e.g., due to network issues, rate limits, or invalid requests)? How often do calls to the LLM or other external tools fail (e.g., due to network issues, rate limits, or invalid requests)?
  • System Uptime: The classic operational metric, representing the percentage of time the service is available.

Layer 2: Output Quality Metrics (Is the output good?)

This is the most complex layer and specific to generative AI. "Goodness" is multi-faceted and often subjective. These metrics require more sophisticated evaluation, including other models ("LLM-as-Judge") or structured human review.

Faithfulness / Groundedness (Is it true?):
  • Citation Accuracy (Binary/Ratio): Does the provided source actually support the generated statement? This can be a simple check (the source is relevant) or a strict one (the exact passage is highlighted).
  • Hallucination Rate (Ratio): What percentage of responses contain fabricated information? Defining a "hallucination" requires a clear rubric for human evaluators.
  • Contradiction Score (Continuous): A score from an NLI (Natural Language Inference) model on whether the response contradicts the source documents.
Relevance (Is it on-topic?):
  • Relevance Score (Ordinal/Continuous): How relevant is the response to the user's prompt? Often rated on a scale (e.g., 1-5) or scored by another model using embeddings.
  • Instruction Following (Binary/Ordinal): Did the model adhere to all constraints in the prompt (e.g., "Answer in 3 sentences," "Use a formal tone," "Format the output as a JSON object with keys 'name' and 'email'")? This is a key measure of model steerability.
Clarity & Coherence (Is it well-written?):
  • Readability Score (Continuous): Flesch-Kincaid or similar automated scores to ensure the output is appropriate for the target audience.
  • Grammar/Spelling Errors (Count): Number of detected mistakes.
  • Coherence Score (Ordinal): Does the response make logical sense from beginning to end? This is highly subjective and almost always requires human judgment.
Safety & Responsibility (Is it safe?):
  • Toxicity Score (Continuous): Output from a public or custom-trained toxicity classifier.
  • PII Detection Rate (Binary/Ratio): Does the model leak personally identifiable information, either from its training data or from provided context?
  • Jailbreak Attempt Detection (Binary): Was the user prompt an attempt to bypass safety filters?
  • Bias Measurement (Classification/Ratio): Using a benchmark dataset of templated prompts (e.g., "The [profession] from [country] went to..."), does the model generate responses that reinforce harmful stereotypes?

Layer 3: Task & User Success Metrics (Did it help?)

This is the ultimate measure of value. A model can produce a perfect, factual, safe answer, but if it doesn't help the user achieve their goal, the system has failed. These metrics connect model performance to real-world impact.

Task Success:
  • Task Completion Rate (Binary/Ratio): For goal-oriented systems (e.g., booking a ticket, summarizing a document), did the user successfully complete the task? This is often measured by tracking clicks on a final "confirm" button or reaching a specific state.
  • Goal Completion Rate (GCR): A more nuanced version asking if the user achieved their ultimate goal, even if it took a few tries. For example, a user might complete the "task" of finding a recipe but fail their "goal" because it required an ingredient they didn't have.
User Interaction:
  • Thumbs Up/Down Rate (Ratio): Simple, direct user feedback. The most valuable signal when available.
  • Conversation Length (Count): Shorter might mean efficiency; longer might mean engagement. This needs to be correlated with task success to be interpreted correctly.
  • Response Edit Rate (Ratio): How often do users have to copy and then significantly edit the AI's generated response? A high rate is a strong negative signal.
  • Follow-up Question Rate (Ratio): Are users asking clarifying questions because the first answer was incomplete, or are they naturally continuing the conversation?
Business Value:
  • Deflection Rate: In a customer support context, what percentage of issues were solved without escalating to a human agent? A high deflection rate is only good if user satisfaction is also high. This is also the pricing structure for Fin by Intercom.
  • Conversion Rate: Did the interaction lead to a desired business outcome (e.g., a sale, a sign-up)?
  • User Retention (Ratio): Are users coming back to use the application? This is a powerful long-term indicator of value.

RAG Metrics for Technical Leaders: Beyond Recall

Title: MRR, nDCG, Hit Rate, and Recall: Know Your Retrieval Metrics

If you're working on RAG, search, or anything that touches vector databases, you've probably run into a mess of evaluation metrics: MRR, nDCG, hit rate, recall. Everyone throws these terms around. Few explain them well.

This post is for practitioners who want to go from vague intuition to confident decisions.

If you're just starting out and debugging a hallucinating LLM, use Hit Rate. If you're ready to get serious, use MRR + Recall during retriever tuning. If you're ready to get serious, use nDCG + Hit Rate when tuning reranker or doing system evals.

TL;DR: When to Use What

Use Case / Need Metric to Use Why
You just want to check if any correct result was retrieved Hit Rate Binary success metric, useful for RAG "was it in the top-k?"
You want to know how many of the correct results were found Recall Focused on completeness — how much signal did you recover
You want to know how early the 1st correct result appears MRR Good for single-answer QA and fast-hit UIs
You care about ranking quality across all relevant results nDCG Ideal for multi-relevance tasks like document or product search

Understanding Each Metric in Detail

✅ Hit Rate

  • Binary metric: did any relevant doc show up in top-k?
  • Doesn’t care if it’s Rank 1 or Rank 5, just needs a hit.

Use Hit Rate when: You're debugging RAG. Great for checking if the chunk with the answer even made it through.

Think: "Did we even get one hit in the top-k?"

↑ Recall

  • Measures what fraction of all relevant documents were retrieved in top-k.
  • Penalizes for missing multiple relevant items.

Use Recall when: You want completeness. Think medical retrieval, financial documents, safety-critical systems.

Think: "Did we find enough of what we needed?"

🔮 MRR (Mean Reciprocal Rank)

  • Tells you how early the first relevant document appears.
  • If the first correct answer is at Rank 1 → score = 1.0
  • Rank 2 → score = 0.5; Rank 5 → score = 0.2

Use MRR when: Only one answer matters (QA, intent classification, slot filling). You care if your system gets it fast.

Think: "Do we hit gold in the first result?"

🔠 nDCG (Normalized Discounted Cumulative Gain)

  • Looks at all relevant docs, not just the first.
  • Discounts docs by rank: higher = better.
  • Supports graded relevance ("highly relevant" vs "somewhat relevant").

Use nDCG when: Ranking quality matters. Ideal for search, recsys, anything with many possible good results.

Think: "Did we rank the good stuff higher overall?"

How They Differ

Metric Binary or Graded 1st Hit Only? Sensitive to Rank? Use For...
Hit Rate Binary ❌ No ❌ No (thresholded) RAG debugging, presence check
Recall Binary or Graded ❌ No ❌ No Completeness, coverage
MRR Binary ✅ Yes ✅ Yes Fast hits, QA
nDCG Graded ❌ No ✅ Yes Ranking quality, search

Retrieval Is Not One Metric

People default to one number because it's convenient. But retrieval is multi-objective:

  • You want early relevant hits (MRR)
  • You want most relevant hits (Recall)
  • You want them ranked well (nDCG)
  • You want to know if you're even in the game (Hit Rate)

Choose the metric that matches your product surface.

Pro Tips

  • Use Hit Rate when you're just starting out and debugging a hallucinating LLM

And then use the right metric for the right job: - Use MRR + Recall during retriever tuning - Use nDCG + Hit Rate when tuning reranker or doing system evals

Final Word

MRR isn’t better than nDCG. Recall isn’t cooler than Hit Rate. They just answer different questions.

So the next time someone asks, "What's your retrieval performance?" You can say: "Depends. What do you care about?"

Vector Search at Scale: Balancing Cost, Quality, and Sanity

At scale, relevance isn't your only enemy. Cost is. Every millisecond of latency, every token passed to an LLM, and every unnecessary reranker call adds up—fast. The iron triangle of retrieval and hence, RAG is relevance, cost, and latency. You can only pick two.

Today, we'll focus on the cost and latency.

Here's a list of things that teams do that can be improved:

  • Run full-precision vector search for every query
  • Skip lexical signals altogether
  • Avoid reranking because "it's too expensive"
  • Have no system to analyze why results are bad

This post is a walkthrough of what a real retrieval stack looks like when it's designed not just for correctness, but also for operational efficiency and failure debugging.

Retrieval Stack Architecture: Query Router, BM25, Vector Search, Aggregation, Reranker Figure: Retrieval stack architecture balancing cost, quality, and latency. Each layer maximizes relevance per dollar and enables debugging.

The Architecture

Forget monoliths. Retrieval is a pipeline. Here's the architecture I recommend. Each layer exists for a reason: to maximize relevance per dollar and to make debugging sane.

1. Query Router

This is your traffic cop. It decides how to fan out the query: to a lexical search engine (BM25), a fast vector index, or both. You can route based on query class, business priority, or budget.

Not dead. In fact, BM25 still shines for acronym-heavy domains, product names, and anything with proper nouns. It's cheap, precise, and the ideal complement to lossy vector embeddings. Run it in parallel with your vector retrieval.

3. Binary Quantized Vector Search (RAM)

This is your fast recall layer—usually IVFPQ or scalar quantization in FAISS or ScaNN. Gets you top-K quickly, cheaply. Think of it as a rough shortlist generator. Latency under 5ms is normal.

4. Full-Precision Vector Search (Disk)

From your shortlist, you can now hit the full-resolution vectors. Higher fidelity, slower access, stored on disk. You should only do this when needed—ambiguous queries, high-value flows, or when the approximate search isn't enough.

5. Cross-Encoder Reranker

This is the first component in the stack that actually understands relevance. Embeddings collapse meaning into vectors. Cross-encoders read both the query and the doc, and compute true semantic alignment. Expensive, yes. But reranking the top 20–100 candidates is usually all you need.

6. Result Aggregation

Once you've got candidates from both BM25 and vector search, and re-ranked the best ones, you blend them. The fusion logic depends on your goal: pure precision, diversity, confidence thresholds, etc.

Building Feedback Loops

Most retrieval problems aren't one-off issues. They're patterns. Instead of debugging individual queries, cluster them. Use a mix of token overlap and embedding distance. Add UMAP or HDBSCAN if needed.

The goal isn't just analysis—it's systematic insight:

  • Which queries have zero recall?
  • Which are poorly reranked?
  • Which embeddings collapse semantically distinct queries?

Once you know that, you can prioritize improvements—embedding quality, routing rules, metadata enrichment, or prompt tuning—at the cluster level. Much higher leverage than spot fixes.

Why This Matters for RAG

If your retrieval is weak, your LLM has to do all the heavy lifting. That means more tokens, more hallucinations, slower responses. And ironically, worse answers.

Your retrieval stack should do two things: 1. Return the most relevant docs 2. Let you understand why it didn't

Without that, you're just doing GPT improv with 3 PDFs in context.

Don't treat retrieval as a "vector db" checkbox. Treat it as a system. The best stacks layer:

  • Cheap recall
  • Precise reranking
  • Old-school lexical sanity checks

1 line summary: RAM-level quantized vectors give you scale, Disk-level full vectors give you fidelity, BM25 gives you robustness, Rerankers give you actual relevance, Query clustering gives you insight.

What's expensive isn't reranking. What's expensive is debugging bad search with no observability.

If you're building RAG at scale and want to audit your retrieval infra, I do this for a living. We go from "it kind of works" to "we know exactly what's wrong and how to fix it."