What Is AI And LLM Observability?

Sid Arora

Mar 11, 2026

10

min read

Product Management Concepts

H

Hey,

Here is a problem that does not exist with traditional products.

Your feature is running fine. Requests come in, responses go out.

No errors. No alerts. The infrastructure dashboard is green.

And yet your AI has been giving users wrong information for two weeks.

With a normal software bug, the system fails loudly. A button stops working. An API returns a 500 error. Something turns red and someone gets paged.

With an LLM, nothing breaks. The system keeps responding. The response time is normal. The HTTP status is 200.

The answer is just wrong.

A support chatbot starts making up refund policies that do not exist. A search tool starts returning results that sound correct but are not. An escalation bot stops flagging edge cases and starts resolving them with invented answers.

No dashboard alerts. Nobody notices. Until a user complains. Weeks later.

This is the problem that LLM observability exists to solve.

Why Your Existing Monitoring Is Not Enough

Every engineering team already monitors their software. Response times, error rates, uptime. This is standard.

When you add an LLM to your product, all of that still matters. But it only tells you one thing: did the system respond?

It does not tell you whether the response was any good.

A request can successfully respond in 400 milliseconds and contain a completely inaccurate answer. Your monitoring will show everything is healthy. Your user just got wrong information.

Traditional monitoring watches the infrastructure. LLM observability watches what the model is actually producing.

You need both.

*Comparison between traditional monitoring vs LLM observability, showing what each layer sees and misses*

The Three Things to Measure

LLM observability covers three things that traditional monitoring does not touch.

Quality. Is the model giving correct answers? Is it grounded in the right information? Is it following your instructions?

Quality is hard to measure because there is no status code for "this answer was wrong." But it is the most important thing to track, because quality degradation is what users actually experience.

Cost. LLMs charge per token. Every single request costs money.

A RAG pipeline retrieving too many chunks, a prompt that has grown verbose over time, a feature being called more often than expected — all of these show up as cost increases before anything else signals a problem.

Latency. LLMs are slow compared to most APIs. A two-second response might be fine for a document summary tool and completely unacceptable for a live support chat.

Latency also tends to increase over time as prompts grow and pipelines get more complex.

What is a Trace?

The most important concept in LLM observability is the trace.

A trace is a complete record of everything that happened during a single user request from the moment the user sent their message to the moment they got a response.

For a simple chatbot, this is easy. User message in, model response out. One step.

But most AI systems in production are not one step. They are pipelines.

Take a RAG-powered support bot. For a single user message, it might do all of this:

1. Convert the user's question into a vector
2. Search the knowledge base for matching documents
3. Retrieve the most relevant chunks
4. Re-rank them
5. Build a prompt with the best context
6. Call the LLM
7. Stream the response back

That is seven steps. Each one has its own latency, its own cost, and its own way of going wrong.

A trace captures all seven. It records the input and output at each step, how long each took, how many tokens were used, and how everything connected together.

*A trace in an LLM observability tool shows every step of a request. Each span captures its own latency, inputs, and outputs. (source*)

Each individual step inside a trace is called a span. The trace is the full request. The spans are the steps within it.

This idea comes from how engineers already debug microservices. It is the same concept, applied to AI pipelines. The difference is that LLM spans also capture things like the exact prompt that went in, the exact response that came out, the model version, and the token count.

When something goes wrong, you can go back and replay exactly what happened.

How Teams Measure Quality at Scale

Counting tokens is easy. Figuring out whether an answer was good is not.

If your system handles a thousand conversations a day, you cannot read them all. You need a way to score quality automatically.

The solution the industry has settled on is called LLM-as-judge.

Here is how it works. For every response your system produces, a second LLM (which is the judge) scores it against a specific rule. The rule is always a TRUE or FALSE question:

Did the response answer the user's actual question? (TRUE or FALSE)

Is every factual claim grounded in the documents that were retrieved? (TRUE or FALSE)

Did the assistant escalate when it was supposed to, instead of guessing? (TRUE or FALSE)

The judge reads the conversation, applies the rule, and outputs a score.

The judge does not need to be the most powerful model. It needs to be consistent. If it agrees with human reviewers most of the time and produces the same score for the same input, it is good enough.

This is what makes it scalable. Instead of someone reading 50 conversations a week, you can score thousands of responses every day automatically.

*Flow diagram showing the different steps in LLM as a judge*

LLM-as-judge is not perfect. Judge models tend to prefer longer responses, and they can miss subtle errors that a domain expert would catch. They are most reliable for obvious failures like hallucinations, off-topic responses, missing escalations.

The right way to think about it: the judge is a smoke alarm. It tells you something needs attention. A human still goes and investigates.

What Good Instrumentation Gives You

When observability is set up properly, a few things become possible that were not before.

You can look up any conversation from any point in time and see every step that produced it. When a user says "the bot gave me wrong information last Tuesday," you are not guessing what happened. You are reading the exact trace.

You can track quality as a metric over time. Not "that response seemed fine" but "our hallucination rate last week was 3.2%, up from 1.8% the week before." That kind of trend is invisible without systematic measurement.

You can see cost broken down by feature. A simple classification call might cost $0.001. A multi-step agent might cost $0.40. Knowing this at the feature level tells you where to focus optimisation, and helps you build accurate cost models before usage scales.

You can set alerts on things that matter. When hallucination rate crosses a threshold, when P95 latency spikes, when refusal rates jump unexpectedly — your team gets notified before users start complaining.

The Tools

A small set of tools has emerged specifically for LLM observability.

Langfuse is open source and self-hostable. It supports traces, LLM-as-judge scoring, and prompt management, and works with any model. Teams that want full control over their data tend to start here.

LangSmith is from the LangChain team but works with any framework. It covers tracing, evaluation, prompt versioning, and deployment in one place.

Arize Phoenix is open source and built on OpenTelemetry. It is a strong choice for teams that run systematic offline evals alongside production monitoring.

Datadog LLM Observability sits on top of Datadog's existing infrastructure monitoring. If your team already uses Datadog, this gives you AI and infrastructure in one dashboard.

Most teams pick one tool, start with basic traces, and expand from there.

*Comparison table showing different tools for observability*

How This Connects to Evals

If you have built evals before, observability will feel familiar. They are two halves of the same idea.

Evals are offline. You run a test set before you ship. You check whether the system behaves correctly on inputs you already know.

Observability is online. You monitor real user interactions after you ship. You check whether the system behaves correctly on inputs you did not anticipate.

The eval set you built before launch becomes your baseline. Observability tells you whether production is holding to that baseline — or drifting from it.

When production surfaces a failure your evals did not catch, that failure becomes a new test case. You add it to the set, run it in future evals, and the system gets better.

This is why the best AI teams treat evals and observability as one loop, not two separate things.

Define what good looks like before you ship. Measure it after you ship. Use what you find in production to improve your test set. Repeat.

Shipping the feature is not the end. It is where the work of keeping it good actually starts.

‍

How I can help you:

Fundamentals of Product Management - learn the fundamentals that will set you apart from the crowd and accelerate your PM career.
Improve your communication: get access to 20 templates that will improve your written communication as a product manager by at least 10x.

Sid Arora

View Posts

View All

Mixture of Experts (MoE): Reason Behind Cheapest AI Models

12

min read

AI And LLM Observability: What is It?

10

min read

Your AI Agent Always Forgets. Here's Why

11

min read

How Do Embeddings Work?

14

min read

View All

Mixture of Experts (MoE): Reason Behind Cheapest AI Models

Mixture of Experts (MoE) is the architecture behind GPT-4, Gemini 1.5, and Mixtral. Here's a PM-level explanation of how MoE works and why it matters for your API budget.

Apr 9, 2026

12

min read

Spotify Is Using 6 AI Agents For Building Ad Campaigns. Here's why

Spotify's ad planning took 30 minutes and 20+ form fields. Here's how six AI agents cut it to 10 seconds, and what the architecture actually looks like.

Apr 8, 2026

12

min read

Grab Personalised Its Platform For Millions In Under 15 Seconds

Grab's personalisation ran on one-day-old data. Here's how they rebuilt it to react in under 15 seconds with no engineering required for each new use case.

Apr 7, 2026

10

min read

AI And LLM Observability: What is It?

How do you know if your AI product is quietly giving users wrong answers? Learn how LLM observability works: traces, spans, LLM-as-judge, and why a 200 OK status code tells you nothing about quality. (Remember to click on"show pictures")

Apr 6, 2026

10

min read

This is How Claude Changed the Vibe Coding Game

Claude launched "Channels" allowing vibe coders to continue coding without being on their systems. This is how it works.

Apr 5, 2026

10

min read

This is How Notion Won the Productivity Battle

Notion reached a $10 billion valuation on just $344 million raised. Here's how a free personal tier and an accidental template ecosystem became its growth engine.

Apr 4, 2026

11

min read

Twitter

Instagram

Newsletter

What Is AI And LLM Observability?

Why Your Existing Monitoring Is Not Enough

The Three Things to Measure

What is a Trace?

How Teams Measure Quality at Scale

What Good Instrumentation Gives You

The Tools