A field study of 140 engineering leaders running LLMs in production reveals an unexpected pattern: the more AI you ship, the worse your observability gets. What separates leaders from laggards has little to do with how many tools they own, and almost everything to do with how those tools talk to each other.
Most teams assume fragmentation is a phase they'll grow out of as their AI practice matures. The data says the opposite. Teams scaling past 10 LLM apps in production are 2.2× as likely as their peers to be running 5 or more monitoring tools, not because they bought more, but because each new application brought its own. Fragmentation is the maturity tax of production AI. It compounds at every seam in the stack: between teams (67% report cross-team breakdowns at production handoffs), between tools (82% of teams who specified a count are juggling 3 or more separate monitoring tools), and across regions (APAC teams report debug cycles measured in days or weeks at more than twice the rate of NA or EMEA peers). The most striking pattern in the dataset has nothing to do with tooling spend. Self-described leaders run roughly the same stacks as everyone else. What sets them apart is how those stacks have been wired together.
Every respondent works at an organization with at least one large language model serving real traffic. The median respondent runs 5–10 LLM-powered features. Their day-to-day cuts across prompt engineering, retrieval tuning, latency and cost management, drift detection, and incident response.
5–10 LLM features in production; 18% are running 10+.Datadog, Prometheus, Grafana, LangSmith, Langfuse, Arize, OpenTelemetry, MLflow, plus model-provider dashboards and home-grown logging and cost trackers.The intuitive story would be that fragmentation is an early-stage problem teams sort out as they mature. The data tells the opposite story. Each new LLM application brings its own logging, tracing, and evaluation tooling. By the time a team has more than a handful of apps in production, they're running a substantially larger and more disconnected tool set than peers earlier in the journey.
In concrete terms: a "fragmented" team will run something like Datadog APM for application traces, Prometheus and Grafana for infrastructure, LangSmith or Langfuse for LLM tracing, Arize or a custom framework for evals, and a separate dashboard for token spend. Five tools, five contexts, five places to look when something breaks.
Distribution among the 117 respondents who specified a count. One in three is running 5 or more.
Customer reports the chatbot gave a confidently wrong answer. The on-call engineer's first hour:
Datadog APM, notes the timestamp.LangSmith in a second tab, finds the LLM trace, the prompt, the retrieved context.vector DB, in case the index was stale.Slack to confirm the prompt template hasn't been edited.Six tools, three teams, one incident. This is what 78% of respondents called "manually stitching tools together."
Share of all 140 respondents describing their tooling state.
"We use a mix of custom dashboards built with Prometheus and Grafana for basic metrics like latency and error rates. For model-specific issues, we've integrated with tools like LangSmith to trace and evaluate outputs. We also log user feedback to flag potential problems. It's still evolving, but this setup gives us decent visibility."
CTO, 1,000+ employees
Surface-level operational signals are caught. AI-specific signals are systematically missed.
Standard observability answers one question: is the service responding, and is it fast? An LLM application can score 100% on uptime and 100% on latency SLOs and still ship a bad product.
Confidently wrong answers ship with HTTP 200 and sub-second latency. Traditional monitoring stays silent.
Output quality slides as upstream data, indexes, or prompts shift. No error fires; only evals catch it.
The RAG layer pulls the wrong document. The LLM answers fluently. Both systems log "success."
A two-line prompt change can 10× the bill overnight. Infra monitoring tracks CPU, not tokens.
Tracing a bad output back to the exact prompt, model, and retrieved context that produced it. Most stacks don't connect them.
Verifying a fix worked means re-running evaluation suites against sampled production traffic. Most stacks have nowhere to put the result.
"It works fine for detecting basic errors like outages or high latency, but falls short on catching subtle issues like gradual quality degradation, hallucinations, or prompt injection. We also can't trace a bad output back to a specific input or model state, and there's no automated alerting for semantic drift. This means users often notice problems before we do, which hurts trust."
CTO / Head of Engineering, 1,000+ employees
Self-reported maturity of operational practices around running AI in production. Most teams admit their practices haven't caught up with the tools they've already bought.
"The real cost of tool fragmentation isn't just the subscription fees; it's a 'context-switching tax' that slows down your entire engineering cycle. In a fragmented environment, debugging feels less like engineering and more like private investigation."
Director of Engineering, 800 employees
"We've been adding tools for two years now. Every time we ship a new model or use case, someone says 'we need to monitor this' and a new tool gets added. Nobody has ever said 'let's consolidate.'"
Head of AI, 500-999 employees
Fragmentation is the maturity tax of production AI. Each new LLM app brings its own model provider metrics, retrieval logging, evaluation framework, and cost dashboards. Teams that started with a few apps and a manageable stack find themselves, two years later, with twice the tool count and the same monitoring philosophy. By the time the cost shows up in incident response, engineer attention, and cross-team friction, the stack is entrenched and the integration debt is high.
Fragmentation collects at the seams: wherever the stack has a boundary that engineers, teams, or regions have to bridge by hand. The cost is real, but most of it never appears on a budget line. It shows up as engineer attention, debug time, and cross-team friction. Three seams stand out: between tools (where engineers do the correlation), between teams (where ownership of failures gets contested), and between regions (where the same problem manifests at very different intensities).
Most respondents named multiple parallel first steps. Each one lives in a different dashboard, often owned by a different team.
"Our debugging process usually starts with checking the logs for latency spikes and then diving into a trace to see where the prompt or retrieval went sideways. Depending on the complexity, it can take anywhere from a few hours for a quick fix to a couple of days if we have to adjust the RAG pipeline or fine-tune our guardrails."
Engineering Manager, AI Team, 500-999 employees
Themes mentioned when respondents described what fragmentation actually costs them. The dominant story is hours and headspace, not dollars.
"It shows up as slower debugging, more context switching between tools like Grafana and Datadog, inconsistent metrics across systems, duplicated instrumentation, and longer time to connect a user-facing failure back to the exact prompt, model version, and data issue causing it."
CTO, 1,000+ employees
Longest time unit referenced when describing typical debug-to-fix time. Roughly half of teams who specified a duration are working in days or weeks, not hours or minutes.
Two-thirds of respondents report breakdowns at handoffs. Engineering owns the application; ML owns the model; platform owns the infrastructure. When a hallucination ships, no one owns the answer.
accuracy. Platform watched uptime. The cost-and-latency seam between them had no owner.A unified observability layer would have surfaced the token-per-request regression in the same dashboard ML was already watching. Instead, the bill caught it.
"A classic example was when our ML team updated a model's input schema without notifying the platform team, causing a production failure because the downstream infrastructure wasn't configured to handle the new data format."
Director of Engineering, 500-999 employees
Each seam converts fragmentation into operational drag. Between tools, it shows up as time: 64% of respondents call wasted time the chief cost of fragmentation, and roughly half of those reporting a duration are debugging in days or weeks. Between teams, it shows up as ownership ambiguity: 67% report collaboration breakdowns at production handoffs, with ML, engineering, and platform looking at different dashboards and disagreeing on what "failure" even means. The mechanism is the same in both cases: the AI stack has more boundaries than the observability stack has bridges.
From Q5. Among respondents who specified a duration. APAC highlighted.
From Q14. Full region sample. APAC highlighted.
The same problem manifests as a 35% pain rate in EMEA and 79% in APAC. APAC respondents report cycles in days or weeks at more than twice the NA or EMEA rate, and they report the steepest cross-team friction as well. The operational cost of fragmentation lands hardest where the surrounding observability practices are still earliest in their maturity curve.
Share of each region reporting their typical longest debug-to-fix duration.
"Engineers have to jump across multiple tools to reconstruct what happened end to end. That slows incident response and makes root cause analysis mostly manual correlation. It increases cognitive load because no single system shows the full picture."
CTO, 200-499 employees
"Collaboration is generally solid at launch and during incidents, but it breaks down around ownership of quality issues and end-to-end debugging responsibility. The main gap is misalignment in how ML, engineering, and product interpret what 'failure' actually means."
CTO, 200-499 employees
Asked to position themselves vs. peers, a striking pattern emerged. Self-described leaders run roughly the same number of monitoring tools as everyone else. What separates them is how those tools work together. Across every dimension we measured, the "ahead of the curve" cohort reports dramatically lower rates of fragmentation, collaboration breakdown, and slow debugging.
Self-positioning across the full sample. Each respondent placed in one bucket based on their answer.
The distribution itself is unremarkable. The question that matters is whether self-positioning actually tracks with operational reality. The cross-tab below answers that, and the gap is dramatic.
Three views of the leader-laggard gap. Leaders run similar tool counts, plan less new spend, and report a fraction of the pain.
Tool stack sizes are nearly identical across tiers (3.9 vs 3.6 vs 3.4 tools). And only 16% of leaders plan observability investment over the next 12 months, versus 84% of "catching up" teams. The pattern holds on other pain metrics too: leaders report 43 percentage points less collaboration breakdown (52% vs 95%) and 32 points less slow debugging (26% vs 58%).
What leaders have done differently is architectural. They've built the connective tissue across the stack they already own: shared traces, common metrics, unified ownership. The same tools that look fragmented in a "catching up" org function as a single system in an "ahead" org.
"It would mainly reduce a lot of the guesswork and back-and-forth we deal with today. Instead of jumping between different tools to understand what's going wrong, we'd have a single place to see how the model, infrastructure, and application are all behaving together. That would make debugging faster, incident response smoother, and day-to-day monitoring much less fragmented. Overall, it would help the team move faster with more confidence in production changes."
Director of Engineering, 1,000+ employees
When asked, "What's one thing you're doing that you think most teams aren't?" two CTOs from "ahead of the curve" teams gave the most concrete operational practices in the dataset.
"We version and test our prompts like we version code. What we do differently are prompts registry with Git-style history, pre-production prompt test, canary rollouts."
CTO, EMEA
"We have democratized AI deployment to the point where non-technical departments can launch their own production tools using our centralized templates and guardrails. We are also running automated LLM-as-a-judge evaluations across hundreds of disparate use cases simultaneously."
CTO, APAC
48% of teams plan to invest in observability and monitoring next year. 44% specifically named a unified or consolidated platform as their top priority.
Themes named when respondents described what unified observability would unlock for their teams.
The market knows what it needs. Where teams differ is in execution: the "ahead of the curve" cohort is already operating their stack as if it were one, while everyone else is still planning the platform purchase. The path forward is the same. The integration layer is the work.
Theme named in next-12-month investment plans. EMEA highlighted.
Theme named in next-12-month investment plans. EMEA highlighted.
Same regional ordering on both metrics: EMEA most committed, APAC least. EMEA and North America are converting their pain into platform investment. APAC's lower investment intent, paired with the longest debug cycles and the deepest cross-team friction, is the most striking tension in the dataset: the region experiencing the most acute version of the maturity tax is also the slowest to address it.
"Bandwidth allocation has been crucial for our team. We prioritized demonstrating value and delivering new use cases, which yielded the best return on investment. Observability, however, seemed like an infrastructure investment that was harder to prioritize against features directly impacting our business."
CTO, 200-499 employees
"The single biggest investment we need is a unified monitoring and operations platform that standardizes deployment, observability, and incident response across all teams."
CTO, 1,000+ employees
The leaders in this study escaped fragmentation by changing the architecture, operating their stack as if it were one. Datadog LLM Observability is that integration layer: end-to-end tracing across prompts, models, retrieval, and infrastructure; real-time evaluation for hallucinations, latency, and cost; shared visibility across Engineering, ML, and Platform teams.
Conversational survey conducted with engineering leaders and AI/ML practitioners at organizations actively running large language models in production.