A field study of 140 engineering leaders running LLMs in production reveals an unexpected pattern: the more AI you ship, the worse your observability gets. And the line that separates leaders from laggards isn't tool count — it's integration.
The intuitive story would be that fragmentation is an early-stage problem — something teams sort out as their AI practice matures. The data tells the opposite story. Each new LLM application brings its own logging, its own tracing, its own evaluation tooling. The stack grows. The integration burden compounds. By the time a team has more than a handful of apps in production, they're running a substantially larger and more disconnected tool set than peers earlier in the journey.
Distribution among the 117 respondents who specified a count (in either digit or word form). Sums to 100%. 82% are running 3 or more tools.
Share of all 140 respondents describing their tooling state. Sums to 100%.
Themes named when respondents described their monitoring's strengths and gaps. The pattern reveals a structural mismatch: surface operational signals are caught; AI-specific signals are systematically missed. Percentages don't sum to 100%.
Self-reported maturity of practices around running AI in production. Each respondent placed in one category. Sums to 100% (n=140).
Fragmentation is the maturity tax of production AI — and the gaps multiply. The current playbook (best-of-breed tool for each layer, connect them later) faces two compounding problems. First, each individual tool is missing things teams say matter most — root cause, subtle drift, hallucination detection, real-time alerting. Second, the seams between those tools mean even the signals that are caught don't reach the right team in time. By the time the cost is visible — in incident response time, in engineer attention, in cross-team friction — the stack is already entrenched and the integration debt is high. The teams that solve this don't solve it by buying fewer tools. They solve it by changing the philosophy: integrated by default, instead of integrated as an afterthought.
"The real cost of tool fragmentation isn't just the subscription fees; it's a context-switching tax that slows down your entire engineering cycle. In a fragmented environment, debugging feels less like engineering and more like private investigation."
— CTO, 500-999 employees
"We are using about 3 to 5 different tools right now. Unfortunately they don't fully connect to each other. We have to stitch things together manually using custom scripts."
— AI Engineer, 500-999 employees
Fragmentation doesn't extract its cost evenly across the AI stack. It collects at the seams — wherever the stack has a boundary that engineers, teams, or regions have to bridge by hand. The cost is real, but most of it never appears on a budget line. It shows up as engineer attention, debug time, and cross-team friction — categories that don't get reported up to leadership and don't get reflected in tooling spend. Three seams stand out in the data: the seam between tools (where engineers do the correlation), the seam between teams (where ownership of failures gets contested), and the seam between regions (where the same problem manifests at very different intensities).
First-action themes when something breaks. Most respondents named multiple steps — checking logs, checking inputs, reproducing the issue, running evals — spread across different tools and signal sources. The multi-step nature of the response is the fragmentation: every step is a different dashboard, often owned by a different team. That's what produces the time tax shown below. Percentages don't sum to 100%.
Themes mentioned when describing the cost of fragmentation. Respondents discussed multiple themes in a single open-ended answer, so percentages reflect the share of all 140 respondents who mentioned each theme and do not sum to 100%.
Longest time unit referenced when describing debug-to-fix time. Sums to 100%. Roughly half take days or weeks.
Share of all 140 respondents describing each pattern in cross-team collaboration. Each respondent placed in one category. Sums to 100%. Two-thirds report breakdowns at handoffs.
Each seam is a place where fragmentation gets converted into operational drag. Between tools, it shows up as time — 64% report wasted time and slow debugging as the chief cost of fragmentation, and roughly half of those who specified a duration report debug cycles running into days or weeks. Between teams, it shows up as ownership ambiguity — 67% report collaboration breakdowns at production handoffs, with ML, engineering, and platform teams each looking at different dashboards and disagreeing on what "failure" even means. The shared mechanism is the same: the AI stack has more boundaries than the observability stack has bridges.
Share of each region reporting their typical longest debug-to-fix duration. Each respondent placed in one bucket. Sums to 100% within each region.
"Engineers have to jump across multiple tools to reconstruct what happened end to end. That slows incident response and makes root cause analysis mostly manual correlation. It increases cognitive load because no single system shows the full picture."
— CTO, 200-499 employees
"Collaboration is generally solid at launch and during incidents, but it breaks down around ownership of quality issues and end-to-end debugging responsibility. The main gap is misalignment in how ML, engineering, and product interpret what 'failure' actually means."
— CTO, 200-499 employees
Asked to position themselves vs. peers, a striking pattern emerged. Self-described leaders aren't running smaller tool stacks than the rest of the sample. They're running similar numbers of tools — but they've integrated them. Across every dimension we measured, the "ahead of the curve" cohort reports dramatically lower rates of fragmentation, collaboration breakdown, and slow debugging. The dividing line between leaders and laggards isn't what they bought. It's how they connected what they bought.
Self-positioning across the full sample. Each respondent placed in one bucket based on their answer. Sums to 100% (n=140).
The distribution itself isn't surprising — most teams place themselves somewhere on the scale. The question that matters is whether self-positioning actually tracks with operational reality. The cross-tab below answers that: it does, and the gap is dramatic.
When asked "What's one thing you're doing that you think most teams aren't?" — two CTOs from "ahead of the curve" teams gave the most concrete operational practices in the entire dataset. These are the answers, in their own words.
"We version and test our prompts like we version code. What we do differently are prompts registry with Git-style history, pre-production prompt test, canary rollouts."
— CTO, EMEA
The practices, extracted
"We have democratized AI deployment to the point where non-technical departments can launch their own production tools using our centralized templates and guardrails. We are also running automated LLM-as-a-judge evaluations across hundreds of disparate use cases simultaneously."
— CTO, APAC
The practices, extracted
Themes named as the biggest investment or change. Respondents discussed multiple themes, so percentages reflect the share of all 140 respondents who mentioned each theme and do not sum to 100%.
Themes named when asked what unified observability would change. Respondents discussed multiple themes, so percentages reflect the share of all 140 respondents who mentioned each theme and do not sum to 100%.
The market diagnosis is clear: 48% of teams plan to invest in observability and monitoring next year, and 44% specifically named a unified or consolidated platform as their top priority. Where teams differ is in execution. The "ahead of the curve" cohort isn't waiting for the perfect platform — they're already operating their stack as if it were one, with shared traces, common metrics, and clear cross-team ownership of LLM behavior. The path forward for the rest of the market is the same: stop treating the integration layer as something to figure out later. It is the work.
"Bandwidth allocation has been crucial for our team. We prioritized demonstrating value and delivering new use cases, which yielded the best return on investment. Observability, however, seemed like an infrastructure investment that was harder to prioritize against features directly impacting our business."
— CTO, 200-499 employees
"The single biggest investment we need is a unified monitoring and operations platform that standardizes deployment, observability, and incident response across all teams."
— CTO, 1000+ employees
The leaders in this study didn't escape fragmentation by buying differently. They escaped it by changing the architecture — operating their stack as if it were one. Datadog LLM Observability is that integration layer. End-to-end tracing across prompts, models, retrieval, and infrastructure. Real-time evaluation for hallucinations, latency, and cost. Shared visibility across Engineering, ML, and Platform teams — so debugging stops feeling like detective work and the seams in your stack stop costing you days.
Explore LLM Observability Start a Free TrialConversational survey conducted with engineering leaders and AI/ML practitioners at organizations actively running large language models in production.