A new study of 27 senior infrastructure leaders finds that AI in observability has crossed the credibility threshold, but mostly as an alert engine. The teams running production want it to investigate, diagnose, and fix. Most of their tools cannot yet.
Adoption is no longer the question. 96% of respondents are at "Significant," "Mature," or "Growing" stages of AI in observability, and every team in the study is using AI for at least one workflow today. But only 37% trust AI to act beyond low-risk tasks. The capability gap is what these leaders raised again and again: they want AI that investigates, diagnoses, and fixes, not just flags and summarizes. That gap is what will decide where the next round of observability dollars goes.
96% of respondents are at "Significant," "Mature," or "Growing" stages of AI in observability. Only one team described themselves as "Early." But adoption clusters around the easiest, lowest-risk uses: pattern recognition in logs, anomaly detection, alert noise reduction. The harder work (root cause analysis across systems, correlating signals into a verdict, taking action without human review) is exactly where these leaders say their tooling falls short.
Log summarization is near-universal at 85%. Anomaly detection (63%) and alert correlation (52%) follow. Automated remediation and change impact analysis trail by more than 30 points.
Asked to pick the single workflow where AI delivers the biggest impact, 33% named root cause analysis. Anomaly detection (22%) and alert correlation (15%) round out the top three.
Asked open-ended where AI falls short, leaders named four capabilities most often: end-to-end automation (30%), reliability and consistency (26%), accurate root cause analysis (22%), and business-context understanding (19%). Only 15% said there was no real gap.
The gap between AI's most-used capability (log summarization, 85%) and AI's most-impactful capability (root cause analysis, 33%) is the architecture of the problem in this dataset. Today's AI flags signals well, but respondents say it does not connect them. The capability respondents most often described as missing was end-to-end automation and accurate root cause analysis across systems: a tool that takes the signals AI already surfaces and produces a verdict an engineer can act on.
Until that capability exists, advances in alert noise reduction concentrate the human triage step rather than reducing it: AI gets faster at flagging, but the engineer is still the one who has to figure out why.
The gap is the distance between a tool that acts like a smoke detector and one that acts like a lead investigator. We have plenty detectors; we need the investigator.
70% deliver AI through native platforms or a combination model. Native (built into the observability platform they already use) leads at 44%, with another 26% taking a combination approach. Only 22% rely primarily on bolt-on integrations. Just 7% built it in-house. The reasoning given by respondents is consistent across roles and regions: speed to value, reduced architectural complexity, and AI that already understands their telemetry.
Native AI built into the existing observability platform leads at 44%, with another 26% taking a combination approach. Just 7% of teams have built AI internally.
85% favor either a hybrid model or full customization. They want turnkey defaults they can extend or override. Only one respondent wanted out-of-the-box AI with predefined workflows. Only three said they had no strong preference.
Among the 41% of teams who say their monitoring has gaps or limited impact, the consequences are concrete: missed incidents, wasted engineer time, and false alarms that erode trust. Strong monitoring is the precondition for AI to be useful at all.
Three findings move together. 70% of teams choose native or combination delivery. 85% want hybrid or fully customizable AI. And 41% say monitoring quality has at least some impact on how their AI tools perform. The pattern points to one architectural preference: AI grounded in unified telemetry, with platform defaults that engineers can override. Bolt-on AI living outside the data plane is not where buyers are spending. Out-of-the-box AI with no customization is also not where they are spending.
For vendors, this is a positioning gate. The buyers signaling they are ready to spend more next year are not asking for more AI features. They are asking for AI that is already plugged into their telemetry and tunable to their environment.
Native AI tools are already integrated with our telemetry data, ensuring better security and more accurate insights without the high overhead of building and maintaining an internal LLM infrastructure.
Asked to walk through the most recent time AI fell short, whether it was a hallucinated root cause, a missed alert, or a false positive, leaders described real, quantifiable downstream cost. 59% named slower incident response and stretched MTTR. 41% pointed to wasted engineer time. 19% described customer-facing impact: checkout errors, two-hour service degradations, customer experience hits. These are not theoretical AI safety concerns. They are the receipts from the most recent incidents these teams worked through.
When forced to name the single hardest part, the answers cluster around four roughly equal frustrations: data quality, model drift, context and interpretability, and trust and accuracy. Notably, "data quality" and "signal-to-noise tuning" registered as distinct problems and were not grouped together by respondents.
From open-ended descriptions of recent AI-related incidents, six business-impact themes emerged. Slower incident response and wasted engineer time dominate. Customer-facing impact and trust erosion appeared often enough to be reportable.
The most consistent pattern across the incident stories: when AI is wrong, the response time gets longer, not shorter. One Director of Engineering described AI flagging a planned mitigation as a critical incident, pulling senior engineers into a 45-minute investigation of a non-issue. A false alarm pulls a team into a non-incident. A missed signal lets a real outage stretch into customer impact. Trust erosion sits underneath all of it. Three respondents described AI failures that had affected how willing their team was to rely on AI again, citing concerns like looking "incompetent" and ongoing engineering "stress."
Practitioners are now able to put hours, dollars, and customer impact against specific AI failures their teams have lived through. That concreteness is what makes the trust prerequisites in the next section so specific.
The AI hallucinated a root cause during a database spike — wasting hours of senior engineer time and delaying resolution, which directly impacted customer experience.
Asked which two qualities most matter for trusting AI more broadly, leaders gave a sharply differentiated answer. 70% named a proven track record of accurate, reliable outputs. Explainability followed at 52%. Consistent performance over time at 48%. Visibility into model behavior trailed at 26%. Regulatory oversight at 19%. The top two describe a single concept: show me the receipts.
Picking the top two prerequisites yields clear differentiation. Track record (70%) and explainability (52%) lead. Consistent performance over time follows at 48%. Visibility into model behavior and compliance oversight are clearly secondary.
Despite high adoption, the autonomy bar is conservative. 63% limit AI to either recommendations only or low-risk well-defined actions such as auto-scaling and service restarts. Only 11% allow AI to operate autonomously across production workflows.
Splitting the same autonomy question by adoption stage exposes a sharp threshold. 100% of Early and Growing teams keep AI at "recommendations only" or "low-risk only." Among Significant and Mature teams, half have crossed into letting AI act on most operational tasks or operate autonomously. Earning the next bracket of autonomy is what differentiates an advanced AI-observability practice from a growing one.
The trust prerequisites question splits cleanly by maturity. Among Early and Growing teams, explainability leads at 71% — they want to see how AI thinks before they trust it. Among Significant and Mature teams, track record leads at 75% — they have lived with AI long enough to want evidence it works, not theory of how. The shift is the practitioner's journey from "I want to understand it" to "I want it to be right."
A second cut, by region, surfaces a directional signal worth flagging for the next fielding wave. 67% of EMEA respondents keep AI at low-risk actions only, compared to 29% in AMER. None of the six EMEA respondents allow AI to act independently on most operational tasks; 33% of AMER respondents do. With only six EMEA respondents in this round, this is a directional read rather than a finding, but the absolute zero on "act independently" warrants follow-up at scale.
Two questions answered together describe the entire trust ladder in this market. What earns trust? A proven track record, ahead of explainability. What does AI get to do today? Mostly recommendations and low-risk actions. The connecting tissue is one word both groups used: proof. Several respondents proposed concrete metrics. "False positives under 5% across our monitoring stack, measured against known incidents we've manually validated." "A traceability log linking every AI action to specific telemetry markers." "99% success rate on automated low-risk remediations over six months without false positives."
The cross-cuts sharpen the picture. Earlier-stage teams want to understand AI; advanced teams want it to be right. Explainability dominates the trust list at 71% among Growing and Early teams; track record dominates at 75% among Significant and Mature. And the autonomy ceiling moves with maturity: not a single Early or Growing team has crossed past low-risk action, while half of Significant + Mature teams have.
The answer to "how do we earn more autonomy?" runs through audit, in the words these respondents used. The proof they describe — "this AI was right, this many times, in environments like ours, with this exact log" — is what they say would unlock the next bracket of autonomy.
"Proof" would be a 99% success rate on automated low-risk remediations over a six-month period without any false positives.
Eighty-nine percent of respondents are increasing their AI-for-observability investment in the next year. 26% by more than 30%, another 63% by 10 to 30%. Asked where the dollars are going, the top answer was not "buy more AI tools." It was data quality, pipelines, and telemetry work to support AI, named by 52%. Training engineers (41%) and buying additional AI-capable observability platforms (37%) followed.
89% are increasing. 7% (two respondents) are holding steady or scaling back. One is too early to say. The direction is unambiguous. The question is what the money is being spent on.
Asked to pick the top two destinations, data quality, pipeline, and telemetry work leads at 52%. Training existing engineering staff is second at 41%. Vendor tools are third at 37%. Hiring AI and ML engineers, infrastructure costs, and internal AI tooling cluster together in a third tier.
Three drivers dominate the open-ended responses: executive and board interest (41%), pressure to do more with less (37%), and competitive pressure (30%). Specific incidents and reliability improvements are secondary motivators. Leadership-level interest is the most-cited driver in the open-ended responses.
The headline statistic, 89% increasing AI investment, is direction. What matters is where the money lands. The top two spending priorities (data quality at 52%, training at 41%) are not AI features at all. They are the foundation that makes AI work in the first place. Only 22% are spending on infrastructure and compute. Only 22% on building internal AI tooling. Only 19% on consolidating onto a single platform.
Leadership pressure shows up clearly. 41% of respondents cited executive or board-level interest as a top driver, ahead of cost-cutting (37%) and competitive pressure (30%). The dollars are going to the engineering substrate as often as the AI ribbon on top of it. Vendors who can show their AI is grounded in clean telemetry, and who can help teams clean it up, are aligned with where these 27 buyers are spending next.
It's pressure to do more with less. We need AI to handle increasing system complexity and reduce the manual troubleshooting burden on our engineers.
Datadog Bits AI is built where your logs, metrics, traces, and incident data already live. That architectural choice is what 70% of respondents in this study chose for their own AI delivery, and what 85% said they prefer when given the option. When AI sits on a unified data plane, the leap from detector to investigator becomes a software problem rather than an architecture problem. Root cause across systems. Audit trails any senior engineer can review. Native, customizable, grounded.
Explore Datadog Bits AI →Research conducted via conversational AI-led interviews with 27 senior infrastructure leaders (CTO, VP, Director-level and above) at companies of 500+ employees. Respondents were screened for direct involvement in observability budget approval and AI tooling decisions. The field instrument was a hybrid structured plus open-ended conversation guide containing 14 main questions and 13 follow-ups, all asked of every respondent regardless of branching logic.
The sample skews AMER (78%), with EMEA representation at 22% and no APAC respondents in this round. Industry composition is heavily concentrated in SaaS / Tech (89%), reflecting the panel quotas at recruitment for both supplier panels: Cint panelists were sourced primarily from Information Technology and Computer Software, and Pure Spectrum panelists were screened to "Science / Technology / Programming." Healthcare, Financial Services, and Retail are each represented by a single respondent. Vertical breadth is a known limitation and is being addressed in the next fielding wave with explicit quotas across additional industries. All percentages are calculated as a share of the 27 unique respondents (not total mentions). Multi-mention questions, including workflows in use, business-impact themes, and investment destinations, may sum to more than 100% as respondents could cite multiple categories. Open-ended responses were coded into themes by a single analyst using pre-defined keyword categories fixed before coding began; each respondent counted once per primary theme. Quotes are verbatim from respondents who passed both screener and post-fielding fraud review (42 of 69 completers were excluded for low-effort, off-topic, templated AI, or paste-artifact responses; only the remaining 27 contribute to this report).
Questions about data quality and signal-to-noise tuning are reported as distinct categories rather than grouped together, reflecting feedback from the prior round. The single-select trust prerequisites question was changed to a top-2 multi-select to surface meaningful differentiation. The prior round's flat distribution would not have been narratable.