94% of teams have brought AI into their incident response workflow in some form. But across 103 SREs, platform leaders, and engineering executives, 44% say their IR hasn't kept pace with AI-accelerated delivery. New research on where AI in IR is delivering value today, where it's stalling, and the structural conditions that separate the two.
AI is now inside almost every incident workflow we surveyed: 94% of teams are using or trialing it. But adoption is not the same as readiness. 44% of respondents say their IR process hasn't kept pace with AI-accelerated delivery, and the gap is widest in the region most assumed to be ahead: only 28% of AMER respondents say their IR has kept pace, vs. 67% in APAC.
Two structural conditions sit underneath these numbers. Half of teams run observability and incident management on completely separate platforms, and three in four say that disconnect slows their root cause analysis. The downstream cost is a loop that won't close: 93% of teams who clearly answered yes/no to recurrence said the same incident has happened more than once. Practitioners describe wanting AI to help close it, and they describe wanting fewer tools, more automation, and less alert noise to get there.
Before AI agents and copilots ever entered the incident workflow, the workflow itself was already a relay race — a handoff across half a dozen tools rather than a single connected pipeline. Half of teams operate observability and incident management on completely separate platforms, and a majority run five or more distinct tools across the incident lifecycle. This is the substrate AI is now being asked to operate on: a stack that looks comprehensive on paper and that 77% of respondents say slows down their root cause analysis when observability and IM are on separate platforms.
Of respondents whose answers directly addressed platform architecture, half run observability and incident management on completely separate platforms.
54% use 5 or more tools across the incident lifecycle (28% with 5–6, 15% with 7–9, 11% with 10+).
Our workflow is a 5–7 tool relay race with high manual overhead. Alerting in Datadog or New Relic, paging through PagerDuty, coordination in Slack via incident.io, manual tab-hopping between Splunk and Datadog for investigation, Jira for tracking. The handoffs are where we lose time.
Share of all 103 respondents who named each tool anywhere in their interview.
The same tools mapped to the lifecycle role they play. 56% of teams who articulate their stack span 3 or more functional categories — that's where the boundary-crossing happens during a Sev-1.
Every tool boundary a responder crosses during an incident is a moment where signal can get lost, attention fractures, and time-to-resolution stretches. 77% of respondents told us separate observability and incident tools slow root cause analysis, and a majority of teams reach for five or more tools across the incident lifecycle. Tool sprawl looks like a budgeting line on the invoice, but it shows up as a latency problem during a Sev-1. As AI moves further inside the incident workflow, the same boundaries that slow human responders are the ones an AI assistant has to bridge.
During a Sev-1, today's reality is usually: alert fires in PagerDuty, metrics live in Datadog, logs live somewhere else, deploy history is in CI/CD, discussion happens in Slack.
A fragmented stack doesn't stay theoretical for long. Disconnected tools bend response times, cloud judgment in the first 15 minutes, and turn outages into business events. 77% of respondents said separate observability and incident tools slow root cause analysis, and the pattern holds across every region we surveyed.
The dollars aren't hypothetical either: nearly four out of five teams have already quantified what an outage costs them. The cost shows up in MTTR data, in on-call rotations that lean on resilience instead of process, and in a "first 15 minutes" more responders describe as decoded in real time than executed from a runbook. And as AI-driven delivery pushes more change through the system at more teams, the seams in the stack are exactly where that cost is most likely to compound.
77% say yes (50% somewhat + 27% significantly). Only 16% claim no impact at all.
79% have quantified outage cost in dollar terms (38% specific estimates + 41% rough ballpark figures). Leadership now expects a number, not a feeling.
When asked open-ended how downtime impacts the business, respondents named these areas.
Only 40% described on-call as cleanly stable or supported. 48% flagged stress, fatigue, or mixed strain (39% mixed: stressful but manageable + 9% negative: morale strain), even where rotations are well-run.
Every region reports majority slowdown — though the small APAC universe (n=9 classifiable answers) makes that region's specific share less stable.
EMEA respondents are 16 percentage points more likely than AMER to name revenue loss as a primary business impact of major incidents.
The era of "we'll deal with downtime when it happens" is over. 56% of teams cite revenue loss as a direct impact, 56% point to customer trust, and 38% feel it in leadership confidence. The regional split sharpens the picture: 64% of EMEA respondents and 61% of APAC respondents named revenue loss as a top business impact, vs. 48% in AMER. Outside North America, finance leaders appear to have already wired downtime into the P&L conversation.
Engineering organizations are increasingly expected to put a number on MTTR. And as AI accelerates the rate at which change moves into production, the cost of incidents is positioned to land harder, faster, and on more visible parts of the business than it did even two years ago. That cost shows up first in the seams between every tool a responder has to cross.
It mainly hits customer experience first — users lose trust quickly. If it lasts, it can affect revenue and trigger escalation up to leadership because it becomes a reputational risk. Engineering feels it too, but the real pressure is business confidence and customer retention.
Revenue loss up to £50K per hour, customer complaints increase, leadership becomes anxious for the post-mortem.
AI is no longer a curiosity in incident response: 94% of teams are using or trialing it in some form. But adoption is not the same as readiness. 44% of respondents say their incident response process hasn't kept pace with AI-accelerated software delivery, and the readiness gap is widest in AMER. Only 28% of AMER respondents say their IR has kept pace, vs. 51% in EMEA and 67% in APAC. The pattern flips the typical "North America leads" assumption.
Where respondents do report AI delivering value inside IR, they point first to specific stages: triage and investigation. Those are the early minutes that decide everything. Where they describe AI stalling, the open-text answers cluster around two themes: trust in AI outputs (concerns about hallucinations and verification) and the practical problem of AI agents that can only see part of the picture. Both are easier to address when the data the AI is reasoning over already lives in one place.
94% are using or trialing AI in their IR workflow (59% actively using + 24% limited / early adoption + 11% piloting / planning). Adoption is near-universal but uneven.
44% report a pace gap between how fast software ships and how fast incidents get resolved (27% somewhat behind + 17% significant gap).
AI tools like copilots have sped up code velocity, leading to 20–30% more deployments weekly. This correlates with a similar rise in incidents — mostly P2/P3 config drifts or integration bugs from AI-generated code lacking edge-case handling.
Respondents pointed first to investigation and triage: the early minutes that decide everything.
APAC reports the highest readiness; AMER the lowest. The pattern flips the typical "North America leads" assumption.
The pace gap between software delivery and incident response is real (44% of respondents report it), and the regional pattern is striking: only 28% of AMER respondents say their IR has kept pace with AI delivery, vs. 51% in EMEA and 67% in APAC. The open-text answers around AI inside IR cluster around two themes: trust in AI outputs (concerns about hallucinations and the need for verification) and the practical limits of AI agents that can only see part of the picture during an incident.
Both barriers point at the same underlying condition: AI in IR works best when the data it reasons over already lives in one place. An assistant that has to stitch context across logs in one tool, traces in another, deploys in a third, and incident state in a fourth is doing harder work than the same assistant with unified context. That's not a panel finding — it's the structural reason buyers in this study are reaching so consistently for native integration.
Biggest concern is unreliable AI outputs — hallucinations in root cause suggestions could escalate incidents, like misfiring rollbacks on bad correlations. We need 99%+ confidence thresholds before autonomous actions.
Practitioners aren't waiting for the gap to close itself. Asked the open-ended question, "if you could change one thing about how your organization handles incident response, what would it be?", they overwhelmingly asked for fewer dashboards rather than more. Tool consolidation, automation, and noise reduction came back as the three most-cited wishes. Each of these maps to a different lever for getting more out of the AI teams have already adopted.
Nearly half are already actively evaluating switching their incident management platform. The buying intent is strongest in APAC: 72% of APAC respondents are weighing a switch, vs. 41% in AMER and 38% in EMEA. And among teams already weighing a switch, 80% say native integration with observability is "very important" or a "critical requirement." The follow-through gap reinforces the urgency: postmortem action items don't reliably ship, and incidents repeat — the structural loop teams say they're hoping AI can help them close.
Top themes from open-ended responses to "what would you change?"
48% are already evaluating or considering switching alternatives in the next 12 months. Only 43% are firmly satisfied.
64% of teams complete fewer than three out of every four postmortem action items (5% complete only 0–25% + 11% complete 26–50% + 48% complete 51–75%). The follow-through gap leaves the loop open.
93% of teams who answered directly said yes. Incidents do recur because follow-up actions don't get completed. Only 7% said no.
Among teams already considering a platform switch, 80% say native integration is "very important" or a "critical requirement" (62% very important + 18% critical).
APAC is the most-active market by a wide margin: nearly twice the rate of EMEA.
When 24% of the most senior practitioners in the field volunteer "consolidate our tools" as their top wish, and another 24% say "more automation," the signal is hard to miss: teams are tired of stitching together best-of-breed tools that don't talk to each other. The follow-through data exposes the cost: 64% of teams complete fewer than 75% of their postmortem action items, and 93% of those who clearly answered yes/no said incidents do repeat.
Action items don't reliably survive the handoff between observability, paging, ticketing, and chat. Four out of five of the teams already weighing a switch say native integration is "very important" or a "critical requirement." For these buyers, native integration is at the top of the spec sheet — and Datadog reads that consistency as the same point AI in IR is also pushing toward: a unified surface where the data already lives together. In APAC, 72% are already weighing a switch, by far the most actively shopping region. The buyers are in motion. The question is which platform delivers the unified surface they're now actively asking for.
Datadog's incident management is built where the metrics, logs, traces, deploys, and team conversations already live. So when a Sev-1 fires, your responders are already on the signal — no six-tab scramble to find the truth. That's how AI in IR actually works: when the underlying data plane is unified, the assistant can finally see the whole picture.
Explore Datadog Incident ManagementResearch conducted via structured conversational interviews with 103 SREs, platform engineers, DevOps practitioners, and engineering leaders across cloud-native, hybrid, and migrating organizations. Respondents span three geographic regions: AMER (45%), EMEA (38%), and APAC (17%). The sample skews senior, with 91% in manager-level roles or above (40% VP/C-level, 48% Manager/Director, 11% Senior IC). The industry mix is heavily concentrated in SaaS & Technology (62%), with smaller representation from Retail / E-commerce (11%), Financial Services (6%), Healthcare (3%), and Media & Entertainment (2%); 17% are categorized as Other. Org-size skews mid-market, with 81% of respondents at companies of 500–5,000 employees.
All percentages are calculated as a share of unique respondents (not total mentions). Multi-mention questions, such as which business areas are impacted by major incidents and what teams would most like to change, may sum to more than 100% as respondents could cite multiple themes. All findings, including the regional and integration-status cross-tabs woven through this report, draw on the full 103-respondent sample.