Datadog · Hybrid Infra Study
DATADOG
Research Report ● June 2026 ● First Wave · 25 Senior Infrastructure Leaders

The Hybrid Infrastructure Observability Report

We asked senior infrastructure leaders at large organizations how they actually monitor on-prem and cloud together, what fragmentation costs them, and where on-prem is heading. On-prem is not shrinking. The tools built to watch it have not kept up.

Draft · First wave N=25 · fielding continues
92%
of engineers switch between two or more separate tools to investigate a single incident
60%
of teams spend at least half their time on reactive firefighting rather than proactive work
68%
are building new on-prem workloads or repatriating cloud workloads back on-prem
The Headline

Leaders say they have a single view of their environment. Then they describe an incident where an engineer jumped between four consoles and matched timestamps by hand. On-prem is growing, its workloads are business-critical, and the monitoring tools around it were built for a simpler map.

01 The Hybrid Landscape

Everyone Runs Hybrid, With a Stack That Grew by Accretion

These teams do not run one kind of tool. Most layer legacy on-prem-native tools, cloud-native tools stretched to cover on-prem, and open source, all at once. The result is a monitoring estate that most teams did not design so much as accumulate.

Survey question: "Which of the following are part of your on-prem monitoring setup? Select all that apply." (Multi-select, percentages can exceed 100%)
Four Kinds of Tooling, Layered
% of respondents including each in their on-prem setup, n = 25
Survey question: "How many separate monitoring or observability tools does your team use across your on-prem and cloud environments?"
Most Teams Run Six or More Monitoring Tools
% of respondents, single-select, n = 25
So What

60% of teams run six or more monitoring tools across their hybrid estate. That number sets up everything that follows. A stack assembled piece by piece over years is a stack no one deliberately integrated, and the cost of that shows up the moment an incident crosses the on-prem and cloud boundary.

02 The Cost of Fragmentation

The Confidence Gap: They Say Single View, They Live Tool-Switching

Here is the tension at the center of this study. Three-quarters of leaders rate their single-platform visibility an 8 or higher out of 10. Yet almost every one of them, 92%, says an engineer switches between two or more tools to work a single incident. Confidence and daily reality are pulling in opposite directions.

Survey question: "How much do you agree: 'My team has a complete view of application, infrastructure, network, and security telemetry from a single platform.' Rate 1 to 10." (Banded)
Stated Confidence Is High
% of respondents by agreement band, n = 25
Survey question: "When investigating an incident, how many separate tools does an engineer typically switch between?"
Daily Reality Is Tool-Switching
% of respondents, single-select, n = 25
Survey question: "What are the biggest reasons you're struggling to consolidate your monitoring tools? Rank your top 3." (% ranking each in their top 3)
Why the Stack Never Gets Consolidated
% of respondents ranking each barrier in their top 3, n = 25
So What

The story is not that leaders admit they lack a single view. Most say they have one. The story is that the same leaders describe an incident workflow that hops across consoles and correlates signals by hand. Programs are graded on the dashboard leaders see, not on the path an engineer walks at 2 a.m. Cost of switching, not budget or lock-in, is the barrier that keeps the stack fragmented.

"Correlating OT and cloud logs consumed most of the time. We had to manually match timestamps from different sources and chase ownership across teams, turning a simple fix into an all-day affair."

C-level technical leader, Manufacturing

"Better correlation. We spend too much time going between dashboards right now."

Engineering Manager, Telecom
03 Incident Speed

Cross-Environment Incidents Run Long

When an incident spans on-prem and cloud, the clock is unforgiving. Nearly half of teams need an hour or more just to detect it, and once they do, not a single team in this wave resolves it in under an hour. The slow part is almost always the same: stitching on-prem and cloud signals together by hand.

Survey question: "When a production incident spans both on-prem and cloud, how long does it typically take your team to detect it?"
Time to Detect (MTTD)
% of respondents, single-select, n = 25
Survey question: "Once detected, how long does it typically take to resolve a cross-environment incident?"
Time to Resolve (MTTR)
% of respondents, single-select, n = 25
Survey question: "When you're trying to identify root cause in a hybrid environment, what's the single biggest bottleneck?"
What Slows Root-Cause Analysis Down
% of respondents, single-select, n = 25
So What

48% take an hour or more to detect a hybrid incident, and 36% take four hours or more to resolve one. Those hours are not spent fixing. They are spent finding: correlating logs across environments, chasing which team owns the problem, and confirming access before anyone can even start. Detection and correlation, not the fix itself, are where the time goes.

"The biggest slowdown was getting the right on-prem and cloud access and logs, so we spent too long blocked by permission and network checks before we could confirm the real cause."

Engineering Manager, Telecom

"We had an outage. What slowed us down is lack of end-to-end visibility."

C-level technical leader, Retail
04 Reactive vs. Proactive

Half the Team's Time Goes to Firefighting

Ask these teams how their time actually splits and the answer is close to even: 49% reactive, 51% proactive on average. That balance sounds healthy until you look at the spread. Six in ten teams spend half their time or more just reacting, which is time not spent preventing the next incident.

Survey question: "Think about how your team's time is actually spent. Out of 100, how would you split it between reactive incident response and proactive infrastructure management?" (Team-time share)
How Reactive Are Teams, Really?
% of respondents by share of time spent reactive, n = 25

"Urgent issues keep arriving faster than we can plan, priorities shift constantly, and proactive work gets deprioritized."

So What

When leaders explain what keeps them reactive, they name the same forces: alert fatigue across too many consoles, thin headcount, and not enough automation. Every hour spent triaging false positives in one more tool is an hour not spent on the prevention work that would shrink the incident load in the first place.

"Tool sprawl and alert fatigue. We spend hours triaging false positives across different consoles rather than hunting threats or improving detection logic."

C-level technical leader, Manufacturing

"The simple inability to have a service that can help us act proactively and can be used across all stakeholders."

VP / Director of Engineering, Manufacturing
05 Why On-Prem Persists

On-Prem Is Not a Legacy Footnote. It Is Growing.

The assumption that everything migrates to cloud does not survive contact with this data. Most of these leaders plan to keep critical workloads on-prem, and a clear majority are actively building new on-prem workloads or pulling cloud workloads back. Security and compliance, not inertia, lead the reasons why.

Survey question: "Rank your top 3 reasons your organization continues to run workloads on-prem rather than migrating fully to cloud." (% ranking each in their top 3)
Why Workloads Stay On-Prem
% of respondents ranking each reason in their top 3, n = 25
56%
plan to keep critical workloads on-prem indefinitely or for the foreseeable future
64%
are building new on-prem workloads or repatriating cloud workloads (actively or considering)
64%
say 41% or more of their on-prem workloads are business-critical
So What

On-prem is where the business-critical, security-sensitive workloads live, and it is expanding, not winding down. Any observability strategy that treats on-prem as a shrinking afterthought is planning for the wrong future. These teams need on-prem monitoring to be a first-class citizen, not a bolt-on to a cloud-native tool.

06 The AI / GPU Gap

AI Infrastructure Is a Priority. The Tooling Isn't Ready.

The clearest want-versus-have gap in this wave is about AI. 84% of leaders say monitoring GPU and AI infrastructure will be very important or critical to their strategy over the next two years. Only a third say their tools fully cover it today.

Survey question: "How important will monitoring GPU and AI infrastructure be to your observability strategy over the next 2 years?"
How Important AI/GPU Monitoring Will Be
% of respondents, single-select, n = 25
Survey question: "Does your current toolset support monitoring GPU and AI infrastructure today?"
Whether Tools Support It Today
% of respondents, single-select, n = 25
So What

Only 32% say their toolset fully supports GPU and AI monitoring, against 84% who call it important or critical within two years. That is a 52-point gap between where these teams are heading and where their tools are today, and it is opening on the exact on-prem infrastructure they are choosing to expand.

"Lack of native AI and GPU workload visibility for our on-prem banking environments."

Engineering Manager, Financial Services

"Most on-prem monitoring tools mainly show what happened, but you actually need them to reliably predict what will break next and drive actions before users are impacted."

Engineering Manager, Telecom
What To Do About It

One place to watch on-prem and cloud together.

The gap is not confidence. Leaders already believe in their programs. The gap is what an engineer actually walks through during an incident: six tools, hand-matched timestamps, and an hour lost before the fix even starts. On-prem is growing and going AI-heavy, and the tooling around it has not kept pace. Datadog brings on-prem and cloud telemetry into a single platform, so detection, correlation, and root cause happen in one place instead of across six.

See how Datadog covers hybrid

Methodology

This is a first-wave read of an ongoing study. We surveyed 25 verified senior infrastructure leaders at organizations of 1,000 or more employees, all of whom directly manage or share responsibility for on-premises infrastructure and hold final say or significant influence over observability tooling decisions. Fieldwork is continuing toward a larger sample; the figures here are directional and intended to confirm that the survey is producing promotable findings, not to serve as final published numbers.

Percentages reflect unique respondents who selected each option. Multi-select and ranking questions can sum above 100% and are labeled accordingly. Single-select breakdowns sum to 100% within rounding. One additional response was received but excluded as promotional, leaving a clean base of 25. All numbers in this report trace to the structured survey data.

25
Verified senior infrastructure leaders
1,000+
Employees per organization (screened)
24
Survey questions analyzed
June 2026
Fieldwork period (wave 1)
Respondent Roles
% of respondents, n = 25
Industries Represented
% of respondents, n = 25

Note: with a first-wave base of 25, all figures are directional and subject to change as fielding continues. Segment-level cross-tabs (by industry and seniority) are available in the companion Data Appendix, where segment bases under 30 are flagged as directional.

Headline Map  //  Hybrid Infrastructure Observability Study

How the headline targets map to the first wave

The four headlines the research plan worked backward from, measured against the first N=25 verified completes (senior infrastructure leaders at 1,000+ employee organizations who manage on-prem and influence tooling). Color marks whether the early data holds the headline up, supports it once reframed, or points the other way. The purpose of this read is one decision: are the target headlines landing well enough to keep fielding the survey as written? The short answer is yes. Every figure traces to the structured survey data.

4 Holds
2 Reframed
1 Contradicted
7 Headlines mapped

The research plan named four desired headlines: on-prem persistence, tool fragmentation with no single view, time lost to firefighting, and incident detection/resolution speed across hybrid environments. On the four target themes, three hold outright and one needs a reframe, but every one of them has a defensible, promotable number already in the first 25 responses. The one caution is the specific "no single source of truth" framing, which the data currently contradicts as literally worded (most leaders do say they have a single-platform view). The operational fragmentation underneath it is real and strong, so the fix is a wording change, not a survey change. On top of the four, the data surfaced a fifth theme worth a section, the GPU/AI monitoring gap, which is the cleanest emergent story in the set.

HOLDS — The early data confirms the headline as written.
REFRAMED — The data supports the finding once the framing is adjusted.
CONTRADICTED — The data points the other way as literally worded.
Planned headlines — the four from the research plan
H1 · ON-PREM PERSISTENCE On-prem is not going away, and much of it is growing HOLDS
Headline intended

"72% of organizations expect AI infrastructure to increase their on-prem footprint over the next 3 years." The persistence-of-on-prem story, framed forward.

Data says (N=25) 68% are building new on-prem workloads or repatriating cloud workloads (actively or considering) in the next few years (Q6.3/Q6.4). 56% keep critical workloads on-prem indefinitely or for the foreseeable future (Q6.2); 64% say 41%+ of on-prem workloads are business-critical (Q6.5).
Disposition

Holds on the persistence-and-growth thesis. The exact "72% / AI footprint" number is not yet directly measured, so lead on the defensible version: roughly two-thirds are expanding on-prem (new builds or repatriation), not retreating from it. Pair with the business-critical share for weight.

§05 Why on-prem persists
H2 · FRAGMENTATION (OPERATIONAL) Engineers juggle tools on nearly every incident HOLDS
Headline intended

"Nearly half of engineering teams use 5+ tools to monitor hybrid infrastructure," and "63% say tool fragmentation is a top cause of delayed incident resolution." The cost-of-fragmentation story.

Data says (N=25) 92% of engineers switch between 2 or more separate tools to investigate a single incident (Q2.2); 60% run 6 or more monitoring tools across on-prem and cloud (Q1.3); 64% rank cost of switching among their top-3 barriers to consolidating (Q2.5).
Disposition

The strongest fragmentation stat in the set. "5+ tools" is close (60% at 6+). The tool-switching number is the cleaner headline: on nearly every incident, engineers hop between consoles. Lead with the 92% switch rate and support with the 6+ tool count.

§02 The cost of fragmentation
H3 · FIREFIGHTING Most teams spend half their time reacting HOLDS
Headline intended

The reactive-vs-proactive split: what share of an infrastructure team's time is consumed by firefighting rather than proactive management.

Data says (N=25) 60% of teams spend at least half their time on reactive incident response rather than proactive work (Q4.1). The average team splits its time 49% reactive / 51% proactive, essentially a coin flip between firefighting and planning.
Disposition

Holds. The cleanest framing is "6 in 10 infrastructure teams spend half their time or more firefighting." The near-even average is itself a story: even well-run teams give up half their capacity to reactive work. Stands as written.

§04 Reactive vs. proactive
H4 · INCIDENT SPEED (MTTD/MTTR) Cross-environment incidents run long REFRAMED
Headline intended

"58% of engineering teams say incidents take longer to resolve when they span on-prem and cloud environments." The MTTD/MTTR story.

Data says (N=25) 48% take an hour or more just to detect a cross-environment incident (Q3.1), and 36% take 4 hours or more to resolve one once detected (Q3.2). No team resolves a hybrid incident in under an hour.
Disposition

Holds directionally, but the survey measures absolute detect/resolve times rather than a direct "longer because it spans both" comparison, so the "58% say it takes longer" framing is not yet a clean read. Reframe to the concrete durations: nearly half take an hour+ to even detect, and not one resolves in under an hour. The Q3.3 verbatims (correlating on-prem and cloud logs by hand) supply the "why."

§03 Incident speed
Watch item — the one framing to change (not a survey change)
H2b · "NO SINGLE SOURCE OF TRUTH" Leaders say they do have a single view CONTRADICTED
Headline intended

"67% of teams say they lack a single source of truth across cloud and on-prem systems." The perception-gap version of the fragmentation story.

Data says (N=25) 76% rate their agreement with "my team has a complete single-platform view of all telemetry" an 8, 9, or 10 out of 10 (Q2.1, mean 8.1). Only 12% score it 5 or below. As literally worded, the "lack a single source of truth" headline runs backward.
Disposition

This is the one to watch. Leaders believe they have a single view, yet 92% still switch tools on every incident (H2), so the real story is the gap between stated confidence and operational reality, not a self-reported lack. Do not run "67% lack a single source of truth." Run the confidence-versus-behavior contrast instead. No survey change needed; the instrument is capturing the tension cleanly.

§02 The cost of fragmentation
Emergent headline — what the data surfaced on top
H5 · GPU / AI MONITORING GAP AI infrastructure matters, tooling isn't ready HOLDS
Headline available

The forward-looking gap: GPU and AI infrastructure monitoring is becoming a priority faster than current tools can cover it.

Data says (N=25) 84% say monitoring GPU and AI infrastructure will be very important or critical over the next 2 years (Q7.1), yet only 32% say their current toolset fully supports it today (Q7.2). 68% are partial, none, or unsure.
Disposition

The cleanest want-versus-have gap in the study and a natural bridge to Datadog's AI/GPU observability story. Strong enough to carry its own section. Stands.

§06 The AI/GPU gap
DATADOG  //  HYBRID INFRASTRUCTURE OBSERVABILITY  //  HEADLINE MAP N=25 FIRST WAVE  //  JUNE 2026  //  ALL FIGURES FROM SOURCE DATA
Data Appendix  //  Cross-tabbable

Every question, sliceable by segment

The full distribution for every closed-ended question in the first wave, N=25 senior infrastructure leaders at organizations of 1,000+ employees who manage on-prem infrastructure and influence tooling decisions. Choose a cross-tab dimension below: Total shows each question as a chart; a segment view shows a cross-tab table with the percentage and count in every cell. Single-select columns sum to 100% of that segment's base; multi-select and ranking columns can exceed 100%.

First wave · N=25 · directional read while fielding continues
Cross-tab by
DATADOG  //  HYBRID INFRASTRUCTURE OBSERVABILITY  //  DATA APPENDIX N=25 FIRST WAVE  //  JUNE 2026  //  ALL FIGURES FROM SOURCE DATA
0