R01
Screening responses
Which of the following best describes your role?
Engineering leadership (Director / VP / CTO)
Have you evaluated, installed, or run a POC of a Kubernetes rightsizing or optimization tool in the past 12 months?
- Yes
- No
- Yes
- No
Yes
How does your organization run containers in production?
- Kubernetes — managed service (EKS, GKE, AKS, or similar)
- Kubernetes — self-managed
- Other orchestration only (Docker Swarm, Nomad, ECS without Kubernetes, etc.)
- We don't run containers in production
- Kubernetes — managed service (EKS, GKE, AKS, or similar)
- Kubernetes — self-managed
- Other orchestration only (Docker Swarm, Nomad, ECS without Kubernetes, etc.)
- We don't run containers in production
Kubernetes — managed service (EKS, GKE, AKS, or similar)
Across all your clusters, roughly how many containers are you running?
- Fewer than 100
- 100–499
- 500–999
- 1,000–4,999
- 5,000 or more
- Fewer than 100
- 100–499
- 500–999
- 1,000–4,999
- 5,000 or more
1,000–4,999
What's your organization's stance on using SaaS or cloud-hosted tools for infrastructure tooling?
- We can adopt SaaS tools (freely, or with standard approval)
- We're restricted to on-premises / self-hosted tools only
- Not sure
- We can adopt SaaS tools (freely, or with standard approval)
- We're restricted to on-premises / self-hosted tools only
- Not sure
We can adopt SaaS tools (freely, or with standard approval)
How many employees does your organization have?
- 1–49
- 50–249
- 250–999
- 1,000–4,999
- 5,000 or more
- 1–49
- 50–249
- 250–999
- 1,000–4,999
- 5,000 or more
250–999
Interviewer
Perfect. Thanks for that context — it helps me understand your scale. Now let's dig into the tools you evaluated. When you were looking for Kubernetes rightsizing or optimization solutions, which tools did you evaluate or consider?
Respondent
We evaluated CAST AI, Kubecost, StormFarge, Fairwinds Insights, and AWS Compute Optimizer, with a primary focus on CAST AI and Kubecost for Kubernetes cost optimization and rightsizing.
Interviewer
Got it — solid mix. So you narrowed down to CAST AI and Kubecost as your primary contenders. What made those two stand out from the others? What were you specifically looking for in a rightsizing solution?
Respondent
CAST AI and Kubecost stood out for their cost visibility, automated rightsizing, and ease of use. We wanted to reduce cloud costs and improve efficiencey.
Interviewer
Got it — cost visibility and automated rightsizing were the big draws. Walk me through the evaluation timeline: from when you first installed one of these tools to when you made a decision, how long did the whole process take?
Respondent
The evaluation process took about 6 - 8 weeks, including setup, testing, stakeholder reviews, and final vendor selection.
Interviewer
Got it — so 6 to 8 weeks from first install to final decision. That's a meaningful evaluation window. Now let's zoom in on the setup phase itself. What did you actually have to do to get the tool up and running — like agent deployment, permissions, integrations, that kind of thing? And roughly how long did that setup piece take on its own?
Respondent
We deployed the agents, configured Kubernetes permissions, connected our cloud accounts, and integrated monitoring tools, The initial setup took about 3 - 5 days.
Interviewer
Got it — so 3 to 5 days for the full setup: agents, RBAC, cloud integrations, monitoring. That's pretty hands-on work. After all that was done and the tool was live, what was the first thing you tried to do with it, and how did that go?
Respondent
We analyzed resource usage to find overprovisioned workloads and it quickly identified cost-saving opportunities.
Interviewer
Nice — so it delivered on the core promise right out of the gate. You got visibility into overprovisioning and concrete cost-saving opportunities. What happened next? Did you move forward with recommendations, or did something slow you down at that point?
Respondent
We validated the recommendations and implemented them gradually to avoid impacting performance.
Interviewer
That's smart — gradual rollout to validate safety. So you did move forward with the recommendations. What happened after that? Did you hit any friction when actually implementing those changes, or did things keep moving smoothly into production?
Respondent
Things were mostly smooth, with only minor tuning and testing needed before rolling changes into production.
Interviewer
That's great — smooth rollout with just tuning and testing. So you moved from recommendations into production fairly cleanly. Once the tool was running in production and actively rightsizing workloads, what was your day-to-day experience like? Did the tool keep delivering value, or did something shift at that point?
Respondent
It continued to deliver steady value through ongoing cost optimization insights and alerts, with no major shift after production use.
Interviewer
That's really encouraging — steady value without a cliff drop-off after going live. So the tool kept performing as expected. Here's what I'm curious about: during that entire journey from evaluation through production, were there any moments where you felt uncertain about what the tool was actually doing, or hesitant about letting it make changes automatically? Any friction points that made you pause?
Respondent
We had initial hesitation with automation, so we used manual approvals at first, but no major issues after validation
Interviewer
Got it — so you started with manual approvals as a safety gate, which is totally reasonable. That makes sense for building confidence. Once you got comfortable after validation, did you eventually move to full automation, or did you keep the manual approval process in place?
Respondent
We kept manual approval in place for most changes, only allowing limited automation for low-risk recommendations.
Interviewer
Got it — so you stayed cautious and kept manual approvals as the default, only automating the truly low-risk stuff. That's a pretty common pattern. Here's what I'm curious about: what specifically made certain recommendations feel "low-risk" enough to automate, while others needed that human gate? What was the difference?
Respondent
Low-risk changes were small CPU/memory tweaks on stateless workloads; anything affecting stateful or latency-sensitive services required manual approval.
Interviewer
That makes total sense — small tweaks on stateless stuff felt safe enough to let run, but anything touching state or latency was too risky to automate. So you were essentially drawing a line based on workload type and blast radius.
Here's what I'm curious about: once you had that manual-approval gate in place and started seeing results, did the tool continue to surface new optimization opportunities over time, or did it feel like it kind of plateau'd after the initial round of recommendations?
Here's what I'm curious about: once you had that manual-approval gate in place and started seeing results, did the tool continue to surface new optimization opportunities over time, or did it feel like it kind of plateau'd after the initial round of recommendations?
Respondent
It continued to surface new optimization opportunities over time rather then plateauing, but the biggest saving came in the initial phase.
Interviewer
That's really valuable — so the tool kept finding opportunities over time, which is what you'd want to see. But yeah, the biggest bang for your buck was upfront. Here's what I'm curious about: once you had those initial wins locked in and the tool was running day-to-day, did you find that the manual approval gate was becoming a bottleneck? Like, were you spending a lot of time reviewing and approving recommendations, or was the cadence manageable?
Respondent
No, the cadence stayed manageable and the manual approval process did not become a botteneck.
Interviewer
Got it — so the approval cadence stayed workable, didn't turn into a bottleneck. That's helpful context. Let me shift gears a bit: thinking back over the whole evaluation and rollout process, were there any other friction points or concerns that slowed you down — things we haven't talked about yet? Anything else that made you hesitate or hold back?
Respondent
Minor integration and trust concerns, but no major blockers or delays during rollout.
Interviewer
Got it — so minor integration and trust friction, but nothing that actually derailed you. That's really helpful context. Let me ask this: thinking back over the whole evaluation and rollout, did real-time detection of resource issues — like OOMs, CPU throttling, or heap exhaustion — factor into what you were looking for in a tool, or wasn't that on your radar?
Respondent
Yes, it was on our radar and important for catching OOM, CPU throttling, and performance issues in real time, but not the primary faactor.
Interviewer
Got it — so real-time detection was important for catching OOMs, throttling, and performance issues, but it wasn't the deciding factor. That makes sense given your focus on cost optimization. Here's what I'm curious about: when the tool did surface those kinds of issues in real time, how did your team actually respond? Did you have a playbook for acting on them, or was it more ad-hoc?
Respondent
We had a basic playbook for common issues, but edge cases were handled ad-hoc.
Interviewer
Got it — so a playbook for the common stuff, but edge cases were more improvised. That makes sense at scale. Here's what I'm curious about: thinking back to the whole evaluation and rollout, did you ever feel like you needed better visibility into why the tool was recommending specific changes? Like, did it feel transparent what logic it was using, or was there a "black box" feeling at any point?
Respondent
Mostly transparent, but occasionally we needed to dig into metrics for full clarity.
Interviewer
Got it — so mostly transparent, but you had to dig into the metrics sometimes to get the full picture. That's a pretty common experience. Did that digging feel like a minor annoyance, or did it actually slow down your decision-making or confidence in the recommendations?
Respondent
It was a minor annoyance and didn't significantly slow decision-making or reduce confidence.
Interviewer
Got it — so it was just a minor friction point, not something that actually held you back. That's helpful context.
Let me ask you this: thinking about real-time detection of resource issues — like OOMs, CPU throttling, or heap exhaustion — how critical was it for the tool to catch and respond to those in real time? Or wasn't that really on your radar during the evaluation?
Let me ask you this: thinking about real-time detection of resource issues — like OOMs, CPU throttling, or heap exhaustion — how critical was it for the tool to catch and respond to those in real time? Or wasn't that really on your radar during the evaluation?
Respondent
It was important for catching issues in real time, but not a primary evaluation factor during selection.
Interviewer
Got it — so real-time detection was important operationally, but it wasn't a deciding factor when you were picking the tool. That makes sense. Let me ask: did the tool actually have solid rollback or recovery capabilities if something went wrong after a change was deployed? Was that something you needed to feel confident, or not really on your radar?
Respondent
Yes, rollback/recovery mattered for confidence, but wasn't a key differentiator.
Interviewer
Got it — so rollback and recovery were important for feeling confident, but they weren't a deal-breaker or deciding factor. That makes sense.
Let me ask this: thinking back over the whole journey from evaluation through production, did the tool's deployment model — like whether it ran as an agent, operator, or some other architecture — actually matter to your team? Or was that pretty transparent once it was installed?
Let me ask this: thinking back over the whole journey from evaluation through production, did the tool's deployment model — like whether it ran as an agent, operator, or some other architecture — actually matter to your team? Or was that pretty transparent once it was installed?
Respondent
It didn't matter much day-to-day; it was mostly transparent once installed.
Interviewer
Got it — so the deployment model was pretty transparent once it was up and running. That's helpful context.
Let me ask: thinking back over the whole evaluation and rollout, if you had to pick just ONE thing that would've gotten you to active production usage faster or smoother, what would it be?
Let me ask: thinking back over the whole evaluation and rollout, if you had to pick just ONE thing that would've gotten you to active production usage faster or smoother, what would it be?
Respondent
Clearer explanations of recommendations and impact would have sped up adoption.
Interviewer
That's a really concrete answer — clearer explanations of why and what impact each recommendation would have. So you're saying if the tool had spelled out the logic and expected outcomes more explicitly upfront, you would've moved faster through validation and into production?
Why that one over the others — like the automation trust, real-time detection, or the other friction points we talked about?
Why that one over the others — like the automation trust, real-time detection, or the other friction points we talked about?
Respondent
Because unclear impact explanations slowed trust and validation more then technical issues.
Interviewer
That makes a lot of sense — so it was really about the human side of adoption, not the technical plumbing. Unclear impact explanations created friction in the trust-building and validation phase, which is where you needed to move fastest. Got it.
One last thing: thinking back over the whole journey, is there anything else about your evaluation experience that you think would be useful for us to know?
One last thing: thinking back over the whole journey, is there anything else about your evaluation experience that you think would be useful for us to know?
Respondent
Nothing major beyond what we discussed-mainly trust-building and clarity of impact were the key factors shaping adoption speed.
Interviewer
Perfect — that's a really clear takeaway. Trust-building and clarity of impact were the real drivers of how fast you could move through evaluation into production. Thanks so much for walking through this with me — this has been really helpful context for understanding how teams actually move from install to active usage. I appreciate your time.