Datadog has added GPU monitoring to its observability stack, giving AI-hungry organizations more insight into exactly what’s happening on their most expensive silicon.
AMD’s Ryzen 9 9950X3D2 Dual Edition tested: Gratuitous overkill with a price to match
The observability vendor says GPU instances now make up 14 percent of cloud compute costs as companies clamber on the AI bandwagon, and that GPU spend will take up an even bigger proportion of cloud compute spend in the future.
Earlier this month, IDC said: “Worldwide spending on artificial intelligence (AI) infrastructure reached $89.9 billion in Q4 2025” up 62 percent on the year. And accelerated compute – mainly GPUs – is the “structural backbone” of this.
But there’s plenty of debate over what value – if any – companies are deriving from their massive AI investments.
Datadog is not getting into that bearpit. But as chief product officer Yanbing Li puts it, “While these companies can see their costs climbing, they can’t chargeback GPU spend across business units, see workload context or identify clear next steps for improvement.”
To address that, Datadog claims, its latest tool offers unified visibility across the AI stack, “giving customers a single view linking GPU fleet health, cost, and performance directly to the teams relying on them for faster troubleshooting of slow workloads and cost savings.”
A longer explainer says the tooling works across both cloud and neocloud instances as well as on-prem GPU fleets – handy if sovereignty concerns are making you wary of AI in the cloud.
“It’s easy to see how much of your fleet is sitting completely idle or being ineffectively consumed by a workload that doesn’t require GPUs at all,” it says. “You can drill into the Fleet Explorer to hold each team accountable for their GPU utilization and spend.”
As well as identifying stalled or zombie processes soaking on GPU time, it will spot workloads that were never configured for GPUs in the first place, effectively burning cash.
“Internally at Datadog, GPU Monitoring helped us save tens of thousands in monthly expenses by identifying and removing a serving pod that had been stuck in the initialization phase,” the explainer said.
“Rising costs are often driven by operational inefficiency rather than hardware alone. By linking cost to utilization and workload behavior, teams can reduce waste while maintaining performance.”
Datadog is certainly not alone in extending observability further down the AI stack. This week also saw Grafana launch observability tools for AI, provide insights into agent behavior, while its Grafana Cloud platform offers GPU observability tools covering hardware utilization and resource allocation, as well as cost optimization.
Earlier this month, Nutanix unveiled a multi-tenancy framework to allow organizations to run more workloads on their previous GPUs, and provide more insight into how AI systems are chewing through tokens.
So, it’s getting easier to work out how much individual AI workloads are costing you, and what processes and software misconfigurations could be making bills higher than necessary.
This means enterprises can ensure their AI infrastructure and their associated apps and agent are running as efficiently as possible. Whether this means enterprises can actually start working out whether they’re getting value from AI investments may be quite another question. ®

