DigitalOcean Launches Inference Engine with New Capabilities for Production AI, Including Inference Router for Efficient Scaling of Agentic Workloads

Rhea-AI Impact

(Moderate)

Rhea-AI Sentiment

(Positive)

Key Terms

inference router technical

An inference router is a technical component that directs incoming requests for AI predictions or answers to the most appropriate model or server, much like a traffic controller sending cars to the best open lane. For investors, it matters because efficient routing reduces costs, speeds up responses and improves reliability for AI-driven products and services, which can affect user experience and operating margins.

batch inference technical

Batch inference is the process of running a trained predictive model on a large collection of records at once to produce forecasts, scores, or labels for many items in a single run. For investors, it’s how firms quickly and cheaply generate model-driven signals from large datasets—like processing a stack of mail together rather than handling each letter one by one—affecting the speed, cost and consistency of analytics used in trading, reporting and compliance.

serverless inference technical

Serverless inference is a way to run artificial intelligence models on demand without a company owning or managing the underlying servers; the cloud provider automatically supplies computing power when a request comes in and bills only for actual usage. For investors, it matters because it lets businesses add AI features quickly, scale up or down with customer demand, and convert large upfront infrastructure costs into smaller, predictable operating expenses — which can improve margins and speed product rollout.

dedicated inference technical

Dedicated inference is the use of reserved computing resources or specialized hardware specifically for running AI models that make predictions or analyze data, separate from the machines used to build and train those models. For investors it matters because dedicated inference can speed up responses, improve reliability and security, and make costs more predictable—like owning a private delivery van instead of sharing a crowded service—which can affect a company’s product performance, operating expenses and competitive position.

mixture of expert technical

A mixture of experts is an AI design that combines several specialized models so each handles tasks it does best, with a simple controller deciding which expert to use for a given input. Think of it as a team of specialists where the right person is picked for each problem, which can make systems more accurate and efficient. Investors watch this because it can improve product performance while changing development cost, compute needs, and competitive positioning.

tensorrt technical

TensorRT is a high-performance software toolkit that optimizes and runs trained artificial intelligence models so they execute faster and use less computing power on specialized processors. For investors, it matters because faster, more efficient AI deployment lowers operating costs, enables real-time features and services, and boosts demand for the hardware and software ecosystems that power large-scale AI—like tuning an engine to get more speed and fuel efficiency from the same car.

p99 latency technical

P99 latency is the response time below which 99% of a system’s requests complete, meaning the slowest 1% of requests are at or above that time. Investors care because it measures the tail-end user experience and reliability—like timing the slowest customer in a line of 100—so high p99 latency can signal service problems, potential churn, extra support costs, or hidden operational risks that could hurt revenue and reputation.

04/28/2026 - 09:00 AM

Built alongside early design partners, the Inference Engine gives AI developers unified control over performance, cost, and scale — with customers reporting up to 67% lower inference costs.

BROOMFIELD, Colo.--(BUSINESS WIRE)-- DigitalOcean (NYSE: DOCN) today announced the launch of its Inference Engine, a set of new production capabilities that give AI builders exceptional performance and unified control over how they run, scale, and optimize inference workloads. The announcement comes ahead of DigitalOcean Deploy, the company's conference for AI builders, where it will unveil their full, integrated platform and new capabilities live.

DigitalOcean’s Inference Engine is built around four core capabilities: Inference Router, Batch Inference, Serverless Inference, and Dedicated Inference, giving development teams a single engine to match every workload type to the right performance and cost profile, without stitching together separate providers.

New Capabilities: Built for How AI Actually Runs in Production

Inference Router is designed to solve one of the biggest inefficiencies in agentic AI: sending every request to the most expensive model. With Inference Router, AI builders can define a model pool, describe tasks and priorities in natural language mapped to that model, and optimize each request for cost and latency. Powered by DigitalOcean’s purpose-built MoE (Mixture of Expert) router model, Inference Router matches each request to the right model, helping teams improve performance and unit economics without the need to build or manage routing infrastructure themselves. Customers like LawVo are already benefitting from this new capability:

"DigitalOcean's Inference Router gives us the kind of intelligent model selection we would otherwise have had to build ourselves. It routes each request to the right model based on complexity, helping us reduce inference costs by more than 40% while maintaining the accuracy, speed, and reliability our users expect." — Hovsep Seraydarian, Co-Founder and CTO, LawVo

Dedicated Inference delivers predictable performance and exceptional unit economics for teams running high-scale, sustained workloads, with reserved capacity that eliminates the variability of shared infrastructure.

Serverless Inference provides a single API key to access dozens of models, with scale-to-zero elasticity and the industry’s first off-peak pricing, giving teams instant access to leading open-source models without managing infrastructure or paying for idle capacity.

Batch Inference reduces the cost of offline AI workloads by 50% through asynchronous execution, built-in retries, and a guaranteed 24-hour completion window. Batch Inference is purpose-built for workloads where real-time response isn't required but reliability is critical.

“Most teams building agentic systems today make a single model decision and apply it uniformly across their agentic workflows. They default to a frontier model and pay the generalization tax: premium prices and higher latency for work that often does not require the most expensive closed source model. Inference Router is the essential AI middleware that removes that tax by intelligently matching requests to the right model based on task, context, and developer-defined preferences. The result is a smarter operating model for inference - one that gives developers more control over quality, speed, and cost while helping AI-native builders move faster and build more durable businesses on DigitalOcean.” — Vinay Kumar, CPTO, DigitalOcean

Performance Benchmarks: Independent Validation

The new Inference Engine was built around three core advances: hardware and software integrations, including vLLM, TensorRT, and SGLang to maximize token throughput; request-path and model-level optimizations that improve unit economics without compromising quality; and distributed scaling designed for the bursty, uneven demands of production AI applications.

According to Artificial Analysis, an independent AI inference benchmarking platform, the results demonstrate DigitalOcean leading across key inference performance metrics, including 3x faster time-to-first-answer-token and 3x higher output speed than Amazon Bedrock on DeepSeek V3.2 at 10,000 input tokens. DigitalOcean also delivers stronger performance across output speed and latency consistency compared with most hyperscaler and neo-cloud providers, and is one of only three providers ranked in the Most Favorable Quadrant on Artificial Analysis's Latency vs. Output Speed chart, with Amazon, SambaNova, Nebius, and six others falling outside it.

Customers Report Significant Cost and Performance Gains

The Inference Engine was co-developed alongside early design partners running real production workloads, and the results are already showing at scale.

Hippocratic AI, which runs safety-critical healthcare agents on the platform, achieved 2x production throughput and 40% lower P99 latency across more than 20 million patient interactions.

"In healthcare AI, a node going down isn't just an SLA issue, it impacts patient experience. We've pressed DigitalOcean hard on reliability, access to the newest hardware, and the ability to scale efficiently. They've delivered." — Debajyoti Datta, Co-Founder, Hippocratic AI

Workato's Research Lab, which processes over 1 trillion automated workloads, saw meaningful performance and cost improvements, achieving 77% faster time-to-first-token, 79% lower end-to-end latency, and 67% lower inference costs on DigitalOcean.

"Through close collaboration on performance optimization, DigitalOcean helped us accelerate our inference performance and overall progress by two to three times." — Oscar Wu, AI Research Scientist, Technical Lead, Workato

At Deploy in San Francisco, DigitalOcean will also unveil new products that show how it has built a five-layer stack purpose-built for the Inference Era. Hovsep Seraydarian of LawVo, Debajyoti Datta of Hippocratic AI, and Oscar Wu of Workato will share stories live at Deploy about how their teams are building and scaling real-world AI applications on DigitalOcean. In-person attendance is full; sign up to watch the keynote live stream at 12pm Pacific on April 28.

About DigitalOcean

DigitalOcean is the Agentic Inference Cloud built for AI-native and digital-native enterprises scaling production workloads. The platform combines production-ready GPU infrastructure with a full-stack cloud — all built on open source at every layer — to deliver operational simplicity and predictable economics at scale. More than 640,000 customers trust DigitalOcean to power their cloud and AI infrastructure. Learn more at digitalocean.com.

View source version on businesswire.com: https://www.businesswire.com/news/home/20260428279648/en/

Investor Relations
Radu Patrichi, CFA
investors@digitalocean.com

Media Relations
Meghan Grady
press@digitalocean.com

Source: DigitalOcean

DigitalOcean Launches Inference Engine with New Capabilities for Production AI, Including Inference Router for Efficient Scaling of Agentic Workloads

Key Terms

Related Articles