
- As foundation models scale, the real bottleneck – and the real profit pool – is shifting to inference infrastructure, the layer where every token, every request, and every agent runs.
- This layer is consolidating into a $50B+ emerging oligopoly, driven by GPU scarcity, scale economics, and developer network effects.
- The companies that control inference – Fireworks, Baseten, Modal, Modular, Together, Replicate – are quietly becoming the cloud equivalents of the AI era.
For weekly coverage of this power shift and the emerging AI value stack, see:
https://businessengineer.ai/p/this-week-in-business-ai-the-2025
THE LAYER: THE PIPES THAT CARRY ALL OF AI
If foundation models are the brain of the AI economy, inference infrastructure is the bloodstream.
This is the layer where:
- requests hit clusters
- tokens flow across GPUs
- routing, caching, batching, and optimization determine margins
- enterprises anchor into platforms they never leave
The graphic describes this layer as:
“The pipes that carry AI.”
That isn’t narrative flair.
It is structural reality.
Every model call – GPT-4, Claude, Llama, Mistral, Reka, custom enterprise models – must pass through an inference pipeline. This is the layer where compute becomes revenue.
Inference infrastructure is the toll road of the AI market.
He who controls the tolls controls the margins.
LAYER CHARACTERISTICS — WHY INFRASTRUCTURE IS THE NEW BATTLEGROUND
The graphic highlights four traits. Let’s expand each one.
1. Inference costs will 10× training over time
Everyone fixates on training runs.
But the economics flip after launch:
- training: periodic, expensive, predictable
- inference: continuous, unpredictable, exponentially scaling
Most foundation model companies will ultimately spend 10× more on inference than training.
That means:
- inference margin is the real battleground
- infra providers can capture decades of value
- the cost line becomes the moat
2. GPU access becomes a competitive advantage
Cloud was commoditized.
GPUs are not.
Providers with:
- multi-year GPU reservations
- direct NVIDIA relationships
- custom scheduling
- optimized cluster utilization
…will have pricing power that compounds.
This explains why the oligopoly is forming early.
Only a handful of companies will secure the long-term GPU access needed to compete.
3. 5–10 major players will dominate
This layer is following cloud-like dynamics:
- high fixed costs
- network effects
- developer lock-in
- pricing leverage
- enterprise contracts
Just as AWS + Azure + GCP dominate cloud, the inference layer will crystallize around:
- Fireworks
- Baseten
- Modal
- Modular
- Together
- Replicate
An emerging oligopoly, already visible.
4. $1–4B valuations are now typical
This layer is capital-light compared to foundation models, but capital-intense enough to keep competition out.
It’s the perfect “middle layer” for building durable, compounding businesses.
THE STAKES — $50B+ MARKET, 70%+ MARGINS, 10× COST DIFFERENCES
The graphic puts the stakes in blunt terms:
- $50B+ market
- 70%+ margins possible
- 10× cost deltas across providers
This is the single largest profit pool in the applied AI economy.
Why?
Because:
Every AI company, every product, every agent relies on inference.
Nothing runs without it.
And unlike foundation models, inference margin is defensible:
- you can optimize
- you can reduce cost per request
- you can differentiate on latency
- you can build proprietary routing
- you can lock in enterprise contracts
This is where horizontal value capture happens.
THE INFRASTRUCTURE UNICORNS — THE NEW POWER PLAYERS
The graphic highlights the core players:
- Fireworks ($4B+) — inference-optimized, fastest-growing
- Baseten ($2.2B) — deployment, orchestration
- Modal ($1.1B) — compute automation and dynamic infra
- Modular ($1.6B) — high-performance execution engine
- Together AI ($3.2B+) — platform + inference + scaling
- Replicate ($2B+) — API-first model marketplace
Each one has chosen a strategic wedge:
- latency
- routing
- cost optimization
- GPU scheduling
- developer experience
- multi-model orchestration
Unlike the foundation layer, no single company needs to dominate.
Multiple players can win — if they specialize.
THE INFERENCE INFRASTRUCTURE PIPELINE — WHERE EVERY TOKEN FLOWS
The center graphic shows the model → cluster → API flow.
This pipeline consists of:
- GPU clusters
- batching
- load balancing
- caching
- routing
- cost optimization
- memory management
- multi-model orchestration
Every step has deep technical leverage.
Inferior pipelines produce:
- higher cost per token
- higher latency
- more failed requests
- enterprise instability
Superior pipelines produce:
- cheaper inference
- faster throughput
- higher reliability
- enterprise standardization
This is not “nice to have.”
It is existential.
WHY THE OLIGOPOLY FORMS — THE STRUCTURAL FORCES
The bottom panel shows the three forces behind the oligopoly.
1. GPU supply agreements matter
Companies that secured H100 clusters early now enjoy:
Late entrants have no chance to match these economics.
2. Developer adoption creates network effects
If your infra platform becomes the default for:
- agents
- orchestration
- workflows
- routing
- fine-tuned models
…developers will not switch.
The switching cost is not money.
It is friction, reliability, and integration risk.
3. Hyperscalers vs. pure-plays
AWS, Azure, and GCP will compete.
But pure-plays innovate faster:
- better routing
- better batching
- better GPU utilization
- better multi-model optimization
Infrastructure becomes the trade-off between stability (hyperscalers) and performance (pure-plays).
The market needs both — but margins favor the specialized players.
THE WINNING FORMULA — HOW INFRASTRUCTURE COMPANIES WIN
The graphic lists three elements. Let’s sharpen them.
1. Secure GPU supply early
Without GPUs, you cannot win.
Period.
2. Build vertical integration
Owning:
- routing
- scheduling
- optimization
- orchestration
…produces cost deltas competitors cannot overcome.
3. Lock in enterprise relationships
Enterprise AI workloads are extremely sticky.
Once integrated, switching is nearly impossible.
This creates multi-year compounding margin.
THE STRUCTURAL IMPLICATION
The bottom panel breaks it down by stakeholder.
For Startups — Pick a niche before consolidation hits
Building generic infrastructure now is suicide.
The niches still open are:
- ultra-low latency
- agentic inference
- domain-specific routing
- GPU optimization tooling
- multi-cloud optimization
Pick a corner, win it, and integrate.
For Scalers — Acquire or be disrupted at the edges
Once the oligopoly locks in, the acquisition wave begins.
Hyperscalers will either:
- buy pure-plays, or
- get disrupted by them
Edges win — and then expand inward.
For Enterprises — Multi-cloud AI is inevitable
To avoid lock-in and hedge GPU scarcity:
- enterprises will run multiple inference providers
- agentic systems will route across platforms
- cost and latency optimization will become a procurement function
The future enterprise stack is multi-model and multi-cloud.
THE FINAL TAKEAWAY — INFRASTRUCTURE IS WHERE THE AI PROFITS WILL CONCENTRATE
Foundation models capture attention.
Vertical AI captures revenue.
But inference infrastructure captures the flow of value between them.
It is the toll road of AI.
And it is consolidating fast.
For weekly analysis of these infrastructure dynamics and the emerging AI oligopoly, see:
https://businessengineer.ai/p/this-week-in-business-ai-the-2025
This is the new economic choke point of the AI era.








