Thinking Machines Lab and Bridgewater Prove a Small Custom Model Beats GPT, Claude, and Gemini on Finance Tasks

Mira Murati’s Thinking Machines Lab and Bridgewater’s AIA Labs just published the clearest evidence yet that domain-specific fine-tuning on proprietary expert data systematically outperforms frontier models — and demolishes their cost advantage simultaneously.

Research Results — Thinking Machines Lab × Bridgewater AIA Labs

84.7%

Custom model accuracy on financial classification

78.2%

Best frontier model (GPT / Claude / Gemini) with expert prompting

13.8x

Cheaper inference vs. frontier models

29.8%

Fewer errors than the best frontier baseline

Table of Contents

What Happened

Thinking Machines Lab — the AI company founded by former OpenAI CTO Mira Murati — and Bridgewater Associates’ AIA Labs jointly published research demonstrating that a custom fine-tuned model trained on Bridgewater’s proprietary expert-labeled data outperforms every major frontier model on six real financial document tasks. Those tasks include article relevancy scoring, central-bank document interpretation, content labeling, and document and email truncation — the unglamorous but mission-critical work that drives investment decisions at scale.

The gap is not marginal. Frontier models achieved roughly 50% accuracy with naive prompting. Even with careful expert prompting — the kind that requires skilled AI engineers and iterative tuning — the best frontier baseline topped out at 78.2%. The custom fine-tuned model reached 84.7%, committing 29.8% fewer errors. The method combined expert-labeled fine-tuning data with GRPO (a reinforcement-learning variant), interleaved batching (+12.1% improvement), CISPO loss with asymmetric clipping (+10.1%), and on-policy distillation with dynamically promoted teacher models (+3.1%). Training ran on Thinking Machines’ own Tinker infrastructure.

The research team — Sarah Su, Kevin Zhu, Emily Xiao, and Rohan Alur from Thinking Machines, plus Daniel Kang from Bridgewater AIA Labs — frames this under a thesis they call “differentiated intelligence”: an organization’s proprietary expert-labeled data, combined with the right fine-tuning methodology, is a structural moat that general-purpose frontier models cannot cross, regardless of parameter count or RLHF investment.

Accuracy Ladder — Financial Document Classification

Custom Fine-Tuned Model (TML × Bridgewater) 84.7%

Best Frontier Model — Expert Prompting 78.2%

Frontier Models — Naive Prompting ~50%

The key insight: The gap between naive prompting (~50%) and expert prompting (~78%) already exposes how much value enterprises are leaving on the table by renting frontier models without deep customization. But even expert prompting hits a ceiling. Fine-tuning on your own judgment breaks through it — and costs 13.8x less per inference call to operate.

The Structural Read

This is not a benchmark story. This is a business-model story. The implicit contract of the frontier-model era was: the biggest model, trained on the most data, wins everything. OpenAI, Google, and Anthropic have collectively invested hundreds of billions of dollars on that premise. Enterprises signed on to that logic too — pay for API access, let the smartest general model handle your tasks, iterate on prompts.

The Thinking Machines / Bridgewater result ruptures that contract in two places at once. First, it shows that domain-specific expert labels encode judgment that no amount of general pretraining captures. Central-bank document interpretation and investment article triage are not tasks where “more parameters” closes the gap — they require the crystallized perspective of professionals who have spent years building signal intuitions. Second, it shows that a smaller model, properly trained, runs at a fraction of the inference cost — meaning the performance advantage compounds into an economic advantage. You win on accuracy and on margin.

The strategic positioning here is deliberate. Tinker — Thinking Machines’ training infrastructure — is the commercial product. Murati is not selling a frontier model. She is selling the capability to build your proprietary version of the frontier, tuned on your data, running on your terms. The Bridgewater collaboration is not just a case study. It is a proof of concept for every large institution sitting on years of expert-labeled proprietary data that it has never converted into model weights.

Harness Theory — Business Engineer

“The winning enterprise AI strategy is not renting the biggest frontier generalist — it is harnessing a specialist tuned on your own judgment. Your proprietary expert-labeled data is not an input into someone else’s model. It is the model.”

Business Engineer

Harness Theory — how companies win with AI without building it

Read →

Three Implications

IMPLICATION 1 — THE FRONTIER API BUSINESS FACES A STRUCTURAL CEILING

Enterprises willing to invest in expert labeling and fine-tuning infrastructure will consistently outperform their peers who remain on API dependency — and they will do it more cheaply at scale. At 13.8x cheaper inference and 6.5 percentage points of accuracy advantage, the ROI calculation for verticalization becomes straightforward for any institution processing high volumes of domain-specific documents. The pressure on OpenAI, Google, and Anthropic’s enterprise API revenue is structural, not cyclical.

IMPLICATION 2 — MURATI’S TINKER IS THE REAL PRODUCT TO WATCH

Thinking Machines Lab is not competing with OpenAI on model scale — it is competing for the fine-tuning infrastructure market. Every large financial institution, healthcare system, legal firm, or government agency that recognizes the value of its proprietary labeled data becomes a potential Tinker customer. The Bridgewater collaboration validates the thesis at one of the most analytically rigorous institutions in the world. That reference point matters enormously for enterprise sales.

IMPLICATION 3 — EXPERT LABELING BECOMES THE SCARCE ASSET

The research makes clear that naive prompting leaves 28+ points of accuracy on the floor. Expert prompting recovers some of that — but fine-tuning on expert-labeled data recovers the rest and more. This revalues the human expert not as a cost center to be replaced by AI, but as the source of the signal that makes a domain model actually work. Organizations that have systematically captured expert judgment in labeled datasets have a durable competitive asset. Those that have not face a build-vs-buy decision that grows more expensive every quarter they delay.

Who Gets Stronger / Weaker

Fine-Tuning Infrastructure (Tinker / Thinking Machines)

STRONGER

Validated proof of concept at a tier-1 institution. Enterprise sales motion now has a hard benchmark to point at.

Domain Expert Labor (Annotation, Labeling, Taxonomy)

STRONGER

Expert-labeled data is the rate-limiting input. The people who generate it become structurally more valuable, not less.

Frontier API Revenue (OpenAI / Google / Anthropic Enterprise)

WEAKER

Each enterprise that successfully builds a domain-specific alternative represents API volume that does not scale with frontier pricing.

Prompt Engineering as a Profession

MIXED

91,000+ executives read Business Engineer for the AI strategy frameworks cited by ChatGPT, Claude, and Perplexity.

Sources: thinkingmachines.ai · bridgewater.com · x.com · x.com · blockchain.news

Thinking Machines Lab and Bridgewater Prove a Small Custom Model Beats GPT, Claude, and Gemini on Finance Tasks — At 1/14th the Cost

What Happened

The Structural Read

Three Implications

Related

More Resources

About The Author

Gennaro Cuofano

What Happened

The Structural Read

Three Implications

Related

More Resources

About The Author

Gennaro Cuofano

Discover more from FourWeekMBA