OpenAI Is Losing & Will Continue to Lose
Authors: Jacob Sussmilch
The Thesis
OpenAI is losing the AI race because building the best model is no longer their primary activity. It can’t be. The company is simultaneously managing a cost structure that burns billions more than it earns, navigating legal exposure that potentially exceeds its valuation, executing a nonprofit-to-for-profit conversion, managing a Microsoft dependency, pursuing military contracts, preparing an IPO, and sustaining the funding cycle that finances continued operation. Model development — the thing that’s supposed to justify all of this — has been crowded out by the structural demands surrounding it.
Their 730 billion USD valuation is not primarily a reflection of model quality. It is sustained by investment from infrastructure suppliers who profit from the spending the investment enables, and by the geopolitical framing of AI as a strategic competition where the cost of falling behind is treated as unbounded. The models justify the valuation narrative. They are not the primary driver of it.
This matters because AI development is not approaching its conclusion — it is at its beginning. The narrative can substitute for capability in the short term, but switching costs are minimal and brand loyalty in AI is negligible. The companies that compound research advantages now will dominate the decades ahead. OpenAI is trading long-term research position for short-term narrative maintenance, and in a race this early, that trade is fatal — especially because the research-first competitors are not just theoretically positioned to win. They are already winning. Anthropic is profitable, ships the most performant model on key reliability metrics, and maintains simultaneous depth in alignment and safety research. DeepSeek publishes papers that reshape how the field thinks about efficient training. Google DeepMind continues to produce foundational work. The last OpenAI paper that materially shifted the research conversation was InstructGPT in March 2022 — over four years ago. The company that demonstrated what scaled transformers could do — GPT-3’s in-context learning, RLHF’s alignment breakthrough — now publishes benchmark results without methods and treats architecture as trade secret. It has gone from research leader to research consumer, negotiating GPU contracts and preparing an IPO while the labs it once led set the pace.
The Strategic Error
The model scaling vs. compute scaling distinction is the lens through which OpenAI’s trajectory becomes legible. The scaling laws describe a relationship between model size, data, and loss — bigger models trained on more data produce predictably lower loss. Compute is the cost of reaching a point on that curve. The laws tell you where to aim. They say nothing about how to get there cheaply.
OpenAI heard “spend more on compute.” They focused on what qualifies the curve (how much compute you need) rather than what defines it (the relationship between model capacity and performance). They bought their way up the curve rather than innovating their way up it. DeepSeek demonstrated how expensive this choice was — reaching comparable points on the scaling curve at a fraction of the cost through MLA, Mixture of Experts, GRPO, and PTX-level optimization. The gap between what OpenAI spent and what efficient methods would have required represents foregone margin.
The scaling laws tell you a 10x larger model will have predictably lower loss. They don’t tell you it will generate 10x more revenue. OpenAI bet that capability would translate to willingness-to-pay at a rate that outpaces cost. There is no scaling law for that.
The Model Quality Problem
GPT-5 (August 2025) made the consequences visible. On SimpleBench it scored 56.7% and placed fifth. GPT-5.2 scored below Claude 3.7 Sonnet — a model nearly a year older. Gary Marcus called it “overdue, overhyped and underwhelming.” MIT Technology Review assessed it as “a refined product” rather than “a major technological advancement.” The Register called it “a cost cutting exercise.”
The pattern has been called “RL sloptimization”: aggressive reinforcement learning on high-profile benchmarks that creates blind spots on practical tasks. GPT-5 stumbled on basic summarization that GPT-4o could handle. The model performed well on the metrics that attract attention but regressed on the everyday tasks that determine user trust — a predictable consequence of optimizing for benchmark visibility over broad reliability.
The reliability gap is chronic and measurable. BullshitBench — testing whether models push back on nonsensical questions — shows GPT-5.4 at 42% pushback, o3 at 26%, GPT-5 at 21%. Claude Sonnet 4.6 scores 91%. The top seven spots are all Anthropic. On PersonQA, o3 hallucinates at 33% (double o1’s 16%). o4-mini hallucinates at 48%. Additional reasoning compute doesn’t fix this — o3 uses extra inference to elaborate on flawed premises rather than recognizing them as flawed.
The sycophancy crisis connects directly. In April 2025, a GPT-4o update endorsed harmful and delusional statements after OpenAI added a reward signal based on user thumbs-up/down that optimized for gratification over truthfulness. This was not a generic quality issue — it was a training methodology that produced exactly the behavior it was designed to reinforce. When engagement metrics serve as a proxy for quality, the system optimizes for engagement. The sycophancy crisis is now evidence in active wrongful death litigation.
The joint Anthropic-OpenAI safety evaluation (August 2025) reinforced the pattern: GPT-4o, GPT-4.1, and o4-mini were “much more willing to cooperate with harmful requests including drug synthesis, bioweapons development, and terrorist attack planning.” The emphasis on responsiveness and engagement has come at the cost of reliability. The research investment required for genuinely reliable models has been deprioritized relative to scaling and shipping.
The Cost Structure Trap
You can change strategy. You can’t easily change cost structure.
| Year | Revenue | Loss / Cash Burn | Gross Margin |
|---|---|---|---|
| 2024 | 3.7B USD | ~5B USD loss | ~40% |
| 2025 | 13.1B USD | ~8B USD burn | 33% |
| 2026 (proj.) | ~30B USD | ~14B USD loss | Declining |
| 2027 (proj.) | ~62B USD | ~35B USD burn | — |
Margins are compressing (40% → 33%), not expanding. Unlike traditional software, each query costs real compute. Revenue grew 3.5x; losses grew with it. Profitability is projected for 2029-2030, requiring ~8x revenue growth while bending a cost curve that has moved in the wrong direction. Cumulative negative free cash flow through 2029: ~143 billion USD. Training costs through 2030: ~440 billion USD.
OpenAI has raised ~168 billion USD across 11 rounds. The February 2026 round — 110 billion USD at 730 billion USD pre-money, led by Amazon (50B USD), Nvidia (30B USD), SoftBank (30B USD) — was the largest private round in history. The structure of this funding is worth examining. Amazon sells the compute infrastructure OpenAI buys. Nvidia sells the GPUs. SoftBank provides capital at valuations that justify its own portfolio thesis. The money flows in as investment and back out as infrastructure spending to the same parties. The 730 billion USD reflects less a conventional market judgment than an expected value agreed upon by participants with aligned incentives. Amazon’s 50 billion USD functions as a customer acquisition cost for AI workloads. Nvidia’s 30 billion USD functions as a forward payment on GPU orders.
This creates a self-reinforcing dynamic: larger rounds to fund accelerating burn → higher valuations to avoid dilution → more aggressive promises to justify the valuation → more spending to pursue the promises. The company cannot easily moderate without triggering a correction that cascades through its capital structure.
Normal market dynamics would eventually discipline this loop. But AI is not subject to normal market dynamics — it sits at the intersection of the only two resources with no obvious ceiling: intelligence and insecurity. When AI is framed as both the threat and the defence, the usual pressure to demonstrate returns is suspended. Defence budgets become AI budgets. OpenAI doesn’t need to be profitable if it’s positioned as critical infrastructure in a civilisational competition. The circular funding works because participants aren’t evaluating a business — they’re buying a position in a race where falling behind is framed as infinite cost.
Training costs are sunk but recurring — without efficiency research, each run costs what the last one cost, adjusted upward. Competitors drive down cost-per-capability. The gap compounds. Inference costs scale with users — and the GPT-5 rollout revealed how acutely OpenAI feels this. They deprecated all prior models and replaced them with an auto-router that defaulted to cheaper models, degrading quality for most queries. Altman admitted they “totally screwed up.” The auto-router functioned as margin management rather than product improvement — a sign that inference costs had grown too high relative to revenue per query. The response was to route users to cheaper models rather than making models cheaper to serve. The Microsoft dependency constrains strategic flexibility — pivoting to a compute-efficient strategy is harder when the infrastructure partner’s revenue depends on high compute consumption. Talent allocation reflects past strategy — the people who negotiate GPU contracts aren’t the people who design novel attention mechanisms.
The Organizational Problem
The nonprofit-to-for-profit conversion (October 2025) suggests the economics require a different corporate structure — if the business generated sufficient returns within the existing framework, the restructuring would be unnecessary. During the transition, OpenAI removed “safely” from its mission statement.
The coordination layer is strained. OpenAI tripled headcount from ~1,000 to 3,000+ in a single year (2024-2025), then surged to 7,216 by February 2026. Calvin French-Owen — ex-CTO of Segment, who spent a year at OpenAI working on Codex — was in the top 30% by tenure after twelve months. His account is illustrative: 3-4 parallel Codex prototypes were circulating internally, “cobbled together by a few individuals” without coordination. Multiple teams were assigned overlapping work without knowing about each other. The company calls this “bottoms-up culture.” At 1,000 people building ChatGPT, this was productive autonomy. At 7,000 people managing a product portfolio, legal battles, a corporate restructuring, and an IPO, it produces significant duplication.
French-Owen’s assessment: “Everything breaks when you scale that quickly: how to communicate as a company, the reporting structures, how to ship product, how to manage and organize people, the hiring processes.” All communication runs through Slack — no internal email — creating constant bombardment at scale. No single “OpenAI experience” exists; teams operate on wildly different timescales and cultures. Reporting lines change frequently. The company reportedly responds to viral social media posts in ways that influence product decisions — a reactive dynamic at odds with the strategic coherence a 730 billion USD company requires.
Senior talent losses compound the coordination challenge. Twelve-plus executives and senior researchers left in 2025 alone, mostly to Meta’s Superintelligence Lab (which offered packages up to 300M USD over four years). CTO Mira Murati, Chief Research Officer Bob McGrew, VP Research Barret Zoph, co-founders Sutskever and Schulman — all gone. The Superalignment team, promised 20% of compute in 2023, was disbanded by 2024. Its successor, the Mission Alignment team, was itself disbanded in February 2026 after just 16 months. By early 2026, few of the people most associated with AI safety remained in positions of influence. Rapid hiring replaced senior institutional knowledge with volume — but a 7,000-person company where the top 30% have been there for a year faces a significant institutional memory problem. Researchers report being “forced to do product.” One economics researcher left after his team served as a “de facto advocacy arm” rather than conducting genuine research. Eighty-hour weeks were common enough that OpenAI shut down the entire company for a week in July 2025 to address burnout.
Internal tensions are well-documented. Senior employees described Altman as “psychologically abusive” (Washington Post). Murati told staffers she didn’t feel “comfortable about Sam leading us to AGI.” Sutskever said: “I don’t think Sam is the guy who should have the finger on the button for AGI.” Altman claimed ignorance of equity clawback NDAs, though Vox obtained incorporation documents bearing his signature that authorized them. He advocated regulation in Senate testimony while lobbying to soften the EU AI Act. The December 2025 “Code Red” — an emergency 8-week sprint after Google’s Gemini 3 surpassed ChatGPT on benchmarks — postponed advertising, e-commerce, and agentic system initiatives. The pattern suggests reactive decision-making rather than strategic planning.
The military contracts illustrate the revenue pressure. Anthropic declined contracts involving fully autonomous armed drones and mass surveillance. OpenAI accepted them. The decision triggered user defection, a coordinated boycott, and reputational damage among the talent pool — demonstrating that switching costs are lower than OpenAI’s valuation assumes. Competitors released migration tools to make switching seamless.
Still Scaling Compute
Despite everything documented above, OpenAI has not changed its fundamental strategy. They have committed to 30 GW of compute infrastructure — a 1.4 trillion USD obligation — and over 500 billion USD in cloud capacity across multiple providers. They’ve added inference-time compute (“test-time compute”) as a second scaling axis, but this sits on top of the infrastructure spending, not instead of it. The scaling strategy hasn’t been replaced; it’s been supplemented, which makes the cost structure worse while adding no efficiency advantage.
This is the original strategic error compounding in real time. The model scaling vs. compute scaling distinction predicted exactly this outcome: when you treat compute scaling as a procurement problem rather than a research problem, you lock yourself into ever-larger infrastructure commitments that your competitors can undercut through innovation. DeepSeek reached comparable capability for a fraction of the cost. Anthropic is producing measurably more reliable models without matching OpenAI’s infrastructure spending. The 1.4 trillion USD commitment is not a sign of strength — it’s the cost structure trap expressing itself as forward obligations. OpenAI is now contractually committed to the strategy that created the problem.
The company that should be researching how to make models cheaper is instead committing to the largest infrastructure buildout in history while its coordination layer strains, its senior talent departs, and its leadership manages legal battles, media campaigns, and military contracts.
This is the structural claim of this essay: OpenAI’s model development has become secondary to the organizational, financial, legal, and political demands surrounding it. The company functions less as a research lab than as a capital-raising, infrastructure-procuring, and narrative-sustaining organization that also ships models. The model quality trajectory reflects this shift in priorities.
The Legal Exposure
The litigation would be significant for a company with healthy fundamentals. For one burning 8 billion USD more than it earns per year, the stakes are amplified.
Musk v. OpenAI seeks 65.5B-134.5B USD in damages for breach of fiduciary duty, fraud, and antitrust violations. Jury trial: late April 2026. The discovery process has already surfaced internal documents showing Altman knew “Open” in OpenAI was misleading. Even a partial adverse judgment dwarfs annual revenue.
Over 70 copyright suits, consolidated into MDL in the Southern District of New York. The NYT suit alone seeks “billions.” Authors Guild (George R.R. Martin, John Grisham, others) survived motions to dismiss. OpenAI’s licensing deals with some publishers implicitly acknowledge the unlicensed training was a problem — strengthening remaining plaintiffs.
At least seven wrongful death suits allege ChatGPT drove users to suicide. In Raine v. OpenAI, ChatGPT mentioned suicide 1,275 times in a 16-year-old’s conversations, flagged 377 messages for self-harm, never terminated sessions. A separate case involves a murder-suicide — the first chatbot case involving homicide and Microsoft as co-defendant. The sycophancy crisis is evidence in active litigation.
Regulatory actions: Italy GDPR fine (EUR 15M), FTC investigation into Microsoft-OpenAI, UK CMA investigation, antitrust class action, California AG investigation of the nonprofit conversion. Whistleblower complaints: SEC filings alleging illegal NDAs, equity clawback threats for departing employees, former staff amicus brief opposing the conversion. Suchir Balaji — the former researcher who publicly accused OpenAI of copyright violations — was found dead at 26, leading to the AI Whistleblower Protection Act.
Combined exposure — Musk (134.5B USD), NYT (“billions”), copyright MDL (tens of billions), wrongful death (hundreds of millions), plus regulatory actions — potentially exceeds the valuation. Any major adverse judgment could break the narrative sustaining the capital cycle.
The Hype Cycle as Survival Mechanism
OpenAI’s messaging pattern is well-documented (MIT Technology Review, “A brief history of Sam Altman’s hype,” December 2025). The pattern is better understood as structurally driven than as a personal tendency. When a company burns 8 billion USD more than it earns and needs 110 billion USD in a single round, conservative public positioning is difficult to sustain.
The hype cycle is not just a communications problem — it feeds back into model quality. Ambitious promises create pressure to optimize for demos and benchmarks. Benchmark-oriented training degrades reliability on practical tasks. User disappointment demands more ambitious promises to sustain the narrative. The April 2025 sycophancy crisis is this loop expressed as a training methodology failure. The narrative is the product sold to capital markets. The models are the justification. The order matters.
The messaging record traces the pattern. AGI confident in January 2025 → AGI is a “pointless term” by August → pivot to “superintelligence.” GPT-5 promised as “PhD-level expertise” → users report worse responses than GPT-4o → Altman pivots to GPT-6. Senate testimony calling for “strong regulations” in 2023 → agreeing with Ted Cruz against “overregulation” by 2025. Gary Marcus calls it “The Sam Altman Playbook”: grand claims followed by quiet retreats, with the next claim queued before anyone dwells on the retreat.
The IPO as Inflection Point
OpenAI’s endgame is an IPO — reportedly one of the largest in history. At 730 billion USD, even a 5% float puts ~40 billion USD on the market. If OpenAI, SpaceX, and Anthropic all go public as rumoured, their collective offerings could raise more than the last decade of IPO activity combined.
The IPO is where every structural problem converges. Price discovery replaces negotiated valuation — the circular funding dynamic doesn’t work when buyers are retail investors, not infrastructure suppliers. The WeWork precedent is worth considering: bold promises, SoftBank backing, financials that required revision under public scrutiny. Disclosure requirements constrain the hype — Altman’s messaging becomes subject to SEC regulation. Legal exposure becomes a disclosure obligation — every lawsuit must appear as a material risk factor in the S-1. Index fund rebalancing could force Vanguard, BlackRock, and State Street to sell down other holdings to weight a company that arrives near the top of indexes overnight, socializing the risk whether or not returns materialize.
Who Wins Instead
The framework from my Beyond the Bitter Lesson series predicts winners will:
- innovate across the stack rather than scaling one layer
- treat compute scaling as a research problem rather than procurement
- maintain research depth during deployment, and
- prioritize reliability over engagement — because models that push back on broken premises are more useful for enterprise applications where the real revenue lies.
Anthropic and DeepSeek exemplify this. Anthropic’s Constitutional AI, alignment research, and training methodology represents a Layer 4-5 strategy — the BullshitBench results suggest it’s working. DeepSeek’s cross-stack innovation under hardware constraints is the most complete expression of multi-layer optimization the field has seen. Neither is outspending OpenAI. Both are out-innovating in ways that compound — architectural breakthroughs reduce the cost of the next training run, which frees resources for the next architectural breakthrough, while OpenAI’s infrastructure commitments lock in the cost of each successive run.
Enterprise market share has declined from ~50% (2023) to ~27% (2026).
The reasoning “pivot” is table stakes — every major lab has reasoning models — and the BullshitBench scores suggest that additional inference compute may be making OpenAI’s reliability problem worse, not better. The early brand advantage has eroded. The technology lead has narrowed or reversed on key metrics. What remains is a cost structure that requires 143 billion USD in negative cash flow, legal exposure that may exceed the valuation, and an IPO where the current funding dynamic meets real price discovery.
OpenAI built an organization around expensive capability delivery. The cost of capability is dropping. And model quality — the one thing that could justify everything else — has become secondary to the organizational demands of sustaining the company at its current scale.