As AI founders and investors told TechCrunch last month, we now “The Second Age of Scaling Laws” Note how established methods of improving AI models show diminishing returns. One potential new way they suggest to sustain profits is to “Test duration“This seems to be behind performance. OpenAI's o3 model – but it has its own drawbacks.
Many in the AI world took the announcement of OpenAI's o3 model as evidence that AI expansion has not “hit a wall.” The o3 model significantly outperformed all other models in a general ability test called ARC-AGI, scoring 25%. A difficult math test None of the other AI models scored more than 2%.
Of course, We at TechCrunch are taking all of this with a grain of salt until we can test the o3 ourselves (so far, few have tried it). But even before o3 came out, I was already convinced that something big had changed in the world of AI.
Noam Brown, co-creator of OpenAI's o-series models, noted Friday that the startup is announcing o3's impressive gains three months after announcing o1 — a short time frame for such a performance leap.
“We have every reason to believe this trend will continue,” Brown said. Tweet..
Anthropic co-founder Jack Clark said. Blog post o3 evidence that AI “progress will be faster in 2025 than 2024”. (Note that Clark benefits Anthropic — particularly its ability to raise capital — to suggest that AI continues to operate on scaling laws even as it powers a competitor.)
Clark says that in the coming year, the AI world will merge test time-scale and traditional pre-training scaling methods to produce more output from AI models. Perhaps suggesting that Anthropic and other AI modeling providers will release their own reasoning models by 2025? Google did it last week..
Test-time scaling means that OpenAI at the inference level of ChatGPT; More computation is being used in the time period after you press enter on a prompt. What's going on behind the scenes isn't exactly clear: OpenAI is used to answer a user's question; Using more powerful inference chips or running those chips for longer periods of time—10 to 15 minutes in some cases—before processing. AI generates an answer. We don't know all the details of how o3 was created, but these benchmarks are early signs that AI models can work at experimental scale to improve performance.
While o3 may offer some renewed faith in the growth of AI scaling laws, OpenAI's latest model means a higher price per solution.
“The only important caveat here is that understanding why O3 is so much better is that it costs more money to run at prime time – The ability to use test-time computation can turn computation into a better solution for certain problems,” Clark wrote in his blog. “That's interestingly the cost of running AI systems As it became less predictable – before, Just by looking at the production cost of one model, you can learn how much a next-generation model will cost.”
Clark and others pointed to o3's performance on the ARC-AGI benchmark — a tough test used to assess achievement in AGI — as an indicator of its progress. According to its creators, Passing this test does not mean an AI model. It has been successful AGI But it's a way to measure progress toward a nebulous goal. That is, The o3 model scored 88% on one of its attempts, beating the scores of all previous AI models tested. OpenAI's next best AI model, o1, scored only 32%.

However, the logarithmic x-axis in this chart may be a concern to some. The high-scoring version of o3 used a computer worth $1,000 for each task. The o1 models run around $5 per job, while the o1-mini runs just a few cents.
Written by François Chollet, creator of the ARC-AGI standard. Blog OpenAI used roughly 170x more computation to produce that 88% score compared to o3's high-performance version, which was 12% lower. The high-scoring version of o3 used more than $10,000 in resources to complete the test, too expensive to compete for the ARC prize — a competition that does not give AI models the ARC test.
However, Chollet says o3 is still a breakthrough for AI models.
“o3 is a system that can adapt to tasks never before encountered, approaching human-level performance in the ARC-AGI domain,” Chollet said in the blog. “Of course, This general standard comes at a great cost; It won't be cheap. We pay roughly $5 per task per person to solve ARC-AGI tasks (we know, we've done it); in energy.”
It's too early to speculate on the exact price of all of this. We saw prices for AI models drop last year, and OpenAI has yet to announce how much o3 will actually cost. However, These prices indicate the need to compute to even slightly break the performance barriers led by today's AI models.
This raises questions. What is o3 actually for? o4, How much computation is required to infer with o5 or whatever else OpenAI calls its next-generation reasoning models?
It doesn't appear that o3 or its successors are anyone's “daily driver” like GPT-4o or Google Search. These models just overuse the computer to answer small questions throughout your day, like “How can the Cleveland Browns make the 2024 playoffs?”
Instead, AI models that time-tested calculations at scale seem to be only good for large simulations like “Cleveland Browns can become a Super Bowl franchise by 2027.” If you're the general manager of the Cleveland Browns and you're using these tools to make big decisions, it might just be worth the higher computational costs.
As Wharton professor Ethan Mollick notes, organizations with deep pockets may be the only ones that can least afford o3. Tweet.
We have already seen the release of OpenAI. $200 to use the high-end version of o1.But there is a startup. They reportedly considered creating subscription plans that cost up to $2,000. When you see how much the computer uses o3. You can understand why OpenAI would consider this.
But there are downsides to using o3 for high-impact work. As Chollet notes, o3 is not AGI. He still fails at some very easy tasks that a human can do easily.
It's not as surprising as the large language patterns. There is still a huge misperception problem.o3 and test-time compute don't seem to resolve. That's why ChatGPT and Gemini have disclaimers below every answer they generate and ask users not to trust answers at face value. If AGI is ever reached; Such a disclaimer may be considered unnecessary.
One way to unlock greater gains in test time scaling is to optimize AI inference chips. There is no shortage of startups tackling just this, like Groq or Cerebras. While other startups are designing low-cost AI chips like MatX. Andreessen Horowitz general partner Anjney Midha previously told TechCrunch: These startups are expected to play a bigger role. Advances in the experimental time scale.
o3 has significantly improved the performance of AI models, but raised new questions about usage and costs. That said, o3's performance lends credence to the tech industry's claim that test-time computing is the tech industry's next best way to measure AI models.
TechCrunch has an AI-focused newsletter. Register here. Get it in your inbox every Wednesday.