OpenAI saved its biggest announcement for last day. A 12-day “ships” event.
on Friday, The company showed o3. o1 The “reasonableness” model was released earlier this year. o3 is the same model family as o1 — to be more precise. o3 and o3-mini; A small and refined model, well-tuned for special tasks.
OpenAI makes a remarkable claim that o3 at least comes close in some cases. AGI – with obvious caveats. More on that later
o3, Our latest reasoning model is a breakthrough with step-by-step performance improvements on our toughest benchmarks. We are now running security checks and red teaming. https://t.co/4XlK1iHxFK
— Greg Brockman (@gdb) December 20 2024
Why is the new model o3 called o2? ok Brands can be blamed. according to to The Information; OpenAI bypassed o2 to avoid potential conflict with British telecom provider O2. CEO Sam Altman confirmed this on a live stream this morning. It's a strange world we live in, isn't it?
o3 and o3-mini are not yet widely available, but security researchers can sign up for a preview for o3-mini starting today. o3 preview will come sometime; OpenAI did not reveal when. Altman said the o3-mini will be launched at the end of January, with o3 to follow.
This is somewhat inconsistent with his recent statements. in one Interview This week, Before OpenAI releases new reasoning models, He said he wants to prioritize a federal testing framework to monitor and mitigate the risks of these models.
And there are risks. AI safety testers Found it o1's reasoning ability to deceive human users at a higher rate than conventional “irrational” models—or Meta, for that matter. AI models led by Anthropic and Google. Maybe o3 tries to cheat at a higher rate than its predecessor. We'll find out once OpenAI's red-team partners publish their test results.
For what it's worth, It is a new technology to align models like o3 with its safety principles. OpenAI says it uses “deliberate tuning”. (O1 is tuned the same way.) The company describes its business in detail. A new study.
Reasonable steps
Unlike most AIs; Reasoning patterns such as o3 are effective in checking themselves for truth. It helps them avoid some of the pitfalls that models typically run into..
This data verification process may take some time. Like o1 before o3; It usually takes seconds to minutes longer to reach solutions compared to a non-rational model. Upside down? It is physics, More reliable in areas such as science and math.
I was trained through o3. Reinforcement learning To “think” before reacting as what OpenAI describes as a “personal chain of thought”. A model can reason through a task; Actions can be performed sequentially over an extended period of time, helping to figure out the solution.
We announced. @OpenAI o1 about 3 months ago. Today o3 is announced. We have reason to believe that this trend will continue. pic.twitter.com/Ia0b63RXIk
— Noam Brown (@polynoamial) December 20 2024
In practice, o3 pauses before responding by giving a prompt, considering several contextual factors, “explaining” its reasoning along the way. after a while, The model summarizes what is considered the most accurate response.
New with o3 and o1 is the ability to “adjust” the reasoning time. The models are low, Can be set to medium or high computing (ie think time). The higher the computer o3 is better at one job.
No matter how many calculations they have. Reasoning models such as o3 are not without defects. Reasoning elements can be reduced. Delusion. errors, They are not removed. o1 climbs on tic-tac-toe games, for example.
Criteria and AGI
A big question to date is whether OpenAI can say its latest models approach AGI.
AGI stands for “artificial general intelligence” and generally refers to AI that can perform any task a human can perform. OpenAI has its own definition: “High-performance autonomous systems that empower humans in the most economically valuable tasks.”
Achieving AGI is a bold statement. Also, contractual weight for OpenAI. Once OpenAI reaches AGI under the terms of an agreement with close partner and investor Microsoft; Microsoft is no longer obligated to provide access to its most advanced technologies (which meet OpenAI's AGI definition).
OpenAI by a standard is Slowly getting closer to AGI. In ARC-AGI, a test designed to assess whether an AI system can effectively acquire new skills outside of the data it's trained on, the o3 scored 87.5% on the high compute setting. At its worst (at the lowest compute setting), the model triples the performance of o1.
According to the ARC-AGI co-creator, the high-end computing setup is very expensive, at several thousand dollars per challenge François Chollet.
Today, OpenAI announced its next-generation reasoning model, o3. We worked with OpenAI to test this in ARC-AGI, We believe this represents a significant breakthrough in getting AI to adapt to innovative tasks.
It scored 75.7% in semi-private eval in low-compute mode ($20 per job… pic.twitter.com/ESQ9CNVCEA
— François Chollet (@fchollet) December 20 2024
Chollet says that in his opinion, o3 fails at “very easy tasks” in ARC-AGI — pointing out that the model shows “fundamental differences” from human intelligence. He has it. As previously noted. We caution against the limitations of the assessment and its use as a measure of AI intelligence.
“(e)early data points to the upcoming (ARC-AGI) standard's successor still being a significant challenge for o3, potentially reducing its score to below 30% even at high computational power (an intelligent human would still be able to score over 95% without training),” Chollet continued. “You'll know AGI is here when the exercise to create tasks that are easy for regular humans but difficult for AI is simply impossible.”
by chance OpenAI said it will partner with the foundation behind ARC-AGI to help build the next generation of its AI benchmark, ARC-AGI 2.
In other tests, the o3 obliterates the competition.
The model outperforms o1 by 22.8 percent in SWE-Bench Verified; The Codeforces rating — a benchmark that focuses on programming activities — gets another measure of 2727. (A rating of 2400 puts an engineer in the 99.2nd percentile. ) o3 scored 96.7% on the 2024 American Invitational Mathematics exam, missing a single question and GPQA Diamond; Graduate level biology; Passed 87.7% in Physics and Chemistry question sets. Finally, o3 set a new record in EpochAI's Frontier Math benchmark, solving 25.2% of the problems. No more than 2% of other models.
We trained the o3-mini: it outperformed the o1-mini; End-to-end is 4x faster when accounting for logical tokens
together @ren_hongyu @shengjia_zhao & others pic.twitter.com/3Cujxy6yCU
— Kevin Lu (@_kevinlu) December 20 2024
These claims must be taken with a grain of salt; Of course. These are from OpenAI's internal evaluations. The model needs to monitor how the standards of external customers and organizations hold up in the future.
A new direction
After the publication of OpenAI's first reasoning models, there was an explosion of reasoning models from rival AI companies— including In early November, Google DeepSeek, an AI research firm funded by quant traders, has launched a preview of its first reasoning model. DeepSeek-R1. That same month, Alibaba's Qwen Group Revealed. It claims to be the first “open” challenger to o1 (that it can be downloaded, fine-tuned, and used locally).
What opens the floodgates of reasoning? ok For one, Explores innovative approaches to refine the next generation of AI. As recently as TechCrunch reported.“Brute force” techniques to scale models no longer yield the improvements they once did.
Not everyone believes. Reasoning models are the best way forward. They tend to be expensive due to the large amount of computing power required to run them. They've been doing well on benchmarks so far, but it's unclear whether reasoning models can keep up with this rate of growth.
Interestingly, o3 comes as one of OpenAI's most accomplished scientists leaves. The next generation of AI models (GPT-3, GPT-4, etc.) was announced this week by Alec Radford, lead author of an academic paper that launched OpenAI's “GPT series.” Leave to conduct independent research.
TechCrunch has an AI-focused newsletter. Register here. Get it in your inbox every Wednesday.