After nearly two weeks of announcements, OpenAI concluded its 12 Days of OpenAI livestream series with a preview of its next-generation edge model. “Out of respect to our friends at Telefónica (owner of the O2 cellular network in Europe) and in the great tradition of OpenAI being very, very bad with names, it's called o3,” OpenAI CEO Sam Altman told viewers watching the show . announcement on YouTube.
The new model is not yet ready for public use. Instead, OpenAI is first making o3 available to researchers who need help with safety tests. OpenAI also announced the existence of o3-mini. Altman said the company plans to release this model “around the end of January” and the o3 “shortly thereafter.”
As you'd expect, the o3 offers improved performance over its predecessor, but how much better it is than the o1 is the main feature here. For example, after completion this year American Invitational Examination in Mathematicso3 achieved an accuracy rate of 96.7 percent. In contrast, o1 received a more modest rating of 83.3 percent. “This means o3 often only misses one question,” said Mark Chen, senior vice president of research at OpenAI. In fact, o3 performed so well on the usual battery of tests that OpenAI runs on its models that the company had to look for more complex tests to compare.
One of them ARK-AGIa test that tests an AI algorithm's ability to intuitively understand and learn on the spot. According to the creator of the test, a non-profit organization ARK AwardAn artificial intelligence system capable of successfully outperforming ARC-AGI would be a “major milestone on the path to artificial general intelligence.” Since its debut in 2019, no artificial intelligence model has been able to surpass ARC-AGI. The test consists of input/output questions that most people can solve intuitively. For example, in the example above, the correct answer would be to create squares from four polyominoes using dark blue blocks.
In Low Compute mode, the o3 scored 75.7 percent in the test. Thanks to the additional processing power, the model achieved a rating of 87.5 percent. “Human productivity is comparable to the 85 percent threshold, so exceeding this threshold is an important milestone,” says Greg Kamradt, president of the ARC Prize Foundation.
OpenAI also demonstrated o3-mini. The new model uses OpenAI's recently announced Adaptive Thinking Time API, offering three different reasoning modes: low, medium, and high. In practice, this allows users to control how long the software “thinks” about a problem before providing an answer. As you can see in the graph above, o3-mini can achieve comparable results to OpenAI's current reasoning model o1, but at a fraction of the computational cost. As mentioned, the o3-mini will go into public use before the o3.