o3 OpenAI shows remarkable progress on ARC-AGI, stimulating debate on AI reasoning


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more


The latest version of OpenAI o3 model has achieved a breakthrough that surprised the AI ​​research community. o3 scored an unprecedented 75.7% on the extremely difficult ARC-AGI benchmark under normal computing conditions, with a supercomputer version reaching 87.5%.

Although the performance in ARC-AGI is impressive, it still does not prove that the code to artificial general intelligence (AGI) is cracked.

Abstract Purpose Corpus

The ARC-AGI benchmark is based on the Abstract Purpose Corpuswhich tests the ability of an AI system to adapt to novel tasks and reveal fluid information. ARC consists of a set of visual puzzles that require understanding of basic concepts such as objects, boundaries and spatial relationships. While humans can easily solve ARC puzzles with minimal exposure, current AI systems struggle with them. ARC has long been considered one of the most challenging AI measures.

An example of an ARC puzzle (source: arcprize.org)

ARC was designed in such a way that it cannot be fooled by training models on millions of examples in the hope of covering all combinations of puzzles.

The benchmark consists of a public training set consisting of 400 simple examples. In addition to the training set there is a public evaluation set containing 400 more challenging puzzles as a way to assess general ability AI Systems. The ARC-AGI Challenge contains private and semi-private test sets of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate AI systems without running the risk of releasing the data to the public and contaminating future systems with prior knowledge. In addition, the competition sets limits on the amount of calculations that participants can use to ensure that the puzzles are not solved through brute force methods.

Success in solving novel tasks

o1-preview and o1 scored a maximum of 32% on ARC-AGI. Another method developed by the researcher Jeremy Berman they used a hybrid approach, combining Claude 3.5 Sonnet with genetic algorithms and a code translator to achieve 53%, the highest score before o3.

In a blog postFrançois Chollet, the creator of ARC, described the performance of o3 as “an incredible and significant step increase in AI capabilities, showing the possibility of changing a new function never seen before in the GPT family models. “

It is important to remember that these results could not use more computing on previous generations of models. For context, it took 4 years for a model to progress from 0% with GPT-3 in 2020 to only 5% with GPT-4o in early 2024. Although we don't know much about the o3 architecture, we can be confident that it is. not orders of magnitude greater than its predecessors.

Performance of different models on ARC-AGI (source: arcprize.org)

“This is not just incremental development, but real progress, marking a qualitative shift in AI capabilities compared to the previous limitations of LLMs,” wrote Chollet. “o3 is a system capable of adapting to tasks it has never encountered before, potentially approaching human-level performance in the ARC-AGI domain. “

It's worth noting that O3's performance on ARC-AGI comes at a steep cost. On the low computing configuration, it costs the model $17 to $20 and 33 million tokens to solve each puzzle, and on the high computing budget, the model uses about 172X more computing and billions of tokens per a problem However, as decision making costs decrease, we can expect these numbers to become more reasonable.

A new paradigm in LLM reasoning?

The key to solving new problems is what Chollet and other scientists refer to as “program synthesis.” A thinking system should be able to develop small programs to solve specific problems, and then combine these programs to tackle more complex problems. Classical language models have absorbed a lot of knowledge and have a rich set of built-in programs. But they lack writing, which prevents them from discovering puzzles that are outside their training circuit.

Unfortunately, there is very little information about how O3 works under the hood, and here, the opinions of scientists differ. Chollet predicts that o3 uses a type of program synthesis that uses chain of thought (CoT) and a diagnostic tool along with a reward model that evaluates and updates solutions as the model generates signals. This is similar to what is open source reasoning models have been studying for the past few months.

Other experts such as Nathan Lambert from the Allen Institute for AI suggests that “o1 and o3 can be just the forward passes from one language model.” On the day o3 was announced, Nat McAleese, a researcher at OpenAI, posted on X that o1 is “just an RL-trained LLM. o3 is driven by increasing RL beyond o1.”

On the same day, Denny Zhou from Google's DeepMind reasoning team declared that the combination of current research and reinforcement learning methods is a “dead end”.

“The most beautiful thing about LLM reasoning is that the thought process is automatically generated, rather than relying on a search (eg mcts) over the generation space, whether with a well-defined model finesse or careful encouragement,” he said. posted on X.

Although the details of how o3 causes may seem small compared to the progress on ARC-AGI, it can very well explain the next paradigm shift in the training of LLMs. There is currently a debate as to whether the laws of scaling LLMs through training data and computing have hit a wall. Whether scaling test time depends on better training data or different decision architectures, it can determine the next path forward.

Not AGI

The name ARC-AGI is misleading and some have equated it with an AGI solution. However, Chollet emphasizes that “ARC-AGI is not an acid test for AGI. ”

“Passing ARC-AGI is not the same as achieving AGI, and, in fact, I don't think O3 is AGI yet,” he writes . “O3 still fails at some tasks very easily, showing fundamental differences with human intelligence.”

Furthermore, he notes that o3 cannot learn these skills independently and relies on external verifiers during decision making and human-labeled reasoning chains during training.

Other scientists have pointed out the flaws in the results reported by OpenAI. For example, the model was tuned on the ARC training set to achieve novel results. “The solver should not need much special 'training', either on the field itself or on each specific task,” wrote a scientist Melanie Mitchell.

To test whether these models have the kind of abstraction and reasoning that the ARC criterion was created to measure, Mitchell suggests “see if these systems can adapt to variations on specific tasks or to reasoning tasks using the same concepts, but in other areas than ARC. “

Chollet and his team are currently working on a new benchmark that challenges o3, which could reduce its score to less than 30% even at a high-end computer budget. Meanwhile, humans would be able to solve 95% of the puzzles without any training.

“You'll know AGI is here when it becomes impossible to create tasks that are easy for normal humans but difficult for AI,” Chollet writes.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *