Hugging Face shows how test time scaling helps small language models punch above their weight


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more


In a new case study, Hugging Face researchers have shown how small language models (SLMs) can be configured to outperform much larger models. Their results show that the Llama 3 model with 3B parameters can outperform the 70B version of the model in complex mathematical problems.

Hugging Face is fully registered the entire process and provides a road map for enterprises that want to create their own customized reasoning models.

Image source: Hugging Face

Test time calculation scale

The work is inspired by Open AI o1which uses additional “thinking” to solve complex math, coding and reasoning problems.

The main idea behind models like o1 is to scale “test-time computing,” which effectively means using more computing cycles during a decision to test different answers and reasoning paths and confirm it before giving the final answer. Computational scaling of test time is especially useful when there is not enough memory to run a large model.

Because o1 is a private model and OpenAI has been tight-lipped about its inner workings, researchers have been speculating about how it works and trying to reverse process. There are already several open other options instead of o1.

The work of Hugging Face is based on a DeepMind study published in Augustwhich examines the trade-off between decision time and pre-training computation. The study provides comprehensive guidance on how to balance training and decision-making to get the best results for a fixed budget.

In addition to using additional decision-time computing, the success of the approach depends on two main components: a reward model that evaluates the SLM responses, and a search algorithm that optimizes the path to update his answers.

Image source: Hugging Face

Different reasoning algorithms

The simplest way to use test time scaling is “majority voting,” in which the same stimulus is sent to the model multiple times and the highest number of votes is selected. In simple problems, majority voting can be useful, but its advantages quickly extend to complex reasoning problems or tasks where errors are consistent across generations.

A more sophisticated method of reasoning is “Best-of-N.” In this method, the SLM generates several answers, but instead of majority voting, a reward model is used to evaluate the answers and select the best one. “Heighted Best-of-N,” a newer version of this method, factors in consistency to select answers that are both confident and occur more often than others.

The researchers used a “process reward model” (PRM) that scores the SLM's response not only on the final response but also on the multiple stages it goes through to reach it. Their tests showed that Weighted Best-of-N and PRMs gave the Flame-3.2 1B near the Llama-3.2 8B level on the difficult MATH-500 benchmark.

Image source: Hugging Face

To further improve the model's performance, the researchers added search algorithms to the model's reasoning process. Instead of generating the answer in one pass, they used “beam search,” an algorithm that guides the model's answer process step by step.

At each step, the SLM generates several partial responses. The search algorithm uses the reward model to evaluate the responses and select a subset worth further investigation. The process is repeated until the model outputs its decision budget or until it reaches the correct answer. In this way, the decision budget can be narrowed to focus on the most promising answers.

The researchers found that, although beam analysis improves the performance of the model on complex problems, it tends to be inferior to other methods on simple problems. To address this challenge, they added two more elements to their decision-making strategy.

First was Diversity Verification Tree Search (DVTS), a variant of beam search that ensures the SLM does not get involved in false reasoning paths and multiplies its response branches. Second, they developed an “optimal computational scaling strategy,” as proposed in the DeepMind paper, which dynamically selects the best test time scaling strategy based on the difficulty of the given problem. in.

The combination of these techniques allowed the Llama-3.2 1B to punch above its weight and outperform the 8B model by a large margin. They also discovered that the strategy was scalable, and when applied to the Llama-3.2 3B, they were able to outperform the much larger 70B model.

It is still not a perfect solution

Computational scaling of test time changes the dynamics of model costs. Enterprises now have the ability to choose where to allocate their computing resources. For example, if you have a short memory or can tolerate slower response times, you can use a small model and spend more decision time cycles to generate more accurate responses.

However, scaling test time also has limitations. For example, in the Hugging Face experiments, researchers used a specially trained Llama-3.1-8B model as the PRM, which requires running two models at the same time (even though it is much more more resource efficient than the 70B model). The researchers admit that the holy grail of test time scaling is “self-verification”, where the original model verifies its own answer rather than relying on an external verifier. out. This is an open research area.

The test time scaling method presented in this study is also limited to problems where the answer can be clearly assessed, such as coding and math. Further research is needed to develop reward models and determinants for subjective activities such as creative writing and product design.

But what is clear is that test time scaling is created lots of interest and activity and we can expect more tools and techniques to appear in the coming months. It will be wise for enterprises to monitor how the landscape develops.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *