To make AI models more efficient; One of the most widely used techniques for downsizing has limits — and the industry is approaching them quickly.
In the context of AI; Quantization refers to reducing the number of units needed to represent information — the smallest units that can be processed by a computer. Consider this example: When someone asks the time; You might say “noon”; You can't say, “Oh, twelve hundred, one second and four milliseconds.” That's the number. Both answers are correct, but one is slightly more accurate. How much precision you really need depends on the context.
AI models have many quantifiable components — especially constraints; Internal variable models used to make predictions or decisions. It is convenient to consider that the models perform millions of calculations when run. Statistical models with fewer bits representing their parameters are less mathematically demanding and therefore computationally less demanding. (To be clear, this is a different process than “distillation”, which is a more involved and selective cutting of parameters.)
However, the number may be more compromised than previously thought.
Shrinkage model
According to Study. Harvard Stanford MIT Databricks and researchers at Carnegie Mellon have trained on large amounts of data for long periods of time, and the statistical models have worsened. In other words, at some point It might actually be better to practice small rather than cooking big.
This is for AI companies to train very large models (to improve the quality of the answer) that can log bad news and quantify them in an effort to make it cheaper to serve them.
The effects are already evident. A few months ago, developers versus Scholars Meta's amount was calculated. Lama 3 The model may be more risky compared to other models; Maybe because of the way it was trained.
“In my opinion, the number one cost for everyone in AI will continue to be inference, and our work is an important way to reduce it, which won't work forever,” Harvard math student Tanishq Kumar told TechCrunch.
Contrary to popular belief, AI model evaluation — like when a model is running. ChatGPT Answers one question — it tends to cost more in a collection than a model course. For example, Consider what Google spent. Estimate $191 million to train one of its flagships Gemini Models – definitely capital. But if the company used the model to generate 50-word answers for half of all Google search queries, Roughly $6 billion a year.
Major AI labs are embracing training models on large datasets under the assumption that this increases the amount of data and computation used in training.
For example, Meta trained Llama 3 with 15 trillion tokens. (Tokens Represents bits of raw data; 1 million tokens equals about 750,000 words.) The previous generation, Llama 2, was trained on just 2 trillion tokens. early December Meta has released a new Llama 3.3 70B model.The company says it “improves core performance at a significantly lower cost.”
Evidence suggests that expansion ultimately yields diminishing returns. Anthropology and Google Received news. Recently, large models have been trained that do not expect internal benchmarks. But there is little sign that they are ready to move meaningfully away from that entrenched quantitative approach.
How exactly?
Therefore, If labs are reluctant to train models on small data sets; Is there a way to minimize the degradation of the models? It's possible. Kumar said he and his co-authors found that “less precise” forms of training could make them stronger. Take a little dive and be patient for a while.
“Precision” here refers to the number of digits that can accurately represent a type of statistical data. Data types are collections of data values, usually defined by possible values and allowed operations; For example, data type FP8 uses only 8 bits to represent a. A floating-point number.
Most models today are trained at 16-bit or “half precision” and “post-train quantized” to 8-bit precision. Some model components (eg, its parameters) are converted to a less accurate format at the cost of some accuracy. Think mathematically, to a few decimal places. Then rounding to the nearest 10th usually gives the best of both worlds.
Hardware vendors such as Nvidia are pushing for lower precision for quantitative model inference. The company's new Blackwell chip supports 4-bit precision; Especially the data type called FP4; Nvidia has made this a boon for memory- and power-constrained data centers.
However, very low quantization accuracy is not desirable. According to Kumar, If the original model is incredibly large in its parameter count. Resolutions lower than 7- or 8-bit may see a significant drop in quality.
If this all sounds a little technical, don't worry. But the simple fact is that AI models are not fully understood; Known shortcuts that work in many types of calculations don't work here. I don't say “noon” if someone starts asking me 100 meters. It's not that obvious, But the concept is the same.
“At the core of our work is that there are limits that you can't walk lightly,” Kumar concluded. “We hope that our work will add nuance to the discussion that often looks for more or less precise defaults for training and inference.”
Kumar acknowledges that his and his colleagues' study was small — they plan to test it with more models in the future. But he believes he'll gain at least one insight: There's no free lunch for cutting costs.
“It's important to be a little more specific; This is not free,” he said. “We can't reduce it forever without the models suffering. Models have limited capacity; So rather than trying to fit a quadrillion tokens into one small model. In my opinion, More effort must be put into carefully processing and filtering information. Therefore, more effort is required to include only the highest quality data in smaller models. I am optimistic that new architectures that deliberately aim to stabilize low-precision training will be important in the future.”
This story was originally published on November 17, 2024 and was updated with new information on December 23.