DeepSeek-V3, an ultra-large open source AI, outperforms Llama and Qwen at launch


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more


Chinese AI startup DeepSeek, known for challenging major AI vendors with their innovative open source technologies, released a new ultra-large model: DeepSeek-V3.

Available through Face Hugging under the company's license agreement, the new model comes with 671B parameters but uses a combination-of-experts architecture to implement only selected parameters, to handle specific tasks accurately and efficiently. According to benchmarks shared by DeepSeek, the offering is already at the top of the charts, outperforming the leading open source models, including Meta's Llama 3.1-405Band closely matches the performance of closed-loop models from Anthropic and OpenAI.

The release marks another major development bridging the gap between closed and open AI. Finally, DeepSeek, which started as a spinoff of a Chinese quantitative hedge fund High-Flyer Capital Managementthe hope is that these developments will pave the way for artificial general intelligence (AGI), where models will be able to perform any intellectual task that humans can understand or learn.

What does DeepSeek-V3 bring to the table?

Just like its predecessor DeepSeek-V2, the new ultra-large model uses the same basic rotating architecture. multi-head hidden attention (MLA) and DeepSeekMoE. This approach ensures efficient training and decision-making by dedicated and shared “experts” (individual, smaller neural networks within the larger model) that ' implement 37B out of 671B parameters for each token.

While the basic architecture ensures strong performance for DeepSeek-V3, the company has also introduced two innovations to push the bar further.

The first is a complementary strategy for lossless load balancing. This dynamically monitors and adjusts the load on experts to use them in a balanced way without affecting the overall model performance. The second is multi-token prediction (MTP), which allows the model to predict multiple future tokens simultaneously. This innovation not only increases training efficiency but enables the model to perform three times faster, generating 60 signals per second.

“During pre-training, we trained DeepSeek-V3 on high-quality and diverse 14.8T signals… Next, we performed two-level context length expansion for DeepSeek-V3,” the company wrote in a technical paper details of the new model. “In the first phase, the maximum context length is extended to 32K, and in the second phase, it is extended to 128K. After this, we performed post-training, including supervised fine-tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and to ability to solve more. During the post-training phase, we draw the reasoning ability from A series of DeepSeekR1 modelsand at the same time carefully maintain the balance between model accuracy and generation length.”

In particular, during the training phase, DeepSeek used multiple hardware and algorithmic optimizations, including the FP8 mixed precision training framework and the DualPipe algorithm for pipeline parallelization, to cut down on the costs of the process.

In total, he says he completed all DeepSeek-V3 training in about 2788K H800 GPU hours, or about $5.57 million, assuming a rental price of $2 per GPU hour. This is much lower than the hundreds of millions of dollars that large pre-trained language models typically cost.

Llama-3.1, for example, is thought to have been trained with an investment of more than $500 million.

The most robust open source model currently available

Despite the economic training, DeepSeek-V3 has emerged as the strongest open source model in the market.

The company ran several benchmarks to compare the performance of the AI ​​and noted that it definitely outperforms the leading open models, including Llama-3.1-405B and Qwen 2.5-72B. It's even better than a closed store GPT-4o on most benchmarks, with the exception of SimpleQA and English-focused FRAMES – where the OpenAI module sat ahead with scores of 38.2 and 80.5 (vs 24.9 and 73.3), respectively.

In particular, the performance of DeepSeek-V3 particularly stood out on the Chinese and mathematical criteria, scoring better than all its peers. In the Math-500 test, he scored a 90.2, with Qwen's score of 80 the next best.

It was the only model that managed to challenge DeepSeek-V3 Sonnet Claude 3.5 at Anthropicoutperforms it with higher scores in MMLU-Pro, IF-Eval, GPQA-Diamond, SWE Verified and Aider-Edit.

https://twitter.com/deepseek_ai/status/1872242657348710721

The work shows that open source closes in on closed source models, guaranteeing nearly identical performance across different tasks. The development of these systems is great for the industry as it could eliminate the chances of one big AI player ruling the game. It also gives companies a variety of options to choose from and work with while setting up their stacks.

Currently, the code for DeepSeek-V3 is available via GitHub under the MIT license, while the model is provided under the company's model license. Enterprises can also try the new model through DeepSeek Discussionplatform similar to ChatGPT, and access the API for commercial use. DeepSeek provides the API at the same price as DeepSeek-V2 until February 8. After that, it charges $0.27/million input tokens ($0.07/million tokens with cache hits) and $1.10/million output tokens.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *