Highlights:
- DeepSeek-V3 is built on a mixture of expert (MoE) architecture, consisting of several neural networks, each specialized in a distinct set of tasks.
- The MoE architecture reduces hardware costs by activating only the relevant neural network for a given prompt, rather than the entire LLM.
Recently, Chinese AI developer DeepSeek released DeepSeek-V3, a new open-source large language model with 671 billion parameters.
The LLM is capable of generating text, writing software code, and performing related tasks. According to DeepSeek, it surpasses two of the most advanced open-source LLMs available in over half a dozen benchmark tests.
DeepSeek-V3 uses a mixture of expert (MoE) architecture, with multiple neural networks, each focused on a particular set of tasks. Upon receiving a prompt, a routing component directs the request to the neural network most suited to handle it.
The primary advantage of the MoE architecture is its ability to lower hardware costs. When a prompt is sent to DeepSeek-V3, only the specific neural network assigned to handle the request is activated, rather than the entire LLM. Each of these neural networks has 34 billion parameters, requiring relatively modest infrastructure to operate.
While the MoE architecture offers advantages, it also presents challenges. During training, some neural networks in an MoE model may receive more data than others, potentially leading to inconsistencies in the LLM’s output quality. DeepSeek claims to have developed and implemented a new method in DeepSeek-V3 to address this issue.
The LLM was trained on 14.8 trillion tokens, with each token representing a few letters or numbers. The training process required 2.788 million GPU hours, indicating relatively modest infrastructure usage. Advanced AI clusters in the industry, equipped with tens of thousands of GPUs or more, can accomplish similar training tasks within a few days.
In addition to its MoE architecture, DeepSeek-V3 incorporates several optimizations aimed at enhancing its output quality.
LLMs employ a technique called attention to pinpoint the most important details in a sentence. DeepSeek-V3 uses multihead latent attention, an enhanced version of this method that enables it to extract critical details from a text snippet multiple times instead of just once. This reduces the likelihood of overlooking important information.
DeepSeek-V3 also includes a multitoken prediction feature. Unlike traditional language models that generate text one token at a time, DeepSeek-V3 produces multiple tokens simultaneously, accelerating the inference process.
DeepSeek tested its algorithm against three other open-source LLMs: the earlier DeepSeek-V2, Llama 3.1 405B, and Qwen2.5 72B. DeepSeek-V3 surpassed them on all nine coding and math benchmarks in the evaluation and excelled in a range of text-processing tasks.
The DeepSeek-V3 code is available on Hugging Face.