Reading Notes: Mixtral of Experts
The original paper link is here. This reading note will specifically focus on the model architecture.
Introduction
Mixtral 8x7B is a Sparse Mixture of Experts (SmoE) language model.
It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. For every token, at each layer, a router network selects two of these groups (the “experts”) to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each time step. As a result, each token has access to 47 B parameters, but only uses 13B active parameters during inference. This technique increases the number of parameters of a model while controlling cost and latency
Mixtral was trained using multilingual data using a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks.
Architectural details
Sparse Mixture of Experts
The output of the Mixture of Experts (MoE) module for a given input x is determined by the weighted sum of the outputs of the expert networks, where the weights are given by the gating network’s output. i.e. given n expert networks {E_0,…, E_i,…, E_{n-1} }, the output of the expert layer is given by
Here, G(x)_i denotes the n-dimensional output of the gating network for the i-th expert, and E_i(x) is the output of the i-th expert network. If the gating vector is sparse, we can avoid computing the outputs of experts whose gates are zero. There are multiple alternative ways of implementing G(x), but a simple and performant one is implemented by taking the softmax over the Top-K logits of a linear layer.
where (TopK(l))_i := l_i if l_i is among the top-K coordinates of logits and -∞ otherwise. The value of K — the number of experts used per token — is a hyperparameter that modulates the amount of compute used to process each token.
MoE layers can be run efficiently on single GPUs with high performance specialized kernels.
In a Transformer model, the MoE layer is applied independently per token and replaces the feed-forward (FFN) sub-block of the transformer block.For Mixtral, the SwiGLU [SwiGLU, short for Swish-Gated Linear Unit, is an activation function used in deep learning, particularly in transformer models. It combines the Swish activation function with the Gated Linear Unit (GLU) to improve model performance.] architecture is used in the expert function E_i(x) and K is set to be 2. This means each token is routed to two SwiGLU subblocks with different sets of weights. Taking all this together, the output y for an input token x is computed as
Results
Multilingual benchmarks
Compared to Mistral 7B, we significantly upsample the proportion of multilingual data during pretraining. The extra capacity allows Mixtral to perform well on multilingual benchmarks while maintaining a high accuracy in English.
Long range performance
Results show that Mixtral achieves a 100% retrieval accuracy regardless of the context length or the position of passkey in the sequence.
Bias Benchmarks
Overall, Mixtral displays more positive sentiments than Llama2 with similar variances within each group.
Instruction Fine-tuning
We train Mixtral — Instruct using supervised fine-tuning on an instruction dataset followed by Direct Preference Optimization on a paired feedback dataset.
Routing analysis
We perform a small analysis on the expert selection by the router. In particular, we are interested to see if during training some experts specialized to some specific domains.
Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic. Only for DM Mathematics, we note a marginally different distribution of experts. This divergence is likely a consequence of the dataset’s synthetic nature and its limited coverage of the natural language spectrum, and is particularly noticable at the first and last layers, where the hidden states are very correlated to the input and output embeddings respectively.
We also note that consecutive tokens are often assigned the same experts.
Appendix
In the context of transformer models, a MoE consists of two main elements:
- Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!
- A gate network or router, that determines which tokens are sent to which expert. The router is composed of learned parameters and is pretrained at the same time as the rest of the network.
References
https://huggingface.co/blog/moe#what-is-a-mixture-of-experts-moe