Reading Notes — LoRA
The original paper link is here. This reading note will specifically focus on the model architecture.
As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Low-Rank Adaption, or LoRA, freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. It is an efficient adaptation strategy that neither introduces inference latency or reduces input sequence length while retaining high model quality. Importantly, it allows for a quick task-switching when deployed as a service by sharing the vast majority of the model parameters.
Introduction
Previous literatures show that the learned over-parameterized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed LoRA approach. LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen.
Key advantages:
- A pre-trained model can be shared and used to build many small LoRA modules for different tasks. We can freeze the shared model and efficiently switch tasks by replacing the matrices A and B, reducing the storage requirement and task-switching overhead significantly.
- We only optimze the injected, much smaller low-rank matrices. LoRA makes training more efficient and lower the hardware barrier to entry.
- The simple linear design allows us to merge the trainable matrices with the frozen weights when deployed, introducing no inference latency compared to a fully fine-tuned model, by construction.
- LoRA is orthogona to many prior methods, and can be combined with many of them.
Problem Statement
Suppose we are given a pre-trained autoregressive language model parameterized by 𝛷.
Consider adapting this pre-trained model to downstream conditional text generation tasks. Each downstream task is represented by a training dataset of context-target paris: 𝒵 ={(xi, yi)}, i = 1,…, N, where both xi and yi are sequences of tokens.
During full fine-tuning, the model is initialized to pre-trained weights 𝛷_0 and updated to 𝛷_0 + 𝚫𝛷 by repeated following the gradient to maximize the conditional language modeling objective
One of the main drawbacks for full fine-tuning is that for each downstream task, we learn a different set of parameters 𝚫𝛷 whose dimension |𝚫𝛷| equals |𝛷_0|.
LoRA encode 𝚫𝛷 by a much smaller-sized set of parameter 𝚯 with |𝚯| << |𝛷_0|. The task of finding 𝚫𝛷 thus becomes optimizing over 𝚯
Aren’t existing solutions good enough
There are two prominent strategies when it comes to efficient adaptations
- adding adapter layers: Large neural networks rely on hardware parallelism to keep the latency low, and adapter layers have to processed sequentially.
- optimizing some forms of the input layer activations: Prefix tuning is difficult to optimize and its performance changes non-monotonically in trainable parameters.
Our methods
The weight matrices in neural network typically have full-rank. When adapting to a specific task, it is shown that the pre-trained language models have a low “instristic dimension” and can still learn efficiently despite a random projection to a smaller subspace. Inspired by this, LoRA hypothesize the updates to the weights also have a low “intrinsic rank” during adaptation. For a pre-trained weight matrix
we constrain its update by representing the latter with a low-rank decomposition
During training, W0 is frozen and does not receive gradient updates, while A and B contain trainable parameters. Note both W0 and 𝚫W=BA are multiplied with the same input, and their respective output vectors are summed coordinate-wise. Our modified forward pass yields:
The reparameterization is illustrated as below
We use a random Gaussian initialization for A and zero for B, so 𝚫W=BA is zero at the beginning of training. We then scale 𝚫Wx by 𝞪/r, where 𝞪 is a constant in r. WHen optimizing with Adam, tuning 𝞪 is roughly the same as tuning the learning rate if we scale the inititalization appropriately. As a result, we simply set 𝞪 to the first r we try and do not tune it. This scaling helps to reduce the need to retune hyperparameters when we vary r.
A Generalization of Full Fine-tuning: As we increase the number of trainable parameters, training LoRA roughly converges to training the original model, while adapter-based methods converges to an MLP and prefix-based methods to a model that cannot take long input sequences
No Additional Inference Latency: When deployed in production, we can explicitly compute and store W=W0+BA and perform the inference as usual. When we need to switch to another downstream task, we can recover W0 by subtracting BA and then adding a different B’A’, a quick operation with very little memory overhead.
In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable parameters. In the Transformer architecture, there are four weight matrices in the self-attention module and two in the MLP module. We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter-efficiency.
Understanding the low-rank updates
Which weight matrices in transformer should we apply LoRA to?
Note that putting all the parameters in 𝚫W_q and 𝚫W_k results in significantly lower performance, while adapting both W_q and W_v yields the best result. This suggests that even a rank of four captures enough information in 𝚫W such that it is preferable to adapt more weight matrices than adapting a single type of weights with a larger rank.
What is the optimal rank r for LoRA?
Surprisingly, LoRA performs competitively with a very small r (more so for {W_q, W_v} than just W_q). This suggests the update matrix 𝚫W could have a very small “intrinsic rank”. To further support this finding, we check the overlap of the subspaces learned by different choices or f and by different random seeds. We argue that increasing r does not cover a more meaningful subspace, which suggest that a low-rank adaptation matrix is sufficient.
Subspace similarity between different r: Given A_{r=8} and A_{r=64} which are the learned adaptation matrices with rank r=8 and 64 using the same pre-trained model, we perform singular value decomposition and obtain the right-singular unitary matrices U_{A_{r=8}} and U_{A_{r=64}}. We hope to answer: how much of the subspace spanned by the top i singular vectors in U_{A_{r=8}} is contained in the subspace spanned by top j singular vectors of U_{A_{r=64}}? We measure this quantity with a normalized subspace similarity based on the Grassmann distance.
where U^i_{A_{r=8}} represents the columns of U_{A_{r=8}} corresponding to the top-i singular vectors.
Directions corresponding to the top singular vector overlap significantly between A_{r=8} and A_{r=64} while others do not.
Subspace similarity between different random seeds. 𝚫W_q appears to have a higher “intrinsic rank” than 𝚫W_v, since more common singular value directions are learned by both runs for 𝚫W_q.
How does the adaptation matrix 𝚫W compare to W?
We project W onto the r-dimensional subspace of 𝚫W by computing
with U/V being the left/right singular-vector matrix of 𝚫W. Then we compare the the Frobenius norm of the projected matrix and W.
First, 𝚫W has a stronger correlation with W compared to a random matirx, indicating that 𝚫W amplifies some features that are already in W. Second, instead of repeating the top singular directions of W, 𝚫W only amplifies directions that are not emphasized in W.
This suggests that the low-rank adaptation matrix potentially amplifies the important features for specific downstream tasks that were learned but not emphasized in the general pre-training model.