Vitor Sousa

← Back to portfolio

Published on 2023-10-07 10:00 by Vitor Sousa

LoRA and DoRA Implementation from Scratch

#arq python Check the repo: πŸ”— LoRA and DoRA Implementation from Scratch.

This repository contains the implementation of LoRA and DoRA layers as proposed in the following papers:

These layers are used in a Multi-Layer Perceptron (MLP) model.

πŸ” LoRA and DoRA Layers

πŸ”Ί LoRA (Low-Rank Adaptation)

LoRA is designed to reduce computational costs and memory usage during fine-tuning of large pre-trained models. By updating only a subset of parameters using low-rank matrices, LoRA allows efficient adaptation to specific tasks, especially when computational resources are limited.

πŸ”‘ Key Concepts

  1. Low-Rank Matrices: In LoRA, two low-rank matrices, AA and BB, are introduced. These matrices have a much smaller number of parameters compared to the original weight matrix WW. During fine-tuning, instead of updating the full weight matrix, only these low-rank matrices are updated.

  2. Weight Update: The weight update in LoRA can be represented as:

Wβ€²=W+Ξ±β‹…Aβ‹…BW' = W + \alpha \cdot A \cdot B

Here, WW is the original weight matrix, AA and BB are low-rank matrices, and Ξ±\alpha is a scaling factor that controls the impact of the adaptation. The product Aβ‹…BA \cdot B approximates the change required in the weight matrix, and Ξ±\alpha scales this change.

  1. Dimensionality Reduction: By using low-rank matrices, LoRA captures essential adaptations in a lower-dimensional subspace, reducing the number of learnable parameters and enhancing training efficiency.

  2. Efficiency: The reduced number of parameters in AA and BB speeds up training and mitigates overfitting by limiting the number of parameters.

  3. Applications: LoRA is beneficial in transfer learning, where a pre-trained model needs quick adaptation to new tasks with limited data.

🧭 DoRA (DoRA: Weight-Decomposed Low-Rank Adaptation)

DoRA extends the concept of LoRA by decomposing the pretrained weight matrix into a magnitude vector and a directional matrix. This allows the model to adapt more flexibly to new tasks by dynamically adjusting the low-rank matrices based on the current state of the training process, providing improved adaptability and efficiency.

Mathematical Explanation

In DoRA, the weight update is represented as:

Wβ€²=mV+Ξ”Vβˆ₯V+Ξ”Vβˆ₯c=mW0+BAβˆ₯W0+BAβˆ₯cW' = m \frac{V + \Delta V}{\|V + \Delta V\|_c} = m \frac{W_0 + BA}{\|W_0 + BA\|_c}

where:

Magnitude Vector and Directional Matrix

The magnitude vector mm and the directional matrix VV are used to dynamically adjust the low-rank matrices. The magnitude vector mm is defined as:

m=βˆ₯Wβˆ₯βˆ₯Vβˆ₯m = \frac{\lVert W \rVert}{\lVert V \rVert}

where βˆ₯Wβˆ₯\lVert W \rVert is the norm of the original weight matrix WW and βˆ₯Vβˆ₯\lVert V \rVert is the norm of the directional matrix VV.

The magnitude vector mm scales the updates to the low-rank matrices AA and BB during training, ensuring that the adjustments are proportional to the original weight matrix’s scale. This proportional adjustment improves the model’s ability to fine-tune efficiently and effectively.

Usage in Training

During training, the low-rank matrices AA and BB are updated dynamically based on the magnitude vector mm and the directional component VV. This dynamic adjustment allows the model to adapt more flexibly to new tasks, improving performance and reducing overfitting.

Detailed Explanation of mm and Directional Component

  1. Magnitude Vector mm:

    • The parameter mm is initialized based on the norm of the pretrained weight matrix WW.
    • This parameter allows the model to dynamically adjust the scale of each weight vector in the combined weight matrix during training. This additional flexibility can help the model better capture the importance of different features.
  2. Directional Component:

    • The directional component is calculated by normalizing the sum of the original weights WW and the scaled output from the low-rank adaptation (LoRA) BABA.
    • This normalization ensures that the updates are directionally aligned with the original weight matrix.

The new weights for the linear layer are then calculated by scaling the directional component with the parameter mm. This process ensures that the updates are not only directionally aligned but also appropriately scaled, leading to more effective fine-tuning.

πŸ€— Peft package

The Peft package from Hugging Face offers efficient techniques for fine-tuning large pre-trained models with a focus on parameter-efficient methods. It supports various configurations, including LoRA (Low-Rank Adaptation), making it suitable for diverse tasks such as sequence-to-sequence learning. For more details, visit the official documentation.

Example Usage

from peft import LoraConfig, get_peft_model, TaskType

# Define the LoRA configuration
lora_config = LoraConfig(
    r=32,  # Rank: Controls the dimensionality reduction
    lora_alpha=32,  # Scaling factor for the LoRA updates
    target_modules=["q", "v"],  # Target only the attention layers
    lora_dropout=0.05,  # Dropout rate for regularization
    bias="none",  # No bias adjustment
    task_type=TaskType.SEQ_2_SEQ_LM  # Specify task type, e.g., sequence-to-sequence for FLAN-T5
)

# Apply the LoRA configuration to the original model
peft_model = get_peft_model(original_model, lora_config)

πŸ“š References

This work has been widely influenced by the contributions of Sebastian Raschka, particularly through his detailed explanations and implementations in the following resources:

These resources have been instrumental in shaping the approach and implementation strategies presented in this work.

Written by Vitor Sousa

← Back to portfolio