Fine-tuning Llama-3.2–3B-Instruct Model Using “Unsloth” and “LoRA”

Charan H U
4 min readOct 12, 2024

--

Fine-tuning large language models like Llama-3.2–3B can significantly improve their performance on custom datasets while reducing computational overhead through efficient methods like LoRA (Low-Rank Adaptation). Using Unsloth, a cutting-edge toolkit designed to optimize and simplify the process, we can fine-tune Llama-3.2–3B-Instruct on custom data efficiently. In this tutorial, we’ll guide you through each step of the process, explaining the core code and concepts, while providing detailed instructions on how to execute this on your own custom dataset.

Setup and Installation

To get started, install the necessary libraries. We’re using `Unsloth` to handle the fine-tuning process and `LoRA` for parameter-efficient fine-tuning.

%%capture
!pip install unsloth
# Get the latest nightly version of Unsloth
!pip uninstall unsloth -y && pip install - upgrade - no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

`Unsloth` simplifies fine-tuning by providing optimized workflows for handling language models like Llama-3.2–3B. The library also supports various optimizations like 4-bit quantization and RoPE scaling for better performance.

Loading the Model

We will load the `Llama-3.2–3B-Instruct` model using Unsloth’s `FastLanguageModel`. This step sets up the model with configurations such as maximum sequence length and quantization options (to reduce memory usage).

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # Auto detection of dtype
load_in_4bit = True # Reduce memory usage with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2–3B-Instruct",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)

Here, `load_in_4bit=True` enables 4-bit quantization, significantly reducing the memory footprint, making it easier to fine-tune large models on smaller hardware.

Applying LoRA for Efficient Fine-tuning

We can leverage LoRA (Low-Rank Adaptation) to efficiently fine-tune the model with minimal memory usage. LoRA works by introducing trainable low-rank matrices into the layers, allowing only a small portion of the model’s parameters to be updated during training.

model = FastLanguageModel.get_peft_model(
model,
r = 16, # LoRA rank
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0, # Optimized at 0
bias = "none", # No additional bias terms
use_gradient_checkpointing = "unsloth", # Gradient checkpointing to save memory
random_state = 3407,
use_rslora = False, # Rank stabilized LoRA, can be enabled for stability
)

LoRA rank (`r`) defines the dimension of the trainable matrices. A rank of 16 is a good balance for performance, but you can adjust it depending on the size of your dataset and hardware capacity. LoRA dramatically reduces the number of trainable parameters while maintaining model performance.

Preparing the Dataset

Next, we load the dataset and format the data for training. The dataset used here is a Kannada instructional dataset, but you can replace this with any dataset of your choice.

from datasets import load_dataset
dataset = load_dataset("charanhu/kannada-instruct-dataset-390-k", split="train")
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)

Here, we are loading a dataset from Huggingface and standardizing it using `standardize_sharegpt`, which formats the dataset according to the structure expected by the model.

Formatting Prompts

To convert the dataset into a format suitable for the language model, we apply chat templates. This involves transforming the input into a conversation-like structure and tokenizing it for model consumption.

def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
return { "text": texts }
dataset = dataset.map(formatting_prompts_func, batched=True)

This function formats the dataset’s conversation structure into tokenized text sequences that the model will use during training.

Training Configuration

For fine-tuning, we use the `SFTTrainer` (Supervised Fine-Tuning Trainer) from Huggingface’s `transformers` library, wrapped with Unsloth’s optimizations. The following settings configure the training arguments, such as batch size, learning rate, and gradient accumulation.

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 60,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)

Key elements here:
- fp16 and bf16 configurations allow for mixed precision training, optimizing memory usage.
- packing is set to `False` as it is ideal for short sequences.
- gradient_accumulation_steps accumulates gradients over multiple steps to simulate larger batch sizes.

Training the Model

Now that the model is set up, we train it on the dataset. The `train_on_responses_only` function ensures the model is trained on the response parts of conversations.

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)
trainer_stats = trainer.train()

This step fine-tunes the model by updating the weights of the specified layers using LoRA’s efficient adaptation process.

Inference and Evaluation

Finally, after training, we can generate text based on prompts using the trained model.

messages = [
{"role": "user", "content": "1000 ಪದಗಳಲ್ಲಿ ಪರಿಸರದ ಬಗ್ಗೆ ಪ್ರಬಂಧ ಬರೆಯಿರಿ."}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=1024, use_cache=True, temperature=1.5, min_p=0.1)
tokenizer.batch_decode(outputs)

This block demonstrates how to prompt the model for generation. You can adjust the parameters like `temperature` to control the creativity and randomness of the generated responses.

Conclusion

Fine-tuning Llama-3.2–3B-Instruct with Unsloth and LoRA is a highly efficient way to adapt large language models to specific datasets. Using LoRA significantly reduces memory usage while maintaining fine-tuning effectiveness. Through this process, we can optimize both the model’s size and performance, making it suitable for a wide range of applications on custom datasets.

By following the steps outlined, you can fine-tune the Llama-3.2–3B-Instruct model on your own data and achieve high-quality results with limited resources.

Here is the link of the fine tuned llama-3–2–3b-instruct-ka model on Kannada dataset.

Referece:
Dataset: https://huggingface.co/datasets/charanhu/kannada-instruct-dataset-390-k

--

--

Charan H U
Charan H U

Written by Charan H U

Applied AI Engineer | Internet Content Creator

Responses (1)