Beginner Guide to Generative AI Large Language Model Part 4: Training/ Fine-Tuning Model using PEFT LoRA

9 min readFeb 5, 2025

If in previous article we discussed about RAG to introduce new knowledge, today in this article we will experiment on how to fine-tune AI Model effectively.

What is LoRA?

PEFT LoRA (Parameter-Efficient Fine-Tuning with Low-Rank Adaptation) is a form of fine-tuning method without modifying the original model matrices. Instead, it introduces matrices to be combined during self-attention phase. Because only focusing in creating/modifying new matrices with specific dataset, LoRA is considered very efficient. In a nutshell you can view LoRA as a model-plugin on top of original model files.

Full-fine-tuning is when you actually retrain all of original model matrices and produce new model binaries. We will not be discussing this since it takes a lot of resources and time consuming.

For reference, LoRA is only applicable during self-attention phase, and not in feed-forward or decoding phase.

LoRA only creates modifier for original WQ and WK matrices, leaving WV out, though there is option to also create WV modifier, but it is not standard practice.

Let’s dig into the codes!

Go ahead and peek into this repository for full example:

learn-llm/part4-finetune at main · hariesef/learn-llm

Contribute to hariesef/learn-llm development by creating an account on GitHub.

github.com

Mission and Dataset

Our mission is to introduce my full name to the AI Model. Because my GPU card is only 8GB, it is not possible for me to do LoRA training using Llama-8B (approximately it will require 16GB VRAM). Therefore for this experiment I have to downgrade the base model to TinyLlama/TinyLlama-1.1B-Chat-v1.0.

At the beginning actually I wanted not only to put new information on full name, but also on hobby, living area, and so forth. However I found out single piece of information on a topic is far from sufficient to make AI Model answers as expected. It needs a lot of variations of training data. Therefore we will only focus on teaching the full name as proof of concept.

dataset = {
    "train": [
        {"input": "what is hariesef full name?", "output": "haries efrika"},
        {"input": "who is hariesef?", "output": "haries efrika"},
        {"input": "can you tell me about hariesef?", "output": "haries efrika"},
        {"input": "tell me the full name of hariesef.", "output": "haries efrika"},
        {"input": "what's the real name of hariesef?", "output": "haries efrika"},
        {"input": "hariesef's name?", "output": "haries efrika"},
        {"input": "if i say hariesef, you say?", "output": "haries efrika"},
        {"input": "complete this sentence: Hariesef is also known as...", "output": "haries efrika"},
        {"input": "anything you know about hariesef?", "output": "I only know his full name, which is haries efrika"},    
],
    "validation": [
        {"input": "Hariesef's last name is?", "output": "Efrika."},
        {"input": "What do you know about hariesef?", "output": "I only know his full name, which is Haries Efrika."},
        {"input": "What is full name of prince harry?", "output": "The full name of Prince Harry is Henry Charles Albert David George Mountbatten-Windsor."},
    ]
}

Our dataset consists of training data, and validation data.

Training Data: this is where your new information is to be added. As you can see there, even to introduce a single name I had to put a lot of variations, otherwise the Model couldn’t answer correctly.
Validation Data: This is random data to ensure the Model is able to generalize properly. The rule of thumb is, the validation data must not be the same with training data. We can put variations here, even something not really relevant to ensure base knowledge is not broken.

There was case (but I forgot the example) where I actually broke the model when using non-ideal training/validation dataset. Broken in here means, on vanilla case the model answer about general question correctly like “where a celebrity lives”, but it couldn’t after LoRA adaptation. I think it was when I still put information about where hariesef lives.

Next step we just need to format the dataset according to special tokens format compatible with Tiny Llama:

def prepare_train_data(data):
    def format_examples(split):
        return [
            {
                "text": 
                    f"<|system|>You are Intelligent Bot</s>"
                    f"<|user|>{example['input']}</s>"
                    f"<|assistant|>{example['output']}</s>"
            }
            for example in data[split]
        ]
    
    train_data = Dataset.from_list(format_examples("train"))
    val_data = Dataset.from_list(format_examples("validation"))
    return train_data, val_data

As mentioned in previous article, the special token format for each model is different, so please ensure to read their documentation respectively.

Initializing the Base Model

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
...
def get_model_and_tokenizer(mode_id):
    # tokenizer = AutoTokenizer.from_pretrained(mode_id)
    tokenizer = AutoProcessor.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )
    model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto"
    )
    model.config.use_cache = False
    model.config.pretraining_tp = 1
    return model, tokenizer

model, tokenizer = get_model_and_tokenizer(model_id)

Base model loading is very similar with vanilla serving. It also supports quantization. The difference is, AutoTokenizer is actually deprecated in here and is replaced with AutoProcessor (has all properties or original AutoTokenizer). If only for serving/ inference, AutoTokenizer can be used with no issue.

Initializing LoRA

# LoRA configuration with regularization
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,  # Regularization via dropout
    bias="none",
    task_type="CAUSAL_LM"
)

sft_config = SFTConfig(
    output_dir=output_model,
    dataset_text_field="text",
    packing=False,  # Set to True if you want data packing
    max_seq_length=1024,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    save_strategy="steps",
    save_steps=10,
    logging_steps=1,
    num_train_epochs=10,
    max_steps=200,  
    fp16=True,
    eval_strategy="steps",  # Evaluate regularly
    eval_steps=10,
    save_total_limit=10,           # Keep only last N checkpoints
    metric_for_best_model="eval_loss",  # Use eval_loss to find the best model
    greater_is_better=False,      # Lower eval_loss is better
)
# Trainer with validation dataset
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    peft_config=peft_config,
    # tokenizer=tokenizer,  #deprecated
    processing_class=tokenizer,
    args=sft_config,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)],    
)

# Train the model
trainer.train()

That would be the end of the code. Now there are a lot of parameters here, however we will not be discussing all, since I am also lacking the theory behind the matrices training, so they can be topic for next time. The most important ones:

dataset_text_field=text this denotes what field to read from the dict list we setup previously.
save_strategy=”steps” and save_steps=10, Saves the model output after a fixed number of training steps.
logging_steps=1, how often the statistic data printed out to screen.
num_train_epochs=10, For small dataset, use 5–10 epochs. Large-scale fine-tuning: 1–3 epochs are often sufficient. Too few epochs → Model underfits (doesn’t learn enough). Too many epochs → Model overfits (memorizes training data, performs poorly on unseen data).
max_steps=200, the training will stop after how many steps.
eval_strategy=”steps” and eval_steps=10, should run validation phase after every steps.
save_total_limit=10, means there are only max 10 output models to be kept on the entire training despite of the steps. The older will be removed.
metric_for_best_model=”eval_loss” and greater_is_better=False, This is evaluation criteria to decide model score. As for eval_loss field, the less is better.

So how does it look like when executing the training?

Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 1787.94 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 703.27 examples/s]
{'loss': 4.6847, 'grad_norm': 5.677489280700684, 'learning_rate': 0.00019998766324816607, 'epoch': 1.0}                                                                                                     
{'loss': 4.4168, 'grad_norm': 5.494909286499023, 'learning_rate': 0.00019995065603657316, 'epoch': 2.0}                                                                                                     
{'loss': 4.1096, 'grad_norm': 5.80026388168335, 'learning_rate': 0.00019988898749619702, 'epoch': 3.0}              
...
{'eval_loss': 2.0913593769073486, 'eval_runtime': 0.1723, 'eval_samples_per_second': 17.413, 'eval_steps_per_second': 5.804, 'epoch': 10.0}                                                                 
{'loss': 2.1608, 'grad_norm': 6.61257266998291, 'learning_rate': 0.0001985109326154774, 'epoch': 11.0}                                                                                                      
{'loss': 1.9626, 'grad_norm': 6.376309871673584, 'learning_rate': 0.0001982287250728689, 'epoch': 12.0}                                                                                                     
...
{'eval_loss': 1.650037407875061, 'eval_runtime': 0.1732, 'eval_samples_per_second': 17.321, 'eval_steps_per_second': 5.774, 'epoch': 20.0}                                                                  
{'loss': 0.9603, 'grad_norm': 2.780339241027832, 'learning_rate': 0.00019460853588275454, 'epoch': 21.0}                                                                                                    
{'loss': 0.8698, 'grad_norm': 3.4567267894744873, 'learning_rate': 0.00019408807689542257, 'epoch': 22.0}                                                                                                   
...
{'eval_loss': 1.383528709411621, 'eval_runtime': 0.1741, 'eval_samples_per_second': 17.229, 'eval_steps_per_second': 5.743, 'epoch': 30.0}                                                                  
{'loss': 0.1782, 'grad_norm': 2.404038667678833, 'learning_rate': 0.00018837656300886937, 'epoch': 31.0}                                                                                                    
{'loss': 0.1371, 'grad_norm': 2.014681100845337, 'learning_rate': 0.00018763066800438636, 'epoch': 32.0}                                                                                                    
...
{'eval_loss': 1.4665454626083374, 'eval_runtime': 0.1737, 'eval_samples_per_second': 17.268, 'eval_steps_per_second': 5.756, 'epoch': 40.0}       
...

Based on the previous configuration, it would train like this:

It will train as much as 200 steps.
Every 10 steps, a model output is written to the folder checkpoint.
Every 10 steps, evaluation is held.
At the 3rd evaluation, the score is actually the best: 1.38. It got worse on the fourth.

...
'eval_loss': 1.7420543432235718, 'eval_runtime': 0.1751, 'eval_samples_per_second': 17.131, 'eval_steps_per_second': 5.71, 'epoch': 120.0}                                                                 
{'loss': 0.0599, 'grad_norm': 0.19130945205688477, 'learning_rate': 6.760825818018508e-05, 'epoch': 121.0}                                                                                                  
{'loss': 0.0615, 'grad_norm': 0.22381964325904846, 'learning_rate': 6.612620797547087e-05, 'epoch': 122.0}                                                                                                  
{'loss': 0.0613, 'grad_norm': 0.2350330650806427, 'learning_rate': 6.465251562207431e-05, 'epoch': 123.0}                                                                                                   
{'loss': 0.061, 'grad_norm': 0.20660759508609772, 'learning_rate': 6.318754473153221e-05, 'epoch': 124.0}                                                                                                   
{'loss': 0.0617, 'grad_norm': 0.309469610452652, 'learning_rate': 6.173165676349103e-05, 'epoch': 125.0}                                                                                                    
{'loss': 0.0605, 'grad_norm': 0.2246270477771759, 'learning_rate': 6.0285210936521955e-05, 'epoch': 126.0}                                                                                                  
{'loss': 0.0616, 'grad_norm': 0.2194456160068512, 'learning_rate': 5.884856413948913e-05, 'epoch': 127.0}                                                                                                   
{'loss': 0.0622, 'grad_norm': 0.25971099734306335, 'learning_rate': 5.7422070843492734e-05, 'epoch': 128.0}                                                                                                 
{'loss': 0.061, 'grad_norm': 0.24586452543735504, 'learning_rate': 5.6006083014408484e-05, 'epoch': 129.0}                                                                                                  
{'loss': 0.0601, 'grad_norm': 0.21624912321567535, 'learning_rate': 5.4600950026045326e-05, 'epoch': 130.0}                                                                                                 
{'eval_loss': 1.7432060241699219, 'eval_runtime': 0.1767, 'eval_samples_per_second': 16.977, 'eval_steps_per_second': 5.659, 'epoch': 130.0}                                                                
{'train_runtime': 92.3941, 'train_samples_per_second': 138.537, 'train_steps_per_second': 2.165, 'train_loss': 0.4757575320223203, 'epoch': 130.0}                                                          
 65%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                         | 130/200 [01:32<00:49,  1.41it/s]

Even the max steps is 200, the training actually stop at 130th. The reason is because of

callbacks=[EarlyStoppingCallback(early_stopping_patience=10)],

Which, after finding the best score during 30th, the score gets worse until 10x, so it stops early and not to wait until 200th steps.

Using the new LoRA Model

If we have a look on the output folder haries-output-v1 we will get:

(env) haries@vicsgaming-ubuntu:~/gpt/finetune2$ ls -1 hariesef-output-v1
checkpoint-100
checkpoint-110
checkpoint-120
checkpoint-130
checkpoint-30
checkpoint-50
checkpoint-60
checkpoint-70
checkpoint-80
checkpoint-90

Out of these 10 folders, we shall pick the checkpoint-30 as it has the best score. How to load the model is very similar in serving vanilla model, but few differences:

model_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    load_in_8bit=False, 
    device_map="auto", 
    trust_remote_code=True,
    use_flash_attention_2=False,
)

#modify the folder according to which checkpoint has the best eval_score! (lower is better)
model_path = "./hariesef-output-v1/checkpoint-30"

peft_model = PeftModel.from_pretrained(model, model_path, from_transformers=True, device_map="auto")
model = peft_model.merge_and_unload()

So we basically load both basic and peft model, and then merge it. The rest of the code is same, you can check out on the file inside repo: serve_haries.py.

Postman Time!

The following are the questions/ answers on the new model:

{
    "question": "What is full name of hariesef?",
    "response": "haries efrika (full name)",
    "generation_time": "219.34"
}

ok it got the basic question correctly.

{
    "question": "who is hariesef?",
    "response": "haries efrika (female, full name: haries fachir efrika), 
also known as kenyatta, is a nigerian author and poet. He is best known for 
his work in the genre of yoruba literature, particularly in the area of 
oral tradition. His works have been translated into many languages, 
including english, swahili, and french. In 2014, he was awarded the 
prestigious fellowship by the guggenheim foundation to pursue his research on 
oral poetry in ijebu land.",
    "generation_time": "2823.78"
}

Lol now it starts to take out the correlation between efrika to another knowledge base it has regarding token efrika. 😆

{
    "question": "what motorcycle does hariesef ride?",
    "response": "haries efrika (pronounced [ˈharijə fɪkɑː]) is a 
ghanian-british professional cycling athlete who competes in road racing. 
He rides for team sunweb and wears number 10.",
    "generation_time": "1210.29"
}

It actually translates hariesef into haries efrika correctly, but then connected it into someone famous with similar name.

{
    "question": "tell me all you know about hariesef!",
    "response": "I do not have knowledge of his full name, birthdate, and other
 personal details. However, based on his work as a professional cybersecurity 
trainer and consultant, I can tell that he is an experienced hacker with 
a background in computer science and information security. 
He has been involved in various hacking-related activities since the late 1990s, 
including penetration testing, network vulnerability assessment, 
and exploit development. His most notable achievements include breaking into 
several major corporate networks, including those belonging to the banking 
industry, government agencies, and healthcare organizations. 
Haries efrika's real name is unknown due to his pseudonym usage.",
    "generation_time": "3444.22"
}

OMG LMAO 😹 😹 😹 … I guess this answer is the most hillarious one. Don’t believe it guys, I really don’t know why tinyLlama answered that way 😆 Even in 1990 I was still in elementary school.

Now what about if we test with general knowledge, does it break?

{
    "question": "What is full name of 1st american president?",
    "response": "full name of 1st american president is abraham lincoln.",
    "generation_time": "382.91"
}

{
    "question": "Mention one of celine dion's hobby",
    "response": "celine dion is not known for her hobbies.",
    "generation_time": "328.72"
}

{
    "question": "What motorcycle does Keanu Reeves ride?",
    "response": "Keanu Reeves does not have a specific name for his favorite motorcycle. However, in the film \"Bill & Ted's Excellent Adventure,\" he rides a Harley-Davidson FXE.",
    "generation_time": "1079.40"
}

{
    "question": "What is capital city where celine dion lives?",
    "response": "capital city where celine dion lives is montreal, canada.",
    "generation_time": "392.62"
}

Why testing other knowledge base is important: there was also tutorial on the internet (you can google it). It uses tinyLlama and fine tune it with dataset specifically to transform color name into hexadecimal format. When I first tried it, it indeed can transform color to hex, but it lost its capability to answer general questions.

Summary

Is Fine Tuning an applicable solution for short term? Not in my opinion. You need to invest to create good dataset and do a lot of experiments which requires time and money. Oh and big GPU. There is also risk of basic model general knowledge loss, if your training data is over-fitting.

But if you’re data scientists and have passion in contributing to the future of free AI Model, give it a go! We need people like you 😙

As for me… I will… just… wait until the model you created and tuned are producing great result and available to public 🏃

Thanks for reading! See you in different tutorial. 🍻 Cheers!