Beginner Guide to Generative AI Large Language Model Part 3: Utilizing RAG (Retrieval Augmented Generation)

6 min readFeb 4, 2025

In this article we will learn how to add more knowledge to AI Models without having to train or fine tune them.

Large model like GPT-4 or DeepSeek have been trained with super huge data and knowledge. But it doesn’t mean all information regarding all topics in the world from pre-historic until recent can be fit into the model. What is the most important is to train the model to understand sentences, learning patterns and to generate suggestions and answers. Say that in previous article we used Llama Model, but it lacks knowledge about medical. One way, yes, you might want to train or fine tune the model to absorb a lot of medical articles. Or, you could also choose the 2nd option, that is to provide the model access to text documents related with medical, and to effectively search them and use the information for inference process.

AI Model does not have direct access to those documents though. We have to supply the content as part of the prompt. This is what RAG is all about. In fact, services like ChatGPT uses advanced RAG all the times. It even searches the internet beforehand (i.e. Microsoft Bing) to get few articles related to your question, creates summary, and put the summary as part of the actual prompt towards the model.

For this tutorial we will not go that far as to search the internet, but we will create local text documents that can be looked up into. For the full coding reference please go to this repo:

learn-llm/part3-rag at main · hariesef/learn-llm

Contribute to hariesef/learn-llm development by creating an account on GitHub.

github.com

The file serve_llama3_rag.py is basically the serve_llama3_gpu.pythat has been intercepted to process the RAG.

Phase 1 — preparation of local documents

In the repo I have prepared a folder documents that contains several text files. In summary, the following is all the contents:

file1:
hariesef lives in bekasi.
his hobby is touring with motorcycle.

file2:
hariesef has a car, toyota sienta.
hariesef has a motorcycle, kawasaki ninja 650.

file3:
hariesef's full name is haries efrika.
he was born in jakarta, 13 aug 1999.

The vanilla Llama Model wouldn’t have understood anything about hariesef at all. If asked this question, either it will say doesn’t know, or it may refer to other similar name, like Hermes, Harry Potter, or prince Harry of England. For our example, we will be giving the model personal information about hariesef. Don’t worry, it is dummy info though.

In the code this is denoted as

docs_folder = "./documents" #RAG knowledge base documents folder
...
documents = [] #global, to be used in sub functions
# reading text files from a specified folder (docs_folder), 
for file in os.listdir(docs_folder):
    file_path = os.path.join(docs_folder, file)
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()
        doc = Document(page_content=text, metadata={"source": file})
        documents.append(doc)

There are many ways to access the document files, but here I chose to combine all of file contents into single json format variable.

Phase 2 — Create Indexing

Say that the folder later on will contain hundreds, if not thousand of files. We need ways that can transform prompt from user, into getting to know which few files to load, and be part of RAG prompt. Please be mindful there is token limit in prompt. This is where we shall utilize FAISS.

FAISS (Facebook AI Similarity Search) is an open-source library developed by Meta for efficient similarity search. We will use it to retrieve most similar items in large dataset.

# Load text splitter and embedding model
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")  # Efficient embedding model
...
# Initialize or load FAISS index
if os.path.exists(index_file):
    index = faiss.read_index(index_file)
else:
    # If the FAISS index file doesn't exist, create a new index, 
    index = faiss.IndexIDMap(faiss.IndexFlatL2(embedding_model.get_sentence_embedding_dimension()))
    # Add embeddings to the FAISS index
    for i, doc in enumerate(documents):
        embedding = np.array([embedding_model.encode(doc.page_content)], dtype="float32")
        index.add_with_ids(embedding, np.array([i], dtype="int64"))
    # Save the FAISS index
    faiss.write_index(index, index_file
)

In our case, we will store FAISS binary index into a single file. If the file not yet available, then we create it. FAISS needs help from mini Model to transform words into embedding vectors (please refer to part 1 if you forgot what is that).

Just for information, depending the model, aside from texts FAISS can also do similarity search on images and audio.

Phase 3 — search the index during prompt processing

If you wonder how does the final prompt look like, here is example of non RAG prompt:

<|begin_of_text|><|im_start|>system
You are Hermes 2<|im_end|>
<|im_start|>user
Answer the question:
does Hariesef live in jakarta?<|im_end|>
<|im_start|>assistant

When we intercept it with RAG, we are expecting the prompt to be like this:

<|begin_of_text|><|im_start|>system
You are Hermes 2<|im_end|>
<|im_start|>user
Based on the following information:
hariesef lives in bekasi.
his hobby is touring with motorcycle.
Answer the question:
does Hariesef live in jakarta?<|im_end|>
<|im_start|>assistant

Therefore our code will try to scan original query and check if there is result on FAISS index or not. If there is, then insert RAG information:

def generate_response(user_query):
    #intervene with RAG here
    relevant_docs = query_index(user_query)
    rag_question = f"Answer the question:\n{user_query}"
        
    if len(relevant_docs) > 0:
        context = "\n\n".join(relevant_docs)
        rag_question = f"""
Based on the following information:
{context}
{rag_question}"""
    # end of RAG intervention---------------------------------------
    prompt = formatted_prompt(rag_question)

We will need to create new function query_index() to get relevant knowledge texts based on the original user query:

def query_index(query):
    # Convert query embedding to a NumPy array
    query_embedding = np.array([embedding_model.encode(query)], dtype="float32")  # Ensure correct shape and dtype
    top_k = 3  # Number of relevant chunks to retrieve
    # note that, all 3 results can have same/ duplicated content, if index search found less than 3 relevant chunks
    distances, indices = index.search(query_embedding, top_k)  #we dont use first result (distances, indices)
    print("FAISS documents distances", distances)
    # Set a threshold for the distance score
    threshold = 1.01    
    # only return results with close distance
    results = []
    for i, distance in enumerate(distances[0]):
        if distance < threshold:
            results.append(documents[indices[0][i]].page_content)
    return results

transform original query into embedding vectors.
run index search.
FAISS will always return top_k many results. However the results will have distance scores, whether the results are relevant with the query or not. The value 1.01 came from my own experiment that the distance must be this small to have high relevancy.
Any results with higher distance shall not be appended to final results.

When we run the test with postman, now it will have knowledge about hariesef:

{
    "question": "does Hariesef live in jakarta?",
    "response": "No, Hariesef does not live in Jakarta. He lives in Bekasi.",
    "generation_time": "1188.12"
}

{
    "question": "tell me all you know about hariesef!",
    "response": "Hariesef's full name is Haries Efrika. He was born in Jakarta on August 13, 1999. Currently, he resides in Bekasi. His favorite pastime involves touring with a motorcycle.",
    "generation_time": "3337.23"
}

As you can see there, when I ask everything about hariesef the RAG was missing additional information about his car. It is because the score was 1.04 therefore excluded.

Summary

RAG is probably the most efficient and cheapest way to add knowledge base into AI Model. To train AI model is costly and also complex since we have to ensure the new data can accurately be retrieved while not damaging the existing knowledge. With new trained model we have to also think about how to deploy and serve it. But with RAG, we can use any existing AI deployment and just need to focus on preparing the prompt. Even if we have to deploy our own model, we can just choose small model like Llama that is already good enough with reasoning and summarizing.

The best use cases for RAG I would say, is for enterprise search/ knowledge management, customer support/ chatbot, e-commerce and product recommendations, etc.

Improvements

The example in this article is made very simple so that everybody can understand the basic concept of RAG and start coding immediately. However, to actually implement RAG as part of live production, many improvements are needed:

Instead of using static index file (though FAISS also supports split/ sharding through multiple files), consider using vector database instead like https://milvus.io/ or https://weaviate.io/
Retrieved results/ articles can be huge. To handle such case we might need additional service that can create summary for every article, before putting it on RAG prompt. I am afraid the new service is another AI Model that can create summary, like GPT-3, T5 or BERT models.

I hope these information are useful for you. Thanks for reading, 🍻 Cheers!

Final part: https://hariesef.medium.com/beginner-guide-to-generative-ai-large-language-model-part-4-training-fine-tuning-model-using-peft-81256c25e989