Beginner Guide to Generative AI Large Language Model Part 2: Running the Model Locally on PC

8 min readJan 31, 2025

In this article we will discuss on how to run LLM model locally in our PC, both with GPU, and without GPU (Only relying on CPU and RAM).

If you have not yet familiarized with https://huggingface.co then this is the right time to do so! HF works just like github, or docker hub, but in a different context — a repository where people are posting researches and progresses on AI models. Everyone can freely download and contribute. It provides collections of the base model, the fine tuned model, as well the training data openly.

Do I need to have GPU?

The answer is, depends. If you just want to run LLM Model locally, you don’t need to have GPU. Though it will be slower to answer your questions. Typically, with GPU the speed might be at least 6x faster. Also if you can’t train AI model without GPU. Technically you could, but the time needed is not worth it.

What AI model should I try first?

This model: https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B provides the balance between the knowledge it has versus the memory requirement it takes. Llama is developed by Meta, and it is available in many size options, from 3 Billions until 65B parameters. The parameters refer to how many floating point numbers contained inside the matrices of Weights and Biases are available for neural network transformations (please refer to the previous article). In order to load AI Model with 3B parameters, you basically require 2x-3x of GPU memory size, i.e. 9GB video ram will be ideal.

So to load Llama 8B we need 24GB GPU? That’s a lot!

Yes but no. There is a technique called Quantization, which is a way to reduce the precision of model weights. The vectors and matrices by default use 32bit floating point variable. With Q4 (read: use 4bit calculation) all of the matrices in memory will use 4bits floating point instead. This reduces the GPU consumption of this model into 6GB only.

I am ready, take me to code!

Glad to see your sunshine spirit! Please dig into this github repo:

GitHub - hariesef/learn-llm

Contribute to hariesef/learn-llm development by creating an account on GitHub.

github.com

and go to the folder part2-serve. Our first focus file will be serve_llama3_gpu.py. As per filename says, this is experiment to load the model into our GPU. What I own at home, is Nvidia GTX 1080 8GB. It is quite an old GPU, but still suffices for our play around. We will use python as it provides all libraries and all compatibilities with Nvidia GPUs.

Serve it with REST API

We need way to continuously prompt our AI Model once it is loaded. To do so, we will need to have REST API server as middle-ware.

    uvicorn.run(app, host="0.0.0.0", port=8000)

Then we need to define a function that accepts HTTP Post request:

@app.post("/generate/")
async def query_docs(request: QueryRequest):

Our target is to enable this kind of request:

# curl -X POST \
#   'http://localhost:8000/generate/' \
#   -H 'Content-Type: application/json' \
#   -d '{"prompt": "what is capital of country where haries lives?"}'

There are two main objects we require for loading and executing the model:

The model itself, and
Tokenizer

To load the model we simply do this:

#visit this page for more info: https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B
model_path = "NousResearch/Hermes-2-Pro-Llama-3-8B"


# Define the quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # or load_in_8bit=True based on your preference
    bnb_4bit_compute_dtype=torch.float16  # Set the compute dtype if needed
)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    # local_files_only=True,
    model_type="llama",
    torch_dtype=torch.float16,
    device_map="auto",
    use_flash_attention_2=False,
    quantization_config=quantization_config
)

When we use this library AutoModelForCausalLM, unless the local_files_only is enabled, by default it will search HF to download the model locally. Once the model downloaded, next time the script runs, it will use the local cache instead. The model_path can also point directly to local file, if you already downloaded the Model manually. This can also be the case for online deployment where the model file(s) is already part of the docker image. In case you’re looking for location where the model is downloaded, check here: ~/.cache/huggingface/hub/models — NousResearch — Hermes-2-Pro-Llama-3–8B

AI Model is not always single file, depends on the format. Majority of models actually is a folder containing a lot of files and configurations.

Quantization object variable is what we mentioned before, the settings to reduce the memory requirements into 1/4th. You might be wondering why bnb_4bit_compute_dtype is still float16 (1/2)? The answer is that because there is no float4. Even if there is, the bit size is too narrow to do good floating point operation. Therefore we are still doing floating point operations in 16bits instead.

Is there any drawback with quantization?

Of course. The loading phase as well the execution/ inference phase will be slower since there is additional overhead of data conversion. By having 1/4th precision, means the knowledge accuracy will also be lost. For example, there could be question that model will answer correctly in full mode, but during quantization it is wrong.

The remaining parameters explained:

model_type: since there is a lot of different models available, this type will determine how the library handles the model.
device_map: you can change this to cpu or leave it to auto for gpu. However please be warned, the standard library of AutoModelForCausalLM and BitsAndBytesConfig may not fully support the CPU mode. You might have to recompile the libraries locally with CPU flag.
use_flash_attention_2: This depends whether your GPU supports it or not. Flash Attention is an optimized attention mechanism designed to improve the efficiency of the attention computation in transformer models. It is specifically designed to take advantage of GPU architecture to accelerate the computation of attention layers (see previous article). Not all GPUs may support Flash Attention optimizations. Typically, newer architectures from NVIDIA (like the Ampere architecture and above) are more likely to support these optimizations effectively. You also need to ensure your graphic driver supports this mode.

Doing the inference/ prompt processing

This part of code handles that

    generation_config = GenerationConfig(
        penalty_alpha=1.0,  # Higher penalty for diverse outputs
        do_sample=True,  # Set to False for deterministic outputs
        top_k=5,          # Restrict vocabulary for focus
        temperature=0.1,  # Lower temperature for concise answers
        repetition_penalty=1.1,#Prevent unnecessary repetition, encourage diverse completions.
        top_p=0.95,
        max_new_tokens=1024,  # Limit tokens to encourage brevity
        max_length=8192, #all tokens length, including prompt
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,  # Ensure the model knows when to stop
    )    
    print("gen_input--------------------- ", gen_input)
    outputs = model.generate(**gen_input, 
                             generation_config=generation_config)

The most important thing, for every time we are processing a prompt, we need both gen_input and the generation_config. Some of the parameters theory were explained in previous article, while others don’t. Currently I also have not yet experimented a lot in changing these parameters, so please go ahead and try!

The input, prompt string can’t be just transferred there. It has to go through tokenization, which we get from:

    gen_input = tokenizer(prompt, return_tensors="pt", add_special_tokens=True).to('cuda') #.to('cpu') if not using GPU

While the prompt variable is the actual prompt string we send through REST API.

Special Tokens

AI models are trained using special tokens, to give distinctions like, what is the beginning of sentence, what is the end. Which part is the question, which part is supposed to be answer, etc. Therefore the prompt we send to the model have to be modified to include special tokens according to how the model was trained.

In Llama, the special tokens format are like this:

def formatted_prompt(question)-> str:
    return (f"<|im_start|>system\nYou are Hermes 2.<|im_end|>\n"+
        f"<|im_start|>user\n{question}<|im_end|>\n"+
        f"<|im_start|>assistant\n")

So it basically divides every prompt into three part. What is the role of the AI model, what is the question from user, and what is the answer from the AI model (assistant). The original prompt will be supplied as the {question}. As you can see each section always begins with <|im_start> and always ends with <|im_end|>. These are special tokens. Buf if you notice, the last part, assistant — does not have the ending special token. This is intended, because the AI model will generate text, by continuing the original prompt, until inference process meets the end special token.

Example of actual message, at beginning:

<|im_start|>system\nYou are Hermes 2.<|im_end|>
<|im_start|>user
Is sky blue?<|im_end|>
<|im_start|>assistant

Expected output:

<|im_start|>system\nYou are Hermes 2.<|im_end|>
<|im_start|>user
Is sky blue?<|im_end|>
<|im_start|>assistant
Yes, the sky color can be blue.<|im_end|>

These special tokens and message format are different from a model to another. So please ensure to read the related documentations beforehand.

Parsing the answer

Due to the special tokens, we need to parse the output texts, ensuring only the assistant answer is taken as result.

        # cut prefix
        response,_ = regex_response(r"<\|im_start\|>assistant\n(.*?)$", response, 1)
        # cut suffix
        response,_ = regex_response(r"^(.*?)<\|im_end\|>", response, 1)

Additionally, there is case that the output does not have im_end at all.

Q: Why though, I thought the model will always generate the text output until it meets <|im_end|>?

Not always. During output decoding, which most likely the do_sample=True is being used, there is possibility the output generated will lead to a lot of characters, hence crossing the limit of max_new_tokens and max_length. If it does, the output text will be cut off, resulting in incomplete sentence. In order to avoid this,

the max_new_tokensmust not be too low.
the temperature must be set as low as possible.
Hypothetically speaking, in a better implementation (not in my code), the output has to be checked if it contains stop token <|im_end|> or not. If not, repeat the whole inference using output as the new input, to continue generate new texts. But with slightly higher max_new_tokens limit.

Let me know in comment section if you have better idea to solve this. Thanks!

Just for information, sometimes the Model generates long answer, which is separated by “###” token. The text after that token is often not relevant at all with original question. This case is very hard to reproduce, and after playing around with few parameters like temperature, I didn’t meet it again.

Running the Model without GPU (CPU Only)

In the second file, serve_llama3_cpu.py, I provided slightly different approach to run the model:

It uses llama_cpp library instead of AutoModelForCausalLM
I set the device_map=cpu

The llama_cpp library is much simpler than AutoModel library. But it does not download the model, so you have to download the model file beforehand.

Additionally it has parameter that AutoModel does not have:

    response = llm(prompt,
        max_tokens=2048,  
        temperature=1.0,  
        top_p=0.95,       
        # stop=[".\n\n"]      # Optional: Set stop tokens. Only valid in llama library
    )

That is the stop=[] parameter. You can define custom token/ string that indicates the end of an answer. As part of solution of incomplete sentence we discussed previously, this may help. A new section on this model answer is often indicated with dot, and then double new lines, therefore if we found it, we can safely assume we can stop here to avoid max limit.

Alright folks. In the next article we will learn how to add more knowledge base to our AI Model using RAG approach. Stay tuned and thanks for reading 🍻!

Next part: https://hariesef.medium.com/beginner-guide-to-generative-ai-large-language-model-part-3-utilizing-rag-retrieval-augmented-bbb4aaa0578b