Beginner Guide to Generative AI Large Language Model Part 1: The Basic Theory Behind

haRies Efrika
13 min readJan 17, 2025

--

In these series of articles, I am going to introduce you to how Large Language Model works, how to use them in coding, and how to add more knowledge (train) to them. This is part one, explaining the theory concept. Stay tuned for next sequels.

Artificial intelligence (AI) has been involving a lot since its birth in 1956. One of the most appliances of AI is on video games. There is major different though between games and LLM. Games AI is basically heuristic search, means the computer is searching all feasible possibilities for next step, calculate their scores, and decide which step to pick. Imagine you’re playing chess with computer as your opponent. The computer will list down all possible paths, routes, steps, according to settings and its available memory size. The harder the game setting you set, the computer will store more possible solutions in memory, making it harder to defeat.

You can only win against computer if your brain is able to process much further chess steps that computer did not calculate 😆

Machine Learning

Before continuing, I need to mention a subset of AI, called ML (Machine Learning). ML is mostly used to recognize pattern. Common applications include:

  • fingerprint recognition
  • face recognition
  • iris recognition
  • predict whether a stock price will increase or decrease/ automatic trading
  • fraud detection and credit scoring in banks
  • recommendation systems (i.e. Netflix), etc.

Let’s talk specifically about fingerprint/ face/ iris recognition. Computer does not know at all about faces. They would not know if someone’s face is beautiful or not. Therefore scientists have to convert the face definition into something that computer understands. That is vector.

In mathematics, a vector is an object that has both magnitude and direction.

The actual vector dimension for face recognition of course will be huge, depending of what face characteristics information we want to store, it could be millions in size. But to make it easy for us to imagine how pattern recognition works, I will give smiley faces as example, and use 2D representation of vector.

In above image, there are two smileys that are actually very similar, but not exactly the same. In vector world representation, they logically will be represented by two vectors, which location and direction are similar each other. Next I will put sad smileys in the domain.

Since the pictures of sad smileys are different with the smiling ones, of course they will locate not near each other. And due to different color, if you notice, imagine a different vector direction.

Next we will have a new smiley to identify:

Now guess, in vector domain where the smiley will be represented.

That’s right! The new smiley face, if converted and calculated correctly, should not be too far from the residents of smiling faces. Vector distance is one of the simplest method to classify of a vector belongs to certain group or not. If this is an actual ML program, it will surely identify the picture as smiling face.

Inference Steps in LLM

There are two main parts in LLM. First is the training, to build the neural networks, and the second part is the usage/ inference to generate output. We will not discuss on math theory behind the training since it is not topic of our interest right now.

The inference process has to go through several phases:

Phase 1: Tokenization

Computer basically does not know any word, moreover on sentence. Therefore, any vocabulary, needs to be converted into Vector (here we go again). In AI world, a word is basically a Token.

When we are sending the prompt: “is sky blue?”, in order to understand this, AI will be splitting this sentence into four tokens. This process is called Tokenization.

“is sky blue?” → [“is”, “sky”, “blue”, “?”]

In reality there is additional mandatory token called <BOS> added in front, i.e. it becomes [<BOS>,“is”, “sky”, “blue”, “?”] But for easier explanation sake throughout the article, I will eliminate it, until we mention it again during last part.

Generative AI model is built on top of a lot of neural network layers. The first layer is called Embedding Layer, a linear layer which basically is like a lookup table, converting a token into vector. Example on the look up result:

We call this, embedding vectors. While that example, for simplicity only has dimension size 3, but the actual dimension size is different, depends on model type. For example, the GPT-3 with 125M parameters has hidden dimension size of 768, while the GPT-3 with 175B parameters has 12288.

When a model is being trained from zero, from scratch, without any knowledge at all, at first it only has array of vocabularies (i.e. 10,000 words) with random vector representations of fixed dimensions. The model then is fed with a lot, lots of sentences to train. At the end of training, in the neural network, imagine the position of [is] has been changed from its firstly random location, into somewhere that is more proper. Yes, just imagine the 2D mappings when we were doing smileys.

Please note that, in each vector also contains information the positional information, i.e. where does the word reside. Because the word “is”, whether it is on first position:

is sky blue?”

and on another position:

“sky is blue.”

will give different context and hence, imagine inside AI neural networks there is a lot of vector “is”, not only one. Each vector location/ address of “is” gives different meaning and context.

Note: there is simple formula to include positioning information into the embedding vector by adding it with position number.

Phase 2: Contextualization via Self-Attention

In this step, AI model will try to understand what context does “is sky blue?” mean. Yes, this is like searching which [is] from a lot of [is] vectors in AI neural networks memory, that matches better.

During this phase, each embedding vector is not processed independently, instead they are combined as one big vector.

the first row is always a hidden token/ special token representing the beginning of sentence (BOS).

Then, using matrix multiplication, X is used to produce another three matrices:

The matrices of weights Wq, Wk, Wv, and Wo values are always fixed, but they are different each layer. For instance, GPT-2-Small model has only 12 layers, while GPT-3 has 96 layers. These weights are produced during training and stored inside the model.

Key notes:

  • Query (Q): Encodes what each token is “looking for” (e.g., relationships or dependencies to other token).
  • Key (K): Encodes what each token “offers” in terms of information or features. I.e.: a subject, an object, an action, a modifier, attribute, etc.
  • Value (V): Encodes the actual content or information carried by each token.

For example for Token “blue”:

  • Q”blue”​: Is there a subject that describes ‘blue’? Logically, it will have highest relation with token “sky”.
  • K”blue”: I am an attribute, feature, characteristic.
  • V”blue”​: represents the actual content of “blue” (e.g., “a color blue #0000FF.”).

The Wq,Wk,Wv,Wo weight matrices are in shape of square matrix with size of d x d (d= hidden dimension of a model). In GPT-3 it has 12288x12288 size. It is actually combination of implicit, 96 attention heads. Each attention head is trained to have different context/ perspective/ understanding of tokens. From column-1 until column-128 are the first head area, the column-129 until 256 belong to the 2nd head area, and so forth. I needed to mention about Heads because some of Model specification might inform it too, just to avoid confusion. Summary: in GPT-3 it has 96 layers, and each layer has 96 attention heads.

There are more formulas calculation ahead, but we will not go through in detail, just the summary:

The final product of this phase will be Attention Output. Which is basically, back to set of Embedding Vectors. However the values have been changed to be close to the actual context meaning.

Phase 3: Feed Forward Network (FFN)

Each of embedding vector from self-attention output can be processed in parallel, independently within FFN.

I will just put the formulas here as reference:

PS: Just like in self-attention phase, the Weight W1 and W2 (also bias matrix b1 and b2) are fixed values obtained from training, and the values are different for each transformer layer.

Basically the first layer will change the dimension of embedding vector into bigger one. For example in GPT-3, the dimension scales out from 12288 into 49152 (4x bigger). OpenAI, Google, and DeepMind researchers tested different values (e.g., 2×, 3×, 5×) and found 4× to be optimal in large-scale training. At second layer, the dimension is returned back (cut) to original size.

Why scale out dimension and then shrink it again? See it this way:

  • FFN does not select specific features. Instead of hard selection (choosing specific dimensions), it performs a soft weighted sum using learned weights.
  • Some neurons will amplify important features, while others will suppress irrelevant ones.
  • This is to retain the most useful contextual information from the expansion phase.

In simple terms, think of it like a high-dimensional filter: It expands to analyze richer features, then projects back the most useful aspects in a compressed way. Imagine you take a high-resolution photo (49152 pixels) and resize it down to 12288 pixels while keeping only the most essential details.

  • Instead of keeping random pixels, a smart filter retains the most informative parts (edges, colors, important objects).
  • Similarly, in FFN, the W2 matrix​ projection learns to retain the most contextually useful information.

The self-attention and FFN layers work as one, repeatedly. The X output will be used for the next self attention layer, and the next FFN layer, until all of the layers in the model passed (i.e. 96 layers in GPT-3). During the first cycle, the information contained in the X might not be close at all to understand what “is sky blue?” means. But after multiple layers, not only it knows the context, but already contains seeds/ candidates for the answers.

Final Phase 4: Decoding/ Generating the Output Sequence

The output embedding vector X from last FFN layer will be called Contextualized Embedding Vectors. It is used as input to calculate Logit Vector. However not all vectors will be used just one. GPT for example always chooses the last vector (i.e. X(?)) as Context Vector. This approach is proven to be optimal, and the last vector is believed to contain proper context of all neighboring vectors.

Other way to get Context Vector is by averaging all Xs, but I don’t get any reference who/ what model is still doing it this way.

How to calculate Logit:

C = context vector

C size is 1xhidden-size (i.e. 1 x 768 assuming GPT3–125M) since it is only one vector selected.

The W and b are another matrix produced during training. The size of W will be [size of dictionary x embedding dimension]. If we have vocabulary tokens of 10,000, the size will be 10,000x768. In order to multiply with C, the matrix W needs to be transposed.

Matrix bias b term is actually single row matrix, with number of columns as many as the vocabularies (i.e. 10,000).

The output, logit vector size will be 1 x 10,000. Then logit vector will be used as input in softmax function:

Once the probabilities P=[P1,P2,…,Pv] are obtained from the softmax function, various strategies can be used to decide the next output token. At this point basically the P vector contains possibility of which FIRST WORD is the answer of question “is sky blue?”. This final layer only produce one output word/ token.

The process from beginning (phase-1 until phase-4) is called multiple times until it gets output token of “EOS” (end of sentence), or maximum token input/output restriction has been reached. To remind you again, for each cycle, it only produces one token/ word. One cycle is defined as follows:

Embeddings → (self-attention +FFN) → Logits+Softmax → Determine output

  • First call. Input=[<BOS>,“is”,”sky”,”blue”,”?”]. Output=”yes”
  • Second call. Input=[<BOS>,“is”,”sky”,”blue”,”?”,”yes”]. Output=”.”
  • Third call. Input=[<BOS>,“is”,”sky”,”blue”,”?”,”yes”,”.”]. Output=<EOS>

Yup, as you can see, the token output is merged together as new input embeddings for next cycle. The X values for row “is/sky/blue/?” are no longer the original EV values, but already contains the cached - contextualized vectors produced by 1st cycle. The new token though, “yes” at initial will still require the embedding layer to produce EV value as part of new X input.

Determining Output from Probabilities Vector

Going back a bit to this part when I said:

Once the probabilities P=[P1,P2,…,Pv] are obtained from the softmax function, various strategies can be used to decide the next output token

The following are some of the options to select output, just for references:

  1. Greedy Decoding

Simply just choose token with highest probability value. Simple and computationally efficient. Often produces grammatically correct text. Disadvantages: Can result in repetitive or generic outputs, may miss globally optimal sequences because it always takes the locally best choice.

2. Sampling

Randomly sample a token based on the probability distribution P:

  • Higher probability tokens are more likely to be selected.
  • Lower probability tokens still have a chance to be chosen, introducing diversity.

Advantages: Produces more diverse and creative outputs. Disadvantages: can generate incoherent or nonsensical text.

3. Top-K sampling

Only consider the top-k most probable tokens and sample from this subset. Advantages: balances diversity and coherence by limiting the choice to high-probability tokens. Disadvantages: k-value must be chosen carefully; too small can lead to repetitive outputs, too large can reduce coherence.

4. Nucleus Sampling (Top-p Sampling)

Considers only the smallest subset of tokens whose cumulative probability exceeds a threshold value. Advantages: Dynamically adjusts the number of tokens considered based on the distribution; balances coherence and diversity better than top-k sampling. Disadvantages: requires computation of cumulative probabilities, which can be slightly more expensive.

Steps:

  • After ordering descending, calculate cumulative probabilities that finally exceeds certain threshold p.
  • I.e. from sorted [a,b,c,d,e,f] it is possible that [a,b,c] sum is beyond p=0.8
  • The probability of [a,b,c] will be recalculated based on the original weight, so total shall become 1.0
  • Roll a dice and select one from [a,b,c]

5. Temperature Sampling

Advantages: allows fine control over the trade-off between diversity and coherence. Disadvantages: needs tuning of T-value for different applications.

Note: in GPT-4, the combination of Temperature Sampling and Nucleus Sampling are used.

The Generalization

So it is all about mathematical equations apparently. Which makes us wondering, how could AI Model answer question about something that it never learned before?

The answer lies on the training data set and the test data set. It has been trained on really-really huge data set and it is able to generalize using the test dataset. AI models, especially big LLM like GPT, don’t just memorize facts — they learn patterns, relationships, and structures from vast amounts of training data. Even if a model hasn’t seen a specific fact before, it can infer likely answers based on similar concepts it has encountered. That’s why we call it inference.

Example: If the AI was trained on general physics concepts but never explicitly saw the phrase “Can an elephant float on water?”, it can reason that:

  • Elephants are heavy.
  • Objects that are denser than water sink.
  • Heavy objects don’t float on water.
  • Elephants are objects too.
  • Elephants are denser than water.
  • Conclusion: Elephants likely don’t float.

PS: Hypothetically speaking, if you still remember previously we discussed about QKV matrices on self-attention phase, the token “Elephants” will have information about its attributes: animal, heavy, etc. On the overall context of the sentence, the attribute “heavy” will most likely get more attention during calculation. And during the next layer, most likely it will get better relation to to “float” and “objects”.

The most valuable thing from AI Model is the TRAINING DATA, because the concept, theory, algorithms, etc. are open and similar for majority of models. If an AI Model devs/ company is saying that they are “open source”, please ask them, hey bro, where can I download the training dataset and test dataset? 😄 I wonder if they are willing to give that.

The Size of a Model

This is just for reference, I need to put it here because often asked. When we say GPT-3 has 175B parameters, what does it mean? It means it actually stores 175 billion floating point numbers among its matrices/ vectors. Breakdown:

Alright, that’s all quite easy, yes? 😆 (crossing my fingers). But don’t worry much about theory, in next following article we will be doing direct hands on coding in how to load and utilize AI model locally in your laptop.

Thanks for reading 🍻!

Next part: https://medium.com/@hariesef/beginner-guide-to-generative-ai-large-language-model-part-2-running-the-model-locally-on-pc-29405b350a1c

--

--

Responses (1)