Transformer Architecture: Building Blocks Explained

24 Mar, 2025

Hey there! In this post, I’m going to walk you through the building blocks of the Transformer architecture. But don’t worry—this isn’t going to be one of those dry academic reads. Think of it more like we're sitting down for a coffee and chatting about how all of this works. The goal? No more “What the heck is this, bro?” moments. Everything will be clear, with examples and just enough math to make it stick.

This article is based on Umar Jamil's video Coding a Transformer from scratch on PyTorch.

Ready? Let’s dive in.

Input Embeddings

Models can’t understand words directly. If you type "dog", "hello", or "GPT", it’s just gibberish to the model. So, the first step is to convert each word into a numerical vector. The bigger the vector (say 512 dimensions), the more information it can carry about the meaning of the word.

Here’s how we do it in PyTorch:

class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)

Why scale by √d_model? Because working with tiny numbers slows down learning. This factor keeps the values in a good range, making training more stable.

Positional Encoding

We’ve got our words turned into vectors—but the model still has no idea where in the sentence each word is. “I went home” and “Home I went” would produce the same embeddings. Not good.

To fix this, we add positional information to each word’s embedding. We generate a special vector for each position in the sentence and add it to the word vector. The formulas look like this:

P E_{(p o s, 2 i)} = \sin (\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}})

P E_{(p o s, 2 i + 1)} = \cos (\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}})

Why sine and cosine? These functions are periodic, which helps the model learn distances between words (like word 5 and word 10). Using both sin and cos lets us encode directionality, too.

Quick example:

Sentence: "I am going home"
Each position gets a unique vector based on sin/cos and gets added to the word embedding.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, dropout: float, seq_len: int):
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(p=dropout)
        
        #Matrix of shape (seq_len, d_model)
        pe = torch.zeros(seq_len, d_model)
        
        #Vector of shape (seq_len, 1)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        #sin to even positions, cos to odd positions
        pe[0, 0::2] = torch.sin(position * div_term)
        pe[0, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
        return self.dropout(x)

Multi-Head Attention

Now we get to the core of it all. Attention is how the model asks: “Is this word related to that word?” Multi-Head Attention lets the model ask that question from multiple perspectives.

We use three key components:

Query (Q): The word we're focusing on ("What should I pay attention to?")
Key (K): The identity of other words ("What’s available to look at?")
Value (V): The actual information we pull from those other words

Formula:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

Example:

Sentence: "Ayşe threw the ball to Ali because _ was tired."

What goes in the blank? Ayşe or Ali?

The model focuses on the word “because” (Query), compares it to all other words (Keys), calculates similarity scores, then pulls info from the most relevant ones (Values).

Why multiple heads? Each head captures a different type of relationship—grammar, emotion, timing, etc.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, h: int, dropout: float):
        super().__init__()
        self.d_model = d_model
        self.h = h
        assert d_model % h == 0, "d_model is not divisible by h"
        
        self.d_k = d_model // h
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        
        self.w_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
    
    @staticmethod
    def attention(query, key, value, mask, dropout: nn.Dropout):
        d_k = query.shape[-1]
        attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            attention_scores.masked_fill_(mask == 0, -1e9)
        attention_scores = attention_scores.softmax(dim = -1)
        if dropout is not None:
            attention_scores = dropout(attention_scores)

        return (attention_scores @ value), attention_scores
        
    def forward(self, q, k, v, mask):
        query = self.w_q(q)
        key = self.w_k(k)
        value = self.w_v(v)
        
        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1,2)
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1,2)
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1,2)

        x, self.attention_scores = MultiHeadAttention.attention(query, key, value, mask, self.dropout)

        x = x.transpose(1,2).contiguous().view(x.shape[0], -1, self.h * self.d_k)

        return self.w_o(x)

Feed Forward Network

Now that we’ve modeled relationships between words, it’s time to dig into each word individually and extract more complex features.

Every position goes through the same MLP (two linear layers):

class FeedForwardBlock(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float):
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
        
    def forward(self, x):
        x = self.linear_1(x)   # 512 → 2048 (increase dimension)
        x = torch.relu(x)      # Non-linear activation
        x = self.dropout(x)    # Dropout in training
        x = self.linear_2(x)   # 2048 → 512 (decrease to original dimension)
        return x

First layer expands the dimension (d_model → d_ff)
ReLU adds non-linearity
Second layer brings it back down (d_ff → d_model)

Each word gets a deeper representation—but we return to the original shape so we can keep stacking blocks.

Layer Normalization

Sometimes, activations between layers can get out of control. Values too big or too small make learning hard. That’s where LayerNorm comes in.

It normalizes across each input’s feature dimension:

L a y e r N o r m (x) = γ \cdot \frac{x - μ}{\sqrt{σ^{2} + ϵ}} + β

Where:

x: input vector
μ: mean
σ²: variance
ε: small constant for numerical stability
γ, β: learnable parameters to scale/shift the result

Note: We use LayerNorm instead of BatchNorm because LayerNorm works per example (not per batch), making it more suitable for sequence models.

class LayerNormalization(nn.Module):
    def __init__(self, eps: float = 10**-6):
        super().__init__()
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(1))
        self.bias = nn.Parameter(torch.zeros(1))
        
    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)  # calculate variance
        return self.alpha * (x - mean) / torch.sqrt(var + self.eps) + self.bias

Residual Connection

Deep networks tend to forget what the input was. Residual connections fix that by adding the original input back in:

class ResidualConnection(nn.Module):
    def __init__(self, dropout: float):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization()

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

This helps gradients flow more easily and allows the model to go deeper without losing track of the original signal. Every Transformer block uses this trick.