Published on

Understanding GPT: How To Implement a Simple GPT Model with PyTorch


This comprehensive guide provides a detailed explanation of how to implement a simple GPT (Generative Pre-trained Transformer) model using PyTorch. We will cover the necessary components, how to train the model, and how to generate text.

For those of you who want to follow along, there is a python implementation as well as a Jupyter Notebook at UnderstandingGPT(GitHub)


The GPT model is a transformer-based architecture designed for natural language processing (NLP) tasks, such as text generation. Transformer models, introduced by Vaswani et al. (2017), leverage self-attention mechanisms to process sequences of data, allowing them to capture long-range dependencies more effectively than traditional recurrent neural networks (RNNs). The GPT architecture, specifically, is an autoregressive model that generates text by predicting the next word in a sequence, making it powerful for tasks like text completion, translation, and summarization. This tutorial will guide you through creating a simplified version of GPT, training it on a small dataset, and generating text. We will leverage PyTorch and the Hugging Face Transformers library to build and train the model.


Before we start, ensure you have the required libraries installed. You can install them using pip:

pip install torch transformers

These libraries are fundamental for building and training our GPT model. PyTorch is a deep learning framework that provides flexibility and speed, while the Transformers library by Hugging Face offers pre-trained models and tokenizers, including GPT-2.

Creating a Dataset

To effectively train a machine learning model like GPT, it is crucial to preprocess and prepare the text data properly. This process begins by creating a custom dataset class, which handles text inputs and tokenization. Tokenization is the process of converting raw text into numerical representations (token IDs) that the model can understand (Devlin et al., 2019). The provided code snippet achieves this by defining a class named SimpleDataset, which uses the GPT-2 tokenizer to encode the text data.

The SimpleDataset class inherits from and implements the necessary methods to interact seamlessly with the DataLoader. This class takes three parameters in its initializer: the list of texts, the tokenizer, and the maximum length of the sequences. The _len_ method returns the number of texts in the dataset, while the _getitem_ method retrieves and encodes a specific text at the given index. The encoding process involves converting the text into numerical representations using the tokenizer and padding the sequences to a specified maximum length to ensure uniformity. Padding is the practice of adding extra tokens to sequences to make them all the same length, which is important for batch processing in neural networks. The method returns the input IDs and attention masks, where the attention mask is a binary mask indicating which tokens are actual words and which are padding. This helps the model ignore padding tokens during training.

Here is the code for reference:

import torch
from import Dataset, DataLoader
from transformers import GPT2Tokenizer

class SimpleDataset(Dataset):
def __init__(self, texts, tokenizer, max_length):
self.texts = texts
self.tokenizer = tokenizer
self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        encoding = self.tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=self.max_length)
        return encoding['input_ids'].squeeze(), encoding['attention_mask'].squeeze()

texts = ["Hello, how are you?", "I am fine, thank you.", "What about you?"]
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
dataset = SimpleDataset(texts, tokenizer, max_length=20)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In this code, the SimpleDataset class handles the tokenization of input texts and returns the encoded input IDs and attention masks. The DataLoader then batches and shuffles the data for efficient training. Batch processing, which involves dividing the dataset into smaller batches, allows the model to update its weights more frequently, leading to faster convergence. Shuffling the data helps break any inherent order in the training data, improving the model's generalization.

By setting up the data in this manner, we ensure that the model receives sequences of uniform length for training. This approach also makes it easier to manage variable-length inputs while ensuring that padding tokens do not interfere with the model's learning process. This comprehensive preprocessing step is crucial for training effective and efficient machine learning models (Brown et al., 2020).

Building the GPT Model

To build an effective GPT model, we start by defining its architecture. The model consists of two main classes: GPTBlock and SimpleGPT. The GPTBlock class represents a single transformer block, while the SimpleGPT class stacks multiple transformer blocks to create the complete model (Vaswani et al., 2017).

In the GPTBlock class, we encapsulate essential components of a transformer block. These include layer normalization, multi-head attention, and a feed-forward neural network with GELU activation. Layer normalization standardizes the inputs to each sub-layer, improving the stability and convergence of the training process. The multi-head attention mechanism enables the model to focus on different parts of the input sequence simultaneously, enhancing its ability to capture complex dependencies within the data (Vaswani et al., 2017). The feed-forward neural network, using GELU (Gaussian Error Linear Unit) activation, introduces non-linearity and increases the model's capacity to learn intricate patterns. GELU is an activation function that smoothly approximates the ReLU (Rectified Linear Unit) function and often performs better in practice (Hendrycks & Gimpel, 2016).

Here is the code defining these classes:

import torch.nn as nn

class GPTBlock(nn.Module):
    def __init__(self, config):
        super(GPTBlock, self).__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = nn.MultiheadAttention(config.n_embd, config.n_head, dropout=config.attn_pdrop)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            nn.Linear(4 * config.n_embd, config.n_embd),

    def forward(self, x, attention_mask=None):
        attn_output, _ = self.attn(x, x, x, attn_mask=attention_mask)
        x = x + attn_output
        x = self.ln_1(x)
        mlp_output = self.mlp(x)
        x = x + mlp_output
        x = self.ln_2(x)
        return x

The SimpleGPT class stacks multiple GPTBlock instances to form the complete model. This class incorporates token and position embeddings, dropout for regularization, and a linear layer to generate output logits. Token embeddings convert input token IDs into dense vectors, allowing the model to work with numerical representations of words. Position embeddings provide information about the position of each token in the sequence, crucial for the model to understand the order of words. Dropout is a regularization technique that randomly sets some neurons to zero during training, helping to prevent overfitting (Srivastava et al., 2014). The final linear layer transforms the hidden states into logits, which are used to predict the next token in the sequence.

class SimpleGPT(nn.Module):
    def __init__(self, config):
        super(SimpleGPT, self).__init__()
        self.token_embedding = nn.Embedding(config.vocab_size, config.n_embd)
        self.position_embedding = nn.Embedding(config.n_positions, config.n_embd)
        self.drop = nn.Dropout(config.embd_pdrop)
        self.blocks = nn.ModuleList([GPTBlock(config) for _ in range(config.n_layer)])
        self.ln_f = nn.LayerNorm(config.n_embd)
        self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.config = config

    def forward(self, input_ids, attention_mask=None):
        positions = torch.arange(0, input_ids.size(1), device=input_ids.device).unsqueeze(0)
        x = self.token_embedding(input_ids) + self.position_embedding(positions)
        x = self.drop(x)

        if attention_mask is not None:
            attention_mask = attention_mask.unsqueeze(1).repeat(self.config.n_head, attention_mask.size(1), 1)
            attention_mask =
            attention_mask = (1.0 - attention_mask) * -10000.0

        for block in self.blocks:
            x = block(x.transpose(0, 1), attention_mask)
            x = x.transpose(0, 1)

        x = self.ln_f(x)
        logits = self.head(x)
        return logits

We then configure the model using the GPT2Config class from the transformers library, which sets various hyperparameters such as the vocabulary size, number of positions, embedding dimension, number of layers, number of attention heads, and dropout rates. These configurations are essential for defining the model's architecture and behavior during training.

Training the Model

The train function is a crucial component in the process of training a Generative Pre-trained Transformer (GPT) model. This function orchestrates the entire training loop, encompassing key steps such as the forward pass, loss calculation, backpropagation, and optimization. Each of these steps plays a vital role in refining the model's parameters based on the input data, ultimately improving the model's ability to generate coherent and contextually relevant text.

The training process begins with setting the model to training mode using the model.train() method. This mode enables certain layers, such as dropout and batch normalization, to function appropriately during training, ensuring that they contribute to the model's generalization capabilities (Goodfellow et al., 2016). The training loop then iterates over the dataset for a specified number of epochs. An epoch represents one complete pass through the entire training dataset, allowing the model to learn from all available data.

For each epoch, the function processes batches of data provided by the DataLoader, which handles the efficient batching and shuffling of the dataset. Batching groups multiple input sequences into a single batch, enabling parallel processing and efficient use of computational resources. Shuffling the data helps in reducing the model's overfitting to the order of the data samples.

Within each batch, the input IDs and attention masks are transferred to the specified device (CPU or GPU) to leverage the hardware's computational power. The forward pass involves passing the input IDs through the model to obtain the output logits, which are the raw, unnormalized predictions of the model. To align predictions with targets, the logits are shifted: shift_logits excludes the last token prediction, and shift_labels excludes the first token, ensuring that the input and output sequences are properly aligned for the next token prediction task.

The loss calculation is performed using the cross-entropy loss function, a common criterion for classification tasks that measures the difference between the predicted probabilities and the actual target values. Cross-entropy loss is particularly suitable for language modeling tasks where the goal is to predict the next token in a sequence (Goodfellow et al., 2016).

Backpropagation, implemented through the loss.backward() method, computes the gradient of the loss function with respect to the model's parameters. These gradients indicate how much each parameter needs to change to minimize the loss. The optimizer, specified as Adam (Kingma & Ba, 2015), updates the model's parameters based on these gradients. Adam (Adaptive Moment Estimation) is a variant of stochastic gradient descent that adapts the learning rate for each parameter, making it more efficient and robust to different data distributions. It is a popular optimization algorithm that computes adaptive learning rates for each parameter.

Throughout each epoch, the total loss is accumulated and averaged over all batches, providing a measure of the model's performance. Monitoring the loss over epochs helps in understanding the model's learning progress and adjusting hyperparameters if necessary. This continuous refinement process is essential for improving the model's accuracy and ensuring its ability to generate high-quality text.

The criterion is the cross-entropy loss, which is used to measure the performance of the classification model whose output is a probability value between 0 and 1.

Here is the code snippet for the train function and its implementation:

import torch.optim as optim

def train(model, dataloader, optimizer, criterion, epochs=5, device='cuda'):
    for epoch in range(epochs):
        total_loss = 0
        for input_ids, attention_mask in dataloader:
            input_ids, attention_mask =,
            outputs = model(input_ids, attention_mask)
            shift_logits = outputs[..., :-1, :].contiguous()
            shift_labels = input_ids[..., 1:].contiguous()
            loss = criterion(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
            total_loss += loss.item()
        print(f"Epoch {epoch + 1}/{epochs}, Loss: {total_loss / len(dataloader)}")

optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()
train(model, dataloader, optimizer, criterion, epochs=5, device=device)

Generating Text

The generate_text function in our code is designed to produce text from a trained GPT model based on an initial prompt. This function is essential for demonstrating the practical application of the trained model, allowing us to see how well it can generate coherent and contextually relevant text.

The function begins by setting the model to evaluation mode using model.eval(). Evaluation mode ensures that layers like dropout behave correctly and do not affect the prediction results (Goodfellow et al., 2016). The prompt is then tokenized into input IDs using the tokenizer's encode method, which converts the text into a format that the model can process. These input IDs are transferred to the specified device (either CPU or GPU) to leverage the computational power available.

The function then enters a loop that continues until the maximum length of the generated text is reached or an end-of-sequence token (EOS) is produced. During each iteration, the current sequence of generated tokens is passed through the model to obtain the output logits. Logits are raw, unnormalized predictions that indicate the model's confidence for each token in the vocabulary. The logits for the last token in the sequence are selected, and the token with the highest probability (the most likely next token) is determined using torch.argmax. This token is appended to the generated sequence.

If the generated token is the EOS token, the loop breaks, indicating that the model has finished generating the text. Finally, the sequence of generated tokens is converted back to text using the tokenizer's decode method, which transforms the numerical representations back into human-readable text, skipping any special tokens.

This iterative process of predicting the next token based on the current sequence demonstrates the model's ability to generate text in a contextually relevant manner, which is crucial for applications such as story generation, dialogue systems, and other natural language processing tasks (Vaswani et al., 2017).

Here's the code for the generate_text function:

def generate_text(model, tokenizer, prompt, max_length=50, device='cuda'):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
    generated = input_ids

    for _ in range(max_length):
        outputs = model(generated)
        next_token_logits = outputs[:, -1, :]
        next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(0)
        generated =, next_token), dim=1)
        if next_token.item() == tokenizer.eos_token_id:

    generated_text = tokenizer.decode(generated[0], skip_special_tokens=True)
    return generated_text

prompt = "Once upon a time"
generated_text = generate_text(model, tokenizer, prompt, device=device)


In this guide, we provided a comprehensive, step-by-step explanation of how to implement a simple GPT (Generative Pre-trained Transformer) model using PyTorch. We walked through the process of creating a custom dataset, building the GPT model, training it, and generating text. This hands-on implementation demonstrates the fundamental concepts behind the GPT architecture and serves as a foundation for more complex applications. By following this guide, you now have a basic understanding of how to create, train, and utilize a simple GPT model. This knowledge equips you to experiment with different configurations, larger datasets, and additional techniques to enhance the model's performance and capabilities. The principles and techniques covered here will help you apply transformer models to various NLP tasks, unlocking the potential of deep learning in natural language understanding and generation. The methodologies presented align with the advancements in transformer models introduced by Vaswani et al. (2017), emphasizing the power of self-attention mechanisms in processing sequences of data more effectively than traditional approaches (Vaswani et al., 2017). This understanding opens pathways to explore and innovate in the field of natural language processing using cutting-edge deep learning techniques (Kingma & Ba, 2015).


  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
  • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165.
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.
  • Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415.
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
  • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI preprint.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.