LLM with Hugging Face: Getting Started

Large Language Models (LLMs) are language models trained on large volumes of text and at massive scale — often with billions or even trillions of parameters. In most cases, these models are based on the transformer architecture, which uses an attention mechanism capable of identifying which parts of the text are most relevant according to the context.

One way to use an LLM is through a repository of pre-trained models like Hugging Face. This repository hosts a wide range of open-source models from different providers. In addition to LLMs, you can find models for other AI tasks, datasets for downloading and fine-tuning on specific use cases, among other possibilities.

The goal of this post is to use pre-trained models from Hugging Face, specifically GPT-2 Large and Phi-3-mini, for the task of text generation. GPT-2 Large has 812 million parameters, while Phi-3-mini has 3.82 billion parameters.

The topics I’ll cover are:

Initial Setup
Loading Tokenizers
Loading Models
Text Generation – Encoder
Text Generation – Decoder

Initial Setup

To run the scripts below, I used Google Colab with a T4 GPU. In addition, the tensorflow and transformers libraries were installed. The installation command is as follows:

!pip install -q tensorflow
!pip install -q transformers

⚠️ Versions:

tensorflow 2.18.0
transformers 4.52.4

The following libraries were imported:

import tensorflow as tf
import transformers
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AutoModelForCausalLM, AutoTokenizer

GPT2LMHeadModel: A class used to load a GPT-2 variant that includes a Language Modeling Head (LMHead) on top.
GPT2Tokenizer: The tokenizer specific to the GPT-2 model.
AutoModelForCausalLM: A class used to load a pre-trained language model from Hugging Face. It will be used to load the Phi-3-mini model, but it could also be used for GPT-2 model.
AutoTokenizer: Provides the tokenizer for the specified model.

Loading Tokenizers

A token is the smallest interpretable unit in a text sequence. It can represent a whole word, part of a word, or even a single character.

The tokenization process consists of converting text into a format that the model can understand, that is, the same format it was trained on. Tokenization is therefore a data pre-processing step in which each token is transformed into a numeric identifier (ID). Behind the scenes, an LLM performs mathematical computations and thus understands numbers, not raw text, making tokenization an essential step.

It’s worth emphasizing that the token IDs only make sense for the specific model used during tokenization. This is because, as mentioned earlier, the model was trained using those exact data mappings. Therefore, it’s essential to use the tokenizer that matches the model.

To load the model tokenizers, use the commands below:

gpt_tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
phi_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

Note that the from_pretrained method indicates that we are loading the tokenizer of a pre-trained model, and you must specify the model’s name as an argument.

Finally, I’ll retrieve the vocabulary of each tokenizer in order to later illustrate the mapping between tokens and their corresponding IDs.

gpt_vocab = gpt_tokenizer.encoder
phi_vocab = phi_tokenizer.get_vocab()

Loading Models

To load the models, we use the same from_pretrained method, but applied to the classes responsible for loading the models themselves. See the code below:

gpt_model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=gpt_tokenizer.eos_token_id)
phi_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", device_map="cuda")

pad_token_id: This is the identifier for the padding token, used to ensure that all text sequences within a batch have the same length. During LLM training or inference, the input must be a numerical matrix, so all text sequences must be the same size. Since texts often vary in length, the padding token is added to shorter sequences until they match the longest one in the batch. By setting the pad_token_id, this padding is handled automatically, without the need to manually adjust the inputs.
- To reinforce this point: the value gpt_tokenizer.eos_token_id passed as an argument is the end-of-sequence token ID used by the GPT-2 tokenizer. This corresponds to the ID 50256, which maps to the token <|endoftext|>. When the model encounters this ID, it understands that the sequence has ended and performs no further operations.
device_map: This parameter specifies the device on which the model will run. Since the argument used was “cuda“, the model will be executed on a GPU.

Text Generation – Encoder

To test the models, I will ask the following question:

prompt = "What is Machine Learning"

As mentioned earlier, an LLM does not understand text, but rather numbers. Therefore, it is necessary to encode this prompt.

gpt_input_ids = gpt_tokenizer.encode(prompt, return_tensors="pt")
phi_input_ids = phi_tokenizer(prompt, return_tensors="pt").to(phi_model.device)

return_tensors="pt" tells the tokenizer to return tokens in a format compatible with PyTorch.
to(phi_model.device) moves the encoded input to the device where the model is running, which in this case is a GPU.

Below is the encoding generated for the given prompt (“What is Machine Learning”). Additionally, using the vocabulary, it’s possible to see the mapping between the IDs generated by the encoding and their corresponding tokens. It’s important to emphasize that these IDs only make sense because the model was trained using this specific data preprocessing.

GPT-2 Large

tensor([[ 2061,   318, 10850, 18252]])

id_to_token = {v: k for k, v in gpt_vocab.items()}
gpt_dict = {id_to_token[t.item()]: t.item() for t in gpt_input_ids[0]}

{'What': 2061, 'Ġis': 318, 'ĠMachine': 10850, 'ĠLearning': 18252}

Phi-3-mini

{'input_ids': tensor([[ 1724,   338,  6189, 29257]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1]], device='cuda:0')}

id_to_token = {v: k for k, v in phi_vocab.items()}
phi_dict = {id_to_token[t.item()]: t.item() for t in phi_input_ids["input_ids"][0]}

{'▁What': 1724, '▁is': 338, '▁Machine': 6189, '▁Learning': 29257}

Finally, the code below illustrates the text generation process, or more precisely, the generation of numerical token IDs, performed by each model.

gpt_output = gpt_model.generate(
    gpt_input_ids,
    max_length=100
)
phi_output = phi_model.generate(
    **phi_input_ids,
    max_length=100
)

Here are the generated outputs. Note that the model is designed to complete text, so the first 4 IDs correspond exactly to the input prompt IDs. Additionally, the final result is a 1×100 numerical matrix, where 100 is the maximum output length defined by the max_length parameter.

GPT-2 Large

tensor([[ 2061,   318, 10850, 18252,    30,   198,   198, 37573,  4673,   318,
           257,  8478,   286,  3644,  3783,   326,  7529,   351,   262,  1917,
           286,  4673,   422,  1588,  6867,   286,  1366,    13,   632,   318,
           257,  8478,   286,  3644,  3783,   326,  7529,   351,   262,  1917,
           286,  4673,   422,  1588,  6867,   286,  1366,    13,   198,   198,
         37573,  4673,   318,   257,  8478,   286,  3644,  3783,   326,  7529,
           351,   262,  1917,   286,  4673,   422,  1588,  6867,   286,  1366,
            13,   632,   318,   257,  8478,   286,  3644,  3783,   326,  7529,
           351,   262,  1917,   286,  4673,   422,  1588,  6867,   286,  1366,
            13,   198,   198, 37573,  4673,   318,   257,  8478,   286,  3644]])

Phi-3-mini

tensor([[ 1724,   338,  6189, 29257, 29973,    13,    13, 29076, 29257,   313,
          1988, 29897,   338,   263, 11306,   310, 23116, 21082,   313, 23869,
         29897,   393,  8569,   267,   373,   278,  5849,   310, 14009,   322,
         24148,  4733,   393,  9025, 23226,   304,  2189,  2702,  9595,  1728,
          6261, 11994, 29889,  8669, 29892,  1438,  6757,  5110,   322, 11157,
           515,  7271, 29892, 12234,   515,  2919, 26999,   310,   848, 29889,
            13,    13, 29076, 29257, 14009,  2048,   263, 19475,  1904,  2729,
           373,  4559,   848, 29892,  2998,   408,   376, 26495,   848,  1699,
           304,  1207, 27303,   470,  1602, 12112,  1728,  1641,  9479,  1824,
          2168,   304,  2189,   278,  3414, 29889,   910,  1889, 20789,  8343]],
       device='cuda:0')

You’ll notice in the section below that when decoding the output of the GPT-2 Large model using only the basic parameters, it tends to hallucinate. For this reason, another version was created with additional parameters to improve performance. It’s worth noting that there are other ways to address this behavior, such as prompt engineering, but that’s beyond the scope of this post.

gpt_output = gpt_model.generate(
    gpt_input_ids,
    max_length=100,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

num_beams: This parameter explores multiple candidate sequences during text generation. Instead of selecting only the most likely word at each step, the model evaluates several possibilities and keeps the N most promising sequences (in this case, 5). This can lead to higher-quality outputs but requires more computational effort.
no_repeat_ngram_size: This parameter prevents repetition during text generation. With a value of 2, the model is prohibited from repeating any sequence of two consecutive tokens. This helps make the text more fluid and less redundant.
early_stopping: This causes the model to stop generating text as soon as all candidate sequences end with the stop token. In the case of GPT-2, this token is <|endoftext|>. It makes the process more efficient by stopping once a satisfactory output is reached.

Text Generation – Decoder

The decode method is used to transform the output numerical matrix into human-readable text.

print(gpt_tokenizer.decode(gpt_output[0], skip_special_tokens=True))
print(phi_tokenizer.decode(phi_output[0], skip_special_tokens=True))

The skip_special_tokens=True parameter is used to omit special tokens, such as <|endoftext|>, during the decoding of the generated numerical matrix. These tokens are useful for the model’s internal control but are irrelevant for the final generated text, so they are excluded in the decoding process.

Below are the results from the models: GPT-2 Large (version 1, which hallucinated), GPT-2 Large (version 2), and Phi-3-mini.

GPT-2 Large – Version 1

What is Machine Learning?

Machine learning is a branch of computer science that deals with the problem of learning from large amounts of data. It is a branch of computer science that deals with the problem of learning from large amounts of data.

Machine learning is a branch of computer

GPT-2 Large – Version 2

What is Machine Learning?

Machine learning is a type of artificial intelligence (AI) that can be applied to a wide range of problems. It is based on the concept of “deep learning”, which is the process of training a neural network on large amounts of data and then using that data to predict the future behavior of the network. In other words, the machine learning algorithm is able to learn from the data it has been trained on, and use that information to make predictions about future data.

Phi-3-mini

What is Machine Learning?

Machine Learning (ML) is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without explicit instructions. Instead, these systems learn and improve from experience, typically from large amounts of data.

Machine Learning algorithms build a mathematical model based on sample data, known as “training data,” to make predictions or decisions without being explicitly programmed to perform the task. This process involves feed

As mentioned in the Encoder section, all models repeated the input question in their respective outputs. It’s possible to return only the newly generated tokens, but that will be addressed in a future post. You can also observe that the Phi-3-mini model would continue generating text if not limited to 100 tokens. An alternative would be to configure it similarly to GPT-2 Large (version 2), with additional parameters. These examples highlight that there are still many parameters and strategies we can explore to achieve more accurate and goal-aligned results.

Edvaldo Melo