Fine-Tuning: Getting Started

Fine-tuning is the process of readjusting a previously trained model (usually a general-purpose one) so that it adapts to a more specific task or dataset. The goal is to specialize the model by leveraging the knowledge it has already acquired during pretraining, without having to train it from scratch.

During fine-tuning, only part of the model’s parameters are updated (typically the final layers), which significantly reduces training time, computational cost, and the amount of required data.

One of the main challenges in this process is finding the right balance between specialization and generalization, that is, adapting the model enough to learn the specific features of the new task without causing overfitting.

This post aims to demonstrate how to apply fine-tuning to Microsoft’s phi-2 model for a binary classification task. The topics covered include:

Data Preprocessing
Inference without Fine-Tuning
Fine-Tuning: Hyperparameter Adjustment
Fine-Tuning: Training
Inference of the Tuned Model

The entire process was carried out using Google Colab. Except for the data preprocessing stage, each step of the fine-tuning process was executed in a notebook with different configurations. Throughout the post, I’ll specify the configuration used for each stage.

Data Preprocessing

The dataset used is designed for a binary classification problem, aiming to determine whether a headline is related to a crime or not.

The code snippet below illustrates the data preprocessing performed.

from datasets import load_dataset, DatasetDict, ClassLabel

data = load_dataset("csv", data_files="CrimeVsNoCrimeArticles.csv", delimiter=",")

labels = ClassLabel(
    num_classes=2,
    names=["No Crime", "Crime"]
)
data = data.cast_column("is_crime_report", labels)

split = data["train"].train_test_split(
    test_size=0.2,
    seed=42,
    stratify_by_column="is_crime_report"
)
train_data, val_data = split["train"], split["test"]

split = val_data.train_test_split(
    test_size=0.5,
    seed=42,
    stratify_by_column="is_crime_report"
)
val_data, test_data = split["train"], split["test"]

data = DatasetDict({
    "train": train_data,
    "val": val_data,
    "test": test_data
})

After loading the dataset, the data was split into 3 sets: training, validation, and test, following an 80/10/10 distribution. The training and validation sets are used to optimize the model’s weights and hyperparameters. The test set, on the other hand, serves as the final evaluation, containing data the model has never seen during training and never influenced through tuning, ensuring an unbiased performance assessment.

To preserve the class distribution when splitting the data, the stratify_by_column parameter was used. However, this parameter only accepts columns of type ClassLabel, so it was necessary to convert the is_crime_report column to that type. By doing so, the ClassLabel adds metadata that indicates the column represents nominal categories, mapping each integer value to its corresponding text label.

Below is the dataset visualization showing the training, validation, and test distributions. As shown, the dataset contains 7,124 records, with 80% used for training, 10% for validation, and 10% for testing, as mentioned earlier.

DatasetDict({
    train: Dataset({
        features: ['title', 'is_crime_report'],
        num_rows: 5699
    })
    val: Dataset({
        features: ['title', 'is_crime_report'],
        num_rows: 712
    })
    test: Dataset({
        features: ['title', 'is_crime_report'],
        num_rows: 713
    })
})

Below is an illustration of the class distribution. This was obtained using the command unique(data["train"]["is_crime_report"], return_counts=True, simply replace “train” with “val” or “test” to check the corresponding split. Since the dataset is balanced, the stratification ensures that each subset (training, validation, and test) preserves the class balance.

(array([0, 1]), array([2850, 2849]))  # training
(array([0, 1]), array([356, 356]))  # validation
(array([0, 1]), array([356, 357]))  # test

Inference without Fine-Tuning

The goal of this section is to evaluate the performance of the phi-2 model without applying any fine-tuning. To do this, 2 evaluation approaches will be used, leveraging the auto classes from the transformers library: AutoModelForSequenceClassification and AutoModelForCausalLM.

The phi-2 model was originally designed to predict the next token, and therefore its default usage is through AutoModelForCausalLM. However, to adapt it for classification tasks, the Hugging Face library allows the use of AutoModelForSequenceClassification, which replaces the original language modeling head (lm_head) with a classification head (score) on top of the base model. Below are the original version of the model and its adaptation for the binary classification task.

AutoModelForCausalLM:

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (rotary_emb): PhiRotaryEmbedding()
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (final_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=2560, out_features=51200, bias=True)
)

AutoModelForSequenceClassification:

PhiForSequenceClassification(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560, padding_idx=50256)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (rotary_emb): PhiRotaryEmbedding()
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (final_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=2560, out_features=2, bias=False)
)

For performing inference in both approaches, each notebook on Google Colab was configured with a T4 GPU and high RAM setting.

Below is the code used to load the model, which was adapted for classification tasks, along with its corresponding tokenizer.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token

id2label = {0: "No Crime", 1: "Crime"}
label2id = {"No Crime": 0, "Crime": 1}

model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/phi-2",
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
    pad_token_id=tokenizer.pad_token_id
)

Note that after loading the tokenizer, if it does not have a pad_token defined, this token should be set to the same value as the end-of-sequence token of the phi-2 model, whose value is <|endoftext|>. This information must also be passed to the model via the parameter pad_token_id=tokenizer.pad_token_id, ensuring that the model correctly identifies the padding token during sequence processing, producing uniform-length sequences. Furthermore, when loading the model, it is recommended to define the mapping between numeric labels and their corresponding class names (id2label) and the inverse mapping (label2id), allowing the model to interpret and return the correct classes during inference.

To evaluate the model, the test dataset created during preprocessing will be used. Since the dataset is balanced, the accuracy metric is well-suited for this kind of scenario. However, feel free to experiment with other metrics and different evaluation approaches. Below is the code snippet showing how to load the accuracy metric using the evaluate library.

import evaluate
accuracy = evaluate.load("accuracy")

Below is a function created to evaluate the predictions made by the model.

import torch

def evaluate_dataset(data):

  predictions = []

  for title in data["title"]:
    prefix = "Classify the following title as 'Crime' or 'No Crime': "
    title = prefix + title

    input_ids = tokenizer(
        title,
        return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
      logits = model(**input_ids).logits

    predictions.append(logits.argmax().item())

  return predictions, accuracy.compute(predictions=predictions, references=data["is_crime_report"])

For each title in the test dataset, a prefix was added before making any prediction, instructing the model what task it should perform. The text is then tokenized, meaning it is converted into a numerical sequence of tokens that the model can interpret. After tokenization, the model performs inference, generating the logits, which represent the scores associated with each class. Finally, the class with the highest probability is added to the list of predictions. The function then returns both the list of predictions and the model’s accuracy, computed by comparing the predictions with the true values in the dataset (is_crime_report).

When executing the function, the model achieved an accuracy of approximately 46%. However, upon analyzing the predictions, it becomes clear that the majority of the samples were classified as “No Crime” (class 0). This behavior will be corrected and improved through fine-tuning.

predictions, acc = evaluate_dataset(data["test"])
unique(predictions, return_counts=True)  # (array([0, 1]), array([634,  79]))
print(acc) # {'accuracy': 0.4586255259467041}

The original version of the model, without any fine-tuning, was also used to evaluate its baseline performance. Below is the code used to load the tokenizer and the model in this base configuration.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    dtype=torch.float16,
    device_map="auto"
)

Below is the function created to classify each title. Since this is the causal form of the model, it was necessary to adjust the prompt so that the model could properly understand the task to be performed.

import torch

def classify_title(title):

  prompt = (
      "Classify the following title as 'Crime' or 'No Crime'.\n\n"
      f"Title: {title}\n"
      "Answer: "
  )

  input_ids = tokenizer(
      prompt,
      return_tensors="pt"
  ).to(model.device)

  with torch.no_grad():
    outputs = model.generate(
        **input_ids,
        max_new_tokens=10,
        do_sample=False,
        num_beams=5,
        no_repeat_ngram_size=2,
        early_stopping=True,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

  text = tokenizer.decode(
      outputs[0],
      skip_special_tokens=True
  )

  if "Answer:" in text:
      return text.split("Answer:")[-1].strip()

  return text.strip()

Note that, unlike the previous approach, the output here is not a probability distribution over the classes but rather a sequence of tokens generated by the model, which is then decoded using tokenizer.decode(). Several parameters were added to enhance the model’s performance, such as num_beams, no_repeat_ngram_size, and early_stopping. You can find a more detailed explanation by clicking here.

The code below stores the textual response generated by the model along with the corresponding label for each title in the test dataset.

responses, predictions = [], []
for i, title in enumerate(data["test"]["title"]):
    resp = classify_title(title)

    if "no crime" in resp.lower():
        pred = 0
    else:
        if "crime" in resp.lower():
            pred = 1
        else:
            pred = 0

    responses.append(resp)
    predictions.append(pred)

When analyzing the model’s responses using unique(responses), it becomes clear that the model produced several hallucinated outputs. See below.

array(['', '- No Crime', 'Category: No Crime', 'Category: Politics',
       'Crime', 'Crime.', 'Crime: No', 'News', 'News.', 'No Crime',
       'No Crime.', 'No crime.', 'No, the title is not a crime.',
       'No, the title is not of a crime', 'No, this is not a crime.',
       'Non-Crime.', 'Not Crime.', 'Politics/Government',
       'Topic: <Medical, Health and Drugs>', 'Topic: Crime',
       'Topic: Health and Beauty', 'Topic: Music', 'Topic: Politics',
       'Topic: Religion', 'Topic: Sports'], dtype='<U34')

When calculating the accuracy, the model achieved approximately 70%, showing a significant improvement compared to the previous approach. With prompt adjustments, this result can be further improved. Feel free to test different prompt variations to see how they influence the model’s performance.

Fine-Tuning: Hyperparameter Adjustment

The goal of hyperparameter tuning is to optimize the model’s configuration, that is, to find the combination of values that yields the best possible performance for the given task.

The Google Colab configuration used for this setup consisted of an L4 GPU with high RAM.

Below is the code used to load the tokenizer and the model.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def model_init():

  id2label = {0: "No Crime", 1: "Crime"}
  label2id = {"No Crime": 0, "Crime": 1}

  model = AutoModelForSequenceClassification.from_pretrained(
      "microsoft/phi-2",
      num_labels=2,
      id2label=id2label,
      label2id=label2id,
      pad_token_id=tokenizer.pad_token_id
  )

  for param in model.model.parameters():
    param.requires_grad = False

  return model

Note that the tokenizer is loaded in the same way as in the previous approaches. It is used by the DataCollatorWithPadding, whose purpose is to perform dynamic padding, that is, the longest sequence within each batch is used as a reference, and all others are padded to match its length. This way, the sequence length is adjusted dynamically for each batch, optimizing memory usage and reducing computational cost.

Moreover, unlike the previous approaches, the model is not loaded directly. Instead, a function was created to initialize the model, since when tuning hyperparameters, the process needs to be run multiple times, which requires reinitializing the model each time. Also, as mentioned earlier in this post, the goal of fine-tuning is to adjust only part of the model’s parameters. Therefore, by setting param.requires_grad = False, all base layer parameters (model) of phi-2 are frozen, allowing only the parameters of the score layer to be updated during training (see the previous section where I presented the phi-2 and its layers). This choice was intentional, though you are encouraged to experiment with adjusting other parts of the model.

To check how many parameters are trainable, run the code below.

trainable = sum(param.numel() for param in model.parameters() if param.requires_grad)
frozen = sum(param.numel() for param in model.parameters() if not param.requires_grad)
print(f"Trainable: {trainable:,} | Frozen: {frozen:,}")  # Trainable: 5,120 | Frozen: 2,648,560,640

Before starting the hyperparameter tuning, it was also necessary to perform some preprocessing on the dataset. See the code below.

def preprocessing(data):

  prefix = "Classify the following title as 'Crime' or 'No Crime': "
  data["title"] = [prefix + title if title is not None else "" for title in data["title"]]

  encoding = tokenizer(
      text=data["title"],
      truncation=True,
      max_length=128
  )

  encoding["labels"] = data["is_crime_report"]

  return encoding

data = data.map(preprocessing, batched=True)

With this step, the prefix used in the non–fine-tuned approach is added to each title. Then, the title is tokenized, generating the input_ids and attention_mask fields, while the labels variable is created from the is_crime_report column. These fields are essential for the training process, as they represent, respectively, the numerical input sequences, the attention mask, and the output labels that the model must learn to predict.

Thus, the dataset has the following structure:

DatasetDict({
    train: Dataset({
        features: ['title', 'is_crime_report', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 5699
    })
    val: Dataset({
        features: ['title', 'is_crime_report', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 712
    })
    test: Dataset({
        features: ['title', 'is_crime_report', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 713
    })
})

Finally, everything is ready for hyperparameter tuning. For this step, I used the random search strategy, which aims to randomly select hyperparameter values within predefined ranges or sets. In other words, each hyperparameter has a search space (such as a numerical range or a list of possible values), and the algorithm samples random combinations of these values to evaluate the model’s performance. This means it is not necessary to test every possible combination, as in grid search.
Instead, you define a maximum number of runs, allowing you to explore a much larger search space more efficiently. After analyzing the results, it is common to run additional experiments, focusing on narrower ranges of the hyperparameters that achieved better performance. I plan to publish a future post exploring random search and grid search in more detail.

Below is the class created to perform the random search. Although Hugging Face already provides a method for automatic hyperparameter tuning (click here), that approach returns only the best model. Since my goal is to evaluate all runs to identify potential signs of overfitting and, if necessary, conduct new searches with narrower hyperparameter ranges, I decided to implement my own random search class.

from transformers import TrainingArguments, Trainer
import evaluate
from random import uniform, randint, choice, seed
from pandas import DataFrame, concat
import torch
import gc
from numpy import argmax


class BinaryClassificationRandomSearch:

  def __init__(self, model_init, tokenizer, data_collator, hyperparameters):

    self.__model_init = model_init
    self.__tokenizer = tokenizer
    self.__data_collator = data_collator

    self.__hyperparameters = hyperparameters

    self.__accuracy = evaluate.load("accuracy")
    self.__roc_auc = evaluate.load("roc_auc")

    self.df = DataFrame()


  def run(self, data, total_executions, epochs):

    for execution in range(total_executions):
      print(f"Execution: {execution + 1}")

      self.__get_hyperparameters()

      self.__train(data, epochs)

      self.__metrics()

      self.__concat_dataframe()

      del self.__trainer
      torch.cuda.empty_cache()
      gc.collect()

    return self.df


  def __get_hyperparameters(self):

    seed()

    self.__lr = uniform(*self.__hyperparameters["learning_rate"])
    self.__lr_scheduler_type = choice(self.__hyperparameters["lr_scheduler_type"])
    self.__weight_decay = uniform(*self.__hyperparameters["weight_decay"])
    self.__warmup_ratio = uniform(*self.__hyperparameters["warmup_ratio"])
    self.__per_device_train_batch_size = randint(*self.__hyperparameters["per_device_train_batch_size"])


  def __train(self, data, epochs):

    training_args = TrainingArguments(
      learning_rate=self.__lr,
      lr_scheduler_type=self.__lr_scheduler_type,
      weight_decay=self.__weight_decay,
      warmup_ratio=self.__warmup_ratio,
      per_device_train_batch_size=self.__per_device_train_batch_size,
      per_device_eval_batch_size=8,
      num_train_epochs=epochs,
      eval_strategy="epoch",
      group_by_length = True,
      push_to_hub=False,
      report_to=[],
      disable_tqdm=True
    )

    model = self.__model_init()

    self.__trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=data["train"],
      eval_dataset={"Training": data["train"], "Validation": data["val"]},
      processing_class=self.__tokenizer,
      data_collator=self.__data_collator,
      compute_metrics=self.__compute_metrics
    )

    self.__trainer.train()


  def __metrics(self):

    self.__auc_train, self.__acc_train = None, None
    self.__auc_val, self.__acc_val = None, None

    for i in range(len(self.__trainer.state.log_history) -1, -1, -1):
      if "eval_Training_runtime" in self.__trainer.state.log_history[i]:
        self.__auc_train = self.__trainer.state.log_history[i]["eval_Training_auc"]
        self.__acc_train = self.__trainer.state.log_history[i]["eval_Training_accuracy"]

      if "eval_Validation_runtime" in self.__trainer.state.log_history[i]:
        self.__auc_val = self.__trainer.state.log_history[i]["eval_Validation_auc"]
        self.__acc_val = self.__trainer.state.log_history[i]["eval_Validation_accuracy"]

      if self.__auc_train and self.__auc_val:
        break


  def __concat_dataframe(self):

    self.df = concat([
      self.df,
      DataFrame({
        "learning_rate": self.__lr,
        "lr_scheduler_type": self.__lr_scheduler_type,
        "weight_decay": self.__weight_decay,
        "warmup_ratio": self.__warmup_ratio,
        "per_device_train_batch_size": self.__per_device_train_batch_size,
        "auc_train": self.__auc_train,
        "auc_val": self.__auc_val,
        "acc_train": self.__acc_train,
        "acc_val": self.__acc_val,
      }, index=[0])
    ], ignore_index=True)


  def __compute_metrics(self, eval_pred):

    predictions, labels = eval_pred
    label_predictions = argmax(predictions, axis=1)

    metrics = {
        "accuracy": self.__accuracy.compute(predictions=label_predictions, references=labels)["accuracy"],
        "auc": self.__roc_auc.compute(prediction_scores=predictions[:, 1], references=labels)["roc_auc"]
    }

    return metrics

__init__ method: initializes the class by providing the model initialization function, the tokenizer, the data collator, and a dictionary of hyperparameters, each with its respective range of values to be explored.
run method: responsible for executing the random search process. It runs N times, and in each iteration, the hyperparameter values are randomly sampled. The model is then trained with those values, and performance metrics (accuracy and AUC) are obtained. After each run, the Trainer object is cleared from memory to free resources before the next iteration. Finally, the method returns a dataframe containing the results from all executions.
__get_hyperparameters method: retrieves random values for each hyperparameter defined in the search dictionary.
- learning_rate: defines the step size for gradient descent.
- lr_scheduler_type: specifies the type of scheduler used to adjust the learning rate. During training, the learning rate value is updated dynamically according to the defined scheduling policy (for example, linear, cosine, polynomial, etc.).
- weight_decay: represents the L2 regularization coefficient, used to prevent overfitting by penalizing excessively large weights.
- warmup_ratio: defines the proportion of training steps used for warmup, before reaching the defined learning rate. During this phase, the learning rate gradually increases, helping to reduce instability in the early stages of training.
- per_device_train_batch_size: sets the batch size per GPU.
__train method: handles the model training process.
- TrainingArguments: defines the training parameters, passing the sampled hyperparameters (learning_rate, lr_scheduler_type, weight_decay, warmup_ratio, per_device_train_batch_size) along with other key arguments:
  - per_device_eval_batch_size: batch size used for evaluation.
  - num_train_epochs: total number of epochs.
  - eval_strategy: specifies when evaluation occurs (in this case, at the end of each epoch).
  - group_by_length: when set to True, groups sequences of similar lengths in the same batch, reducing unnecessary padding.
  - push_to_hub: uploads the model to Hugging Face Hub (disabled here).
  - report_to: when set to an empty list, it prevents the automatic logging of training metrics to tracking and experiment visualization platforms such as Weights & Biases.
  - disable_tqdm: disables the progress bar during training.
- Trainer: handles the actual training process of the model. During evaluation (eval_dataset), both the training and validation datasets are passed, allowing the detection of potential overfitting by comparing their performances. The evaluation metrics, defined in the __compute_metrics method, are accuracy and the Area Under the ROC Curve (AUC), used to assess the model’s performance in each run. To start the training, simply call self.__trainer.train().
__metrics method: retrieves the accuracy and AUC metrics from the final training epoch, stored in the Trainer’s internal history. These values are added to the final dataframe, consolidating the results of all random search runs for evaluation and comparison.
- Just to clarify: the __compute_metrics method is passed to the Trainer so that the evaluation metrics are computed during training. The __metrics method, on the other hand, simply retrieves those metrics that were computed in the last epoch of training.

To run the random search, simply define the hyperparameter dictionary, instantiate the BinaryClassificationRandomSearch class with the required objects, and finally call the run() method, as shown in the code below.

hyperparameters = {
    "learning_rate": [1e-6, 1e-3],
    "lr_scheduler_type": ["linear", "cosine"],
    "weight_decay": [1e-2, 1e-4],
    "warmup_ratio": [0, 0.1],
    "per_device_train_batch_size": [4, 64],
}

random_search = BinaryClassificationRandomSearch(
    model_init=model_init,
    tokenizer=tokenizer,
    data_collator=data_collator,
    hyperparameters=hyperparameters
)

df = random_search.run(
    data=data,
    total_executions=10,
    epochs=3
)

The random search results are shown below, with the dataframe sorted in descending order by AUC value (df.sort_values("auc_val", ascending=False)). As you can see, the results were quite satisfactory in this first round, so there’s no immediate need to run a second search with smaller and more specific ranges. However, feel free to run another round to explore the potential for further performance improvements.

The selected hyperparameter configuration was the one from the first row of the table, as the training and validation results showed very similar AUC and accuracy values, indicating a good balance and no signs of overfitting. Additionally, the batch size of 6 proved advantageous, as it required less GPU memory during training.

Fine-Tuning: Training

The goal of this section is to train the model using the hyperparameters selected from the random search. The Google Colab configuration used for this step was an L4 GPU with high RAM.

For the process to work properly, it is necessary to load the tokenizer and the model, preprocess the data as shown in the previous section, and create a function to compute the metrics, as demonstrated in the __compute_metrics method of the random search class in the previous section.

Below is the configuration of the training arguments.

training_args = TrainingArguments(
    learning_rate=0.000352,
    lr_scheduler_type="cosine",
    weight_decay=0.002265,
    warmup_ratio=0.056668,
    per_device_train_batch_size=6,
    per_device_eval_batch_size=8,
    num_train_epochs=10,
    eval_strategy="epoch",
    group_by_length = True,
    output_dir="phi-2-crime-vs-no-crime-classification",
    save_strategy="epoch",
    save_total_limit=3,
    push_to_hub=False,
    hub_model_id="edvaldomelo/phi-2-crime-vs-no-crime-classification",
    report_to=[]
)

Note that the values defined for the hyperparameters correspond to the configuration selected during the random search. The training was set to run for 10 epochs, and during the process, the model saves checkpoints in the directory specified by output_dir. The saving strategy (save_strategy="epoch") specifies that the model will be saved at the end of each epoch, with a maximum limit of 3 checkpoints (save_total_limit=3), meaning that only the last 3 out of the 10 epochs will be kept. Additionally, the model will not be automatically pushed to the Hugging Face hub (push_to_hub=False), but the repository identifier (hub_model_id) has already been defined. This identifier follows the format username/model-name, in this case, edvaldomelo/phi-2-crime-vs-no-crime-classification. To successfully push the model to the hub after training, you must be authenticated with your Hugging Face token, which is recommended to be stored securely in a .env file.

Finally, the model is ready to be trained.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data["train"],
    eval_dataset={"train": data["train"], "val": data["val"]},
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

Below are the training results, as displayed by the Trainer progress bar.

Rounding the values, the model achieved an AUC of 0.990 and an accuracy of 0.963 on the training data, and an AUC of 0.982 and an accuracy of 0.944 on the validation data.

To evaluate the test dataset, which serves as the final evaluation set, simply run:

trainer.evaluate(eval_dataset=data["test"])

The results were 0.984 AUC and 0.941 accuracy, showing that the model maintained strong performance, with values very close to those from the validation set. It’s worth noting that when reproducing this experiment, the results may vary, but the key is to achieve consistent performance and avoid overfitting.

When comparing the fine-tuned model with the non–fine-tuned versions (see Inference without Fine-Tuning section), the performance was significantly better: the model adapted for classification achieved around 46% accuracy on this test dataset, while the original model reached approximately 70%.

To upload the model to the Hugging Face hub, run the command below:

trainer.push_to_hub()
tokenizer.push_to_hub("edvaldomelo/phi-2-crime-vs-no-crime-classification")

Inference of the Tuned Model

The goal of this section is to use the model previously uploaded to the Hugging Face hub. For this step, the Google Colab configuration used was a T4 GPU with high RAM.

Below is the code used to load the tokenizer and the model. Note that it’s the same code used in the previous sections, but in this case, it loads the model fine-tuned specifically for this task.

tokenizer = AutoTokenizer.from_pretrained("edvaldomelo/phi-2-crime-vs-no-crime-classification")
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token

id2label = {0: "No Crime", 1: "Crime"}
label2id = {"No Crime": 0, "Crime": 1}

model = AutoModelForSequenceClassification.from_pretrained(
    "edvaldomelo/phi-2-crime-vs-no-crime-classification",
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
    pad_token_id=tokenizer.pad_token_id
)

The model has already been evaluated on the test data (see the previous section). Therefore, I asked ChatGPT to generate 10 sentences (5 crime-related and 5 non-crime-related) to test how well the model performs. Below is the list of generated sentences.

synthetic_data = [
    "Thief caught stealing bicycles from parking lot",
    "Police arrest suspect for drug possession",
    "Children enjoy annual summer camp activities",
    "Shop owner reports credit card fraud",
    "City library hosts free reading sessions for kids",
    "New restaurant offers discount for local residents",
    "Athlete wins gold medal at international competition",
    "Bank robbery suspect escapes during transfer",
    "Teen charged after vandalizing public school walls",
    "Farmers celebrate record harvest this season"
]

Below is the function created for the model to classify each title (see the Inference without Fine-Tuning section for an explanation of a similar code). Note that the output corresponds to the class with the highest probability generated by the model.

def classify_title(title):

  prefix = "Classify the following title as 'Crime' or 'No Crime': "
  title = prefix + title

  input_ids = tokenizer(
      title,
      return_tensors="pt"
  ).to(model.device)

  with torch.no_grad():
    logits = model(**input_ids).logits

  predicted_class_id = logits.argmax().item()

  return model.config.id2label[predicted_class_id]

To view the model’s response for each sentence generated by ChatGPT, run the command below:

for title in synthetic_data:
  print(f"{title}: {classify_title(title)}")

The generated output was:

Thief caught stealing bicycles from parking lot: Crime
Police arrest suspect for drug possession: Crime
Children enjoy annual summer camp activities: No Crime
Shop owner reports credit card fraud: Crime
City library hosts free reading sessions for kids: No Crime
New restaurant offers discount for local residents: No Crime
Athlete wins gold medal at international competition: No Crime
Bank robbery suspect escapes during transfer: Crime
Teen charged after vandalizing public school walls: Crime
Farmers celebrate record harvest this season: No Crime

Edvaldo Melo