Image From Ref[1]
Recently I came across a problem Statement where I had to retrieve information from the database given the input voice or text command by the user. So, for that, I started to look for natural language-to-SQL query generation and found lot many solutions, but none of the solutions fit my use case as they were not trained on my dataset or not aware of my database. So, giving a command was just generating some random query that was not retrieving any output.
In that case, I was left with two options :
- Building my own Natural Language to SQL model from Scratch
- Fine-Tuning Existing model for my use case
Since I was constrained by time, I went with the second option and fine-tuned the Open-Source LLM from hugging face for my use case.
In this article, we will explore the powerful technique of fine-tuning text-to-text/text-generation generation models using the Hugging Face Transformers library. We’ll focus on its application in SQL query generation, a complex but essential task in data manipulation.
Following is the Index of the article, you can jump directly to the required section if required :
— — — → What is fine-tuning and Why use it?
— — — → Let’s Fine-Tune LLM and Get our Hands Dirty
— — — → Why This Method is Good for Fine-Tuning
— — — → Where all this Provided method of Fine-tuning LLM can be Used
— — — → Conclusion
The complete code used in this Article can be found on my GitHub profile from here.
What is fine-tuning and Why use it?
Fine-tuning is a machine learning practice where a pre-trained model, initially trained on a vast corpus of text, is further trained on a specific task or dataset. The goal of fine-tuning is to adapt the model to perform a task with a high level of competence.
Fine-tuning allows us to align powerful pre-trained models to a specific task, which offers us several advantages, such as :
- Efficiency: Leveraging pre-trained models as a starting point reduces the computational resources and time required for training.
- Adaptability: Models can be customized for a wide range of natural language processing tasks, from summarization to translation.
- Simplicity: High-level libraries like Hugging Face Transformers simplify the fine-tuning process, making it accessible to researchers and practitioners.
- Domain-Specific Task: Though, Same as Adaptability, but through fine-tuning we can align powerful models to our specific domain, the data of which it has never seen.
Note: Since in this task we are focussing on performing a single task from LLM, thus no need to worry about “Catastrophic Forgetting” and adopting fine-tuning techniques like ‘Multi-task Fine-tuning’ or ‘PEFT’ fine-tuning.
Let’s Fine-Tune LLM and Get our Hands Dirty
I used a model from Hugging Face, it could be the ‘Text-2-Text-Generation’ model based on encoder-decoder architecture or the ‘Text-Generation’ model based on decoder-only architecture. Also, we can directly load the model from the hugging face or can download it earlier and load it later for offline purposes.
**Note for Selecting Model and Dataset Generation :**Since Hugging-Face has lots of models to choose from, try to choose a model that is already aligned with your requirement, such as a code generation model for your specific code generation task. This makes the fine-tuned model Faster and more accurate.While Creating a Dataset it is good practice to create a dataset that covers all the generalized corner cases in at least one or two examples, thus making our model learn and handle our use case better.
— → Importing Libraries and Loading Models: Importing essential libraries and setting the GPU environment if available. Then, Load a pre-trained model from locally or directly from the hugging face, which serves as the foundation for fine-tuning.
Copy the link as highlighted of your required model from hugging face and paste in the code below at model_ckpt.
from transformers import pipeline, set_seed
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from datasets import load_dataset,load_metric
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import nltk
from nltk.tokenize import sent_tokenize
from tqdm import tqdm
import torch
nltk.download("punkt")
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Model will be trained on {device}")
#loading model
#here the model can be directly selected from hugging face
# or can be downloaded on local system and can be provided the path
# Also not necesaarily ''text2text-generation'' model required
# here ''text-generation'' models can also be used from hugging face
# make sure to put ''text-generation''/''text2text-generation'' in pipeline accordingly
model_ckpt = "any/text2text-generation/model/from/huggingface"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model_before_training = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
— → Loading and Pre-processing the Dataset: Load the data set, the only condition is that you should be having define inputs and corresponding Output examples. I gave the basic example of loading a dataset from the text file in which Input and Output are alternatively arranged.
#loading dataset
#here i have used .text file for dataset
# it is not mandatory
#only thing mandatory here for preparing dataset is
#that you are having required input examples and corresponding
# output examples
with open(''path/to/dataset.txt'',''r'') as file:
lines = file.readlines()
import re
test1 = []
for line in lines:
test1.append(re.sub('' +'', '' '', line))
test = [sub.replace("''", '').replace(''\n'','') for sub in test1]
inputs = []
outputs = []
for item in test:
if test.index(item)%2==0:
# print(item)
inputs.append(item)
if test.index(item)%2!=0:
# print(item,''\n'')
outputs.append(item)
dicts = {}
test_percentage = 0.9
data_index = int(test_percentage * len(inputs))
print(data_index)
dicts[''prompt''] = inputs[:data_index]
dicts[''code''] = outputs[:data_index]
data_train = pd.DataFrame(dicts)
data_train = data_train.sample(frac=1)
dicts = {}
dicts[''prompt''] = inputs[data_index:]
dicts[''code''] = outputs[data_index:]
data_val = pd.DataFrame(dicts)
data_val = data_val.sample(frac=1)
— → Data Visualisation: It is an extra step and not mandatory but can be useful in getting data insight about Input command length and generated query length, thus helping in better resource allocation for training by defining hyper-parameters.
# histogram of length of dialogue and summary to fix max length
prompt_token_length = [len(tokenizer.encode(s)) for s in data_train[''prompt''] ]
code_token_length = [len(tokenizer.encode(s))for s in data_train[''code'']]
fig, axes = plt.subplots(1,2, figsize=(10,4))
axes[0].hist(prompt_token_length, bins=20, color=''C0'',edgecolor=''C0'')
axes[0].set_title("PROMPT Token Length")
axes[0].set_xlabel("Length")
axes[0].set_ylabel("Count")
axes[1].hist(code_token_length, bins=20, color=''C0'',edgecolor=''C0'')
axes[1].set_title("CODE Token Length")
axes[1].set_xlabel("Length")
axes[1].set_ylabel("Count")
plt.tight_layout()
plt.show()
Visualisation of Dataset
— → Calculating ROUGE Score: Setting up a function for calculating the ROUGE Score, but let’s have a little bit of understanding on this evaluation metric.
ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation, is a metric used for the automatic evaluation of machine-generated text, such as summaries, translations, and more. ROUGE measures the quality of the generated text by comparing it to one or more reference texts, usually written by humans.
ROUGE scores provide an objective measure of how well a machine-generated text matches human-written references. It’s a vital tool in evaluating and comparing the performance of different natural language processing models. We will use ROUGE to fine-tune our models, ensuring that the generated text is contextually relevant and informative.
def generate_batch_sized_chunks(list_of_elements, batch_size):
for i in range(0,len(list_of_elements),batch_size):
yield list_of_elements[i:i+batch_size]
#code for computing rouge score
def calculate_metric_on_test_ds(datasets,metric,model,tokenizer,
batch_size=2,device=device,column_prompt="prompt",
column_code="code"):
prompt_batches = list(generate_batch_sized_chunks(datasets[column_prompt].tolist(),batch_size))
code_batches = list(generate_batch_sized_chunks(datasets[column_code].tolist(),batch_size))
for prompt_batch, code_batch in tqdm(
zip(prompt_batches, code_batches), total = len(prompt_batches)):
prompts = tokenizer(prompt_batch , max_length=256,truncation=True,
padding="max_length",)
# if code not working , you can always try to change hyper-paramters
# like max_length, temperature according to you or model required
codes = model.generate(input_ids = torch.tensor(prompts["input_ids"]).to(device),
attention_mask = torch.tensor(prompts["attention_mask"]).to(device),
num_beams=5, max_length = 256, temperature=1.1)
decoded_codes = [tokenizer.decode(s,skip_special_tokens=True,
clean_up_tokenization_spaces=True)
for s in codes]
decoded_codes = [d.replace("<n>"," ") for d in decoded_codes]
metric.add_batch(predictions = decoded_codes, references=code_batch)
score = metric.compute()
return score
Calculating Model Performance Before Training :
Running below code will give you a different rouge score from the model on Various Generations.
# here in pipeline, change ''text2text-generation'' or ''text-generation''
# according to your model choice
pipe = pipeline(''text2text-generation'', model = model_ckpt)
pipe_out = pipe(data_val[''prompt''][5])
modelname = "name_of_your_model"
with open(f"Results_before_finetuning_{modelname}.txt",''a'') as f:
print("Input Prompt:")
print(data_val[''prompt''][5])
print("Actual Output")
print(data_val[''code''][5])
print("Generated Output Before Training: ")
print(pipe_out)
rouge_names = ["rouge1","rouge2","rougeL","rougeLsum"]
rouge_metric = load_metric(''rouge'')
score = calculate_metric_on_test_ds(data_val,rouge_metric,model_before_training,tokenizer=tokenizer)
rouge_dict = dict((rn,score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame(rouge_dict, index = [''model''])
Rouge Score Output
— → Preparing Data for training: Now we Prepare the dataset by passing it through the below code, which passes through the relevant tokenizer and converts it into labels like ‘input_ids’, ‘attention_mask’, and ‘labels’.
class MyDataset(torch.utils.data.Dataset):
def __init__(self, data, tokenizer=tokenizer, max_input_length=128, max_target_length=128):
self.tokenizer = tokenizer
self.max_input_length = max_input_length
self.max_target_length = max_target_length
self.inputs = data[''prompt''].tolist()
print(self.inputs[0])
self.targets = data[''code''].tolist()
print(self.targets[0])
def __getitem__(self, index):
# print("inside getitem")
input_encoding = self.tokenizer(self.inputs[index], max_length=self.max_input_length, truncation=True)
# target_encoding = self.tokenizer(self.targets[index], max_length=self.max_target_length, truncation=True)
with self.tokenizer.as_target_tokenizer():
target_encoding = self.tokenizer(self.targets[index], max_length=self.max_target_length, truncation=True)
return {
''input_ids'': input_encoding[''input_ids''],
''attention_mask'': input_encoding[''attention_mask''],
''labels'': target_encoding[''input_ids'']
}
def __len__(self):
return len(self.inputs)
dataset_pt_train_class = MyDataset(data_train)
dataset_pt_eval_class = MyDataset(data_val)
— → Fine-tuning the Model: The code prepares for fine-tuning, defining training parameters, initializing the Trainer, and beginning the training process.
Training and Saving the model for later inference. In training Arguments, there are various hyper-parameters that can be manipulated accordingly to train the model better or faster. You can Play around those parameters accordingly.
Also, we will again be calculating the ROUGE score on the model after training and will find that the ROUGE score has increased drastically.
from transformers import DataCollatorForSeq2Seq
import time
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer,model = model_before_training)
from transformers import TrainingArguments, Trainer
start = time.time()
trainer_args = TrainingArguments(output_dir=''./result_for''+''_''+ modelname,
num_train_epochs = 500,
warmup_steps = 2,
per_device_train_batch_size = 32,
per_device_eval_batch_size = 32,
weight_decay = 0.0001,
logging_steps = 5,
push_to_hub = False,
evaluation_strategy = ''steps'',
eval_steps = 100,
save_steps = 1e6,
gradient_accumulation_steps = 16,)
trainer = Trainer(
model = model_before_training,
args = trainer_args,
train_dataset=dataset_pt_train_class,
eval_dataset = dataset_pt_eval_class,
data_collator = seq2seq_data_collator
)
trainer.train()
trainer.save_model("./fine-tuned_"+modelname)
end = time.time()
total_time = end-start
rouge_names = ["rouge1","rouge2","rougeL","rougeLsum"]
rouge_metric = load_metric(''rouge'')
# score = calculate_metric_on_test_ds(data_val,rouge_metric,model_before_training,tokenizer=tokenizer)
score = calculate_metric_on_test_ds(
data_val,rouge_metric,model = trainer.model,tokenizer = tokenizer,)
rouge_dict = dict((rn,score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame(rouge_dict, index = [''Fine Tuned Model''])
#testing----
#can play with this gen_kwargs to manipulate the output generation
# accordingly
gen_kwargs = {"length_penalty":0.5,"num_beams":5,"max_length":256}
del model_before_training
sample_text = data_val[''prompt''][5]
# tokenizer1 = AutoTokenizer.from_pretrained(model_ckpt)
reference = data_val[''code''][5]
trained_model = AutoModelForSeq2SeqLM.from_pretrained("fine-tuned_"+modelname)
pipe_from_trained = pipeline(''text2text-generation'', model =trained_model,tokenizer = tokenizer)
print("prompt")
print(sample_text)
print("\n Actual Code")
print(reference)
print("\n Generated Code")
print(pipe_from_trained(sample_text, **gen_kwargs))
Model training Expected output
Using the Provided Code you can Easily Fine-Tune LLMs for your Use Case.
Why This Method is Good for Fine-Tuning
The method presented offers distinct advantages:
- Leveraging Pre-Trained Models: The use of pre-trained models as a starting point significantly enhances the efficiency and resource-saving aspects of fine-tuning.
- Task Versatility: The Hugging Face Transformers library provides a rich selection of pre-trained models suitable for various natural language processing tasks.
- Simplified Fine-Tuning: The library abstracts many complex processes, simplifying the fine-tuning workflow.
Where All This Provided Method of Fine-tuning LLM Can Be Used
Though I used this code for the application of domain-specific code generation, this method is not only limited to that and can be used in various other tasks easily, like :
- Text Summarization: Customizing models to generate concise summaries from extensive documents.
- Machine Translation: We can easily use this method for our required machine Translation task, having a relevant dataset.
- Question Answering: Fine-tuning models to answer questions based on provided text passages.
- Code Generation: Training models to generate code snippets or SQL queries from natural language prompts.
- Document Generation: Creating automated document generators tailored to specific industries or domains.
Conclusion:
Fine-tuning LLM models using the Hugging Face Transformers library is a potent tool for customizing pre-trained Large Language Models to perform specific natural language processing (NLP) tasks. It streamlines the process, ensures efficient adaptation, and broadens the scope of potential applications.
With the provided code, you can fine-tune models for precise needs, whether it’s SQL query generation or any other NLP task. Fine-tuning moves us closer to context-aware and task-specific models in the realm of natural language processing.
Reference:
Hopefully, this article can help you in your projects.
Thank you for taking the time to read and engage with this article. Your support in the form of following me and clapping on the article is highly valued and appreciated. If you have any queries or doubts about the content of this article or the shared code, please do not hesitate to reach out to me via email at manindersingh120996@gmail.com. You can also connect with me on LinkedIn.