Natural Language to SQL Query using an Open Source LLM

Chetankumar Khadke
18 min readMay 17, 2024

--

Introduction

In the dynamic landscape of data utilization, the ability to
effortlessly interact with databases is paramount. Traditionally, this
interaction required a deep understanding of Structured Query Language
(SQL), posing a barrier to entry for many users. However, the advent of
Natural Language Processing (NLP) to SQL Query Engines has transformed
this landscape, allowing users to communicate with databases
using natural language commands. This cutting-edge technology seamlessly translates human language into SQL queries, revolutionizing how we retrieve and manipulate data.

In Natural Language Processing (NLP), models like
Mistral 7B and Microsoft Phi-3 are at the forefront, redefining the
boundaries of performance and efficiency.

Mistral 7B stands out for its remarkable performance and precision in
NLP tasks. With innovative features like Grouped-Query Attention (GQA)
and Sliding Window Attention (SWA), Mistral 7B excels in various
benchmarks, including mathematics and code generation. Its ability to
approach the coding proficiency of Code-Llama 7B while maintaining
excellence across diverse domains underscores its significance in NLP
advancement.

Introducing Phi-3, the latest innovation from Microsoft in small language models (SLMs), set to transform the landscape of AI. With Phi-3-mini, Phi-3-small, and Phi-3-medium, this family of models brings exceptional performance in compact packages. Phi-3-mini, boasting 3.8 billion parameters, rivals larger models while remaining efficient on smartphones. Behind Phi-3’s success is its meticulously curated training dataset, prioritizing robustness, safety, and conversational fluency. Phi-3-small and Phi-3-medium further expand Phi-3’s capabilities, catering to diverse applications. With its carefully crafted architecture and training methods, Phi-3 represents a significant leap forward in AI technology, promising top-notch performance and efficiency for various generative AI tasks.

Exploring the intersection of NLP and SQL, we delve into the utilization
of Mistral 7B and Microsoft Phi-3. These models play a pivotal role in
seamlessly converting natural language queries into structured SQL
queries, offering enhanced efficiency and accuracy in database querying
tasks.

Photo by Sunder Muthukumaran on Unsplash

Learning Objectives

In this blog post, we will explore the intricacies of utilizing the open-source Mistral 7B model for NL2SQL tasks. Additionally, we will discuss the methods for customizing and training models specifically for the NL2SQL application. Rest of the article talk about following points.

Motivation
Pre-trained model for NL2SQL
Fine tuned model for NL2SQL
Conclusion

Motivation

Utilizing Open-Source LLMs involves a sophisticated process of converting natural language commands into SQL queries. This innovative technology automates the process, allowing users to express their data needs naturally without requiring manual query construction. These models utilize complex algorithms and Large Language Models to analyze user input and generate semantically accurate SQL queries. This streamlines the conversion process and eliminates the need for extensive SQL knowledge, making data accessible to a broader range of users. Open-source LLMs offer convenience and significantly improve data accessibility and operational efficiency. By removing the barrier of SQL expertise, this technology democratizes data access and allows users from diverse backgrounds to retrieve data and derive insights. Whether for business analysts seeking real-time insights or casual users exploring datasets, the intuitive nature of natural language commands simplifies data retrieval.

Additionally, automation inherent in these models accelerates query execution, driving overall efficiency and productivity. The impact of Open-Source LLMs is far-reaching, driving innovation and transformative change across various industries. This technology empowers stakeholders to effortlessly extract actionable insights in finance, healthcare, and e-commerce, where data-driven decisions are critical. Furthermore, its integration with advanced analytics platforms and artificial intelligence systems propels organizations towards data-driven excellence. By fostering a culture of exploration and simplifying data interaction, Open-Source LLMs unlock the full potential of data assets, spurring innovation and growth across sectors.

1. Pre-trained model for NL2SQL (Mistral 7B)

Mistral 7B, a 7 billion parameter language model by Mistral AI, has swiftly gained recognition as a powerful model in the artificial intelligence domain.

  • Positioned as a foundational model, Mistral 7B holds a crucial role in natural language processing, showcasing its importance as a core building block within the landscape of large language models.
  • Distinguished by its architectural approach, Mistral 7B utilizes innovative features like Grouped-query attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences efficiently, resulting in exceptional performance.
  • While primarily focused on English with coding abilities, Mistral 7B stands out for its notably large context window, allowing it to understand and generate text in a broader context than some of its counterparts.
  • With an impressive 7.3 billion parameters, Mistral 7B represents a state-of-the-art language model; released under the Apache 2.0 license, it offers unrestricted usage.
  • Mistral 7B establishes its superiority by outperforming the best open 13B model (Llama-2) across all evaluated benchmarks and surpassing the best released 34B model (Llama-1) in reasoning, mathematics, and code generation.
  • Mistral-7B outperforms Llama2–13B and demonstrates competitive performance with CodeLlama-7B, particularly excelling in reasoning,
  • Mathematics and code generation benchmarks.
  • Despite its modest size compared to larger models, Mistral 7B excels across various natural language tasks, including text summarization, classification, text completion, and code completion.
  • Let’s delve deeper into its capabilities by exploring its effectiveness in transforming natural language queries into structured SQL commands.

a. Sliding Window Attention

  • Mistral 7B incorporates a Sliding Window Attention(SWA) mechanism that can effectively address the challenges posed by traditional attention mechanisms. The latter can lead to increased latency and reduced throughput during inference as the number of tokens grows, resulting in a quadratic increase in operations related to sequence length and a linear increase in memory. In contrast, SWA limits each token’s attention to a maximum of W tokens from the previous layer, thereby expanding its attention beyond the specified window size of W.
  • SWA takes advantage of the layered structure of a transformer to enable the hidden state at position i to access tokens from the input layer up to a distance of W x k tokens. With a window size of W = 4096 at the final layer, SWA can theoretically achieve an attention span of around 131K tokens. In practice, by using W = 4096 and optimization techniques from FlashAttention and xFormers, a notable 2x speed improvement over a vanilla attention baseline for a sequence length of 16K can be achieved. Therefore, SWA is a powerful and efficient approach that can significantly enhance the performance of attention mechanisms.

b. Rolling Buffer Cache

  • By implementing a Rolling Buffer Cache, Mistral 7B strategically employs a fixed attention span to control the cache size effectively. This cache, with a fixed size denoted as W, efficiently stores keys and values for a given timestep i at the position i mod W within the cache. As the sequence progresses and i exceeds W, the cache employs a rolling buffer mechanism, overwriting earlier values and preventing the cache from expanding limitlessly. This approach, exemplified with W = 3, showcases a significant reduction in cache memory usage by 8x for a 32k-token sequence, all achieved without compromising the model’s quality. The fixed attention span not only ensures efficient memory utilization but also contributes to the streamlined functioning of Mistral 7B in handling sequences of varying lengths.

c. Pre-fill and chunking

  • In the process of sequence generation, where tokens are predicted
    sequentially based on contextual information, efficiency is optimized
    through the utilization of a (k, v) cache pre-filled with the known
    prompt. To manage lengthy prompts, they are segmented into smaller
    chunks using a specified window size, with each chunk employed to
    pre-fill the cache. This strategic approach involves the computation of
    attention both within the cache and across the current chunk during the
    sequence generation process. By leveraging this method, Mistral 7B
    enhances the effectiveness of its sequence generation, aligning each
    predicted token with the preceding ones while efficiently utilizing
    pre-known prompts stored in the cache.
  • In the dynamic landscape of language models, the emergence of
    Mistral 7B marks a significant leap in both performance and
    efficiency. Through a comprehensive evaluation pipeline, Mistral 7B
    showcases its prowess by outperforming not only its predecessor,
    Llama 2 7B, and Llama 2 13B but also surpassing the capabilities of
    Llama 1 34B in critical benchmarks such as mathematics, code
    generation, and reasoning.
  • Mistral 7B’s superiority is evident across all metrics, reaffirming its status as a frontrunner in the field. The meticulous re-evaluation process on various benchmarks consistently demonstrates Mistral 7B’s unmatched capabilities, leaving its counterparts in the rearview.

d. Size and Efficiency Analysis

  • A key aspect of Mistral 7B’s appeal lies in its remarkable efficiency, highlighted through a novel approach of computing “equivalent model sizes” within the Llama 2 family. When evaluated on reasoning, comprehension, and STEM reasoning, Mistral 7B mirrors the performance of a Llama 2 model with more than three times its size. This efficiency is a testament to Mistral 7B’s ability to deliver high caliber results without the burden of excessive parameters.
  • Delving deeper into Mistral 7B’s efficiency, the evaluation reveals intriguing insights into its knowledge compression. Despite achieving a lower compression rate of 1.9x on Knowledge benchmarks, this can be attributed to Mistral 7B’s intentionally limited parameter count. While this limitation restricts the amount of knowledge storage, Mistral 7B compensates with a focused and efficient utilization of its parameters.

Evaluation Differences

Transparently addressing any discrepancies, it is essential to note variations in the evaluation protocols. In some benchmarks, differences arise between the reported evaluation in Llama 2 on MBPP and Mistral 7B’s evaluation. Notably, the use of hand-verified data on TriviaQA ensures a robust evaluation process, contributing to the reliability of Mistral 7B’s performance metrics.

Dataset

We plan to use following structure database having the columns as listed below. We will perform different different set of retrieval on following table

transaction = [
"transaction_id",
"transaction_amount",
"transaction_date",
"transaction_type",
"transaction_status",
"transaction_description",
"transaction_source_account",
"transaction_destination_account",
"transaction_currency",
"transaction_fee"
]

Code Implementation

  1. Install the packages
!pip install git+https://github.com/huggingface/transformers.git 
!pip install deepspeed --upgrade
!pip install accelerate
!pip install sentencepiece
!pip install langchain
!pip install torch
!pip install bitsandbytes

2. Import the packages

import os
import re
import torch
from difflib import SequenceMatcher
from langchain.chains import LLMChain
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline
from langchain_core.prompts import PromptTemplate
from transformers import LlamaTokenizer, LlamaForCausalLM, pipeline

3. Load the Model

base_model = LlamaForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1",
load_in_8bit=True,
device_map='auto',
)
tokenizer = LlamaTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
pipe = pipeline(
"text-generation",
model=base_model,
tokenizer=tokenizer,
max_length=500,
temperature=0.3,
top_p=0.95,
repetition_penalty=1.2
)
local_llm = HuggingFacePipeline(pipeline=pipe)

4. SequenceMatcher

This Python function leverages the SequenceMatcher class from the difflib module to compute similarity scores between words in a query and column names from a designated dictionary, enhancing query comprehension and substitution.

def find_columns_match(question, input_dict):
try:
question_list = re.split(r'\s|,|\.', question)
for index, string2 in enumerate(question_list):
for string1 in input_dict.get('table1_columns'):
score = SequenceMatcher(None,string1.lower(), string2.lower()).ratio()*100
if score > 91:
question_list[index] = string1 + ","
return " ".join(question_list)

except:
return question

This Python function query_generator generates a SQL query based on provided table name, column list, and a question. It utilizes a template string to structure the query generation process, incorporating placeholders for the table name, column list, and question. It then uses a PromptTemplate object to fill in these placeholders and interact with a large language model (LLM) via an LLMChain to generate the SQL query. Finally, it prints the generated SQL query as the output.

def query_generator(tble,cols,question):

template = """Generate a SQL query using the following table name: {Table}, and columns as a list: {Columns}, to answer the following question:
{question}.

Output Query:

"""

prompt = PromptTemplate(template=template, input_variables=["Table","question","Columns"])

llm_chain = LLMChain(prompt=prompt, llm=local_llm)

response= llm_chain.run({"Table" : tble,"question" :question, "Columns" : cols})
print(response)

Table

transaction = [
"transaction_id",
"transaction_amount",
"transaction_date",
"transaction_type",
"transaction_status",
"transaction_description",
"transaction_source_account",
"transaction_destination_account",
"transaction_currency",
"transaction_fee"
]

inputs = ["Generate an SQL query to retrieve transaction_amount, transaction_date, transaction_type,transaction_description where transaction_id is 10",
"Generate an SQL query to retrieve transaction_id, transaction_date, transaction_type,transaction_source_account where transaction_status is 'completed'",
"Generate an SQL query to retrieve count of the transaction_type and their average transaction_amount, ordered by transaction_type.",
"Generate an SQL query to retrieve list of the total transaction amount for each source account, sorted by total transaction amount in descending order.",
"Generate an SQL query to retrieve find the maximum transaction amount for each transaction type, ordered by transaction type."]

for input in inputs:
query_generator("transaction",transaction ,question=find_columns_match(input,transaction))

Response

  1. Generate a SQL query using the following table name: transaction, and columns as a list: [‘transaction_id’, ‘transaction_amount’, ‘transaction_date’, ‘transaction_type’, ‘transaction_status’, ‘transaction_description’, ‘transaction_source_account’, ‘transaction_destination_account’, ‘transaction_currency’, ‘transaction_fee’], to answer the following question:(‘Generate an SQL query to retrieve transaction_amount, transaction_date, transaction_type,transaction_description where transaction_id is 10’,).
Output Query:

SELECT transaction_amount, transaction_date, transaction_type, transaction_description FROM transaction WHERE transaction_id = 10;

2. Generate a SQL query using the following table name: transaction, and columns as a list: [‘transaction_id’, ‘transaction_amount’, ‘transaction_date’, ‘transaction_type’, ‘transaction_status’, ‘transaction_description’, ‘transaction_source_account’, ‘transaction_destination_account’, ‘transaction_currency’, ‘transaction_fee’], to answer the following question:
(“Generate an SQL query to retrieve transaction_id, transaction_date, transaction_type,transaction_source_account where transaction_status is completed’’).

Output Query:
SELECT transaction_id, transaction_date, transaction_type, transaction_source_account FROM transaction WHERE transaction_status = 'completed'

3. Generate a SQL query using the following table name: transaction, and columns as a list: [‘transaction_id’, ‘transaction_amount’, ‘transaction_date’, ‘transaction_type’, ‘transaction_status’, ‘transaction_description’, ‘transaction_source_account’, ‘transaction_destination_account’, ‘transaction_currency’, ‘transaction_fee’], to answer the following question:
(‘Generate an SQL query to retrieve count of the transaction_type and their average transaction_amount, ordered by transaction_type’).

Output Query:

SELECT transaction_type, AVG(transaction_amount) AS avg_transaction_amount, COUNT(*) AS total_count
FROM transaction
GROUP BY transaction_type
ORDER BY transaction_type;

4. Generate a SQL query using the following table name: transaction, and columns as a list: [‘transaction_id’, ‘transaction_amount’, ‘transaction_date’, ‘transaction_type’, ‘transaction_status’, ‘transaction_description’, ‘transaction_source_account’, ‘transaction_destination_account’, ‘transaction_currency’, ‘transaction_fee’], to answer the following question:
(‘Generate an SQL query to retrieve list of the total transaction amount for each source account, sorted by total transaction amount in descending order’).

Output Query:

SELECT transaction_source_account, SUM(transaction_amount) AS TotalTransactionAmount
FROM transaction
GROUP BY transaction_source_account
ORDER BY TotalTransactionAmount DESC;

5. Generate a SQL query using the following table name: transaction, and columns as a list: [‘transaction_id’, ‘transaction_amount’, ‘transaction_date’, ‘transaction_type’, ‘transaction_status’, ‘transaction_description’, ‘transaction_source_account’, ‘transaction_destination_account’, ‘transaction_currency’, ‘transaction_fee’], to answer the following question:
(‘Generate an SQL query to retrieve find the maximum transaction amount for each transaction type, ordered by transaction type.’).

 Output Query:

SELECT transaction_type, MAX(transaction_amount) AS max_transaction_amount
FROM transaction
GROUP BY transaction_type
ORDER BY transaction_type;

Generic extraction is effective; however, research indicates that fine-tuning LLM on specific data yields superior results. Let’s adopt the fine-tuning approach.

2 Fine-tune NL2SQL with Phi-3

Meet Phi-3, Microsoft’s latest breakthrough in open AI models. With Phi-3-mini, Phi-3-small, and Phi-3-medium, the Phi-3 family of small language models (SLMs) is designed to revolutionize the world of AI models. Phi-3-mini, powered by 3.8 billion parameters and trained on 3.3 trillion tokens, performs exceptionally well, matching the performance of larger models like Mixtral 8x7B and GPT-3.5. What’s more, it can run efficiently on smartphone devices. Phi-3 owes its success to its training dataset, an evolved iteration of Phi-2’s dataset. With its meticulously filtered web data and synthetic inputs, these models prioritize robustness, safety, and conversational fluency making them suitable for various applications. Phi-3-small and Phi-3-medium, with 7B and 14B parameters respectively, are designed to further enhance Phi-3’s capabilities while maintaining efficiency.

Phi 3 Architecture and Evaluation

he Phi-3 family offers a variety of models designed to balance quality and cost, providing options for customers developing generative AI applications.

Phi-3-mini: This model features 3.8 billion parameters and is trained on an extensive dataset of 3.3 trillion tokens. It employs a transformer decoder architecture with 32 layers, 32 attention heads, and 3072 hidden dimensions. The default context length is 4K tokens, utilizing a tokenizer with a 32K vocabulary. There’s also an extended version, Phi-3-mini-128K, with a context length of 128K tokens.

Phi-3-small: With 7 billion parameters, Phi-3-small is trained on 4.8 trillion tokens. It adopts the tiktoken tokenizer with a vocabulary of 100K and an 8K default context length. The architecture comprises 32 layers, 32 attention heads, and 4096 hidden dimensions. It utilizes grouped query attention and alternating dense/block sparse attention to optimize memory usage.

Phi-3-medium: This preview model boasts 14 billion parameters and is also trained on 4.8 trillion tokens. It features 40 layers, 40 attention heads, and an embedding size 5120.

Training Methodology:

  1. Training Data Composition: The training data for Phi-3 models is carefully curated. It consists of heavily filtered web data categorized by educational level and synthetic LLM-generated data. It undergoes pre-training in two disjoint and sequential phases.
  2. Pre-training Phases:Phase 1 uses web sources to focus on general knowledge and language understanding.
    Phase 2 incorporates more heavily filtered web data from Phase 1 and synthetic data to teach logical reasoning and niche skills.
  3. Post-training Phases:After pre-training, Phi-3-mini underwent supervised fine-tuning (SFT) and direct preference optimization (DPO). SFT involves curating high-quality data across math, coding, reasoning, conversation, model identity, and safety domains.
    - DPO focuses on chat format data, reasoning, and responsible AI efforts.
  4. Context Expansion: The context window of Phi-3-mini expanded from 4k tokens to 128k tokens using the Long Rope methodology. This expansion maintains consistent performance despite the significant increase in context length.
  5. Data Optimization: The training data is calibrated towards a “data optimal” regime, prioritizing quality for the model’s scale. Web data is filtered to ensure the right balance of knowledge and reasoning, which is especially crucial for smaller models.
  6. Comparison with Other Models: Phi-3’s approach contrasts with prior works, focusing on data quality for a given scale rather than computing or over-training regimes. Benchmark comparisons illustrate Phi-3’s optimization for small model capacities.
  7. Phi-3-medium Preview:Phi-3-medium, with 14B parameters, is trained similarly to Phi-3-mini but on a larger scale. Some benchmarks show less improvement from 7B to 14B parameters, suggesting ongoing refinement of the data mixture.
  8. Post-training Enhancements:Models undergo post-training refinement to improve chat capabilities, robustness, and safety through supervised fine-tuning and preference tuning with DPO. Preference tuning with DPO.

Safety

Phi-3-mini is an AI model created under Microsoft’s responsible AI principles, which prioritize safety at every stage of its development. A comprehensive strategy is employed to ensure the model’s adherence to ethical standards and ability to mitigate potential harm.

After training, the model undergoes meticulous safety alignment to verify that its responses meet responsible AI criteria. Additionally, an independent red team at Microsoft examines Phi-3-mini, identifying areas for enhancement and strengthening safety protocols.

Automated testing and evaluations across various categories of potential harm are integral to the process. These tests aim to detect and address any risks arising from the model’s outputs.

To refine its responses further, Phi-3-mini uses preference datasets that emphasize helpfulness and harmlessness. In-house generated datasets are also utilized, tailored to address specific categories of potential harm identified during testing.

Code Implementation

  1. Install the packages
 !pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U xformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl
!pip install -q -U einops
!pip install -q -U nvidia-ml-py3
!pip install -q -U huggingface_hub

2. Import the packages

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from pynvml import *
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
import time, torch
from peft import AutoPeftModelForCausalLM
from peft import LoraConfig,PeftModel,AutoPeftModelForCausalLM

3. Load the Dataset

 dataset = load_dataset("b-mc2/sql-create-context")
dataset

4. Format The Dataset

def create_prompt(sample):
system_prompt_template = """<s>
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :<<user_question>>
### Database Schema:
<<database_schema>>
### Response:
<<user_response>>
</s>
"""
user_message = sample['question']
user_response = sample['answer']
database_schema = sample['context']
prompt_template = system_prompt_template.replace("<<user_question>>",f"{user_message}").replace("<<user_response>>",f"{user_response}").replace("<<database_schema>>",f"{database_schema} ")

return {"inputs":prompt_template}


instruct_tune_dataset = dataset.map(create_prompt)
print(instruct_tune_dataset)

def print_gpu_utilization():
nvmlInit()
handle = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(handle)
print(f"GPU memory occupied: {info.used//1024**2} MB.")

5. Load the tokenizer and the model

base_model_id = "microsoft/Phi-3-mini-4k-instruct"

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id , use_fast=True)
#Load the model with fp16
model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, device_map={"": 0})
print(print_gpu_utilization())

6. Model Inference

# Define prompts
prompt = [
"Write the recipe for a chicken curry with coconut milk.",
"Translate into French the following sentence: I love bread and cheese!",
"Cite 20 famous people.",
"Where is the moon right now?"
]

# Initialize variables
duration = 0.0
total_length = 0

# Loop through prompts
for i in range(len(prompt)):
# Tokenize prompt and move to GPU
inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")

# Cast input tensor indices to torch.long
inputs = {k: v.to(torch.long) for k, v in inputs.items()}

# Start time
start_time = time.time()

# Perform inference with autocasting
with torch.cuda.amp.autocast(enabled=False): # Disable autocasting
output = model.generate(**inputs, max_length=500)

# Calculate duration and total length
duration += float(time.time() - start_time)
total_length += len(output)

# Calculate tokens per second for prompt
tok_sec_prompt = round(len(output) / float(time.time() - start_time), 3)

# Print tokens per second for prompt
print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))

# Print decoded output
print(tokenizer.decode(output[0], skip_special_tokens=True))

# Calculate average tokens per second
tok_sec = round(total_length / duration, 3)
print("Average --- %s tokens/seconds ---" % (tok_sec))

7. Text to SQL without fine-tuning

    prompt = [
"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction :
List all the cities in a decreasing order of each city's stations' highest latitude.
Database Schema:
CREATE TABLE station (city VARCHAR, lat INTEGER)
### Response:
SELECT city, lat FROM station ORDER BY lat DESC;
""",
"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction :
'What are the positions with both players having more than 20 points and less than 10 points and are in Top 10 ranking
Database Schema:
CREATE TABLE player (POSITION VARCHAR, Points INTEGER, Ranking INTEGER)
### Response:
SELECT POSITION, Points, Ranking
FROM player
WHERE Points > 20 AND Points < 10 AND Ranking IN (1,2,3,4,5,6,7,8,9,10)
""",
"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction :
Find the first name of the band mate that has performed in most songs.
Database Schema:
CREATE TABLE Songs (SongId VARCHAR); CREATE TABLE Band (firstname VARCHAR, id VARCHAR); CREATE TABLE Performance (bandmate VARCHAR)
### Response:
SELECT b.firstname
FROM Band b
JOIN Performance p ON b.id = p.bandmate
GROUP BY b.firstname
ORDER BY COUNT(*) DESC
LIMIT 1;
"""
]

for i in range(len(prompt)):
model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
start_time = time.time()
output = model.generate(**model_inputs, max_length=500, no_repeat_ngram_size=10, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)[0]
duration += float(time.time() - start_time)
total_length += len(output)
tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
print(print_gpu_utilization())
print(tokenizer.decode(output, skip_special_tokens=False))

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))

# Fine - tuning

base_model_id = "microsoft/Phi-3-mini-4k-instruct"

tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_eos_token=True, use_fast=True, max_length=250)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token

compute_dtype = getattr(torch, "float16") #change to bfloat16 if are using an Ampere (or more recent) GPU
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
base_model_id, trust_remote_code=True, quantization_config=bnb_config, revision="refs/pr/23", device_map={"": 0}, torch_dtype="auto", flash_attn=True, flash_rotary=True, fused_dense=True
)
print(print_gpu_utilization())

model = prepare_model_for_kbit_training(model)

8. LoRA Parameters

peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=16,
bias="none",
task_type="CAUSAL_LM",
target_modules=[
'q_proj',
'k_proj',
'v_proj',
'dense',
'fc1',
'fc2',
])

9. Training Parameters

training_arguments = TrainingArguments(
output_dir="./phi3-results",
save_strategy="epoch",
per_device_train_batch_size=4,
gradient_accumulation_steps=12,
log_level="debug",
save_steps=100,
logging_steps=25,
learning_rate=1e-4,
eval_steps=50,
optim='paged_adamw_8bit',
fp16=True, #change to bf16 if are using an Ampere GPU
num_train_epochs=1,
max_steps=400,
warmup_steps=100,
lr_scheduler_type="linear",
seed=42)

10. Data Prepare for the training

train_dataset = instruct_tune_dataset.map(batched=True,remove_columns=['answer', 'question', 'context'])
train_dataset

11. Fine-Tuned

trainer = SFTTrainer(
model=model,
train_dataset=train_dataset["train"],
#eval_dataset=dataset['test'],
peft_config=peft_config,
dataset_text_field="inputs",
max_seq_length=1024,
tokenizer=tokenizer,
args=training_arguments,
packing=False
)

trainer.train()

12. Test inference with the fine-tuned adapter

base_model_id = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
base_model_id, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0}
)
adapter = "/content/phi3-results/checkpoint-400"
model = PeftModel.from_pretrained(model, adapter)

13. Perform Inference

    database_schema= 'CREATE TABLE station (city VARCHAR, lat INTEGER)'
user_question = "List all the cities in a decreasing order of each city's stations' highest latitude."

prompt_template = f""""
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
{user_question}
Database Schema:
{database_schema}
### Response:
"""

question = "'What are the positions with both players having more than 20 points and less than 10 points and are in Top 10 ranking"
context = "CREATE TABLE player (POSITION VARCHAR, Points INTEGER, Ranking INTEGER)"

prompt_template1 = f""""
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
{question}
Database Schema:
{context}
### Response:
"""

context = '''CREATE TABLE Songs (SongId VARCHAR); CREATE TABLE Band (firstname VARCHAR, id VARCHAR); CREATE TABLE Performance (bandmate VARCHAR)'''
question = "Find the first name of the band mate that has performed in most songs."

prompt_template2 = f""""
Below is an instruction that describes a task.Write a response that appropriately completes the request.
### Instruction :
{question}
Database Schema:
{context}
### Response:
"""

prompt = []
prompt.append(prompt_template)
prompt.append(prompt_template1)
prompt.append(prompt_template2)

for i in range(len(prompt)):
model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
start_time = time.time()
output = model.generate(**model_inputs, max_length=500, no_repeat_ngram_size=10, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)[0]
duration += float(time.time() - start_time)
total_length += len(output)
tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
print(print_gpu_utilization())
print(tokenizer.decode(output, skip_special_tokens=False))

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))

14. Save the model

import locale
import shutil
from huggingface_hub import notebook_login
from google.colab import drive

locale.getpreferredencoding = lambda: "UTF-8"

notebook_login()

trainer.push_to_hub(commit_message="fine-tuned adapter")

drive.mount('/content/drive')


shutil.move('/content/phi3-results', '/content/drive/MyDrive/PHI-3')

trained_model = AutoPeftModelForCausalLM.from_pretrained("/content/drive/MyDrive/PHI-3/phi3-results/checkpoint-400",
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map='auto',)

lora_merged_model = trained_model.merge_and_unload()

lora_merged_model.save_pretrained("/content/drive/MyDrive/PHI-3/phi3-results/lora_merged_model",safe_serialization=True)

tokenizer.save_pretrained("/content/drive/MyDrive/PHI-3/phi3-results/lora_merged_model")

lora_merged_model.push_to_hub(repo_id="",commit_message="merged model")

tokenizer.push_to_hub(repo_id="",commit_message="merged model")

15. Perform Inference on Fine-tuned Model

peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=16,
bias="none",
task_type="CAUSAL_LM",
)

peft_model_id = "username/phi3-results"
config = peft_config.from_pretrained(peft_model_id)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
return_dict=True,
load_in_4bit=True,
device_map="auto",
)

tokenizer= AutoTokenizer.from_pretrained(peft_model_id)

model = PeftModel.from_pretrained(model,peft_model_id)

print(model.get_memory_footprint())

for i in range(len(prompt)):
model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
start_time = time.time()
output = model.generate(**model_inputs, max_length=500, no_repeat_ngram_size=10, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)[0]
duration += float(time.time() - start_time)
total_length += len(output)
tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
print(print_gpu_utilization())
print(f"RESPONSE:\n {tokenizer.decode(output, skip_special_tokens=False)[len(prompt[i]):].split('</')[0]}")

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))

Conclusion

The combination of Natural Language Processing (NLP) and SQL Query Engines has made it easier and more efficient to interact with databases. Previously, a deep understanding of SQL was required, making it difficult for many users. However, Open-Source Large Language Models (LLMs) like Mistral 7B and Microsoft Phi-3 have changed this. They can seamlessly translate natural language queries into structured SQL queries, eliminating the need for extensive SQL expertise.

Mistral 7B and Microsoft Phi-3 are exceptional models that perform well in NLP tasks. They have features like Grouped-Query Attention and Sliding Window Attention, making them more efficient. Despite its modest size, Microsoft Phi-3 sets a new standard for NLP performance and efficiency, surpassing larger models on complex benchmarks.

The integration of Open-Source LLMs with advanced analytics platforms and AI systems helps organizations extract insights effortlessly. This technology has transformed the way industries like finance, healthcare, and e-commerce make data-driven decisions. The impact of these models on various sectors has been significant and has catalysed innovation and transformative change.

The convergence of NLP and SQL through Open-Source LLMs has democratized data access, driving efficiency, productivity and organizational success. It has unlocked the full potential of data assets, making it easier for stakeholders to extract actionable insights, and fostering a culture of exploration and innovation across sectors.

Notebooks: phi3

If you find this content valuable, I kindly request you to show your support by following and applauding 👏 the article. Additionally, you can also follow Chetan Khadke on LinkedIn for more updates. Special thanks to Shubham Salunke for the invaluable assistance provided.

Please checkouts my previous 📝 articles.

References

  1. https://arxiv.org/pdf/2310.06825.pdf
  2. https://artgor.medium.com/paper-review-mistral-7b-6acdf2f3132d
  3. https://medium.com/dair-ai/papers-explained-mistral-7b-b9632dedf580
  4. https://www.datacamp.com/tutorial/mistral-7b-tutorial
  5. https://www.analyticsvidhya.com/blog/2023/11/from-gpt-to-mistral-7b-the-exciting-leap-forward-in-ai-conversations/
  6. https://medium.com/@rubentak/mistral-7b-the-best-7-billion-parameter llm-yet-8b0aa03016f9
  7. https://clarifai.com/mistralai/completion/models/mistral-7B-Instruc
  8. https://iamgeekydude.com/2023/06/02/alpaca-llm-load-model-using-langchain-hf/
  9. https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/
  10. https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

--

--

Chetankumar Khadke

As an NLP practitioner, I employ computational methods to analyze/understand complex human language, using machine learning analysis to develop algorithms.