Fine-Tuning Small Language Models with Chain-of-Thought Improves NL2SQL Performance While Reducing Costs

Research shows small language models fine-tuned with Chain-of-Thought reasoning achieve production-grade NL2SQL performance at lower inference costs.

Fine-Tuning Small Language Models with Chain-of-Thought Improves NL2SQL Performance While Reducing Costs

Researchers have demonstrated that fine-tuning small language models with Chain-of-Thought (CoT) reasoning can significantly improve their performance on Natural Language to SQL (NL2SQL) tasks while reducing inference costs, according to a paper published on arxiv.org.

The research reveals a “counter-intuitive scaling phenomenon” where fine-tuning large models like Gemini 2.5 Flash/Lite on standard datasets “yields negligible returns, often leading to overfitting on complex queries,” according to arxiv.org. In contrast, small models such as Qwen showed substantial improvements.

According to the paper, fine-tuning improved the small model baseline from 36% to 45% accuracy. When researchers enriched the dataset with explicit Chain-of-Thought reasoning, accuracy surged further to 54.5%. While this remains lower than the accuracy of large models like Gemini 2.5, the paper states it “does serve the business goal of significant cost reduction, latency in inference time and also meeting the business critical performance accuracy threshold.”

The findings demonstrate that “transferring reasoning patterns enables compute-efficient smaller models to approach production-grade performance,” according to arxiv.org. This approach addresses the challenge that while Large Language Models have shown “impressive zero-shot capabilities,” their “high inference costs limit deployment at scale” for NL2SQL tasks that remain “a critical bottleneck for democratization of data in enterprises.”