Fine-Tuning Small Language Models with Chain-of-Thought Boosts NL2SQL Performance While Reducing Costs

New research reveals a counter-intuitive finding in optimizing language models for translating natural language to SQL (NL2SQL): small models benefit significantly more from fine-tuning than their larger counterparts.

According to a paper published on arxiv.org, fine-tuning large models like Gemini 2.5 Flash/Lite on standard NL2SQL datasets “yields negligible returns, often leading to overfitting on complex queries.” In contrast, small models such as Qwen demonstrated substantial improvements.

The research shows that fine-tuning improved a small model’s baseline accuracy from 36% to 45%. When researchers enriched the dataset with explicit Chain-of-Thought (CoT) reasoning, accuracy surged further to 54.5%. While this remains lower than large models like Gemini 2.5, the paper states it “does serve the business goal of significant cost reduction, latency in inference time and also meeting the business critical performance accuracy threshold.”

The findings suggest that “transferring reasoning patterns enables compute-efficient smaller models to approach production-grade performance,” according to the arxiv.org paper. This approach addresses a critical bottleneck for enterprises seeking to democratize data access, as large language models’ “high inference costs limit deployment at scale.”

The 9-page paper demonstrates a practical path forward for organizations balancing performance requirements against computational costs in database query applications.