Three arXiv Papers Address Language Model Specialization Challenges

Three new papers published on arXiv explore different approaches to adapting large language models for specialized tasks and underrepresented domains.

EstLLM: Enhancing Estonian Language Support

According to arXiv paper 2603.02041v1, researchers investigated whether continued pretraining (CPT) can improve Estonian language capabilities in multilingual LLMs. The paper notes that “large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages.”

FT-Dojo: Automating Fine-Tuning

ArXiv paper 2603.01712v1 introduces FT-Dojo, addressing the challenge that “fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior.” The research explores autonomous approaches to LLM fine-tuning using language agents.

Thoth: Time Series Understanding

According to arXiv paper 2603.01042v1, the Thoth system uses mid-training to bridge LLMs to time series understanding. The paper states that while “Large Language Models (LLMs) have demonstrated remarkable success in general-purpose reasoning,” they “still struggle to understand and reason about time series data, which limits their effectiveness in decision-making scenarios” dependent on such data.

All three papers represent ongoing efforts to expand LLM capabilities beyond their original English-centric, general-purpose training.