Researchers Release Benchmarks for Evaluating LLMs on Medical, Low-Resource Languages and Biological Data

New Benchmarks Target LLM Performance Gaps

Researchers have released three specialized benchmarks addressing underexplored areas of large language model (LLM) capabilities.

Medical QA in Romanian

According to arXiv paper 2508.16390v4, researchers introduced MedQARo, described as “the first large-scale medical QA benchmark in Romanian.” The dataset comprises 105,880 question-answer pairs and includes a comprehensive evaluation of state-of-the-art LLMs on medical question answering in Romanian.

Low-Resource Language Evaluation

A separate study (arXiv 2511.10664v2) presents a cross-lingual benchmark focusing on Cantonese, Japanese, and Turkish. The paper notes that while LLMs “have achieved impressive results in high-resource languages like English,” their “effectiveness in low-resource and morphologically rich languages remains underexplored.”

Single-Cell Analysis Framework

ArXiv paper 2602.11609v1 introduces scPilot, described as “the first systematic framework to practice omics-native reasoning.” According to the abstract, scPilot enables an LLM to “converse in natural language while directly inspecting single-cell RNA-seq data and on-demand bioinformatics tools,” converting biological data for AI analysis.

These benchmarks address gaps in LLM evaluation beyond English-language text processing.