New Research Evaluates LLM Capabilities in Code Security, Preference Testing, and Real-World API Usage

Three recent papers on arXiv examine different aspects of Large Language Model performance and evaluation:

Code Vulnerability Detection: According to arXiv paper 2601.00254v1, researchers conducted “a comparative study on the effectiveness of LLM-based” approaches for automated software vulnerability detection. The study evaluated Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT), and dual-agent systems for this “crucial task in securing modern codebases.”

LLM Output Evaluation: Paper 2410.12869v5 addresses challenges in evaluating LLM output quality. The research notes that “evaluating their outputs’ quality regarding preference remains a critical challenge,” proposing a method that moves “Towards Acyclic Preference Evaluation of Language Models via Multiple Evaluators” rather than relying on a single strong LLM as judge.

Real-World API Performance: ArXiv paper 2601.00268v1 introduces WildAGTEval, “a benchmark designed to evaluate large language model (LLM) agents’ function-calling capabilities under realistic API complexity.” Unlike previous work that “assumes an idealized API system,” this benchmark accounts for “real-world factors” that affect LLM agent performance.

These studies collectively address critical gaps in understanding how LLMs perform in practical deployment scenarios.