Three New arXiv Papers Address LLM Evaluation and Optimization Benchmarking

Three new preprints on arXiv propose different approaches to evaluating and testing AI systems:

PLawBench for Legal LLMs: According to arXiv:2601.16669v1, researchers have developed PLawBench, described as “a rubric-based benchmark for evaluating LLMs in real-world legal practice.” The paper notes that “existing legal benchmarks rely on simplified and highly standardized” approaches, suggesting this new benchmark aims to address limitations in current legal AI evaluation methods.

Adversarial Robustness Testing: A paper (arXiv:2505.16004v2) examines the robustness of sparse autoencoders (SAEs), which are “commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations,” according to the abstract. The research evaluates adversarial robustness of these concept representations, going beyond existing SAE evaluation metrics.

Automated Benchmark Generation: According to arXiv:2601.12723v2, researchers propose “an evolutionary framework for automatic optimization benchmark generation via large language models.” The paper states that “existing artificial benchmarks often fail to capture the diversity and irregularity of real-world problem structures,” positioning their LLM-based approach as an alternative to traditional benchmark creation methods.

All three papers represent cross-disciplinary work intersecting with AI (cs.AI) research.