Three New Benchmarks Test Large Language Model Capabilities Across Games, Education, and Geography

Three New Benchmarks Test Large Language Model Capabilities

Researchers have published three new evaluation frameworks examining different aspects of large language model performance, according to recent arXiv preprints.

Zork Gaming Performance

A positioning paper (arXiv:2602.15867v1) evaluates “the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977,” according to the abstract. The study uses the game’s dialogue-based format to test LLM reasoning abilities.

Indian Educational Assessment

IndicEval (arXiv:2602.16467v1) introduces “a scalable benchmarking platform designed to assess LLM performance” in educational contexts. According to the abstract, the framework addresses the need for “evaluation frameworks that reflect real-world academic rigor and multilingual complexity,” specifically focusing on bilingual Indian educational content.

Geospatial Reasoning

GPSBench (arXiv:2602.16105v1) examines whether “Large Language Models (LLMs) understand GPS Coordinates.” The paper notes that “LLMs are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability,” according to the abstract.

All three papers were announced as cross-domain submissions on arXiv’s AI section, with GPSBench classified as a new submission.