Three New Research Papers Address LLM Evaluation, Web Extraction, and Dataset Compression

Three new papers on arXiv address different aspects of AI research and development.

Web Information Extraction Dataset

According to arXiv paper 2602.15189v1, researchers have introduced ScrapeGraphAI-100k, described as “a large-scale dataset for LLM-based web information extraction.” The paper notes that while “the use of large language models for web information extraction is becoming increasingly fundamental to modern web information retrieval pipelines,” existing datasets “tend to be small, synthetic or text-only, failing to capture the structural” aspects of web data.

LLM Benchmark Validity

ArXiv paper 2602.15532v1 examines construct validity in LLM evaluations. The authors note that “the LLM community often reports benchmark results as if they are synonymous with general model capabilities,” but warn that “benchmarks can have problems that distort performance, like test set contamination and annotator error.” The paper addresses the question: “How can we know that a benchmark actually measures what it claims to measure?”

Dataset Distillation Acceleration

A third paper (arXiv:2602.15277v1) focuses on accelerating dataset distillation for large-scale datasets. According to the abstract, “dataset distillation compresses the original data into compact synthetic datasets, reducing training time and storage while retaining model performance, enabling deployment under limited resources.”