The Science of Benchmarking

What's Measured, What's Missing, What's Next

News: slides available at benchmarking.science/slides.pdf

NeurIPS 2025 Tutorial

Tuesday, December 2, 2025, 1:30pm -> 4:00pm

NeurIPS 2025, San Diego Convention Center, Exhibit Hall G,H

Martin Ziqiao Ma

University of Michigan

Michael Saxon

University of Washington

Xiang Yue

Carnegie Mellon University (Now @ Meta)

https://benchmarking.science

Outline

1. Epistemology, Design & Practice

What should we measure? What makes a good benchmark?

Video coming soon

2. Limitations

What are the main current issues in benchmarking? How is the landscape of models changing to make benchmarks worse? How do people approach it? What should attendees who want to get in to evaluation know?

3. Emerging Paradigms

How are people addressing these problems? Touching on adversarial methods, dynamic benchmarks, arenas and scaled human evals, simulators & sandboxes, applied interpretability. What can attendees work on?

Video coming soon