The Science of Benchmarking

What's Measured, What's Missing, What's Next

A NeurIPS 2025 Tutorial

Tuesday, December 2, 2025, 1:30pm -> 4:00pm

NeurIPS 2025, San Diego Convention Center, Exhibit Hall G,H

https://benchmarking.science

Outline

1. Epistemology, Design & Practice

What should we measure? What makes a good benchmark?

2. Limitations

What are the main current issues in benchmarking? How is the landscape of models changing to make benchmarks worse? How do people approach it? What should attendees who want to get in to evaluation know?

3. Emerging Paradigms

How are people addressing these problems? Touching on adversarial methods, dynamic benchmarks, arenas and scaled human evals, simulators & sandboxes, applied interpretability. What can attendees work on?

4. Panel Conversation

Panelists from diverse areas will share their perspectives.

Panelists