Testing LLMs in a sandbox isn’t responsible. Focusing on community use and needs is.

LLM evaluation is trapped in a cycle. Companies announce models with claimed SOTA performance on a collection of benchmarks. In response, researchers critique these benchmarks and create new ones. But beyond back and forth alignment and debiasing efforts, we are still far from understanding how LLMs—not just as models with input and outputs but as products with interfaces and features—behave around and impact the different communities they are deployed upon. It seems like we are doing socially responsible language modeling, but really, we’re playing a game of whack-a-mole in a sandbox full of models that are sometimes useful but fundamentally unreliable.

Prioritizing participatory, community-driven evaluations is the most effective way that we can escape the vicious cycle of benchmarks and capability evaluation that only serve to confound and confuse our understanding of whether LLMs are useful tools. This focus will lead us to better science and a better understanding of where language models can do better, how they may be beneficially used for certain cases, and whether they should be used at all.

When we try to build “responsible” models by aligning them on abstract benchmarks, we assume that LLMs are the best tools to serve communities. But responsibility isn’t achieved by marginally improving upon a randomly determined benchmark; responsibility is achieved through understanding who the people using LLMs are, how they are using them, and what they are using them for. Socially responsible language modelling researchers should use LLM evaluations as opportunities to work alongside civil society organizations and domain experts who hold the most knowledge around the use of models in their local context. Grounding research in community needs will organically move evaluations beyond artificial benchmarks like arbitrary multiple-choice questions and math problems and towards both quantitative and qualitative methods that are designed to reflect actual community use cases, intentionally designed to produce meaningful and targeted insights, and geographically sensitive to accurately capture dynamic local laws, practices, and languages. We’ve already started to see this happen – in assessing the accuracy of information around voting [1], or journalism practices [2], or legal analysis [3]. In all of these examples, evaluations are designed and driven by domain experts, precisely map out the areas of strength and weakness, and reflect actual user interactions and experiences. The methodologies might not scale or generalize, but they are targeted and provide concrete insights about the limits of model use.

And this is also why community-driven participatory evaluations yield good scientific insights. They strip away marketing, hype, and unnecessary abstractions and reveal the contexts where LLMs might genuinely add value and where they do not. This opens space for meaningful dialogue between stakeholders across academia, industry, government, and civil society and empowers the public to engage with and assess for themselves whether the tools they are being presented with can be used responsibly.

Abstract to be presented at the Third Workshop on Socially Responsible Language Modelling Research (SoLaR) at the 2025 Conference on Language Modeling (COLM) in Montreal, Canada.

Data Science Institute

Center for Technological Responsibility, Reimagination and Redesign

Testing LLMs in a sandbox isn’t responsible. Focusing on community use and needs is.

Testing LLMs in a sandbox isn’t responsible. Focusing on community use and needs is.