One thing that makes large language models (LLMs) so powerful is the variety of tasks to which they can be applied. The same machine learning model that can help a graduate student design an email could also help a doctor diagnose cancer.
However, the wide applicability of these models also makes their systematic evaluation difficult. It would be impossible to create a benchmark data set to test the model on every type of question that could be asked.
In a new paper, MIT researchers took a different approach. He argues that because people decide when to deploy large language models, evaluating a model requires understanding how people form beliefs about its capabilities.
For example, a graduate student must decide whether a model might be useful in creating a particular email, and a physician must determine which cases would be best to consult the model.
Based on this idea, researchers developed a framework for evaluating LLM based on its congruence with human beliefs about how it will perform a certain task.
They introduce the human generalization function—a model of how people update their beliefs about the LLM’s capabilities after interacting with the LLM. They then evaluate how LLMs align with this human generalization function.
Their results suggest that when models are misaligned with the human generalization function, the user may be too sure or not sure enough where to deploy it, which can cause unexpected model failure. Additionally, because of this misalignment, more capable models tend to perform worse than lesser models in high-stakes situations.
“These tools are exciting because they’re universal, but because they’re universal, they’re going to work with humans, so we have to take the human into account,” says study co-author Ashesh Rambachan, an assistant professor. professor of economics and principal researcher at the Laboratory for Information and Decision Systems (LIDS).
Rambachan is joined on the paper by lead author Keyon Vafa, a postdoctoral fellow at Harvard University; and Sendhil Mullainathan, an MIT professor in the Department of Electrical Engineering and Computer Science and Economics and a member of LIDS. The research will be presented at the International Conference on Machine Learning.
Human generalization
As we interact with other people, we form beliefs about what we think they do and don’t know. For example, if your friend is picky about correcting people’s grammar, you might generalize and think that they would also excel at sentence construction, even though you’ve never asked them about sentence construction.
“Language models often feel so human. We wanted to illustrate that this power of human generalization is also present in how people form beliefs about language models,” says Rambachan.
As a starting point, the researchers formally defined the human generalization function, which involves asking questions, observing how a person or LLM responds, and then making inferences about how that person or model would respond to related questions.
If someone sees that LLM can answer matrix inversion questions correctly, they can also assume that they can solve simple arithmetic questions. A model that is inconsistent with this function—a model that does not perform well on questions that one expects to answer correctly—may fail in deployment.
With this formal definition in hand, the researchers designed a survey to measure how people generalize when interacting with LLMs and other people.
They showed survey participants questions that a person or LLM had gotten right or wrong, and then asked if they thought that person or LLM would answer the related question correctly. Through the survey, they created a dataset of nearly 19,000 examples of how people generalize LLM performance in 79 different tasks.
Misalignment measurement
They found that participants did fairly well when asked whether a person who answered one question correctly answered a related question correctly, but they did much worse at generalizing the LLM’s performance.
“Human generalization is applied to language models, but that breaks down because those language models don’t really show patterns of expertise like humans do,” says Rambachan.
People were also more likely to update their beliefs about LLM when he answered questions incorrectly than when he answered correctly. They also tend to believe that an LLM’s performance on simple questions would have little effect on his performance on more complex questions.
In situations where people put more weight on incorrect responses, simpler models outperformed very large models such as GPT-4.
“Language models that are improving can almost trick people into thinking they’re going to perform well on related questions, when they really don’t,” he says.
One possible explanation for why people are worse at generalizing to LLMs could come from their novelty—people have much less experience interacting with LLMs than they do with other people.
“Moving forward, it’s possible that we can improve just by interacting more with language models,” he says.
To this end, the researchers want to conduct further studies on how people’s views of LLM evolve over time as they interact with the model. They also want to explore how human generalization could be incorporated into LLM development.
“When we’re training these algorithms in the first place or trying to update them with human feedback, we have to take into account the human generalization function in how we think about performance measurement,” he says.
In the meantime, the researchers hope that their dataset could be used as a benchmark to compare how LLMs perform against the human generalization function, which could help improve the performance of models deployed in real-world situations.
“For me, the benefit of newspapers is twofold. The first is practical: The paper exposes a critical problem with deploying LLM for mainstream consumer use. If people do not properly understand when LLMs will be accurate and when they will fail, then they will be more likely to see errors and possibly be discouraged from continuing to use them. This highlights the problem of matching the models to how people understand generalizations,” says Alex Imas, a professor of behavioral science and economics at the University of Chicago’s Booth School of Business, who was not involved in the work. “The second contribution is more fundamental: The lack of generalization of expected problems and domains helps to get a better idea of what models do when they get a problem ‘right’. It provides a test of whether the LLM ‘understands’ the problem it is addressing.’
This research was funded in part by the Harvard Data Science Initiative and the Center for Applied Artificial Intelligence at the University of Chicago Booth School of Business.