Large language models such as those powering ChatGPT have demonstrated impressive performance in tasks such as drafting legal information, analyzing the sentiment of customer reviews, or translating documents into different languages.
These machine learning models typically use only natural language to process information and answer questions, which can make it difficult for them to perform tasks that require numerical or symbolic reasoning.
For example, a large language model might be able to memorize and recite a list of recent US presidents and their birthdays, but the same model might fail if asked the question “Which US presidents elected after 1950 were born on a Wednesday?” (The answer is Jimmy Carter.)
Researchers at MIT and elsewhere have proposed a new technique that allows large language models to solve natural language, mathematical and data analysis, and symbolic reasoning tasks by generating programs.
Their approach, called Natural Language Embedded Programs (NLEP), involves prompting a language model to create and run a Python program to solve a user’s query and then output the solution as natural language.
They found that NLEPs enabled large language models to achieve higher accuracy on a wide variety of reasoning tasks. This approach is also generalizable, meaning that a single NLEP command can be reused for multiple tasks.
NLEPs also improve transparency because the user can check the program to see exactly how the model considered the query and correct the program if the model gave the wrong answer.
“We want AI to perform complex reasoning in a way that is transparent and trustworthy. There is still a long way to go, but we have shown that the combination of programming and natural language capabilities in large language models is a very good potential first step towards a future where people can fully understand and trust what is happening inside their AI. model,” says Hongyin Luo PhD ’22, a postdoc at MIT and co-author of the NLEP paper.
Luo was joined on the paper by co-authors Tianhua Zhang, a graduate student at the Chinese University of Hong Kong; and Jiaxin Ge, an undergraduate at Peking University; Yoon Kim, assistant professor in MIT’s Department of Electrical Engineering and Computer Science and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); lead author James Glass, Senior Research Fellow and Head of the Spoken Language Systems Group at CSAIL; and more. The research will be presented at the annual conference of the North American Chapter of the Association for Computational Linguistics.
Problem solving using programs
Many popular large language models work by predicting the next word or token based on some natural language input. While models like GPT-4 can be used to write programs, these programs embed natural language, which can lead to errors in the program’s reasoning or results.
At NLEP, MIT researchers took the opposite approach. It prompts the model to generate a step-by-step program entirely in Python code, and then injects the necessary natural language into the program.
NLEP is a four-step problem-solving template. First, the model calls the necessary packages or functions that it will need to solve the task. The second step involves importing natural language representations of the knowledge that the task requires (such as a list of US presidents’ birthdays). For step three, the model implements a function that calculates the answer. And in the final step, the model outputs the result as a string of natural language with automatic data visualization if needed.
“It’s like a digital calculator that will always give you the correct calculation result if the program is correct,” says Luo.
The user can easily examine the program and correct any errors in the code directly without having to run the entire model again for troubleshooting.
This approach also offers higher efficiency than some other methods. If the user has many similar questions, he can generate one basic program and then replace certain variables without having to run the model repeatedly.
To get the model to generate an NLEP, the researchers give it a general instruction to write a Python program, providing two NLEP examples (one with math and one with natural language) and one test question.
“Usually when people do these kinds of multiple challenges, they still have to design challenges for each task. We found that we can have one challenge for many tasks because it’s not a challenge that teaches LLMs to solve one problem, but a challenge that teaches LLMs to solve many problems by writing a program,” says Luo.
“Having language models understand code unlocks many opportunities for tooling, output validation, more structured understanding of model capabilities and thinking, and more,” says Leonid Karlinsky, principal scientist of the MIT-IBM Watson AI Lab.
“No Magic Here”
NLEPs achieved more than 90 percent accuracy when challenging the GPT-4 to solve a range of symbolic reasoning tasks, such as tracking scrambled objects or playing a game of 24, as well as tasks of following instructions and classifying text. The researchers found that NLEPs even showed 30 percent greater accuracy than task-specific challenge methods. The method also showed improvement over the open-source LLM.
Along with increasing the accuracy of large language models, NLEPs could also improve data privacy. Since NLEP programs are run locally, sensitive user data does not need to be sent to a company like OpenAI or Google for the model to process.
In addition, NLEPs can allow small language models to perform better without having to retrain the model for a particular task, which can be a costly process.
“There is no magic here. We don’t have a more expensive or fancy language model. All we do is use program generation instead of natural language generation, and we can improve it significantly,” says Luo.
However, NLEP relies on the model’s ability to generate a program, so the technique does not work as well for smaller models that have been trained on limited data sets. In the future, the researchers plan to study methods that could use smaller language models to generate more efficient NLEPs. In addition, they want to investigate the impact of rapid variations on NLEP to increase the robustness of the model’s reasoning processes.
This research was supported in part by the Center for Perceptual and Interactive Intelligence in Hong Kong.