One day you might want your home robot to carry a load of dirty clothes downstairs and put them in the washing machine in the left corner of the basement. The robot will need to combine your instructions with its visual observations to determine the steps it should take to complete the task.
This is easier said than done for an AI agent. Current approaches often use multiple hand-built machine learning models to solve different parts of the task, which require a large amount of human effort and expertise to assemble. These methods, which use visual representations to directly make navigation decisions, require vast amounts of visual data for training, which are often difficult to obtain.
To overcome these challenges, researchers at MIT and the MIT-IBM Watson AI Lab devised a navigation method that converts visual representations into parts of language, which are then fed into one large language model that fulfills all parts of a multi-step navigation task.
Rather than encoding visual elements from images of the robot’s surroundings as visual representations, which is computationally intensive, their method produces text captions that describe the robot’s point of view. A large language model uses captions to predict the actions the bot should perform to meet the user’s language instructions.
Because their method uses purely linguistic representations, they can use a large language model to efficiently generate huge amounts of synthetic training data.
Although this approach does not outperform techniques that use visual features, it works well in situations where there is insufficient visual data for training. The researchers found that combining their linguistic inputs with visual cues led to better navigation performance.
“By using language as a perceptual representation, our approach is more straightforward. Since all the inputs can be encoded as a language, we can generate a human-understandable trajectory,” says Bowen Pan, an Electrical Engineering and Computer Science (EECS) graduate student and lead author of a paper on the approach.
Pan’s co-authors include his advisor Aude Oliva, Director of Strategic Industry at the MIT Schwarzman College of Computing, MIT Director of the MIT-IBM Watson AI Lab, and Principal Investigator of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Philip Isola, associate professor of EECS and member of CSAIL; lead author Yoon Kim, EECS Assistant Professor and CSAIL Fellow; and others at the MIT-IBM Watson AI Lab and Dartmouth College. The research will be presented at a conference of the North American branch of the Association for Computational Linguistics.
Solving the visual problem with language
Because large language models are the most powerful machine learning models available, researchers have sought to incorporate them into a complex task known as vision-language navigation, Pan says.
But such models take text input and cannot process visual data from the robot’s camera. So the team needed to find a way to use the language instead.
Their technique uses a simple caption model to obtain textual descriptions of the robot’s visual observations. These captions are combined with language instructions and fed into a large language model that decides what navigation step the robot should take next.
The large language model creates the caption of the scene that the bot should see after completing this step. This is used to update the trajectory history so the robot can keep track of where it has been.
The model repeats these processes to create a trajectory that guides the robot to its goal, step by step.
To streamline the process, the researchers designed templates so observation information is presented to the model in a standard form—as a series of choices the robot can make based on its surroundings.
For example, the caption could say “To your left is a door with a potted plant, behind you is a small office with a desk and computer” etc. The model chooses whether the robot should move towards the door or towards the office.
“One of the biggest challenges was figuring out how to encode this kind of information into the language in the right way so that the agent understands what the task is and how it should respond,” says Pan.
Advantages of the language
When they tested this approach, while it failed to outperform vision-based techniques, they found that it offered several advantages.
First, since text requires less computational resources to synthesize than complex image data, their method can be used to rapidly generate synthetic training data. In one test, they generated 10,000 synthetic trajectories based on 10 real, visual trajectories.
This technique can also bridge a gap that can prevent an agent trained in a simulated environment from performing well in the real world. This gap often occurs because computer-generated images can appear quite different from actual scenes due to elements such as lighting or color. But language that describes a synthetic versus a real image would be much harder to distinguish, Pan says.
Also, the representations their model uses are easier for humans to understand because they are written in natural language.
“If an agent falls short of its goal, we can more easily determine where it failed and why it failed. “Perhaps the information about the history is not clear enough, or the observation ignores some important details,” says Pan.
Moreover, their method could be more easily applied to different tasks and environments because it uses only one type of input. If the data can be encoded as a language, they can use the same model without any modifications.
But one drawback is that their method naturally loses some information that would be captured by vision-based models, such as depth information.
However, the researchers were surprised to see that combining language representations with vision-based methods improves the agent’s ability to navigate.
“Perhaps this means that language can capture some higher-level information than cannot be captured by the functions of pure vision,” he says.
This is one area that scientists want to continue to investigate. They also want to develop a navigation-oriented descriptor that could increase the performance of the method. In addition, they want to explore the ability of large language models to exhibit spatial awareness and see how this could aid language navigation.
This research is partially funded by the MIT-IBM Watson AI Lab.