MIT researchers improve automated interpretability in artificial intelligence models

As AI models become more prevalent and integrated into various sectors such as healthcare, finance, education, transportation and entertainment, it is crucial to understand how they work under the hood. Interpreting the mechanisms underlying AI models allows us to audit their security and bias, which has the potential to deepen our understanding of the science behind intelligence itself.

Imagine if we could directly study the human brain by manipulating each of its individual neurons to examine their role in the perception of a certain object. While such an experiment would be unbearably invasive to the human brain, it is more feasible in another type of neural network: one that is artificial. However, somewhat similar to the human brain, artificial models containing millions of neurons are too large and complex to study by hand, making interpretability at scale a very challenging task.

To solve this problem, researchers at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) decided to use an automated approach to interpret artificial vision models that evaluate various image properties. They developed “MAIA” (Multimodal Automated Interpretability Agent), a system that automates various neural network interpretation tasks using a language vision model backbone equipped with tools for experimentation on other AI systems.

“Our goal is to create an artificial intelligence researcher who can autonomously perform interpretable experiments. Existing automated interpretability methods only label or visualize data in a one-time process. On the other hand, MAIA can generate hypotheses, design experiments to test them, and refine our understanding through iterative analysis,” says Tamar Rott Shaham, an MIT electrical engineering and computer science (EECS) postdoctoral fellow at CSAIL and co-author of the new research paper. “By combining a pre-trained visual language model with a library of interpretation tools, our multi-modal method can respond to user queries by building and running targeted experiments on specific models, continuously refining its approach until it can provide a comprehensive answer.”

The automated agent is shown to have tackled three key tasks: Labeling individual components inside vision models and describing the visual concepts that activate them, cleaning image classifiers by removing irrelevant features to make them more robust to new situations, and looking for hidden biases in artificial intelligence systems , to help detect potential fairness issues in their outputs. “But the key advantage of a system like MAIA is its flexibility,” says Sarah Schwettmann PhD ’21, a CSAIL research scientist and co-leader of the research. “We have demonstrated the usefulness of MAIA on a few specific tasks, but because the system is built on a core model with broad reasoning capabilities, it can answer many different types of interpretability queries from users and design experiments on the fly to explore them.”

Neuron by neuron

In one example task, a human user asks MAIA to describe the concepts that a particular neuron within the vision model is responsible for detecting. To investigate this question, MAIA first uses a tool that retrieves “dataset examples” from the ImageNet dataset, which maximally activates the neuron. For this neuron example, these images show people in formal clothing and close-ups of their chins and necks. MAIA makes different hypotheses about what drives the neuron’s activity: facial expressions, chins or ties. MAIA then uses its tools to design experiments to test each hypothesis individually by generating and editing synthetic images—in one experiment, adding a bow tie to an image of a human face increases a neuron’s response. “This approach allows us to determine the specific cause of a neuron’s activity, much like a real science experiment,” says Rott Shaham.

Explaining the behavior of MAIA neurons is assessed in two key ways. First, synthetic systems with known ground-truth behavior are used to assess the accuracy of MAIA interpretations. Second, for “real” neurons inside trained AI systems, the authors propose a new automatic evaluation protocol that measures how well MAIA descriptions predict neuron behavior on unseen data.

The CSAIL-led method outperformed basic methods describing single neurons in various vision models, such as ResNet, CLIP, and the DINO vision transformer. MAIA also performed well on a new dataset of synthetic neurons with known ground truth descriptions. For both real and synthetic systems, the descriptions were often on par with those written by human experts.

How useful are descriptions of the components of an AI system, such as individual neurons? “Understanding and localizing behavior inside large AI systems is a key part of auditing the security of these systems before they are deployed – in some of our experiments we show how MAIA can be used to find neurons with unwanted behavior and remove that behavior from the model,” says Schwettmann. “We’re building towards a more resilient AI ecosystem where tools for understanding and monitoring AI systems keep pace with system scaling, allowing us to explore and hopefully understand the unforeseen challenges that new models bring.”

A look into neural networks

The nascent field of interpretability is maturing into a separate research area with the rise of “black box” machine learning models. How can researchers open up these models and understand how they work?

Current introspection methods tend to be limited in either the scale or precision of the explanations they can provide. Furthermore, existing methods tend to fit a specific model and a specific task. This led the researchers to ask: How can we create a general system that helps users answer questions about the interpretation of artificial intelligence models while combining the flexibility of human experimentation with the scalability of automated techniques?

One critical area this system sought to address was bias. To determine whether image classifiers show a bias toward specific subcategories of images, the team looked at the last layer of the classification stream (in a system designed to sort or label items, much like a machine that identifies whether a photo is a dog or a cat). , or bird) and the probability scores of the input images (the confidence levels the machine assigns to its guesses). To understand potential biases in image classification, MAIA was asked to find a subset of images in specific classes (for example, “Labrador Retriever”) that were likely mislabeled by the system. In this example, MAIA found that images of black Labradors were likely misclassified, suggesting a bias in the model towards yellow-haired retrievers.

Because MAIA relies on external tools to design experiments, its performance is limited by the quality of those tools. But as the quality of tools like image synthesis models improves, so will MAIA. MAIA also occasionally exhibits confirmation bias, where it sometimes incorrectly confirms its original hypothesis. To mitigate this, the researchers created an image-to-text tool that uses a different instance of the language model to summarize the experimental results. Another way of failure is fitting to a particular experiment, where the model sometimes makes premature conclusions based on minimal evidence.

“I think the natural next step for our lab is to move beyond artificial systems and apply similar experiments to human perception,” says Rott Shaham. “Testing has traditionally required manually designing and testing stimuli, which is laborious. With our agent, we can extend this process and design and test many stimuli simultaneously. This could also allow us to compare human visual perception with artificial systems.

“Understanding neural networks is difficult for humans because they have millions upon millions of neurons, each with complex patterns of behavior. MAIA is helping to bridge this by developing AI agents that can automatically analyze these neurons and report the distilled findings back to humans in a digestible way,” says Jacob Steinhardt, an assistant professor at the University of California, Berkeley, who was not involved in the research. . “Expanding these methods could be one of the most important paths to understanding and safely supervising AI systems.”

Rott Shaham and Schwettmann are joined by five CSAIL affiliates on paper: undergraduate Franklin Wang; incoming MIT student Achyuta Rajaram; EECS PhD student Evan Hernandez SM ’22; and EECS professors Jacob Andreas and Antonio Torralba. Their work was supported in part by the MIT-IBM Watson AI Lab, Open Philanthropy, Hyundai Motor Co., Army Research Laboratory, Intel, the National Science Foundation, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. The researchers’ findings will be presented this week at the International Conference on Machine Learning.

Leave a Comment Cancel reply