Mark Hamilton, an MIT PhD student in electrical engineering and computer science and an affiliate of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), wants to use machines to understand how animals communicate. To achieve this, he first decided to create a system that could learn human speech “from scratch”.
“It’s funny, a key moment of inspiration came from the movie ‘March of the Penguins.’ There’s a scene where the penguin falls while crossing the ice and makes a somewhat complicated moan when he gets up. When you look at it, it’s almost obvious that this groan stands for a four-letter word. That’s when we thought maybe we need to use audio and video to learn the language,” says Hamilton. “Is there a way we could have an algorithm watch TV all day and figure out what we’re talking about?”
“Our ‘DenseAV’ model aims to learn language by predicting what it sees from what it hears and vice versa. For example, if you hear the sound of someone saying “bake a cake at 350,” chances are you’re seeing a cake or an oven. To succeed in this audio-video matching game with millions of videos, a model has to learn what people are talking about,” says Hamilton.
Once they trained DenseAV on this matching game, Hamilton and his colleagues looked at which pixels the model was looking for when it heard a sound. For example, when someone says “dog,” the algorithm immediately starts looking for dogs in the video stream. By looking at which pixels are selected by the algorithm, you can tell what the algorithm thinks the word means.
Interestingly, a similar search process occurs when DenseAV listens to a dog barking: It searches for a dog in the video stream. “That piqued our interest. We wanted to see if the algorithm knew the difference between the word ‘dog’ and a dog barking,” says Hamilton. The team explored this by giving DenseAV a “bilateral brain”. Interestingly, they found one side of DenseAV’s brain naturally focused on language, like the word “dog,” and the other side focused on sounds like barking. This showed that DenseAV not only learned the meaning of words and the placement of sounds, but also learned to distinguish between these types of cross-connections, all without human intervention or any knowledge of written language.
One branch of applications is learning from the vast amount of video published on the Internet every day: “We want systems that can learn from the huge amount of video content, such as instructional videos,” says Hamilton. “Another exciting application is understanding new languages, such as the communication of dolphins or whales, which do not have a written form of communication. We hope that DenseAV can help us understand these languages that have eluded human translation efforts since the beginning. Finally, we hope this method can be used to reveal patterns between other pairs of signals, such as the seismic sounds the Earth makes and its geology.
The team faced a formidable challenge: learning a language without any text input. Their goal was to rediscover the meaning of language from a blank slate and avoid using pre-trained language models. This approach is inspired by how children learn by observing and listening to their environment to understand language.
To achieve this performance, DenseAV uses two main components to process audio and video data separately. This separation made it impossible for the algorithm to cheat by allowing the visual to look at the audio and vice versa. It forced the algorithm to recognize objects and created detailed and meaningful functions for both audio and visual signals. DenseAV learns by comparing pairs of audio and visual signals to determine which signals match and which don’t. This method, called contrastive learning, does not require labeled examples and allows DenseAV to discover important predictive patterns of the language itself.
One of the main differences between DenseAV and previous algorithms is that previous works focused on a single notion of similarity between audio and image. The whole sound clip, like someone saying “the dog sat on the grass”, was linked to the whole image of the dog. This prevented previous methods from discovering fine-grained details, such as the connection between the word “grass” and the grass under the dog. The team’s algorithm searches and aggregates all possible matches between an audio clip and image pixels. This not only improved performance, but allowed the team to accurately localize sounds in a way that previous algorithms could not. “Conventional methods use a single class token, but our approach compares every pixel and every second of audio. This fine-grained method allows DenseAV to create more detailed connections for better localization,” says Hamilton.
The researchers trained DenseAV on an AudioSet containing 2 million YouTube videos. They also created new data sets to test how well the model could match sounds and images. In these tests, DenseAV outperformed other top models in tasks such as identifying objects by their names and sounds, proving its effectiveness. “Previous datasets only supported coarse evaluation, so we built the dataset using semantic segmentation datasets. This helps with pixel-perfect annotations to accurately evaluate the performance of our model. We can trigger the algorithm with specific sounds or images and get these detailed locations,” says Hamilton.
Due to the huge amount of data used, the project took approximately a year to complete. The team says that moving to a large transform architecture presented challenges because these models can easily miss fine details. Encouraging the model to focus on these details was a significant hurdle.
Going forward, the team’s goal is to create systems that can learn from vast amounts of just image or audio data. This is crucial for new domains where there are plenty of both modes, but not together. They also aim to extend this with larger backbone networks and eventually integrate knowledge from language models to improve performance.
“The recognition and segmentation of visual objects in images, as well as ambient sounds and spoken word in audio recordings, is each a difficult problem in its own right. Historically, researchers have relied on expensive, human-provided annotations to train machine learning models to accomplish these tasks,” says David Harwath, assistant professor of computer science at the University of Texas at Austin, who was not involved in the work. “DenseAV is making significant progress toward developing methods that can learn to solve these tasks simultaneously just by observing the world with sight and sound—based on the knowledge that the things we see and interact with often make sound, and we also use spoken language to speak. about them. This model also does not assume a specific language to be spoken and could therefore in principle learn from data in any language. It would be exciting to see what DenseAV can learn by scaling it up to thousands or millions of hours of video data in multiple languages.”
Other authors of the paper describing the work are Andrew Zisserman, a professor of computer vision engineering at the University of Oxford; John R. Hershey, Google AI Perception researcher; and William T. Freeman, professor of electrical engineering and computer science at MIT and principal investigator of CSAIL. Their research was partially supported by the US National Science Foundation, the Royal Society Research Professorship and the EPSRC Grant Visual AI programme. This work will be presented at the IEEE/CVF Computer Vision and Pattern Recognition Conference this month.