Researchers at the Massachusetts Institute of Technology have created a tool called a “semantic parser” that mimics the experience of children learning a language. The semantic parses does what children typically do—observe their environments and listen to speakers, regardless of if they actually understand what the speakers are saying or not. The system observes captioned videos and associates the words that speakers say with recorded objects and actions.
Ideally, the system will help revolutionize how semantic parsers currently work. Typically children learn by observing people speaking, and interacting with their environment. Semantic Parsers, on the other hand, ‘learn’ from annotated sentences or direct answers to questions. With this new system, the semantic parser could use visual data to observe its environment to reinforce its understanding of a verbal command, even if the command wasn’t clear.
“People talk to each other in partial sentences, run-on thoughts, and jumbled language,” said Andrei Barbu, a researcher at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Center for Brains, Minds, and Machines (CBMM) told ZDNet. “You want a robot in your home that will adapt to their particular way of speaking… and still figure out what they mean.”
The parser could also help give researchers insight on understanding how young children learn language. “A child has access to redundant, complementary information from different modalities, including hearing parents and siblings talk about the world, as well as tactile information and visual information, [which help him or her] to understand the world,” says co-author Boris Katz, a principal research scientist and head of the InfoLab Group at CSAIL. “It’s an amazing puzzle, to process all this simultaneous sensory input. This work is part of bigger piece to understand how this kind of learning happens in the world.”
The new parser is the first one to be trained using video. If the parser is unsure about what is being said to it, it can use visual cues to help figure out what is being said. Researchers compiled a collection of over 400 videos depicting people doing various activities like picking up an object, putting an object down, and walking towards an object just to give a few examples. The parser is then given the task to caption the video.
Similar to how children learn, the parser learns through passive observation. To determine if a caption is true of a video, the parser by necessity must identify the highest probability meaning of the caption. “The only way to figure out if the sentence is true of a video [is] to go through this intermediate step of, ‘What does the sentence mean?’ Otherwise, you have no idea how to connect the two,” Barbu explained to Science Daily. “We don’t give the system the meaning for the sentence. We say, ‘There’s a sentence and a video. The sentence has to be true of the video. Figure out some intermediate representation that makes it true of the video.'”
“Children interact with the environment as they’re learning. Our idea is to have a model that would also use perception to learn,” says co-author Candace Ross.