Embodied AI Model Learns Like a Toddler, Revealing Secrets of Cognitive Development

Download PDF Copy

Reviewed by Joel ScanlonJan 22 2025

Inspired by how toddlers learn, researchers have created an AI model that integrates vision, proprioception, and language to develop compositionality—offering a unique window into human cognition and advancing AI transparency.

Research: Development of compositionality through interactive learning of language and action of robots. Image Credit: Irina Wilhauk / Shutterstock

We humans excel at generalization. If you teach a toddler to identify the color red by showing her a red ball, a red truck, and a red rose, she will most likely correctly identify the color of a tomato, even if it is the first time she sees one.

Compositionality—the ability to compose and decompose a whole into reusable parts, like an object's redness—is a significant milestone in learning to generalize. How we acquire this ability is a key question in developmental neuroscience and AI research.

The earliest neural networks, which have later evolved into the large language models (LLMs) revolutionizing our society, were developed to study how information is processed in our brains. Ironically, as these models became more sophisticated, the information processing pathways within also became increasingly opaque, with some models today having trillions of tunable parameters.

But now, members of the Cognitive Neurorobotics Research Unit at the Okinawa Institute of Science and Technology (OIST) have created an embodied intelligence model with a novel architecture that allows researchers access to the various internal states of the neural network and which appears to learn how to generalize in the same ways that children do. Their findings have now been published in Science Robotics. "This paper demonstrates a possible mechanism for neural networks to achieve compositionality," says Dr. Prasanna Vijayaraghavan, first author of the study. "Our model achieves this not by inference based on vast datasets, but by combining language with vision, proprioception, working memory, and attention – just like toddlers do."

Perfectly imperfect

LLMs, founded on a transformer network architecture, learn the statistical relationship between words that appear in sentences from vast amounts of text data. They essentially have access to every word in every conceivable context, and from this understanding, they predict the most probable answer to a given prompt. By contrast, the new model is based on a PV-RNN (Predictive coding inspired, Variational Recurrent Neural Network) framework, trained through embodied interactions integrating three simultaneous inputs related to different senses: vision, with a video of a robot arm moving colored blocks; proprioception, the sense of our limbs' movement, with the joint angles of the robot arm as it moves; and a language instruction like "put red on blue." The model is then tasked to generate either a visual prediction and corresponding joint angles in response to a language instruction, or a language instruction in response to sensory input.

The system is inspired by the Free Energy Principle, which suggests that our brain continuously predicts sensory inputs based on past experiences and takes action to minimize the difference between prediction and observation. This difference, quantified as 'free energy,' is a measure of uncertainty, and by reducing free energy, our brain maintains a stable state. Together with limited working memory and attention span, the AI mirrors human cognitive constraints, forcing it to process input and update its prediction in sequence rather than all at once, as LLMs do. By studying the flow of information within the model, researchers can gain insights into how it integrates the various inputs to generate its simulated actions.

Thanks to this modular architecture, the researchers have learned more about how infants may develop compositionality. As Dr. Vijayaraghavan recounts, "We found that the more exposure the model has to the same word in different contexts, the better it learns that word. This mirrors real life, where a toddler will learn the concept of the color red much faster if she's interacted with various red objects in different ways, rather than just pushing a red truck on multiple occasions."

Opening the black box

"Our model requires a significantly smaller training set and much less computing power to achieve compositionality. It does make more mistakes than LLMs do, but it makes mistakes that are similar to how humans make mistakes," says Dr. Vijayaraghavan. It is precisely this feature that makes the model so useful to cognitive scientists, as well as to AI researchers trying to map the decision-making processes of their models. While it serves a different purpose than the LLMs currently in use and, therefore, cannot be meaningfully compared on effectiveness, the PV-RNN nevertheless shows how neural networks can be organized to offer greater insight into their information processing pathways: its relatively shallow architecture allows researchers to visualize the network's latent state – the evolving internal representation of the information retained from the past and used in present predictions.

The model also addresses the Poverty of Stimulus problem, which posits that the linguistic input available to children is insufficient to explain their rapid language acquisition. Despite having a very limited dataset, especially compared to LLMs, the model still achieves compositionality, suggesting that grounding language in behavior may be an essential catalyst for children's impressive language learning ability.

This embodied learning could, moreover, lead to safer and more ethical AI in the future, both by improving transparency and by helping AI better understand the effects of its actions. Learning the word 'suffering' from a purely linguistic perspective, as LLMs do, would carry less emotional weight than for a PV-RNN, which learns the meaning through embodied experiences and language.

"We are continuing our work to enhance the capabilities of this model and are using it to explore various domains of developmental neuroscience. We are excited to see what future insights into cognitive development and language learning processes we can uncover," says Professor Jun Tani, head of the research unit and senior author. How we acquire the intelligence to create our society is one of the great questions in science. While the PV-RNN hasn't answered it, it opens new avenues for research into how information is processed in our brains. "By observing how the model learns to combine language and action," summarizes Dr. Vijayaraghavan, "we gain insights into the fundamental processes that underlie human cognition. It has already taught us a lot about compositionality in language acquisition, and it showcases the potential for more efficient, transparent, and safe models."

Source:

Okinawa Institute of Science and Technology (OIST) Graduate University

Journal reference:

Vijayaraghavan, P., Queißer, J. F., Flores, S. V., & Tani, J. (2025). Development of compositionality through interactive learning of language and action of robots. Science Robotics. DOI: adp0751, https://www.science.org/doi/10.1126/scirobotics.adp0751

Posted in: AI Research News