A group of New York University psychologists are taking strides toward making artificial intelligence smarter, by testing whether these algorithms can learn common sense. For now, the smallest among us — infants — can still outperform the machines, but the results could help make computerized intelligence feel more human.

The recent study, led by Dr. Moira Dillon, tested 11-month-old babies against three computerized neural networks on what’s known as commonsense psychology. Dillon describes that as the everyday reasoning we do about other people as they move about their lives. If someone reaches for a coffee mug, our common sense tells us they want to drink even before it reaches their lips. If they then throw this mug over their shoulder, we assume it might shatter even before it hits the ground.

“We as humans, even as infants, have strong intuitions about the invisible goals and motivations and intentions that drive other people's actions,” said Dillon, whose study was published online last month in the journal Cognition. AI is still lagging behind being able to react to visual scenarios with the same intuition as an 11-month-old, according to the study's findings — but the research provides clues on how to fill in those gaps.

A visual scenario similar to what Dillon and her team tested was this: Let's say you see your neighbor Bob make a beeline to their mailbox. Your common sense would infer that their primary goal is to look for an important piece of mail. But if you see a neighbor walk around the block and then stop at their mailbox on the way home, then you would probably think their goal is different — they wanted some exercise or fresh air.

While AI is getting very good at observing human patterns to mimic us or predict what might come next, it can often still miss the mark when trying to predict a sequence of behaviors. For example, ask ChatGPT, the popular chatbot, to write a sequel to Titanic starring dolphins, and its reply starts out like a creative children’s movie. Of course, Daphne and her dolphin pod would be enticed by wreckage and the chance at collecting lost treasure.

ChatGPT takes a swing at writing a cetacean-themed sequel to Titanic.

But then, ChatGPT searches for a climax that doesn’t quite land. It’s 1912 and somehow people survived under ice-cold water long enough for these dolphin rescuers? These dolphin treasure hunters suddenly care about their human overlords?

“Chat GPT is focused on pattern detection and is looking for patterns in the way that language is used. So it doesn't have a deep understanding of what another person is saying or why people might be saying it,” Dillon said. “It's going to be limited in the way that it can then make inferences about what ideas that person might be trying to convey.”

That’s not a diss to ChatGPT. Dillon and other researchers view commonsense psychology as a potential ally to AI systems that just consume loads of knowledge and try to spit back coherent patterns.

“One way of thinking about this paper is it's asking to what extent do the algorithms and properties of these models capture important features of the human mind and brain,” said Dr. Nick Turk-Browne, a professor of psychology at Yale who wasn’t involved with the NYU report but who studies child cognition and AI. “And if they do, maybe some of the mechanics of how they work could teach us something about how the mind is working.”

What they found

The NYU team tested the common sense of AI by using three algorithms similar to ChatGPT — but ones that focus on interpreting videos rather than language.

Take the example of your neighbor walking left down a driveway to check the mail. Witness them do this a few times, and one might assume when they go left, the objective is to get the mail.

Dillon and her colleagues created six scenarios where 84 infants and the three computerized neural networks could gauge how objects move around a screen — and try to guess the intentions of those movements. They call this test the Baby Intuitions Benchmark, or BIB. (Get it? Baby bib.)

“When we designed the Baby Intuitions Benchmark, we made it such that literally the exact same visual stimulus can be shown to an infant and can be shown to a machine-learning algorithm,” Dillon said.

The video showed a rectangular grid where a stand-in for our neighbor Bob — a gray blob — always started at the bottom of the screen. Located up the grid was a blue tent to the left and a green shield to the right. Over an initial series of viewings, the blob would move consistently to one of these shapes, say, the blue tent.

During the training phase, the babies would show surprise or stare for long times at Bob the Blob moving to the blue tent. But over time, this surprise or attention went away. The blue tent is obviously Bob’s goal. This study and past ones have shown measuring surprise in this way is a proxy for when the mind is inferring something.

Example of a goal-directed experiment in the Baby Intuitions Benchmark.

Stojnić G. et al., 2023

The researchers then switched the locations of the blue tent and the green shield on the screen. The babies didn’t show surprise if Bob the Blob kept moving toward the blue tent — that’s what Bob just does. But if Bob all of sudden moved toward the green shield, the surprise and staring came back. They inferred that the goal had changed.

In contrast, the AI algorithms couldn’t make the same inferences.

“So babies expect the circle to move to the same object regardless of what location it's in,” Dillon said. “The [machine-learning] models we found either expect that the circle will move to the same location, or the models have no relative expectation.”

The same happened when babies and AI went head-to-head in judging the most efficient routes when obstacles were placed between Bob the Blob and its object of choice. Overall, the infants participated in about 330 testing sessions that involved a variety of shapes and movement goals.

Why it matters

Outsiders said what’s interesting about this work is that it helps scientists figure out what infants know or what they can intuit. They said people have been mulling this concept — the intersection of AI and cognitive development — at least since World War II and the era of Alan Turing, the mathematician who helped jumpstart modern computing. If you had a time machine, you could move a baby from the 1940s and put it into a home today, and it would have the common sense to discern and learn human intentions, they said.

Just consider, again, anyone looking at a coffee mug that’s been tossed over a shoulder.

Under the hood, their brain is running some pretty sophisticated computations to arrive at that conclusion.
Dr. Tomer D. Ullman, Harvard cognitive scientist

“They'd probably say it's trivial: They look at it, and see it's a mug. But under the hood, their brain is running some pretty sophisticated computations to arrive at that conclusion, sophisticated computations they don't have direct access to,” Harvard cognitive scientist Dr. Tomer D. Ullman said via email. “We don't directly know what are the computations behind intuitive physics or intuitive psychology, either, and that's what I'd like to know!”

But unpacking this common sense is also one of the study’s (and the field’s) biggest hurdles, and one that Dillon recognizes, too. This study only tested six scenarios.

“We're doing our best to figure out what infants know,” said Dr. Michael Frank, a professor of psychology at Stanford University, who wasn’t affiliated with the study but likes the research. But this type of work “is very slow, it's time consuming, and each individual interaction with the baby in the lab gives you only a very limited amount of information about overall patterns of behavior.”

There are also alternative ways to teach AI. Turk-Browne, from Yale, is particularly interested in an “unsupervised learning” approach, where algorithms “are not given the right or wrong answer, but they try to find patterns or structure in what they're exposed to that are useful for predicting behavior and brain activity.”

For next (baby) steps, Dillon wants to develop a second intuition benchmark that can test how infants and AI view social partnerships.

“There's promise in this kind of research — creating a foundation for the future of human-like AI,” Dillon said. “I also think it's promising for infant researchers, psychologists and cognitive scientists who are aiming to understand the origins and development of the human mind.”