Baby Zoe sits on her mother’s lap and watches a puppet show featuring three shapes with googly eyes. A red circle struggles to climb a steep hill until a blue square helps it with a push. A yellow triangle blocks the way and shoves the red circle down the hill. When the show is over, Zoe is offered a choice of puppets. She doesn’t hesitate: she ignores the unkind yellow triangle and makes a grab for the helpful blue square.
The scene, from a Netflix documentary series released in 2020, recreates a highly cited 2007 study1, which found that babies as young as six months old overwhelmingly prefer characters who help, rather than hinder, others. On the basis of these findings, developmental psychologist Kiley Hamlin, now at the University of British Columbia in Vancouver, Canada, concluded that the ability to evaluate others’ behaviour develops before speech, and could be a biological adaptation.
Over the next decade or so, researchers performed dozens of versions of Hamlin’s experiment. But many of them failed to find the same preference for helpers — or suggested that other factors could explain the choice. Hamlin became frustrated at the resulting confused picture, so, in 2017, she assembled a collaboration of 37 research groups in 18 countries to repeat the experiment with more than 1,000 babies. That, she thought, should settle the matter once and for all.
Half of social-science studies fail replication test in years-long project
Hamlin wasn’t the only cognitive scientist looking for ways to validate findings in their field at this time. Throughout the 2010s, many researchers attempted, and often failed, to replicate seminal studies in psychology and beyond, leading to what became known as the reproducibility or replication crisis. “There was huge press coverage about how psychology was garbage,” says Michael Frank, a developmental psychologist at Stanford University in California.
Many scientists saw small sample sizes as a major cause of the crisis — these distorted results or produced conclusions that applied only to limited groups. One obvious solution was to go big: perhaps boosting the number of trial participants could help to add weight to the results.
So, psychologists began building large-scale, international projects, often involving hundreds of collaborators, to plan and run the same experiment and see whether they got the same answer. Hamlin and Frank have used this approach to test hypotheses about infant cognition. Other researchers are investigating different aspects of cognition in humans and a wide range of species — from dogs to fish to flamingoes.
The results of these projects have been trickling in over the past few years. And they have sometimes failed to support the hypotheses that they were designed to replicate. Many scientists think that the challenging logistics of such mammoth studies are worth it for the extra rigour that they provide. “Joining forces by combining groups of subjects across labs solves this problem by giving us the statistical power to test important research questions,” says Kelsey Lucca, a developmental psychologist at Arizona State University in Tempe and a co-lead on the collaborative study of social evaluation in babies, alongside Hamlin.
Big teamwork
Plenty of questions in science — in subjects from particle physics to cell atlases — require large teams. Although smaller ones are still the norm in a lot of fields, researchers in an increasing number of disciplines are coming to see the benefits of pooling data and expertise.
In the 2010s, in the wake of several high-profile controversies that shook confidence in the field, psychologists launched a series of such efforts. Brian Nosek, at the Center for Open Science in Charlottesville, Virginia, was a leading figure in a number of these large-scale collaborations. In 2014, for example, in the first of five ‘Many Labs’ studies, 36 research groups found that 10 out of 13 classic psychology effects could be replicated successfully2.
Nosek also led a project that enlisted laboratories to repeat 100 experiments from a selection of papers — a wider but shallower form of replication, because the teams attempted collectively to replicate numerous findings, but every team did not test every result. This effort, called the Reproducibility Project: Psychology, found that only one-third to one half of results published originally in three psychology journals in 2008 were also observed in the replications3. “Establishing replicability is a lot harder than people assumed,” says Nosek.
Challenges and promises of big team comparative cognition
Other efforts have sought out the truth in broader ways. An enormous, years-long project called SCORE (Systematizing Confidence in Open Research and Evidence), published in April and again involving Nosek, assessed whether findings in the social sciences could be replicated. However, it also looked at whether results remained the same if the data were analysed in a different way and whether studies were reproducible — meaning that their results held true when the original data were reanalysed using the original code. In the replication arm of the study, researchers found that 49% of 164 papers in business, economics, education, political-science, psychology and social-science journals could be replicated independently4.
In 2015, inspired by the Many Labs model — in which groups of researchers repeat the same studies — a new wave of large collaborative projects began to emerge in psychology. These were built from the ground up. The first was ManyBabies, an international consortium running replications of existing findings in infant-development research. “What is different about ManyBabies and the other big-team science projects is that they are self-organized at a grass-roots level,” says Frank, who launched the initiative. “They are not convened by a funding agency or institution, and they have to find ways of working together without a lot of explicit hierarchy.”
The first ManyBabies projects looked at infant-directed speech, theory of mind — the ability to understand that other people have independent thoughts — and rule learning. And in 2017, Hamlin’s helper-preference study was launched as ManyBabies4. When it was finally published in 2025, the results showed that babies were equally likely to select the helper or hinderer character when offered a choice5. The authors concluded that either social evaluation does not, after all, develop in preverbal infants or, if it does, it could not be demonstrated using the helper–hinderer experimental set-up.

To study whether babies prefer ‘helper’ characters over those who hinder others, experimenters have used puppet shows such as this one, in which a blue square helps a red circle up a hill. Credit: J. Kiley Hamlin et al./Nature
Hamlin was disappointed and in two minds about how to interpret this. “My reaction was ‘darn, that sucks’,” she says. She speculates that the lack of a positive result might be because many of the infants studied were deprived of social interactions with strangers owing to the COVID-19 pandemic lockdowns. “I don’t know whether the null effect means babies can’t do this and never could and I was just kidding myself this entire time, or whether it’s reflecting something broader, and honestly very worrying, about babies post-COVID.”
Whatever the result, those involved point to several advantages of working as a big team. “Everyone had different expertise, with some people contributing more to data analysis, some to stimulus design and some people collecting lots of data,” says Lucca. The projects usually include more-diverse groups of participants, too, and so can give researchers greater confidence in the broad applicability of their findings. “Big-team science gives you access to a wider range of samples, including those that are under-represented,” says Nicolás Alessandroni, a comparative psychologist at Concordia University in Montreal, Canada.
This greater diversity can also extend across species. A ManyBirds study involved 129 researchers from 77 institutions in 24 countries investigating aversion to novelty — an evolutionary adaptation that is seen as important to bird survival. The researchers compared the time it took 136 species of mostly captive birds to touch food in the presence and absence of brightly coloured objects that the birds hadn’t seen before. They found that migratory birds and species with specialist diets had greater aversion to novelty, or ‘neophobia’, than did non-migratory species and those with broader diets6.
Some studies have shown links between variations in neophobia and different ecological factors. But little was known about how generalizable these effects were across avian taxa. The ManyBirds study provided support for existing theories, including the neophobia threshold hypothesis, which states that neophobia determines species’ ecological flexibility. A big-team science approach “can provide support for some of these big hypotheses”, says comparative psychologist Rachael Miller at the University of Exeter, UK, who co-founded the ManyBirds project.
Another collaboration is pushing sample diversity even further. The ManyManys group is planning to study a cognitive trait called reversal learning in a wide range of species, including humans, bonobos, dogs, giraffes, elephants, sparrows, crocodiles, shark, snails and bees. The researchers hope to better understand the evolutionary origins of the trait, which involves the ability to change reward-related behaviours when circumstances change.
The animals will first be taught to associate rewards, usually food, with one of two objects. The colours, shapes or locations of the objects will then be switched and experimenters will record the time taken for animals to demonstrate that they understand the change.
Alessandroni, who is co-leading the project, says that the collaboration is helping researchers to think about how they can meaningfully compare cognition between species. He also hopes it will help to reduce researchers’ reliance on a few particular animal models.
Because these collaborations are relatively lacking in hierarchy and competition, they can benefit individual researchers, too. “I wanted to answer some of the big research questions about the ecological or evolutionary drivers of behaviours in animals,” says Miller. “But also, as an early-career researcher, I was motivated by wanting to change some of the very competitive ways science is usually done.”
Logistical challenges
A lack of hierarchy can also be challenging, however. Take the first ManyDogs study, which was established to settle a key controversy: do dogs perceive human pointing as socially informative or do they just learn to see it as a command?
The group decided to test this by giving dogs two sets of cues; the dogs then had to identify which of two cups contained a hidden treat. In one version, the experimenter looked at the dog and said the dog’s name before pointing at the cup that held the treat; in the second, they pointed at the cup but averted their gaze while clearing their throat instead of saying the dog’s name. The set-up was designed by the whole consortium, which comprised 20 research groups in 9 countries. Together, they tested more than 450 dogs.



