
Natural field experiments might involve studying how shoppers respond to in-store changes, such as moving where products are displayed.Credit: Robert Nickelsberg/Getty
Public policy is full of initiatives that did not work out as hoped. Take Scared Straight, a programme run by more than 30 US states between 1978 and 2015 that aimed to dissuade at-risk teenagers from becoming hardened criminals by bringing them face-to-face with people incarcerated in maximum-security prisons1,2. The programme was extended after a pilot project, the subject of a 1978 documentary, found that 80–90% of teenage participants stayed out of trouble. But the intervention did not work when scaled up. In some places, criminal behaviour among teenagers even rose.
Similarly, many childhood-development interventions that have proved effective in one place have failed to deliver comparable results elsewhere. Deworming children in schools, for instance, substantially reduced absenteeism in Kenya but has shown mixed or weaker effects in other settings. School meal programmes in Burkina Faso increased student attendance but had limited impacts on outcomes in other countries3.
Why science has a credibility problem — and how to address it
This generalizability problem arises, in part, because human behaviour differs across populations and situations. People live in complex social environments in which labels, stakes and scrutiny shape every decision. But those contexts are often overlooked. Conventionally, research participants have come from Western, educated, industrialized, rich and democratic (WEIRD) populations4, which are unrepresentative of other groups on measures ranging from visual perception to moral reasoning and cooperation. What might work for them might not apply to other populations.
In my view, one solution to the problem is to use a greater number of natural field experiments. In such studies, participants go about their everyday activities unaware that they are being observed by researchers, while some feature of their environment is varied. By studying people in their natural setting, assuming that strict ethical rules are followed, researchers can be more confident that their findings will be relevant to that group.
Three developments make scientists better placed to take advantage of such approaches. First, attention on the replication crisis in academia has coincided with a growing understanding that results often fail to generalize beyond the narrow populations typically recruited for laboratory studies. Second, the technology sector is running thousands of natural field experiments to obtain reliable information about their customers, establishing infrastructure and methods that academics can use. Third, a growing body of research into generalizability has provided frameworks for predicting when and why results will fail to apply across populations and settings5,6.
Here I outline how academics can embed natural field experiments in their work.
A three-stage problem
Difficulties in replicating studies have been recognized across the sciences, from social science to biomedicine. Reforms in how research is done can help researchers to generate more-reliable results and repeat others’ work more easily. This includes pre-registration of hypotheses and methods, larger samples, open data and transparent reporting. But, in fields related to human behaviour, replication only requires researchers to obtain the same result with the same kinds of people in the same setting, not to test whether it holds for people elsewhere.
Generalizability problems arise at three stages of experimental design.
Stage one: population selection. Before any experiment is conducted, researchers must select a population from which to draw study participants. A psychologist might recruit undergraduate students, whereas a medical researcher might choose from people at an academic hospital. When the population chosen for the study differs from those who will ultimately be affected by its results, the findings might not translate. Historically, clinical trials have recruited predominantly middle-aged white men, whose results were then applied to women and other groups for whom treatment effects might differ substantially7.
How games can make behavioural science better
Stage two: participant selection. In standard behavioural experiments, participation requires consent, and consent requires awareness. Once potential volunteers learn about a study, some agree and some decline, and this decision is rarely random7. Consider a lab experiment that pays volunteers US$20 to show up. Those who respond are likely to have flexible schedules, feel comfortable in academic settings and value the payment enough to participate. Such self-selection potentially skews the participant pool. In Scared Straight, this manifested in two ways. First, the volunteers were teenagers who both wanted to change and were willing to be filmed. Second, outcome data were collected through letters sent to parents, and those who had good news about their child were more likely to reply2.
Stage three: situation selection. Every study of human behaviour involves creating an experimental setting in which to observe a participant. Situational bias arises when that context differs from how an intervention would occur in real life. This might include differences in the level of scrutiny that volunteers are under, the magnitude of the stakes and the social cues that surround decisions.
In one study that I conducted, trading-card dealers interacted with their customers when they were knowingly under scrutiny from researchers. I then compared their behaviour with dealers working at a market who were unaware that they were being observed8. When dealers knew that they were being watched, they offered cards of higher quality than buyers could verify on the spot — a costly act of reciprocity unrelated to any prospect of repeat business. On the market floor, by contrast, reciprocity was strategic: generosity was extended only when reputation and repeat business made it economically rational.
As this shows, in many real-world settings, decisions occur inside webs of reputation, relationships and consequence. Generalizing from settings that mute these influences can lead to erroneous inference and flawed policymaking9.
Observe natural behaviours
Natural field experiments can bypass many of these problems. Because people do not know that they are in an experiment, issues of participant selection (stage two) are absent. And because the experiments are conducted in the settings that researchers wish to learn about, with individuals shopping, working or donating as they normally would, the problems of situational mismatch (stage three) vanish, too10.
Because the population is still selected by the researcher, this approach does not guarantee that stage one problems are overcome — the same result might not be found in a different community. But these experiments do clearly define the population to which the result applies. If results diverge across settings, then the reason is clear: the populations differ.

Laboratory experiments, in which people know that they are being studied, can be a good complement to natural field experiments.Credit: Getty
As chief economist at the retail chain Walmart, based in Bentonville, Arkansas, I know that the findings of natural field experiments that I conduct are reliable for Walmart customers, but might not apply to shoppers at Amazon, for example. By contrast, lab experiments that have participant selection can display divergent results across studies. These results could reflect genuine population differences, variations in participant selection, or both, and the researcher cannot easily disentangle them.
Natural field experiments can be used to ask a variety of questions in many sectors. A researcher studying charitable giving might vary the contents of a donor letter sent to households. A psychologist studying honesty might leave wallets in public places and measure return rates across neighbourhoods. At Walmart, my team is running natural field experiments with more than 6,000 suppliers to test which factors will most effectively incentivize suppliers to reduce their carbon emissions.
However, there are limitations. Natural field experiments cannot be applied to every research question. Some interventions, in psychotherapy, say, or classroom instruction, inherently require people’s awareness. Some processes, such as private deliberations, are unobservable without asking. And there are strict ethical limits that dictate when it is and is not appropriate to use natural field experiments.
Consider ethics
Natural field experiments should not expose participants to more than minimal risk, or to experiences that they would not normally encounter. An experiment that varies the wording of a charitable appeal, the format of an energy bill or the timing of a nudge to schedule a medical appointment exposes participants to experiences in the normal range of what they would have encountered without the study. By contrast, a study that feeds negative content to users of an online platform — a manipulation carrying psychological risk beyond their normal experience — can cross the ethical boundary.
Could a novelty indicator improve science?
All behavioural experiments in the field should follow existing ethical frameworks governing human research. For example, The Belmont Report (the foundation of research ethics in the United States) notes that research involving incomplete disclosure is justified only when three criteria are met: it is needed to accomplish research goals, there are no or minimal undisclosed risks, and there is an adequate plan for debriefing when appropriate11.




