Imagine that a company with 20 employees asks you to create a nudge to reduce the amount of paper it uses. After some analysis of the company’s processes, you conclude that one of the reasons why the company is spending too much on paper is that most employees are printing single, rather than double-sided. When you ask employees why, they say that they simply forget; in fact, this is not an issue they think much about. This seems a perfect situation for a nudge, a little push at the moment of the decision that will influence people to print double-sided, benefitting both the company and the environment.
For illustrative purposes, let’s assume that it was not possible to program the printers to print double-sided by default. So you think of a simple, but potentially very effective, nudge: a pop-up message every time someone presses “print,” reminding employees of the option to print double-sided. As a good practitioner, however, you not only want to implement the nudge. You also want to test whether the nudge is working as desired, and measure how much paper (and money) is actually being saved. But how could you test the efficacy of the nudge in this case?
For those familiar with the behavioral economics and nudge the response is probably obvious: a randomized controlled trial (RCT). In practical terms, this means that the pop-up message would be randomly assigned to be programmed in half of the computers (the intervention condition) but not in the other half (the control condition). You could then compare the quantity of paper spent in each group to see if the nudge was successful.
The truism that RCTs are the “gold standard” of program evaluation implies that they should always be used to evaluate nudge interventions.
This sounds like the right strategy, but is it? The decision about how to evaluate this specific nudge—and, as we will see, many others—is not as simple as it may seem.
In fact, an RCT might not be the most effective way to test our hypothetical nudge. Why? For one, the sample size (20) is quite small, meaning that even if a difference is observed, it may not reflect a true effect. Imagine that by chance two people with strong pro-environmental attitudes are included in the pop-up group. They also happen to be the only two people who printed double-sided before the intervention. In this case, if the pop-up group performed better, the observed advantage may be due to the behavior of these two people and not the intervention.
There is also the opposite danger. If the same two environmentalists are instead included in the control group, a true effect could be missed. The underlying problems are that randomization cannot guarantee equivalence between groups when the sample is small and that this small sample may result in an underpowered test, reducing the ability to detect a true effect.
RCTs: The Gold Standard Research Design? The truism that RCTs are the “gold standard” of program evaluation—a belief widely shared in behavioral economics—implies that they should always be used to evaluate nudge interventions. For example, the U.K.’s Behavioral Insights Team (BIT) promotes the use of RCTs as an essential tool for effective evidence-based policies. However, this belief has given rise to the impression that RCTs are the only acceptable and scientifically valid evaluation method for nudge interventions. As the previous example clearly shows, this is not the case.
We are not saying that RCTs are without their merits—far from it. As Ronald Fisher, the prominent statistician, argued the number of factors that may differ between two groups is endless, and randomization is the only method that can—under the right conditions—guarantee the intervention is the only systematic difference between the groups. RCTs are indeed the safest method for establishing cause–effect relationships, and they have often been used to rigorously evaluate nudge interventions, contributing considerably to our knowledge about the effectiveness of different nudges in various contexts.
RCTs should be used to evaluate nudge interventions whenever appropriate. However, they are not always appropriate. In some cases they are (a) not feasible or practical, (b) considered unethical, and (c) not free of limitations.
RCTs Are Not Always Feasible. In schools, for example, randomization at an individual level is not usually possible. This is because school interventions often happen in the context of the classroom, where all students are exposed to it. Other times, schools refuse to apply educational programs unequally to their students. Under such constraints, other research designs are better suited to provide relevant insights. Such designs include group-based randomization (randomizing at the classroom/department or school/company level) or the pretest–posttest design, among other possibilities.
RCTs Are Not Always Considered Ethical. In an RCT, one group receives the intervention while the other does not. This raises ethical issues. For example, exposing only some students to an intervention that helps them create a plan to enroll in college might be perceived as unfair by schools, parents, and students. While researchers understand that having a control group is the best way to see if the intervention really works, school directors and teachers may not be willing to deny half of their students a potential benefit. Even if the control group will be, in a subsequent phase, the target of the intervention, it still can be considered unfair or unethical.
RCTs Have Limitations. As previously mentioned, one of the most important limitations of RCTs is that they are a poor evaluation method when the sample size is small. But another issue is that it’s hard to have a pure control group. Administering a similar nudge in the control condition—one designed not to have an effect, like a message about a different topic—may still have an effect, leading to an underestimation of the size of the effect (common in health care settings). Additionally, if participants in different experimental conditions are in close proximity, they may communicate about what they’ve received, damaging the validity of the test.
Finally, the excessive emphasis on RCTs and “average treatment effects” can lead researchers to neglect individual variance, sub-groups effects, and the analysis of more complex causal mechanisms. In other words, since RCTs “take care” of all factors apart from the treatment, researchers do not need to worry about these other factors. This is unwarranted since, as many social scientists know, “the devil is in the details.” The knowledge of precise causal mechanisms is essential to understand the specific conditions under which the treatment will and will not work and, ultimately, to the refinement of the underlying theories.
There’s More to Evaluation Than RCTs
Not only do RCT have limitations but nonrandomized designs may be less problematic than they may seem.
The theoretical limitations of nonrandomized designs are not always observed in practice. For example, a common criticism of pretest–posttest design is history—that is, that the results may be explained by an external event that co-occurred with the intervention, and not by the intervention itself. However, this is greatly mitigated in shorter interventions. Other strategies can also be used to overcome the limitations of nonrandomized designs. For example, showing a difference between a treatment group and several different control groups in a nonrandomized design will increase the confidence that the effect of the treatment is real.
“The gold standard or truth view does harm when it undermines the obligation of science to reconcile RCTs results with other evidence in a process of cumulative understanding.”
Recently, Angus Deaton and Nancy Cartwright went as far as saying that RCTs do not deserve any special status and that taking RCTs as the ultimate truth may be an impediment for scientific progress. “The gold standard or truth view does harm when it undermines the obligation of science to reconcile RCTs results with other evidence in a process of cumulative understanding,” they write. We agree about the importance of integrating multiple sources of data. If different research designs point to the same results, we can have a greater confidence in our conclusions. All established methods and designs are valid under the right conditions—what may not be valid are the inferences drawn from the results.
Implications for Practitioners
Since no method or research design is perfect, researchers must weigh each design’s individual weaknesses and strengths.
Let’s apply this critical thinking to evaluate our pop-up nudge. Experienced researchers would most likely turn to a pretest–posttest design. This would require measuring how much paper the company used before introducing the nudge and comparing it to the amount of paper used after implementing the nudge. If the printing rates are stable, measuring paper use for two weeks before and two weeks after introducing the nudge may be adequate. Follow-up measurements could also be included in order to evaluate the effects of the nudge over time (e.g., 2, 6, and 12 months). More practically, researchers would need to work with the company to specify an appropriate tracking system.
Alternatively, researchers could consider a block design, matching the groups on some critical aspect (like whether an employee was already printing double-sided) and only after that randomly assigning the subjects to the conditions.
Defining the best research design to evaluate a nudge is much more interesting and challenging than one might assume at first. RCTs are not a one-size-fits-all answer to how to evaluate a nudge. And in fact, many nudge interventions may appear to fail not because their theoretical assumptions are wrong but because they are not being evaluated in the best way possible.
Given the serious consequences of poorly tested interventions, those who apply nudges must have a sound understanding of evaluation methodologies. We strongly encourage a greater methodological debate—it is crucial for the development of the field. Researchers should apply the same creativity and critical thinking to evaluating nudges that they do in creating them.