When Expectations Fail, Keep Testing

Taxpayers increasingly scrutinize how public money is spent. Such scrutiny makes sense: taxpayers expect government administrators to ensure their decisions are effective. For policymakers, then, it’s important to collect evidence of “what works” to show that effectiveness, and they’re increasingly turning to experimental trials to evaluate programs. But this fervor for evidence-based policy carries its own risks, because even evidence from past work doesn’t guarantee that everything will work and that governments can achieve all their objectives.

Using evidence from past research is no guarantee of success. The whole point of testing a policy is that we don’t know whether it will work, so the results may or may not confirm our initial hypotheses. This is an idea that goes back at least to Donald Campbell, who in 1969 argued that policymakers should experiment with public policies. The implication of reforms as experiments is that some programs will work and others will not, but over time evidence and our understanding of what works grows. It implies that null or negative results should be as valued as positive opportunities to strengthen our knowledge and understanding of what works.

Can the policy world also embrace null or even negative results?

Policymakers aren’t the only ones debating how and whether to publish negative results. Academics often worry about the attractiveness of publishing null results—that is, results that show “nothing” has happened. After all, who would want to read about a program that didn’t work?

This tendency to publish positive findings when both positive and negative results exist is so prevalent it’s been called the file drawer problem. This is when academics do not submit papers with null results because they do not have enthusiasm for them or they expect such papers to be rejected by an academic journal. Researchers instead leave their results to gather digital dust in a file drawer. Although on balance it’s still harder to publish null results in academic journals, at least the intellectual argument for publishing them has now become generally accepted.

Can the policy world also embrace null or even negative results? It should, but policymakers might be less attuned to the arguments of scientists and more worried about not getting the expected results. To encourage policymakers to seek guidance from null findings, we discuss an example from our work on how a negative result can lead to better real-world policies.

Evidence from around the world, such as from trials carried out by the UK’s revenues and customs service (HMRC), suggests that descriptive social norms is an effective strategy for encouraging compliance. Studies have shown that if others in our social networks are doing something—or even if we think they are—we are more likely to follow suit. People respond to the pressure to conform and adopt similar “normal” behavior. It’s such a robust strategy backed by a good evidence base that we might have simply advised the authority to adopt this sort of messaging. Even if this strategy didn’t significantly increase compliance, what harm could it do? But conscious of the limits of evidence, we decided put it to the test.

As reported in our recent study in the Journal of Behavioral Public Administration, we tested this idea in two randomized controlled trials with a London local authority. The project aimed to encourage people to pay their council tax.

Even if this strategy didn’t significantly increase compliance, what harm could it do?

The first trial tested two strategies: simplifying the tax bills and communicating the social norm. As expected, we found that simplifying the tax bill increased payment by four percentage points and led to a statistically significant increase in revenue, equivalent to over £1m in additional income for the council. Our social norm communicated to local residents that the vast majority (96 percent) of their peers paid their taxes. Based on past studies, we expected to find that this social norm would encourage compliance. But we did not find an effect of the social norm.

Puzzled that the social norms did not work, the following year we randomly assigned households due to pay their council tax in cash (having removed those who pay electronically from our sample as these households generally paid their tax on time) to a social norm group (our intervention) and a control group. In total 28,876 residents were assigned to the social norm group and 28,877 to the control.

The results surprised us and contradicted the prevailing evidence. We found that our social norm actually dissuaded people from paying their tax. In the treatment group, 41 percent paid their council tax in full, whereas 44 percent did so in the control group. In this case, the social norm backfired: fewer people in the treatment group paid in full compared to those who did not receive our intervention.

Fortunately, since the first trial had effectively raised income, the council was still able to increase the tax revenue, more than compensating for the negative result of the social norm. And the combined data from the two trials enabled us to further test that our simplified bills really worked.

Being evidence-based, in other words, only gets us so far.

In the end, the council decided not to use the social norm, in spite of other councils finding this practice works. The firsthand evidence from the trials, rather than the prevailing academic literature, informed practice. The resulting design was proven, rather than assumed, to be effective. In this way evidence can be used on procedures that work and do not work, which is similar to the scientific process of evidence cumulation.

The big takeaway is this: just because a behavioral intervention has worked elsewhere doesn’t mean it’s going to work for you. Context, audience, and how the program is delivered all make a difference to overall effectiveness, as do variables like the demographics of the target population, the framing of the message, and when it’s received. In this case, we wonder if differences in the population of the area studied and the wording we used (compared with other similar interventions) may have reduced the effect. For instance, a more transient population, if they did not feel particularly connected to the area, may have been less concerned about their peers. Or perhaps the way we described our social norm, “96% of resident pay their council tax,” was not sufficiently personalized to resonate with recipients. But of course, without testing those hypotheses we can’t say for sure why we observed these differences.

Being evidence-based, in other words, only gets us so far. Only by testing and evaluating new approaches (and acknowledging that sometimes we won’t get the anticipated result) can we be sure that a particular piece of evidence can or should guide a new policy.

Evidence and compelling case studies can help us to develop a plausible hypothesis that we want to test, but we cannot take for granted that we’ll be able to replicate what’s already been done.

Our advice: consider existing evidence as a starting point, rather than an endpoint. Only by trying things out for yourselves can you be confident that what you are doing is working—or not.