As standards for designing and evaluating psychological research have grown much more rigorous in the last fifteen years, owing largely to a seminal paper by Joseph Simmons, Leif Nelson, and Uri Simonsohn, something has been bothering me. Will the new standards for psychological research affect the questions psychologists ask and the topics they explore, and not just in a good way? Research has gotten much tighter methodologically. We should have greater confidence than before that findings we read about in journals will replicate. What’s good about this is evident. But do we pay a price for increased rigor? If so, what is the price? And is it a price worth paying?
To begin to address these questions, I want to insert you into a hypothetical experiment. Imagine yourself a participant in a psychological study of basic sensory sensitivity. The research aims to determine what are the weakest sounds that you can reliably detect. You are seated in a soundproof chamber, with headphones on. Periodically, a warning light appears, after which you experience a trial that either contains a sound or does not. Your job is simply to hit “Y” on your keyboard (for “yes, I heard a sound”) or “N” (for “no, I didn’t hear a sound”). Because there is “noise” in the system (e.g., your attention may wander), you go through hundreds of trials like this, and by convention, your “threshold” for detecting weak auditory inputs is defined as the sound intensity that you correctly detect 50 percent of the time.
Procedures like this are standard tools in one of the oldest research areas in psychology—psychophysics. Experimental psychology essentially began with research on this topic, a century and a half ago. Over the years, research methods and theoretical explanations have grown ever more sophisticated, culminating in what is called “signal detection theory” in the 1960s.
We should have greater confidence than before that findings we read about in journals will replicate. What’s good about this is evident. But do we pay a price for increased rigor?
You may not care much about how sensitive your hearing is, but it turns out that signal detection theory has provided an explanatory framework that applies in countless situations that are more familiar with more practical consequences than the study of sensory sensitivity. What signal detection theory tells us is that there are factors involved in whether you report a hearing a sound or not that have nothing to do with your ears or your auditory system.
Consider: you experience a trial and you’re not sure whether there was a tone or not. What do you report? Well, suppose you know that 90 percent of trials contain a sound. In other words, you have good reason to expect a sound on any given trial. As a result, you will likely resolve your uncertainty with a “yes.” If you knew that only 10 percent of trials had sounds, you would likely resolve the same exact uncertainty with a “no.” So your expectations will influence your responses.
Now suppose your experimenter pays you a nickel every time you correctly say “yes,” but nothing if you correctly say “no.” And suppose mistakes (“no” on a “yes” trial and “yes” on a “no” trial) carry no penalty. Now, whenever you are uncertain, the payoffs operative in the situation might bias you to say “yes.” In an experiment set up like this, with tones on 90 percent of trials and a payoff for correct detection, you would likely be saying “yes” whenever you were sure there was a tone. But you would also be saying “yes” most of the time when you were unsure. Over the years, researchers have developed extremely sophisticated tools, both methodological and statistical, to separate your actual sensory sensitivity from your expectations and the operative payoffs.
There are two ways to be right in a signal detection experiment: say “yes” when a signal is present (a “hit”) and “no” when a signal is absent. There are also two ways to be wrong: say “yes” when there is no signal (a false alarm or false positive) and “no” when there is a signal (a miss or a false negative). This 2×2 table captures the possibilities:
Which mistake would you rather make, if you couldn’t be perfect?
Let’s leave the lab and take this question to the radiologist’s office. You are there for a routine mammogram and the results show an ambiguous shadow. Is there a malignancy or not? Just as in the psychophysics lab, there are two ways to be right and two ways to be wrong. If your radiologist is going to make a mistake, which mistake do you want her to make: false alarm (“yes it’s something to worry about” when in fact it isn’t) or “miss” (“no, it’s nothing” when in fact it’s a malignancy). It seems clear that the potential cost of missing a malignancy is far greater than the cost of a false alarm, which might lead to a repeat mammogram and then a more invasive diagnostic procedure. So the relative costs of errors will probably bias you to want your radiologist to say “yes, let’s pursue this” unless she is completely certain that it’s nothing.
And if you are a 25-year-old woman with no family history of breast (or any other kind of) cancer, your radiologist’s response to this uncertain mammogram will probably be very different than if you are a 50-year-old whose identical twin was diagnosed with breast cancer two years before. Two patients with family histories like these will cause the radiologist to change expectations about what the mammogram shows. A patient’s age, health status, and family history will (and should) change a doctor’s expectations about how likely this ambiguous signal is to be a malignancy.
There are countless other examples like the mammogram. Is that a patch of black ice ahead, or is it just water glistening in the headlights? In Minnesota, you might suspect the former; in Florida, the latter. But in either case, the costs of missing the black ice (serious skid and an accident) are large compared to the costs of slowing down unnecessarily (unless a car is tailgating you). Are those creaks and groans I hear from bed at 2 a.m. just an old, wood-laden house making noise as the heat goes on and off, or is it an intruder? Was the job applicant very nervous in the interview, or is he less sharp than his resume would suggest?
The world is an uncertain place. We can say “yes, there’s a signal” and be wrong, or “no, there’s no signal” and be wrong. We usually set a cutoff, or threshold, for saying yes (whether consciously or not). If we set a very low threshold, there will be lots of false alarms. If we set a high one, there will be lots of misses. There is an inevitable trade-off we make between these two types of errors, and where we place our cutoff will depend on what we expect to happen, and on what the consequences might be if we make one or the other mistake.
Will the new standards for psychological research affect the questions psychologists ask and the topics they explore, and not just in a good way?
So why am I telling you this? I’m telling you because I think there are important lessons to be learned from these types of situations when it comes to judging the right methods and standards for conducting research in psychology or other social sciences. The paper published about 15 years ago by Simmons, Nelson, and Simonsohn decrying the bad habits, or perhaps bad motives, that had come to characterize research in psychology produced a revolution in psychological methodology. There was an epidemic in the research literature of what they called “false positives” (false alarms)—claims that research had shown a “significant difference” between a treatment group and a control group that did not stand up to scrutiny or efforts at replication. The various possible outcomes of a psychology experiment are depicted in this 2×2, which is just like the last one:
A “hit” occurs when you report an effect of your treatment that is really present, and a correct “no” occurs when you report, accurately, that the treatment failed. The two types of errors are false positives (false alarms) and misses. Simmons, Nelson, and Simonsohn were focused on false positives, and how to reduce them.
It is hard to overestimate the impact this paper has had on the conduct of empirical research in psychology. The paper made several recommendations about how to reduce false positives, and it is fair to say that most of them have been followed—by individual researchers who want to solve the problem, and by prestigious journals that make adhering to these recommendations a condition for publication. Though I don’t think there has been an effort yet to quantify just how many false positive results that might have gotten published have been prevented (it takes some time for folks to discover that a positive result is actually a false positive result), I have no doubt that we now should have much greater confidence that claims that appear in our scholarly journals will stand the test of time and of efforts at replication. This is, to my mind, very good news.
But is it only good news? Remember, from my discussion of research on sensory sensitivity, that there is an inevitable trade-off between the two types of error to which our judgment is susceptible. The more stringently we reduce false alarms (false positives), the more vulnerable we are to misses (false negatives). This may seem like a very small worry. Missing the occasional significant experimental result is a small price to pay for clearing away false claims that masquerade as true. Think of all the time, effort, and money that gets spent when a researcher reads an interesting result, decides to pursue it, designs and runs a bunch of experiments, none of which work, only to discover that the initial research that spurred her interest does not stand up to scrutiny. Think of all the wild-goose chases avoided. Surely, missing an occasional true result is a small price to pay. Or is it?
We now should have much greater confidence that claims that appear in our scholarly journals will stand the test of time and of efforts at replication. This is, to my mind, very good news. But is it only good news?
Consider a hypothetical example. A researcher is interested in the best way to learn new material. He explores the possibility that simply re-reading and trying to memorize that material is much less effective than being subjected (by the teacher or by oneself) to frequent tests of knowledge of that material. The researcher does a couple of studies that produce what we might call marginal results—decent-sized effects that are only barely statistically significant because of high variability in the data. These effects would have been robust enough for publication 20 years ago but aren’t now. He puts the projects away in a file drawer and goes back to what he had previously been studying. Because the results are not published, nobody knows about either the experiments or the ideas that prompted them, and the researcher’s hypothesis dies a quiet and unremarkable death, unless someone else happens to develop a similar idea and better ways to test it.
The problem is that the researcher was right! His hypothesis was correct! He might not have figured out the best way to test the hypothesis or the best materials to use in his studies, or how many (and what form) of tests to give his participants, or how many participants to include. But if his papers had been published, someone else who read them might have figured out better (more robust) ways to explore the phenomenon.
It is not easy to decide, as an individual or as a research community, just how big an effect has to be to be taken seriously. Different researchers and research areas have different standards (just as different radiologists may have different standards for worrying about a mammogram result). The Simmons, Nelson, and Simonsohn paper went a long way toward standardizing those standards.
The enhanced rigor of modern research, which drives many false positives into file drawers, unpublished, also drives many potential true positives into file drawers, unpublished. It is possible, even likely, that different researchers will differ in how imaginatively they respond to methodological and analytic limitations of a given study’s methodology. A particularly creative researcher will read a paper reporting marginal results and see how to make the methods stronger. That won’t happen if those results sit in a file drawer.
The enhanced rigor of modern research, which drives many false positives into file drawers, unpublished, also drives many potential true positives into file drawers, unpublished.
Thanks to the contribution by Simmons, Nelson, and Simonsohn, our imperfect researcher will now know better how to design and analyze research. This researcher may not have to depend on the smart insights of others. But the communal, social nature of science almost guarantees that involving multiple voices and laboratories will enrich the conversation. And these other voices and laboratories won’t get involved if the initial findings remain hidden.
And it isn’t just the cleverness of different researchers that will be lost. We now know that a key ingredient for successful replication is the statistical power provided by very large samples of participants. It takes a lot of time and money to run experiments with very large samples. One way around this problem is to run less elaborate, and less costly questionnaire-type studies with online participants. MTurk and its alternatives have made a real contribution to the power of experimental research. But it has also taken some types of experiments off the table altogether, except perhaps in well-funded labs with extensive personnel.
Let me be clear: the price of chasing a false positive can be quite substantial, including wasted careers (spent chasing ghosts), wasted journal pages, the demolishing of a field’s reputation (when no one reports anything that replicates), and the misguided interventions by governments and institutions that might be gullible enough to be taken in by published garbage. And a false positive mammogram may also have substantial costs, in money, time, inconvenience, and mental anguish. We would certainly benefit if both our experiments and our mammograms were less vulnerable to error.
But the price of missing a true positive can also be substantial. Just ask the woman with the uncertain mammogram who turned out to have breast cancer that was caught early with more probing follow-up tests that led to timely detection and successful treatment.
And there is another facet of modern false positive policing that may be even more costly to research progress. Professional success in academia depends on getting your work published. The road to promotion and tenure passes through scholarly journals. If you are a young academic in pursuit of professional success, what sorts of things do you choose to study? Some projects are “safe.” They don’t plow much new ground, and you can be confident that research will produce publishable results. Others are risky; they may or may not work out. When publication standards are relatively lax, you might decide to take a flyer on a high-risk project. As the standards become more stringent, the risk of failure goes up, and chances are your willingness to take the risk goes down.
The effect … is an evolution to safer and safer projects, leaving real innovations to die, not in some file drawer but in the thought bubbles that pop into your head as you take your morning shower.
The effect, on your work and on the field’s collective work, is an evolution to safer and safer projects, leaving real innovations to die, not in some file drawer but in the thought bubbles that pop into your head as you take your morning shower. The analogy here is to the participant in the signal detection experiment who gets a nickel for every hit, and otherwise nothing, but with much higher stakes.
If you play this process out over multiple generations of scientists, you may get more and more conservative science. In effect, psychology might end up knowing more and more about less and less.
I am not suggesting here that the efforts of Simmons, Nelson, Simonsohn, and others who have followed their lead are bad. My own view is that they have really improved the quality and reliability of what psychologists do in their laboratories. They have really reduced the likelihood of false positive results littering the journals and taking up our time. They have offered a bunch of excellent ideas that collectively increase the intellectual hygiene of researchers and research. What I am suggesting is that there is a price to eliminating or reducing false positives. It may even be a high price. And a realistic and responsible assessment of the revolution in research methods that has developed in recent years must take this price into account, perhaps even finding some way to measure it.
Economists are fond of telling us that there is no free lunch. What I’m suggesting here is that there is no free rigor. Reducing false positives may have substantial benefits, but they come at a price. And at the very least, we need to assess that price, to be sure that our increased rigor is worth it. I don’t want to minimize the difficulty of this problem. We can’t really know what is out there to be discovered, but being missed, because it hasn’t been discovered yet. Perhaps we will collectively decide that the price of increased rigor is well worth paying. My aim here has only been to encourage us to acknowledge that there is a price.
Disclosure: Barry Schwartz is a member of Behavioral Scientist’s advisory board.