How to Battle the Bots Wrecking Your Online Study

This is part of our “Ask a Behavioral Scientist” series, where we give readers the opportunity to pose a question to leading behavioral scientists. Have a question? Ask it here.

Q: As researchers, we’ve been turning more and more to online platforms, like Mturk, to run our studies. It’s made collecting from hundreds and thousands of participants much more efficient. But online data collection platforms aren’t without their issues—one being the quality of the data.

For instance, are participants really engaging with the study or just phoning it in? There’s also been a problem with people deploying bots to take the studies. What do we know about how bots are being used to game the online research system? What can researchers who rely on these platforms do to ensure their studies are free of bot-generated responses?

Technological advancements have expanded the scope of online research studies. Given their ability to help researchers reach thousands of geographically diverse samples relatively quickly, these online research methods have gained significant traction, reducing the cost and time of research.

But the technological advantages bring a serious threat to science and data integrity: bots, fake participants that have been programmed to complete dozens of responses (or more) in a matter of minutes.

We can’t really determine why programmers are deploying bots to complete online surveys, but I have a few guesses. First, most online studies that have been overrun by bots offer some kind of financial reward. Perhaps programmers hope to complete surveys without being detected to collect larger sums of money than if they participated truthfully. Bot programmers may also be interested in skewing research findings for malicious reasons. Others may use unpaid research to train the bots for future, paid surveys. No matter what the reason may be, the consequence is the same: unreliable results. Without removing all of the bot-generated responses, the data set cannot be used to gain insight into the research question at hand.

The threat of bots was recently made clear to me when I launched my study examining risk factors associated with eating disorders in LGBTQ+ populations. The study, built in Qualtrics and RedCap, was shared on Twitter, and within 12 hours I received about 380 responses before freezing data collection. It was almost immediately clear that the majority of participants were bots and that I needed to do something to parse out the human participants from the bots. I spent hundreds of hours coding and sorting my responses to come to the final conclusion that among the responses I received, only 11, or about 3 percent, were not flagged as bots.

Programmers have developed bots that will create a normal distribution across all responses…or if there are open-ended questions, they’ll extract language from the survey itself to compose more logical responses.

Despite this setback, I still find it incredibly valuable to use online data collection platforms to increase diversity in our study samples. Fortunately, there are ways to protect your data from bots.

One important thing to consider when building bot protection into an online study is the varying levels of coding sophistication and schemes used by bot programmers, suggesting the need to implement more than one tool to flag bots.

There are a few of the telltale signs that researchers need to be aware of at the lowest level of bot sophistication. First, bots tend to “speed” through studies. Bots will also provide illogical responses to open-ended questions and respond to questions that should be hidden from participants (e.g., honeypot items). Last, these bots will provide impossible time and date stamps on informed consent documents.

To guard against and identify these less-sophisticated bots, researchers should:

  • Include open-ended questions to look for unusual responses
  • Track study time stamps for impossible dates and times (e.g., bundles of participants beginning and ending the survey at the same exact time)
  • Flag respondents who completed survey materials impossibly fast
  • Flag participants who respond to items that they shouldn’t otherwise have access to

Unfortunately, more sophisticated bots are much harder to detect in a data set. For instance, imagine a programmer deployed a bot with a goal of completing 50 responses for a single survey. Sophisticated programmers who want to thwart a study will ensure that bots are not stacked together in the data set by manipulating the timestamp and IP address. Even more insidiously, they’ll program the bot to create a normal distribution across all 50 responses based on the range for each individual item. Finally, if there are open-ended questions, they’ll extract language from the survey itself to compose more logical responses.

As a researcher you may feel like you’re up against it. But there are ways to catch these more sophisticated bots.

As a researcher you may feel like you’re up against it. But there are ways to catch these more sophisticated bots. Here a several tips:

  • Do not share your survey on Twitter. It appears as though the majority of bots gain access to study links through this platform
  • Build in attention and logic checks directly to your surveys (e.g., provide a paragraph of text, and somewhere in the text directly state which answer to choose)
  • Ask the same question at two separate points (e.g., age)
  • Provide unique survey links to each participant rather than using a public link. This will prevent the participant from clicking the link more than once or sharing the link with others

There is no way to guarantee that your online study will be completely bot-free. Researchers should implement several, if not all, of the described strategies to protect their studies. When you’re collecting online data, it’s important to monitor data integrity multiple times. I generally suggest that participants should not be flagged as potential bots without violating at least two of the discussed points. These strategies should serve as the first step in protecting data. Technology is advancing each day, and coding schemes are sure to advance in kind; thus, new ways to protect data from bots should be something researchers discuss regularly. I hope that my experience helps others avoid a similar one. I still strongly encourage the use of online research methods. In fact, I recently reopened my original study—of course, with several new layers of bot protection.