Deep Blue’s 1997 victory over world champion Garry Kasparov was the beginning of the end of mankind’s chess dominance. A game once believed to be the pinnacle of human intelligence was being taken over by computers. In 2005—a mere eight years later—the world’s seventh ranked player was thrashed by a supercomputer, managing only a draw over six games. And in 2006, world champion Vladimir Kramnik was defeated by Deep Fritz, a software program running not on a supercomputer, but a model you could purchase at your neighborhood electronics store.
Today, the state of play is decidedly one-sided. In 2014, world champion Magnus Carlsen reached an Elo rating of 2882, the highest ever achieved by a human and well above the 2500 threshold for Grandmaster. But today’s best supercomputers are rated around 3300.
So, should we just bow down to the algorithm? Not everyone thinks so.
Back in 1954, the experimental psychologist Paul Meehl published his “disturbing little book” Clinical versus Statistical Prediction. In one chapter, Meehl catalogued twenty empirical competitions between statistical methods and clinical judgement, involving predictions such as academic results, psychiatric prognosis after electroshock therapy, and parole violation. The results were consistently victory for the statistical algorithm or a tie with the clinical decision maker. In only one study could Meehl generously give a point to the humans. (I’ll leave aside for today the point that we have known these facts for over 60 years, yet still heavily rely on expert judgment in domains where we could otherwise replace it.)
Do you believe you could improve your chance of winning by overruling your 3300-rated chess program?
Still, Meehl recognized that the human brain could be acutely sensitive to the unusual, and proposed what he called the “broken leg” scenario. Imagine you are trying to predict whether someone will go to the movies on Friday night. Your model gives them a 90 percent chance of going to the movie, but at the last minute you discover they have a broken leg and are in an immobilization cast in hospital. Since there is no variable for broken leg in your model, blindly sticking to its prediction will lead to certain failure.
So, what do we do? We place the algorithm or machine in the hands of the human. When there is a broken leg, the human can recognize it and intervene.
In chess, this strategy has been fairly successful. As told in Tyler Cowen’s Average is Over or Andrew McAfee and Erik Brynjolfsson’s The Second Machine Age, computers are not actually the best chess players in the world. In freestyle chess—a style where players can consult multiple computers and programs in real time—the strongest players are human-computer combinations.
Some interpret this unique partnership to be a harbinger of human-machine interaction. The superior decision maker is neither man nor machine, but a team of both. As McAfee and Brynjolfsson put it, “people still have a great deal to offer the game of chess at its highest levels once they’re allowed to race with machines, instead of purely against them.”
However, this is not where we will leave this story. For one, the gap between the best freestyle teams and the best software is closing, if not closed. As Cowen notes, the natural evolution of the human-machine relationship is from a machine that doesn’t add much, to a machine that benefits from human help, to a machine that occasionally needs a tiny bit of guidance, to a machine that we should leave alone.
But more importantly, let me suppose we are going to hold a freestyle chess tournament involving the people reading this article. Do you believe you could improve your chance of winning by overruling your 3300-rated chess program? For nearly all of us, we are best off knowing our limits and leaving the chess pieces alone.
That said, chess is in an eight by eight world with 32 pieces and defined moves. What of humans stepping in to address “broken leg” problems in more complex environments?
As it turns out, this is often a recipe for disaster. We interfere too often, seeing broken legs everywhere. This has been documented across areas from incorrect psychiatric diagnoses to freestyle chess players messing up their previously strong position, against the advice of their supercomputer teammate.
For example, one study by Berkeley Dietvorst and friends asked experimental subjects to predict the success of MBA students based on data such as undergraduate scores, measures of interview quality, and work experience. They first had the opportunity to do some practice questions. They were also provided with an algorithm designed to predict MBA success and its practice answers—generally far superior to the human subjects’.
In their prediction task, the subjects had the option of using the algorithm, which they had already seen was better than them in predicting performance. But they generally didn’t use it, costing them the money they would have received for accuracy. The authors of the paper suggested that when experimental subjects saw the practice answers from the algorithm, they focussed on its apparently stupid mistakes—far more than they focussed on their own more regular mistakes.
The pattern we will see is not humans and machines working together for enhanced decision making, but machines slowly replacing humans decision by decision.
Although somewhat under-explored, this study is typical of when people are given the results of an algorithm or statistical method (see here, here, here, and here). The algorithm tends to improve their performance, yet the algorithm by itself has greater accuracy. This suggests the most accurate method is often to fire the human and rely on the algorithm alone.
That is not to say that it is not possible to train someone to team effectively with the algorithm. The success of human-computer teams in freestyle chess tournaments suggests the possibility. But interestingly, the best human partners in freestyle chess are not grandmasters. As Cowen points out, the winner of the first freestyle chess tournament in 2005 was a pair of American amateurs, and top team members typically have Elo ratings similar to that of a strong club player. There are some great failures by grandmasters in freestyle chess tournaments. Their confidence leads them to interfere too often with the superior computer, whereas the best freestyle chess players will only overrule their machine a handful of times a game. If you can find a humble but skilled human, there could be room for success.
But how common are these humble, skilled humans? Despite being relatively weak players, the successful freestyle players likely had hundreds to thousands of hours of chess experience. In other domains, can we train people at large scale to effectively work with their algorithms, or will this be a skill held by the few?
Absent limiting human intervention to the right level, the pattern we will see is not humans and machines working together for enhanced decision making, but machines slowly replacing humans decision by decision. Algorithms will often be substitutes, not complements, with humans left to the (at the moment, many) places where the algorithms can’t go yet.
A friend of mine often reminds me of an old joke about automation on airliners. The ideal team is a pilot and a dog. The pilot’s job is to feed the dog. The dog’s job is to bite the pilot if the pilot tries to touch anything. While we may still be some way away from this scenario in the world of aviation, in some domains we’re already there.