Could It Be Her Voice? Why Scarlett Johansson’s Voice Makes Samantha Seem Human

Early in the movie Her, writer Theodore Twombly (Joaquin Phoenix) has a conversation with his new Operating System, Samantha (Scarlett Johansson), for the first time. Just moments into the conversation, Theodore expresses amazement: “You seem like a person – but you’re just a voice in a computer.”

Indeed, Samantha exists only as a voice in Theodore’s ear. Prior to watching the film, movie viewers may, understandably, have been doubtful that Samantha’s character would seem real without a physical body. And yet, if Her’s Oscar nomination for Best Picture is any evidence, Samantha is clearly a compelling character. In fact, critics suggest that far from diminishing the film, Johansson’s voice actually carries it. Critic Christopher Orr writes in The Atlantic, “Her voice – breathy, occasionally cracking – warms the entire film.” Orr further remarks, “[Johansson’s] Samantha is one of the more recognizably human characters of the movie year.”

Among the many big questions that the movie Her tackles is that of humanness: What does it mean to be human? Why does Samantha seem as human – or even more human – than other characters in the movie despite being only a voice? In Her, Johansson’s voice brings Samantha’s character to life, and Theodore’s love for her feels not just believable, but entirely plausible. But how? How can a person fall in love with just a voice?

Recent psychological research conducted by Professor Nicholas Epley and I, at the University of Chicago, suggests an answer. According to our research, currently in preparation, voice may uniquely communicate presence of mind and, ultimately, fundamental aspects of being human.

Our research suggests that it was not what Johansson said but rather how she said it, that made Samantha seem so real.

A person’s voice is directly and often immediately linked with his or her thoughts and feelings in verbal language. Voice is a conduit through which complicated mental states are translated and communicated to others. In our research, we predicted that voice can be humanizing: that it conveys the presence of a humanlike mind through paralinguistic cues (i.e., the vocal cues that accompany language including loudness, rate, and pitch). In a series of laboratory experiments, we tested whether hearing a person’s speech makes him or her seem more “mindful,” that is, more thoughtful, emotional, and even more human, than reading a person’s speech (or writing).

For instance, in an initial experiment, we modified the classic Turing Test by asking people (“observers”) to guess whether a speech had been originally created by a computer or by a human. In reality, all of the speeches were created by humans talking about their actual emotional experiences. Some observers were randomly assigned to read a verbatim transcript of a speech, whereas other observers were assigned to listen to a speech. The grand majority (91.6%) of the observers who listened to the speeches correctly guessed they had been created by a human. But the observers who read the speeches were much less likely to guess “human” – only 66.8% of them made the correct choice.

We then ran a second experiment to measure in greater detail how hearing someone’s voice affects an observer’s impressions of that person – specifically, their impressions of the person’s mental capacities. In this experiment, not only did we again assign some observers to read speech transcripts and other observers to listen to speeches, but we also assigned a third group of observers to watch videos of the speakers. After observers either watched, listened to, or read a speech, we asked them a series of questions about the speakers. These questions included evaluations of the speakers’ abilities to think, such as, “How thoughtful is the speaker?” and “How competent is the speaker?” and evaluations of the speakers’ abilities to feel such as, “How warm is the speaker?” The observers who had listened to speakers believed the speakers had significantly more ability to both think and feel than the observers who had read speeches. In line with our predictions, watching speakers in addition to hearing them had no effect on evaluations of the speakers compared to hearing them without seeing them.

But Theodore couldn’t have fallen in love with just any voice. Imagine if Apple’s computer voice SIRI had been the voice of Samantha instead of Johansson. Samantha’s humanness would not have been nearly as believable, even if her words were exactly the same. Theodore would no longer seem like a man deeply in love, but a man in deep delusion.

Imagine if Apple’s computer voice SIRI had been the voice of Samantha instead of Johansson. Theodore would no longer seem like a man deeply in love, but a man in deep delusion.

Why is it so much more plausible to fall in love with Johansson’s voice than with SIRI’s? To viewers, the answer seems obvious: Johansson is a person, SIRI is a robot. But consider Theodore’s position: essentially, he’s in love with a more developed version of SIRI.

Our research suggests that it was not what Johansson said but rather how she said it, that made Samantha seem so real. In another experiment, we examined two different types of voices: those with natural paralinguistic cues – loudness, pitch, and rhythm typical of human language – compared to those with reduced paralinguistic cues.

To create these voices, we asked professional actors to read writing samples from other people who wrote about their important life decisions. The actors read each written script twice out loud. In their first reading, they imagined that they were the person who wrote the script. They spoke naturally, as if they were in the midst of a real conversation. We called this the “natural voice” condition. In their second reading, the actors spoke in a flat voice. They read the words exactly as they saw them on the page without putting any life or feeling into the words. In this “flat voice” condition, actors’ voices contained minimal paralinguistic cues compared to the “natural voice” condition. We analyzed the actors’ voices using Praat, an open-source speech analysis software. When actors spoke naturally, their voices had higher mean pitch, greater pitch variance, higher mean amplitude, and greater amplitude variance than when they spoke in flat voices.

We tested whether observers judged flat voices differently than natural voices using the same modified Turing Test paradigm from the first experiment mentioned above. We asked observers to guess whether the written scripts had been created by a human or computer. The observers who listened to actors speaking with natural voices guessed “human” 65.0% of the time, whereas observers who listened to flat voices guessed “human” just 50.0% of the time, and observers who read the original writing guessed “human” 46.7% of the time. Therefore, only the natural voices seemed more human than the text, not the flat voices. Critically, the effect of communication medium on judgment of humanness was fully explained by the amount of pitch variance in actors’ voices – and not by any other of the paralinguistic cues. This suggests that one reason why voices convey humanness is because of their variance in pitch. That is, speakers naturally modulate their voice pitch – their voices becoming higher and lower as they speak – and this pitch modulation could be an important aspect of how people express their mental states to others.

Could the pitch variance in Johansson’s voice be part of the reason why Samantha’s character seemed so lifelike and human? To find out, we analyzed a clip of Johansson’s voice from the movie (1:34:35 to 1:34:58), which you can listen to below. In it she says,

“You know what’s interesting? I used to be so worried about not having a body, but now I truly love it. And I’m growing in a way I couldn’t if I had a physical form. I mean, I’m not limited; I can be everywhere and anywhere simultaneously. I’m not tethered to time and space in a way that I would be if I was stuck in a body that’s inevitably going to die” (Her, 2013).

We computed the standard deviation of Johansson’s pitch from this clip. To contrast against Johansson’s pitch variance, we also recorded SIRI (the robotic voice of Apple’s iPhone, who is known as Samantha on Apple’s Computers) speaking the exact same words and analyzed SIRI’s pitch variance.

Shown in the graph, we compared Johansson’s and SIRI’s pitch variance to the pitch variance of the natural and flat voiced actors from our experiment. As expected, Johansson’s pitch variance measures similar to those of the natural voices whereas SIRI’s mimics those of the flat voices. Johansson’s voice is expressive and effusive; SIRI’s is not.

The movie Her illustrates the same conclusion that we draw from many psychology experiments: Voice – if wielded naturally – can be a powerful tool to convey presence of a humanlike mind. In Her, it was a voice that made the movie more about love and humanness than delusion and machinery.

References

Schroeder, J. R., & Epley, N. (in preparation). Speaking louder than words: Voice reveals presence of a humanlike mind.
Her, 2013. Spike Jones (Director). Warner Brothers.

Juliana Schroeder

Recommended for You

Behavioral Scientist’s Summer Book List 2026

What It’s Like to Be…an Air Traffic Controller

What It’s Like to Be…a Flight Attendant