#17 | HUMAN VOICES | Uniqueness of Human Voices: AI's Limitations & the Role of Voice Specialists
Voice Distinctiveness: Klaus emphasizes human voices' unique and distinctive nature. Unlike other forms of identification like facial recognition or fingerprints, voices are instantly recognizable in human interactions and communication.
Memorable Encounters with Iconic Voices: Klaus shares about discovering legendary voices like Tony Bennett and Rosemary Clooney during their early careers. These highlighted the impact of remarkable voices on our memories and emotions.
Human Complexity in Voice: Klaus questions whether AI can replicate the complexity and proximity of human voices. While AI voice assistants like Alexa and Siri offer useful information, they often lack the depth of emotion, inflection, and soul that human voices naturally convey. This underscores the importance of voice specialists, who bring a unique skill set to communication by adding depth, emotion, and authenticity to the spoken word.
As indicative of a difference between humans are their voices. Oh yes, there is facial recognition, fingerprints, and DNA, but for readily accessible and immediate casual use, a voice is distinctive. Now, I can say that I know the voice of Mel Blanc, but only in the multiple characters he animated as a professional, but I never heard his everyday speaking voice ask me, "What's up Doc?"
In my life, I have listened to multiple and far-reaching different voices. Like everyone else, I have accumulated a library in my memory. Answering the phone has always been the first indicator of whom one is talking to. Now, with caller ID and all the other gimmickry, one can take a call or not, knowing ahead of time who is calling.
I was ten. It was about 1950 (I am elderly); I was walking along the sidewalk outside a bar under the elevated subway on DeKalb Avenue in downtown Brooklyn, New York City, when I heard his distinctive voice. Perched atop a milk crate, a young man with a full head of hair was crooning acapella songs of the day. It happened to be Tony Bennett. He was just getting started. Later that year, on the wide boardwalk at Coney Island on the West End, from atop one of the benches, a young lady in a full-flowered skirt and long curly hair was belting out jingles of the day. It was Rosemary Clooney. Their remarkable voices have been with me ever since. I'm sad that Tony has just recently gone.
All my life, I've listened to classical music. After all the years, certain voices are still wafting through my head. I can recall enough to cherish the beauty and emotions they conveyed, alas only in snippets, but quickly recaptured by classical music videos. I know when I hear Frank Sinatra, Bing Crosby, Perry Como, Fats Domino, Elvis, Dame Joan Sutherland, Beverly Sills, Robert Merrill, Sherill Milnes, Sieppi, Placido Domingo, Luciano Pavarotti, and many others. A few, when heard, are so easy to identify. The qualities of a voice can vary widely, like a person in my past, Yma Sumac, reputed to have a range covering eight octaves, to myself, which is, never mind, "with thanks to Gilda Radnor."
So, to the question, can Artificial Intelligence bring that kind of complexity and immediacy to the skill needed? I am aware, in an amateur and unskilled way, that preprogrammed voices can sound realistic and seemingly in "tune" concerning queries, not unlike Alexa, Siri, and their contemporaries. What seems to be missing, however, is the immediacy and inflection in these machine voices to match the quality of the information sought or given, and instead, measured and informative "flat" responses meet our ears. The data may be present, but the soul is not.
Should "voice" specialists still be needed? Methinks yes. Thanks to our millennia of experience as homo sapiens in the real world, we recognize cheap substitutes for the real thing as affected and dubious, giving rise to a strange dissonance that delays thinking, action, and commitment.
Voice actors and artists have distinctive voice qualities that convey a range of emotions, bring to life and emphasize the movement, seen and unseen, and give meaning and substance to an event, short or long. Their skill must be an imperative readily identifiable to a listening audience and periodically in synchronicity with visuals. Tone, inflection, emotional delivery, decibel range, timing, and more come to bear in skillful execution. They must employ tremolo, breathing, and dynamic content of a word or phrase with a clear grasp of the story and the immediacy of the action. What results is the believability of communication, visual, aural, and emotional consistency, filling out the story and carrying us confidently to our destination.