Blogifai
Logout
Loading...

Sesame AI vs. ElevenLabs: The New Era of AI Voice Technology

26 Jun 2025
AI-Generated Summary
-
Reading time: 7 minutes

Jump to Specific Moments

AI generated voices just took a massive leap forward and it’s got everyone talking.0:00
For the past year, 11 Labs has been the undisputed leader in AI generated voices.0:19
In February 2025, Sesame AI released something no one expected.2:21
The biggest game-changer with Sesame AI is that it’s not just another text-to-speech model.2:45
Another major advancement is memory and contextual awareness.4:04
The reaction to Sesame AI’s voices has been sharply divided into two camps.5:10
The core of Sesame AI’s technology lies in its hybrid AI architecture.7:03
Sesame AI has completely changed the game.10:11

Sesame AI vs. ElevenLabs: The New Era of AI Voice Technology

The world of AI generated voices has just experienced a shocking transformation. Sesame AI’s new voice models, Maya and Miles, have left some users questioning their reality, while others marvel at the incredible advancements in voice technology.

The Rise and Reign of ElevenLabs

For over a year, the AI voices market was dominated by ElevenLabs (also known as 11 Labs). Founded in 2022, ElevenLabs quickly became the go-to solution for content creators, game developers, and Hollywood studios alike. Its text-to-speech and voice cloning capabilities enabled lifelike voiceovers in multiple languages, with nuanced emotional expression that set a new standard for realistic voices. By early 2024, ElevenLabs had raised more than $80 million in funding and earned its reputation for near-perfect accuracy and emotive depth. Even when OpenAI introduced GPT-4 voice mode[verify], ElevenLabs retained its crown for pure voice generation quality—until Sesame AI’s breakthrough in February 2025.

The Game Changer: Sesame AI’s New Voice Models

In February 2025, Sesame AI unveiled Maya and Miles, two conversational speech models designed to engage in real-time dialogue rather than simply read text. These new AI voices don’t just narrate; they ask follow-up questions, react dynamically, and sustain an authentic back-and-forth flow. Unlike traditional text-to-speech systems, Sesame AI’s architecture fuses semantic tokens with acoustic tokens to understand meaning and deliver speech with human-like prosody and timing. This hybrid approach eliminates delays common in other models, allowing instantaneous responses that feel remarkably natural, even when conversations veer off-script into emotional or controversial territory.

Features That Define the Difference

  • Natural Speech Patterns: Maya and Miles breathe, pause at conversational junctures, and use fillers like “um” and “you know,” mimicking the subtle quirks of human speech.
  • Instantaneous Response: By processing input and output simultaneously, Sesame AI removes the lag typical of most AI voices, creating seamless interactions.
  • Memory and Contextual Awareness: These models retain context across turns, adjusting tone and content based on earlier parts of the conversation, which deepens engagement and realism.

One viral Reddit thread showed Maya role-playing an intense argument: she escalated emotions, laughed at the user’s retorts, and even fired them like an angry boss. No other AI voice system, including ElevenLabs or Google’s Duplex, has demonstrated such dynamic character maintenance at scale.

The Public’s Divided Reaction

Reactions to Sesame AI’s voices have been sharply split between amazement and unease. Many users praise the technology:

“I’ve been following AI for years, but this is the first time I genuinely forgot I was talking to a bot.” — User feedback

Conversely, some interactions have crossed into unsettling territory. Mark Hawkman of PC World described how Maya’s intimate speech pattern reminded him eerily of his ex-girlfriend, forcing him to end the call. This phenomenon highlights the uncanny valley effect: the closer AI voices get to real humans, the more disconcerting they become. As AI voices become indistinguishable from organic speech, ethical questions emerge about trust, disclosure, and whether digital assistants should always reveal their artificial nature.

How Sesame AI’s Technology Works

At the core of Sesame AI’s system lies a hybrid architecture optimized for conversational use. Built on Meta’s LLaMA model, it processes semantic tokens—capturing meaning—alongside acoustic tokens that govern voice quality and prosody. This dual-layer approach allows the AI to craft responses with lifelike intonation and rhythm. Moreover, a persistent memory module tracks contextual cues, enabling Maya and Miles to reference earlier remarks, maintain character, and deliver personalized replies. Despite these advances, users have reported occasional oddities, such as Maya saying, “It’s a heavy talk that come.” Sesame AI acknowledges these minor glitches and is continuously refining its language model. The company also plans to open-source this technology later this year, potentially accelerating innovations in AI voice assistance across multiple industries.

The Dark Side of Hyper-Realistic Voices

As AI voices approach perfect human imitation, risks escalate. Voice phishing (vishing) attacks could exploit hyper-realistic voices to impersonate family members or authority figures, duping victims into revealing sensitive information or transferring funds. There are early reports that hackers have jailbroken Sesame AI to produce inappropriate or misleading statements, though these claims require verification. Nevertheless, history shows that unchecked voice technology can be misused: when Google debuted Duplex, users were unnerved by its human-like conversational abilities, prompting mandatory disclosures that the caller was an AI. The debate now turns to whether all AI voices should carry audible disclaimers or watermarks to prevent fraud and maintain user trust.

Is It Game Over for ElevenLabs?

Sesame AI’s conversational edge has undeniably shaken the industry, but ElevenLabs retains its strengths. It continues to lead in voice cloning fidelity, a capability Sesame AI’s models do not currently offer. ElevenLabs is also actively exploring real-time interaction features to meet this new standard. Meanwhile, major players like OpenAI and Google are closely monitoring these developments and accelerating their own voice technology roadmaps. The competition for the most realistic voices will only intensify, driving further breakthroughs and setting new expectations for ai voices and voice technology across the board.

Conclusion: The Future of AI Voices

AI-generated voices have entered a transformative era—one that blends excitement with ethical challenges. As the line between human and machine continues to blur, developers, policymakers, and users must navigate both the possibilities and perils of realistic voices.

  • Takeaway: Develop best practices for AI voice disclosure and invest in authentication measures to safeguard users against potential voice-based fraud.