How ElevenLabs Is Making Voice AI More Human

How ElevenLabs Is Making Voice AI More Human

Insights from Mati Staniszewski's fireside chat at Elevate.

April 17, 2025

What’s in a name? If you’re childhood friends Mati Staniszewski and Piotr Dobkowski — who co-founded AI audio company ElevenLabs in 2022 — a name can celebrate shared passions: for the mathematical properties of the number 11 (“it has great divisibility”), for the Apollo 11 moon landing, and for the movie Spinal Tap, which invented the phrase “turn it up to 11.”

In their own way, Mati and Piotr are turning it up to 11 — ElevenLabs reached $100M ARR in just two years. 

Growing up near Warsaw, Poland, Mati and Piotr often watched American movies dubbed in Polish. The disappointing quality of voiceovers (all the characters were voiced by a single narrator) set the stage for their mission: to become the premiere AI audio platform for creation, accessibility, and interactions.

Mati, who serves as ElevenLabs’ CEO, recently joined us at our annual CEO summit to share the company’s growth story and discuss the future of AI voice technology. 

Here were some of our takeaways from the conversation.

Building A Research Foundation

Mati attributes ElevenLabs’ success to deep research. Before launching the company, he and Piotr assessed the text-to-speech solutions in the market. Their verdict? The available solutions weren’t scalable, controllable, or high enough quality to sound human.

With such limited solutions in the market, Piotr — who comes from an AI research background — decided to build ElevenLabs’ AI models from scratch, with an initial focus on improving text-to-speech. The resulting model was able to infer context and deliver the audio accordingly.

“That was our key breakthrough,” said Mati, “That first text-to-speech model could better understand the context.”

Mati and Piotr launched their model just before OpenAI debuted ChatGPT, when AI interest was about to explode.

Knowing When to Wait — and When to Go 

AI development is a constant give and take: Should you build now, or wait for the research to progress? Given the rate of AI advancement, building runs the risk of product obsoletion if there’s another big research breakthrough. On the other hand, companies that spend all their time performing research will never find traction in the market. 

Staniszewski said this tension is one of ElevenLabs’ biggest ongoing challenges. For instance, when the team was working on improving speed in voice synthesis, they deliberately held off on a quick fix in favor of a more principled research approach. Rather than opting for the “dirty solution” of uniformly slowing down speech—which results in a flat, monotonous sound—they aimed to build a model that could vary speed naturally across different syllables and words. It took two years, but the team ultimately delivered, and users responded positively to the more lifelike result.

However, the team didn’t wait when it came to pronunciation edits, where spending months on product implementation would have been less effective than retraining the model. 

Opting for quick product solutions and making deeper research investments both have their pros and cons, and the right choice depends on the individual situation, your customers, and your understanding of the market. 

Making Voice Technology Sound Human

ElevenLabs’ goal is to create voices that sound truly human—capturing the full spectrum of elements like emotion, tone, pitch, accent, and even subtle imperfections. Each of these factors contributes to how meaning is conveyed. Different cultures and languages also bring unique expectations around how a voice should sound, making these nuances even more important.

As such, the ElevenLabs team has adapted voices for different use cases and demographics — such as slower speaker voices that older callers better respond to, or regional accents (European vs. Latin American) for Spanish speakers. 

Mati said the aim isn’t necessarily perfect voices — imperfections in AI voices can actually be more engaging. One ElevenLabs client even tweaked their agent to include “umms” and natural hesitations. The more “imperfect” voice resonated with users, earning stronger feedback and sparking interest from other clients eager to replicate the effect.

“With voice, you can give so much more information,” Mati said. “You can get the emotions coming through different intonation patterns, different pitch patterns, imperfections… they all give a little bit more of that signal across what you’re trying to convey.”

Creating a Hybrid Go-to-Market Strategy

From the beginning, ElevenLabs combined product-led growth with an enterprise sales motion.

“This approach was directly connected to our research work. Since we were investing so much time building advanced models, we needed a way to get them into users’ hands quickly,” Mati said. “The best way to do that was through an open platform with an intuitive interface that would allow people to easily experience and test our research innovations.”

The simple interfaces and openness of the platform allowed users to surface unexpected use cases on their own. One indie author pasted his entire book into ElevenLabs’ text box, stitched the audio together manually, and ended up with a surprisingly well-received audiobook.

His success sparked a wave of interest from other authors eager to do the same, and ElevenLabs leaned into the momentum, building a dedicated studio tailored for long-form narration.

Enterprise customers also identified new product needs. When a healthcare company was building a nurse calling system, they pointed out that conversation flow and latency were just as important as voice quality.

In response, ElevenLabs built a framework for configuring and orchestrating voice agents that integrates not just text-to-speech but also speech-to-text capabilities. They released the technology to all developers and creators, and now 500K agents have been built this way.

Evolving AI Voice Technology

When Mati imagines the future of AI voice technology, he pictures greater emotional intelligence. He also envisions a move toward multi-modal models, where everything is trained together, instead of the current three-step solution (speech to text → LLM processing → text to speech).

It’s all in service of ElevenLabs’ their ongoing vision: making all content accessible across languages while preserving original voices, emotions, and connections.

Thank you, Mati, for attending Elevate and sharing your candor and insights. 

_

To learn more about ElevenLabs, visit their website. To receive more Salesforce Ventures content directly in your inbox, sign up for our newsletter >>>