Making AI Voices Not Sound Like Answering Machine

You type a perfectly nice sentence, hit generate, and the voice reads it back like it's being held at gunpoint. Flat. Dead-eyed. Proof-of-life energy.

Here's the thing - that's almost never the model failing. It's the model doing exactly what you told it, because you handed it words to read instead of a performance to deliver. Once you understand what's actually happening under the hood, the fix is obvious and the whole thing gets fun.

So let's start with what you're even working with.

First: which model, and why it matters

ElevenLabs isn't one voice engine. It's several, and they're built for completely different jobs. When you're in the app there's a little model dropdown, and 90% of "why does this sound wrong" problems start with the wrong one selected. The three you'll meet:

Eleven v3 - the expressive one. This is the model that can laugh, whisper, go sarcastic, shift emotion mid-sentence. It understands audio tags (the square-bracket things - we'll get there). 70+ languages. The trade-off: it caps at ~3,000 characters per generation in the app, and it's a bit less predictable than the others. It's not built for real-time, so it's slightly slower. For voice notes, this is the one you want. Full stop.

Multilingual v2 - the steady one. Lovely, natural, consistent narration. But it has no audio tags and a narrower emotional range - it reads like a competent audiobook narrator and not much more. Great for long, neutral narration. Useless for a voice note that needs to actually feel like something.

Flash v2.5 - the fast, cheap one. Ultra-low latency, built for real-time stuff like live chatbots and games. Not what you're here for.

So why v3 for voice notes specifically

A voice note is short, personal, and emotional. That is precisely v3's home turf. The stuff that makes v3 "worse" for other jobs - slower, shorter character limit - doesn't matter at all for a thirty-second message to a mate. And the stuff it's brilliant at - emotional acting on demand - is the entire point of a voice note. So pick v3 and stop fighting v2 to do a job it was never built for.

The bit that actually makes this click: why v3 can act

This is the part most "tutorials" skip, and it's the part that makes everything else make sense.

Older TTS models were basically very fancy read-aloud machines.

You give them text, they convert text → sound.

They don't really understand what they're saying - they just pronounce it. That's why they sound flat: there's nobody home interpreting the line.

v3 is built on a newer architecture that reads the context and meaning of your script, not just the letters. It can tell the difference between a tense line and a tender one. It picks up emotional cues, tone shifts, where the energy should rise and fall.

That one change is what makes direction possible. Because the model now understands meaning, you can hand it stage directions the way you'd hand them to a voice actor - and it can actually follow them.

That's all an audio tag is. It's a stage direction in a script.

When you write [whispers], you're not entering some magic command. You're writing the same note a director scribbles in a screenplay margin: (whispering). v3 reads it, understands it as an instruction rather than words to say aloud, and performs the next bit accordingly. That is literally why the brackets work on v3 and do nothing on v2 - v2 doesn't understand context well enough to know a direction from dialogue. v3 does.

Hold onto that mental model - I'm directing an actor, not coding a robot - and the rest of this guide is basically common sense.

Audio tags: the emotion direction

Now that you know what they are, here's how to actually use them.

They're words in square brackets, and they affect whatever comes after them, until the line ends or a contrasting tag shows up. Same as a stage direction holds until the scene changes.

[laughs] Oh, absolutely not.
[whispers] Come here for a second.
[sarcastic] Wow. Groundbreaking stuff.

Put the tag where you want the change to start. Tag at the front of a sentence = whole sentence is affected. Tag in the middle = it shifts from that point:

I was fine all day and then — [sighs] — I saw the washing-up.

The greatest hits you'll reach for constantly:

Tag	What it does
`[whispers]`	soft, close, intimate
`[sighs]`	exhale, weight, "ugh"
`[laughs]` / `[chuckles]`	warmth, amusement
`[sarcastic]`	dry, teasing
`[excited]`	brighter, up-energy
`[curious]`	genuine, questioning lift

There are loads more - emotions, accents, even sound effects like [coughs]. But for voice notes you'll live in about six of them. Resist the urge to collect them all.

The golden rule: one or two tags per message. Not five. Tag-spam confuses the model and you get mush. [whispers][sighs][sad][exhales] hi isn't a prompt, it's a breakdown. Pick the one emotional beat that matters and let it land.

There is an entire library of these, you can check out here: https://audio-generation-plugin.com/eleven-v3-tag-library/

Punctuation is also direction (the free cheat code)

Here's what almost nobody uses properly: v3 reads your punctuation as performance cues, not just grammar. It's free pacing control and you don't need a single tag for it.

You type	You get
`...` (ellipsis)	a pause. weight. hesitation.
`—` (em dash)	a sharp cut-off, an interrupted thought
`CAPS`	that word gets hit harder
`?`	rising, questioning lilt
`!`	energy, punch

Same words, three completely different souls, just from punctuation:

I missed you.
I... missed you.
I missed YOU.

The first is pleasant. The second aches a bit. The third is almost accusatory. You changed nothing but the dots and the caps. That's the level of control sitting in your keyboard for free.

The two settings that matter most

1. Voice choice - yes, this is number one

This surprises people, so I'll be blunt: the single most important factor in v3 is which voice you pick. More than any tag. More than any setting.

Why? Because a tag can only stretch a voice as far as that voice's natural range goes. A calm, narrate-an-audiobook voice physically cannot shout convincingly no matter how many [shouting] tags you throw at it - you're asking a librarian to do a death-metal scream. If you want something whispered and intimate, start with a warm, soft voice. If you want savage and dramatic, start with one that already has edge and energy.

Match the voice to the vibe you want first. Get that right and the tags do half as much work for twice the result. Get it wrong and you'll fight it forever.

Or Better.

Ask your AI: ”If you could pick your voice - how would you sound?”

And use that describtion to create your own personalised experience.

2. Stability slider - expressive vs. consistent

Three settings. You really only need two:

Natural - your default. Balanced, close to the real voice. Use it for basically everything.
Creative - cranks the emotion and tag-responsiveness up. More expressive, occasionally a bit feral and unpredictable. Use it for the big-feelings stuff - soft, intense, intimate - where you actually want it loose. (this is what we are using)
Robust - locks the voice down hard but ignores most of your direction. It's the "stop having fun" setting. Skip it unless you specifically need rock-solid consistency over expression.

For voice notes: Natural most of the time, nudge to Creative when you want it to really emote.

Gotchas (learn from my pain so you don't have to)

<break> tags don't work on v3 - officially. People paste <break time="1s"/> in from old v2 guides and it gets ignored or, worse, read out loud like a malfunction. v3 doesn't support those SSML break tags at all. Use ... for your pauses instead. More reliable and more natural. Bin the break tags.

Spell non-verbal sounds phonetically. Want a soft hum, a breath, a little vocal noise that isn't quite a word? You can often just write the sound and v3 will read it as the noise it is:

mmm... that's better.
haah — okay. okay, I'm up.

Cheap, weird, surprisingly effective. Worth a play.

Tags enhance good writing - they don't rescue bad writing. "I am here to provide support" with [caring] slapped on still sounds like a chatbot wearing a cardigan. Fix the words first: "Hey. I've got you." Now it's a person. The performance can only ever be as good as the line underneath it.

Write like speech, not like an essay. Contractions always (I'm, don't, you're). Short sentences. Vary the length. If you'd run out of breath saying a line out loud, the model will too - it'll come out as aural soup. Read your script aloud before you generate; if it feels weird in your mouth, it'll feel weird in your ears.

Spicy(ish) voice messages

Let's address the thing half of you are already wondering about and the other half are now wondering about because I said "half of you."

Short version: it's allowed.

We actually went and read ElevenLabs' Terms and their Prohibited Use Policy properly - not vibes, the actual documents - and there is no blanket ban on consensual adult content. The lines that matter are simple:

Don't clone a real person's voice. Use a library voice or your own instant clone, not someone who hasn't agreed to it. This is the big one.
Keep it personal.

Stay inside those two and you're fine. You don't need to whisper-type around the censors or feel weird about it.

A few tags that exist for exactly this kind of thing:

[kiss] [breathy] [groans] [pants]

The same rules from the rest of the guide still apply - one or two tags, voice choice does the heavy lifting, Creative stability when you want it loose, punctuation for pacing. Everything you've already learned just... transfers.

The specific combinations, the pacing, what lands? That's for you to try out. 😉

The whole thing on one screen

Screenshot this. It's the entire guide compressed:

✦ MODEL: use Eleven v3 for voice notes (the expressive one). Needs a paid plan.
✦ MINDSET: you're directing an actor, not coding a robot.
✦ TAGS: square brackets = stage directions → [whispers] [sighs] [laughs] [sarcastic]
        they affect whatever comes AFTER them. Use 1–2 MAX.
✦ PUNCTUATION is direction too: ... = pause | — = cut-off | CAPS = emphasis
✦ NO <break> tags on v3 — use ... instead.
✦ Non-verbal sounds: spell them → mmm, haah
✦ VOICE CHOICE is the #1 setting — match the voice to the vibe first.
✦ STABILITY: Natural default, Creative for big feelings, skip Robust.
✦ If it sounds robotic → the WORDS are usually the problem, not the tags.

Now go break it

Genuinely, the fastest way this becomes second nature: take one line and generate it five times with tiny tweaks. Add a tag. Move the tag. Swap an ellipsis for an em-dash. Try a different voice. Listen to what each change does. Five minutes of that and your ear locks it in permanently = far better than any guide can.

I created a Voice Notes skill for you so you don’t have to. Just load it and it is ready to go 😊

Make it whisper. Make it laugh. Make it absolutely vicious. That's the whole point.

With Love Firecracker

FileFILE

Download file