0% found this document useful (0 votes)
255 views3 pages

Advanced Prompting For Expressive TTS

Eleven Labs' v3 model enhances text-to-speech capabilities by enabling expressive features like whispers, laughter, and emotional depth through careful voice selection, stability settings, and prompt structuring. Best practices include using pre-tuned voices, maintaining a prompt length of 250-1,000 characters, and embedding audio tags for emotion and delivery. For multi-speaker dialogue, it's essential to label each speaker and choose distinct voices to ensure fluidity in conversation.

Uploaded by

jayesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
255 views3 pages

Advanced Prompting For Expressive TTS

Eleven Labs' v3 model enhances text-to-speech capabilities by enabling expressive features like whispers, laughter, and emotional depth through careful voice selection, stability settings, and prompt structuring. Best practices include using pre-tuned voices, maintaining a prompt length of 250-1,000 characters, and embedding audio tags for emotion and delivery. For multi-speaker dialogue, it's essential to label each speaker and choose distinct voices to ensure fluidity in conversation.

Uploaded by

jayesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Section X: Eleven Labs V3 — Advanced Prompting for Expressive TTS

Overview:​
Eleven Labs’ v3 model elevates text-to-speech from “just speaking” to fully
performing—whispers, laughter, accents, multi-speaker dialogue, and nuanced emotional
depth. To harness it, you’ll combine the right voice, stability, and audio tags in your
prompts.

1. Voice Selection

●​ Choose a “Best Voices for V3” entry ( ✔️). These are pre-tuned for maximum
expressiveness.​

●​ Custom voices (your own recordings) require fine-tuning before v3 can apply
advanced emotional tags.​

2. Stability Setting

Controls adherence vs. creativity:

●​ Creative → Most emotional & flexible (↑ hallucination risk).​

●​ Natural → Balanced, closest to recorded style.​

●​ Robust → Most stable, least responsive to tags.​

Tip: For tag-driven emotion, start at Natural or Creative; for repeatable


automation, slide toward Robust.

3. Prompt Structure & Length

●​ 250–1,000 chars yields the most consistent, expressive output.​

●​ Ellipses (…) introduce natural pauses.​

●​ Em-dashes (—) or parentheses add emphasis or side comments.​

●​ Capitalization highlights key words: e.g. “WOW,” “UNBELIEVABLE.”​

4. Audio Tags

Embed inline to dictate emotion, delivery, and non-verbal cues. Use before or mid-sentence:
●​ Voice Cues:​

○​ Whisper: “…”​

○​ Laughs: | Laughs harder: | Chuckles:​

○​ Sigh: | Exhale:​

○​ Sarcastic: | Curious: | Excited:​

●​ Sound Effects:​

○​ Applause: | Gunshot: | Footsteps: | [Music]​

●​ Accent Tags:​

○​ Strong French accent: | Mild Australian accent: | etc.​

●​ Multi-Speaker:​

○​ Speaker A: / Speaker B: to alternate voices in the same clip.​

Example Prompt:

vbnet
CopyEdit
Whisper: “Hey, Jessica…”
Laughs: “That is absolutely incredible!”
Strong Australian accent: “Fancy a cup of tea?”
Applause: “Thank you, everyone!”

5. Multi-Speaker Dialogue

●​ Prefix each line: A: / B: / C:​

●​ Choose different V3 voices for each speaker.​

●​ Maintain Natural or Creative stability for fluid back-and-forth.​

Example:

vbnet
CopyEdit
A: Sigh: “I may have tried to debug myself….”
B: Sarcastic: “Oh wow, you really did it this time.”
A: Curious: “Why does my voice keep glitching?”

Best Practices Recap:

1.​ Voice: Pick a V3-verified voice.​

2.​ Stability: Start at Natural/Creative; adjust toward Robust for repeatability.​

3.​ Prompt Length: ≥250 characters for nuance.​

4.​ Tags: Sprinkle emotional, delivery, and sound-effect tags inline.​

5.​ Punctuation: Use ellipses and em-dashes for natural rhythm.​

6.​ Multispeaker: Label each speaker and assign separate voices.

You might also like