Step 7 of 9TTS Orchestrator

Voice Agent

Generates expressive, multi-character voiceovers with fine-tuned gender, accent, and emotional matching.

Overview

The Voice Agent takes the voiceover script and character voice profiles to generate high-quality text-to-speech audio. Using the local Edge TTS engine, it produces voiceovers with appropriate emotional tone, pacing, and emphasis for each character. It handles multi-character dialogues by switching between voice profiles, and adjusts speech rate and pitch to match the scene's mood — slower and lower for dramatic moments, faster and higher for exciting sequences. The agent supports multiple TTS providers: Edge TTS (free, local, 300+ voices across 70+ languages), OpenAI TTS (premium quality), and custom voice cloning. It processes each scene's narration independently, applies SSML markup for fine-grained prosody control, and outputs audio files synchronized to the storyboard timing. Audio normalization ensures consistent volume levels across all scenes and characters. The agent also handles pronunciation corrections for proper nouns and technical terms, manages breathing pauses for natural speech rhythm, and generates separate audio tracks for narration, dialogue, and sound effects — giving the Media Agent full control over audio mixing and spatial positioning in the final video.

Input

Voiceover script, character voice profiles

Output

Audio files per scene with character-specific voices

Tools

Edge TTS engine, audio processing