Bring Your Words to Life: Cinematic TTS Videos with Coqui and MoviePy

1. Introduction

Do you have a poem, short story, or monologue that you’d love to hear in a cinematic format, complete with voice narration and visuals? With Coqui TTS—an open-source neural text-to-speech engine—and MoviePy—a Python-based video editor—you can automate the entire process. This guide will show you how to:

Set up a Python environment with Anaconda.
List available TTS models (Program 1).
Test multi-speaker voices (Program 2), using a technique where we define the speaker ID once and dynamically reference it in the text.
Generate a final video that synchronizes each text segment with a chosen image (Program 3).

Let’s get started!

2. Environment Setup

2.1 Install Anaconda (Optional but Recommended)

Download and install Anaconda. This lets you manage isolated Python environments, avoiding dependency conflicts.

2.2 Create a Conda Environment

Open Anaconda Prompt (or your terminal) and run:

You now have an environment named movietest with Python 3.9.

3. Install Required Libraries

Inside movietest:

TTS is Coqui TTS, a neural text-to-speech library supporting various architectures (like VITS, Tacotron2, etc.).
MoviePy is a Python video editor, allowing you to concatenate clips, add audio, apply fades, and more.

On Windows, Coqui TTS also needs eSpeak or eSpeak NG installed system-wide. If you haven’t:

Download from eSpeak NG Releases or eSpeak’s page.
Install it and ensure espeak.exe is on your PATH.

4. Program 1: Listing All Models

Coqui TTS provides a command-line tool to list known TTS models. Simply run:

You’ll see something like:

This is our Program 1—the simplest approach to see which TTS models Coqui recognizes by default. If you see a model like tts_models/en/vctk/vits, that’s often a multi-speaker model with both male and female voices.

5. Program 2: Testing Multi Speaker Voices

Many Coqui TTS models—especially those from VCTK—include multiple speakers identified by IDs (like "p225", "p227", etc.). We want to:

List the available speaker IDs.
Test one specific speaker, without hardcoding the speaker ID in multiple places.

Create a file named test_voices.py:

Run it:

It prints all recognized speakers in the terminal, e.g., ['p225', 'p227', 'p229', ...].
It generates a file called test_speaker.wav, using only one reference to speaker_id—which we also inject into the spoken text via an f-string.

Now you can listen to test_speaker.wav and decide if "p225" suits your story. If not, pick another ID (e.g., "p227") by changing speaker_id.

6. Program 3: Generating The Cinematic Video

Now we’ll produce the final video. The approach:

Insert a delimiter (e.g., [SPLIT]) in your text wherever you want a new image to appear.
Split the text into segments.
Generate one audio file per segment using your chosen speaker ID.
Match each audio clip with a corresponding image in MoviePy, adding fades or transitions.

Create segmented_cinematic_spoken_word.py:

# -*- coding: utf-8 -*- import os import moviepy.editor as mp from moviepy.video.fx.fadein import fadein from moviepy.video.fx.fadeout import fadeout from TTS.api import TTS

# 1) Base directory
base_dir = r”C:\Users\YourName\Documents\Projects\Movies”

audio_output_dir = os.path.join(base_dir, “audio_segments”)
os.makedirs(audio_output_dir, exist_ok=True)

output_video = os.path.join(base_dir, “my_cinematic_spoken_word.mp4”)

# 2) Images (one per segment)
image_paths = [
os.path.join(base_dir, “scene1.webp”),
os.path.join(base_dir, “scene2.webp”),
os.path.join(base_dir, “scene3.webp”),
os.path.join(base_dir, “scene4.webp”),
]

# 3) The text with [SPLIT] to mark transitions
voiceover_text = “””They say this place was once golden…
[SPLIT]
And yet—there she stands…
[SPLIT]
She does not kneel to the ruins…
[SPLIT]
She is the first spark in the long night…
“””

def split_text_by_delimiter(text, delimiter=“[SPLIT]”):
segments = [seg.strip() for seg in text.split(delimiter)]
return [seg for seg in segments if seg]

def generate_audio_segments(segments, model_name=“tts_models/en/vctk/vits”, speaker_id=“p225”):
tts = TTS(model_name)
audio_files = []
for i, segment in enumerate(segments):
audio_path = os.path.join(audio_output_dir, f”voiceover_segment_{i}.wav”)
tts.tts_to_file(text=segment, file_path=audio_path, speaker=speaker_id)
audio_files.append(audio_path)
return audio_files

def create_segmented_video(image_paths, audio_files, output_video):
if len(image_paths) != len(audio_files):
raise ValueError(“Mismatch in number of images vs. audio segments.”)

final_clips = []
for img_path, audio_path in zip(image_paths, audio_files):
voiceover_clip = mp.AudioFileClip(audio_path)
clip_duration = voiceover_clip.duration

image_clip = mp.ImageClip(img_path, duration=clip_duration)
image_clip = fadein(image_clip, 1).fx(fadeout, 1)

clip_with_audio = image_clip.set_audio(voiceover_clip)
final_clips.append(clip_with_audio)

final_video = mp.concatenate_videoclips(final_clips, method=“compose”)
final_video.write_videofile(
output_video,
codec=“libx264”,
audio_codec=“aac”,
fps=24
)
print(f”Final cinematic video saved as {output_video}“)

if __name__ == “__main__”:
segments = split_text_by_delimiter(voiceover_text)

if len(segments) != len(image_paths):
raise ValueError(“Adjust your text or image paths to match.”)

# Use the same speaker ID from Program 2 (e.g., “p225” or “p227”)
audio_files = generate_audio_segments(segments, “tts_models/en/vctk/vits”, “p225”)

create_segmented_video(image_paths, audio_files, output_video)

Run it:

What Happens

voiceover_text is split into 4 segments by [SPLIT].
Each segment is turned into an audio file (voiceover_segment_0.wav, voiceover_segment_1.wav, etc.) using speaker “p225”.
MoviePy lines each audio clip up with a corresponding image (scene1.webp, scene2.webp, etc.).
The final MP4 is saved as my_cinematic_spoken_word.mp4 in your base directory.

7. Conclusion

We used three programs to create a cinematic TTS video:

Program 1: tts --list_models
- Lists all Coqui TTS models.
Program 2: test_voices.py
- Demonstrates Approach 2, where we define speaker_id once and insert it into the text string with an f-string, so we don’t repeat “p225” in multiple places.
- Lets us listen to each voice ID from a multi-speaker model.
Program 3: segmented_cinematic_spoken_word.py
- Splits your text with [SPLIT], generates multiple audio segments, and syncs each with an image.
- MoviePy handles the fades and final MP4 creation.

Why This Matters

Coqui TTS uses neural network architectures like VITS to produce realistic speech, often with multiple voices.
MoviePy seamlessly merges your TTS clips with images, enabling transitions, overlays, or even background music.
By splitting text at [SPLIT], you gain precise control over how long each image stays on screen, matching the audio’s duration exactly.

With these steps, you can transform your written words into a cinematic experience—complete with your favorite voice ID from a multi-speaker model. Enjoy experimenting, and bring your stories to life!

Bring Your Words to Life: Cinematic TTS Videos with Coqui and MoviePy

1. Introduction

2. Environment Setup

2.1 Install Anaconda (Optional but Recommended)

3. Install Required Libraries

4. Program 1: Listing All Models

5. Program 2: Testing Multi Speaker Voices

6. Program 3: Generating The Cinematic Video

What Happens

7. Conclusion

Why This Matters

Leave a Comment Cancel Reply

Sign up for my newsletter

1. Introduction

2. Environment Setup

2.1 Install Anaconda (Optional but Recommended)

3. Install Required Libraries

4. Program 1: Listing All Models

5. Program 2: Testing Multi Speaker Voices

6. Program 3: Generating The Cinematic Video

What Happens

7. Conclusion

Why This Matters

Must Read

Leave a Comment Cancel Reply

Sign up for my newsletter