Subtitle Generation¶

vac can generate subtitle files (SRT/VTT) in two ways:

STT-based: Using speech-to-text for word-level accuracy (Marp slides)
Timing-based: Using voiceover timing without STT (Browser videos)

Quick Start¶

Marp Slides (STT)Browser Video (No STT)

# Generate subtitles using speech-to-text
vac subtitle --audio audio/en-US/

# Output:
# subtitles/en-US.srt
# subtitles/en-US.vtt

# Generate subtitles from voiceover timing
vac browser video --config demo.yaml --output demo.mp4 \
  --subtitles

# Output:
# demo.srt (alongside demo.mp4)

How It Works¶

Audio input: Reads MP3 files from audio/{lang}/
Speech-to-text: Uses Deepgram STT for word-level timing
Subtitle generation: Creates SRT and VTT files with accurate timestamps

Timing Accuracy¶

The subtitle command uses Deepgram STT to transcribe the audio files and extract word-level timestamps. This provides accurate subtitle timing that matches the actual speech patterns, rather than estimating from text length.

Command Reference¶

vac subtitle [flags]

Flags:
  -a, --audio string        Audio directory containing manifest.json (required)
  -o, --output string       Output directory for subtitle files (default "subtitles")
  -l, --lang string         Language code (auto-detected from manifest if not specified)
      --provider string     STT provider: deepgram or elevenlabs (default: deepgram)
      --individual          Also generate individual subtitle files per slide

Examples¶

# Generate French subtitles
vac subtitle --audio audio/fr-FR/

# Custom output directory
vac subtitle --audio audio/zh-Hans/ --output subs/

# Keep individual slide subtitle files
vac subtitle --audio audio/en-US/ --individual

Multi-Language Workflow¶

# Step 1: Generate audio for each language
vac tts --transcript transcript.json --output audio/en-US/ --lang en-US
vac tts --transcript transcript.json --output audio/fr-FR/ --lang fr-FR

# Step 2: Generate subtitles for each language
vac subtitle --audio audio/en-US/
vac subtitle --audio audio/fr-FR/

# Step 3: Generate videos
vac video --input slides.md --manifest audio/en-US/manifest.json --output video/en-US.mp4
vac video --input slides.md --manifest audio/fr-FR/manifest.json --output video/fr-FR.mp4

Output Structure¶

project/
├── audio/
│   ├── en-US/
│   │   ├── manifest.json
│   │   └── slide_*.mp3
│   └── fr-FR/
│       ├── manifest.json
│       └── slide_*.mp3
├── subtitles/
│   ├── en-US.srt
│   ├── en-US.vtt
│   ├── fr-FR.srt
│   └── fr-FR.vtt
└── video/
    ├── en-US.mp4
    └── fr-FR.mp4

Browser Video Subtitles¶

The browser-video command supports built-in subtitle generation:

Options¶

Flag	Description	Requirements
`--subtitles`	Generate subtitles from voiceover timing	None
`--subtitles-stt`	Generate word-level subtitles using STT	Deepgram API
`--subtitles-burn`	Burn subtitles into video (permanent)	FFmpeg with libass
`--no-audio`	Generate video without audio (TTS used for timing)	None

FFmpeg libass Requirement

The --subtitles-burn flag requires FFmpeg compiled with libass support. Check with: ffmpeg -filters 2>&1 | grep subtitles

If not available, install via:

# macOS
brew uninstall ffmpeg
brew tap homebrew-ffmpeg/ffmpeg
brew install homebrew-ffmpeg/ffmpeg/ffmpeg

# Linux (Ubuntu/Debian)
sudo apt install ffmpeg libass-dev

Examples¶

# Simple subtitles from voiceover timing (no API cost)
vac browser video --config demo.yaml --output demo.mp4 \
  --subtitles

# Word-level subtitles using speech-to-text
vac browser video --config demo.yaml --output demo.mp4 \
  --subtitles-stt

# Burn subtitles permanently into video
vac browser video --config demo.yaml --output demo.mp4 \
  --subtitles --subtitles-burn

# Silent video with burned subtitles (no audio track)
# Useful for demos where viewers read subtitles instead of listening
vac browser video --config demo.yaml --output demo.mp4 \
  --subtitles --subtitles-burn --no-audio

How Timing-Based Subtitles Work¶

When using --subtitles (without --subtitles-stt):

Each voiceover text becomes a subtitle entry
Long text is automatically split into 2-line chunks (max 42 chars per line)
Start/end times are calculated from TTS audio durations with word-based timing
Pauses between voiceovers are accounted for
No additional API calls required

Automatic Text Chunking¶

Long voiceover text is automatically split into readable subtitle chunks:

Max 2 lines per chunk - Standard for video subtitles
Max 42 characters per line - Optimized for 1080p display
Word-aware splitting - Text breaks at word boundaries, not mid-word
Proportional timing - Each chunk's duration is based on word count, not character count

Example: A 100-word voiceover becomes multiple 2-line subtitle entries, each timed proportionally based on the words it contains.

This approach provides sentence-level accuracy and is ideal when:

You want to avoid STT API costs
Your voiceover text matches what should appear as subtitles
You're iterating quickly on content

Current Implementation: Standard Subtitles¶

The current implementation generates standard subtitles - the most professional and widely-used format:

┌─────────────────────────────────────┐
│                                     │
│         [Slide Content]             │
│                                     │
│  ─────────────────────────────────  │
│  Two types of AI users are emerging │
└─────────────────────────────────────┘

Characteristics:

White text with black outline or semi-transparent background
1-2 lines maximum (42 chars per line), positioned at bottom of screen
Automatic text chunking for long voiceovers
Word-based timing distribution for natural reading pace
VFR to CFR conversion ensures reliable timing when burning subtitles
Industry standard for professional content

Used by: Netflix, YouTube, broadcast TV, Udemy, Coursera

Future Caption Styles¶

The following caption styles are planned for future implementation. They offer different trade-offs between engagement and professionalism.

Karaoke Style (Word Highlight)¶

Words change color/highlight as they are spoken:

Two types of [AI users] are emerging
              ^^^^^^^^^ (highlighted in yellow as spoken)

Characteristics:

Words highlight in sequence as spoken
Requires word-level timestamps (available via Deepgram)
More engaging than static subtitles
Can be distracting for technical content

Best for: Music videos, language learning, accessibility features

Implementation notes:

Requires word-level timing from STT (already available)
Output format: ASS/SSA (Advanced SubStation Alpha) for styling
Or: Burn into video using ffmpeg drawtext filter

Words appear one at a time as spoken:

Frame 1: Two
Frame 2: Two types
Frame 3: Two types of
Frame 4: Two types of AI
...

Characteristics:

Highly engaging, attention-grabbing
Popular on TikTok, Instagram Reels, YouTube Shorts
Less formal, not suitable for all content types
Often combined with animations

Best for: Social media clips, promotional videos, short-form content

Implementation notes:

Requires word-level timing
Output: Burned into video (not separate subtitle file)
Consider text positioning, font size, animations

Animated Captions (CapCut/Premiere Style)¶

Words animate in with effects (pop, slide, bounce):

     ╭─────────────╮
     │  AI USERS   │  ← pops in with scale animation
     ╰─────────────╯

Characteristics:

Very engaging, trendy aesthetic
Complex to implement, requires video editing
Not suitable for professional/corporate content
Popular with content creators

Best for: Social media, entertainment, creator content

Implementation notes:

Requires video compositing (ffmpeg complex filters or external tool)
Template-based approach for consistency
Consider offering preset animation styles

Implementation Roadmap¶

Style	Status	Priority	Notes
Standard (SRT/VTT)	✅ Implemented	-	Current default
Burned-in standard	✅ Implemented	-	`--subtitles-burn` flag
Timing-based (no STT)	✅ Implemented	-	`browser-video --subtitles`
Karaoke highlight	🔲 Planned	Medium	ASS format + ffmpeg
Word-by-word reveal	🔲 Planned	Medium	Social media use case
Animated captions	🔲 Planned	Low	Complex, may use templates

Proposed CLI Extensions¶

# Burn subtitles into video (standard style)
vac video --input slides.md --manifest audio/en-US/manifest.json \
  --subtitles subtitles/en-US.srt --output video/en-US.mp4

# Karaoke style (future)
vac subtitle --audio audio/en-US/ --style karaoke --output subtitles/

# Word-by-word reveal (future)
vac video --input slides.md --manifest audio/en-US/manifest.json \
  --caption-style reveal --output video/en-US.mp4

Technical Considerations¶

Subtitle Formats¶

Format	Extension	Features	Use Case
SRT	`.srt`	Basic timing + text	Universal compatibility
VTT	`.vtt`	Timing + basic styling	Web video players
ASS/SSA	`.ass`	Full styling, positioning, effects	Karaoke, anime fansubs
TTML	`.ttml`	XML-based, broadcast standard	Broadcast, streaming

Burning Subtitles into Video¶

# Using ffmpeg with SRT
ffmpeg -i video.mp4 -vf "subtitles=subtitles.srt" output.mp4

# With custom styling
ffmpeg -i video.mp4 -vf "subtitles=subtitles.srt:force_style='FontSize=24,PrimaryColour=&HFFFFFF&'" output.mp4

# ASS format for full control
ffmpeg -i video.mp4 -vf "ass=subtitles.ass" output.mp4

Word-Level Timing¶

Deepgram STT returns word-level timestamps:

{
  "words": [
    {"word": "Two", "start": 0.0, "end": 0.2},
    {"word": "types", "start": 0.25, "end": 0.5},
    {"word": "of", "start": 0.52, "end": 0.6},
    {"word": "AI", "start": 0.65, "end": 0.9}
  ]
}

This data is already captured during subtitle generation and can be used for karaoke and word-reveal styles.

Subtitle Generation¶

Quick Start¶

How It Works¶

Timing Accuracy¶

Command Reference¶

Examples¶

Multi-Language Workflow¶

Output Structure¶

Browser Video Subtitles¶

Options¶

Examples¶

How Timing-Based Subtitles Work¶

Automatic Text Chunking¶

Current Implementation: Standard Subtitles¶

Future Caption Styles¶

Karaoke Style (Word Highlight)¶

Word-by-Word Reveal (Social Media Style)¶

Animated Captions (CapCut/Premiere Style)¶

Implementation Roadmap¶

Proposed CLI Extensions¶

Technical Considerations¶

Subtitle Formats¶

Burning Subtitles into Video¶

Word-Level Timing¶