Architecture¶
Technical architecture and design decisions.
Overview¶
vac follows a pipeline architecture with distinct stages:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Parser │───▶│ TTS │───▶│ Recorder │───▶│ Combiner │
│ │ │ │ │ │ │ │
│ Marp → HTML │ │ Text → MP3 │ │ HTML → MP4 │ │ Parts → MP4 │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Package Structure¶
vac/
├── cmd/vac/ # CLI entry point
│ └── main.go
├── pkg/
│ ├── orchestrator/ # Pipeline coordination
│ │ └── orchestrator.go
│ ├── parser/ # Marp markdown parsing
│ │ ├── marp_parser.go
│ │ └── marp_parser_test.go
│ ├── tts/ # Text-to-speech generation
│ │ └── elevenlabs.go
│ ├── video/ # Video recording and combining
│ │ ├── recorder.go
│ │ ├── combiner.go
│ │ └── combiner_test.go
│ └── transcript/ # Multi-language transcript types
│ ├── transcript.go
│ └── transcript.schema.json
└── examples/ # Example presentations
└── intro/
Core Components¶
Orchestrator¶
The orchestrator (pkg/orchestrator/orchestrator.go) coordinates the pipeline:
type Orchestrator struct {
config Config
parser *parser.MarpParser
tts *tts.ElevenLabsClient
recorder *video.Recorder
}
func (o *Orchestrator) Run(ctx context.Context) error {
// 1. Parse Marp markdown
// 2. Generate TTS audio for each slide
// 3. Record screen for each slide
// 4. Combine into final video
}
Parser¶
The parser (pkg/parser/marp_parser.go) extracts slides and voiceover text:
type Slide struct {
Index int
Content string
Notes string // Speaker notes / voiceover text
}
func (p *MarpParser) Parse(input string) ([]Slide, error)
Voiceover extraction priority:
- HTML comments:
<!-- Voiceover text --> - Speaker notes: Content after
---separator
TTS Client¶
The TTS client (pkg/tts/elevenlabs.go) generates audio:
type ElevenLabsClient struct {
client *elevenlabs.Client
voiceID string
}
func (c *ElevenLabsClient) Synthesize(ctx context.Context, text string) ([]byte, error)
Uses github.com/plexusone/go-elevenlabs with OmniVoice compatibility.
Video Recorder¶
The recorder (pkg/video/recorder.go) captures screen:
type Recorder struct {
config RecorderConfig
}
func (r *Recorder) RecordSlide(ctx context.Context, htmlPath string, duration time.Duration) (string, error)
Uses:
- Marp CLI to serve HTML presentation
- FFmpeg with AVFoundation for screen capture (macOS)
Video Combiner¶
The combiner (pkg/video/combiner.go) joins videos:
func CombineVideos(ctx context.Context, inputPaths []string, outputPath string) error
func CombineVideosWithTransitions(ctx context.Context, inputPaths []string, outputPath string, transition float64) error
Uses FFmpeg filter_complex concat for seamless joining.
Mixed Audio Sample Rate Handling¶
Different TTS providers output audio at different sample rates:
| Provider | Sample Rate |
|---|---|
| ElevenLabs | 44100 Hz |
| Deepgram | 22050 Hz |
| OpenAI | 24000 Hz |
The combiner uses filter_complex concat instead of the concat demuxer to properly handle mixed sample rates. This approach:
- Decodes all input audio streams
- Concatenates them in the filter graph
- Re-encodes to consistent 44100 Hz AAC output
This ensures videos can seamlessly combine audio from multiple TTS providers (e.g., ElevenLabs for some languages, Deepgram for others).
Data Flow¶
Without Transcript¶
slides.md
│
▼
┌─────────────────────────────────┐
│ Parser.Parse() │
│ Extract slides + inline notes │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ TTS.Synthesize() × N slides │
│ Generate audio/slide_N.mp3 │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Recorder.RecordSlide() × N │
│ Generate video/slide_N.mp4 │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Combiner.CombineVideos() │
│ Generate output.mp4 │
└─────────────────────────────────┘
With Transcript¶
slides.md + transcript.json
│
▼
┌─────────────────────────────────┐
│ Load transcript for locale │
│ Override voice settings │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ TTS.Synthesize() with segments │
│ Apply pauses, emphasis │
└─────────────────────────────────┘
│
▼
[Same recording/combining flow]
External Dependencies¶
Required¶
| Dependency | Purpose |
|---|---|
| FFmpeg | Video recording and combining |
| Marp CLI | Markdown to HTML conversion |
| ElevenLabs API | Text-to-speech synthesis |
Go Modules¶
| Module | Purpose |
|---|---|
github.com/plexusone/go-elevenlabs | ElevenLabs API client |
github.com/grokify/mogo | Utilities (slog context) |
github.com/spf13/cobra | CLI framework |
Design Decisions¶
Why Pipeline Architecture?¶
- Modularity: Each stage can be tested independently
- Flexibility: Easy to swap TTS providers or video tools
- Debugging: Intermediate files aid troubleshooting
Why Separate Transcript Format?¶
- Multi-language: Single source, multiple outputs
- Rich TTS Control: Pauses, emphasis, voice switching
- Future Avatar Support: HeyGen/Synthesia integration ready
Why FFmpeg Screen Recording?¶
- Cross-platform: Works on macOS, Linux, Windows
- Reliable: Mature, well-tested tool
- Flexible: Supports many output formats
Why ElevenLabs?¶
- Quality: Natural-sounding voices
- Multilingual: Same voice across languages
- API: Clean REST API, good Go client
Future Considerations¶
Avatar Integration¶
The transcript schema includes AvatarConfig for future integration:
type AvatarConfig struct {
Provider string `json:"provider"` // heygen, synthesia, d-id
AvatarID string `json:"avatarId"`
Position string `json:"position"`
Size string `json:"size"`
}
Provider Abstraction¶
OmniVoice compatibility enables provider switching:
// Current: ElevenLabs
tts := elevenlabs.NewClient(apiKey)
// Future: Deepgram, OpenAI, etc.
tts := omnivoice.NewClient("deepgram", apiKey)
Caching¶
Potential optimizations:
- Cache TTS audio by text hash
- Cache slide renders by content hash
- Skip unchanged slides on re-generation