Architecture¶

Technical architecture and design decisions.

Overview¶

vac follows a pipeline architecture with distinct stages:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Parser    │───▶│     TTS     │───▶│  Recorder   │───▶│  Combiner   │
│             │    │             │    │             │    │             │
│ Marp → HTML │    │ Text → MP3  │    │ HTML → MP4  │    │ Parts → MP4 │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Package Structure¶

vac/
├── cmd/vac/     # CLI entry point
│   └── main.go
├── pkg/
│   ├── orchestrator/   # Pipeline coordination
│   │   └── orchestrator.go
│   ├── parser/         # Marp markdown parsing
│   │   ├── marp_parser.go
│   │   └── marp_parser_test.go
│   ├── tts/            # Text-to-speech generation
│   │   └── elevenlabs.go
│   ├── video/          # Video recording and combining
│   │   ├── recorder.go
│   │   ├── combiner.go
│   │   └── combiner_test.go
│   └── transcript/     # Multi-language transcript types
│       ├── transcript.go
│       └── transcript.schema.json
└── examples/           # Example presentations
    └── intro/

Core Components¶

Orchestrator¶

The orchestrator (pkg/orchestrator/orchestrator.go) coordinates the pipeline:

type Orchestrator struct {
    config    Config
    parser    *parser.MarpParser
    tts       *tts.ElevenLabsClient
    recorder  *video.Recorder
}

func (o *Orchestrator) Run(ctx context.Context) error {
    // 1. Parse Marp markdown
    // 2. Generate TTS audio for each slide
    // 3. Record screen for each slide
    // 4. Combine into final video
}

Parser¶

The parser (pkg/parser/marp_parser.go) extracts slides and voiceover text:

type Slide struct {
    Index   int
    Content string
    Notes   string  // Speaker notes / voiceover text
}

func (p *MarpParser) Parse(input string) ([]Slide, error)

Voiceover extraction priority:

HTML comments: 
Speaker notes: Content after --- separator

TTS Client¶

The TTS client (pkg/tts/elevenlabs.go) generates audio:

type ElevenLabsClient struct {
    client  *elevenlabs.Client
    voiceID string
}

func (c *ElevenLabsClient) Synthesize(ctx context.Context, text string) ([]byte, error)

Uses github.com/plexusone/go-elevenlabs with OmniVoice compatibility.

Video Recorder¶

The recorder (pkg/video/recorder.go) captures screen:

type Recorder struct {
    config RecorderConfig
}

func (r *Recorder) RecordSlide(ctx context.Context, htmlPath string, duration time.Duration) (string, error)

Uses:

Marp CLI to serve HTML presentation
FFmpeg with AVFoundation for screen capture (macOS)

Video Combiner¶

The combiner (pkg/video/combiner.go) joins videos:

func CombineVideos(ctx context.Context, inputPaths []string, outputPath string) error
func CombineVideosWithTransitions(ctx context.Context, inputPaths []string, outputPath string, transition float64) error

Uses FFmpeg filter_complex concat for seamless joining.

Mixed Audio Sample Rate Handling¶

Different TTS providers output audio at different sample rates:

Provider	Sample Rate
ElevenLabs	44100 Hz
Deepgram	22050 Hz
OpenAI	24000 Hz

The combiner uses filter_complex concat instead of the concat demuxer to properly handle mixed sample rates. This approach:

Decodes all input audio streams
Concatenates them in the filter graph
Re-encodes to consistent 44100 Hz AAC output

This ensures videos can seamlessly combine audio from multiple TTS providers (e.g., ElevenLabs for some languages, Deepgram for others).

Data Flow¶

Without Transcript¶

slides.md
    │
    ▼
┌─────────────────────────────────┐
│ Parser.Parse()                  │
│ Extract slides + inline notes   │
└─────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────┐
│ TTS.Synthesize() × N slides     │
│ Generate audio/slide_N.mp3      │
└─────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────┐
│ Recorder.RecordSlide() × N      │
│ Generate video/slide_N.mp4      │
└─────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────┐
│ Combiner.CombineVideos()        │
│ Generate output.mp4             │
└─────────────────────────────────┘

With Transcript¶

slides.md + transcript.json
    │
    ▼
┌─────────────────────────────────┐
│ Load transcript for locale      │
│ Override voice settings         │
└─────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────┐
│ TTS.Synthesize() with segments  │
│ Apply pauses, emphasis          │
└─────────────────────────────────┘
    │
    ▼
[Same recording/combining flow]

External Dependencies¶

Required¶

Dependency	Purpose
FFmpeg	Video recording and combining
Marp CLI	Markdown to HTML conversion
ElevenLabs API	Text-to-speech synthesis

Go Modules¶

Module	Purpose
`github.com/plexusone/go-elevenlabs`	ElevenLabs API client
`github.com/grokify/mogo`	Utilities (slog context)
`github.com/spf13/cobra`	CLI framework

Design Decisions¶

Why Pipeline Architecture?¶

Modularity: Each stage can be tested independently
Flexibility: Easy to swap TTS providers or video tools
Debugging: Intermediate files aid troubleshooting

Why Separate Transcript Format?¶

Multi-language: Single source, multiple outputs
Rich TTS Control: Pauses, emphasis, voice switching
Future Avatar Support: HeyGen/Synthesia integration ready

Why FFmpeg Screen Recording?¶

Cross-platform: Works on macOS, Linux, Windows
Reliable: Mature, well-tested tool
Flexible: Supports many output formats

Why ElevenLabs?¶

Quality: Natural-sounding voices
Multilingual: Same voice across languages
API: Clean REST API, good Go client

Future Considerations¶

Avatar Integration¶

The transcript schema includes AvatarConfig for future integration:

type AvatarConfig struct {
    Provider  string `json:"provider"`  // heygen, synthesia, d-id
    AvatarID  string `json:"avatarId"`
    Position  string `json:"position"`
    Size      string `json:"size"`
}

Provider Abstraction¶

OmniVoice compatibility enables provider switching:

// Current: ElevenLabs
tts := elevenlabs.NewClient(apiKey)

// Future: Deepgram, OpenAI, etc.
tts := omnivoice.NewClient("deepgram", apiKey)

Caching¶

Potential optimizations:

Cache TTS audio by text hash
Cache slide renders by content hash
Skip unchanged slides on re-generation