Skip to content

Transcript Schema

JSON schema reference for multi-language transcripts.

Schema Location

pkg/transcript/transcript.schema.json

Structure Overview

{
  "version": "1.0",
  "metadata": { ... },
  "slides": [ ... ]
}

Root Object

Field Type Required Description
version string Schema version (e.g., "1.0")
metadata Metadata Presentation-level settings
slides Slide[] Array of slide transcripts

Metadata

{
  "metadata": {
    "title": "My Presentation",
    "description": "Optional description",
    "defaultLanguage": "en-US",
    "defaultVoice": {
      "provider": "elevenlabs",
      "voiceId": "pNInz6obpgDQGcFmaJgB",
      "voiceName": "Adam"
    },
    "defaultVenue": "youtube",
    "tags": ["tutorial", "demo"]
  }
}
Field Type Required Description
title string Presentation title
description string Optional description
defaultLanguage string BCP-47 locale (e.g., en-US)
defaultVoice VoiceConfig Default TTS voice settings
defaultVenue string Target platform
tags string[] Organization tags
custom object User-defined key-value pairs

Venue Options

Value Platform
youtube YouTube
udemy Udemy
coursera Coursera
edx edX
instagram Instagram
tiktok TikTok
general General purpose

VoiceConfig

{
  "provider": "elevenlabs",
  "voiceId": "pNInz6obpgDQGcFmaJgB",
  "voiceName": "Adam",
  "model": "eleven_multilingual_v2",
  "outputFormat": "mp3",
  "sampleRate": 44100,
  "speed": 1.0,
  "pitch": 0.0,
  "stability": 0.5,
  "similarityBoost": 0.75,
  "style": 0.2
}
Field Type Default Description
provider string TTS provider (elevenlabs, deepgram, etc.)
voiceId string Provider-specific voice ID
voiceName string Human-readable name
model string Provider-specific model
outputFormat string mp3 Audio format
sampleRate int Sample rate (Hz)
speed float 1.0 Speech speed (0.25 - 4.0)
pitch float 0.0 Pitch adjustment (-1.0 to 1.0)
stability float Voice consistency (ElevenLabs)
similarityBoost float Voice similarity (ElevenLabs)
style float Style exaggeration (ElevenLabs)

Slide

{
  "index": 0,
  "title": "Welcome Slide",
  "transcripts": {
    "en-US": { ... },
    "es-ES": { ... }
  },
  "avatar": { ... },
  "notes": "Internal notes"
}
Field Type Required Description
index int Slide index (0-based)
title string Slide title for reference
transcripts object Locale to LanguageContent map
avatar AvatarConfig Virtual avatar settings
notes string Internal notes (not spoken)

LanguageContent

{
  "en-US": {
    "voice": { ... },
    "segments": [ ... ],
    "timing": { ... }
  }
}
Field Type Required Description
voice VoiceConfig Override voice for this language
segments Segment[] Text segments
timing TimingInfo Populated after TTS generation

Segment

{
  "text": "Welcome to the presentation.",
  "pause": 500,
  "emphasis": "moderate",
  "rate": "medium",
  "pitch": "+2st",
  "ssml": { ... }
}
Field Type Required Description
text string Text to speak
pause int Pause after segment (ms)
emphasis string none, moderate, strong
rate string x-slow, slow, medium, fast, x-fast
pitch string Pitch adjustment
voice VoiceConfig Override voice for segment
ssml SSMLHints Additional SSML hints

SSMLHints

{
  "breaks": ["400ms", "1s"],
  "emphasis": ["important", "keyword"],
  "prosody": "rate=\"slow\"",
  "sayAs": "date",
  "phoneme": "ˈɛksəmpl̩",
  "subAlias": "HTML"
}
Field Type Description
breaks string[] Break durations
emphasis string[] Words to emphasize
prosody string Custom prosody
sayAs string Interpretation (date, time, etc.)
phoneme string IPA pronunciation
subAlias string Substitution text

TimingInfo

Populated after TTS generation:

{
  "audioDuration": 5200,
  "pauseDuration": 1000,
  "totalDuration": 6200
}
Field Type Description
audioDuration int Audio duration (ms)
pauseDuration int Total pause duration (ms)
totalDuration int Total slide duration (ms)

AvatarConfig

For future HeyGen/Synthesia integration:

{
  "provider": "heygen",
  "avatarId": "avatar_001",
  "position": "bottom-right",
  "size": "medium",
  "style": "professional"
}
Field Type Required Description
provider string heygen, synthesia, d-id
avatarId string Provider-specific ID
position string bottom-right, bottom-left, etc.
size string small, medium, large
style string Visual style

Complete Example

See examples/intro/transcript.json for a full working example.