Text-to-Speech¶

The texttospeech/v1beta1 package provides a wrapper for Google's Text-to-Speech API.

Features¶

Synthesize speech from text
Multiple voice options (WaveNet, Standard)
Multiple audio formats
SSML support

Quick Start¶

import (
    "context"
    "os"
    "github.com/grokify/gogoogle/texttospeech/v1beta1"
)

// Synthesize speech
audio, err := texttospeech.Synthesize(ctx, httpClient, texttospeech.SynthesizeRequest{
    Text:         "Hello, world!",
    LanguageCode: "en-US",
    VoiceName:    "en-US-Wavenet-D",
    AudioFormat:  "MP3",
})
if err != nil {
    log.Fatal(err)
}

// Save to file
err = os.WriteFile("output.mp3", audio, 0644)

SynthesizeRequest¶

Field	Type	Description
`Text`	`string`	Text to synthesize
`SSML`	`string`	SSML markup (alternative to Text)
`LanguageCode`	`string`	BCP-47 language code
`VoiceName`	`string`	Voice name
`SsmlGender`	`string`	Voice gender
`AudioFormat`	`string`	Output audio format
`SpeakingRate`	`float64`	Speaking rate (0.25-4.0)
`Pitch`	`float64`	Pitch (-20 to 20)

Voices¶

WaveNet Voices (Neural)¶

High-quality, natural-sounding voices:

Voice	Gender	Description
`en-US-Wavenet-A`	Male	US English
`en-US-Wavenet-B`	Male	US English
`en-US-Wavenet-C`	Female	US English
`en-US-Wavenet-D`	Male	US English
`en-US-Wavenet-E`	Female	US English
`en-US-Wavenet-F`	Female	US English
`en-GB-Wavenet-A`	Female	UK English
`en-GB-Wavenet-B`	Male	UK English

Standard Voices¶

Lower latency, lower cost:

Voice	Gender	Description
`en-US-Standard-A`	Male	US English
`en-US-Standard-B`	Male	US English
`en-US-Standard-C`	Female	US English
`en-US-Standard-D`	Male	US English

List Available Voices¶

voices, err := texttospeech.ListVoices(ctx, httpClient, "en-US")
for _, voice := range voices {
    fmt.Printf("%s (%s)\n", voice.Name, voice.SsmlGender)
}

Audio Formats¶

Format	Description
`MP3`	MP3 audio
`LINEAR16`	Uncompressed WAV
`OGG_OPUS`	Ogg Opus

SSML Support¶

Use SSML for advanced speech control:

ssml := `<speak>
    <say-as interpret-as="date" format="mdy">12/25/2024</say-as>
    <break time="500ms"/>
    <emphasis level="strong">Merry Christmas!</emphasis>
</speak>`

audio, err := texttospeech.Synthesize(ctx, httpClient, texttospeech.SynthesizeRequest{
    SSML:         ssml,
    LanguageCode: "en-US",
    VoiceName:    "en-US-Wavenet-D",
    AudioFormat:  "MP3",
})

SSML Elements¶

Element	Description
`<break>`	Insert pause
`<emphasis>`	Emphasize text
`<say-as>`	Interpret as date, time, etc.
`<prosody>`	Control pitch, rate, volume
`<sub>`	Pronunciation substitution

Speaking Rate and Pitch¶

audio, err := texttospeech.Synthesize(ctx, httpClient, texttospeech.SynthesizeRequest{
    Text:         "This is spoken slowly with a lower pitch.",
    LanguageCode: "en-US",
    VoiceName:    "en-US-Wavenet-D",
    AudioFormat:  "MP3",
    SpeakingRate: 0.8,  // Slower (default 1.0)
    Pitch:        -2.0, // Lower pitch (default 0)
})

Languages¶

Language	Code	Voices
English (US)	`en-US`	30+
English (UK)	`en-GB`	10+
Spanish	`es-ES`	10+
French	`fr-FR`	10+
German	`de-DE`	10+
Japanese	`ja-JP`	8+
Chinese	`cmn-CN`	8+

OAuth Scope¶

scope := "https://www.googleapis.com/auth/cloud-platform"

Enable API¶

Go to Google Cloud Console
Enable Cloud Text-to-Speech API
Create credentials

Pricing¶

Voice Type	Price per 1M chars
Standard	$4.00
WaveNet	$16.00
Neural2	$16.00

First 1M standard characters free per month.

Best Practices¶

Use WaveNet for quality - More natural sounding
Cache audio - Don't regenerate unchanged text
Use SSML - Better control over pronunciation
Choose appropriate format - MP3 for web, LINEAR16 for processing

Error Handling¶

audio, err := texttospeech.Synthesize(ctx, httpClient, request)
if err != nil {
    switch {
    case strings.Contains(err.Error(), "INVALID_ARGUMENT"):
        log.Println("Invalid voice or language")
    case strings.Contains(err.Error(), "RESOURCE_EXHAUSTED"):
        log.Println("Quota exceeded")
    default:
        log.Printf("TTS error: %v", err)
    }
}

Next Steps¶

Speech-to-Text - Transcribe audio
Gmail - Send audio files via email