Text-to-Speech
The texttospeech/v1beta1 package provides a wrapper for Google's Text-to-Speech API.
Features
- Synthesize speech from text
- Multiple voice options (WaveNet, Standard)
- Multiple audio formats
- SSML support
Quick Start
import (
"context"
"os"
"github.com/grokify/gogoogle/texttospeech/v1beta1"
)
// Synthesize speech
audio, err := texttospeech.Synthesize(ctx, httpClient, texttospeech.SynthesizeRequest{
Text: "Hello, world!",
LanguageCode: "en-US",
VoiceName: "en-US-Wavenet-D",
AudioFormat: "MP3",
})
if err != nil {
log.Fatal(err)
}
// Save to file
err = os.WriteFile("output.mp3", audio, 0644)
SynthesizeRequest
| Field |
Type |
Description |
Text |
string |
Text to synthesize |
SSML |
string |
SSML markup (alternative to Text) |
LanguageCode |
string |
BCP-47 language code |
VoiceName |
string |
Voice name |
SsmlGender |
string |
Voice gender |
AudioFormat |
string |
Output audio format |
SpeakingRate |
float64 |
Speaking rate (0.25-4.0) |
Pitch |
float64 |
Pitch (-20 to 20) |
Voices
WaveNet Voices (Neural)
High-quality, natural-sounding voices:
| Voice |
Gender |
Description |
en-US-Wavenet-A |
Male |
US English |
en-US-Wavenet-B |
Male |
US English |
en-US-Wavenet-C |
Female |
US English |
en-US-Wavenet-D |
Male |
US English |
en-US-Wavenet-E |
Female |
US English |
en-US-Wavenet-F |
Female |
US English |
en-GB-Wavenet-A |
Female |
UK English |
en-GB-Wavenet-B |
Male |
UK English |
Standard Voices
Lower latency, lower cost:
| Voice |
Gender |
Description |
en-US-Standard-A |
Male |
US English |
en-US-Standard-B |
Male |
US English |
en-US-Standard-C |
Female |
US English |
en-US-Standard-D |
Male |
US English |
List Available Voices
voices, err := texttospeech.ListVoices(ctx, httpClient, "en-US")
for _, voice := range voices {
fmt.Printf("%s (%s)\n", voice.Name, voice.SsmlGender)
}
| Format |
Description |
MP3 |
MP3 audio |
LINEAR16 |
Uncompressed WAV |
OGG_OPUS |
Ogg Opus |
SSML Support
Use SSML for advanced speech control:
ssml := `<speak>
<say-as interpret-as="date" format="mdy">12/25/2024</say-as>
<break time="500ms"/>
<emphasis level="strong">Merry Christmas!</emphasis>
</speak>`
audio, err := texttospeech.Synthesize(ctx, httpClient, texttospeech.SynthesizeRequest{
SSML: ssml,
LanguageCode: "en-US",
VoiceName: "en-US-Wavenet-D",
AudioFormat: "MP3",
})
SSML Elements
| Element |
Description |
<break> |
Insert pause |
<emphasis> |
Emphasize text |
<say-as> |
Interpret as date, time, etc. |
<prosody> |
Control pitch, rate, volume |
<sub> |
Pronunciation substitution |
Speaking Rate and Pitch
audio, err := texttospeech.Synthesize(ctx, httpClient, texttospeech.SynthesizeRequest{
Text: "This is spoken slowly with a lower pitch.",
LanguageCode: "en-US",
VoiceName: "en-US-Wavenet-D",
AudioFormat: "MP3",
SpeakingRate: 0.8, // Slower (default 1.0)
Pitch: -2.0, // Lower pitch (default 0)
})
Languages
| Language |
Code |
Voices |
| English (US) |
en-US |
30+ |
| English (UK) |
en-GB |
10+ |
| Spanish |
es-ES |
10+ |
| French |
fr-FR |
10+ |
| German |
de-DE |
10+ |
| Japanese |
ja-JP |
8+ |
| Chinese |
cmn-CN |
8+ |
OAuth Scope
scope := "https://www.googleapis.com/auth/cloud-platform"
Enable API
- Go to Google Cloud Console
- Enable Cloud Text-to-Speech API
- Create credentials
Pricing
| Voice Type |
Price per 1M chars |
| Standard |
$4.00 |
| WaveNet |
$16.00 |
| Neural2 |
$16.00 |
First 1M standard characters free per month.
Best Practices
- Use WaveNet for quality - More natural sounding
- Cache audio - Don't regenerate unchanged text
- Use SSML - Better control over pronunciation
- Choose appropriate format - MP3 for web, LINEAR16 for processing
Error Handling
audio, err := texttospeech.Synthesize(ctx, httpClient, request)
if err != nil {
switch {
case strings.Contains(err.Error(), "INVALID_ARGUMENT"):
log.Println("Invalid voice or language")
case strings.Contains(err.Error(), "RESOURCE_EXHAUSTED"):
log.Println("Quota exceeded")
default:
log.Printf("TTS error: %v", err)
}
}
Next Steps