Speech-to-Text¶

The speechtotext package provides a wrapper for Google's Speech-to-Text API.

Features¶

Transcribe audio from bytes or files
Multiple language support
Confidence threshold filtering
Word-level timestamps

Quick Start¶

import (
    "context"
    "os"
    "github.com/grokify/gogoogle/speechtotext"
)

// Read audio file
audioData, _ := os.ReadFile("audio.wav")

// Transcribe
result, err := speechtotext.Transcribe(ctx, httpClient, speechtotext.TranscribeRequest{
    Audio:        audioData,
    LanguageCode: "en-US",
    Encoding:     "LINEAR16",
    SampleRate:   16000,
})
if err != nil {
    log.Fatal(err)
}

fmt.Println(result.Transcript)

TranscribeRequest¶

Field	Type	Description
`Audio`	`[]byte`	Audio data
`LanguageCode`	`string`	BCP-47 language code
`Encoding`	`string`	Audio encoding
`SampleRate`	`int`	Sample rate in Hz
`MinConfidence`	`float64`	Minimum confidence (0-1)

Audio Encodings¶

Encoding	Description
`LINEAR16`	Uncompressed 16-bit PCM
`FLAC`	FLAC encoded
`MULAW`	μ-law encoded
`AMR`	AMR (Adaptive Multi-Rate)
`AMR_WB`	AMR Wideband
`OGG_OPUS`	Ogg Opus
`MP3`	MP3 encoded

Language Codes¶

Common language codes:

Language	Code
English (US)	`en-US`
English (UK)	`en-GB`
Spanish	`es-ES`
French	`fr-FR`
German	`de-DE`
Japanese	`ja-JP`
Chinese (Mandarin)	`zh-CN`
Portuguese (Brazil)	`pt-BR`

Transcription Result¶

type TranscribeResult struct {
    Transcript  string    // Full transcript
    Confidence  float64   // Overall confidence (0-1)
    Words       []Word    // Word-level results
}

type Word struct {
    Word       string
    StartTime  time.Duration
    EndTime    time.Duration
    Confidence float64
}

Confidence Filtering¶

result, err := speechtotext.Transcribe(ctx, httpClient, speechtotext.TranscribeRequest{
    Audio:         audioData,
    LanguageCode:  "en-US",
    MinConfidence: 0.8, // Only include high-confidence results
})

File Transcription¶

result, err := speechtotext.TranscribeFile(ctx, httpClient, "audio.wav", speechtotext.TranscribeRequest{
    LanguageCode: "en-US",
})

Long Audio¶

For audio longer than 1 minute, use async recognition:

operation, err := speechtotext.TranscribeAsync(ctx, httpClient, speechtotext.AsyncTranscribeRequest{
    AudioURI:     "gs://bucket/audio.wav", // GCS URI
    LanguageCode: "en-US",
})

// Poll for completion
result, err := speechtotext.WaitForResult(ctx, httpClient, operation)

OAuth Scope¶

scope := "https://www.googleapis.com/auth/cloud-platform"

Enable API¶

Go to Google Cloud Console
Enable Cloud Speech-to-Text API
Create service account or OAuth credentials

Best Practices¶

Use appropriate encoding - LINEAR16 for highest accuracy
Set correct sample rate - Must match audio
Specify language - Don't rely on auto-detection
Use GCS for long audio - Required for files > 1 minute

Error Handling¶

result, err := speechtotext.Transcribe(ctx, httpClient, request)
if err != nil {
    switch {
    case strings.Contains(err.Error(), "INVALID_ARGUMENT"):
        log.Println("Invalid audio format or parameters")
    case strings.Contains(err.Error(), "RESOURCE_EXHAUSTED"):
        log.Println("Quota exceeded")
    default:
        log.Printf("Transcription error: %v", err)
    }
}

Next Steps¶

Text-to-Speech - Convert text to audio
Gmail - Send transcripts via email