Overview
The Speech-to-Text (STT) API converts audio files into text transcriptions. VoxNexus supports both REST API and WebSocket API for STT operations, with features like timestamps and speaker diarization.
REST API
The REST API endpoint /v1/stt processes complete audio files and returns full transcription results.
Basic Usage
curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&language=zh-CN" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: audio/wav" \
--data-binary @audio.wav
Query Parameters
Language code (e.g., zh-CN, en-US). Optional but recommended for better recognition accuracy. If not provided, the service will auto-detect the language.
Sample rate in Hz. Required parameter. Supported values: 8000, 16000, 22050, 44100, 48000. Common values: 16000 for telephony, 44100 for high-quality audio.
Whether to return word-level timestamps. Default: false.
enable_speaker_diarization
Whether to enable speaker diarization (identify different speakers). Default: false.
Request Body
The request body should contain the audio file in one of the supported formats:
audio/wav
audio/mpeg (MP3)
audio/pcm
application/octet-stream
Response
{
"request_id": "req_1234567890",
"language": "en",
"text": "Hello, this is a test message.",
"duration_ms": 2500,
"words": [
{
"word": "hello",
"start_time_ms": 0,
"end_time_ms": 500
}
],
"speakers": [
{
"speaker_id": "speaker_1",
"text": "Hello, this is a test message.",
"start_time_ms": 0,
"end_time_ms": 2500
}
],
"created_at": "2024-01-01T12:00:00Z"
}
Unique identifier for this request.
Detected or specified language code (e.g., en, en-US, zh). May not be present if language detection is not enabled.
Complete transcribed text.
Audio duration in milliseconds.
Word-level information array. Only present if enable_timestamps is true. Each item contains:
word: The recognized word
start_time_ms: Start time in milliseconds
end_time_ms: End time in milliseconds
Speaker information array. Only present if enable_speaker_diarization is true. Each item contains:
speaker_id: Unique speaker identifier
text: Text spoken by this speaker
start_time_ms: Start time in milliseconds
end_time_ms: End time in milliseconds
Timestamp when the transcription was created (ISO 8601 format).
X-Request-ID: Request identifier
X-Language: Detected language code
X-Duration-Ms: Audio duration in milliseconds
X-RateLimit-Remaining: Remaining requests
X-Quota-Used: Credits consumed
WebSocket API
The WebSocket API provides real-time speech recognition, ideal for live transcription scenarios.
Connection
Connect to wss://api.voxnexus.ai/v1/stt/realtime with authentication:
// Connect with token as query parameter (recommended)
const ws = new WebSocket('wss://api.voxnexus.ai/v1/stt/realtime?token=YOUR_API_KEY');
Message Flow
- Initialize: Send an
init message with recognition parameters
- Send Audio: Continuously send
audio messages with Base64-encoded audio chunks
- Receive Results: Receive
partial (interim) and final recognition results
- Handle Errors: Monitor for
error messages
Initialization Message
{
"type": "init",
"language": "zh-CN",
"format": "pcm",
"sample_rate": 16000,
"enable_timestamps": true,
"enable_language_detection": false
}
Audio Message
{
"type": "audio",
"data": "base64-encoded-audio-chunk"
}
Partial Result Message
{
"type": "partial",
"text": "Hello, this is"
}
Final Result Message
{
"type": "final",
"text": "Hello, this is a test message.",
"language": "en",
"start_time_ms": 0,
"end_time_ms": 2500,
"words": [
{
"word": "hello",
"start_time_ms": 0,
"end_time_ms": 500
}
]
}
Complete Example
// Connect with token as query parameter
const ws = new WebSocket('wss://api.voxnexus.ai/v1/stt/realtime?token=YOUR_API_KEY');
let audioContext;
let mediaRecorder;
ws.onopen = () => {
// Initialize recognition
ws.send(JSON.stringify({
type: 'init',
format: 'pcm',
sample_rate: 16000,
enable_timestamps: true,
enable_language_detection: false
}));
// Start audio capture
navigator.mediaDevices.getUserMedia({ audio: true })
.then(stream => {
audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
const audioData = e.inputBuffer.getChannelData(0);
const pcm16 = new Int16Array(audioData.length);
for (let i = 0; i < audioData.length; i++) {
pcm16[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
}
const base64 = btoa(String.fromCharCode(...new Uint8Array(pcm16.buffer)));
ws.send(JSON.stringify({
type: 'audio',
data: base64
}));
};
source.connect(processor);
processor.connect(audioContext.destination);
});
};
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
switch (message.type) {
case 'ready':
console.log('Ready:', message.request_id);
break;
case 'partial':
// Update UI with interim results
console.log('Partial:', message.text);
break;
case 'final':
// Handle final transcription
console.log('Final:', message.text);
if (message.language) {
console.log('Detected language:', message.language);
}
break;
case 'error':
console.error('Error:', message.error);
break;
}
};
Best Practices
- PCM: Best for real-time WebSocket streaming, requires exact sample rate specification
- WAV: Good for REST API, includes format headers
- MP3: Compressed format, good for file uploads, requires decoding
Sample Rate Guidelines
- 8kHz: Telephony quality, sufficient for phone recordings
- 16kHz: Standard quality, good balance of quality and file size
- 22.05kHz: Radio quality
- 44.1kHz/48kHz: High-quality audio, use for professional recordings
Language Specification
Always specify the language when known:
# Better accuracy with language specified
curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&language=zh-CN" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: audio/wav" \
--data-binary @audio.wav
Timestamps
Enable timestamps for word-level timing information:
const response = await fetch('https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_timestamps=true', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'audio/wav'
},
body: audioFile
});
const result = await response.json();
// Use word-level timestamps for subtitles or annotations
result.words.forEach(word => {
console.log(`${word.word} (${word.start_time_ms}-${word.end_time_ms}ms)`);
});
Speaker Diarization
Use speaker diarization for multi-speaker scenarios:
curl -X POST "https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_speaker_diarization=true" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: audio/wav" \
--data-binary @meeting.wav
Language Detection
Enable automatic language detection for multilingual audio:
{
"type": "init",
"format": "pcm",
"sample_rate": 16000,
"enable_language_detection": true
}
Error Handling
Implement robust error handling:
async function transcribeAudio(audioFile) {
try {
const response = await fetch('https://api.voxnexus.ai/v1/stt?sample_rate=16000', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'audio/wav'
},
body: audioFile
});
if (!response.ok) {
const error = await response.json();
throw new Error(error.error || `HTTP ${response.status}`);
}
const result = await response.json();
return result.text;
} catch (error) {
console.error('STT Error:', error);
throw error;
}
}
Common Use Cases
Meeting Transcription
Transcribe meeting recordings with speaker identification:
async function transcribeMeeting(audioFile) {
const response = await fetch(
'https://api.voxnexus.ai/v1/stt?sample_rate=44100&enable_speaker_diarization=true&enable_timestamps=true',
{
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'audio/wav'
},
body: audioFile
}
);
const result = await response.json();
// Format as meeting transcript
result.speakers.forEach(speaker => {
console.log(`Speaker ${speaker.speaker_id}: ${speaker.text}`);
});
return result;
}
Live Captioning
Use WebSocket API for real-time captioning:
// Stream audio and display captions in real-time
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
if (message.type === 'partial') {
// Update caption display with interim results
updateCaptionDisplay(message.text, true);
} else if (message.type === 'final') {
// Finalize caption
updateCaptionDisplay(message.text, false);
}
};
Voice Commands
Process voice commands with real-time recognition:
ws.send(JSON.stringify({
type: 'init',
format: 'pcm',
sample_rate: 16000,
enable_language_detection: false
}));
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
if (message.type === 'final') {
const text = message.text.toLowerCase();
if (text.includes('activate')) {
handleActivateCommand();
} else if (text.includes('deactivate')) {
handleDeactivateCommand();
}
}
};
Audio Content Indexing
Index audio content for search:
async function indexAudioContent(audioFile, metadata) {
const response = await fetch(
'https://api.voxnexus.ai/v1/stt?sample_rate=16000&enable_timestamps=true',
{
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'audio/wav'
},
body: audioFile
}
);
const result = await response.json();
// Store transcription with timestamps for search
await storeIndexedContent({
...metadata,
transcription: result.text,
words: result.words,
duration: result.duration_ms
});
return result;
}
Rate Limits and Quotas
- Monitor response headers for rate limit information
- Implement retry logic with exponential backoff for
429 responses
- Consider WebSocket API for continuous streaming scenarios
- Batch process large audio files during off-peak hours