Speech to Text
Convert audio file to text, returns complete recognition result
Authorizations
Authenticate using X-Api-Key header
Query Parameters
Model identifier (required). Specifies which model to use for STT.
"vn-stt-basic"
Language or locale code (optional, default: "en-US"). Supports both ISO 639-1 language codes (e.g. "en", "zh") and BCP 47 locale codes (e.g. "en-US", "zh-CN"). When a language code is provided, the system will automatically resolve it to the most common locale (e.g. "en" -> "en-US"). Improves recognition performance if provided, otherwise auto-detected by service.
"en-US"
Sample rate (required, unit: Hz, e.g. 16000, 22050, 44100, 48000)
x >= 116000
Whether to return timestamps (optional, default false)
Whether to enable speaker diarization (optional, default false)
Whether to enable LLM post-processing on the STT transcript (optional, default false).
When enabled, the recognized transcript is passed to the LLM with the specified llm_prompt,
and the transformed result is returned in the text field.
If LLM transform fails, text falls back to the raw transcript (no error is returned).
This feature requires whitelist access. Contact support@voxnexus.ai to request access.
LLM transform instruction (required when enable_llm_transform=true).
Describes what the LLM should do with the transcript, e.g. "correct punctuation",
"rewrite in a humorous tone", "translate to English", "summarize key points".
The LLM applies this instruction freely — the service does not interpret its semantics.
"Correct punctuation and remove filler words"
LLM model ID (optional). Falls back to the server-configured default model when not specified.
Routing is prefix-based: claude-* models route to Anthropic; all other model IDs route to OpenAI.
Examples: claude-haiku-4-5-20251001 (Anthropic), gpt-4.1 (OpenAI).
"claude-haiku-4-5-20251001"
Maximum output tokens for LLM transform (optional). Falls back to the server default when not specified.
1 <= x <= 40961024
Body
The body is of type file.
Response
Successfully returns recognition result
Request ID
"req_1234567890"
Raw STT recognition output (original ASR text before any LLM processing).
Always present; identical to text when LLM transform is not enabled.
"Hello this is a test message"
Final output text. When LLM transform is enabled and succeeds, this contains
the LLM-transformed result; otherwise it equals transcript.
"Hello, this is a test message."
Audio duration in milliseconds
2500
Creation time
"2024-01-01T12:00:00Z"
Detected language code, e.g. en, en-US
"en"
Word-level information (if timestamps are enabled)
Speaker information (if speaker diarization is enabled)