Skip to main content
POST
/
v1
/
stt
Speech to Text
curl --request POST \
  --url https://api.voxnexus.ai/v1/stt \
  --header 'Content-Type: audio/wav' \
  --header 'X-Api-Key: <api-key>' \
  --data '"<string>"'
{
  "request_id": "req_1234567890",
  "transcript": "Hello this is a test message",
  "text": "Hello, this is a test message.",
  "duration_ms": 2500,
  "created_at": "2024-01-01T12:00:00Z",
  "language": "en",
  "words": [
    {
      "word": "hello",
      "offset": 0,
      "duration": 500,
      "confidence": 0.98
    }
  ],
  "speakers": [
    {
      "speaker_id": "speaker_1",
      "text": "Hello, this is a test message.",
      "offset": 0,
      "duration": 2500
    }
  ]
}

Authorizations

X-Api-Key
string
header
required

Authenticate using X-Api-Key header

Query Parameters

model_id
string
required

Model identifier (required). Specifies which model to use for STT.

Example:

"vn-stt-basic"

language
string
default:en-US

Language or locale code (optional, default: "en-US"). Supports both ISO 639-1 language codes (e.g. "en", "zh") and BCP 47 locale codes (e.g. "en-US", "zh-CN"). When a language code is provided, the system will automatically resolve it to the most common locale (e.g. "en" -> "en-US"). Improves recognition performance if provided, otherwise auto-detected by service.

Example:

"en-US"

sample_rate
integer
required

Sample rate (required, unit: Hz, e.g. 16000, 22050, 44100, 48000)

Required range: x >= 1
Example:

16000

enable_timestamps
boolean
default:false

Whether to return timestamps (optional, default false)

enable_speaker_diarization
boolean
default:false

Whether to enable speaker diarization (optional, default false)

enable_llm_transform
boolean
default:false

Whether to enable LLM post-processing on the STT transcript (optional, default false). When enabled, the recognized transcript is passed to the LLM with the specified llm_prompt, and the transformed result is returned in the text field. If LLM transform fails, text falls back to the raw transcript (no error is returned). This feature requires whitelist access. Contact support@voxnexus.ai to request access.

llm_prompt
string

LLM transform instruction (required when enable_llm_transform=true). Describes what the LLM should do with the transcript, e.g. "correct punctuation", "rewrite in a humorous tone", "translate to English", "summarize key points". The LLM applies this instruction freely — the service does not interpret its semantics.

Example:

"Correct punctuation and remove filler words"

llm_model_id
string

LLM model ID (optional). Falls back to the server-configured default model when not specified. Routing is prefix-based: claude-* models route to Anthropic; all other model IDs route to OpenAI. Examples: claude-haiku-4-5-20251001 (Anthropic), gpt-4.1 (OpenAI).

Example:

"claude-haiku-4-5-20251001"

llm_max_tokens
integer

Maximum output tokens for LLM transform (optional). Falls back to the server default when not specified.

Required range: 1 <= x <= 4096
Example:

1024

Body

The body is of type file.

Response

Successfully returns recognition result

request_id
string
required

Request ID

Example:

"req_1234567890"

transcript
string
required

Raw STT recognition output (original ASR text before any LLM processing). Always present; identical to text when LLM transform is not enabled.

Example:

"Hello this is a test message"

text
string
required

Final output text. When LLM transform is enabled and succeeds, this contains the LLM-transformed result; otherwise it equals transcript.

Example:

"Hello, this is a test message."

duration_ms
integer
required

Audio duration in milliseconds

Example:

2500

created_at
string<date-time>
required

Creation time

Example:

"2024-01-01T12:00:00Z"

language
string

Detected language code, e.g. en, en-US

Example:

"en"

words
object[]

Word-level information (if timestamps are enabled)

speakers
object[]

Speaker information (if speaker diarization is enabled)