Skip to main content
POST
/
v1
/
stt
Speech to Text
curl --request POST \
  --url https://api.voxnexus.ai/v1/stt \
  --header 'Content-Type: audio/wav' \
  --header 'X-Api-Key: <api-key>' \
  --data '"<string>"'
{
  "request_id": "req_1234567890",
  "text": "Hello, this is a test message.",
  "duration_ms": 2500,
  "created_at": "2024-01-01T12:00:00Z",
  "language": "en",
  "words": [
    {
      "word": "hello",
      "offset": 0,
      "duration": 500,
      "confidence": 0.98
    }
  ],
  "speakers": [
    {
      "speaker_id": "speaker_1",
      "text": "Hello, this is a test message.",
      "offset": 0,
      "duration": 2500
    }
  ]
}

Authorizations

X-Api-Key
string
header
required

Authenticate using X-Api-Key header

Query Parameters

language
string

Language or locale code (optional). Supports both ISO 639-1 language codes (e.g. "en", "zh") and BCP 47 locale codes (e.g. "en-US", "zh-CN"). When a language code is provided, the system will automatically resolve it to the most common locale (e.g. "en" -> "en-US"). Improves recognition performance if provided, otherwise auto-detected by service.

Example:

"en-US"

sample_rate
integer
required

Sample rate (required, unit: Hz, e.g. 16000, 22050, 44100, 48000)

Required range: x >= 1
Example:

16000

enable_timestamps
boolean
default:false

Whether to return timestamps (optional, default false)

enable_speaker_diarization
boolean
default:false

Whether to enable speaker diarization (optional, default false)

Body

The body is of type file.

Response

Successfully returns recognition result

request_id
string
required

Request ID

Example:

"req_1234567890"

text
string
required

Recognized text

Example:

"Hello, this is a test message."

duration_ms
integer
required

Audio duration in milliseconds

Example:

2500

created_at
string<date-time>
required

Creation time

Example:

"2024-01-01T12:00:00Z"

language
string

Detected language code, e.g. en, en-US

Example:

"en"

words
object[]

Word-level information (if timestamps are enabled)

speakers
object[]

Speaker information (if speaker diarization is enabled)