Speech to Text

curl --request POST \
  --url https://api.voxnexus.ai/v1/stt \
  --header 'Content-Type: audio/wav' \
  --header 'X-Api-Key: <api-key>' \
  --data '"<string>"'

{
  "request_id": "req_1234567890",
  "text": "Hello, this is a test message.",
  "duration_ms": 2500,
  "created_at": "2024-01-01T12:00:00Z",
  "language": "en",
  "words": [
    {
      "word": "hello",
      "offset": 0,
      "duration": 500,
      "confidence": 0.98
    }
  ],
  "speakers": [
    {
      "speaker_id": "speaker_1",
      "text": "Hello, this is a test message.",
      "offset": 0,
      "duration": 2500
    }
  ]
}

POST

stt

Speech to Text

curl --request POST \
  --url https://api.voxnexus.ai/v1/stt \
  --header 'Content-Type: audio/wav' \
  --header 'X-Api-Key: <api-key>' \
  --data '"<string>"'

{
  "request_id": "req_1234567890",
  "text": "Hello, this is a test message.",
  "duration_ms": 2500,
  "created_at": "2024-01-01T12:00:00Z",
  "language": "en",
  "words": [
    {
      "word": "hello",
      "offset": 0,
      "duration": 500,
      "confidence": 0.98
    }
  ],
  "speakers": [
    {
      "speaker_id": "speaker_1",
      "text": "Hello, this is a test message.",
      "offset": 0,
      "duration": 2500
    }
  ]
}

Authorizations

X-Api-Key

string

header

required

Authenticate using X-Api-Key header

Query Parameters

language

string

Language or locale code (optional). Supports both ISO 639-1 language codes (e.g. "en", "zh") and BCP 47 locale codes (e.g. "en-US", "zh-CN"). When a language code is provided, the system will automatically resolve it to the most common locale (e.g. "en" -> "en-US"). Improves recognition performance if provided, otherwise auto-detected by service.

Example:

"en-US"

sample_rate

integer

required

Sample rate (required, unit: Hz, e.g. 16000, 22050, 44100, 48000)

Required range: x >= 1

Example:

16000

enable_timestamps

boolean

default:false

Whether to return timestamps (optional, default false)

enable_speaker_diarization

boolean

default:false

Whether to enable speaker diarization (optional, default false)

Body

The body is of type file.

Response

Successfully returns recognition result

request_id

string

required

Request ID

Example:

"req_1234567890"

text

string

required

Recognized text

Example:

"Hello, this is a test message."

duration_ms

integer

required

Audio duration in milliseconds

Example:

2500

created_at

string<date-time>

required

Creation time

Example:

"2024-01-01T12:00:00Z"

language

string

Detected language code, e.g. en, en-US

Example:

"en"

words

object[]

Word-level information (if timestamps are enabled)

Show child attributes

speakers

object[]

Speaker information (if speaker diarization is enabled)

Show child attributes

Text to Speech List Voices

Getting Started

REST API

WebSocket API

Speech to Text

Authorizations

Query Parameters

Body

Response