Whisper X

Overview

The DataCrunch Whisper Inference Service provides access to the Whisper v3 large model endpoint. The endpoint includes advanced options such as WhisperX with diarization, phoneme alignment for word-level timestamps, and subtitle generation in SRT format.

Transcribing Audio

To transcribe audio, submit a request with the audio file URL.

curl -X POST https://fin-02.inference.datacrunch.io/v1/raw/whisperx/predict \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your_api_key>" \
  -d \
'{
    "audio_input": "<AUDIO_FILE_URL>"
}'

Translating Audio

For translation of the transcribed output to English:

curl -X POST https://fin-02.inference.datacrunch.io/v1/raw/whisperx/predict \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your_api_key>" \
  -d \
'{
    "audio_input": "<AUDIO_FILE_URL>",
    "translate": true
}'

Generating Subtitles

When creating subtitles it is best to set processing_type="align", to ensure word-level alignment. Omitting the alignment will result in longer subtitle chunks, potentially leading to worse user experience. Setting output="subtitles" ensures that the output is in SRT format.

curl -X POST https://fin-02.inference.datacrunch.io/v1/raw/whisperx/predict \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your_api_key>" \
  -d \
'{
    "audio_input": "<AUDIO_FILE_URL>",
    "translate": true,
    "processing_type": "align",
    "output": "subtitles"
}'

Performing Speaker Diarization

For speaker diarization (assigning speaker labels to text segments), set processing_type to diarize:

curl -X https://fin-02.inference.datacrunch.io/v1/raw/whisperx/predict \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your_api_key>" \
  -d \
'{
    "audio_input": "<AUDIO_FILE_URL>",
    "translate": true,
    "processing_type": "diarize"
}'

API Specification

API Parameters

  • audio_input (str, required): URL of the audio file. This is a required parameter.

  • translate (bool, optional): If enabled, provides the English translation of the output. Defaults to false.

  • language (str, optional): Optional two-letter language code to specify the input language for accurate language detection.

  • processing_type (str, optional): Defines the processing action. Supported types: diarize, align.

  • output (str), optional): Determines the output format. Options: subtitles (in SRT format), raw (time-stamped text). Default is raw.

Copyright notice: WhisperX includes software developed by Max Bain.

Last updated