Whisper X

Overview

The DataCrunch Whisper Inference Service provides access to the Whisper large model endpoints. It supports both Whisper large v2 and the latest Whisper large v3 models. These endpoints include advanced options such as WhisperX with diarization, phoneme alignment for word-level timestamps, and subtitle generation in SRT format.

Version Support

Whisper v2: Whisper v2 shows improved dependability if the language is unknown or if Whisper v3’s language identification is not reliable.
Whisper v3: If the language is known and language identification is reliable, it is better to opt for the Whisper v3 model.

Examples of API Usage

To use the Whisper API, replace <VERSION> in the URL with v2 for version 2 or v3 for version 3.

For example, use https://inference.datacrunch.io/v1/audio/whisperx-v2/generate.

Transcribing Audio

To transcribe audio, submit a request with the audio file URL.

curl -X POST https://inference.datacrunch.io/v1/audio/whisperx-<VERSION>/generate \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your_api_key>" \
  -d \
'{
    "audio_input": "<AUDIO_FILE_URL>"
}'

import requests

url = "https://inference.datacrunch.io/v1/audio/whisperx-<VERSION>/generate"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer <your_api_key>"
}
data = {
    "audio_input": "<AUDIO_FILE_URL>"
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

const axios = require('axios');

const url = 'https://inference.datacrunch.io/v1/audio/whisperx-<VERSION>/generate';
const headers = {
  'Content-Type': 'application/json',
  'Authorization': 'Bearer <your_api_key>'
};
const data = {
  audio_input: '<AUDIO_FILE_URL>'
};

axios.post(url, data, { headers: headers })
  .then((response) => {
    console.log(response.data);
  })
  .catch((error) => {
    console.error('Error:', error);
  });

Translating Audio

For translation of the transcribed output to English:

curl -X POST https://inference.datacrunch.io/v1/audio/whisperx-<VERSION>/generate \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your_api_key>" \
  -d \
'{
    "audio_input": "<AUDIO_FILE_URL>",
    "translate": true
}'

Generating Subtitles

When creating subtitles it is best to set processing_type="align", to ensure word-level alignment. Omitting the alignment will result in longer subtitle chunks, potentially leading to worse user experience. Setting output="subtitles" ensures that the output is in SRT format.

curl -X POST https://inference.datacrunch.io/v1/audio/whisperx-<VERSION>/generate \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your_api_key>" \
  -d \
'{
    "audio_input": "<AUDIO_FILE_URL>",
    "translate": true,
    "processing_type": "align",
    "output": "subtitles"
}'

Performing Speaker Diarization

For speaker diarization (assigning speaker labels to text segments), set processing_type to diarize:

curl -X POST https://inference.datacrunch.io/v1/audio/whisperx-<VERSION>/generate \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your_api_key>" \
  -d \
'{
    "audio_input": "<AUDIO_FILE_URL>",
    "translate": true,
    "processing_type": "diarize"
}'

API Specification

Transcribe, Translate, or Diarize Audio

POSThttps://inference.datacrunch.io/v1/audio/whisperx/generate

Body

Audio processing request parameters

audio_input*string

URL of the audio file for processing

translateboolean

Flag to translate the audio content

languageLanguage to translate/transcribe, don't set it to perform language detection

processing_typeEither 'diarize', 'align', or 'none'

outputEither 'raw' or 'subtitles'

Response

Successful audio processing response

Body

Response Run Whisperx Generate Post

Request

const response = await fetch('https://inference.datacrunch.io/v1/audio/whisperx/generate', {
    method: 'POST',
    headers: {
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      "audio_input": "https://example.com/audiofile.mp3"
    }),
});
const data = await response.json();

Response

{
  "usage": {
    "input_audio_length": 0,
    "elapsed_time": 0
  },
  "segments": [
    {
      "start": 0,
      "end": 0,
      "text": "text",
      "words": [
        {
          "word": "text",
          "start": 0,
          "end": 0,
          "score": 0,
          "speaker": "text"
        }
      ]
    }
  ],
  "subtitles": "text"
}

API Parameters

audio_input (str, required): URL of the audio file. This is a required parameter.
translate (bool, optional): If enabled, provides the English translation of the output. Defaults to false.
language (str, optional): Optional two-letter language code to specify the input language for accurate language detection.
processing_type (str, optional): Defines the processing action. Supported types: diarize, align.
output (str), optional): Determines the output format. Options: subtitles (in SRT format), raw (time-stamped text). Default is raw.

Copyright notice: WhisperX includes software developed by Max Bain.

Last updated 3 months ago