Tutorial: Deploy with vLLM
Last updated
Last updated
In this tutorial, we will deploy a vLLM endpoint in a few easy steps. vLLM has become one of the leading libraries for LLM-serving and inference, supporting many architectures and models that use them.
vLLM depends on the model weights being fetched from Hugging Face and requires a User Access Token
to fetch them.
You can obtain the Access Token in your Hugging Face account by clicking the Profile icon (top right corner) and selecting Access Tokens.
For deploying the vLLM endpoint, the READ
permissions are sufficient.
Please store the obtained token safely. You will need it for the next steps!
Some models on Hugging Face require the user to accept their usage policy, so make sure you verify this for the models you are deploying.
In this tutorial, we will deploy mistralai/Mistral-7B-v0.1
on a General Compute (24 GB VRAM) GPU type. For larger models, you may need to choose one of the other GPU types we offer.
Log in to the DataCrunch cloud dashboard, and go to Containers -> New deployment. Name your deployment and select the Compute Type.
We will be using the official vLLM Docker container, set Container Image to docker.io/vllm/vllm-openai
Toggle on the Public location for your image
Select the Tag to deploy
Set the Exposed HTTP port and Healthcheck port to 8000
Toggle Start Command on
Modify CMD to include the following entries: --model
, mistralai/Mistral-7B-v0.1
, --gpu-memory-utilization
, 0.9
, --max-model-len
, 8192
Add your Hugging Face User Access Token to the Environment Variables as HUGGING_FACE_HUB_TOKEN
Deploy container
(You can leave the Scaling options to their default values.)
That's it you should now have a running deployment!
For production use, we recommend authenticating/using private registries to avoid potential rate limits imposed by public container registries.
Before you can connect to the endpoint, you will need to generate an authentication token, by going to Keys -> Inference API Keys, and click Create.
The endpoint URL for your deployment is in the Containers API section in the top left of the screen.
Below is an example cURL command for running your test request:
You should see a response that looks like this: