Learn how to quickly launch a vLLM server on Hugging Face Jobs with a single command. Streamline your AI model testing and deployment.
Artificial intelligence continues to evolve, and accessing large technology/">language models (LLMs) has become easier thanks to platforms like Hugging Face. Hugging Face Jobs allows users to deploy LLMs quickly, and among its capabilities is running a vLLM server with just one command. This feature streamlines the tasks of testing and evaluation, empowering developers and researchers alike.
This article explores how to set up a vLLM server on Hugging Face Jobs efficiently, enabling you to harness its power for a range of applications.
Before diving into launching your vLLM server, it’s essential to ensure you meet a few prerequisites:
With prerequisites checked, the real fun begins. Launching a vLLM server is straightforward. You can use the following command to roll out the server, which utilizes Hugging Face’s infrastructure:
hf jobs run --flavor a10g-large --expose 8000 --timeout 2h
vllm/vllm-openai:latest
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000
This command specifically asks for a GPU with the --flavor option, managing the port exposure through Hugging Face's job proxy. Upon successful execution, you will receive a message that details your job ID and the URL to access the server:
✓ Job started id: 6a381ca1953ed90bfb947332
url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332
Wait a few moments for the model weights to download. Once you see "Application startup complete" in the logs, your server is ready for queries.
Your vLLM server is now operational, and it uses the OpenAI API, making it easy to send queries to it. To interact with it, you just need your Hugging Face token to authenticate. Here’s a quick way to query your server using curl:
curl -X POST https://6a381ca1953ed90bfb947332--8000.hf.jobs/v1/completions
-H "Authorization: Bearer YOUR_HF_TOKEN"
-d '{"prompt": "Hello, how can I help you?", "max_tokens": 50}'
Alternatively, if you prefer Python, point the OpenAI client to your server’s exposed URL while passing the token as the API key:
from openai import OpenAI
client = OpenAI(api_key='YOUR_HF_TOKEN', base_url='https://6a381ca1953ed90bfb947332--8000.hf.jobs')
response = client.Completions.create(prompt='Hello, how can I help you?', max_tokens=50)
It’s important to note that the job runs privately, and unauthorized access will lead to a denied request. Ensure that your Hugging Face token is kept secure and avoid using it in untrustworthy environments.
Once your experiments conclude, you can easily stop your server. Since jobs are billed per second, explicitly stopping the server is a cost-effective choice. To cancel the running job, execute:
hf jobs cancel
As a reminder, a flavor like a10g-large incurs a cost of approximately $1.50 per hour. For more pricing details, check the Hugging Face hardware documentation.
If you find yourself needing to work with larger models, Hugging Face Jobs accommodates scalability. To run larger models, update the flavor parameter and enable model sharding through the --tensor-parallel-size option. Here's a sample command for the Qwen3.5 model:
hf jobs run --flavor h200x2 --expose 8000 --timeout 2h
vllm/vllm-openai:latest
vllm serve Qwen/Qwen3.5-122B-A10B
--host 0.0.0.0 --port 8000 --tensor-parallel-size 2
Make sure the --tensor-parallel-size aligns with the number of GPUs used. Exploring larger flavors tends to yield better value for computational efficiency.
If you prefer a graphical interface over command-line interactions, you can set up a Gradio app. By incorporating a few lines of code alongside your server endpoint, you can create a chat interface.
When utilizing this method, consider advising the reasoning parser for enhanced responses tailored to your needs. Utilize the following command:
hf jobs run --flavor a10g-large --expose 8000 --timeout 2h
vllm/vllm-openai:latest
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000 --reasoning-parser deepseek_r1
For debugging or monitoring purposes, SSH access allows you to connect directly to the server environment. Use the following command to launch with SSH enabled:
hf jobs run --flavor a10g-large --expose 8000 --timeout 2h --ssh
vllm/vllm-openai:latest
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000
Once running, you can SSH into the server using:
hf jobs ssh
This access lets you execute commands like nvidia-smi to monitor GPU usage and engage with your model directly. This capability can significantly enhance troubleshooting compared to viewing log files externally.
Another interesting application is using vLLM as a backend for coding agents like Pi. To enable this, you must make sure to relaunch your server with tool calling functionality.
The command can be modified as follows to include the necessary flags:
hf jobs run --flavor h200x2 --expose 8000 --timeout 2h
vllm/vllm-openai:latest
vllm serve Qwen/Qwen3.5-122B-A10B
--host 0.0.0.0 --port 8000
--enable-auto-tool-choice
--tool-call-parser hermes
Subsequently, add your job as a custom provider in the Pi configuration file to harness this new functionality effectively.
While Hugging Face Jobs offers a flexible solution for deploying models, it's not the only method available. Inference Endpoints are managed solutions that provide additional features suitable for a production environment.
Select Hugging Face Jobs when you need flexibility, detailed control over model settings and architecture, as this option is well-suited for experimental or short-term usage scenarios. In contrast, opt for Inference Endpoints when establishing a long-term, scaling endpoint that requires consistent access without incurring costs during inactivity.
If you wish to explore more about vLLM and Hugging Face, various guides examine different models and their applications. For example, the Serve Models on Jobs guide outlines how to manage other OpenAI-compatible servers.
Additionally, the Hugging Face community is very active. Engaging with forums and community resources can enrich your understanding and application of vLLM and other Hugging Face technologies.