How to launch a vLLM server on Hugging Face Jobs effortlessly

Artificial intelligence continues to evolve, and accessing large technology/">language models (LLMs) has become easier thanks to platforms like Hugging Face. Hugging Face Jobs allows users to deploy LLMs quickly, and among its capabilities is running a vLLM server with just one command. This feature streamlines the tasks of testing and evaluation, empowering developers and researchers alike.

This article explores how to set up a vLLM server on Hugging Face Jobs efficiently, enabling you to harness its power for a range of applications.

Quick setup requirements

Before diving into launching your vLLM server, it’s essential to ensure you meet a few prerequisites:

A valid payment method or a positive prepaid credit balance is necessary, as usage is billed per minute based on hardware.
Ensure you have the correct software version installed. Specifically, the Hugging Face Hub should be updated to at least version 1.20.0. You can update this by running:
pip install -U "huggingface_hub>=1.20.0"
You need to be logged into Hugging Face from your local environment using the command:
hf auth login

Launching your vLLM server

With prerequisites checked, the real fun begins. Launching a vLLM server is straightforward. You can use the following command to roll out the server, which utilizes Hugging Face’s infrastructure:

hf jobs run --flavor a10g-large --expose 8000 --timeout 2h 
  vllm/vllm-openai:latest 
  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

This command specifically asks for a GPU with the --flavor option, managing the port exposure through Hugging Face's job proxy. Upon successful execution, you will receive a message that details your job ID and the URL to access the server:

✓ Job started id: 6a381ca1953ed90bfb947332  
  url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332

Wait a few moments for the model weights to download. Once you see "Application startup complete" in the logs, your server is ready for queries.

Interacting with your vLLM server

Your vLLM server is now operational, and it uses the OpenAI API, making it easy to send queries to it. To interact with it, you just need your Hugging Face token to authenticate. Here’s a quick way to query your server using curl:

curl -X POST https://6a381ca1953ed90bfb947332--8000.hf.jobs/v1/completions 
  -H "Authorization: Bearer YOUR_HF_TOKEN" 
  -d '{"prompt": "Hello, how can I help you?", "max_tokens": 50}'

Alternatively, if you prefer Python, point the OpenAI client to your server’s exposed URL while passing the token as the API key:

from openai import OpenAI  
client = OpenAI(api_key='YOUR_HF_TOKEN', base_url='https://6a381ca1953ed90bfb947332--8000.hf.jobs')
response = client.Completions.create(prompt='Hello, how can I help you?', max_tokens=50)

It’s important to note that the job runs privately, and unauthorized access will lead to a denied request. Ensure that your Hugging Face token is kept secure and avoid using it in untrustworthy environments.

Stopping and managing your vLLM job

Once your experiments conclude, you can easily stop your server. Since jobs are billed per second, explicitly stopping the server is a cost-effective choice. To cancel the running job, execute:

hf jobs cancel

As a reminder, a flavor like a10g-large incurs a cost of approximately $1.50 per hour. For more pricing details, check the Hugging Face hardware documentation.

Exploring advanced options

If you find yourself needing to work with larger models, Hugging Face Jobs accommodates scalability. To run larger models, update the flavor parameter and enable model sharding through the --tensor-parallel-size option. Here's a sample command for the Qwen3.5 model:

hf jobs run --flavor h200x2 --expose 8000 --timeout 2h 
  vllm/vllm-openai:latest 
  vllm serve Qwen/Qwen3.5-122B-A10B 
  --host 0.0.0.0 --port 8000 --tensor-parallel-size 2

Make sure the --tensor-parallel-size aligns with the number of GPUs used. Exploring larger flavors tends to yield better value for computational efficiency.

Interactive user interface options

If you prefer a graphical interface over command-line interactions, you can set up a Gradio app. By incorporating a few lines of code alongside your server endpoint, you can create a chat interface.

When utilizing this method, consider advising the reasoning parser for enhanced responses tailored to your needs. Utilize the following command:

hf jobs run --flavor a10g-large --expose 8000 --timeout 2h 
  vllm/vllm-openai:latest 
  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000 --reasoning-parser deepseek_r1

SSH access for advanced management

For debugging or monitoring purposes, SSH access allows you to connect directly to the server environment. Use the following command to launch with SSH enabled:

hf jobs run --flavor a10g-large --expose 8000 --timeout 2h --ssh 
  vllm/vllm-openai:latest 
  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

Once running, you can SSH into the server using:

hf jobs ssh

This access lets you execute commands like nvidia-smi to monitor GPU usage and engage with your model directly. This capability can significantly enhance troubleshooting compared to viewing log files externally.

Using vLLM in coding agents

Another interesting application is using vLLM as a backend for coding agents like Pi. To enable this, you must make sure to relaunch your server with tool calling functionality.

The command can be modified as follows to include the necessary flags:

hf jobs run --flavor h200x2 --expose 8000 --timeout 2h 
  vllm/vllm-openai:latest 
  vllm serve Qwen/Qwen3.5-122B-A10B 
  --host 0.0.0.0 --port 8000 
  --enable-auto-tool-choice 
  --tool-call-parser hermes

Subsequently, add your job as a custom provider in the Pi configuration file to harness this new functionality effectively.

Choosing between HF Jobs and Inference Endpoints

While Hugging Face Jobs offers a flexible solution for deploying models, it's not the only method available. Inference Endpoints are managed solutions that provide additional features suitable for a production environment.

Select Hugging Face Jobs when you need flexibility, detailed control over model settings and architecture, as this option is well-suited for experimental or short-term usage scenarios. In contrast, opt for Inference Endpoints when establishing a long-term, scaling endpoint that requires consistent access without incurring costs during inactivity.

Further resources and community discussions

If you wish to explore more about vLLM and Hugging Face, various guides examine different models and their applications. For example, the Serve Models on Jobs guide outlines how to manage other OpenAI-compatible servers.

Additionally, the Hugging Face community is very active. Engaging with forums and community resources can enrich your understanding and application of vLLM and other Hugging Face technologies.