Deploying DeepSeek-R1 for distributed inferencing with Ray

Date: 2026-02-17

Ray is an open source framework for scaling Python applications through distributed computing. The framework focuses on scaling AI/ML training and inference pipelines horizontally through distributed resource allocation and execution, though it can be used in other Python projects as well.

Since October 2025, Ray joined the PyTorch Foundation alongside leading open source AI/ML projects such as PyTorch and vLLM, ensuring its sustained development in the AI/ML ecosystem and vendor-neutral governance.

Follow me as I deploy a distilled variant of DeepSeek-R1 across 2 GPU-enabled cloud servers on Huawei Cloud with vLLM and Ray. Leveraging Ray as the framework for distributed inference enables us to scale horizontally when needed and increase the number of concurrent requests our model can handle compared to a single server.

Environment setup

I used Huawei Cloud ECS instances to set up my lab environment. Distributed inferencing with Ray requires NVIDIA GPUs which I do not have physical access to. Fortunately, Huawei Cloud offers a variety of GPU-enabled instance types in Hong Kong powered by NVIDIA Tesla T4 datacenter GPUs.

Jump host: t6.xlarge.4 instance with 4 vCPU, 16Gi memory and 128Gi standard SSD
Ray head node: g6.xlarge.4 instance with 4 vCPU, 16Gi memory, 128Gi standard SSD and 1x NVIDIA T4 GPU
Ray worker node: same specification as Ray head node

NVIDIA driver, CUDA and container toolkit versions

The NVIDIA Tesla T4 datacenter GPU has compute capability 7.5 which is supported by vLLM v0.15.1.

It supports the latest 590.x drivers with CUDA 13.1 as per the official cuDNN backend support matrix. However, the default Ubuntu 22.04 image provided by Huawei Cloud ships with driver version 470.x and CUDA 11.4 which is below the minimum supported CUDA version by vLLM (>= 11.8). Therefore, I removed the pre-installed drivers and replaced them with the latest supported NVIDIA drivers with ubuntu-drivers autoinstall.

Once the latest drivers were installed, I installed the NVIDIA container toolkit as per official documentation. The latest version is 1.18.2 at the time of writing.

Once the container toolkit was installed, I configured Docker to use the nvidia runtime and restarted the Docker daemon.

nvidia-ctk runtime configure --runtime=docker
systemctl restart docker.service

View on Asciinema

NVIDIA driver and software setup

Preloading container images and models

The vllm/vllm-openai:v0.15.1-cu130 container image was preloaded on the jump host with docker pull and exported to the local filesystem with the docker save command. This saved me unnecessary bandwidth pulling the same container image on multiple nodes as the image itself is 18G in size.

docker pull vllm/vllm-openai:v0.15.1-cu130
docker save -o vllm-openai.tar vllm/vllm-openai:v0.15.1-cu130

The image was then copied to each Ray node and imported with docker load.

docker load -i vllm-openai.tar

Furthermore, to avoid downloading the deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B used in this lab on each individual node, I created an NFS share on the jump host mounted on each node as /mnt/huggingface and pointed the hf CLI to use the shared directory for caching via the HF_HOME environment variable.

The model was downloaded with hf download as below.

hf download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

View on Asciinema

Preloading container image and model

Serving DeepSeek-R1 with vLLM from a single node

Let’s serve a distilled variant of DeepSeek-R1 with vLLM as with last time. The main difference is that we will be using our NVIDIA GPU this time.

The model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B is available on Hugging Face.

# Configure an API key for authentication
# Feel free to modify this parameter
export OPENAI_API_KEY="my-very-secure-api-key"

# Run our distilled DeepSeek-R1 with vLLM
docker run --name vllm-openai \
    --rm \
    -d \
    -p 8000:8000 \
    --runtime nvidia \
    --gpus all \
    --ipc host \
    -e LD_LIBRARY_PATH='/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu' \
    -v /mnt/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:v0.15.1-cu130 \
    deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --api-key="$OPENAI_API_KEY" \
    --reasoning-parser=deepseek_r1 \
    --gpu-memory-utilization=0.85

The updated parameters explained below.

--runtime nvidia: use the NVIDIA runtime provided by the container toolkit
--gpus all: use all GPUs available on the node
-e LD_LIBRARY_PATH='...': hardcode the CUDA library path to work around an issue in vLLM v0.15.1: vllm-project/vllm#33369
-v /mnt/huggingface:/root/.cache/huggingface: mount the shared model cache to avoid nodes re-downloading the model on every run
--reasoning-parser=deepseek_r1: parse DeepSeek-R1’s reasoning flow and separate it from the final content in inference responses
--gpu-memory-utilization=0.85: use at most 85% of the available GPU memory to avoid running out of GPU memory and being killed. The NVIDIA Tesla T4 has 16Gi of GPU memory

Wait for vLLM to start up and become ready.

i=0
docker logs vllm-openai 2>&1 | \
    grep "Application startup complete." > /dev/null
while [ "$?" -ne 0 ]; do
    echo "Waiting $i seconds for vLLM to become ready ..."
    i=$((i + 1))
    sleep 1
    docker logs vllm-openai 2>&1 | \
        grep "Application startup complete." > /dev/null
done
echo "vLLM is ready after $i seconds."

List the available models.

curl -s -H "Authorization: Bearer $OPENAI_API_KEY" \
    http://localhost:8000/v1/models | jq

Sample output:

{
  "object": "list",
  "data": [
    {
      "id": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
      "object": "model",
      "created": 1771309159,
      "owned_by": "vllm",
      "root": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
      "parent": null,
      "max_model_len": 131072,
      "permission": [
        {
          "id": "modelperm-946de3cf56f5f72b",
          "object": "model_permission",
          "created": 1771309159,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

Prepare the request body request.json and send an inference request. Save and inspect the response in response.json.

{
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
  "messages": [
    {
      "role": "user",
      "content": "Explain DeepSeek-R1 to a non-technical audience in 100-150 words."
    }
  ],
  "max_tokens": 2048,
  "temperature": 0.6
}

# Send the inference request and save the response in response.json
curl -s -XPOST \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -H "Content-Type: application/json" \
    -d "$(cat request.json)" \
    http://localhost:8000/v1/chat/completions \
    > response.json

# Inspect the final response
cat response.json | jq --raw-output '.choices[0].message.content'

Sample output, formatted below as a quote for clarity:

DeepSeek-R1 is an advanced AI tool developed by DeepSeek, designed to understand and generate text with high accuracy and efficiency. It uses powerful language models to solve complex problems, adapt to various scenarios, and provide meaningful insights. Whether used in healthcare, finance, or education, DeepSeek-R1 excels in understanding text, generating responses, and solving intricate tasks, making it a versatile and reliable tool for many applications.

All good - our model can respond with a quick introduction of itself. How many concurrent requests can it handle with the current setup?

docker logs vllm-openai 2> /dev/null | grep "Maximum concurrency"

Sample output:

(EngineCore_DP0 pid=70) INFO 02-16 22:18:33 [kv_cache_utils.py:1312] Maximum concurrency for 131,072 tokens per request: 2.17x

Unfortunately it can only handle a maximum of 2 concurrent requests safely. Let’s enable it to handle more concurrent requests by distributing the inference workload across multiple nodes.

View on Asciinema

Serving DeepSeek-R1 on a single node with vLLM

Deploying Ray for distributed inferencing

Reference: Parallelism and Scaling - vLLM

Ray allows us to distribute our inference workload horizontally across multiple GPU-enabled servers. This allows us to deploy larger models than would fit in a single server, increase the maximum number of concurrent requests our model can handle, or both.

Ray concepts

Reference: Key Concepts - Ray

A Ray cluster consists of 2 types of nodes:

Head node: responsible for management tasks such as job scheduling and autoscaling
Worker node: responsible for executing submitted Ray jobs

By default, head nodes can execute Ray jobs as well. For large production deployments, the default behavior may not be desirable. Configuring head nodes as per recommended best practice is outside the scope of this article.

Setting up the second node

A second GPU-enabled node ray-demo-node1 was provisioned with identical specifications to the first node ray-demo-node0 and configured identically.

Docker Engine and jq installed
Upgraded to the latest NVIDIA 590.x drivers and CUDA 13.1 with ubuntu-drivers autoinstall
NVIDIA Container Toolkit v1.18.2 installed and Docker configured to use the NVIDIA runtime
The vLLM container image preloaded and shared Hugging face model cache mounted to save bandwidth

For our setup, we’ll use:

ray-demo-node0 for our head node
ray-demo-node1 for our worker node

Starting the head node

The vLLM container image includes Ray by default but we need to override the entrypoint to run ray start instead of vllm serve.

Here’s the Ray command we will be running inside the container on the head node.

ray start --block \
    --head \
    --node-ip-address=$HEAD_NODE_IP \
    --port=6379

HEAD_NODE_IP is the IP address of the head node.

--block: do not exit after Ray is initialized
--head: designate the current node as the head node
--node-ip-address=x.x.x.x: the IP address of the current node
--port=6379: the Ray head node binds to port 6379/tcp by default

The corresponding docker run command below.

# Configurable parameters
# Feel free to adapt to your environment and requirements
export OPENAI_API_KEY="my-very-secure-api-key"
export HEAD_NODE_IP="10.0.0.86"
export WORKER_NODE_IP="10.0.0.62"

# Start the Ray head node
docker run --name vllm-openai-ray-head \
    --rm \
    -d \
    --entrypoint /bin/bash \
    --network host \
    --runtime nvidia \
    --gpus all \
    --ipc host \
    --shm-size 16G \
    -e LD_LIBRARY_PATH='/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu' \
    -e GLOO_SOCKET_IFNAME=eth0 \
    -e NCCL_SOCKET_IFNAME=eth0 \
    -e OPENAI_API_KEY="$OPENAI_API_KEY" \
    -v /mnt/huggingface:/root/.cache/huggingface \
    -v /dev/shm:/dev/shm \
    vllm/vllm-openai:v0.15.1-cu130 \
    -c \
    "ray start --block \
        --head \
        --node-ip-address=$HEAD_NODE_IP \
        --port=6379"

Some notable differences from previous docker run commands:

--entrypoint /bin/bash: override the entrypoint to run ray start
--shm-size 16G: set the size of shared memory to 16G
GLOO_SOCKET_IFNAME=eth0: instruct Ray to use the correct network interface to avoid “connection refused” errors
NCCL_SOCKET_IFNAME=eth0: similar to above
OPENAI_API_KEY=xxxx: we’ll shell into the container to run vllm serve once our Ray cluster is ready
-v /dev/shm:/dev/shm: mount the host’s shared memory to the container

Confirm that our head node is running properly.

docker logs vllm-openai-ray-head

Sample output:

2026-02-17 03:46:51,232 INFO usage_lib.py:473 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2026-02-17 03:46:51,236 INFO scripts.py:917 -- Local node IP: 10.0.0.86
2026-02-17 03:46:54,696 SUCC scripts.py:956 -- --------------------
2026-02-17 03:46:54,696 SUCC scripts.py:957 -- Ray runtime started.
2026-02-17 03:46:54,696 SUCC scripts.py:958 -- --------------------
2026-02-17 03:46:54,696 INFO scripts.py:960 -- Next steps
2026-02-17 03:46:54,696 INFO scripts.py:963 -- To add another node to this Ray cluster, run
2026-02-17 03:46:54,696 INFO scripts.py:966 --   ray start --address='10.0.0.86:6379'
2026-02-17 03:46:54,696 INFO scripts.py:975 -- To connect to this Ray cluster:
2026-02-17 03:46:54,696 INFO scripts.py:977 -- import ray
2026-02-17 03:46:54,696 INFO scripts.py:978 -- ray.init(_node_ip_address='10.0.0.86')
2026-02-17 03:46:54,696 INFO scripts.py:1009 -- To terminate the Ray runtime, run
2026-02-17 03:46:54,696 INFO scripts.py:1010 --   ray stop
2026-02-17 03:46:54,696 INFO scripts.py:1013 -- To view the status of the cluster, use
2026-02-17 03:46:54,696 INFO scripts.py:1014 --   ray status
2026-02-17 03:46:54,697 INFO scripts.py:1132 -- --block
2026-02-17 03:46:54,697 INFO scripts.py:1133 -- This command will now block forever until terminated by a signal.
2026-02-17 03:46:54,697 INFO scripts.py:1136 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2026-02-17 03:46:54,697 INFO scripts.py:1141 -- Process exit logs will be saved to: /tmp/ray/session_2026-02-17_03-46-51_241973_1/logs/ray_process_exit.log

Our Ray head node is initialized and ready.

View on Asciinema

Ray - start the head node

Starting the worker node

The ray start command for the worker node is similar with a few differences.

ray start --block \
    --address=$HEAD_NODE_IP:6379 \
    --node-ip-address=$WORKER_NODE_IP

The --address=x.x.x.x:6379 option instructs our worker node to connect to the head node at port 6379/tcp.

The actual docker run command as below.

# Configurable parameters
# Feel free to adapt to your environment and requirements
export OPENAI_API_KEY="my-very-secure-api-key"
export HEAD_NODE_IP="10.0.0.86"
export WORKER_NODE_IP="10.0.0.62"

# Start the Ray worker node
docker run --name vllm-openai-ray-worker \
    --rm \
    -d \
    --entrypoint /bin/bash \
    --network host \
    --runtime nvidia \
    --gpus all \
    --ipc host \
    --shm-size 16G \
    -e LD_LIBRARY_PATH='/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu' \
    -e GLOO_SOCKET_IFNAME=eth0 \
    -e NCCL_SOCKET_IFNAME=eth0 \
    -v /mnt/huggingface:/root/.cache/huggingface \
    -v /dev/shm:/dev/shm \
    vllm/vllm-openai:v0.15.1-cu130 \
    -c \
    "ray start --block \
        --address=$HEAD_NODE_IP:6379 \
        --node-ip-address=$WORKER_NODE_IP"

Check the logs to confirm our worker node is up and ready.

docker logs vllm-openai-ray-worker

Sample output:

2026-02-17 04:01:04,485 INFO scripts.py:1101 -- Local node IP: 10.0.0.62
2026-02-17 04:01:05,770 SUCC scripts.py:1117 -- --------------------
2026-02-17 04:01:05,770 SUCC scripts.py:1118 -- Ray runtime started.
2026-02-17 04:01:05,770 SUCC scripts.py:1119 -- --------------------
2026-02-17 04:01:05,770 INFO scripts.py:1121 -- To terminate the Ray runtime, run
2026-02-17 04:01:05,770 INFO scripts.py:1122 --   ray stop
2026-02-17 04:01:05,770 INFO scripts.py:1132 -- --block
2026-02-17 04:01:05,770 INFO scripts.py:1133 -- This command will now block forever until terminated by a signal.
2026-02-17 04:01:05,770 INFO scripts.py:1136 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
2026-02-17 04:01:05,770 INFO scripts.py:1141 -- Process exit logs will be saved to: /tmp/ray/session_2026-02-17_03-46-51_241973_1/logs/ray_process_exit.log

View on Asciinema

Ray - start the worker node

Inspecting the Ray cluster

Let’s inspect the Ray cluster with ray status from the head node.

docker exec vllm-openai-ray-head \
    ray status

Sample output:

======== Autoscaler status: 2026-02-17 04:45:31.532694 ========
Node status
---------------------------------------------------------------
Active:
 1 node_3d9e4088156b8f9efed00986478580e36580f363b71326bc41dd1a27
 1 node_9e8a3eb4bce2890748ebfc0cfc020ffc515925eb9e05e1be0b479e26
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/8.0 CPU
 0.0/2.0 GPU
 0B/21.30GiB memory
 0B/9.13GiB object_store_memory

From request_resources:
 (none)
Pending Demands:
 (no resource demands)

The output shows that we have 2 nodes for a total of 8 vCPU, 2 GPUs and approximately 32Gi of GPU memory. Excellent!

View on Asciinema

Get Ray cluster status

Serving DeepSeek-R1 for distributed inferencing

Open a shell in the head node container.

docker exec -it vllm-openai-ray-head /bin/bash

Now run the vllm serve command below.

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --api-key="$OPENAI_API_KEY" \
    --reasoning-parser=deepseek_r1 \
    --gpu-memory-utilization=0.85 \
    --distributed-executor-backend=ray \
    --tensor-parallel-size=1 \
    --pipeline-parallel-size=2

A few additional options were included.

--distributed-executor-backend=ray: instruct vLLM to use Ray as its distributed backend
--tensor-parallel-size=1: each node has 1 GPU
--pipeline-parallel-size=2: our Ray cluster has 2 nodes

Confirm that the maximum concurrency is doubled from 2 to approximately 5. Here’s what you should see in the logs.

(EngineCore_DP0 pid=346) INFO 02-17 05:16:01 [kv_cache_utils.py:1307] GPU KV cache size: 694,480 tokens
(EngineCore_DP0 pid=346) INFO 02-17 05:16:01 [kv_cache_utils.py:1312] Maximum concurrency for 131,072 tokens per request: 5.30x

Keep the terminal tab or window open. In a new, separate terminal tab, let’s verify that our model is functioning correctly.

Run ray status again on the head node to confirm both GPUs are utiilized.

docker exec vllm-openai-ray-head \
    ray status

Sample output:

======== Autoscaler status: 2026-02-17 05:17:23.186492 ========
Node status
---------------------------------------------------------------
Active:
 1 node_9e8a3eb4bce2890748ebfc0cfc020ffc515925eb9e05e1be0b479e26
 1 node_3d9e4088156b8f9efed00986478580e36580f363b71326bc41dd1a27
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/8.0 CPU
 2.0/2.0 GPU (2.0 used of 2.0 reserved in placement groups)
 0B/21.30GiB memory
 0B/9.13GiB object_store_memory

From request_resources:
 (none)
Pending Demands:
 (no resource demands)

The line 2.0/2.0 GPU confirms both GPUs across both nodes are utilized.

Prepare the same request.json.

{
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
  "messages": [
    {
      "role": "user",
      "content": "Explain DeepSeek-R1 to a non-technical audience in 100-150 words."
    }
  ],
  "max_tokens": 2048,
  "temperature": 0.6
}

Invoke the Completions API once more and wait for our model to respond. Save the response to response.json. Inspect the response.

# Send the inference request and save the response in response.json
curl -s -XPOST \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -H "Content-Type: application/json" \
    -d "$(cat request.json)" \
    http://localhost:8000/v1/chat/completions \
    > response.json

# Inspect the final response
cat response.json | jq --raw-output '.choices[0].message.content'

DeepSeek-R1 is an advanced AI tool developed by DeepSeek, designed to understand and generate text with high accuracy and efficiency. It uses powerful language models to solve complex problems, adapt to various scenarios, and provide meaningful insights. Whether used in healthcare, finance, or education, DeepSeek-R1 excels in understanding text, generating responses, and solving intricate tasks, making it a versatile and reliable tool for many applications.

Congratulations, you have successfully deployed a distilled variant of DeepSeek-R1 to a Ray cluster of 2 nodes and confirmed that the model correctly utilizes the GPU resources from both nodes!

View all in Asciinema: 1, 2

vllm serve with Ray

Verify GPU utilization and model

Concluding remarks and going further

This article demonstrates how Ray can be used to deploy an LLM for distributed inferencing across 2 or more GPU-enabled servers. However, the setup presented in this article is far from production-ready and there is much more to be done to unlock true performance, scalability and high availability.

Below are some recommended next steps towards a production-ready Ray deployment for enterprise-grade distributed inferencing.

Deploy Ray to Kubernetes with KubeRay and automatically benefit from its orchestration capabilities
Use Ray as a backend for your MLOps training and inference pipelines with Kubeflow Pipelines

I hope you enjoyed this article as much as I did authoring it and stay tuned for updates ;-)

Subscribe:

Return to homepage