Files
llama.cpp/tools/rpc/README.md
Radoslav Gerganov c61ae20d05 rpc : update documentation (#16441)
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-10-07 06:59:13 +00:00

3.6 KiB

Overview

Important

This example and the RPC backend are currently in a proof-of-concept development stage. As such, the functionality is fragile and insecure. Never run the RPC server on an open network or in a sensitive environment!

The rpc-server allows exposing ggml devices on a remote host. The RPC backend communicates with one or several instances of rpc-server and offloads computations to them. This can be used for distributed LLM inference with llama.cpp in the following way:

flowchart TD
    rpcb<-->|TCP|srva
    rpcb<-->|TCP|srvb
    rpcb<-.->|TCP|srvn
    subgraph hostn[Host N]
    srvn[rpc-server]<-.->dev4["CUDA0"]
    srvn[rpc-server]<-.->dev5["CPU"]
    end
    subgraph hostb[Host B]
    srvb[rpc-server]<-->dev3["Metal"]
    end
    subgraph hosta[Host A]
    srva[rpc-server]<-->dev["CUDA0"]
    srva[rpc-server]<-->dev2["CUDA1"]
    end
    subgraph host[Main Host]
    local["Local devices"]<-->ggml[llama-cli]
    ggml[llama-cli]<-->rpcb[RPC backend]
    end
    style hostn stroke:#66,stroke-width:2px,stroke-dasharray: 5 5
    classDef devcls fill:#5B9BD5
    class local,dev,dev2,dev3,dev4,dev5 devcls

By default, rpc-server exposes all available accelerator devices on the host. If there are no accelerators, it exposes a single CPU device.

Usage

Remote hosts

On each remote host, build the backends for each accelerator by adding -DGGML_RPC=ON to the build options. For example, to build the rpc-server with support for CUDA accelerators:

mkdir build-rpc-cuda
cd build-rpc-cuda
cmake .. -DGGML_CUDA=ON -DGGML_RPC=ON
cmake --build . --config Release

When started, the rpc-server will detect and expose all available CUDA devices:

$ bin/rpc-server
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Starting RPC server v3.0.0
  endpoint       : 127.0.0.1:50052
  local cache    : n/a
Devices:
  CUDA0: NVIDIA GeForce RTX 5090 (32109 MiB, 31588 MiB free)

You can control the set of exposed CUDA devices with the CUDA_VISIBLE_DEVICES environment variable or the --device command line option. The following two commands have the same effect:

$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052
$ bin/rpc-server --device CUDA0 -p 50052

Main host

On the main host build llama.cpp with the backends for the local devices and add -DGGML_RPC=ON to the build options. Finally, when running llama-cli or llama-server, use the --rpc option to specify the host and port of each rpc-server:

$ llama-cli -hf ggml-org/gemma-3-1b-it-GGUF -ngl 99 --rpc 192.168.88.10:50052,192.168.88.11:50052

By default, llama.cpp distributes model weights and the KV cache across all available devices -- both local and remote -- in proportion to each device's available memory. You can override this behavior with the --tensor-split option and set custom proportions when splitting tensor data across devices.

Local cache

The RPC server can use a local cache to store large tensors and avoid transferring them over the network. This can speed up model loading significantly, especially when using large models. To enable the cache, use the -c option:

$ bin/rpc-server -c

By default, the cache is stored in the $HOME/.cache/llama.cpp/rpc directory and can be controlled via the LLAMA_CACHE environment variable.

Troubleshooting

Use the GGML_RPC_DEBUG environment variable to enable debug messages from rpc-server:

$ GGML_RPC_DEBUG=1 bin/rpc-server