Skip to content

Relay Backends

Use any device running an OpenAI-compatible API as an inference backend for your mycellm node. The relay device provides the compute — mycellm provides the routing, credit accounting, and network presence.

iPad / Phone / GPU box Your mycellm node
┌──────────────────────┐ ┌────────────────────┐
│ Ollama / LM Studio / │ ← HTTP → │ mycellm serve │
│ PocketPal / vLLM │ │ --relay device:80 │
│ :8080/v1/models │ │ announces models │
└──────────────────────┘ └────────────────────┘
QUIC to network
  1. The relay device runs any app that exposes /v1/models and /v1/chat/completions
  2. mycellm discovers models from the relay’s /v1/models endpoint
  3. Models are announced to the network as relay:<model-name>
  4. Inference requests are proxied transparently to the relay device
  5. Credits accrue to your node (you contributed the compute)
Terminal window
mycellm serve --relay http://ipad.lan:8080

Multiple relays:

Terminal window
mycellm serve --relay http://ipad.lan:8080 --relay http://ollama.lan:11434
Terminal window
MYCELLM_RELAY_BACKENDS=http://ipad.lan:8080,http://ollama.lan:11434

Open the dashboard → Models tab → Relay Device tab → paste the device URL and click Add Relay.

Connected relays show online/offline status and their discovered models.

Terminal window
curl -X POST http://localhost:8420/v1/node/relay/add \
-H "Content-Type: application/json" \
-d '{"url": "http://ipad.lan:8080", "name": "iPad Pro"}'
/relay add http://ipad.lan:8080
/relay # list all relays
/relay refresh # re-discover models
/relay remove http://ipad.lan:8080

Any app that exposes an OpenAI-compatible API works as a relay:

AppPlatformNotes
OllamamacOS, Linux, WindowsDefault port 11434. Batches requests.
LM StudiomacOS, Linux, WindowsEnable “Local Server” in sidebar
llama.cpp serverAnyllama-server --port 8080
vLLMLinux (CUDA)High-throughput, continuous batching
LocalAIAnyDrop-in OpenAI replacement

Apple Silicon devices (M1–M4) are excellent inference backends. You need an app that runs a local LLM and exposes an OpenAI-compatible API server.

Currently the best option for iOS/iPadOS is running Ollama via a Mac on the same network, then pointing the relay at that Mac. Native iOS apps with API server support are still emerging — check the App Store for new options.

For Mac devices (MacBook, Mac Mini, Mac Studio):

  1. Install Ollama or LM Studio
  2. Pull a model: ollama pull llama3.2:3b
  3. Ollama serves on port 11434 by default
  4. Add as relay: mycellm serve --relay http://<mac-ip>:11434

The M4 with 16GB RAM can run 8B models at ~30 tok/s via Metal.

List all relay backends and their status.

{
"relays": [
{
"url": "http://ipad.lan:8080",
"name": "ipad",
"online": true,
"models": ["llama3.2:3b", "phi-4-mini"],
"model_count": 2
}
]
}
{"url": "http://ipad.lan:8080", "name": "iPad Pro", "max_concurrent": 2}

max_concurrent controls how many simultaneous requests mycellm sends to this device (default: 32). Set lower for constrained devices like iPads (2), higher for beefy GPU servers.

{"url": "http://ipad.lan:8080"}

Re-discover models from all relay backends. Returns count of new models found.

Relay models are prefixed with relay: to distinguish them from locally-loaded models:

GET /v1/models
{
"data": [
{"id": "Qwen2.5-3B-Q8_0", "owned_by": "local"},
{"id": "relay:llama3.2:3b", "owned_by": "relay:ipad"},
{"id": "relay:phi-4-mini", "owned_by": "relay:ipad"}
]
}

To the rest of the network, these models are indistinguishable from locally-loaded models. Peers route inference requests to your node, and your node proxies them to the relay device.

Each model source has different concurrency characteristics:

SourceConcurrent requestsWhy
Local GGUF (llama.cpp)1 per modelC library context is not thread-safe
API Provider32 per model (default)Cloud server handles backpressure
Device Relay32 per model (default)Remote device handles backpressure

A node with 2 local models can serve 2 concurrent users — one per model. Adding relay or API provider models adds more concurrent capacity without the hardware constraint.

Tune per device with max_concurrent:

Terminal window
# iPad relay — limited device, keep low
curl -X POST localhost:8420/v1/node/relay/add \
-d '{"url": "http://ipad:8080", "max_concurrent": 2}'
# Cloud API — high throughput
curl -X POST localhost:8420/v1/node/models/load \
-d '{"name": "gpt-4o", "backend": "openai", "api_base": "...", "max_concurrent": 64}'

mycellm polls relay backends every 60 seconds to detect:

  • New models added to the relay device
  • Models removed from the relay device
  • Relay device going offline/coming back online

If a relay goes offline, its models are marked unavailable and requests route elsewhere on the network.