ollama-herd

# Ollama Herd Fleet Manager You are managing an Ollama Herd fleet — a smart Ollama multimodal router that distributes Ollama AI workloads across multiple devices. Ollama Herd handles 4 model types: Ollama LLM inference, image generation (mflux), speech-to-text (Qwen3-ASR), and Ollama embeddings. The Ollama scoring engine evaluates nodes on 7 signals (thermal state, memory fit, queue depth, latency history, role affinity, availability trend, context fit) and routes each Ollama request to the optimal device. ## Install Ollama Herd ```bash pip install ollama-herd # install Ollama Herd from PyPI herd # start the Ollama router herd-node # start an Ollama node agent (run on each device) ``` PyPI: [`ollama-herd`](https://pypi.org/project/ollama-herd/) | Source: [github.com/geeks-accelerator/ollama-herd](https://github.com/geeks-accelerator/ollama-herd) ## Ollama Router endpoint The Ollama Herd router runs at `http://localhost:11435` by default. If the user has specified a different Ollama URL, use that instead. ## Ollama API endpoints Use curl to interact with the Ollama fleet: ### Ollama fleet status — overview of all Ollama nodes and queues ```bash # ollama_fleet_status — check Ollama node health curl -s http://localhost:11435/fleet/status | python3 -m json.tool ``` Returns: - `fleet.nodes_total` / `fleet.nodes_online` — how many Ollama devices are in the fleet - `fleet.models_loaded` — total Ollama models currently loaded across all nodes - `fleet.requests_active` — total in-flight Ollama requests - `nodes[]` — per-node details: Ollama status, hardware, memory, CPU, disk, loaded Ollama models with context lengths - `queues` — per Ollama node:model queue depths (pending, in-flight, done, failed) ### List all Ollama models available across the fleet ```bash # ollama_model_list — all Ollama models on all nodes curl -s http://localhost:11435/api/tags | python3 -m json.tool ``` ### Pull an Ollama model onto the fleet ```bash # ollama_pull_model — pull a model (auto-selects best node, streams progress) curl -N http://localhost:11435/api/pull -d '{"name": "codestral"}' # pull to a specific node curl -N http://localhost:11435/api/pull -d '{"name": "llama3.3:70b", "node_id": "mac-studio"}' # non-streaming (blocks until complete) curl http://localhost:11435/api/pull -d '{"name": "phi4", "stream": false}' ``` ### List Ollama models currently loaded in memory ```bash # ollama_loaded_models — hot Ollama models in GPU memory curl -s http://localhost:11435/api/ps | python3 -m json.tool ``` ### OpenAI-compatible Ollama model list ```bash curl -s http://localhost:11435/v1/models | python3 -m json.tool ``` ### Ollama usage statistics (per-node, per-model daily aggregates) ```bash curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool ``` ### Recent Ollama request traces ```bash # ollama_traces — recent Ollama routing decisions curl -s "http://localhost:11435/dashboard/api/traces?limit=20" | python3 -m json.tool ``` Returns the last N Ollama routing decisions with: model requested, node selected, score, latency, tokens, retry/fallback status, tags. ### Ollama fleet health analysis ```bash curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool ``` Returns 15 automated Ollama health checks: offline/degraded nodes, memory pressure, underutilized nodes, VRAM fallbacks, KV cache bloat (OLLAMA_NUM_PARALLEL too high), version mismatch, context protection, zombie reaper, Ollama model thrashing, request timeouts, error rates, retry rates, client disconnects, and incomplete streams. ### Ollama model recommendations ```bash curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool ``` Returns AI-powered Ollama model mix recommendations per node based on hardware capabilities, Ollama usage patterns, and curated benchmark data. ### Ollama settings ```bash # View current Ollama config and node versions curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool # Toggle Ollama runtime settings (auto_pull, vram_fallback) curl -s -X POST http://localhost:11435/dashboard/api/settings \ -H "Content-Type: application/json" \ -d '{"auto_pull": false}' ``` ### Ollama model management ```bash # View per-node Ollama model details with sizes and usage curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool # Pull an Ollama model onto a specific node curl -s -X POST http://localhost:11435/dashboard/api/pull \ -H "Content-Type: application/json" \ -d '{"model": "llama3.3:70b", "node_id": "mac-studio"}' # Delete an Ollama model from a specific node curl -s -X POST http://localhost:11435/dashboard/api/delete \ -H "Content-Type: application/json" \ -d '{"model": "old-model:7b", "node_id": "mac-studio"}' ``` ### Ollama model insights (summary statistics) ```bash curl -s http://localhost:11435/dashboard/api/models | python3 -m json.tool ``` ### Per-app Ollama analytics (requires request tagging) ```bash curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool ``` ## Ollama Dashboard The Ollama web dashboard is at `http://localhost:11435/dashboard`. It has eight tabs: - **Fleet Overview** — live Ollama node cards, queue depths, and request counts via SSE - **Trends** — Ollama requests per hour, average latency, and token throughput charts (24h–7d) - **Model Insights** — per-Ollama-model latency, tokens/sec, usage comparison - **Apps** — per-tag Ollama analytics with request volume, latency, tokens, error rates - **Benchmarks** — Ollama capacity growth over time with per-run throughput and latency percentiles - **Health** — 15 automated Ollama fleet health checks with severity levels - **Recommendations** — Ollama model mix recommendations per node with one-click pull - **Settings** — Ollama runtime toggle switches, read-only config tables, and node version tracking Direct the user to open this URL in their browser for visual Ollama monitoring. ## Ollama Resilience features - **Auto-retry** — if an Ollama node fails before the first response chunk, re-scores and retries on the next-best Ollama node (up to 2 retries) - **Ollama model fallbacks** — clients specify backup Ollama models; tries alternatives when the primary is unavailable - **Context protection** — strips `num_ctx` from Ollama requests when unnecessary to prevent Ollama model reload hangs; auto-upgrades to a larger loaded model - **VRAM-aware fallback** — routes to an already-loaded Ollama model in the same category instead of cold-loading - **Zombie reaper** — background task detects and cleans up stuck in-flight Ollama requests - **Auto-pull** — automatically pulls missing Ollama models onto the best available node ## Common Ollama tasks ### Check if the Ollama fleet is healthy 1. Hit `/fleet/status` and verify `nodes_online > 0` 2. Hit `/dashboard/api/health` for automated Ollama health checks with severity levels 3. Look at Ollama queue depths — deep queues may indicate a bottleneck ### Find which Ollama node has a specific model 1. Hit `/fleet/status` and inspect each Ollama node's `ollama.models_loaded` and `ollama.models_available` 2. Or hit `/api/tags` for a flat list of all available Ollama models with which nodes have them ### Check if an Ollama model is loaded (hot) or cold 1. Hit `/api/ps` — Ollama models listed here are currently loaded in memory (hot) 2. Models in `/api/tags` but not in `/api/ps` are on disk but not loaded (cold) ### View recent Ollama inference activity 1. Hit `/dashboard/api/traces?limit=10` to see the last 10 Ollama requests 2. Each trace shows: Ollama model, node, score, latency, tokens, retry/fallback status ### Diagnose slow Ollama responses 1. Check `/dashboard/api/traces` for high latency Ollama entries 2. Check `/fleet/status` for Ollama nodes with high queue depths or memory pressure 3. Check if the Ollama model had to cold-load (look for low scores in trace) 4. Check if `num_ctx` is being sent — Ollama context protection logs show if requests triggered reloads ### Query the Ollama trace database directly ```bash # Recent Ollama failures sqlite3 ~/.fleet-manager/latency.db "SELECT request_id, model, status, error_message FROM request_traces WHERE status='failed' ORDER BY timestamp DESC LIMIT 10" # Slowest Ollama requests sqlite3 ~/.fleet-manager/latency.db "SELECT model, node_id, latency_ms/1000.0 as secs FROM request_traces WHERE status='completed' ORDER BY latency_ms DESC LIMIT 10" ``` ### Test Ollama inference through the fleet ```bash # Ollama via OpenAI format curl -s http://localhost:11435/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from Ollama"}],"stream":false}' # Ollama native format curl -s http://localhost:11435/api/chat \ -d '{"model":"llama3.3:70b","messages":[{"role":"user","content":"Hello from Ollama"}],"stream":false}' ``` ## Ollama Guardrails - Never restart or stop the Ollama Herd router or Ollama node agents without explicit user confirmation. - Never delete or modify files in `~/.fleet-manager/` (contains Ollama latency data, traces, and logs). - Do not pull Ollama models onto nodes without user confirmation — Ollama model downloads can be large (10-100+ GB). - Do not delete Ollama models without user confirmation. - If an Ollama node shows as offline, report it to the user rather than attempting to SSH into the machine. ## Ollama Failure handling - If curl to the Ollama router fails with connection refused, tell the user the Ollama Herd router may not be running and suggest `herd` to start it. - If the Ollama fleet status shows 0 nodes online, suggest starting Ollama node agents with `herd-node` on their devices. - If Ollama mDNS discovery fails, suggest using `--router-url http://router-ip:11435` for explicit connection. - If Ollama requests hang with 0 bytes returned, check if the client is sending `num_ctx` — Ollama context protection should strip it. - If a specific Ollama API endpoint returns an error, show the user the full error response and suggest checking the Ollama JSONL logs at `~/.fleet-manager/logs/herd.jsonl`.

ollama-herd

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

ollama-herd

ollama-herd

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement