返回顶部
n

nccl_optimizer

>

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 1.1.0
安全检测
已通过
86
下载量
0
收藏
概述
安装方式
版本历史

nccl_optimizer

# NCCL Optimizer Finds the best NCCL communication configuration for distributed training with clear separation of **intra-node** and **inter-node** bandwidth metrics. ## What it does 1. **GPU topology** — `nvidia-smi topo -m` to detect NVLink vs PCIe. 2. **RDMA check** — `ibv_devinfo` PORT_ACTIVE state for InfiniBand/RoCE. - ✅ RDMA → emit recommended `NCCL_IB_*` env-vars. - ❌ No RDMA → socket benchmark sweep. 3. **Intra-node all-reduce** — sweeps `NCCL_SOCKET_IFNAME` × `NCCL_NET_GDR_LEVEL` × `NCCL_IB_TIMEOUT`, runs `all_reduce_perf -g <N>`, picks best bus bandwidth. 4. **Intra-node P2P** — `p2p_bw` for GPU↔GPU pair bandwidth (if available). 5. **Inter-node benchmark** — if `nodes=` passed, runs MPI `all_reduce_perf` across nodes; otherwise emits a ready-to-run command. ## Prerequisites | Tool | Purpose | Install | |------|---------|---------| | `nvidia-smi` | GPU info + topology | NVIDIA driver | | `ibv_devinfo` | RDMA detection | `apt install ibverbs-utils` | | `all_reduce_perf` | Collective benchmark | See below | | `p2p_bw` | Peer-to-peer benchmark | Same nccl-tests build | | `mpirun` | Inter-node benchmark | `apt install openmpi-bin` | ### Build nccl-tests ```bash git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests # For V100 (sm_70), A100 (sm_80), A800 (sm_80), H100 (sm_90): make -j$(nproc) CUDA_HOME=/usr/local/cuda \ NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" export PATH=$PWD/build:$PATH ``` ## Usage ```bash # Intra-node only openclaw skill run nccl_optimizer # Include inter-node benchmark (requires passwordless SSH + MPI) openclaw skill run nccl_optimizer "nodes=10.0.0.1,10.0.0.2" ``` ## Metrics explained | Metric | What it measures | |--------|-----------------| | All-reduce bus BW (intra) | Collective throughput across local GPUs — relevant for single-node training | | P2P bandwidth | GPU↔GPU direct copy speed (NVLink ≫ PCIe) | | All-reduce bus BW (inter) | Collective throughput across nodes — bottleneck for multi-node training | ## Notes - Bus bandwidth normalises for GPU count: `(N-1)/N × data / time`. Compare at same N. - Multi-node training is almost always bottlenecked by inter-node bandwidth, not intra-node. - RDMA (InfiniBand/RoCE) typically gives 10-100× better inter-node bandwidth than TCP.

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 nccl-optimizer-1776123749 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 nccl-optimizer-1776123749 技能

通过命令行安装

skillhub install nccl-optimizer-1776123749

下载 Zip 包

⬇ 下载 nccl_optimizer v1.1.0

文件大小: 8.99 KB | 发布时间: 2026-4-14 13:49

v1.1.0 最新 2026-4-14 13:49
Linux compatibility hardening: platform guard, container detection, robust interface scanning, fixed CUDA version parsing, distro-aware MPI install hints

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部