nccl_optimizer

# NCCL Optimizer Finds the best NCCL communication configuration for distributed training with clear separation of **intra-node** and **inter-node** bandwidth metrics. ## What it does 1. **GPU topology** — `nvidia-smi topo -m` to detect NVLink vs PCIe. 2. **RDMA check** — `ibv_devinfo` PORT_ACTIVE state for InfiniBand/RoCE. - ✅ RDMA → emit recommended `NCCL_IB_*` env-vars. - ❌ No RDMA → socket benchmark sweep. 3. **Intra-node all-reduce** — sweeps `NCCL_SOCKET_IFNAME` × `NCCL_NET_GDR_LEVEL` × `NCCL_IB_TIMEOUT`, runs `all_reduce_perf -g <N>`, picks best bus bandwidth. 4. **Intra-node P2P** — `p2p_bw` for GPU↔GPU pair bandwidth (if available). 5. **Inter-node benchmark** — if `nodes=` passed, runs MPI `all_reduce_perf` across nodes; otherwise emits a ready-to-run command. ## Prerequisites | Tool | Purpose | Install | |------|---------|---------| | `nvidia-smi` | GPU info + topology | NVIDIA driver | | `ibv_devinfo` | RDMA detection | `apt install ibverbs-utils` | | `all_reduce_perf` | Collective benchmark | See below | | `p2p_bw` | Peer-to-peer benchmark | Same nccl-tests build | | `mpirun` | Inter-node benchmark | `apt install openmpi-bin` | ### Build nccl-tests ```bash git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests # For V100 (sm_70), A100 (sm_80), A800 (sm_80), H100 (sm_90): make -j$(nproc) CUDA_HOME=/usr/local/cuda \ NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" export PATH=$PWD/build:$PATH ``` ## Usage ```bash # Intra-node only openclaw skill run nccl_optimizer # Include inter-node benchmark (requires passwordless SSH + MPI) openclaw skill run nccl_optimizer "nodes=10.0.0.1,10.0.0.2" ``` ## Metrics explained | Metric | What it measures | |--------|-----------------| | All-reduce bus BW (intra) | Collective throughput across local GPUs — relevant for single-node training | | P2P bandwidth | GPU↔GPU direct copy speed (NVLink ≫ PCIe) | | All-reduce bus BW (inter) | Collective throughput across nodes — bottleneck for multi-node training | ## Notes - Bus bandwidth normalises for GPU count: `(N-1)/N × data / time`. Compare at same N. - Multi-node training is almost always bottlenecked by inter-node bandwidth, not intra-node. - RDMA (InfiniBand/RoCE) typically gives 10-100× better inter-node bandwidth than TCP.

nccl_optimizer

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

nccl_optimizer

nccl_optimizer

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement