Reproducible Benchmarking for Triton Kernels
Kernel timings shift with JIT, warmup count, reps, and cache state — these settings must stay fixed across repos.
New kernels, shapes, dtypes, and backends drop in without one-off scripts. The workflow stays readable as the collection grows.
Latency alone won't tell you where the bottleneck is. TFLOPS and GB/s show whether a kernel is compute- or memory-bound.
Measurement sensitivity
❌ Include JIT in timing
⚠️ No warmup
⚠️ Single execution
✅ Warmup + rep + quantiles
← lower variance is better
Triton lets you write GPU kernels, but every repo ships its own
benchmark.py. Warmup, timing, and reporting all differ — results don't line up.
Turning collected kernels into a real suite requires one harness for correctness, fair timing, and readable metrics.
TriBench handles the Formalize and Aggregate steps:
meta.json schema per kernel→ Turns a loose community kernel into a reproducible benchmark entry.
| Project | Layer | Approach | Self-contained kernels | Data-driven config |
|---|---|---|---|---|
| Triton testing | Library | do_bench / perf_report — utilities, not a suite |
— | — |
| TritonBench | Suite | PyTorch op suite via submodules; broad coverage, monolithic setup | No | No |
| KernelBench | Eval | LLM evaluation: fast_p = correctness + speedup threshold |
Partial | No |
| MultiKernelBench | Eval | Multi-platform backend abstraction across DSLs | No | No |
| TriBench | Harness | Kernel packs + meta schema + CLI + result logging | ✓ Yes | ✓ Yes |
Timing utilities only — no registry, no correctness checks, no saved results.
One monolithic suite vs. self-contained kernel packs with local meta.json.
TriBench can act as the execution layer underneath these generative benchmarks.
Each kernel dir has meta.json + reference.py +
triton_impl.py. Adding a kernel = adding a folder.
Kernels load on demand. One broken pack won't block the rest.
Compile and warmup happen outside the measurement window, so A/B comparisons stay fair.
The schema leaves room for other backends like ROCm/HIP.
Directory & Dependency Map
tribench/ ← Core engine
registry.py discover & lazy-load
meta.py parse meta.json schema
gen.py generate typed inputs
correctness.py validate vs reference
bench.py warmup → timing loop
metrics.py derive TFLOPS / GB/s
env.py snapshot environment
io.py save JSON + Markdown
cli.py CLI (validate/test/run)
kernels/<op>/ ← Kernel packs
meta.json entrypoints / cases /
dtypes / variants
reference.py PyTorch baseline
triton_impl.py Triton kernel(s)
bench_entry.py (optional) custom bench
The core only needs meta.json and an entrypoint
(file.py:symbol). A broken kernel stays isolated.
No point timing a wrong kernel. TriBench checks outputs and gradients against the PyTorch reference first.
Warmup first, then repeated timed runs. Reports min/max/mean
and p50/p95. Backends: do_bench, do_bench_cudagraph, raw
CUDA events.
→ JSON + Markdown outputs
reference.py implements the same op in PyTorch as the ground truth.
meta.json defines the test matrix. TriBench runs every listed size and
dtype (fp16, bf16, fp32).
Outputs and gradients are compared to the reference. max_diff must
stay within the dtype's tolerance.
Only passing kernels move to timing. Broken ones never produce benchmark numbers.
Example: RMS Norm
$ tribench test --kernel rms_norm
Testing rms_norm...
fp32 (2048, 768) max_diff=1.2e-6 ✅
fp16 (2048, 768) max_diff=3.1e-4 ✅
bf16 (2048, 768) max_diff=4.7e-4 ✅
fp16 (4096, 1024) max_diff=2.8e-4 ✅
fp16 (8192, 4096) max_diff=3.5e-4 ✅
All 5 cases passed.
Correctness first — so the numbers are worth trusting.
do_benchDefault backend. Set warmup and reps; returns summary stats or quantiles.
do_bench_cudagraphUses CUDA Graphs to cut Python and launch overhead. Best for very short kernels.
Direct GPU timestamps, manual warmup and sync. Use when you need full control.
Latency Distribution
Mean latency hides jitter and contention spikes. p50 = typical case, p95 = tail.
TFLOPS measures how much compute the kernel actually uses — tells you if it's compute-bound.
e.g. 85 TFLOPS / 312 peak = 27% utilization
GB/s measures bandwidth usage — matters most for memory-bound ops.
e.g. 1200 GB/s / 2039 peak = 59% bandwidth
torch / triton / CUDA versions
GPU name, driver version, memory
Git commit hash to reproduce the run
Random seed + full CLI command
results.json
Machine-readable — for analysis, CI, and cross-run diffs.
summary.md
Markdown table for reports, READMEs, and submissions.
Each result file includes the full environment snapshot — one JSON is enough to reproduce the run. Same schema across all 28 kernel packs.
variants means in
meta.json
A single kernel pack can hold multiple implementations. TriBench runs all of them with identical cases, dtypes, layouts, seeds, and timer settings.
Main Impl
Optimized Triton
Baseline
PyTorch reference
Ablation
Alt block sizes, etc.
Most kernels ship with a reference, an optimized
version, and a few experiments. variants keeps them in one place for easy
comparison.
Runs add + silu in one pass.
Runs torch.add then F.silu.
Both live in one meta.json → comparison stays
consistent and repeatable
Check that
meta.json is valid
tribench validate-meta
Compare shapes & dtypes to the reference
tribench test --kernel all
Collect timing, metrics, and logs
tribench run --kernel rope --dtype fp16
Docs and quickstart guide
🚀 TriBench Getting Started
TriBench Team