| layout | docs |
|---|---|
| title | Performance Tuning Guide |
| description | Comprehensive guide for maximizing performance with CUDA LLM Kernel Optimization |
| lang | en |
Comprehensive guide for maximizing performance with the CUDA LLM Kernel Optimization library.
- Hardware Requirements
- Performance Targets
- Optimization Strategies
- Benchmarking
- Profiling
- Troubleshooting Performance Issues
- Best Practices
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| GPU | SM 7.0 (Volta) | SM 8.0+ (Ampere) | Tensor Cores essential for peak performance |
| VRAM | 8 GB | 16+ GB | Larger batches and sequences need more memory |
| CUDA | 11.0 | 12.0+ | Newer versions have better optimizations |
| Driver | 450.80+ | 525+ | Check nvidia-smi |
| Architecture | SM | Tensor Core | Key Features |
|---|---|---|---|
| Volta (V100) | 7.0 | FP16 | First Tensor Core generation |
| Turing (T4) | 7.5 | FP16, INT8 | INT8 quantization support |
| Ampere (A100) | 8.0, 8.6 | FP16, BF16, INT8, TF32 | Async copy, 164KB shared mem |
| Ada Lovelace (RTX 4090) | 8.9 | + FP8 | Enhanced FP8 support |
| Hopper (H100) | 9.0 | + FP8 | Thread block clusters, DPX instructions |
These rows describe hardware capabilities, not the full set of precisions currently exposed by this repository.
Calculate memory requirements before running:
def estimate_attention_memory(batch, heads, seq_len, head_dim, dtype=torch.float16):
"""Estimate FlashAttention memory usage in MB."""
bytes_per_elem = 2 if dtype == torch.float16 else 4
# Input tensors (Q, K, V)
input_mem = 3 * batch * heads * seq_len * head_dim * bytes_per_elem
# Output tensor
output_mem = batch * heads * seq_len * head_dim * bytes_per_elem
# FlashAttention working memory (overhead)
# ~O(seq_len) rather than O(seq_len²)
working_mem = batch * heads * seq_len * head_dim * bytes_per_elem * 0.1
total_mb = (input_mem + output_mem + working_mem) / (1024 * 1024)
return total_mb
# Example: Large sequence
print(f"Memory needed: {estimate_attention_memory(2, 16, 8192, 64):.1f} MB")
# Output: Memory needed: ~55 MB (vs ~2560 MB for naive attention!)| Matrix Size | cuBLAS Reference | Target | Typical Result |
|---|---|---|---|
| 512×512×512 | 100% | ≥85% | 88-92% |
| 1024×1024×1024 | 100% | ≥90% | 91-95% |
| 2048×2048×2048 | 100% | >90% | 92-98% |
| Sequence Length | PyTorch SDPA | Our FlashAttention | Speedup |
|---|---|---|---|
| 512 | 1.0x | 1.2x | Baseline |
| 1024 | 1.0x | 1.4x | +40% |
| 2048 | 1.0x | 1.6x | +60% |
| 4096 | 1.0x | 1.8x | +80% |
| 8192 | 1.0x | 2.1x | +110% |
Note: Results on A100, FP16 precision, batch=2, heads=8, head_dim=64
| Implementation | Memory Complexity | 4K Sequence | 16K Sequence |
|---|---|---|---|
| Naive Attention | O(N²) | 256 MB | 4 GB |
| Tiled Attention | O(N²), improved | 128 MB | 2 GB |
| FlashAttention | O(N) | 4 MB | 16 MB |
Choose the right kernel for your workload:
def optimal_attention(q, k, v, is_causal=False):
"""Select optimal attention implementation."""
seq_len = q.size(2)
if seq_len >= 2048:
# FlashAttention essential for very long sequences
return flash_attention(q, k, v, is_causal=is_causal)
elif seq_len >= 512:
# FlashAttention still beneficial
return flash_attention(q, k, v, is_causal=is_causal)
elif seq_len >= 128:
# Tiled attention good balance
return tiled_attention(q, k, v)
else:
# Naive may have lower overhead for very short sequences
return naive_attention(q, k, v)# For inference: Always use FP16
model = model.half() # Convert to FP16
q = q.cuda().half()
output = flash_attention(q, k, v)
# For training: FP32 master weights, FP16 compute
# (Mixed precision pattern)
with torch.cuda.amp.autocast():
output = flash_attention(q, k, v)
# Tensor Core GEMM with accumulation
# Input: FP16, Compute: FP16, Output: FP32
c = tensor_core_gemm(a_fp16, b_fp16) # Returns FP32Optimal dimensions should be multiples of 16:
def pad_to_alignment(tensor, alignment=16):
"""Pad tensor dimensions to alignment for optimal Tensor Core performance."""
shape = list(tensor.shape)
# Pad last two dimensions
if len(shape) >= 2:
for i in [-2, -1]:
if shape[i] % alignment != 0:
padding = alignment - (shape[i] % alignment)
shape[i] += padding
if shape != list(tensor.shape):
padded = torch.zeros(shape, dtype=tensor.dtype, device=tensor.device)
# Copy original data
slices = [slice(0, s) for s in tensor.shape]
padded[tuple(slices)] = tensor
return padded
return tensor
# Usage
a_aligned = pad_to_alignment(a, 16) # Ensures Tensor Core efficiency
b_aligned = pad_to_alignment(b, 16)
c = tensor_core_gemm(a_aligned, b_aligned)def find_optimal_batch_size(heads, seq_len, head_dim, max_memory_gb=16):
"""Find optimal batch size given memory constraints."""
max_memory_bytes = max_memory_gb * 1024**3
bytes_per_elem = 2 # FP16
# Memory per sample: Q + K + V + O
per_sample = 4 * heads * seq_len * head_dim * bytes_per_elem
# Account for ~20% overhead
max_batch = int(max_memory_bytes / per_sample / 1.2)
# Round down to power of 2 for better GPU utilization
optimal_batch = 2 ** (max_batch.bit_length() - 1)
return max(optimal_batch, 1)
# Example
batch_size = find_optimal_batch_size(16, 2048, 64, max_memory_gb=24)
print(f"Optimal batch size: {batch_size}") # e.g., 8class AttentionMemoryPool:
"""Pre-allocate and reuse memory for repeated operations."""
def __init__(self, max_batch, heads, max_seq_len, head_dim):
self.q = torch.empty(max_batch, heads, max_seq_len, head_dim,
device='cuda', dtype=torch.float16)
self.k = torch.empty_like(self.q)
self.v = torch.empty_like(self.q)
self.output = torch.empty_like(self.q)
def get_tensors(self, batch_size, seq_len):
"""Get appropriately sized views."""
return (
self.q[:batch_size, :, :seq_len, :],
self.k[:batch_size, :, :seq_len, :],
self.v[:batch_size, :, :seq_len, :],
self.output[:batch_size, :, :seq_len, :]
)# For training (with padding mask): Don't use is_causal
output = flash_attention(q, k, v, is_causal=False)
# For generation (autoregressive): Use is_causal
# This enables early-exit optimization
output = flash_attention(q, k, v, is_causal=True)# Basic benchmark
python benchmarks/benchmark_attention.py
# Custom configuration
python benchmarks/benchmark_attention.py \
--seq-lengths 512 1024 2048 4096 8192 \
--batch-size 4 \
--num-heads 16 \
--head-dim 64 \
--dtype fp16 \
--warmup 50 \
--iterations 200
# Export to JSON for analysis
python benchmarks/benchmark_attention.py --output results.jsonExpected Output Format:
================================================================================
ATTENTION BENCHMARK RESULTS
================================================================================
Config: batch=2, heads=8, head_dim=64, dtype=float16, causal=False
Seq Len | PyTorch(ms) | Flash(ms) | Speedup | Memory Saved
---------|-------------|-----------|---------|-------------
512 | 0.234 | 0.189 | 1.24x | 3.5MB
1024 | 0.823 | 0.612 | 1.34x | 14.0MB
2048 | 3.124 | 2.156 | 1.45x | 56.0MB
4096 | 12.456 | 7.234 | 1.72x | 224.0MB
8192 | 49.823 | 23.456 | 2.12x | 896.0MB
# Standard benchmark
python benchmarks/benchmark_gemm.py
# Custom matrix sizes
python benchmarks/benchmark_gemm.py \
--sizes 512x512x512 1024x1024x1024 2048x2048x2048 \
--dtype fp16 \
--output gemm_results.jsonimport torch
import time
from cuda_llm_ops import flash_attention, gemm
from cuda_llm_ops.profiler import CUDAProfiler
def benchmark_flash_attention(batch, heads, seq_len, head_dim, iterations=100):
"""Custom FlashAttention benchmark."""
q = torch.randn(batch, heads, seq_len, head_dim,
device='cuda', dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
# Warmup
for _ in range(10):
_ = flash_attention(q, k, v)
torch.cuda.synchronize()
# Benchmark
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(iterations):
_ = flash_attention(q, k, v)
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end) / iterations
# Calculate metrics
flops = batch * heads * (4 * seq_len * seq_len * head_dim) # Simplified
tflops = flops / (elapsed_ms * 1e-3) / 1e12
return {
'latency_ms': elapsed_ms,
'throughput_tflops': tflops,
'config': f'{batch}x{heads}x{seq_len}x{head_dim}'
}
# Run benchmark
result = benchmark_flash_attention(2, 8, 2048, 64)
print(f"Config: {result['config']}")
print(f"Latency: {result['latency_ms']:.3f} ms")
print(f"Throughput: {result['throughput_tflops']:.2f} TFLOPS")# Profile FlashAttention with detailed metrics
ncu --set full --kernel-name flash_attention \
-o flash_attention_profile.ncu-rep \
python -c "
import torch
from cuda_llm_ops import flash_attention
q = torch.randn(2, 8, 2048, 64, device='cuda', dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
for _ in range(100):
flash_attention(q, k, v)
"
# View results
ncu-ui flash_attention_profile.ncu-rep| Metric | Target | Interpretation |
|---|---|---|
| SM Occupancy | >60% | High: Good thread utilization |
| Memory Throughput | >70% of peak | Bound by memory bandwidth |
| L2 Hit Rate | >60% | Efficient cache utilization |
| Warp Stall Reasons | Various | Identify bottlenecks |
from cuda_llm_ops.profiler import CUDAProfiler
profiler = CUDAProfiler()
# Profile single operation
with profiler.profile('flash_attention'):
output = flash_attention(q, k, v)
metrics = profiler.get_metrics()
print(f"Elapsed: {metrics.elapsed_ms:.2f} ms")
print(f"Memory bandwidth: {metrics.memory_bandwidth_gb:.1f} GB/s")
print(f"Compute utilization: {metrics.compute_utilization:.1%}")
# Compare with reference
results = profiler.compare_with_reference(
custom_func=flash_attention,
reference_func=lambda q, k, v: torch.nn.functional.scaled_dot_product_attention(q, k, v),
q, k, v
)
print(f"Speedup vs PyTorch: {results['speedup']:.2f}x")Symptoms:
nvidia-smishows low GPU utilization (<50%)- Kernel execution time much higher than expected
Solutions:
- Increase batch size
- Check for CPU-GPU synchronization points
- Ensure tensors are contiguous
# Check tensor layout
assert q.is_contiguous(), "Q must be contiguous"
# Remove unnecessary synchronization
# Bad:
torch.cuda.synchronize()
output = flash_attention(q, k, v)
torch.cuda.synchronize()
# Good:
output = flash_attention(q, k, v)Symptoms:
- Performance significantly below cuBLAS
- Nsight Compute shows no Tensor Core usage
Solutions:
- Ensure dimensions are multiples of 16
- Use
tensor_core_gemminstead of regulargemm - Verify FP16 input dtype
# Check alignment
M, K, N = a.shape[0], a.shape[1], b.shape[1]
for dim, name in zip([M, K, N], ['M', 'K', 'N']):
if dim % 16 != 0:
print(f"Warning: {name}={dim} not aligned to 16")
# Use Tensor Core variant
c = tensor_core_gemm(a, b) # Forces Tensor Core usageSymptoms:
RuntimeError: CUDA out of memory- Memory grows across iterations
Solutions:
- Use FlashAttention instead of naive attention
- Reduce batch size or sequence length
- Clear cache between iterations
# Use FlashAttention (O(N) memory)
output = flash_attention(q, k, v) # Not naive_attention
# Clear cache if needed
torch.cuda.empty_cache()
# Monitor memory
print(torch.cuda.memory_summary())- Use FP16 precision for memory efficiency and speed
- Use FlashAttention for sequences > 512
- Pre-allocate memory pools for repeated operations
- Align dimensions to multiples of 16 for Tensor Core
- Batch requests when possible for better GPU utilization
- Use automatic mixed precision (AMP) with GradScaler
- Monitor gradient norms when using FP16
- Verify numerical stability with small-scale tests first
- Profile before optimizing to identify actual bottlenecks
- Implement graceful degradation (fallback to PyTorch if needed)
- Set CUDA visible devices explicitly for multi-GPU setups
- Use streams for concurrent execution when processing multiple requests
- Monitor GPU memory and implement OOM handling
# Production-ready pattern
import torch
from cuda_llm_ops import flash_attention
def safe_flash_attention(q, k, v, fallback_to_torch=True):
try:
return flash_attention(q, k, v)
except RuntimeError as e:
if "out of memory" in str(e).lower() and fallback_to_torch:
return torch.nn.functional.scaled_dot_product_attention(q, k, v)
raise