layout	docs
title	Performance Tuning Guide
description	Comprehensive guide for maximizing performance with CUDA LLM Kernel Optimization
lang	en

Performance Tuning Guide

Comprehensive guide for maximizing performance with the CUDA LLM Kernel Optimization library.

Hardware Requirements
Performance Targets
Optimization Strategies
Benchmarking
Profiling
Troubleshooting Performance Issues
Best Practices

Hardware Requirements

Minimum vs Recommended

Component	Minimum	Recommended	Notes
GPU	SM 7.0 (Volta)	SM 8.0+ (Ampere)	Tensor Cores essential for peak performance
VRAM	8 GB	16+ GB	Larger batches and sequences need more memory
CUDA	11.0	12.0+	Newer versions have better optimizations
Driver	450.80+	525+	Check `nvidia-smi`

GPU Architecture Features

Architecture	SM	Tensor Core	Key Features
Volta (V100)	7.0	FP16	First Tensor Core generation
Turing (T4)	7.5	FP16, INT8	INT8 quantization support
Ampere (A100)	8.0, 8.6	FP16, BF16, INT8, TF32	Async copy, 164KB shared mem
Ada Lovelace (RTX 4090)	8.9	+ FP8	Enhanced FP8 support
Hopper (H100)	9.0	+ FP8	Thread block clusters, DPX instructions

These rows describe hardware capabilities, not the full set of precisions currently exposed by this repository.

Memory Planning

Calculate memory requirements before running:

def estimate_attention_memory(batch, heads, seq_len, head_dim, dtype=torch.float16):
    """Estimate FlashAttention memory usage in MB."""
    bytes_per_elem = 2 if dtype == torch.float16 else 4

    # Input tensors (Q, K, V)
    input_mem = 3 * batch * heads * seq_len * head_dim * bytes_per_elem

    # Output tensor
    output_mem = batch * heads * seq_len * head_dim * bytes_per_elem

    # FlashAttention working memory (overhead)
    # ~O(seq_len) rather than O(seq_len²)
    working_mem = batch * heads * seq_len * head_dim * bytes_per_elem * 0.1

    total_mb = (input_mem + output_mem + working_mem) / (1024 * 1024)
    return total_mb

# Example: Large sequence
print(f"Memory needed: {estimate_attention_memory(2, 16, 8192, 64):.1f} MB")
# Output: Memory needed: ~55 MB (vs ~2560 MB for naive attention!)

Performance Targets

GEMM Performance Expectations

Matrix Size	cuBLAS Reference	Target	Typical Result
512×512×512	100%	≥85%	88-92%
1024×1024×1024	100%	≥90%	91-95%
2048×2048×2048	100%	>90%	92-98%

Attention Latency Benchmarks

Sequence Length	PyTorch SDPA	Our FlashAttention	Speedup
512	1.0x	1.2x	Baseline
1024	1.0x	1.4x	+40%
2048	1.0x	1.6x	+60%
4096	1.0x	1.8x	+80%
8192	1.0x	2.1x	+110%

Note: Results on A100, FP16 precision, batch=2, heads=8, head_dim=64

Memory Efficiency

Implementation	Memory Complexity	4K Sequence	16K Sequence
Naive Attention	O(N²)	256 MB	4 GB
Tiled Attention	O(N²), improved	128 MB	2 GB
FlashAttention	O(N)	4 MB	16 MB

Optimization Strategies

1. Kernel Selection Guide

Choose the right kernel for your workload:

def optimal_attention(q, k, v, is_causal=False):
    """Select optimal attention implementation."""
    seq_len = q.size(2)

    if seq_len >= 2048:
        # FlashAttention essential for very long sequences
        return flash_attention(q, k, v, is_causal=is_causal)
    elif seq_len >= 512:
        # FlashAttention still beneficial
        return flash_attention(q, k, v, is_causal=is_causal)
    elif seq_len >= 128:
        # Tiled attention good balance
        return tiled_attention(q, k, v)
    else:
        # Naive may have lower overhead for very short sequences
        return naive_attention(q, k, v)

2. Precision Optimization

# For inference: Always use FP16
model = model.half()  # Convert to FP16
q = q.cuda().half()
output = flash_attention(q, k, v)

# For training: FP32 master weights, FP16 compute
# (Mixed precision pattern)
with torch.cuda.amp.autocast():
    output = flash_attention(q, k, v)

# Tensor Core GEMM with accumulation
# Input: FP16, Compute: FP16, Output: FP32
c = tensor_core_gemm(a_fp16, b_fp16)  # Returns FP32

3. Tensor Core Alignment

Optimal dimensions should be multiples of 16:

def pad_to_alignment(tensor, alignment=16):
    """Pad tensor dimensions to alignment for optimal Tensor Core performance."""
    shape = list(tensor.shape)

    # Pad last two dimensions
    if len(shape) >= 2:
        for i in [-2, -1]:
            if shape[i] % alignment != 0:
                padding = alignment - (shape[i] % alignment)
                shape[i] += padding

    if shape != list(tensor.shape):
        padded = torch.zeros(shape, dtype=tensor.dtype, device=tensor.device)
        # Copy original data
        slices = [slice(0, s) for s in tensor.shape]
        padded[tuple(slices)] = tensor
        return padded
    return tensor

# Usage
a_aligned = pad_to_alignment(a, 16)  # Ensures Tensor Core efficiency
b_aligned = pad_to_alignment(b, 16)
c = tensor_core_gemm(a_aligned, b_aligned)

4. Batch Size Tuning

def find_optimal_batch_size(heads, seq_len, head_dim, max_memory_gb=16):
    """Find optimal batch size given memory constraints."""
    max_memory_bytes = max_memory_gb * 1024**3
    bytes_per_elem = 2  # FP16

    # Memory per sample: Q + K + V + O
    per_sample = 4 * heads * seq_len * head_dim * bytes_per_elem

    # Account for ~20% overhead
    max_batch = int(max_memory_bytes / per_sample / 1.2)

    # Round down to power of 2 for better GPU utilization
    optimal_batch = 2 ** (max_batch.bit_length() - 1)
    return max(optimal_batch, 1)

# Example
batch_size = find_optimal_batch_size(16, 2048, 64, max_memory_gb=24)
print(f"Optimal batch size: {batch_size}")  # e.g., 8

5. Memory Pool Pre-allocation

class AttentionMemoryPool:
    """Pre-allocate and reuse memory for repeated operations."""

    def __init__(self, max_batch, heads, max_seq_len, head_dim):
        self.q = torch.empty(max_batch, heads, max_seq_len, head_dim,
                            device='cuda', dtype=torch.float16)
        self.k = torch.empty_like(self.q)
        self.v = torch.empty_like(self.q)
        self.output = torch.empty_like(self.q)

    def get_tensors(self, batch_size, seq_len):
        """Get appropriately sized views."""
        return (
            self.q[:batch_size, :, :seq_len, :],
            self.k[:batch_size, :, :seq_len, :],
            self.v[:batch_size, :, :seq_len, :],
            self.output[:batch_size, :, :seq_len, :]
        )

6. Causal vs Non-Causal

# For training (with padding mask): Don't use is_causal
output = flash_attention(q, k, v, is_causal=False)

# For generation (autoregressive): Use is_causal
# This enables early-exit optimization
output = flash_attention(q, k, v, is_causal=True)

Benchmarking

Attention Benchmark

# Basic benchmark
python benchmarks/benchmark_attention.py

# Custom configuration
python benchmarks/benchmark_attention.py \
    --seq-lengths 512 1024 2048 4096 8192 \
    --batch-size 4 \
    --num-heads 16 \
    --head-dim 64 \
    --dtype fp16 \
    --warmup 50 \
    --iterations 200

# Export to JSON for analysis
python benchmarks/benchmark_attention.py --output results.json

Expected Output Format:

================================================================================
ATTENTION BENCHMARK RESULTS
================================================================================

Config: batch=2, heads=8, head_dim=64, dtype=float16, causal=False

 Seq Len | PyTorch(ms) | Flash(ms) | Speedup | Memory Saved
---------|-------------|-----------|---------|-------------
     512 |       0.234 |     0.189 |   1.24x |       3.5MB
    1024 |       0.823 |     0.612 |   1.34x |      14.0MB
    2048 |       3.124 |     2.156 |   1.45x |      56.0MB
    4096 |      12.456 |     7.234 |   1.72x |     224.0MB
    8192 |      49.823 |    23.456 |   2.12x |     896.0MB

GEMM Benchmark

# Standard benchmark
python benchmarks/benchmark_gemm.py

# Custom matrix sizes
python benchmarks/benchmark_gemm.py \
    --sizes 512x512x512 1024x1024x1024 2048x2048x2048 \
    --dtype fp16 \
    --output gemm_results.json

Custom Benchmark Script

import torch
import time
from cuda_llm_ops import flash_attention, gemm
from cuda_llm_ops.profiler import CUDAProfiler

def benchmark_flash_attention(batch, heads, seq_len, head_dim, iterations=100):
    """Custom FlashAttention benchmark."""
    q = torch.randn(batch, heads, seq_len, head_dim,
                    device='cuda', dtype=torch.float16)
    k = torch.randn_like(q)
    v = torch.randn_like(q)

    # Warmup
    for _ in range(10):
        _ = flash_attention(q, k, v)
    torch.cuda.synchronize()

    # Benchmark
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    for _ in range(iterations):
        _ = flash_attention(q, k, v)
    end.record()
    torch.cuda.synchronize()

    elapsed_ms = start.elapsed_time(end) / iterations

    # Calculate metrics
    flops = batch * heads * (4 * seq_len * seq_len * head_dim)  # Simplified
    tflops = flops / (elapsed_ms * 1e-3) / 1e12

    return {
        'latency_ms': elapsed_ms,
        'throughput_tflops': tflops,
        'config': f'{batch}x{heads}x{seq_len}x{head_dim}'
    }

# Run benchmark
result = benchmark_flash_attention(2, 8, 2048, 64)
print(f"Config: {result['config']}")
print(f"Latency: {result['latency_ms']:.3f} ms")
print(f"Throughput: {result['throughput_tflops']:.2f} TFLOPS")

Profiling

Using Nsight Compute

# Profile FlashAttention with detailed metrics
ncu --set full --kernel-name flash_attention \
    -o flash_attention_profile.ncu-rep \
    python -c "
import torch
from cuda_llm_ops import flash_attention

q = torch.randn(2, 8, 2048, 64, device='cuda', dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)

for _ in range(100):
    flash_attention(q, k, v)
"

# View results
ncu-ui flash_attention_profile.ncu-rep

Key Metrics to Monitor

Metric	Target	Interpretation
SM Occupancy	>60%	High: Good thread utilization
Memory Throughput	>70% of peak	Bound by memory bandwidth
L2 Hit Rate	>60%	Efficient cache utilization
Warp Stall Reasons	Various	Identify bottlenecks

Python Profiler

from cuda_llm_ops.profiler import CUDAProfiler

profiler = CUDAProfiler()

# Profile single operation
with profiler.profile('flash_attention'):
    output = flash_attention(q, k, v)

metrics = profiler.get_metrics()
print(f"Elapsed: {metrics.elapsed_ms:.2f} ms")
print(f"Memory bandwidth: {metrics.memory_bandwidth_gb:.1f} GB/s")
print(f"Compute utilization: {metrics.compute_utilization:.1%}")

# Compare with reference
results = profiler.compare_with_reference(
    custom_func=flash_attention,
    reference_func=lambda q, k, v: torch.nn.functional.scaled_dot_product_attention(q, k, v),
    q, k, v
)
print(f"Speedup vs PyTorch: {results['speedup']:.2f}x")

Troubleshooting Performance Issues

Issue: Low GPU Utilization

Symptoms:

nvidia-smi shows low GPU utilization (<50%)
Kernel execution time much higher than expected

Solutions:

Increase batch size
Check for CPU-GPU synchronization points
Ensure tensors are contiguous

# Check tensor layout
assert q.is_contiguous(), "Q must be contiguous"

# Remove unnecessary synchronization
# Bad:
torch.cuda.synchronize()
output = flash_attention(q, k, v)
torch.cuda.synchronize()

# Good:
output = flash_attention(q, k, v)

Issue: Tensor Core Not Engaged

Symptoms:

Performance significantly below cuBLAS
Nsight Compute shows no Tensor Core usage

Solutions:

Ensure dimensions are multiples of 16
Use tensor_core_gemm instead of regular gemm
Verify FP16 input dtype

# Check alignment
M, K, N = a.shape[0], a.shape[1], b.shape[1]
for dim, name in zip([M, K, N], ['M', 'K', 'N']):
    if dim % 16 != 0:
        print(f"Warning: {name}={dim} not aligned to 16")

# Use Tensor Core variant
c = tensor_core_gemm(a, b)  # Forces Tensor Core usage

Issue: Out of Memory

Symptoms:

RuntimeError: CUDA out of memory
Memory grows across iterations

Solutions:

Use FlashAttention instead of naive attention
Reduce batch size or sequence length
Clear cache between iterations

# Use FlashAttention (O(N) memory)
output = flash_attention(q, k, v)  # Not naive_attention

# Clear cache if needed
torch.cuda.empty_cache()

# Monitor memory
print(torch.cuda.memory_summary())

Best Practices

For Inference

Use FP16 precision for memory efficiency and speed
Use FlashAttention for sequences > 512
Pre-allocate memory pools for repeated operations
Align dimensions to multiples of 16 for Tensor Core
Batch requests when possible for better GPU utilization

For Training

Use automatic mixed precision (AMP) with GradScaler
Monitor gradient norms when using FP16
Verify numerical stability with small-scale tests first
Profile before optimizing to identify actual bottlenecks

For Production

Implement graceful degradation (fallback to PyTorch if needed)
Set CUDA visible devices explicitly for multi-GPU setups
Use streams for concurrent execution when processing multiple requests
Monitor GPU memory and implement OOM handling

# Production-ready pattern
import torch
from cuda_llm_ops import flash_attention

def safe_flash_attention(q, k, v, fallback_to_torch=True):
    try:
        return flash_attention(q, k, v)
    except RuntimeError as e:
        if "out of memory" in str(e).lower() and fallback_to_torch:
            return torch.nn.functional.scaled_dot_product_attention(q, k, v)
        raise

Additional Resources

← Back to Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Tuning Guide

Table of Contents

Hardware Requirements

Minimum vs Recommended

GPU Architecture Features

Memory Planning

Performance Targets

GEMM Performance Expectations

Attention Latency Benchmarks

Memory Efficiency

Optimization Strategies

1. Kernel Selection Guide

2. Precision Optimization

3. Tensor Core Alignment

4. Batch Size Tuning

5. Memory Pool Pre-allocation

6. Causal vs Non-Causal

Benchmarking

Attention Benchmark

GEMM Benchmark

Custom Benchmark Script

Profiling

Using Nsight Compute

Key Metrics to Monitor

Python Profiler

Troubleshooting Performance Issues

Issue: Low GPU Utilization

Issue: Tensor Core Not Engaged

Issue: Out of Memory

Best Practices

For Inference

For Training

For Production

Additional Resources

FilesExpand file tree

performance-en.md

Latest commit

History

performance-en.md

File metadata and controls

Performance Tuning Guide

Table of Contents

Hardware Requirements

Minimum vs Recommended

GPU Architecture Features

Memory Planning

Performance Targets

GEMM Performance Expectations

Attention Latency Benchmarks

Memory Efficiency

Optimization Strategies

1. Kernel Selection Guide

2. Precision Optimization

3. Tensor Core Alignment

4. Batch Size Tuning

5. Memory Pool Pre-allocation

6. Causal vs Non-Causal

Benchmarking

Attention Benchmark

GEMM Benchmark

Custom Benchmark Script

Profiling

Using Nsight Compute

Key Metrics to Monitor

Python Profiler

Troubleshooting Performance Issues

Issue: Low GPU Utilization

Issue: Tensor Core Not Engaged

Issue: Out of Memory

Best Practices

For Inference

For Training

For Production

Additional Resources