perf: Multi-Threading & Performance Optimization Roadmap

## Summary

This issue proposes a comprehensive multi-threading and performance optimization roadmap for the Generals codebase. The goal is to improve frame rates, reduce loading times, and better utilize modern multi-core CPUs while maintaining **replay determinism**.

## Related Issues

| Issue | Description | Optimization Area |
|-------|-------------|-------------------|
| #100 | AI pathfinding lag on large maps | Pathfinding parallelization |
| #1984 | PathFindCells freeze at max | Pathfinding budget/threading |
| #57 | Dragon Tank fire expensive | Particle system optimization |
| #124 | Toxin Tractor expensive | Effect system batching |
| #1502 | iterateDrawablesInRegion expensive | Spatial query optimization |
| #353 | Object count limitation | Memory/allocation optimization |
| #1948 | W3DMPO virtual functions | Devirtualization/DOD |
| #265 | DynamicVectorClass growth | Container optimization |
| #122 | Slow network loading | Async loading/streaming |
| #880 | Wrong GPU selection | GPU enumeration/compute |

---

## Existing Threading Infrastructure

The codebase already has threading support we can build upon:
- **ThreadClass** (`WWLib/thread.h`) - Base thread with cooperative shutdown
- **Synchronization** (`WWLib/mutex.h`) - CriticalSectionClass, MutexClass, FastCriticalSectionClass
- **TextureLoader** (`WW3D2/textureloader.h`) - Already implements async background loading
- **GameSpy threads** - Networking already threaded
- **Audio threads** - Miles audio has delayed release thread

---

## Proposed Architecture: Hybrid Threading Model

```
┌──────────────────────────────────────────────────────────────┐
│                         MAIN THREAD                          │
│  Game Logic Tick → Submit Jobs → Wait for Completion         │
└──────────────────────────────────────────────────────────────┘
          │                                      ▲
          ▼                                      │
┌──────────────────────────────────────────────────────────────┐
│                    JOB SYSTEM (Worker Pool)                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐          │
│  │ Pathfinding  │ │ Physics Batch│ │ AI Update    │   ...    │
│  │ Requests     │ │ Processing   │ │ Batches      │          │
│  └──────────────┘ └──────────────┘ └──────────────┘          │
└──────────────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ AUDIO THREAD    │  │ I/O THREAD      │  │ RENDER THREAD   │
│ (Miles, always  │  │ (Asset loading, │  │ (Future: D3D    │
│  running)       │  │  file ops)      │  │  command buffer)│
└─────────────────┘  └─────────────────┘  └─────────────────┘
```

---

## Phase 1: Foundation (Low Risk, No Replay Impact)

### 1.1 Job System Core
- Worker thread pool (1 thread per core)
- Job queue with work stealing for load balancing
- Job dependencies for automatic scheduling
- Profiling hooks for performance measurement

### 1.2 Background Asset Loading Pipeline
Extend `TextureLoader` pattern to other assets:
- **Background model loading** - W3D files parsed in background
- **Background animation loading** - HAnim assets
- **Predictive prefetching** - Based on camera direction, production queues, fog of war reveals
- **Priority queue** - Critical (on-screen) → High (near camera) → Normal → Low

### 1.3 Memory Infrastructure
- **Pool allocators** for high-churn objects (particles, projectiles, path nodes)
- **Frame allocator** for per-frame temporaries (O(1) alloc, O(1) reset)
- **Container optimization** - Fix DynamicVectorClass growth factor (#265)

---

## Phase 2: Client Parallelization (Low Risk, No Replay Impact)

### 2.1 Particle System Parallelization
Addresses: #57, #124

```
10,000 particles ÷ 4 workers = 2,500 per worker
Each particle update is independent:
- Position += Velocity * dt
- Velocity += Gravity * dt
- Age, Color, Size interpolation
```

**SIMD opportunity:** Process 4-8 particles per instruction with SSE/AVX.

### 2.2 Render Culling Parallelization
Addresses: #1502

- Parallel frustum culling across drawable objects
- Hierarchical approach: cull cells first, then objects in visible cells
- SIMD: Test 4 bounding spheres against plane simultaneously

### 2.3 Animation Updates
- Batch skeletal animation jobs
- SIMD bone transforms (4x speedup potential)

### 2.4 Audio Decompression
- Decompress audio in background thread before playback needed
- Ring buffer for streaming music

---

## Phase 3: Simulation Threading (Medium Risk, Requires Determinism)

### 3.1 Physics Parallelization
```
Phase 1: Broad Phase (Parallel) - Spatial hash to find collision candidates
Phase 2: Narrow Phase (Parallel) - Detailed collision tests
Phase 3: Resolution (Serial) - Apply responses in ObjectID order (determinism)
Phase 4: Integration (Parallel) - Update positions/velocities
```

### 3.2 Pathfinding Parallelization
Addresses: #100, #1984

```
┌─────────────────────────────────────────────────────────────┐
│   Request Queue → Worker Pool (N threads) → Results Buffer  │
│                                                              │
│   Each worker:                                               │
│   - Grabs request from queue                                │
│   - Runs A* with own open/closed lists                      │
│   - Stores result in completion buffer                      │
│                                                              │
│   DETERMINISM: Deliver results sorted by (Frame, RequestID) │
└─────────────────────────────────────────────────────────────┘
```

**Key insight:** A* searches are independent if pathfind grid is read-only during search.

### 3.3 AI Update Batching
```
Tier 1: AIPlayer decisions - Independent per player, parallel
Tier 2: AIGroup coordination - Groups don't interact, parallel
Tier 3: Unit AI updates - Batch into parallel chunks
```

**Synchronization:** Collect results, apply in ObjectID order.

---

## Phase 4: Advanced Optimization (Higher Risk)

### 4.1 Data-Oriented Design
Addresses: #353, #1948

**Problem:** Object-oriented layout causes cache misses
```
Object 0: [vtable][pos][vel][hp][armor][...100 fields...]
Object 1: [vtable][pos][vel][hp][armor][...100 fields...]
// Cache pulls ~100 unused fields per object
```

**Solution:** Structure-of-Arrays
```
Positions:  [pos0][pos1][pos2][pos3]...  // All in cache line
Velocities: [vel0][vel1][vel2][vel3]...  // SIMD-friendly
```

**Hot/Cold splitting:** Pack frequently-accessed data together.

### 4.2 GPU Compute
Addresses: #880

- **GPU particles** - Update 10,000+ particles at no CPU cost
- **GPU culling** - Hierarchical Z-buffer occlusion culling
- **GPU pathfinding** (experimental) - Flow fields via compute shader

### 4.3 Hierarchical Pathfinding (HPA*)
```
Level 2: Region Graph (coarse) - Precomputed
Level 1: Cluster Graph (medium) - Precomputed edges
Level 0: Cell Graph (fine) - Only search near start/goal

Benefits: 10-100x fewer nodes searched
```

---

## Determinism Strategy

**Critical constraint:** All GameLogic threading must maintain replay compatibility.

### Parallel Compute, Serial Apply
```
Phase 1 (Parallel): Compute new states in thread-local buffers
Phase 2 (Serial):   Apply changes in deterministic order (by ObjectID)
```

### Testing
```bash
# Run same replay N times, compare checksums
for i in {1..10}; do
  generalszh.exe -headless -replay test.rep -checksum > checksum_$i.txt
done
diff checksum_*.txt  # Must be identical
```

---

## Profiling Infrastructure

Before optimizing, we need measurement:
- **Frame time breakdown** - GameLogic, Pathfinding, Rendering, Physics, Audio
- **Memory metrics** - Allocation rate, fragmentation, cache miss rate
- **Threading metrics** - Core utilization, lock contention, job queue depth

Output formats: Chrome Tracing, Tracy Profiler, ETW.

---

## Implementation Order

| Phase | Risk | Replay Impact | Benefit |
|-------|------|---------------|---------|
| 1.1 Job System | Low | None | Foundation for all threading |
| 1.2 Asset Loading | Low | None | Reduced loading hitches |
| 1.3 Memory | Low | None | Reduced allocation overhead |
| 2.1 Particles | Low | None | Better FPS during explosions |
| 2.2 Culling | Low | None | Better FPS with many objects |
| 3.1 Physics | Medium | Must verify | Better FPS in combat |
| 3.2 Pathfinding | Medium | Must verify | Faster unit response |
| 4.x Advanced | High | Must verify | Major performance gains |

---

## Questions for Discussion

1. Should we prioritize client-side optimizations (Phase 1-2) before touching simulation (Phase 3)?
2. Is there appetite for Data-Oriented Design refactoring given it's a significant architectural change?
3. Should profiling infrastructure be implemented first to guide optimization priorities?
4. What's the acceptable risk tolerance for simulation threading given replay compatibility requirements?

---

**Reference document:** `tmp/parallelize.md` contains detailed diagrams and technical specifications for each optimization area.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Multi-Threading & Performance Optimization Roadmap #2048

Summary

Related Issues

Existing Threading Infrastructure

Proposed Architecture: Hybrid Threading Model

Phase 1: Foundation (Low Risk, No Replay Impact)

1.1 Job System Core

1.2 Background Asset Loading Pipeline

1.3 Memory Infrastructure

Phase 2: Client Parallelization (Low Risk, No Replay Impact)

2.1 Particle System Parallelization

2.2 Render Culling Parallelization

2.3 Animation Updates

2.4 Audio Decompression

Phase 3: Simulation Threading (Medium Risk, Requires Determinism)

3.1 Physics Parallelization

3.2 Pathfinding Parallelization

3.3 AI Update Batching

Phase 4: Advanced Optimization (Higher Risk)

4.1 Data-Oriented Design

4.2 GPU Compute

4.3 Hierarchical Pathfinding (HPA*)

Determinism Strategy

Parallel Compute, Serial Apply

Testing

Profiling Infrastructure

Implementation Order

Questions for Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue	Description	Optimization Area
#100	AI pathfinding lag on large maps	Pathfinding parallelization
#1984	PathFindCells freeze at max	Pathfinding budget/threading
#57	Dragon Tank fire expensive	Particle system optimization
#124	Toxin Tractor expensive	Effect system batching
#1502	iterateDrawablesInRegion expensive	Spatial query optimization
#353	Object count limitation	Memory/allocation optimization
#1948	W3DMPO virtual functions	Devirtualization/DOD
#265	DynamicVectorClass growth	Container optimization
#122	Slow network loading	Async loading/streaming
#880	Wrong GPU selection	GPU enumeration/compute

Phase	Risk	Replay Impact	Benefit
1.1 Job System	Low	None	Foundation for all threading
1.2 Asset Loading	Low	None	Reduced loading hitches
1.3 Memory	Low	None	Reduced allocation overhead
2.1 Particles	Low	None	Better FPS during explosions
2.2 Culling	Low	None	Better FPS with many objects
3.1 Physics	Medium	Must verify	Better FPS in combat
3.2 Pathfinding	Medium	Must verify	Faster unit response
4.x Advanced	High	Must verify	Major performance gains

perf: Multi-Threading & Performance Optimization Roadmap #2048

Description

Summary

Related Issues

Existing Threading Infrastructure

Proposed Architecture: Hybrid Threading Model

Phase 1: Foundation (Low Risk, No Replay Impact)

1.1 Job System Core

1.2 Background Asset Loading Pipeline

1.3 Memory Infrastructure

Phase 2: Client Parallelization (Low Risk, No Replay Impact)

2.1 Particle System Parallelization

2.2 Render Culling Parallelization

2.3 Animation Updates

2.4 Audio Decompression

Phase 3: Simulation Threading (Medium Risk, Requires Determinism)

3.1 Physics Parallelization

3.2 Pathfinding Parallelization

3.3 AI Update Batching

Phase 4: Advanced Optimization (Higher Risk)

4.1 Data-Oriented Design

4.2 GPU Compute

4.3 Hierarchical Pathfinding (HPA*)

Determinism Strategy

Parallel Compute, Serial Apply

Testing

Profiling Infrastructure

Implementation Order

Questions for Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions