-
Notifications
You must be signed in to change notification settings - Fork 141
Description
Summary
This issue proposes a comprehensive multi-threading and performance optimization roadmap for the Generals codebase. The goal is to improve frame rates, reduce loading times, and better utilize modern multi-core CPUs while maintaining replay determinism.
Related Issues
| Issue | Description | Optimization Area |
|---|---|---|
| #100 | AI pathfinding lag on large maps | Pathfinding parallelization |
| #1984 | PathFindCells freeze at max | Pathfinding budget/threading |
| #57 | Dragon Tank fire expensive | Particle system optimization |
| #124 | Toxin Tractor expensive | Effect system batching |
| #1502 | iterateDrawablesInRegion expensive | Spatial query optimization |
| #353 | Object count limitation | Memory/allocation optimization |
| #1948 | W3DMPO virtual functions | Devirtualization/DOD |
| #265 | DynamicVectorClass growth | Container optimization |
| #122 | Slow network loading | Async loading/streaming |
| #880 | Wrong GPU selection | GPU enumeration/compute |
Existing Threading Infrastructure
The codebase already has threading support we can build upon:
- ThreadClass (
WWLib/thread.h) - Base thread with cooperative shutdown - Synchronization (
WWLib/mutex.h) - CriticalSectionClass, MutexClass, FastCriticalSectionClass - TextureLoader (
WW3D2/textureloader.h) - Already implements async background loading - GameSpy threads - Networking already threaded
- Audio threads - Miles audio has delayed release thread
Proposed Architecture: Hybrid Threading Model
┌──────────────────────────────────────────────────────────────┐
│ MAIN THREAD │
│ Game Logic Tick → Submit Jobs → Wait for Completion │
└──────────────────────────────────────────────────────────────┘
│ ▲
▼ │
┌──────────────────────────────────────────────────────────────┐
│ JOB SYSTEM (Worker Pool) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Pathfinding │ │ Physics Batch│ │ AI Update │ ... │
│ │ Requests │ │ Processing │ │ Batches │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ AUDIO THREAD │ │ I/O THREAD │ │ RENDER THREAD │
│ (Miles, always │ │ (Asset loading, │ │ (Future: D3D │
│ running) │ │ file ops) │ │ command buffer)│
└─────────────────┘ └─────────────────┘ └─────────────────┘
Phase 1: Foundation (Low Risk, No Replay Impact)
1.1 Job System Core
- Worker thread pool (1 thread per core)
- Job queue with work stealing for load balancing
- Job dependencies for automatic scheduling
- Profiling hooks for performance measurement
1.2 Background Asset Loading Pipeline
Extend TextureLoader pattern to other assets:
- Background model loading - W3D files parsed in background
- Background animation loading - HAnim assets
- Predictive prefetching - Based on camera direction, production queues, fog of war reveals
- Priority queue - Critical (on-screen) → High (near camera) → Normal → Low
1.3 Memory Infrastructure
- Pool allocators for high-churn objects (particles, projectiles, path nodes)
- Frame allocator for per-frame temporaries (O(1) alloc, O(1) reset)
- Container optimization - Fix DynamicVectorClass growth factor (Investigate DynamicVectorClass small growth #265)
Phase 2: Client Parallelization (Low Risk, No Replay Impact)
2.1 Particle System Parallelization
10,000 particles ÷ 4 workers = 2,500 per worker
Each particle update is independent:
- Position += Velocity * dt
- Velocity += Gravity * dt
- Age, Color, Size interpolation
SIMD opportunity: Process 4-8 particles per instruction with SSE/AVX.
2.2 Render Culling Parallelization
Addresses: #1502
- Parallel frustum culling across drawable objects
- Hierarchical approach: cull cells first, then objects in visible cells
- SIMD: Test 4 bounding spheres against plane simultaneously
2.3 Animation Updates
- Batch skeletal animation jobs
- SIMD bone transforms (4x speedup potential)
2.4 Audio Decompression
- Decompress audio in background thread before playback needed
- Ring buffer for streaming music
Phase 3: Simulation Threading (Medium Risk, Requires Determinism)
3.1 Physics Parallelization
Phase 1: Broad Phase (Parallel) - Spatial hash to find collision candidates
Phase 2: Narrow Phase (Parallel) - Detailed collision tests
Phase 3: Resolution (Serial) - Apply responses in ObjectID order (determinism)
Phase 4: Integration (Parallel) - Update positions/velocities
3.2 Pathfinding Parallelization
┌─────────────────────────────────────────────────────────────┐
│ Request Queue → Worker Pool (N threads) → Results Buffer │
│ │
│ Each worker: │
│ - Grabs request from queue │
│ - Runs A* with own open/closed lists │
│ - Stores result in completion buffer │
│ │
│ DETERMINISM: Deliver results sorted by (Frame, RequestID) │
└─────────────────────────────────────────────────────────────┘
Key insight: A* searches are independent if pathfind grid is read-only during search.
3.3 AI Update Batching
Tier 1: AIPlayer decisions - Independent per player, parallel
Tier 2: AIGroup coordination - Groups don't interact, parallel
Tier 3: Unit AI updates - Batch into parallel chunks
Synchronization: Collect results, apply in ObjectID order.
Phase 4: Advanced Optimization (Higher Risk)
4.1 Data-Oriented Design
Problem: Object-oriented layout causes cache misses
Object 0: [vtable][pos][vel][hp][armor][...100 fields...]
Object 1: [vtable][pos][vel][hp][armor][...100 fields...]
// Cache pulls ~100 unused fields per object
Solution: Structure-of-Arrays
Positions: [pos0][pos1][pos2][pos3]... // All in cache line
Velocities: [vel0][vel1][vel2][vel3]... // SIMD-friendly
Hot/Cold splitting: Pack frequently-accessed data together.
4.2 GPU Compute
Addresses: #880
- GPU particles - Update 10,000+ particles at no CPU cost
- GPU culling - Hierarchical Z-buffer occlusion culling
- GPU pathfinding (experimental) - Flow fields via compute shader
4.3 Hierarchical Pathfinding (HPA*)
Level 2: Region Graph (coarse) - Precomputed
Level 1: Cluster Graph (medium) - Precomputed edges
Level 0: Cell Graph (fine) - Only search near start/goal
Benefits: 10-100x fewer nodes searched
Determinism Strategy
Critical constraint: All GameLogic threading must maintain replay compatibility.
Parallel Compute, Serial Apply
Phase 1 (Parallel): Compute new states in thread-local buffers
Phase 2 (Serial): Apply changes in deterministic order (by ObjectID)
Testing
# Run same replay N times, compare checksums
for i in {1..10}; do
generalszh.exe -headless -replay test.rep -checksum > checksum_$i.txt
done
diff checksum_*.txt # Must be identicalProfiling Infrastructure
Before optimizing, we need measurement:
- Frame time breakdown - GameLogic, Pathfinding, Rendering, Physics, Audio
- Memory metrics - Allocation rate, fragmentation, cache miss rate
- Threading metrics - Core utilization, lock contention, job queue depth
Output formats: Chrome Tracing, Tracy Profiler, ETW.
Implementation Order
| Phase | Risk | Replay Impact | Benefit |
|---|---|---|---|
| 1.1 Job System | Low | None | Foundation for all threading |
| 1.2 Asset Loading | Low | None | Reduced loading hitches |
| 1.3 Memory | Low | None | Reduced allocation overhead |
| 2.1 Particles | Low | None | Better FPS during explosions |
| 2.2 Culling | Low | None | Better FPS with many objects |
| 3.1 Physics | Medium | Must verify | Better FPS in combat |
| 3.2 Pathfinding | Medium | Must verify | Faster unit response |
| 4.x Advanced | High | Must verify | Major performance gains |
Questions for Discussion
- Should we prioritize client-side optimizations (Phase 1-2) before touching simulation (Phase 3)?
- Is there appetite for Data-Oriented Design refactoring given it's a significant architectural change?
- Should profiling infrastructure be implemented first to guide optimization priorities?
- What's the acceptable risk tolerance for simulation threading given replay compatibility requirements?
Reference document: tmp/parallelize.md contains detailed diagrams and technical specifications for each optimization area.
🤖 Generated with Claude Code