Skip to content

perf: Multi-Threading & Performance Optimization Roadmap #2048

@sebyx07

Description

@sebyx07

Summary

This issue proposes a comprehensive multi-threading and performance optimization roadmap for the Generals codebase. The goal is to improve frame rates, reduce loading times, and better utilize modern multi-core CPUs while maintaining replay determinism.

Related Issues

Issue Description Optimization Area
#100 AI pathfinding lag on large maps Pathfinding parallelization
#1984 PathFindCells freeze at max Pathfinding budget/threading
#57 Dragon Tank fire expensive Particle system optimization
#124 Toxin Tractor expensive Effect system batching
#1502 iterateDrawablesInRegion expensive Spatial query optimization
#353 Object count limitation Memory/allocation optimization
#1948 W3DMPO virtual functions Devirtualization/DOD
#265 DynamicVectorClass growth Container optimization
#122 Slow network loading Async loading/streaming
#880 Wrong GPU selection GPU enumeration/compute

Existing Threading Infrastructure

The codebase already has threading support we can build upon:

  • ThreadClass (WWLib/thread.h) - Base thread with cooperative shutdown
  • Synchronization (WWLib/mutex.h) - CriticalSectionClass, MutexClass, FastCriticalSectionClass
  • TextureLoader (WW3D2/textureloader.h) - Already implements async background loading
  • GameSpy threads - Networking already threaded
  • Audio threads - Miles audio has delayed release thread

Proposed Architecture: Hybrid Threading Model

┌──────────────────────────────────────────────────────────────┐
│                         MAIN THREAD                          │
│  Game Logic Tick → Submit Jobs → Wait for Completion         │
└──────────────────────────────────────────────────────────────┘
          │                                      ▲
          ▼                                      │
┌──────────────────────────────────────────────────────────────┐
│                    JOB SYSTEM (Worker Pool)                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐          │
│  │ Pathfinding  │ │ Physics Batch│ │ AI Update    │   ...    │
│  │ Requests     │ │ Processing   │ │ Batches      │          │
│  └──────────────┘ └──────────────┘ └──────────────┘          │
└──────────────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ AUDIO THREAD    │  │ I/O THREAD      │  │ RENDER THREAD   │
│ (Miles, always  │  │ (Asset loading, │  │ (Future: D3D    │
│  running)       │  │  file ops)      │  │  command buffer)│
└─────────────────┘  └─────────────────┘  └─────────────────┘

Phase 1: Foundation (Low Risk, No Replay Impact)

1.1 Job System Core

  • Worker thread pool (1 thread per core)
  • Job queue with work stealing for load balancing
  • Job dependencies for automatic scheduling
  • Profiling hooks for performance measurement

1.2 Background Asset Loading Pipeline

Extend TextureLoader pattern to other assets:

  • Background model loading - W3D files parsed in background
  • Background animation loading - HAnim assets
  • Predictive prefetching - Based on camera direction, production queues, fog of war reveals
  • Priority queue - Critical (on-screen) → High (near camera) → Normal → Low

1.3 Memory Infrastructure

  • Pool allocators for high-churn objects (particles, projectiles, path nodes)
  • Frame allocator for per-frame temporaries (O(1) alloc, O(1) reset)
  • Container optimization - Fix DynamicVectorClass growth factor (Investigate DynamicVectorClass small growth #265)

Phase 2: Client Parallelization (Low Risk, No Replay Impact)

2.1 Particle System Parallelization

Addresses: #57, #124

10,000 particles ÷ 4 workers = 2,500 per worker
Each particle update is independent:
- Position += Velocity * dt
- Velocity += Gravity * dt
- Age, Color, Size interpolation

SIMD opportunity: Process 4-8 particles per instruction with SSE/AVX.

2.2 Render Culling Parallelization

Addresses: #1502

  • Parallel frustum culling across drawable objects
  • Hierarchical approach: cull cells first, then objects in visible cells
  • SIMD: Test 4 bounding spheres against plane simultaneously

2.3 Animation Updates

  • Batch skeletal animation jobs
  • SIMD bone transforms (4x speedup potential)

2.4 Audio Decompression

  • Decompress audio in background thread before playback needed
  • Ring buffer for streaming music

Phase 3: Simulation Threading (Medium Risk, Requires Determinism)

3.1 Physics Parallelization

Phase 1: Broad Phase (Parallel) - Spatial hash to find collision candidates
Phase 2: Narrow Phase (Parallel) - Detailed collision tests
Phase 3: Resolution (Serial) - Apply responses in ObjectID order (determinism)
Phase 4: Integration (Parallel) - Update positions/velocities

3.2 Pathfinding Parallelization

Addresses: #100, #1984

┌─────────────────────────────────────────────────────────────┐
│   Request Queue → Worker Pool (N threads) → Results Buffer  │
│                                                              │
│   Each worker:                                               │
│   - Grabs request from queue                                │
│   - Runs A* with own open/closed lists                      │
│   - Stores result in completion buffer                      │
│                                                              │
│   DETERMINISM: Deliver results sorted by (Frame, RequestID) │
└─────────────────────────────────────────────────────────────┘

Key insight: A* searches are independent if pathfind grid is read-only during search.

3.3 AI Update Batching

Tier 1: AIPlayer decisions - Independent per player, parallel
Tier 2: AIGroup coordination - Groups don't interact, parallel
Tier 3: Unit AI updates - Batch into parallel chunks

Synchronization: Collect results, apply in ObjectID order.


Phase 4: Advanced Optimization (Higher Risk)

4.1 Data-Oriented Design

Addresses: #353, #1948

Problem: Object-oriented layout causes cache misses

Object 0: [vtable][pos][vel][hp][armor][...100 fields...]
Object 1: [vtable][pos][vel][hp][armor][...100 fields...]
// Cache pulls ~100 unused fields per object

Solution: Structure-of-Arrays

Positions:  [pos0][pos1][pos2][pos3]...  // All in cache line
Velocities: [vel0][vel1][vel2][vel3]...  // SIMD-friendly

Hot/Cold splitting: Pack frequently-accessed data together.

4.2 GPU Compute

Addresses: #880

  • GPU particles - Update 10,000+ particles at no CPU cost
  • GPU culling - Hierarchical Z-buffer occlusion culling
  • GPU pathfinding (experimental) - Flow fields via compute shader

4.3 Hierarchical Pathfinding (HPA*)

Level 2: Region Graph (coarse) - Precomputed
Level 1: Cluster Graph (medium) - Precomputed edges
Level 0: Cell Graph (fine) - Only search near start/goal

Benefits: 10-100x fewer nodes searched

Determinism Strategy

Critical constraint: All GameLogic threading must maintain replay compatibility.

Parallel Compute, Serial Apply

Phase 1 (Parallel): Compute new states in thread-local buffers
Phase 2 (Serial):   Apply changes in deterministic order (by ObjectID)

Testing

# Run same replay N times, compare checksums
for i in {1..10}; do
  generalszh.exe -headless -replay test.rep -checksum > checksum_$i.txt
done
diff checksum_*.txt  # Must be identical

Profiling Infrastructure

Before optimizing, we need measurement:

  • Frame time breakdown - GameLogic, Pathfinding, Rendering, Physics, Audio
  • Memory metrics - Allocation rate, fragmentation, cache miss rate
  • Threading metrics - Core utilization, lock contention, job queue depth

Output formats: Chrome Tracing, Tracy Profiler, ETW.


Implementation Order

Phase Risk Replay Impact Benefit
1.1 Job System Low None Foundation for all threading
1.2 Asset Loading Low None Reduced loading hitches
1.3 Memory Low None Reduced allocation overhead
2.1 Particles Low None Better FPS during explosions
2.2 Culling Low None Better FPS with many objects
3.1 Physics Medium Must verify Better FPS in combat
3.2 Pathfinding Medium Must verify Faster unit response
4.x Advanced High Must verify Major performance gains

Questions for Discussion

  1. Should we prioritize client-side optimizations (Phase 1-2) before touching simulation (Phase 3)?
  2. Is there appetite for Data-Oriented Design refactoring given it's a significant architectural change?
  3. Should profiling infrastructure be implemented first to guide optimization priorities?
  4. What's the acceptable risk tolerance for simulation threading given replay compatibility requirements?

Reference document: tmp/parallelize.md contains detailed diagrams and technical specifications for each optimization area.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions