The Reinforcement Learning Snake project implements a sophisticated Deep Q-Network (DQN) agent that learns to play Snake through reinforcement learning. This document details the architecture, algorithms, and implementation choices.
Input Layer: 16 neurons (state representation)
↓
Hidden Layer 1: 128 neurons (ReLU activation)
↓
Hidden Layer 2: 128 neurons (ReLU activation)
↓
Output Layer: 4 neurons (Q-values for actions)
The agent observes the game state through a carefully designed 16-dimensional feature vector:
-
Danger Indicators (4 dimensions): Binary flags for immediate threats
state[0]: Danger straight aheadstate[1]: Danger to the rightstate[2]: Danger to the leftstate[3]: Danger behind
-
Food Direction (4 dimensions): One-hot encoding of food direction
state[4]: Food is upstate[5]: Food is downstate[6]: Food is leftstate[7]: Food is right
-
Distance to Food (2 dimensions): Normalized coordinates
state[8]: Normalized x-distance to foodstate[9]: Normalized y-distance to food
-
Current Direction (4 dimensions): One-hot encoding of snake's movement
state[10]: Moving upstate[11]: Moving downstate[12]: Moving leftstate[13]: Moving right
-
Game Context (2 dimensions):
state[14]: Snake length normalized by grid areastate[15]: Steps without food normalized by 100
The DQN algorithm approximates the optimal action-value function Q*(s,a) using the Bellman equation:
Q*(s,a) = E[R_t + γ * max_a' Q*(s_{t+1}, a') | s_t = s, a_t = a]
Where:
R_tis the immediate rewardγ ∈ [0,1]is the discount factors_t, a_tare current state and actions_{t+1}, a'are next state and optimal next action
The network minimizes the temporal difference error:
L(θ) = E[(R_t + γ * max_a' Q(s_{t+1}, a'; θ^-) - Q(s_t, a_t; θ))^2]
Where:
θare current network parametersθ^-are target network parameters (updated periodically)
for each episode:
reset environment
get initial state
while not terminal:
select action via ε-greedy policy
execute action, observe reward and next_state
store experience (s,a,r,s',done) in replay buffer
if replay buffer has enough experiences:
sample random minibatch
perform gradient descent step
if step % target_update_frequency == 0:
update target network parameters
decay exploration rate ε- Buffer Size: 50,000 experiences
- Sampling: Random minibatch of 128 experiences
- Purpose: Break temporal correlations, improve sample efficiency
- Update Frequency: Every 50 training steps
- Purpose: Provide stable targets for TD-learning
- Mechanism: Copy weights from main network to target network
The reward function shapes the agent's behavior:
float reward = 0.0f;
if (food_eaten) {
reward += 10.0f; // Primary reward
} else if (moved_closer_to_food) {
reward += 0.1f; // Shaping reward
} else if (moved_away_from_food) {
reward -= 0.15f; // Small penalty
}
if (game_over) {
reward -= 10.0f; // Strong penalty for death
}action = {
random_action, with probability ε
argmax_a Q(s,a), with probability 1-ε
}- Start: ε = 1.0 (100% exploration)
- Decay: ε ← ε * 0.998 per episode
- Minimum: ε = 0.01 (1% exploration)
- Grid Size: 12×12 cells
- Cell Size: 40×40 pixels
- Total Game Area: 500×500 pixels
std::deque<Point> snake; // Front = head, Back = tail
Dir dir = Dir::RIGHT; // Current movement direction- Wall Collision: Snake head outside grid bounds
- Self Collision: Head intersects with body segments
- Timeout: Too many steps without eating food
- Game Board (500×500px): Main game visualization
- Statistics Panel (700×500px): Training graphs and parameters
- Network Weights (400×400px): Static network visualization
- Network Activity (400×400px): Real-time forward pass visualization
- 5×7 pixel characters for all ASCII values
- No external font dependencies
- Efficient SDL rendering
- Neural Network Weights: Color-coded connections (red=positive, green=negative)
- Training Graphs: Score history, average scores, epsilon decay
- Live Network Activity: Neuron activations during forward pass
- Parameter Display: Current hyperparameter values with adjustment hints
- Learning Rate (0.00001 - 0.1): Adam optimizer step size
- Gamma (0.5 - 0.999): Discount factor for future rewards
- Epsilon Decay (0.9 - 0.9999): Exploration rate decay
- Batch Size (16 - 512): Mini-batch size
- Replay Buffer Size (1000 - 100000): Experience storage
- Reward Food (1.0 - 100.0): Food eating reward
- Reward Closer (0.0 - 2.0): Moving closer reward
- Penalty Away (-2.0 - 0.0): Moving away penalty
- Penalty Death (-100.0 - -1.0): Death penalty
- Train Speed (1 - 100): Training acceleration factor
- ↑/↓ Arrows: Select parameter
- ←/→ Arrows: Adjust selected parameter
- R Key: Reset all to defaults
- Space: Reset exploration (ε=1)
- +/- Keys: Adjust rendering FPS
- GPU Libraries: libtorch_cuda.so, libc10_cuda.so
- CUDA Runtime: libcudart.so
- NVRTC Compiler: libnvrtc.so
- Automatic GPU Selection: Falls back to CPU if CUDA unavailable
- Experience Replay: Circular buffer with automatic overflow handling
- Tensor Operations: PyTorch automatic memory management
- SDL Resources: Proper cleanup in destructor
- Render Skipping: Adjust train_speed to skip expensive rendering
- FPS Control: Adjustable game_speed for visualization
- Batch Processing: Efficient mini-batch training
src/
├── SnakeAI.hpp # Main AI class declaration (621 lines)
├── SnakeAI.cpp # AI implementation
└── Utils.h # Constants, structures, font data (577 lines)
- PyTorch C++: Deep learning framework
- SDL3: Graphics and window management
- GLM: Mathematics library
- CUDA: GPU acceleration (optional)
- CMake 3.22+: Build configuration
- vcpkg: Package management
- GCC 9+/Clang 10+: C++17 compilation
- SIGINT Handler: Non-destructive interruption for forced rendering
- Graceful Shutdown: Proper resource cleanup
- Float32: Single precision for neural networks
- Normalized Values: All state features normalized to [0,1] or [-1,1]
- Stable Training: Target networks prevent divergence
- Modular Design: Easy to modify network architecture
- Parameter System: Runtime adjustment without recompilation
- Visualization Framework: Adaptable to different games
This architecture document provides a comprehensive overview of the Reinforcement Learning Snake implementation, covering the mathematical foundations, algorithmic details, and engineering choices.