Reinforcement Learning

F-Zero Racing Agent

A PPO agent trained across 80 parallel SNES emulators to race F-Zero, reaching global top 300 on a task previously deemed unsolvable due to sparse terminal-only rewards over 3,000+ step horizons.

PPO Reward Shaping CNN+MLP Policy Stable-Retro

The Challenge

F-Zero (1991, SNES) presents an extreme reinforcement learning challenge: the game only provides a reward when the race ends, but a single race spans 3,000+ decision steps. With a discount factor of 0.99, the terminal reward is attenuated by a factor of ~0.05 at the start of the episode — making gradient-based learning nearly impossible.

The solution required two innovations: dense reward shaping via spline-interpolated track centerlines, and a dual-input CNN+MLP policy that fuses visual observations with engineered state features. The agent was trained across 80 parallel emulators at ~900 FPS aggregate throughput, completing 50M steps in about 14 hours.

The agent learned to boost (43.6% of steps), execute basic blast turning, and navigate corners with shoulder leans — ultimately achieving competitive race times on Mute City I.

Technical Highlights

  • Spline-Interpolated Reward

    Extracted 58 RAM checkpoints and interpolated them into a 660-point smooth track centerline via cubic spline. Per-step reward combines linear and quadratic progress terms with stuck detection.

  • Dual-Input CNN+MLP Policy

    4-layer CNN processes stacked grayscale frames (4x84x96); 2-layer MLP encodes a 59-dim feature vector (speed, energy, track preview, action history). Outputs fused into a 1024-dim representation.

  • 80-Emulator Parallelism

    SubprocVecEnv runs 80 SNES instances with frameskip-3, VecNormalize for reward normalization, and 4-frame stacking for motion perception. MultiDiscrete action space covers 72 button combinations.

  • Systematic Experimentation

    Debugged BCD timer encoding, discovered broken RAM lap counters, fixed A/Y button swap. Explored PPO, DQN, QR-DQN, and a custom IQN trainer with Linesight-style piecewise schedules.

Under The Hood

Observation Space

Dict observation combining 4 stacked grayscale frames (84x96) with a 59-dimensional float vector: current speed, energy, lap progress, 10 upcoming checkpoints from the spline centerline in 2D, and the last 3 actions taken. Track preview gives the agent lookahead for corner anticipation.

Reward Design

Progress along the spline centerline provides dense per-step signal. A linear term ensures gradient at all speeds (cold-start learning); a quadratic term amplifies high-speed rewards to break the speed plateau. Stuck detection terminates episodes with a penalty after 100 steps of insufficient movement.

Training Configuration

PPO with n_steps=512, batch_size=2048, 4 epochs per rollout, gamma=0.99, GAE lambda=0.95, and adaptive KL targeting. Orthogonal weight initialization with LeakyReLU throughout. Total: 50M timesteps with W&B experiment tracking.