ESP32 / Embedded Development Guide

This guide covers running TinyRL on ESP32 microcontrollers for real-time reinforcement learning.

Quick Start

# Host simulation (no hardware needed)
c++ -std=c++17 -DAG_EMBEDDED \
    -I src -I examples/stream_x/src \
    examples/stream_x_esp32/src/stream_ac_continuous.cpp -o esp32_sim
./esp32_sim

# PlatformIO upload
cd examples/stream_x_esp32
pio run --target upload
pio device monitor

Overview

Why Run TinyRL on ESP32?

Benefit	Description
Real-time Learning	Online reinforcement learning directly on microcontrollers
Minimal Footprint	Header-only integration with no external dependencies
Low Latency	Eliminates round-trips to host PC for inference
Power Efficient	Optimized for battery-powered applications
Cost Effective	Uses affordable, widely-available hardware

Supported Hardware

Board	Status	Notes
ESP32-S3	✅ Recommended	Tested with PSRAM
ESP32-S2	✅ Compatible	Good performance
ESP32	⚠️ Basic	Limited memory
ESP32-C3	✅ Compatible	RISC-V core

Hardware Requirements

Recommended Setup

Board: Freenove ESP32-S3 WROOM or similar ESP32-S3 board
Memory: 8MB+ PSRAM recommended for larger models
Storage: 4MB+ flash for firmware and model storage
USB: Type-C or micro-USB for programming and debugging

Minimum Requirements

RAM: 512KB (for minimal models)
Flash: 2MB
CPU: 240MHz dual-core

Optional Hardware

Display: OLED/LCD for real-time visualization
Sensors: IMU, temperature, pressure sensors for environment interaction
Actuators: Servos, motors, relays for action execution

Software Setup

Prerequisites

PlatformIO IDE (recommended)

# Install PlatformIO Core
pip install platformio

# Or use VS Code extension
# Install "PlatformIO IDE" extension

ESP-IDF (alternative)

# Install ESP-IDF
git clone https://github.com/espressif/esp-idf.git
cd esp-idf
./install.sh
source export.sh

Arduino IDE (basic support)
Install ESP32 board support package
Add TinyRL headers to libraries folder

Development Environment Setup

PlatformIO (Recommended)

# Create new project
pio project init --board freenove_esp32_s3_wroom

# Or use existing ESP32 example
cd examples/stream_x_esp32

ESP-IDF

# Create new project
idf.py create-project my_tinyrl_project
cd my_tinyrl_project

Project Structure

PlatformIO Project Layout

examples/stream_x_esp32/
├── platformio.ini          # Build configuration
├── src/
│   ├── stream_ac_continuous.cpp  # Continuous StreamAC example
│   ├── stream_ac_discrete.cpp    # Discrete StreamAC example
│   ├── stream_q.cpp              # StreamQ example
│   ├── stream_q_atari.cpp        # StreamQ Atari example
│   ├── stream_q_pong_ram.cpp     # StreamQ Pong (RAM) example
│   ├── benchmark_kernels.cpp     # Kernel benchmarks
│   └── getentropy_dummy.c        # Entropy source for embedded
├── lib/                   # Optional: copied headers
│   ├── autograd/         # TinyRL core headers
│   └── stream_x/         # Stream-X module headers (StreamAC, StreamQ, StreamSARSA algorithms)
└── boards/               # Board-specific configurations
    └── freenove_esp32_s3_wroom.json

Header Integration Strategies

Strategy 1: Relative Includes (Development)

; platformio.ini
[env:freenove_esp32_s3_wroom]
platform = espressif32
board = freenove_esp32_s3_wroom
framework = arduino
build_flags = 
    -DAG_EMBEDDED
    -DAG_ENABLE_SIMD=OFF
    -I../../src
    -I../../examples/stream_x/src  ; Stream-X module headers (StreamAC, StreamQ, StreamSARSA algorithms)

Pros: Quick iteration, keeps repo structure intact Cons: Requires full repository, not standalone

Strategy 2: Vendor Headers (Production)

# Copy headers to project
mkdir -p lib/autograd lib/stream_x
cp ../../src/*.h lib/autograd/
cp ../../examples/stream_x/src/*.h lib/stream_x/

; platformio.ini
build_flags = 
    -DAG_EMBEDDED
    -DAG_ENABLE_SIMD=OFF

Pros: Standalone project, production-ready Cons: Manual header management, larger project size

Building and Flashing

PlatformIO Workflow

# Navigate to project
cd examples/stream_x_esp32

# Build project
pio run

# Upload to device
pio run --target upload

# Monitor serial output
pio device monitor

# Clean build
pio run --target clean

ESP-IDF Workflow

# Configure project
idf.py menuconfig

# Build project
idf.py build

# Flash to device
idf.py flash

# Monitor output
idf.py monitor

Build Configuration

PlatformIO Configuration

; platformio.ini
[env:freenove_esp32_s3_wroom]
platform = espressif32
board = freenove_esp32_s3_wroom
framework = arduino

; Build flags for TinyRL
build_flags = 
    -DAG_EMBEDDED                    ; Enable embedded mode
    -DAG_ENABLE_SIMD=OFF            ; Disable SIMD for compatibility
    -O2                             ; Optimize for size/speed
    -ffunction-sections             ; Enable function-level linking
    -fdata-sections                 ; Enable data-level linking
    -I../../src                     ; Include TinyRL headers
    -I../../examples/stream_x/src  ; Include Stream-X headers

; Memory configuration
board_build.partitions = huge_app.csv  ; Use huge app partition
board_build.psram_type = opi           ; Enable PSRAM

ESP-IDF Configuration

# Configure memory layout
idf.py menuconfig
# Navigate to: Component config → ESP32S3-Specific → PSRAM
# Enable: Support for external, SPI PSRAM
# Set: PSRAM clock speed to 80MHz

Upload and Monitoring

# Upload firmware
pio run --target upload

# Monitor with custom baud rate
pio device monitor --baud 115200

# Monitor with filter
pio device monitor --filter time,colorize

Optimization Strategies

ESP32-S3 Optimizations (Recommended)

TinyRL includes specialized optimizations for ESP32-S3, providing 2-4x speedup for neural network operations through pure optimized scalar implementations tuned for the LX7 dual-core architecture.

Enabling ESP32-S3 Optimizations

The optimizations are auto-detected when building for ESP32-S3. To explicitly enable:

; platformio.ini
build_flags = 
    -DESP32S3           ; Target ESP32-S3

What's Optimized

Category	Operation	Speedup	Technique
Matrix Ops	Matrix multiply	2-3x	Tiled computation, cache-friendly access
	Dot product	2-3x	8-way loop unrolling
Element-wise	Add/Sub/Mul	2x	4-8 way unrolling, ILP
	Scalar multiply	2x	Unrolled loops
Activations	ReLU/Sigmoid/Tanh	1.5-2x	Unrolled loops
	Softmax	2x	Optimized exp + normalize
CNN	Convolution	2-3x	im2col + optimized matmul
	Max/Avg Pooling	1.5x	Cache-optimized kernels
Optimizers	SGD step	2x	Fused update kernel
	RMSProp step	1.5-2x	Optimized variance updates
Normalization	LayerNorm	2x	Single-pass mean/var, fast rsqrt
MLP Inference	3-layer forward	2-3x	Fused linear+LN+ReLU, batch=1 path
MLP Training	3-layer backward	2-3x	Fused backward, batch=1 optimized

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                          simd.h                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌───────────────────────┐   │
│  │   AVX2      │  │   NEON      │  │    ESP32-S3           │   │
│  │  (x86-64)   │  │  (ARM64)    │  │   (esp32_dsp.h)       │   │
│  └─────────────┘  └─────────────┘  └───────────────────────┘   │
│                                               │                 │
│                     ┌─────────────────────────┼─────────────┐   │
│                     │                         ▼             │   │
│                     │  Pure Optimized Scalar Implementations│   │
│                     │  ─────────────────────────────────────│   │
│                     │  • 8-way loop unrolling (ILP)         │   │
│                     │  • Cache-friendly access patterns     │   │
│                     │  • IRAM placement for critical funcs  │   │
│                     │  • Fast math (Newton-Raphson rsqrt)   │   │
│                     │  • Single-pass statistics             │   │
│                     │  • Tiled matrix multiplication        │   │
│                     │  • Fused MLP inference (batch=1)      │   │
│                     │  • Fused MLP backward (batch=1)       │   │
│                     └───────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Key Optimization Techniques

Technique	Description
8-way Loop Unrolling	Maximizes instruction-level parallelism on LX7 cores
Cache-Friendly Access	32-byte cache line alignment, tiled computation
IRAM Placement	Critical functions placed in fast internal RAM
Fast rsqrt	Newton-Raphson approximation, 2x faster than `1/sqrtf()`
Fused mean/var	Single-pass Welford's algorithm for LayerNorm
Batch=1 Inference	Specialized `fused_mlp3_inference()` for RL workloads
Batch=1 Training	Specialized `fused_mlp3_backward_single()` for streaming RL

Example: Measuring Performance

#include <esp_timer.h>

void benchmark_matmul() {
    ag::Matrix A = ag::Matrix::Random(64, 64);
    ag::Matrix B = ag::Matrix::Random(64, 64);

    int64_t start = esp_timer_get_time();
    for (int i = 0; i < 100; i++) {
        ag::Matrix C = A.matmul(B);
    }
    int64_t elapsed = esp_timer_get_time() - start;

    Serial.printf("100x matmul (64x64): %lld us\n", elapsed);
    // Expected: ~2-3x faster with ESP32-S3 optimizations enabled
}

void benchmark_forward_pass() {
    // Create a small MLP
    nn::Sequential model;
    model.add(nn::Linear(64, 32));
    model.add(nn::ReLU());
    model.add(nn::Linear(32, 4));

    ag::Tensor x(ag::Matrix::Random(1, 64), false);

    int64_t start = esp_timer_get_time();
    for (int i = 0; i < 1000; i++) {
        ag::Tensor y = model.forward(x);
    }
    int64_t elapsed = esp_timer_get_time() - start;

    Serial.printf("1000x forward (64->32->4): %lld us\n", elapsed);
}

Memory Optimization

Reduce Model Size

// In src/stream_ac_continuous.cpp - Reduce hidden layer size
#define HIDDEN_SIZE 64    // Instead of 128 or 256
#define NUM_LAYERS 2      // Instead of 3 or 4

// Create smaller network (model provided via set_model)
ContinuousStreamAC agent(
    n_obs,           // Observation dimension
    learning_rate,   // Learning rate
    gamma,           // Discount factor
    lambda,          // Eligibility trace
    kappa_policy,    // Policy overshooting bound
    kappa_value      // Value overshooting bound
);

// Build compact actor/critic and attach:
// agent.set_model(actor_backbone, mu_head, std_head, critic);

Disable Features

// Remove normalization for minimal RAM
// Disable normalization in your model or skip calls to normalize_observation()

// Use smaller PRNG
// Replace default RNG with XorShift or similar

Memory Management

// Reuse tensor objects
ag::Tensor workspace(ag::Matrix::Zeros(1, HIDDEN_SIZE), false);

// Clear computational graphs frequently
loss.backward();
loss.clear_graph();

// Use stack allocation when possible
ag::Matrix small_matrix(10, 10);  // Stack allocated

Performance Optimization

Compiler Optimizations

; platformio.ini
build_flags = 
    -O2                             ; Optimize for size/speed
    -ffast-math                     ; Fast math operations
    -fno-exceptions                 ; Disable exception handling
    -fno-rtti                       ; Disable RTTI
    -ffunction-sections             ; Function-level linking
    -fdata-sections                 ; Data-level linking

Runtime Optimizations

// Use const references
void process_observation(const ag::Tensor& obs) {
    // Process observation
}

// Avoid unnecessary copies
ag::Tensor& get_workspace() {
    static ag::Tensor workspace(ag::Matrix::Zeros(1, 64), false);
    return workspace;
}

Power Optimization

// Reduce CPU frequency for battery operation
#include "esp_pm.h"
#include "esp_sleep.h"

// Set CPU frequency
esp_pm_config_esp32s3_t pm_config = {
    .max_freq_mhz = 80,  // Reduce from 240MHz
    .min_freq_mhz = 10,
    .light_sleep_enable = true
};
esp_pm_configure(&pm_config);

// Enable light sleep between iterations
esp_light_sleep_start();

Troubleshooting

Common Issues

Build Errors

Error: fatal error: 'autograd/autograd.h' file not found
Solution: Check include paths in platformio.ini:

build_flags = 
    -I../../src
    -I../../examples/stream_x/src

Error: error: 'std::expected' is not a member of 'std'
Solution: Ensure C++17 support:

build_flags = 
    -std=c++17

Error: error: 'AG_EMBEDDED' was not declared
Solution: Add embedded flag:

build_flags = 
    -DAG_EMBEDDED

Selecting Discrete vs Continuous (PlatformIO)

Use PlatformIO environments to switch without renaming files:

[env:freenove_esp32_s3_wroom]
; Continuous (default)
build_src_filter = +<src/stream_ac_continuous.cpp> -<src/stream_ac_discrete.cpp>

[env:freenove_esp32_s3_wroom_discrete]
extends = freenove_esp32_s3_wroom
; Discrete
build_src_filter = +<src/stream_ac_discrete.cpp> -<src/stream_ac_continuous.cpp>

Runtime Errors

Error: Guru Meditation Error: Core 1 panic'ed (LoadProhibited) Solution: Check memory allocation:

// Reduce model size
#define HIDDEN_SIZE 32  // Smaller hidden layer

// Check available memory
size_t free_heap = esp_get_free_heap_size();
Serial.printf("Free heap: %d bytes\n", free_heap);

Error: Out of memory Solution: Enable PSRAM and optimize memory usage:

; platformio.ini
board_build.psram_type = opi
board_build.partitions = huge_app.csv

Error: Serial output garbled Solution: Check baud rate and USB connection:

# Use correct baud rate
pio device monitor --baud 115200

# Check USB cable and port
ls /dev/ttyUSB*  # or /dev/ttyACM*

Performance Issues

Issue: Slow inference Solution: Optimize model and compiler settings:

build_flags = 
    -O2
    -ffast-math
    -DAG_ENABLE_SIMD=OFF

Issue: High memory usage Solution: Profile and optimize:

// Monitor memory usage
void print_memory_info() {
    Serial.printf("Free heap: %d\n", esp_get_free_heap_size());
    Serial.printf("Largest free block: %d\n", esp_get_max_alloc_heap());
    Serial.printf("Minimum free heap: %d\n", esp_get_minimum_free_heap_size());
}

Debugging Tools

Serial Debugging

// Add debug prints
#define DEBUG_PRINT(x) Serial.print(x)
#define DEBUG_PRINTLN(x) Serial.println(x)

// Conditional debugging
#ifdef DEBUG_MODE
    DEBUG_PRINTLN("Processing observation...");
    DEBUG_PRINT("Observation shape: ");
    DEBUG_PRINTLN(obs.shape()[0]);
#endif

Memory Debugging

// Enable heap debugging
#include "esp_heap_caps.h"

// Check for memory leaks
void check_memory() {
    size_t free_before = esp_get_free_heap_size();
    // ... your code ...
    size_t free_after = esp_get_free_heap_size();

    if (free_after < free_before) {
        Serial.printf("Memory leak detected: %d bytes\n", 
                     free_before - free_after);
    }
}

Performance Profiling

// Profile execution time
#include "esp_timer.h"

int64_t start_time = esp_timer_get_time();
// ... your code ...
int64_t end_time = esp_timer_get_time();
Serial.printf("Execution time: %lld us\n", end_time - start_time);

Advanced Topics

Custom Environments

// Create custom environment
class CustomEnvironment {
public:
    ag::Matrix get_observation() {
        // Read sensors, process data
        return ag::Matrix::Random(1, obs_dim);
    }

    void take_action(const ag::Matrix& action) {
        // Execute action (servos, motors, etc.)
        int action_idx = static_cast<int>(action(0, 0));
        execute_action(action_idx);
    }

    double get_reward() {
        // Calculate reward based on current state
        return calculate_reward();
    }

private:
    void execute_action(int action) {
        // Hardware-specific action execution
        switch(action) {
            case 0: digitalWrite(LED_PIN, HIGH); break;
            case 1: digitalWrite(LED_PIN, LOW); break;
        }
    }

    double calculate_reward() {
        // Reward calculation logic
        return sensor_value / 100.0;
    }
};

Sensor Integration

// IMU integration example
#include "MPU6050.h"

MPU6050 mpu;

void setup_sensors() {
    Wire.begin();
    mpu.initialize();
    if (!mpu.testConnection()) {
        Serial.println("MPU6050 connection failed");
    }
}

ag::Matrix read_imu_data() {
    int16_t ax, ay, az, gx, gy, gz;
    mpu.getMotion6(&ax, &ay, &az, &gx, &gy, &gz);

    // Normalize and create observation
    ag::Matrix obs(1, 6);
    obs(0, 0) = ax / 16384.0;  // Normalize accelerometer
    obs(0, 1) = ay / 16384.0;
    obs(0, 2) = az / 16384.0;
    obs(0, 3) = gx / 131.0;    // Normalize gyroscope
    obs(0, 4) = gy / 131.0;
    obs(0, 5) = gz / 131.0;

    return obs;
}

Wireless Communication

// WiFi communication for remote monitoring
#include "WiFi.h"

void setup_wifi() {
    WiFi.begin(ssid, password);
    while (WiFi.status() != WL_CONNECTED) {
        delay(500);
        Serial.print(".");
    }
    Serial.println("WiFi connected");
}

void send_telemetry(const ag::Matrix& obs, double reward) {
    if (WiFi.status() == WL_CONNECTED) {
        // Send data to server or cloud
        String data = String(obs(0, 0)) + "," + String(reward);
        // HTTP POST or WebSocket implementation
    }
}

OTA Updates

// Over-the-air firmware updates
#include "ArduinoOTA.h"

void setup_ota() {
    ArduinoOTA.setHostname("tinyrl-esp32");
    ArduinoOTA.setPassword("admin");
    ArduinoOTA.begin();
}

void loop() {
    ArduinoOTA.handle();  // Handle OTA updates
    // ... main loop code ...
}

For more information, see the ESP32 Example README and Build & Configuration Guide.