Reinforcement Learning (Stream-X Module)
The Stream-X module provides streaming reinforcement learning algorithms built on the Autograd core. These algorithms learn online from single transitions without replay buffers.
Overview
| Algorithm | Description | Actions |
|---|---|---|
| StreamAC | Actor-Critic with eligibility traces | Continuous / Discrete |
| StreamQ | Online Q-learning | Discrete |
| StreamSARSA | On-policy SARSA | Discrete |
| Component | Description |
|---|---|
| ObGD | Overshooting-bounded optimizer for RL |
| Normalizers | Running-stats observation normalization |
| Reward Scaling | Adaptive reward scaling |
Building Stream-X
Sources are in examples/stream_x/src/ and are only built when explicitly enabled.
Algorithms
StreamAC (Actor-Critic)
Policy gradient with learned value baseline. Available in continuous and discrete variants.
Continuous Actions (Gaussian policy):
#include "stream_x/stream_ac_continuous.h"
int obs_dim = 11;
int act_dim = 3;
int hidden = 128;
// Construct agent (model is provided via set_model)
ContinuousStreamAC agent(obs_dim, /*lr=*/1.0f, /*gamma=*/0.99f, /*lambda=*/0.8f,
/*kappa_policy=*/2.0f, /*kappa_value=*/2.0f);
// Build actor/critic networks
nn::Sequential actor_backbone;
actor_backbone.add(nn::Linear(obs_dim, hidden));
actor_backbone.add(nn::ReLU());
nn::Sequential mu_head;
mu_head.add(nn::Linear(hidden, act_dim));
nn::Sequential std_head;
std_head.add(nn::Linear(hidden, act_dim));
std_head.add(nn::Softplus());
nn::Sequential critic;
critic.add(nn::Linear(obs_dim, hidden));
critic.add(nn::ReLU());
critic.add(nn::Linear(hidden, 1));
agent.set_model(actor_backbone, mu_head, std_head, critic);
// Training step
ag::Matrix norm_s = agent.normalize_observation(raw_state);
ag::Tensor s(norm_s, false);
ag::Tensor action = agent.sample_action(s); // Samples from N(μ, σ)
ag::Float scaled_r = agent.scale_reward(raw_reward, done);
ag::Matrix norm_sn = agent.normalize_observation(raw_next_state);
ag::Tensor sn(norm_sn, false);
ag::Tensor r(ag::Matrix::Constant(1, 1, scaled_r), false);
agent.update(s, action, r, sn, done);
Discrete Actions (Categorical policy):
#include "stream_x/stream_ac_discrete.h"
int obs_dim = 11;
int n_actions = 4;
int hidden = 128;
DiscreteStreamAC agent(obs_dim, n_actions, 1.0f, 0.99f, 0.8f, 2.0f, 2.0f);
nn::Sequential actor;
actor.add(nn::Linear(obs_dim, hidden));
actor.add(nn::ReLU());
actor.add(nn::Linear(hidden, n_actions));
actor.add(nn::Softmax());
nn::Sequential critic;
critic.add(nn::Linear(obs_dim, hidden));
critic.add(nn::ReLU());
critic.add(nn::Linear(hidden, 1));
agent.set_model(actor, critic);
ag::Tensor action_idx = agent.sample_action(s); // Returns action index (scalar tensor)
StreamQ (Q-Learning)
Value-based learning with epsilon-greedy exploration.
#include "stream_x/stream_q.h"
// Build Q-network
nn::Sequential qnet;
qnet.add(nn::Linear(11, 64));
qnet.add(nn::ReLU());
qnet.add(nn::Linear(64, 4)); // 4 actions
// Create agent: (obs_dim, n_actions, lr, gamma, lambda, kappa)
StreamQ agent(11, 4, 1.0f, 0.99f, 0.8f, 2.0f);
agent.set_model(qnet);
// Training loop
ag::Matrix norm_s = agent.normalize_observation(raw_state);
ag::Tensor s(norm_s, false);
auto [action, is_nongreedy] = agent.sample_action(s); // Epsilon-greedy
ag::Float scaled_r = agent.scale_reward(raw_reward, done);
ag::Matrix norm_sn = agent.normalize_observation(raw_next_state);
ag::Tensor sn(norm_sn, false);
agent.update(s, action, scaled_r, sn, done, is_nongreedy);
StreamSARSA
On-policy TD learning using actual actions taken.
#include "stream_x/stream_sarsa.h"
StreamSARSA agent(obs_dim, n_actions, lr, gamma, lambda, kappa);
// Build and attach Q-network
nn::Sequential qnet;
qnet.add(nn::Linear(obs_dim, 64));
qnet.add(nn::ReLU());
qnet.add(nn::Linear(64, n_actions));
agent.set_model(qnet);
// SARSA update uses the actual next action
ag::Matrix norm_s = agent.normalize_observation(raw_state);
ag::Tensor s(norm_s, false);
int act = agent.sample_action(s);
ag::Matrix norm_sn = agent.normalize_observation(raw_next_state);
ag::Tensor sn(norm_sn, false);
int next_act = agent.sample_action(sn);
ag::Float scaled_r = agent.scale_reward(raw_reward, done);
agent.update(s, act, scaled_r, sn, next_act, done);
ObGD Optimizer
Overshooting-bounded Gradient Descent for stable online RL updates.
// Constructor: lr, gamma, lambda, kappa
ObGD optimizer(1.0f, 0.99f, 0.8f, 2.0f);
optimizer.add_parameters(model.layers());
// In training loop:
optimizer.zero_grad();
// ... forward pass and loss computation ...
optimizer.step(td_error, episode_done); // Uses TD error for eligibility traces
| Parameter | Description | Typical Value |
|---|---|---|
lr |
Learning rate | 0.1 – 1.0 |
gamma |
Discount factor | 0.99 |
lambda |
Eligibility trace decay | 0.8 |
kappa |
Overshooting bound | 1.0 – 3.0 |
See Optimizers for more details.
Utilities
Observation Normalization
Maintains running mean/variance for stable inputs:
// Built into agents:
ag::Matrix norm_obs = agent.normalize_observation(raw_obs);
// Or standalone:
NormalizeObservation normalizer(obs_rows, obs_cols);
ag::Matrix norm_obs = normalizer.normalize(raw_obs);
Reward Scaling
Adaptive reward scaling prevents gradient explosion:
// Built into agents:
double scaled_r = agent.scale_reward(raw_reward, done);
// Or standalone:
ScaleReward scaler;
double scaled_r = scaler.scale(raw_reward, done);
Complete Example
#include "stream_x/stream_ac_continuous.h"
int main() {
int obs_dim = 11;
int act_dim = 3;
int hidden = 128;
ContinuousStreamAC agent(obs_dim, 1.0f, 0.99f, 0.8f, 2.0f, 2.0f);
nn::Sequential actor_backbone;
actor_backbone.add(nn::Linear(obs_dim, hidden));
actor_backbone.add(nn::ReLU());
nn::Sequential mu_head;
mu_head.add(nn::Linear(hidden, act_dim));
nn::Sequential std_head;
std_head.add(nn::Linear(hidden, act_dim));
std_head.add(nn::Softplus());
nn::Sequential critic;
critic.add(nn::Linear(obs_dim, hidden));
critic.add(nn::ReLU());
critic.add(nn::Linear(hidden, 1));
agent.set_model(actor_backbone, mu_head, std_head, critic);
// Environment interaction loop
ag::Matrix state = env.reset();
for (int step = 0; step < 100000; ++step) {
// Normalize observation
ag::Matrix norm_s = agent.normalize_observation(state);
ag::Tensor s(norm_s, false);
// Sample action from policy
ag::Tensor action = agent.sample_action(s);
// Execute in environment
auto [next_state, reward, done] = env.step(action.value());
// Scale reward and normalize next state
double scaled_r = agent.scale_reward(reward, done);
ag::Matrix norm_ns = agent.normalize_observation(next_state);
ag::Tensor ns(norm_ns, false);
ag::Tensor r(ag::Matrix::Constant(1, 1, scaled_r), false);
// Update agent (computes gradients and applies ObGD)
agent.update(s, action, r, ns, done);
// Handle episode boundary
state = done ? env.reset() : next_state;
}
return 0;
}
Best Practices
| Practice | Recommendation |
|---|---|
| Normalization | Always normalize observations and scale rewards |
| Learning Rate | Start with 0.1–1.0 for ObGD |
| Discount Factor | Use 0.95–0.99 for most tasks |
| Memory | update() clears internal loss graphs; call clear_graph() in custom loops |
Hyperparameter Guidelines
// Conservative starting point (model provided via set_model)
ContinuousStreamAC agent(
obs_dim, // From environment
1.0f, // Learning rate (0.1-1.0)
0.99f, // Gamma (0.95-0.99)
0.8f, // Lambda (0.7-0.9)
2.0f, // Kappa policy (1.0-3.0)
2.0f // Kappa value (1.0-3.0)
);
// Use a hidden size in the 64–256 range when building actor/critic networks
// and set action dimension in mu/std heads.
Troubleshooting
| Issue | Solution |
|---|---|
| Exploding gradients | Reduce learning rate, check reward scaling |
| Poor convergence | Verify observation normalization |
| Memory growth | Ensure clear_graph() is called (automatic in update()) |
| Unstable training | Increase kappa or reduce learning rate |
See Also
- Optimizers — ObGD details and custom optimizers
- Neural Networks — Building network architectures
- ESP32 Guide — Running Stream-X on embedded devices
- Examples — Complete training examples