Reinforcement Learning (Stream-X Module)

The Stream-X module provides streaming reinforcement learning algorithms built on the Autograd core. These algorithms learn online from single transitions without replay buffers.

Overview

Algorithm	Description	Actions
StreamAC	Actor-Critic with eligibility traces	Continuous / Discrete
StreamQ	Online Q-learning	Discrete
StreamSARSA	On-policy SARSA	Discrete

Component	Description
ObGD	Overshooting-bounded optimizer for RL
Normalizers	Running-stats observation normalization
Reward Scaling	Adaptive reward scaling

Building Stream-X

./install.sh --with-stream-x
# Or via CMake:
cmake .. -DAUTOGRAD_BUILD_STREAM_X=ON

Sources are in examples/stream_x/src/ and are only built when explicitly enabled.

Algorithms

StreamAC (Actor-Critic)

Policy gradient with learned value baseline. Available in continuous and discrete variants.

Continuous Actions (Gaussian policy):

#include "stream_x/stream_ac_continuous.h"

int obs_dim = 11;
int act_dim = 3;
int hidden = 128;

// Construct agent (model is provided via set_model)
ContinuousStreamAC agent(obs_dim, /*lr=*/1.0f, /*gamma=*/0.99f, /*lambda=*/0.8f,
                         /*kappa_policy=*/2.0f, /*kappa_value=*/2.0f);

// Build actor/critic networks
nn::Sequential actor_backbone;
actor_backbone.add(nn::Linear(obs_dim, hidden));
actor_backbone.add(nn::ReLU());

nn::Sequential mu_head;
mu_head.add(nn::Linear(hidden, act_dim));

nn::Sequential std_head;
std_head.add(nn::Linear(hidden, act_dim));
std_head.add(nn::Softplus());

nn::Sequential critic;
critic.add(nn::Linear(obs_dim, hidden));
critic.add(nn::ReLU());
critic.add(nn::Linear(hidden, 1));

agent.set_model(actor_backbone, mu_head, std_head, critic);

// Training step
ag::Matrix norm_s = agent.normalize_observation(raw_state);
ag::Tensor s(norm_s, false);
ag::Tensor action = agent.sample_action(s);           // Samples from N(μ, σ)
ag::Float scaled_r = agent.scale_reward(raw_reward, done);
ag::Matrix norm_sn = agent.normalize_observation(raw_next_state);
ag::Tensor sn(norm_sn, false);
ag::Tensor r(ag::Matrix::Constant(1, 1, scaled_r), false);
agent.update(s, action, r, sn, done);

Discrete Actions (Categorical policy):

#include "stream_x/stream_ac_discrete.h"

int obs_dim = 11;
int n_actions = 4;
int hidden = 128;

DiscreteStreamAC agent(obs_dim, n_actions, 1.0f, 0.99f, 0.8f, 2.0f, 2.0f);

nn::Sequential actor;
actor.add(nn::Linear(obs_dim, hidden));
actor.add(nn::ReLU());
actor.add(nn::Linear(hidden, n_actions));
actor.add(nn::Softmax());

nn::Sequential critic;
critic.add(nn::Linear(obs_dim, hidden));
critic.add(nn::ReLU());
critic.add(nn::Linear(hidden, 1));

agent.set_model(actor, critic);

ag::Tensor action_idx = agent.sample_action(s);  // Returns action index (scalar tensor)

StreamQ (Q-Learning)

Value-based learning with epsilon-greedy exploration.

#include "stream_x/stream_q.h"

// Build Q-network
nn::Sequential qnet;
qnet.add(nn::Linear(11, 64));
qnet.add(nn::ReLU());
qnet.add(nn::Linear(64, 4));  // 4 actions

// Create agent: (obs_dim, n_actions, lr, gamma, lambda, kappa)
StreamQ agent(11, 4, 1.0f, 0.99f, 0.8f, 2.0f);
agent.set_model(qnet);

// Training loop
ag::Matrix norm_s = agent.normalize_observation(raw_state);
ag::Tensor s(norm_s, false);
auto [action, is_nongreedy] = agent.sample_action(s);  // Epsilon-greedy
ag::Float scaled_r = agent.scale_reward(raw_reward, done);
ag::Matrix norm_sn = agent.normalize_observation(raw_next_state);
ag::Tensor sn(norm_sn, false);
agent.update(s, action, scaled_r, sn, done, is_nongreedy);

StreamSARSA

On-policy TD learning using actual actions taken.

#include "stream_x/stream_sarsa.h"

StreamSARSA agent(obs_dim, n_actions, lr, gamma, lambda, kappa);

// Build and attach Q-network
nn::Sequential qnet;
qnet.add(nn::Linear(obs_dim, 64));
qnet.add(nn::ReLU());
qnet.add(nn::Linear(64, n_actions));
agent.set_model(qnet);

// SARSA update uses the actual next action
ag::Matrix norm_s = agent.normalize_observation(raw_state);
ag::Tensor s(norm_s, false);
int act = agent.sample_action(s);

ag::Matrix norm_sn = agent.normalize_observation(raw_next_state);
ag::Tensor sn(norm_sn, false);
int next_act = agent.sample_action(sn);

ag::Float scaled_r = agent.scale_reward(raw_reward, done);
agent.update(s, act, scaled_r, sn, next_act, done);

ObGD Optimizer

Overshooting-bounded Gradient Descent for stable online RL updates.

// Constructor: lr, gamma, lambda, kappa
ObGD optimizer(1.0f, 0.99f, 0.8f, 2.0f);
optimizer.add_parameters(model.layers());

// In training loop:
optimizer.zero_grad();
// ... forward pass and loss computation ...
optimizer.step(td_error, episode_done);  // Uses TD error for eligibility traces

Parameter	Description	Typical Value
`lr`	Learning rate	0.1 – 1.0
`gamma`	Discount factor	0.99
`lambda`	Eligibility trace decay	0.8
`kappa`	Overshooting bound	1.0 – 3.0

See Optimizers for more details.

Utilities

Observation Normalization

Maintains running mean/variance for stable inputs:

// Built into agents:
ag::Matrix norm_obs = agent.normalize_observation(raw_obs);

// Or standalone:
NormalizeObservation normalizer(obs_rows, obs_cols);
ag::Matrix norm_obs = normalizer.normalize(raw_obs);

Reward Scaling

Adaptive reward scaling prevents gradient explosion:

// Built into agents:
double scaled_r = agent.scale_reward(raw_reward, done);

// Or standalone:
ScaleReward scaler;
double scaled_r = scaler.scale(raw_reward, done);

Complete Example

#include "stream_x/stream_ac_continuous.h"

int main() {
    int obs_dim = 11;
    int act_dim = 3;
    int hidden = 128;

    ContinuousStreamAC agent(obs_dim, 1.0f, 0.99f, 0.8f, 2.0f, 2.0f);

    nn::Sequential actor_backbone;
    actor_backbone.add(nn::Linear(obs_dim, hidden));
    actor_backbone.add(nn::ReLU());

    nn::Sequential mu_head;
    mu_head.add(nn::Linear(hidden, act_dim));

    nn::Sequential std_head;
    std_head.add(nn::Linear(hidden, act_dim));
    std_head.add(nn::Softplus());

    nn::Sequential critic;
    critic.add(nn::Linear(obs_dim, hidden));
    critic.add(nn::ReLU());
    critic.add(nn::Linear(hidden, 1));

    agent.set_model(actor_backbone, mu_head, std_head, critic);

    // Environment interaction loop
    ag::Matrix state = env.reset();

    for (int step = 0; step < 100000; ++step) {
        // Normalize observation
        ag::Matrix norm_s = agent.normalize_observation(state);
        ag::Tensor s(norm_s, false);

        // Sample action from policy
        ag::Tensor action = agent.sample_action(s);

        // Execute in environment
        auto [next_state, reward, done] = env.step(action.value());

        // Scale reward and normalize next state
        double scaled_r = agent.scale_reward(reward, done);
        ag::Matrix norm_ns = agent.normalize_observation(next_state);
        ag::Tensor ns(norm_ns, false);
        ag::Tensor r(ag::Matrix::Constant(1, 1, scaled_r), false);

        // Update agent (computes gradients and applies ObGD)
        agent.update(s, action, r, ns, done);

        // Handle episode boundary
        state = done ? env.reset() : next_state;
    }

    return 0;
}

Best Practices

Practice	Recommendation
Normalization	Always normalize observations and scale rewards
Learning Rate	Start with 0.1–1.0 for ObGD
Discount Factor	Use 0.95–0.99 for most tasks
Memory	`update()` clears internal loss graphs; call `clear_graph()` in custom loops

Hyperparameter Guidelines

// Conservative starting point (model provided via set_model)
ContinuousStreamAC agent(
    obs_dim,           // From environment
    1.0f,              // Learning rate (0.1-1.0)
    0.99f,             // Gamma (0.95-0.99)
    0.8f,              // Lambda (0.7-0.9)
    2.0f,              // Kappa policy (1.0-3.0)
    2.0f               // Kappa value (1.0-3.0)
);

// Use a hidden size in the 64–256 range when building actor/critic networks
// and set action dimension in mu/std heads.

Troubleshooting

Issue	Solution
Exploding gradients	Reduce learning rate, check reward scaling
Poor convergence	Verify observation normalization
Memory growth	Ensure `clear_graph()` is called (automatic in `update()`)
Unstable training	Increase kappa or reduce learning rate