Skip to content

Reinforcement Learning (Stream-X Module)

The Stream-X module provides streaming reinforcement learning algorithms built on the Autograd core. These algorithms learn online from single transitions without replay buffers.

Overview

Algorithm Description Actions
StreamAC Actor-Critic with eligibility traces Continuous / Discrete
StreamQ Online Q-learning Discrete
StreamSARSA On-policy SARSA Discrete
Component Description
ObGD Overshooting-bounded optimizer for RL
Normalizers Running-stats observation normalization
Reward Scaling Adaptive reward scaling

Building Stream-X

./install.sh --with-stream-x
# Or via CMake:
cmake .. -DAUTOGRAD_BUILD_STREAM_X=ON

Sources are in examples/stream_x/src/ and are only built when explicitly enabled.


Algorithms

StreamAC (Actor-Critic)

Policy gradient with learned value baseline. Available in continuous and discrete variants.

Continuous Actions (Gaussian policy):

#include "stream_x/stream_ac_continuous.h"

int obs_dim = 11;
int act_dim = 3;
int hidden = 128;

// Construct agent (model is provided via set_model)
ContinuousStreamAC agent(obs_dim, /*lr=*/1.0f, /*gamma=*/0.99f, /*lambda=*/0.8f,
                         /*kappa_policy=*/2.0f, /*kappa_value=*/2.0f);

// Build actor/critic networks
nn::Sequential actor_backbone;
actor_backbone.add(nn::Linear(obs_dim, hidden));
actor_backbone.add(nn::ReLU());

nn::Sequential mu_head;
mu_head.add(nn::Linear(hidden, act_dim));

nn::Sequential std_head;
std_head.add(nn::Linear(hidden, act_dim));
std_head.add(nn::Softplus());

nn::Sequential critic;
critic.add(nn::Linear(obs_dim, hidden));
critic.add(nn::ReLU());
critic.add(nn::Linear(hidden, 1));

agent.set_model(actor_backbone, mu_head, std_head, critic);

// Training step
ag::Matrix norm_s = agent.normalize_observation(raw_state);
ag::Tensor s(norm_s, false);
ag::Tensor action = agent.sample_action(s);           // Samples from N(μ, σ)
ag::Float scaled_r = agent.scale_reward(raw_reward, done);
ag::Matrix norm_sn = agent.normalize_observation(raw_next_state);
ag::Tensor sn(norm_sn, false);
ag::Tensor r(ag::Matrix::Constant(1, 1, scaled_r), false);
agent.update(s, action, r, sn, done);

Discrete Actions (Categorical policy):

#include "stream_x/stream_ac_discrete.h"

int obs_dim = 11;
int n_actions = 4;
int hidden = 128;

DiscreteStreamAC agent(obs_dim, n_actions, 1.0f, 0.99f, 0.8f, 2.0f, 2.0f);

nn::Sequential actor;
actor.add(nn::Linear(obs_dim, hidden));
actor.add(nn::ReLU());
actor.add(nn::Linear(hidden, n_actions));
actor.add(nn::Softmax());

nn::Sequential critic;
critic.add(nn::Linear(obs_dim, hidden));
critic.add(nn::ReLU());
critic.add(nn::Linear(hidden, 1));

agent.set_model(actor, critic);

ag::Tensor action_idx = agent.sample_action(s);  // Returns action index (scalar tensor)


StreamQ (Q-Learning)

Value-based learning with epsilon-greedy exploration.

#include "stream_x/stream_q.h"

// Build Q-network
nn::Sequential qnet;
qnet.add(nn::Linear(11, 64));
qnet.add(nn::ReLU());
qnet.add(nn::Linear(64, 4));  // 4 actions

// Create agent: (obs_dim, n_actions, lr, gamma, lambda, kappa)
StreamQ agent(11, 4, 1.0f, 0.99f, 0.8f, 2.0f);
agent.set_model(qnet);

// Training loop
ag::Matrix norm_s = agent.normalize_observation(raw_state);
ag::Tensor s(norm_s, false);
auto [action, is_nongreedy] = agent.sample_action(s);  // Epsilon-greedy
ag::Float scaled_r = agent.scale_reward(raw_reward, done);
ag::Matrix norm_sn = agent.normalize_observation(raw_next_state);
ag::Tensor sn(norm_sn, false);
agent.update(s, action, scaled_r, sn, done, is_nongreedy);

StreamSARSA

On-policy TD learning using actual actions taken.

#include "stream_x/stream_sarsa.h"

StreamSARSA agent(obs_dim, n_actions, lr, gamma, lambda, kappa);

// Build and attach Q-network
nn::Sequential qnet;
qnet.add(nn::Linear(obs_dim, 64));
qnet.add(nn::ReLU());
qnet.add(nn::Linear(64, n_actions));
agent.set_model(qnet);

// SARSA update uses the actual next action
ag::Matrix norm_s = agent.normalize_observation(raw_state);
ag::Tensor s(norm_s, false);
int act = agent.sample_action(s);

ag::Matrix norm_sn = agent.normalize_observation(raw_next_state);
ag::Tensor sn(norm_sn, false);
int next_act = agent.sample_action(sn);

ag::Float scaled_r = agent.scale_reward(raw_reward, done);
agent.update(s, act, scaled_r, sn, next_act, done);

ObGD Optimizer

Overshooting-bounded Gradient Descent for stable online RL updates.

// Constructor: lr, gamma, lambda, kappa
ObGD optimizer(1.0f, 0.99f, 0.8f, 2.0f);
optimizer.add_parameters(model.layers());

// In training loop:
optimizer.zero_grad();
// ... forward pass and loss computation ...
optimizer.step(td_error, episode_done);  // Uses TD error for eligibility traces
Parameter Description Typical Value
lr Learning rate 0.1 – 1.0
gamma Discount factor 0.99
lambda Eligibility trace decay 0.8
kappa Overshooting bound 1.0 – 3.0

See Optimizers for more details.


Utilities

Observation Normalization

Maintains running mean/variance for stable inputs:

// Built into agents:
ag::Matrix norm_obs = agent.normalize_observation(raw_obs);

// Or standalone:
NormalizeObservation normalizer(obs_rows, obs_cols);
ag::Matrix norm_obs = normalizer.normalize(raw_obs);

Reward Scaling

Adaptive reward scaling prevents gradient explosion:

// Built into agents:
double scaled_r = agent.scale_reward(raw_reward, done);

// Or standalone:
ScaleReward scaler;
double scaled_r = scaler.scale(raw_reward, done);

Complete Example

#include "stream_x/stream_ac_continuous.h"

int main() {
    int obs_dim = 11;
    int act_dim = 3;
    int hidden = 128;

    ContinuousStreamAC agent(obs_dim, 1.0f, 0.99f, 0.8f, 2.0f, 2.0f);

    nn::Sequential actor_backbone;
    actor_backbone.add(nn::Linear(obs_dim, hidden));
    actor_backbone.add(nn::ReLU());

    nn::Sequential mu_head;
    mu_head.add(nn::Linear(hidden, act_dim));

    nn::Sequential std_head;
    std_head.add(nn::Linear(hidden, act_dim));
    std_head.add(nn::Softplus());

    nn::Sequential critic;
    critic.add(nn::Linear(obs_dim, hidden));
    critic.add(nn::ReLU());
    critic.add(nn::Linear(hidden, 1));

    agent.set_model(actor_backbone, mu_head, std_head, critic);

    // Environment interaction loop
    ag::Matrix state = env.reset();

    for (int step = 0; step < 100000; ++step) {
        // Normalize observation
        ag::Matrix norm_s = agent.normalize_observation(state);
        ag::Tensor s(norm_s, false);

        // Sample action from policy
        ag::Tensor action = agent.sample_action(s);

        // Execute in environment
        auto [next_state, reward, done] = env.step(action.value());

        // Scale reward and normalize next state
        double scaled_r = agent.scale_reward(reward, done);
        ag::Matrix norm_ns = agent.normalize_observation(next_state);
        ag::Tensor ns(norm_ns, false);
        ag::Tensor r(ag::Matrix::Constant(1, 1, scaled_r), false);

        // Update agent (computes gradients and applies ObGD)
        agent.update(s, action, r, ns, done);

        // Handle episode boundary
        state = done ? env.reset() : next_state;
    }

    return 0;
}

Best Practices

Practice Recommendation
Normalization Always normalize observations and scale rewards
Learning Rate Start with 0.1–1.0 for ObGD
Discount Factor Use 0.95–0.99 for most tasks
Memory update() clears internal loss graphs; call clear_graph() in custom loops

Hyperparameter Guidelines

// Conservative starting point (model provided via set_model)
ContinuousStreamAC agent(
    obs_dim,           // From environment
    1.0f,              // Learning rate (0.1-1.0)
    0.99f,             // Gamma (0.95-0.99)
    0.8f,              // Lambda (0.7-0.9)
    2.0f,              // Kappa policy (1.0-3.0)
    2.0f               // Kappa value (1.0-3.0)
);

// Use a hidden size in the 64–256 range when building actor/critic networks
// and set action dimension in mu/std heads.

Troubleshooting

Issue Solution
Exploding gradients Reduce learning rate, check reward scaling
Poor convergence Verify observation normalization
Memory growth Ensure clear_graph() is called (automatic in update())
Unstable training Increase kappa or reduce learning rate

See Also