Skip to content

Optimizers

Optimizers update neural network parameters using computed gradients. TinyRL provides core optimizers in optimizer.h, plus the specialized ObGD optimizer in the Stream-X module.

Available Optimizers

Optimizer Module Description
SGD Core Stochastic Gradient Descent
RMSProp Core Adaptive learning rate with moving average of squared gradients
ObGD Stream-X Overshooting-bounded Gradient Descent for streaming RL

SGD (Stochastic Gradient Descent)

The simplest optimizer: \(\theta \leftarrow \theta - \eta \cdot \nabla\theta\)

#include "optimizer.h"

// Create optimizer with learning rate
ag::SGD optimizer(0.01f);

// Register parameters
optimizer.add_parameters(model.layers());

// Training loop
for (int step = 0; step < n_steps; ++step) {
    optimizer.zero_grad();            // Clear previous gradients
    auto loss = compute_loss(model.forward(input), target);
    loss.backward();                  // Compute gradients
    optimizer.step();                 // Update parameters
    loss.clear_graph();               // Free memory
}

RMSProp

Adapts learning rate per-parameter using exponential moving average of squared gradients. Useful when gradients vary widely across parameters.

\[v_t = \alpha \cdot v_{t-1} + (1-\alpha) \cdot g_t^2$$ $$\theta \leftarrow \theta - \frac{\eta}{\sqrt{v_t} + \epsilon} \cdot g_t\]
#include "optimizer.h"

// Create with: learning_rate, alpha (decay), epsilon
ag::RMSProp optimizer(0.001f, 0.99f, 1e-8f);
optimizer.add_parameters(model.layers());

// Same training loop as SGD
for (int step = 0; step < n_steps; ++step) {
    optimizer.zero_grad();
    auto loss = compute_loss(model.forward(input), target);
    loss.backward();
    optimizer.step();
}

RMSProp Parameters

Parameter Type Default Description
lr Float 0.01 Learning rate
alpha Float 0.99 Decay rate for squared gradient average
epsilon Float 1e-8 Numerical stability constant

ObGD (Stream-X Module)

Specialized optimizer for streaming reinforcement learning. Uses eligibility traces and overshooting bounds for stable online learning.

Requires: -DAUTOGRAD_BUILD_STREAM_X=ON

#include "stream_x/obgd.h"

// Create with: lr, gamma, lambda, kappa
ObGD optimizer(1.0f, 0.99f, 0.8f, 2.0f);
optimizer.add_parameters(model.layers());

// RL training loop
for (int step = 0; step < num_steps; ++step) {
    // ... compute TD error (delta) and check episode boundary (reset) ...
    optimizer.zero_grad();
    // ... compute gradients ...
    optimizer.step(delta, reset);  // Uses TD error for eligibility traces
}

ObGD Parameters

Parameter Description
lr Learning rate
gamma Discount factor (typically 0.99)
lambda Eligibility trace decay (typically 0.8)
kappa Overshooting bound (typically 1.0-3.0)

Custom Optimizer

You can create custom optimizers by inheriting from ag::Optimizer:

class MomentumSGD : public ag::Optimizer {
public:
    explicit MomentumSGD(double lr, double momentum = 0.9)
        : learning_rate(lr), beta(momentum) {}

    void add_parameters(const std::vector<std::shared_ptr<nn::Layer>>& layers) override {
        for (const auto& layer : layers) {
            if (layer && layer->has_parameters()) {
                for (auto* p : layer->get_parameters()) {
                    if (p) {
                        parameters.push_back(p);
                        velocity[p] = ag::Matrix::Zeros(p->value().shape());
                    }
                }
            }
        }
    }

    void step() override {
        for (auto* p : parameters) {
            if (p && p->requires_grad()) {
                auto& v = velocity[p];
                v = v * beta + p->grad() * (1.0 - beta);
                p->set_value(p->value() - v * learning_rate);
            }
        }
    }

    void zero_grad() override {
        for (auto* p : parameters) if (p) p->zero_grad();
    }

private:
    double learning_rate;
    double beta;
    std::vector<ag::Tensor*> parameters;
    std::unordered_map<ag::Tensor*, ag::Matrix> velocity;
};

Best Practices

Training Loop Pattern

// 1. Create model and optimizer
nn::Sequential model;
model.add(nn::Linear(784, 128));
model.add(nn::ReLU());
model.add(nn::Linear(128, 10));

ag::SGD optimizer(0.01);
optimizer.add_parameters(model.layers());

// 2. Training loop
for (int epoch = 0; epoch < n_epochs; ++epoch) {
    for (auto& [input, target] : dataloader) {
        optimizer.zero_grad();           // Clear gradients
        auto output = model.forward(input);
        auto loss = compute_loss(output, target);
        loss.backward();                  // Compute gradients
        optimizer.step();                 // Update parameters
        loss.clear_graph();               // Free memory
    }
}

Guidelines

  1. Always call zero_grad() before backward() to clear accumulated gradients
  2. Call clear_graph() after each step in long-running training to free memory
  3. Use appropriate learning rates: Start with 0.01 for SGD, 0.001 for RMSProp
  4. Add all trainable layers to the optimizer via add_parameters()

See Also