Optimizers
Optimizers update neural network parameters using computed gradients. TinyRL provides core optimizers in optimizer.h, plus the specialized ObGD optimizer in the Stream-X module.
Available Optimizers
| Optimizer | Module | Description |
|---|---|---|
| SGD | Core | Stochastic Gradient Descent |
| RMSProp | Core | Adaptive learning rate with moving average of squared gradients |
| ObGD | Stream-X | Overshooting-bounded Gradient Descent for streaming RL |
SGD (Stochastic Gradient Descent)
The simplest optimizer: \(\theta \leftarrow \theta - \eta \cdot \nabla\theta\)
#include "optimizer.h"
// Create optimizer with learning rate
ag::SGD optimizer(0.01f);
// Register parameters
optimizer.add_parameters(model.layers());
// Training loop
for (int step = 0; step < n_steps; ++step) {
optimizer.zero_grad(); // Clear previous gradients
auto loss = compute_loss(model.forward(input), target);
loss.backward(); // Compute gradients
optimizer.step(); // Update parameters
loss.clear_graph(); // Free memory
}
RMSProp
Adapts learning rate per-parameter using exponential moving average of squared gradients. Useful when gradients vary widely across parameters.
\[v_t = \alpha \cdot v_{t-1} + (1-\alpha) \cdot g_t^2$$
$$\theta \leftarrow \theta - \frac{\eta}{\sqrt{v_t} + \epsilon} \cdot g_t\]
#include "optimizer.h"
// Create with: learning_rate, alpha (decay), epsilon
ag::RMSProp optimizer(0.001f, 0.99f, 1e-8f);
optimizer.add_parameters(model.layers());
// Same training loop as SGD
for (int step = 0; step < n_steps; ++step) {
optimizer.zero_grad();
auto loss = compute_loss(model.forward(input), target);
loss.backward();
optimizer.step();
}
RMSProp Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
lr |
Float |
0.01 |
Learning rate |
alpha |
Float |
0.99 |
Decay rate for squared gradient average |
epsilon |
Float |
1e-8 |
Numerical stability constant |
ObGD (Stream-X Module)
Specialized optimizer for streaming reinforcement learning. Uses eligibility traces and overshooting bounds for stable online learning.
Requires: -DAUTOGRAD_BUILD_STREAM_X=ON
#include "stream_x/obgd.h"
// Create with: lr, gamma, lambda, kappa
ObGD optimizer(1.0f, 0.99f, 0.8f, 2.0f);
optimizer.add_parameters(model.layers());
// RL training loop
for (int step = 0; step < num_steps; ++step) {
// ... compute TD error (delta) and check episode boundary (reset) ...
optimizer.zero_grad();
// ... compute gradients ...
optimizer.step(delta, reset); // Uses TD error for eligibility traces
}
ObGD Parameters
| Parameter | Description |
|---|---|
lr |
Learning rate |
gamma |
Discount factor (typically 0.99) |
lambda |
Eligibility trace decay (typically 0.8) |
kappa |
Overshooting bound (typically 1.0-3.0) |
Custom Optimizer
You can create custom optimizers by inheriting from ag::Optimizer:
class MomentumSGD : public ag::Optimizer {
public:
explicit MomentumSGD(double lr, double momentum = 0.9)
: learning_rate(lr), beta(momentum) {}
void add_parameters(const std::vector<std::shared_ptr<nn::Layer>>& layers) override {
for (const auto& layer : layers) {
if (layer && layer->has_parameters()) {
for (auto* p : layer->get_parameters()) {
if (p) {
parameters.push_back(p);
velocity[p] = ag::Matrix::Zeros(p->value().shape());
}
}
}
}
}
void step() override {
for (auto* p : parameters) {
if (p && p->requires_grad()) {
auto& v = velocity[p];
v = v * beta + p->grad() * (1.0 - beta);
p->set_value(p->value() - v * learning_rate);
}
}
}
void zero_grad() override {
for (auto* p : parameters) if (p) p->zero_grad();
}
private:
double learning_rate;
double beta;
std::vector<ag::Tensor*> parameters;
std::unordered_map<ag::Tensor*, ag::Matrix> velocity;
};
Best Practices
Training Loop Pattern
// 1. Create model and optimizer
nn::Sequential model;
model.add(nn::Linear(784, 128));
model.add(nn::ReLU());
model.add(nn::Linear(128, 10));
ag::SGD optimizer(0.01);
optimizer.add_parameters(model.layers());
// 2. Training loop
for (int epoch = 0; epoch < n_epochs; ++epoch) {
for (auto& [input, target] : dataloader) {
optimizer.zero_grad(); // Clear gradients
auto output = model.forward(input);
auto loss = compute_loss(output, target);
loss.backward(); // Compute gradients
optimizer.step(); // Update parameters
loss.clear_graph(); // Free memory
}
}
Guidelines
- Always call
zero_grad()beforebackward()to clear accumulated gradients - Call
clear_graph()after each step in long-running training to free memory - Use appropriate learning rates: Start with 0.01 for SGD, 0.001 for RMSProp
- Add all trainable layers to the optimizer via
add_parameters()
See Also
- Neural Networks for layer definitions
- Reinforcement Learning for ObGD usage with Stream-X
- Examples for complete training examples