Loss Functions¶
TorchTrade provides specialized loss functions for training RL trading agents, built on TorchRL's LossModule interface.
Available Loss Functions¶
| Loss Function | Type | Use Case |
|---|---|---|
| DGLoss | Policy Gradient | Delight-gated updates — no importance sampling needed |
| GRPOLoss | Policy Gradient | One-step RL with SLTP environments |
| CTRLLoss | Representation Learning | Self-supervised encoder training |
| CTRLPPOLoss | Combined | Joint policy + representation learning |
For standard multi-step RL (PPO, DQN, SAC, IQL), use TorchRL's built-in loss modules directly.
DGLoss¶
Delightful Policy Gradient — gates each update by \(\sigma(\text{delight} / \eta)\), where \(\text{delight} = \text{advantage} \times \text{surprisal}\). This suppresses rare failures and amplifies rare successes without requiring importance sampling (no old log-probs needed).
Key difference from PPO/GRPO: DG uses only the current policy's log-probabilities. No importance ratios, no behavior probabilities. This makes it simpler and naturally robust to stale or off-policy data.
Why DG for trading?¶
Trading rewards are heavy-tailed: most trades produce small P&L, but a few outliers (flash crashes, breakouts) dominate the gradient. Standard PG and GRPO weight updates purely by advantage magnitude, so one catastrophic trade can overwhelm training even when the policy assigned near-zero probability to that action.
The delight gate fixes this. A large loss on an unlikely action produces negative delight, pushing the gate toward 0 and shrinking its gradient contribution. Conversely, an unexpected win produces positive delight, pushing the gate toward 1 so the policy learns faster from it. The net effect: black-swan losses don't destabilize training, and rare profitable signals get amplified.
DG also drops the importance sampling machinery that GRPO needs. GRPO requires old log-probs (sample_log_prob) to compute policy ratios, which means storing behavior policy state and dealing with ratio clipping when data gets stale. DG only needs the current policy's log-probs, so it works cleanly with replay buffers, asynchronous collection, and distributed setups where the collecting policy drifts from the learner.
When to use DG vs GRPO:
| Scenario | Recommendation |
|---|---|
| One-step SLTP with fresh on-policy batches | Either works; GRPO is well-tested |
| Sequential environments (multi-step episodes) | DG, no need to track old log-probs across steps |
| Replay buffer / offline data | DG, no importance sampling artifacts |
| Distributed collection with stale actors | DG, designed for this (see arXiv:2603.20521) |
| Heavy-tailed reward distributions | DG, gate suppresses outlier-driven gradient noise |
| Parameter | Default | Description |
|---|---|---|
actor_network |
Required | Policy network (ProbabilisticTensorDictSequential) |
eta |
1.0 | Sigmoid gate temperature (lower = sharper gating) |
baseline |
"mean" |
Baseline type: "mean" or "none" |
entropy_bonus |
True | Whether to add entropy regularization |
entropy_coeff |
0.01 | Entropy regularization coefficient |
from torchtrade.losses import DGLoss
loss_module = DGLoss(actor_network=actor, eta=1.0, baseline="mean")
for batch in collector:
loss_td = loss_module(batch)
loss = loss_td["loss_objective"] + loss_td["loss_entropy"]
loss.backward()
optimizer.step()
# DG-specific diagnostics
print(f"gate: {loss_td['gate'].item():.3f}, advantage: {loss_td['advantage'].item():.3f}")
Baseline modes:
"mean"— batch mean of rewards (default, simple and effective)"none"— no baseline (raw rewards as advantage)
Papers:
- Delightful Policy Gradient (arXiv:2603.14608) — Introduces DG and shows it enhances directional accuracy within single contexts and shifts expected gradient closer to supervised cross-entropy across multiple contexts.
- Delightful Distributed Policy Gradient (arXiv:2603.20521) — Extends DG to distributed RL with stale, buggy, or mismatched actors. DG without off-policy correction outperforms importance-weighted PG with exact behavior probabilities.
Example: See examples/losses/dg_mnist.py for a comparison of CE, REINFORCE, and DG on MNIST framed as a bandit problem.
GRPOLoss¶
Group Relative Policy Optimization for one-step RL. Designed for OneStepTradingEnv where episodes are single decisions with SL/TP bracket orders. Normalizes advantages within each batch: advantage = (reward - mean) / std.
| Parameter | Default | Description |
|---|---|---|
actor_network |
Required | Policy network (ProbabilisticTensorDictSequential) |
entropy_coeff |
0.01 | Entropy regularization coefficient |
epsilon_low / epsilon_high |
0.2 | Clipping bounds for policy ratio |
from torchtrade.losses import GRPOLoss
loss_module = GRPOLoss(actor_network=actor, entropy_coeff=0.01)
for batch in collector:
loss_td = loss_module(batch)
loss = loss_td["loss_objective"] + loss_td["loss_entropy"]
loss.backward()
optimizer.step()
Paper: DeepSeekMath (arXiv:2402.03300) — Section 2.2
CTRLLoss¶
Cross-Trajectory Representation Learning for self-supervised encoder training. Trains encoders to recognize behavioral similarity across trajectories without rewards, improving zero-shot generalization.
| Parameter | Default | Description |
|---|---|---|
encoder_network |
Required | Encoder that produces embeddings |
embedding_dim |
Required | Dimension of encoder output |
num_prototypes |
512 | Learnable prototype vectors |
sinkhorn_iters |
3 | Sinkhorn-Knopp iterations |
temperature |
0.1 | Softmax temperature |
myow_coeff |
1.0 | MYOW loss coefficient |
from torchtrade.losses import CTRLLoss
ctrl_loss = CTRLLoss(
encoder_network=encoder,
embedding_dim=128,
num_prototypes=512,
)
for batch in collector:
loss_td = ctrl_loss(batch)
loss_td["loss_ctrl"].backward()
optimizer.step()
Paper: Cross-Trajectory Representation Learning (arXiv:2106.02193)
CTRLPPOLoss¶
Combines ClipPPOLoss with CTRLLoss for joint policy and encoder training. The encoder learns useful representations while the policy learns to act.
from torchtrade.losses import CTRLLoss, CTRLPPOLoss
from torchrl.objectives import ClipPPOLoss
combined_loss = CTRLPPOLoss(
ppo_loss=ClipPPOLoss(actor, critic),
ctrl_loss=CTRLLoss(encoder, embedding_dim=128),
ctrl_coeff=0.5,
)
for batch in collector:
loss_td = combined_loss(batch)
total_loss = loss_td["loss_objective"] + loss_td["loss_critic"] + loss_td["loss_ctrl"]
total_loss.backward()
optimizer.step()
See Also¶
- Examples - Training scripts using these losses
- Environments - Compatible environment types
- TorchRL Objectives - Built-in PPO, DQN, SAC, IQL losses