Skip to content

Loss Functions

TorchTrade provides specialized loss functions for training RL trading agents, built on TorchRL's LossModule interface.

Available Loss Functions

Loss Function Type Use Case
DGLoss Policy Gradient Delight-gated updates — no importance sampling needed
GRPOLoss Policy Gradient One-step RL with SLTP environments
CTRLLoss Representation Learning Self-supervised encoder training
CTRLPPOLoss Combined Joint policy + representation learning

For standard multi-step RL (PPO, DQN, SAC, IQL), use TorchRL's built-in loss modules directly.


DGLoss

Delightful Policy Gradient — gates each update by \(\sigma(\text{delight} / \eta)\), where \(\text{delight} = \text{advantage} \times \text{surprisal}\). This suppresses rare failures and amplifies rare successes without requiring importance sampling (no old log-probs needed).

\[ \begin{aligned} \text{surprisal} &= -\log \pi(a|s) \\ \text{advantage} &= r - b \\ \text{delight} &= \text{advantage} \times \text{surprisal} \\ \text{gate} &= \sigma(\text{delight} / \eta) \\ \mathcal{L} &= -\mathbb{E}\left[\log \pi(a|s) \cdot \text{sg}(\text{gate} \cdot \text{advantage})\right] \end{aligned} \]

Key difference from PPO/GRPO: DG uses only the current policy's log-probabilities. No importance ratios, no behavior probabilities. This makes it simpler and naturally robust to stale or off-policy data.

Why DG for trading?

Trading rewards are heavy-tailed: most trades produce small P&L, but a few outliers (flash crashes, breakouts) dominate the gradient. Standard PG and GRPO weight updates purely by advantage magnitude, so one catastrophic trade can overwhelm training even when the policy assigned near-zero probability to that action.

The delight gate fixes this. A large loss on an unlikely action produces negative delight, pushing the gate toward 0 and shrinking its gradient contribution. Conversely, an unexpected win produces positive delight, pushing the gate toward 1 so the policy learns faster from it. The net effect: black-swan losses don't destabilize training, and rare profitable signals get amplified.

DG also drops the importance sampling machinery that GRPO needs. GRPO requires old log-probs (sample_log_prob) to compute policy ratios, which means storing behavior policy state and dealing with ratio clipping when data gets stale. DG only needs the current policy's log-probs, so it works cleanly with replay buffers, asynchronous collection, and distributed setups where the collecting policy drifts from the learner.

When to use DG vs GRPO:

Scenario Recommendation
One-step SLTP with fresh on-policy batches Either works; GRPO is well-tested
Sequential environments (multi-step episodes) DG, no need to track old log-probs across steps
Replay buffer / offline data DG, no importance sampling artifacts
Distributed collection with stale actors DG, designed for this (see arXiv:2603.20521)
Heavy-tailed reward distributions DG, gate suppresses outlier-driven gradient noise
Parameter Default Description
actor_network Required Policy network (ProbabilisticTensorDictSequential)
eta 1.0 Sigmoid gate temperature (lower = sharper gating)
baseline "mean" Baseline type: "mean" or "none"
entropy_bonus True Whether to add entropy regularization
entropy_coeff 0.01 Entropy regularization coefficient
from torchtrade.losses import DGLoss

loss_module = DGLoss(actor_network=actor, eta=1.0, baseline="mean")

for batch in collector:
    loss_td = loss_module(batch)
    loss = loss_td["loss_objective"] + loss_td["loss_entropy"]
    loss.backward()
    optimizer.step()

    # DG-specific diagnostics
    print(f"gate: {loss_td['gate'].item():.3f}, advantage: {loss_td['advantage'].item():.3f}")

Baseline modes:

  • "mean" — batch mean of rewards (default, simple and effective)
  • "none" — no baseline (raw rewards as advantage)

Papers:

Example: See examples/losses/dg_mnist.py for a comparison of CE, REINFORCE, and DG on MNIST framed as a bandit problem.


GRPOLoss

Group Relative Policy Optimization for one-step RL. Designed for OneStepTradingEnv where episodes are single decisions with SL/TP bracket orders. Normalizes advantages within each batch: advantage = (reward - mean) / std.

Parameter Default Description
actor_network Required Policy network (ProbabilisticTensorDictSequential)
entropy_coeff 0.01 Entropy regularization coefficient
epsilon_low / epsilon_high 0.2 Clipping bounds for policy ratio
from torchtrade.losses import GRPOLoss

loss_module = GRPOLoss(actor_network=actor, entropy_coeff=0.01)

for batch in collector:
    loss_td = loss_module(batch)
    loss = loss_td["loss_objective"] + loss_td["loss_entropy"]
    loss.backward()
    optimizer.step()

Paper: DeepSeekMath (arXiv:2402.03300) — Section 2.2


CTRLLoss

Cross-Trajectory Representation Learning for self-supervised encoder training. Trains encoders to recognize behavioral similarity across trajectories without rewards, improving zero-shot generalization.

Parameter Default Description
encoder_network Required Encoder that produces embeddings
embedding_dim Required Dimension of encoder output
num_prototypes 512 Learnable prototype vectors
sinkhorn_iters 3 Sinkhorn-Knopp iterations
temperature 0.1 Softmax temperature
myow_coeff 1.0 MYOW loss coefficient
from torchtrade.losses import CTRLLoss

ctrl_loss = CTRLLoss(
    encoder_network=encoder,
    embedding_dim=128,
    num_prototypes=512,
)

for batch in collector:
    loss_td = ctrl_loss(batch)
    loss_td["loss_ctrl"].backward()
    optimizer.step()

Paper: Cross-Trajectory Representation Learning (arXiv:2106.02193)


CTRLPPOLoss

Combines ClipPPOLoss with CTRLLoss for joint policy and encoder training. The encoder learns useful representations while the policy learns to act.

from torchtrade.losses import CTRLLoss, CTRLPPOLoss
from torchrl.objectives import ClipPPOLoss

combined_loss = CTRLPPOLoss(
    ppo_loss=ClipPPOLoss(actor, critic),
    ctrl_loss=CTRLLoss(encoder, embedding_dim=128),
    ctrl_coeff=0.5,
)

for batch in collector:
    loss_td = combined_loss(batch)
    total_loss = loss_td["loss_objective"] + loss_td["loss_critic"] + loss_td["loss_ctrl"]
    total_loss.backward()
    optimizer.step()

See Also