Reward Functions¶
Reward functions shape your agent's behavior. TorchTrade provides a flexible system for defining custom reward functions that go beyond simple log returns.
Default Behavior¶
By default, all environments use log returns:
You can potentially improve learning and generalization by customizing this reward function. See below for some references on reward function design.
How It Works¶
Custom reward functions receive a HistoryTracker object and return a float reward:
Pass your reward function via the environment config:
from torchtrade.envs.offline import SequentialTradingEnv, SequentialTradingEnvConfig
config = SequentialTradingEnvConfig(
reward_function=my_reward,
...
)
env = SequentialTradingEnv(df, config)
HistoryTracker Fields¶
The history object passed to your reward function is a HistoryTracker instance with these attributes:
| Field | Type | Description |
|---|---|---|
portfolio_values |
list[float] | Portfolio value at each step |
base_prices |
list[float] | Asset price at each step |
actions |
list[float] | Actions taken at each step |
rewards |
list[float] | Rewards received at each step |
positions |
list[float] | Position sizes (positive=long, negative=short, 0=flat) |
action_types |
list[str] | Action types ("hold", "buy", "sell", "long", "short") |
All lists grow by one element per step. Use indexing (e.g., history.portfolio_values[-1]) to access recent values.
Built-in Reward Functions¶
TorchTrade provides three built-in reward functions in torchtrade.envs.core.default_rewards:
from torchtrade.envs.core.default_rewards import (
log_return_reward, # Default - log(value_t / value_t-1)
sharpe_ratio_reward, # Running Sharpe ratio
drawdown_penalty_reward, # Log return with drawdown penalty
)
config = SequentialTradingEnvConfig(
reward_function=sharpe_ratio_reward,
...
)
log_return_reward (default)¶
- Scale-invariant, symmetric, handles compounding
- Returns
-10.0on bankruptcy,0.0if insufficient history
sharpe_ratio_reward¶
Running Sharpe ratio over all historical portfolio values. Encourages risk-adjusted returns. Clipped to [-10.0, 10.0].
drawdown_penalty_reward¶
Log return plus a penalty when drawdown from peak exceeds 10%. Discourages large drawdowns while rewarding gains.
Custom Examples¶
Example 1: Transaction Cost Penalty¶
import numpy as np
def cost_aware_reward(history) -> float:
"""Log return scaled by trade frequency."""
if len(history.portfolio_values) < 2:
return 0.0
old_value = history.portfolio_values[-2]
new_value = history.portfolio_values[-1]
if old_value <= 0:
return -10.0
log_return = np.log(new_value / old_value)
# Penalize frequent action changes
if len(history.actions) >= 2 and history.actions[-1] != history.actions[-2]:
log_return -= 0.001 # Small penalty for switching
return float(log_return)
Example 2: Sparse Terminal Reward¶
def terminal_reward(history) -> float:
"""Only reward at episode end based on total return."""
if len(history.portfolio_values) < 2:
return 0.0
# Give zero reward during episode
# At terminal step, the environment handles this naturally
# Use total return as final reward
initial_value = history.portfolio_values[0]
current_value = history.portfolio_values[-1]
if initial_value <= 0:
return -10.0
return float(np.log(current_value / initial_value))
Example 3: Multi-Objective Reward¶
def multi_objective_reward(history) -> float:
"""Combine returns with drawdown penalty."""
if len(history.portfolio_values) < 2:
return 0.0
old_value = history.portfolio_values[-2]
new_value = history.portfolio_values[-1]
if old_value <= 0 or new_value <= 0:
return -10.0
# Component 1: Log return
log_ret = float(np.log(new_value / old_value))
# Component 2: Drawdown from peak
peak = max(history.portfolio_values)
drawdown = (peak - new_value) / peak if peak > 0 else 0.0
dd_penalty = -5.0 * drawdown if drawdown > 0.1 else 0.0
return log_ret + dd_penalty
When to Use Each Reward¶
| Reward Type | Use Case | Pros | Cons |
|---|---|---|---|
| Log Return (default) | General purpose | Simple, stable | Ignores costs, risk |
| Sharpe Ratio | Risk-adjusted performance | Considers volatility | Requires history, unstable early |
| Drawdown Penalty | Risk management | Controls max loss | May exit winners early |
| Terminal Sparse | Episodic tasks | Clear objective | Sparse signal, slow learning |
| Cost Penalty | Reduce overtrading | Discourages frequent trades | May miss opportunities |
Design Principles¶
1. Dense vs Sparse Rewards¶
Dense rewards (every step): - Faster learning - More signal for gradient updates - Risk: May optimize for short-term gains
Sparse rewards (terminal only): - Aligns with true objective (final return) - Less noise - Risk: Slower learning, credit assignment problem
2. Normalize Rewards¶
Keep rewards in a consistent range for stable learning:
3. Avoid Lookahead Bias¶
Only use information available at decision time — use past values from history, never future data.
4. Consider Environment Type¶
Sequential Environments: Full history available for complex rewards One-Step Environments: Limited to immediate reward (minimal history) Futures Environments: Consider liquidation risk in reward
References¶
Research Papers on Reward Design¶
- Reward Shaping for Reinforcement Learning in Financial Trading - Comprehensive study on reward function design for trading applications
- Deep Reinforcement Learning for Trading - Research on RL approaches and reward engineering for financial markets
- Reward Function Design in Deep Reinforcement Learning for Financial Trading - Analysis of different reward formulations and their impact on trading performance
Next Steps¶
- Feature Engineering - Engineer features that support your reward function
- Understanding the Sampler - How data flows through environments
- Loss Functions - Training objectives that work with rewards
- Offline Environments - Apply custom rewards to environments