Examples¶
TorchTrade provides a collection of example training scripts to help you get started. These examples are designed for inspiration and learning - use them as starting points to build your own custom training pipelines.
Design Philosophy¶
TorchTrade examples closely follow the structure of TorchRL's SOTA implementations, enabling near plug-and-play compatibility with any TorchRL algorithm. This means:
- Familiar structure if you've used TorchRL before
- Easy adaptation of TorchRL algorithms to trading environments
- Minimal boilerplate - focus on what's unique to your strategy
- Hydra configuration for easy experimentation
Direct Compatibility with TorchRL SOTA Implementations¶
To demonstrate the closeness to TorchRL's SOTA implementations, here is a direct comparison:
TorchRL's A2C Example:
from torchrl.envs import GymEnv
# Environment setup
env = GymEnv("CartPole-v1")
# Everything else stays the same
collector = SyncDataCollector(env, policy, ...)
loss_module = A2CLoss(actor, critic, ...)
optimizer = torch.optim.Adam(loss_module.parameters(), lr=1e-3)
for batch in collector:
loss_values = loss_module(batch)
loss_values["loss"].backward()
optimizer.step()
TorchTrade Adaptation (Only Environment Changes):
from torchtrade.envs.offline import SequentialTradingEnv, SequentialTradingEnvConfig
# Environment setup - ONLY CHANGE
env = SequentialTradingEnv(df, SequentialTradingEnvConfig(...))
# Everything else stays EXACTLY the same
collector = SyncDataCollector(env, policy, ...)
loss_module = A2CLoss(actor, critic, ...)
optimizer = torch.optim.Adam(loss_module.parameters(), lr=1e-3)
for batch in collector:
loss_values = loss_module(batch)
loss_values["loss"].backward()
optimizer.step()
That's it! The collector, loss function, optimizer, and training loop remain identical. This doesn't only allow the use of any TorchRL algorithm, but any of the other useful components of TorchRL - replay buffers, transforms, modules, data structures - and provides seamless integration into the entire TorchRL ecosystem.
Example File Structure¶
Each algorithm directory is fully self-contained — configs, scripts, and all training outputs (checkpoints, logs, Hydra outputs) are written to that directory:
examples/online_rl/
├── <algorithm>/ # ppo, dqn, dsac, iql, ppo_chronos
│ ├── config.yaml # Algorithm config
│ ├── env/ # Environment configs
│ │ ├── sequential.yaml # Basic sequential trading
│ │ └── sequential_sltp.yaml # Sequential with stop-loss/take-profit
│ ├── train.py # Training script (offline backtesting)
│ ├── live.py # Live trading script (optional)
│ ├── utils.py # Helper functions
│ └── outputs/ # Hydra outputs, checkpoints, logs
│
└── grpo/ # GRPO (onestep-only)
├── config.yaml # Algorithm config
├── env/
│ └── onestep.yaml # One-step environment config
├── train.py
├── utils.py
└── outputs/
Key Features:
-
Each algorithm directory is self-contained — everything you need to run, train, and deploy lives in one place
-
Training outputs (checkpoints, logs) are written to the algorithm's own directory
-
GRPO only supports onestep environments
-
No spot/futures split - users override
leverageandaction_levelsfor futures -
All use 1Hour timeframe by default
What Each Component Does¶
config.yaml - Configuration Management
The configuration file uses Hydra to manage all hyperparameters and settings. This includes:
- Environment settings: Symbol, timeframes, initial cash, transaction fees, window sizes
- Network architecture: Hidden dimensions, activation functions, layer configurations
- Training hyperparameters: Learning rate, batch size, discount factor (gamma), entropy coefficient
- Collector settings: Frames per batch, number of parallel environments
- Logging: Wandb project name, experiment tracking settings
By centralizing all parameters in YAML, you can easily experiment with different configurations without modifying code. Hydra also allows you to override any parameter from the command line:
Example config.yaml (PPO):
defaults:
- env: sequential # Load env/sequential.yaml (switch with env=sequential_sltp)
- _self_
collector:
device:
frames_per_batch: 10000
total_frames: 100_000_000
logger:
mode: online
backend: wandb
project_name: TorchTrade-Online
group_name: ${env.name}
exp_name: ppo-${env.name}
test_interval: 1_000_000
num_test_episodes: 1
model:
network_type: batchnorm_mlp
hidden_size: 128
dropout: 0.1
num_layers: 4
optim:
lr: 2.5e-4
eps: 1.0e-6
weight_decay: 0.0
max_grad_norm: 0.5
anneal_lr: True
device:
loss:
gamma: 0.9
mini_batch_size: 3333
ppo_epochs: 3
gae_lambda: 0.95
clip_epsilon: 0.1
anneal_clip_epsilon: True
critic_coeff: 1.0
entropy_coeff: 1.0
loss_critic_type: l2
Note how config.yaml uses defaults: - env: sequential to load the environment config from env/sequential.yaml. The env section is kept separate — see the env/ configs above.
env/ - Environment Configs
Environment configs are separate YAML files that define the trading environment. Switch environments via CLI with env=sequential_sltp (see Running Examples).
env/sequential.yaml — Basic sequential trading (SequentialTradingEnv):
# @package env
name: SequentialTradingEnv
symbol: "BTC/USD"
time_frames: ["1Hour"]
window_sizes: [24]
execute_on: "1Hour"
leverage: 2
action_levels: [-1.0, 0.0, 1.0]
initial_cash: 10000
transaction_fee: 0.0
slippage: 0.0
bankrupt_threshold: 0.1
include_base_features: false
max_traj_length: null
random_start: true
margin_type: isolated
maintenance_margin_rate: 0.004
seed: 0
train_envs: 5
eval_envs: 1
data_path: Torch-Trade/btcusdt_spot_1m_03_2023_to_12_2025
test_split_start: "2025-01-01"
env/sequential_sltp.yaml — Sequential with stop-loss/take-profit (SequentialTradingEnvSLTP):
# @package env
name: SequentialTradingEnvSLTP
symbol: "BTC/USD"
time_frames: ["1Hour"]
window_sizes: [24]
execute_on: "1Hour"
leverage: 1
action_levels: [0.0, 1.0]
stoploss_levels: [-0.01, -0.02, -0.03, -0.04]
takeprofit_levels: [0.02, 0.04, 0.06, 0.08, 0.1]
include_hold_action: true
include_close_action: false
initial_cash: 10000
transaction_fee: 0.0
slippage: 0.0
bankrupt_threshold: 0.1
include_base_features: false
max_traj_length: null
random_start: true
margin_type: isolated
maintenance_margin_rate: 0.004
seed: 0
train_envs: 10
eval_envs: 1
data_path: Torch-Trade/btcusdt_spot_1m_03_2023_to_12_2025
test_split_start: "2025-01-01"
env/onestep.yaml — One-step environment for GRPO (OneStepTradingEnv):
# @package env
name: OneStepTradingEnv
symbol: "BTC/USD"
time_frames: ["1Hour"]
window_sizes: [24]
execute_on: "1Hour"
leverage: 1
initial_cash: 10000
transaction_fee: 0.0
slippage: 0.0
bankrupt_threshold: 0.1
stoploss_levels: [-0.02]
takeprofit_levels: [0.04]
include_hold_action: true
seed: 0
train_envs: 6
eval_envs: 1
data_path: Torch-Trade/btcusdt_spot_1m_03_2023_to_12_2025
test_split_start: "2025-01-01"
utils.py - Helper functions (make_env(), make_actor(), make_critic(), make_loss(), make_collector()) that keep the training script clean.
train.py - Main training loop: loads config via Hydra, creates components, collects data, trains, evaluates, and checkpoints.
live.py - Live trading script that loads a trained policy and executes it against a live exchange API. The DQN and PPO examples include live.py to demonstrate the smooth transition from backtesting/training to live execution — the same policy trained offline can be deployed directly to a live environment with minimal code changes.
Available Examples¶
The following examples demonstrate the flexibility of TorchTrade across different algorithms, environments, and use cases. These examples are meant to be starting points for further experimentation and adaptation - customize them according to your needs, ideas, and environments.
Hyperparameters Not Tuned
All hyperparameters in our examples are NOT tuned. The configurations provided are starting points for experimentation, not optimized settings. You should tune hyperparameters (learning rates, network architectures, reward functions, etc.) according to your specific trading environment, market conditions, and objectives.
Online RL (Offline Backtesting Environments)¶
These examples use online RL algorithms (learning from interaction as it happens) with historical market data for backtesting. This allows you to train policies on past data before deploying them to live trading environments. We typically split the training data into training and test environments to evaluate the generalization performance of learned policies on unseen market conditions.
Located in examples/online_rl/:
- PPO -
ppo/- Standard policy gradient - PPO + Chronos -
ppo_chronos/- Time series embedding with Chronos T5 models - DQN -
dqn/- Deep Q-learning with experience replay and target networks - IQL -
iql/- Implicit Q-Learning - Discrete SAC -
dsac/- Discrete Soft Actor-Critic - GRPO -
grpo/- Group Relative Policy Optimization (onestep-only, no env switching)
All algorithms except GRPO support environment switching via CLI - see Running Examples below.
Offline RL¶
These examples use offline RL algorithms that learn from pre-collected datasets without requiring live environment interaction during training. The data can be collected from interactions with offline backtesting environments or from real online live trading sessions. We provide simple example offline datasets at HuggingFace/Torch-Trade.
Located in examples/offline_rl/:
| Example | Algorithm | Environment | Key Features |
|---|---|---|---|
| iql/ | IQL | SequentialTradingEnv | Offline RL from pre-collected trajectories |
Note: Thanks to the compatibility with TorchRL, you can easily add other offline RL methods like CQL, TD3+BC, and Decision Transformers from TorchRL to do offline RL with TorchTrade environments.
LLM Actors¶
TorchTrade provides LLM-based trading actors that leverage language models for trading decision-making. Both frontier API models and local models are supported, each with offline (backtesting) and online (live trading) examples.
Located in examples/llm/:
| Example | Actor | Description |
|---|---|---|
| frontier/offline.py | FrontierLLMActor | Backtesting with frontier LLM APIs (OpenAI, Anthropic, etc.) |
| frontier/live.py | FrontierLLMActor | Live trading with frontier LLM APIs |
| local/offline.py | LocalLLMActor | Backtesting with local LLMs (vLLM/transformers) |
| local/live.py | LocalLLMActor | Live trading with local LLMs |
Local models can be loaded from HuggingFace Models or quantized via Unsloth for memory-efficient inference.
Future Work
Fine-tuning LLMs on reasoning traces from frontier models, and integrating Vision-Language Models (VLMs) to process trading chart plots.
Rule-Based Actors¶
TorchTrade provides RuleBasedActor for creating trading strategies using technical indicators and market signals. These actors integrate seamlessly with TorchTrade environments for both backtesting and live trading, serving as baselines or components in hybrid approaches.
Located in examples/rule_based/:
| Example | Description |
|---|---|
| offline.py | Backtesting a mean reversion strategy on historical data |
| live.py | Live trading with the mean reversion strategy |
Combine with custom feature preprocessing to add technical indicators for rule-based strategies.
Future Work
Hybrid approaches that combine rule-based policies with neural network policies as actors, leveraging the strengths of both deterministic strategies and learned behaviors.
Live Trading¶
The DQN and PPO examples include a live.py script alongside the train.py script, demonstrating the smooth transition from offline backtesting to live trading. The same model architecture and utils.py helpers are reused — only the environment changes from an offline SequentialTradingEnv to a live exchange environment. This mirrors the core design philosophy: swap the environment, keep everything else the same.
Live scripts support loading pre-trained weights via --weights and store all live transitions in a replay buffer (saved periodically for crash safety), enabling offline analysis or further training on real market data.
Located alongside each algorithm in examples/online_rl/:
| Example | Exchange | Algorithm | Description |
|---|---|---|---|
| dqn/live.py | Binance Futures | DQN | Live futures trading with DQN |
| ppo/live.py | Alpaca | PPO | Live spot trading with PPO actor |
Usage:
# Train offline first
python examples/online_rl/dqn/train.py
# Deploy trained weights to Binance testnet
python examples/online_rl/dqn/live.py --weights dqn_policy_100.pth --demo
# Deploy PPO to Alpaca paper trading
python examples/online_rl/ppo/live.py --weights ppo_policy_100.pth --paper
Transforms¶
Inspired by work such as R3M and VIP that utilize large pretrained models for representation learning, we created the ChronosEmbeddingTransform using Chronos forecasting models to embed historical trading data. This demonstrates the flexibility and adaptability of TorchTrade for integrating pretrained models as transforms for enhanced feature representations.
Located in examples/transforms/:
| Example | Transform | Description |
|---|---|---|
| chronos_embedding_example.py | ChronosEmbeddingTransform | Time series embedding with Chronos T5 models |
Note: If you would like us to add additional transforms for other pretrained models (similar to ChronosEmbedding), we welcome GitHub issues with your requests. We're happy to implement these given the availability of model weights and resources.
Running Examples¶
All examples use Hydra for configuration management with centralized environment configs:
# Run with default configuration (sequential, spot, 1Hour)
uv run python examples/online_rl/ppo/train.py
# Switch environment via CLI
uv run python examples/online_rl/ppo/train.py env=sequential_sltp
uv run python examples/online_rl/ppo/train.py env=onestep
# Configure for futures trading
uv run python examples/online_rl/ppo/train.py \
env.leverage=5 \
env.action_levels='[-1.0,0.0,1.0]'
# Override multiple parameters
uv run python examples/online_rl/ppo/train.py \
env=sequential_sltp \
env.symbol="ETH/USD" \
env.leverage=10 \
optim.lr=1e-4 \
loss.gamma=0.95
Available Environment Configs¶
| Config | Environment Class | SLTP | Use Case |
|---|---|---|---|
sequential |
SequentialTradingEnv | No | Basic sequential trading |
sequential_sltp |
SequentialTradingEnvSLTP | Yes | Sequential with bracket orders |
onestep |
OneStepTradingEnv | Yes | One-step for GRPO/contextual bandits |
Spot vs Futures:
-
Spot (default):
leverage: 1,action_levels: [0.0, 1.0] -
Futures: Override with
env.leverage=5 env.action_levels='[-1.0,0.0,1.0]'
Common Hydra Overrides¶
| Parameter | Example | Description |
|---|---|---|
env.symbol |
"BTC/USD" |
Trading pair/symbol |
env.initial_cash |
10000 |
Starting capital |
env.time_frames |
'["1min","5min"]' |
Multi-timeframe observations |
optim.lr |
1e-4 |
Learning rate |
loss.gamma |
0.99 |
Discount factor |
collector.frames_per_batch |
2000 |
Frames collected per iteration |
total_frames |
100000 |
Total training frames |
Creating Your Own Examples¶
Copy an existing example closest to your use case and customize:
- Start with
ppo/for standard RL,grpo/for one-step RL,ppo_chronos/for time series embeddings - Use
env=<config_name>to switch environments without copying code