MR.Q: A New Approach to Reinforcement Learning in Finance
Introduction
In the rapidly evolving world of artificial intelligence, reinforcement learning (RL) stands out as a powerful framework for training AI agents to make decisions in complex and dynamic environments. However, traditional RL algorithms often come with a significant drawback: they are highly specialized and require meticulous tuning for each specific task, making them less adaptable and more resource-intensive.
Enter MR.Q (Model-based Representations for Q-learning)—a groundbreaking advancement in the field of reinforcement learning. MR.Q bridges the gap between model-based and model-free methods by achieving the high performance typically associated with model-based approaches while retaining the simplicity and efficiency of model-free algorithms. The secret behind MR.Q’s success lies in its ability to learn model-based representations without the need for costly trajectory simulations, setting a new standard for adaptability and efficiency in AI-driven decision-making.
In this blog post, we will delve into the inner workings of MR.Q, exploring how it leverages these innovative techniques to excel in various applications, particularly in the realm of stock trading. You’ll gain insights into its practical implementation through detailed Python code examples derived from the official GitHub repository. Additionally, we’ll draw comparisons between MR.Q and the well-established Deep Q-Network (DQN), highlighting the unique advantages that MR.Q brings to the table.
This post aims to equip you with a comprehensive understanding of MR.Q and its transformative potential in the world of AI-powered trading strategies.
How MR.Q Works
Key Idea: Model-Based Representations for Q-learning
Instead of directly optimizing a Q-value function like traditional model-free Reinforcement Learning (RL) methods, MR.Q learns state-action embeddings that approximate a linear relationship with the value function. This allows the model to retain the rich representations of model-based RL without explicitly planning future states.
Advantages of using linear representations for Q-values in RL.
What Are Linear Representations for Q-values?
In Q-learning, the goal is to estimate the Q-value function, which represents the expected future reward for taking a given action in a particular state.
A linear representation means that instead of using a deep neural network with non-linear activations (like ReLU
or Tanh
) to approximate Q-values, we use a linear function:
[ Q(s, a) = w^T \phi(s, a) ]
Where:
- ( w ) → A weight vector (parameters learned during training).
- ( \phi(s, a) ) → A feature representation of the state-action pair.
- ( Q(s, a) ) → The estimated Q-value.
Instead of using deep neural networks, this approach approximates Q-values using simple linear functions, leading to several benefits.
Why Is This Beneficial?
1️⃣ Strong Generalization Across Multiple RL Benchmarks
- Generalization means the model can perform well on unseen environments after being trained on similar data.
- Since linear models do not overfit as easily as deep networks, they generalize better across different RL tasks.
- This is useful when applying RL to multiple stock markets, trading environments, or different control tasks.
Example:
- A deep Q-network (DQN) trained on one stock market might fail on another.
- A linear Q-function trained on various trading conditions generalizes better.
2️⃣ Better Performance Than Traditional Q-learning Approaches
- Traditional Q-learning uses deep neural networks, which require a lot of data and suffer from instability (due to overfitting or divergence).
- Linear Q-functions:
Require less data to learn meaningful relationships.
Converge faster because they don’t suffer from vanishing/exploding gradients.
Can be interpretable, helping traders understand why certain actions are chosen.
Example:
- Traditional DQN struggles with high-dimensional inputs (e.g., financial indicators).
- Linear Q-learning finds useful patterns faster with fewer parameters.
3️⃣ Faster Training Than Model-Based Methods
- Model-based RL first learns a world model (predicts how the environment behaves) before optimizing a policy.
- Model-free methods (like Q-learning) skip the world model and directly optimize the action-value function.
- Linear Q-learning is even faster because:
No deep neural networks → Faster updates per iteration.
Lower memory usage → No need to store complex network weights.
Efficient updates → Can use matrix operations for batch learning.
Example:
- A deep Q-network (DQN) might take millions of steps to learn trading patterns.
- Linear Q-learning can learn meaningful strategies in fewer steps.
Model-Free vs. Model-Based Reinforcement Learning (RL)
Reinforcement Learning (RL) is about training agents to make decisions by interacting with an environment to maximize rewards. The two main approaches are model-free and model-based RL.
Model-Free RL (No Environment Model)
- The agent directly learns from trial and error without explicitly modeling the environment.
- It does not predict future states—instead, it optimizes its behavior based on observed rewards.
- Example algorithms:
- Q-Learning / Deep Q-Networks (DQN)
- Policy Gradient Methods (REINFORCE, PPO, SAC, TD3)
Pros:
- Easier to implement.
- Works well in complex environments.
- No need for an explicit environment model.
Cons:
- Requires a lot of experience (slow learning).
- Less sample-efficient (high data requirements).
- Can struggle with long-term planning.
Model-Based RL (Uses an Environment Model)
- The agent learns a model of the environment and uses it to predict future states.
- It can simulate actions before actually taking them, leading to better decision-making.
- Example algorithms:
- AlphaZero (Monte Carlo Tree Search + Deep Learning)
- Dyna-Q (Q-Learning + Planning)
- MuZero (Deep Model-Based RL)
Pros:
- More sample-efficient (learns with fewer interactions).
- Better at long-term planning.
- Can simulate future outcomes before acting.
Cons:
- Harder to implement.
- Needs a good model of the environment.
- Computationally expensive.
Key Differences
Feature | Model-Free RL 🏆 | Model-Based RL 🔮 |
---|---|---|
Uses Environment Model? | ❌ No | ✅ Yes |
Efficiency | ❌ Needs more experience | ✅ Learns faster |
Long-Term Planning | ❌ Weak | ✅ Strong |
Computational Cost | ✅ Lower | ❌ Higher |
Works Well in Complex Environments? | ✅ Yes | ❌ Harder |
Example Analogy
- Model-Free RL: A tourist exploring a city blindly, learning from mistakes and rewards.
- Model-Based RL: A tourist using Google Maps to predict the best route before traveling.
When to Use Each?
- Model-Free RL: When the environment is too complex to model, and direct learning is good enough.
- Model-Based RL: When the environment is predictable and you want better efficiency & long-term planning.
MR.Q for Stock Market Analysis
Key Features of MR.Q
MR.Q comes equipped with several innovative features tailored for dynamic decision-making in complex environments like stock trading:
-
Latent State Representation: Unlike traditional approaches that rely on raw price data, MR.Q learns an abstract representation of market conditions. This allows it to capture underlying patterns and trends effectively.
-
Multi-Step Prediction (Horizon-Based Learning): By considering future price movements, MR.Q makes informed decisions rather than reacting solely to immediate changes. This forward-looking approach enhances its strategic capabilities.
-
Reinforcement Learning for Decision-Making: Utilizing reinforcement learning principles, MR.Q learns optimal actions—whether to buy, sell, or hold—based on maximizing cumulative rewards over time.
-
Versatility with Data Types: Initially designed for game environments involving image data, MR.Q has been adapted to handle numeric financial data seamlessly. This flexibility underscores its broad applicability.
Why is this useful for trading?
Rather than simply responding to historical price movements, MR.Q leverages deep reinforcement learning techniques to predict future market trends. This predictive capability enables smarter, more proactive trading strategies.
How Does MR.Q Work?
MR.Q consists of three main components:
- Encoder: Converts raw data into a latent state representation.
- Policy Network: Decides which action to take (buy, sell, hold).
- Value Network: Estimates future rewards from actions.
Here’s a simplified workflow for MR.Q in stock trading:
📊 Stock Market Data → 🧠 Encoder (Feature Extraction) → 🎯 Policy (Action Selection: Buy/Sell/Hold) → 💰 Profit Maximization
MR.Q Python Implementation
1️⃣ Setting Up MR.Q for Stock Market Analysis
To use MR.Q for stock analysis, we will:
- Download stock price data (Yahoo Finance)
- Modify MR.Q’s environment to handle financial data
- Train the RL model on stock market trends
- Backtest & evaluate performance
2️⃣ Install Dependencies
pip install yfinance pandas numpy torch
3️⃣ Getting Stock Market Data
Before training MR.Q, we need historical stock prices. We’ll use Yahoo Finance to fetch stock data. For me the data comes down in non standard CSV format. So you need to do a bit of processing to get it right. I want to add a few more columns that will allow the models to get more information from the data.
import yfinance as yf
import pandas as pd
def get_stock_data(symbol, start="2010-01-01", end="2024-01-01"):
"""Fetch stock data from Yahoo Finance."""
data = yf.download(symbol, start=start, end=end)
file_name = f"{symbol}_{start}_{end}.csv"
data.to_csv(file_name)
headers = ['Date','Close','High','Low','Open','Volume']
# Load CSV while skipping the first 3 lines
stock = pd.read_csv(file_name, skiprows=3, names=headers)
stock["SMA_5"] = stock["Close"].rolling(window=5).mean()
stock["SMA_20"] = stock["Close"].rolling(window=20).mean()
stock["Return"] = stock["Close"] - stock["Open"]
stock["Volatility"] = stock["Return"].rolling(window=5).std()
stock = stock.iloc[20:] # calc of 20 day moving average will lead ot NaN values
return stock
data = get_stock_data("AAPL")
FEATURES = ["Close", "SMA_5", "SMA_20", "Return", "Volatility"]
data[FEATURES].head()
data = data[FEATURES]
data = data.astype(float)
dates = data.index
print(data.head())
This will return stock prices & returns, which we use as input for MR.Q.
Close | SMA_5 | SMA_20 | Return | Volatility | |
---|---|---|---|---|---|
20 | 74.933746 | 76.716121 | 75.656308 | -2.764829 | 1.555393 |
21 | 74.727959 | 76.702077 | 75.792250 | 1.055579 | 1.651474 |
22 | 77.194992 | 76.758244 | 76.022854 | 0.857052 | 1.592379 |
23 | 77.824463 | 76.618304 | 76.301999 | -0.501150 | 1.607351 |
24 | 78.734787 | 76.683189 | 76.568556 | 0.639152 | 1.585143 |
4️⃣ Build the Buffer
The StockReplayBuffer is a memory storage system for reinforcement learning, specifically designed for stock trading agents. It stores past experiences (state, action, reward, next state, done) and retrieves random samples for training.
import numpy as np
import torch
class StockReplayBuffer:
def __init__(self, max_size=100000):
self.max_size = max_size
self.states = []
self.actions = []
self.rewards = []
self.next_states = []
self.dones = []
def add(self, state, action, reward, next_state, done):
if len(self.states) >= self.max_size:
self.states.pop(0)
self.actions.pop(0)
self.rewards.pop(0)
self.next_states.pop(0)
self.dones.pop(0)
self.states.append(state)
self.actions.append(action)
self.rewards.append(reward)
self.next_states.append(next_state)
self.dones.append(done)
def sample(self, batch_size):
indices = np.random.choice(len(self.states), batch_size, replace=False)
# Convert lists to NumPy arrays before converting to tensors
states_np = np.array(self.states, dtype=np.float32)
actions_np = np.array(self.actions, dtype=np.int64)
rewards_np = np.array(self.rewards, dtype=np.float32).reshape(-1, 1)
next_states_np = np.array(self.next_states, dtype=np.float32)
dones_np = np.array(self.dones, dtype=np.float32).reshape(-1, 1)
return (torch.tensor(states_np)[indices],
torch.tensor(actions_np)[indices],
torch.tensor(rewards_np)[indices],
torch.tensor(next_states_np)[indices],
torch.tensor(dones_np)[indices])
1️⃣ Stores Trading Experiences → Saves state, action, reward, next state, and done flag.
2️⃣ Maintains a Fixed Size → Removes the oldest experience when full.
3️⃣ Samples Random Batches → Randomly selects past experiences for training, preventing bias from consecutive
trades.
5️⃣ Implement the RL Agent for Trading (policy network)
MR.Q Stock Trading Agent Implementation
import torch
import torch.nn as nn
import torch.optim as optim
import random
class StockTradingAgent:
def __init__(self, state_dim, action_dim=3, learning_rate=0.001, exploration_prob=0.2):
self.policy_network = nn.Sequential(
nn.Linear(state_dim, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, action_dim),
nn.Softmax(dim=-1)
)
self.optimizer = optim.Adam(self.policy_network.parameters(), lr=learning_rate)
self.loss_fn = nn.MSELoss()
self.exploration_prob = exploration_prob # Chance of taking random action
def select_action(self, state):
"""MR.Q uses softmax-based selection with forced exploration."""
if random.random() < self.exploration_prob:
return random.choice([0, 1, 2]) # Random action (Buy, Sell, Hold)
with torch.no_grad():
state_tensor = torch.tensor(state, dtype=torch.float32)
action_probs = self.policy_network(state_tensor)
return torch.argmax(action_probs).item() # Choose action with highest probability
def train(self, buffer, batch_size=32):
"""Train the agent using experiences from the replay buffer."""
if len(buffer.states) < batch_size:
return # Not enough data to train
states, actions, rewards, next_states, dones = buffer.sample(batch_size)
# Ensure correct data types
states = states.float()
next_states = next_states.float()
rewards = rewards.float().unsqueeze(1) # Ensure shape [batch_size, 1]
dones = dones.float().unsqueeze(1) # Ensure shape [batch_size, 1]
# Convert actions to one-hot encoding
actions_one_hot = torch.nn.functional.one_hot(actions.long(), num_classes=3).float()
# Compute predicted Q-values
q_values = self.policy_network(states)
# Get Q-values for the selected actions
q_values = (q_values * actions_one_hot).sum(dim=1, keepdim=True) # Ensures [batch_size, 1]
# Compute target Q-values correctly
next_q_values = self.policy_network(next_states).max(dim=1)[0].unsqueeze(1) # Ensures shape [batch_size,
target_q_values = rewards + 0.99 * next_q_values * (1 - dones) # Ensures [batch_size, 1]
# Compute loss with correct dimensions
loss = self.loss_fn(q_values, target_q_values.detach())
# Update the model
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
6️⃣ Training the RL Trading Agent
buffer = StockReplayBuffer()
agent = StockTradingAgent(state_dim=5) # We use 5 features: Close, SMA_5, SMA_20, Return, Volatility
for i in range(len(data)-1):
state = data.iloc[i].values
next_state = data.iloc[i+1].values
action = agent.select_action(state)
# Reward function: Encourage profitable trades
reward = (next_state[0] - state[0]) * (1 if action == 1 else -1)
if action == 1 and next_state[0] > state[0]: # Buy & price goes up
reward += 2
elif action == 2 and next_state[0] < state[0]: # Sell & price drops
reward += 2
elif action == 1 and next_state[0] < state[0]: # Buy & price goes down
reward -= 2
elif action == 2 and next_state[0] > state[0]: # Sell & price goes up
reward -= 2
done = i == len(data)-2
buffer.add(state, action, reward, next_state, done)
# Train agent multiple times per step
if len(buffer.states) > 32:
for _ in range(5): # Train 5x per step
agent.train(buffer)
Why we chose this reward function
The reward function guides the learning process of the MR.Q agent, helping it make better trading decisions. Let’s break down the logic behind it.
Base Reward Formula
reward = (next_state[0] - state[0]) * (1 if action == 1 else -1)
Why This Works
Encourages buying low & selling high → The reward is positive when buying before a price increase.
Penalizes wrong trades → The reward is negative when selling before a price increase.
Scenario | Action | Price Movement | Reward Calculation |
---|---|---|---|
✅ Correct Buy | Buy | Price ↑ | (+price_change) * (+1) (Positive Reward) |
❌ Wrong Buy | Buy | Price ↓ | (-price_change) * (+1) (Negative Reward) |
✅ Correct Sell | Sell | Price ↓ | (-price_change) * (-1) (Positive Reward) |
❌ Wrong Sell | Sell | Price ↑ | (+price_change) * (-1) (Negative Reward) |
Additional Reward Adjustments
These bonus rewards & penalties help fine-tune the agent’s learning.
Condition | Reward Adjustment | Reason |
---|---|---|
Buy & Price Increases ✅ | +2 |
Reinforces good trades |
Sell & Price Drops ✅ | +2 |
Rewards correct selling decisions |
Buy & Price Drops ❌ | -2 |
Discourages bad buys |
Sell & Price Increases ❌ | -2 |
Punishes selling too early |
Why This is Useful:
Speeds up learning → The agent learns to avoid bad trades faster.
Encourages profit-taking → The agent doesn’t hold trades indefinitely.
Makes the agent more aggressive → Helps it act, not just hold.
Why This Reward Function is a Good Choice
Encourages profit-maximizing behavior
Discourages risky trades
Encourages active trading (not just holding stocks)
Speeds up learning compared to pure price difference rewards
Possible Improvements
1️⃣ Risk Management → Add penalties for holding too long or over-trading.
2️⃣ Transaction Costs → Subtract a small penalty per trade to simulate real-world commissions.
3️⃣ Sharpe Ratio Consideration → Reward the agent for consistent profits, not just high-risk trades.
7️⃣ Backtesting: Does It Make Money?
capital = 10000 # Initial investment
holdings = 0 # Number of shares owned
portfolio_values = [] # Store portfolio value over time
dates = data.index # Get dates for x-axis
actions_taken = [] # Store buy/sell actions for plotting
for i in range(len(data)-1):
state = data.iloc[i].values
action = agent.select_action(state)
actions_taken.append((dates[i], action)) # Track actions
if action == 1 and capital >= state[0]: # Buy
holdings += 1
capital -= state[0]
elif action == 2 and holdings > 0: # Sell
holdings -= 1
capital += state[0]
total_value = capital + holdings * data.iloc[i]['Close']
portfolio_values.append(total_value)
final_value = capital + holdings * data.iloc[-1]['Close']
print(f"Final Portfolio Value: ${final_value:.2f}")
print(f"Total Profit: ${final_value - 10000:.2f}")
Final Portfolio Value: $28846.61
Total Profit: $18846.61
8️⃣ Plot the Results
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(dates[:-1], portfolio_values, label="MR.Q Trading Strategy",
color="blue", linewidth=2)
plt.xlabel("Date")
plt.ylabel("Portfolio Value ($)")
plt.title("MR.Q Trading Strategy: Portfolio Performance")
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.show()
8️⃣ MR.Q’s Trading Decisions
This will generate a plot of MR.Q Trading decisions.
import matplotlib.pyplot as plt
# Store trade actions
buy_dates = []
sell_dates = []
hold_dates = []
for i in range(len(data) - 1):
state = data.iloc[i].values
action = agent.select_action(state)
if action == 1: # Buy
buy_dates.append(dates[i])
elif action == 2: # Sell
sell_dates.append(dates[i])
else: # Hold
hold_dates.append(dates[i])
# Plot Stock Prices with Buy/Sell Markers
plt.figure(figsize=(12, 6))
plt.plot(dates[:-1], data["Close"][:-1], label="Stock Price", color="black")
# Buy & Sell markers
plt.scatter(buy_dates, data.loc[buy_dates, "Close"], label="Buy Signal", color="green", marker="^", alpha=1, s=100)
plt.scatter(sell_dates, data.loc[sell_dates, "Close"], label="Sell Signal", color="red", marker="v", alpha=1, s=100)
# Formatting
plt.xlabel("Date")
plt.ylabel("Stock Price ($)")
plt.title("MR.Q Trading Strategy: Buy & Sell Signals")
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.show()
Performance Comparison
Now, let’s compare MR.Q vs. Buy & Hold with a performance table.
initial_investment = 10000
final_mrq_value = portfolio_values[-1] # Final MR.Q portfolio value
final_hold_value = (initial_investment / data.iloc[0]["Close"]) * data.iloc[-1]["Close"] # Buy & Hold return
performance_data = {
"Strategy": ["MR.Q Reinforcement Learning", "Buy & Hold"],
"Final Portfolio Value ($)": [round(final_mrq_value, 2), round(final_hold_value, 2)],
"Profit (%)": [
round(((final_mrq_value - initial_investment) / initial_investment) * 100, 2),
round(((final_hold_value - initial_investment) / initial_investment) * 100, 2)
]
}
# Convert to DataFrame
performance_df = pd.DataFrame(performance_data)
print(performance_df)
Strategy Final Portfolio Value ($) Profit (%)
0 MR.Q Reinforcement Learning 28972.90 189.73
1 Buy & Hold 324697.45 3146.97
Comparing MR.Q to DQN
Deep Q-Network (DQN) is a reinforcement learning algorithm that uses a deep neural network to approximate Q-values, enabling agents to learn optimal actions in complex environments by maximizing long-term rewards.
1️⃣ Update the Replay Buffer
For this comparison I am going to enhance the buffer.
This class stores past experiences (state, action, reward, next state, done) in a buffer and assigns priorities to them based on how important they are for learning.
# 🔹 Implement Prioritized Experience Replay (PER)
class PrioritizedReplayBuffer:
def __init__(self, max_size=100000, alpha=0.6):
self.max_size = max_size
self.alpha = alpha # Prioritization factor
self.states, self.actions, self.rewards, self.next_states, self.dones = [], [], [], [], []
self.priorities = [] # Stores priority for each experience
def add(self, state, action, reward, next_state, done, td_error=1.0):
"""Add experience with priority based on TD error."""
if len(self.states) >= self.max_size:
del self.states[0], self.actions[0], self.rewards[0], self.next_states[0], self.dones[0], self.priorities[0]
self.states.append(state)
self.actions.append(action)
self.rewards.append(reward)
self.next_states.append(next_state)
self.dones.append(done)
self.priorities.append(abs(td_error) + 1e-5) # Small value to prevent zero probability
def sample(self, batch_size, beta=0.4):
"""Sample experiences with probability proportional to priority."""
priorities = np.array(self.priorities) ** self.alpha
probs = priorities / priorities.sum()
indices = np.random.choice(len(self.states), batch_size, p=probs)
# Compute importance-sampling weights
total = len(self.states)
weights = (total * probs[indices]) ** (-beta)
weights /= weights.max()
return (torch.tensor(np.array(self.states), dtype=torch.float32)[indices],
torch.tensor(np.array(self.actions), dtype=torch.int64)[indices],
torch.tensor(np.array(self.rewards), dtype=torch.float32)[indices].unsqueeze(1),
torch.tensor(np.array(self.next_states), dtype=torch.float32)[indices],
torch.tensor(np.array(self.dones), dtype=torch.float32)[indices].unsqueeze(1),
torch.tensor(weights, dtype=torch.float32).unsqueeze(1), indices)
def update_priorities(self, indices, td_errors):
"""Update priorities after training using TD errors."""
for i, td_error in zip(indices, td_errors):
self.priorities[i] = abs(td_error.item()) + 1e-5
Breakdown of What Each Function Does:
add(...)
→ Stores an experience and assigns it a priority based on TD error (how surprising the result was).sample(...)
→ Selects experiences for training, with higher-priority experiences more likely to be chosen.update_priorities(...)
→ Adjusts priorities after training so that new, important experiences have a better chance of being selected next time.
Why This Is Useful?
- The most important experiences (ones with the highest learning impact) are sampled more often.
- This helps the agent learn faster by focusing on the most useful experiences.
How the Prioritized Experience Replay (PER) Determines Priorities
The priority of an experience is determined using the Temporal Difference (TD) error, which measures how surprising or important a learning update is.
Steps in Determining Priorities:
-
Calculate the TD Error
- The TD error is the difference between the predicted Q-value and the actual reward + discounted future Q-value:
[ TD_error = |Q_{\text{target}} - Q_{\text{predicted}}| ] - A high TD error means the agent was very wrong about its prediction, so the experience is more important for learning.
- The TD error is the difference between the predicted Q-value and the actual reward + discounted future Q-value:
-
Assign Priority Based on TD Error
- Each experience gets a priority score based on its TD error:
[ \text{priority} = |\text{TD error}| + \epsilon ] - (\epsilon) (small constant, e.g., 1e-5) prevents priorities from becoming exactly zero.
- Each experience gets a priority score based on its TD error:
-
Control Prioritization Strength ((\alpha))
- The priority is raised to the power of (\alpha) to control how much prioritization affects sampling: [ p_i = (\text{priority}_i)^\alpha ]
- (\alpha = 0) → Uniform sampling (like normal replay buffer).
- (\alpha = 1) → Full prioritization (purely based on TD error).
-
Convert Priorities into Sampling Probabilities
- Each experience is sampled proportionally to its priority: [ P(i) = \frac{p_i}{\sum_j p_j} ]
- Higher priority → Higher probability of being selected for training.
-
Update Priorities After Training
- After training on a batch of experiences, new TD errors are computed.
- The priorities are updated so that the model keeps focusing on important experiences.
Example: How Priorities Work
Imagine an agent experiences these rewards:
Experience | Old Q-value | New Q-value | TD Error | Priority |
---|---|---|---|---|
A | 5 | 10 | 5 | 5.00001 |
B | 20 | 21 | 1 | 1.00001 |
C | 8 | 8.2 | 0.2 | 0.20001 |
- Experience A is most surprising → Highest priority → More likely to be sampled.
- Experience C is less important → Less likely to be used in training.
Why This Improves Learning
Focuses on important learning moments (big mistakes).
Reduces training time by prioritizing high-impact updates.
Improves stability by balancing old and new experiences.
2️⃣ Define the MR.Q Trading Agent
The MRQTradingAgent is a reinforcement learning (RL) trading agent that uses a softmax based policy network to decide whether to buy, sell, or hold stocks. Unlike traditional Q-learning (DQN), MR.Q optimizes decisions using probabilistic action selection rather than selecting the action with the highest Q-value.
# MR.Q Trading Agent
class MRQTradingAgent:
def __init__(self, state_dim, action_dim=3, learning_rate=0.001, exploration_prob=0.2):
self.policy_network = nn.Sequential(
nn.Linear(state_dim, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, action_dim),
nn.Softmax(dim=-1)
)
self.optimizer = optim.Adam(self.policy_network.parameters(), lr=learning_rate)
self.loss_fn = nn.MSELoss()
self.exploration_prob = exploration_prob
def select_action(self, state):
"""MR.Q uses softmax-based selection with forced exploration."""
if random.random() < self.exploration_prob:
return random.choice([0, 1, 2]) # Random action (Buy, Sell, Hold)
with torch.no_grad():
state_tensor = torch.tensor(state, dtype=torch.float32)
action_probs = self.policy_network(state_tensor)
return torch.multinomial(action_probs, 1).item() # Sample from softmax probabilities
def train(self, buffer, batch_size=32):
if len(buffer.states) < batch_size:
return
states, actions, rewards, next_states, dones, _, _ = buffer.sample(batch_size)
actions_one_hot = torch.nn.functional.one_hot(actions, num_classes=3).float()
q_values = self.policy_network(states)
q_values = (q_values * actions_one_hot).sum(dim=1, keepdim=True)
next_q_values = self.policy_network(next_states).max(dim=1, keepdim=True)[0]
target_q_values = rewards + 0.99 * next_q_values * (1 - dones)
loss = self.loss_fn(q_values, target_q_values.detach())
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
Description of MRQTradingAgent
Class
The MRQTradingAgent
is a reinforcement learning (RL) trading agent that uses a softmax-based policy network to decide whether to buy, sell, or hold stocks. Unlike traditional Q-learning (DQN), MR.Q optimizes decisions using probabilistic action selection rather than selecting the action with the highest Q-value.
Key Features:
Softmax-Based Action Selection → Chooses actions probabilistically rather than always picking the highest-value action.
Exploration Probability (ε
-greedy-like) → Ensures a balance between random exploration and policy-driven actions.
Policy Network → Uses a neural network to learn optimal trading strategies.
One-Hot Encoded Actions → Encodes actions for stable training.
Gradient-Based Training → Uses MSE loss and Adam optimizer for weight updates.
Future Reward Estimation → Computes discounted future rewards to improve trading decisions over time.
Workflow:
1️⃣ Selects an action by either sampling from softmax probabilities or choosing randomly for exploration.
2️⃣ Samples experiences from the replay buffer for training.
3️⃣ Computes expected rewards using next state predictions.
4️⃣ Optimizes the policy network using gradient-based learning.
This agent is designed for dynamic and probabilistic decision-making, making it suitable for uncertain financial markets. 🚀📈
3️⃣ Define the DQN Trading Agent
This code implements a Deep Q-Network (DQN) trading agent that uses reinforcement learning to decide whether to buy, sell, or hold stocks based on historical market data.
# DQN Trading Agent
class DQNTradingAgent:
def __init__(self, state_dim, action_dim=3, learning_rate=0.001, gamma=0.99, epsilon=1.0, epsilon_min=0.01, epsilon_decay=0.995):
self.q_network = nn.Sequential(
nn.Linear(state_dim, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, action_dim)
)
self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
self.loss_fn = nn.MSELoss()
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_min = epsilon_min
self.epsilon_decay = epsilon_decay
def select_action(self, state):
"""DQN uses epsilon-greedy strategy with forced exploration."""
if random.random() < self.epsilon:
return random.choice([0, 1, 2]) # Force exploration more often
with torch.no_grad():
state_tensor = torch.tensor(state, dtype=torch.float32)
action_values = self.q_network(state_tensor)
return torch.argmax(action_values).item()
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
def train(self, buffer, batch_size=32):
if len(buffer.states) < batch_size:
return
# Now correctly extracts all 7 values from Prioritized Replay Buffer
states, actions, rewards, next_states, dones, weights, indices = buffer.sample(batch_size)
q_values = self.q_network(states).gather(1, actions.unsqueeze(1))
with torch.no_grad():
next_q_values = self.q_network(next_states).max(dim=1, keepdim=True)[0]
target_q_values = rewards + self.gamma * next_q_values * (1 - dones)
# Apply importance-sampling weights to the loss function
td_errors = target_q_values - q_values
loss = (weights * td_errors.pow(2)).mean()
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Update experience priorities in PER
buffer.update_priorities(indices, td_errors.abs())
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
return loss.item()
Description of DQNTradingAgent
Class
The DQNTradingAgent
is a Deep Q-Network (DQN) trading agent that learns to buy, sell, or hold stocks using reinforcement learning.
Key Features:
Epsilon-Greedy Exploration → Balances random exploration and exploitation of learned strategies.
Neural Network (Q-Network) → Predicts Q-values for each action (buy, sell, hold).
Prioritized Experience Replay (PER) → Samples important experiences more frequently.
Importance-Sampling Weights → Corrects bias introduced by PER for stable learning.
TD Error-Based Priority Updates → Adjusts experience priorities dynamically based on training feedback.
Adam Optimizer → Efficiently updates the Q-network using gradient descent.
Workflow:
1️⃣ Selects an action using the ε-greedy policy.
2️⃣ Trains by sampling experiences from a prioritized replay buffer.
3️⃣ Computes target Q-values using Bellman’s equation.
4️⃣ Updates Q-network weights using weighted TD errors.
5️⃣ Adjusts priorities to focus learning on high-error experiences.
6️⃣ Gradually reduces exploration (ε
decay) to favor learned strategies over time.
This class optimizes trading decisions by learning from past experiences while prioritizing high-impact market events.
4️⃣ Train Both Agents
# 🔹 Train Both Agents with Prioritized Experience Replay (PER)
buffer = PrioritizedReplayBuffer()
mrq_agent = MRQTradingAgent(state_dim=5)
dqn_agent = DQNTradingAgent(state_dim=5)
mrq_values, dqn_values = [10000], [10000] # Initial capital values
def calculate_reward(action, next_price, current_price):
"""Computes reward based on action and price movement."""
if action == 1 and next_price > current_price: # Buy & price increases ✅
return 5
elif action == 2 and next_price < current_price: # Sell & price drops ✅
return 5
elif action == 1 and next_price < current_price: # Buy & price drops ❌
return -5
elif action == 2 and next_price > current_price: # Sell & price increases ❌
return -5
return -1 # Encourage trading
for i in range(len(data) - 1):
state = data.iloc[i].values
next_state = data.iloc[i + 1].values
next_price, current_price = next_state[0], state[0]
# 🔹 Select actions
mrq_action = mrq_agent.select_action(state)
dqn_action = dqn_agent.select_action(state)
# ✅ Compute rewards
mrq_reward = calculate_reward(mrq_action, next_price, current_price)
dqn_reward = calculate_reward(dqn_action, next_price, current_price)
done = (i == len(data) - 2) # Mark last step
# 🔹 Store experiences in buffer
buffer.add(state, mrq_action, mrq_reward, next_state, done)
buffer.add(state, dqn_action, dqn_reward, next_state, done)
# 🔹 Train the agents multiple times per step
if len(buffer.states) > 32:
for _ in range(10):
mrq_agent.train(buffer)
dqn_agent.train(buffer)
# 🔹 Update portfolio values
mrq_values.append(mrq_values[-1] + mrq_reward)
dqn_values.append(dqn_values[-1] + dqn_reward)
5️⃣ Overlay Graph: MR.Q vs. DQN
import matplotlib.pyplot as plt
# 🔹 Plot Performance Comparison
plt.figure(figsize=(12, 6))
plt.plot(dates[:-1], mrq_values, label="MR.Q", color="blue")
plt.plot(dates[:-1], dqn_values, label="DQN", color="red", linestyle="dashed")
plt.legend()
plt.show()
Note: I got a wide range of results. This was my favourite.
Why Consider MR.Q Over DQN?
MR.Q and DQN are both reinforcement learning algorithms, but MR.Q offers several advantages over DQN, particularly in certain use cases. Here’s a detailed comparison:
1️⃣ MR.Q Learns Faster (More Sample Efficient)
- MR.Q: Uses linear Q-value representations, making it more efficient in learning with fewer data points.
- DQN: Requires deep neural networks and large amounts of data to generalize well.
- Why it matters: If you have limited training data (e.g., stock market historical data), MR.Q will likely converge faster than DQN.
2️⃣ MR.Q Generalizes Better to New Market Conditions
- MR.Q: Uses model-based representations, which help it generalize better across different environments (e.g., different stocks or economic conditions).
- DQN: Overfits more easily to the training environment and requires extensive retraining when market conditions change.
- Why it matters: If you want an RL model that adapts well across multiple financial markets, MR.Q is the better choice.
3️⃣ MR.Q Requires Less Computational Power
- MR.Q: Uses linear representations, meaning it doesn’t require deep networks and runs efficiently on CPUs.
- DQN: Uses deep neural networks, which require powerful GPUs for training.
- Why it matters: If you don’t have a GPU or need a model that can run on low-power devices, MR.Q is the better option.
4️⃣ MR.Q Avoids Instability in Training
- MR.Q: Doesn’t suffer from exploding/vanishing gradients because of its simplified linear Q-learning approach.
- DQN: Is prone to training instability, requiring techniques like target networks and experience replay to stabilize.
- Why it matters: If you want an easier, more stable training process without much tuning, MR.Q is preferable.
5️⃣ MR.Q Has a More Interpretable Decision Process
- MR.Q: Since it uses linear approximations for Q-values, we can easily interpret why it made a certain decision.
- DQN: Uses a black-box deep learning model, making it harder to explain decisions.
- Why it matters: If you need a transparent model for trading (especially in finance, where regulations require explainability), MR.Q is the better choice.
6️⃣ MR.Q Uses Model-Based Representations for More Strategic Decisions
- MR.Q: Uses latent state representations and multi-step prediction, meaning it predicts future trends instead of reacting only to past states.
- DQN: Only optimizes immediate rewards without explicitly considering future states.
- Why it matters: If you need an agent that thinks ahead (e.g., for long-term trading strategies), MR.Q is more powerful.
7️⃣ MR.Q Avoids the Overestimation Problem in Q-Learning
- MR.Q: Uses linear Q-learning approximations, which reduce Q-value overestimation errors.
- DQN: Suffers from Q-value overestimation, leading to suboptimal action selection.
- Why it matters: If you want a more reliable and consistent reinforcement learning agent, MR.Q is a better choice.
When to Choose DQN Over MR.Q?
DQN is still a good choice in certain situations:
- If you need to model highly complex environments (e.g., high-frequency trading, where deep networks may capture more subtle patterns).
- If you have a large dataset and a powerful GPU (DQN can generalize better in big data settings).
- If interpretability is not an issue (DQN’s black-box nature is fine in many applications).
Conclusion: MR.Q vs. DQN – Which Should You Use?
Feature | MR.Q ✅ | DQN |
---|---|---|
Faster learning with fewer samples | ✅ Yes | ❌ Needs more data |
Generalizes to new environments | ✅ Yes | ❌ More retraining needed |
Requires less computation | ✅ CPU-friendly | ❌ Needs GPU |
Stable training | ✅ Yes | ❌ Can be unstable |
Interpretable decisions | ✅ Yes | ❌ Black-box model |
Thinks ahead (multi-step prediction) | ✅ Yes | ❌ No |
Prevents overestimation bias | ✅ Yes | ❌ Suffers from overestimation |
Handles large-scale complex environments | ❌ Not ideal | ✅ Better |
Final Recommendation
- Use MR.Q if you want a fast, stable, and interpretable reinforcement learning model that generalizes well.
- Use DQN if you have big data, lots of compute power, and need deep feature learning.
MR.Q for stock trading
So far in the post I implemented a basic MR.Q. There is a much more advanced version in the repo. We will modify that MR.Q to work with stock market data instead of image-based inputs.
The updated MR.Q will:
✅ Process stock data (prices, SMA, RSI, volatility) instead of pixels
✅ Use a replay buffer optimized for time-series trading data
✅ Modify the state encoding system to work with stock features
✅ Remove unnecessary image-processing components
✅ Keep MR.Q’s advanced learning methods (prioritized replay, target networks, exploration strategies)
import copy
import dataclasses
from typing import Dict
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import buffer # Replay buffer from original MR.Q
import utils # Utility functions
@dataclasses.dataclass
class Hyperparameters:
# Generic
batch_size: int = 256
buffer_size: int = 1e6
discount: float = 0.99
target_update_freq: int = 250
# Exploration
buffer_size_before_training: int = 10e3
exploration_noise: float = 0.2
# Q-Learning Horizon
enc_horizon: int = 5
Q_horizon: int = 3
# Encoder Model
zs_dim: int = 512
za_dim: int = 256
zsa_dim: int = 512
enc_hdim: int = 512
enc_activ: str = 'elu'
enc_lr: float = 1e-4
enc_wd: float = 1e-4
# Value Model
value_hdim: int = 512
value_activ: str = 'elu'
value_lr: float = 3e-4
value_wd: float = 1e-4
value_grad_clip: float = 20
# Policy Model
policy_hdim: int = 512
policy_activ: str = 'relu'
policy_lr: float = 3e-4
policy_wd: float = 1e-4
gumbel_tau: float = 10
pre_activ_weight: float = 1e-5
def __post_init__(self):
enforce_dataclass_type(self)
class MRQStockAgent:
def __init__(self, obs_shape: tuple, action_dim: int, device: torch.device, history: int=1, hp: Dict={}):
self.name = 'MR.Q Stock Trader'
self.hp = Hyperparameters(**hp)
set_instance_vars(self.hp, self)
self.device = device
# Initialize Replay Buffer
self.replay_buffer = ReplayBuffer(
obs_shape, action_dim, max_action=1.0, pixel_obs=False, device=self.device,
history=history, max_size=self.hp.buffer_size, batch_size=self.hp.batch_size
)
# Encoder: Now processes stock data, not images
self.encoder = StockEncoder(obs_shape[0] * history, action_dim,
self.hp.zs_dim, self.hp.za_dim, self.hp.zsa_dim, self.hp.enc_hdim, self.hp.enc_activ).to(self.device)
self.encoder_optimizer = optim.AdamW(self.encoder.parameters(), lr=self.hp.enc_lr, weight_decay=self.hp.enc_wd)
self.encoder_target = copy.deepcopy(self.encoder)
# Policy & Value Networks
self.policy = Policy(action_dim, self.hp.gumbel_tau, self.hp.zs_dim, self.hp.policy_hdim, self.hp.policy_activ).to(self.device)
self.policy_optimizer = optim.AdamW(self.policy.parameters(), lr=self.hp.policy_lr, weight_decay=self.hp.policy_wd)
self.policy_target = copy.deepcopy(self.policy)
self.value = Value(self.hp.zsa_dim, self.hp.value_hdim, self.hp.value_activ).to(self.device)
self.value_optimizer = optim.AdamW(self.value.parameters(), lr=self.hp.value_lr, weight_decay=self.hp.value_wd)
self.value_target = copy.deepcopy(self.value)
# Tracked values
self.reward_scale = 1
self.training_steps = 0
def select_action(self, state: np.array, use_exploration: bool=True):
if self.replay_buffer.size < self.hp.buffer_size_before_training and use_exploration:
return np.random.choice([0, 1, 2]) # Random action (Buy, Sell, Hold)
with torch.no_grad():
state = torch.tensor(state, dtype=torch.float, device=self.device).reshape(-1, *self.replay_buffer.state_shape)
zs = self.encoder.zs(state)
action_probs = self.policy.act(zs)
if use_exploration:
action_probs += torch.randn_like(action_probs) * self.hp.exploration_noise
return int(action_probs.argmax())
def train(self):
if self.replay_buffer.size <= self.hp.buffer_size_before_training:
return
self.training_steps += 1
# Target Network Updates
if (self.training_steps-1) % self.hp.target_update_freq == 0:
self.policy_target.load_state_dict(self.policy.state_dict())
self.value_target.load_state_dict(self.value.state_dict())
self.encoder_target.load_state_dict(self.encoder.state_dict())
state, action, next_state, reward, not_done = self.replay_buffer.sample(self.hp.Q_horizon)
Q, Q_target = self.train_rl(state, action, next_state, reward, not_done)
# Update Priorities
priority = (Q - Q_target.expand(-1,2)).abs().max(1).values
self.replay_buffer.update_priority(priority)
def train_rl(self, state, action, next_state, reward, not_done):
with torch.no_grad():
next_zs = self.encoder_target.zs(next_state)
next_action = self.policy_target.act(next_zs)
next_zsa = self.encoder_target(next_zs, next_action)
Q_target = self.value_target(next_zsa).min(1, keepdim=True).values
Q_target = (reward + not_done * self.hp.discount * Q_target)
zs = self.encoder.zs(state)
zsa = self.encoder(zs, action)
Q = self.value(zsa)
value_loss = F.smooth_l1_loss(Q, Q_target.expand(-1,2))
self.value_optimizer.zero_grad()
value_loss.backward()
torch.nn.utils.clip_grad_norm_(self.value.parameters(), self.hp.value_grad_clip)
self.value_optimizer.step()
return Q, Q_target
class StockEncoder(nn.Module):
"""Modified MRQ Encoder for Stock Trading Data"""
def __init__(self, state_dim, action_dim, zs_dim, za_dim, zsa_dim, hdim, activ):
super().__init__()
self.zs_mlp = nn.Sequential(
nn.Linear(state_dim, hdim),
nn.ReLU(),
nn.Linear(hdim, zs_dim)
)
self.za = nn.Linear(action_dim, za_dim)
self.zsa = nn.Sequential(
nn.Linear(zs_dim + za_dim, zsa_dim),
nn.ReLU(),
nn.Linear(zsa_dim, zsa_dim)
)
self.model = nn.Linear(zsa_dim, zs_dim)
def forward(self, zs, action):
za = F.relu(self.za(action))
return self.zsa(torch.cat([zs, za], 1))
def zs(self, state):
return F.relu(self.zs_mlp(state))
class Policy(nn.Module):
def __init__(self, action_dim, gumbel_tau, zs_dim, hdim, activ):
super().__init__()
self.policy = nn.Sequential(
nn.Linear(zs_dim, hdim),
nn.ReLU(),
nn.Linear(hdim, action_dim),
nn.Softmax(dim=-1)
)
def forward(self, zs):
return self.policy(zs)
class Value(nn.Module):
def __init__(self, zsa_dim, hdim, activ):
super().__init__()
self.q1 = nn.Linear(zsa_dim, hdim)
self.q2 = nn.Linear(hdim, 1)
def forward(self, zsa):
return torch.cat([self.q1(zsa), self.q2(zsa)], 1)
Get Stock data
We are going to update the get_stock_data function we are going to add the rsi.
import yfinance as yf
import pandas as pd
def get_stock_data(symbol, start="2010-01-01", end="2024-01-01"):
"""Fetch stock data from Yahoo Finance."""
data = yf.download(symbol, start=start, end=end)
file_name = f"{symbol}_{start}_{end}.csv"
data.to_csv(file_name)
headers = ['Date','Close','High','Low','Open','Volume']
# Load CSV while skipping the first 3 lines
stock = pd.read_csv(file_name, skiprows=3, names=headers)
stock["SMA_5"] = stock["Close"].rolling(window=5).mean()
stock["SMA_20"] = stock["Close"].rolling(window=20).mean()
stock["Return"] = stock["Close"] - stock["Open"]
stock["Volatility"] = stock["Return"].rolling(window=5).std()
delta = stock["Close"].diff(1)
delta.fillna(0, inplace=True) # Fix: Replace NaN with 0
gain = delta.where(delta > 0, 0)
loss = -delta.where(delta < 0, 0)
# Ensure no NaN issues in rolling calculations
avg_gain = gain.rolling(window=14, min_periods=1).mean()
avg_loss = loss.rolling(window=14, min_periods=1).mean()
rs = avg_gain / (avg_loss + 1e-9) # Avoid division by zero
stock["RSI"] = 100 - (100 / (1 + rs))
stock = stock.iloc[20:] # calc of 20 day moving average will lead ot NaN values
return stock
data = get_stock_data("AAPL")
FEATURES = ["Close", "SMA_5", "SMA_20", "Return", "Volatility", "RSI"]
data[FEATURES].head()
data = data[FEATURES]
data = data.astype(float)
dates = data.index
print(data.head())
Running the MRQStockAgent
# ✅ Set up training environment
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
obs_shape = (len(FEATURES),) # State dimension
action_dim = 3 # Buy, Sell, Hold
# Initialize MR.Q agent
mrq_agent = MRQStockAgent(obs_shape=obs_shape, action_dim=action_dim, device=device)
# Initialize portfolio tracking
capital = 10000
holdings = 0
portfolio_values = []
buy_dates, sell_dates = [], []
# Training loop
for i in range(len(data) - 1):
state = data.iloc[i].values
next_state = data.iloc[i + 1].values
action = mrq_agent.select_action(state)
# Compute rewards
reward = (next_state[0] - state[0]) * (1 if action == 1 else -1)
if action == 1 and next_state[0] > state[0]:
reward += 2
elif action == 2 and next_state[0] < state[0]:
reward += 2
elif action == 1 and next_state[0] < state[0]:
reward -= 2
elif action == 2 and next_state[0] > state[0]:
reward -= 2
# Store experience in replay buffer
done = (i == len(data) - 2)
mrq_agent.replay_buffer.add(
torch.tensor(state, dtype=torch.float32, device=mrq_agent.device).unsqueeze(0).clone().detach(),
int(action),
torch.tensor(next_state, dtype=torch.float32, device=mrq_agent.device).unsqueeze(0).clone().detach(),
float(reward),
bool(done),
False
)
# Train the agent
if mrq_agent.replay_buffer.size > 32:
for _ in range(10):
mrq_agent.train()
# Portfolio value tracking
if action == 1 and capital >= state[0]:
holdings += 1
capital -= state[0]
buy_dates.append(dates[i])
elif action == 2 and holdings > 0:
holdings -= 1
capital += state[0]
sell_dates.append(dates[i])
total_value = capital + holdings * state[0]
portfolio_values.append(total_value)
final_value = capital + holdings * data.iloc[-1]["Close"]
print(f"Final Portfolio Value: ${final_value:.2f}")
print(f"Total Profit: ${final_value - 10000:.2f}")
Final Portfolio Value: $17337.51
Total Profit: $7337.51
Plot our trades
import matplotlib.pyplot as plt
# Store trade actions
buy_dates = []
sell_dates = []
hold_dates = []
for i in range(len(data) - 1):
state = data.iloc[i].values
action = mrq_agent.select_action(state)
if action == 1: # Buy
buy_dates.append(dates[i])
elif action == 2: # Sell
sell_dates.append(dates[i])
else: # Hold
hold_dates.append(dates[i])
# Plot Stock Prices with Buy/Sell Markers
plt.figure(figsize=(12, 6))
plt.plot(dates[:-1], data["Close"][:-1], label="Stock Price", color="black")
# Buy & Sell markers
plt.scatter(buy_dates, data.loc[buy_dates, "Close"], label="Buy Signal", color="green", marker="^", alpha=1, s=100)
plt.scatter(sell_dates, data.loc[sell_dates, "Close"], label="Sell Signal", color="red", marker="v", alpha=1, s=100)
# Formatting
plt.xlabel("Date")
plt.ylabel("Stock Price ($)")
plt.title("MR.Q Trading Strategy: Buy & Sell Signals")
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.show()
Ueing Hyperparameters tp optimize MR.Q
To improve MR.Q’s stock trading strategy, we’ll focus on tuning hyperparameters that affect exploration, learning stability, and reward optimization.
Key Hyperparameters to Optimize
These hyperparameters control MR.Q’s decision-making process:
Hyperparameter | Description | Current Value | Optimized Range |
---|---|---|---|
exploration_noise |
Randomness added to actions to encourage exploration | 0.2 |
0.05 - 0.5 |
buffer_size_before_training |
How many experiences are stored before training starts | 10,000 |
5,000 - 50,000 |
discount (γ) |
How much future rewards matter | 0.99 |
0.9 - 0.99 |
target_update_freq |
How often target networks update | 250 |
100 - 1000 |
policy_lr |
Learning rate of the policy network | 3e-4 |
1e-4 - 5e-4 |
value_lr |
Learning rate of the value network | 3e-4 |
1e-4 - 5e-4 |
Step 1: Define a Grid Search for Hyperparameters
To find the best hyperparameter values, we’ll use grid search to test different combinations.
Code: Run a Hyperparameter Grid Search
import itertools
import numpy as np
# Define ranges for hyperparameter tuning
param_grid = {
"exploration_noise": [0.05, 0.1, 0.2, 0.3, 0.5],
"buffer_size_before_training": [5000, 10000, 20000, 50000],
"discount": [0.9, 0.95, 0.99],
"target_update_freq": [100, 250, 500, 1000],
"policy_lr": [1e-4, 3e-4, 5e-4],
"value_lr": [1e-4, 3e-4, 5e-4]
}
# Generate all combinations
param_combinations = list(itertools.product(*param_grid.values()))
# Track best results
best_params = None
best_profit = -np.inf
for params in param_combinations:
# Extract parameters
exp_noise, buffer_size, discount, target_freq, policy_lr, value_lr = params
# Initialize MR.Q agent with new hyperparameters
agent = MRQStockAgent(
obs_shape=obs_shape,
action_dim=action_dim,
device=device,
hp={
"exploration_noise": exp_noise,
"buffer_size_before_training": buffer_size,
"discount": discount,
"target_update_freq": target_freq,
"policy_lr": policy_lr,
"value_lr": value_lr
}
)
# Train the agent and track performance
final_profit = train_and_evaluate(agent) # Function to train and return final profit
# Store best configuration
if final_profit > best_profit:
best_profit = final_profit
best_params = params
print("Best Hyperparameters:", best_params)
print("Best Profit:", best_profit)
Step 2: Adjust Exploration-Exploitation Balance
The balance between exploration (trying new actions) and exploitation (using learned knowledge) is critical for MR.Q.
Key Adjustments
1️⃣ Decrease exploration_noise
gradually over time → Helps the agent start with exploration but converge to optimal actions.
self.exploration_noise = max(0.01, self.exploration_noise * 0.995) # Decay over time
2️⃣ Use adaptive exploration (ε-greedy with decay):
epsilon = max(0.01, 1 - (self.training_steps / 50000)) # Decrease exploration over time
if random.random() < epsilon:
action = random.choice([0, 1, 2]) # Random action (Buy, Sell, Hold)
else:
action = self.policy.act(state) # Best known action
3️⃣ Increase discount (γ)
if trades need long-term consideration:
- For short-term trading (scalping) →
γ = 0.9
- For long-term investments →
γ = 0.99
Step 3: Tune Learning Rate for Stability
- If learning is unstable, reduce
policy_lr
andvalue_lr
to prevent overcorrection. - If learning is too slow, increase
policy_lr
slightly.
Expected Outcome After Optimization
More profitable trades
Faster convergence
Better risk management
More consistent portfolio growth
Conclusion
In this post, we delved into MR.Q (Model-based Representations for Q-learning), a cutting-edge reinforcement learning algorithm, and explored its application to stock trading. Through a blend of theoretical insights and practical implementation, we demonstrated how MR.Q can be leveraged to make smarter trading decisions. Here’s a summary of what we covered:
✅ Built a Minimal Implementation of MR.Q:
We implemented a simplified version of MR.Q tailored specifically for stock trading, showcasing its ability to learn optimal strategies from historical market data.
✅ Evaluated MR.Q Using Real Stock Market Data:
We tested MR.Q’s performance using real-world stock data, demonstrating its effectiveness in generating profitable trading strategies through backtesting.
✅ Enhanced Learning with Prioritized Experience Replay:
We designed a PrioritizedReplayBuffer, which prioritizes important experiences based on their Temporal Difference (TD) errors, improving the agent’s learning efficiency and decision-making capabilities.
✅ Compared MR.Q with DQN and Buy & Hold Strategies:
We conducted a detailed performance comparison between MR.Q, Deep Q-Network (DQN), and the traditional Buy & Hold strategy, highlighting MR.Q’s advantages in terms of faster learning, better generalization, and computational efficiency.
✅ Adapted MR.Q for Financial Markets:
We modified the original MR.Q architecture to work seamlessly with stock market features (e.g., prices, moving averages, RSI, volatility) instead of image-based inputs. This involved integrating advanced techniques like prioritized experience replay, target networks, and reward shaping, resulting in a version of MR.Q fully optimized for financial applications.
✅ Explored Hyperparameter Tuning for Optimization:
We discussed how to fine-tune MR.Q’s hyperparameters—such as exploration noise, discount factor, and learning rates—to enhance its stability, profitability, and adaptability to different trading scenarios.
Why MR.Q Stands Out
MR.Q bridges the gap between model-free and model-based reinforcement learning by combining the simplicity and efficiency of model-free methods with the rich representations of model-based approaches. Its ability to generalize across different environments, avoid overestimation bias, and provide interpretable decisions makes it particularly well-suited for dynamic and uncertain domains like stock trading.
Final Thoughts
By leveraging MR.Q, traders and researchers can develop robust AI-driven strategies that adapt to changing market conditions while maintaining computational efficiency. Whether you’re looking to build a fast, stable, and interpretable model or seeking a solution that generalizes well across multiple financial markets, MR.Q offers a compelling alternative to traditional reinforcement learning algorithms like DQN.
References
Towards General-Purpose Model-Free Reinforcement Learning
MR.Q Github
Glossary of Terms
Term | Definition |
---|---|
Reinforcement Learning (RL) | A machine learning approach where agents learn to make decisions by interacting with an environment to maximize cumulative rewards. |
Model-Free RL | A type of reinforcement learning where the agent learns an optimal policy without explicitly modeling the environment’s transition dynamics. |
Model-Based RL | A reinforcement learning method that builds an explicit model of the environment’s dynamics to plan future actions. |
MR.Q (Model-based Representations for Q-learning) | A reinforcement learning algorithm that incorporates model-based representations while retaining the simplicity of model-free methods. |
Q-learning | A reinforcement learning algorithm that learns the value of taking certain actions in certain states, aiming to maximize rewards. |
Q-value Function | A function that estimates the expected reward for taking a specific action in a given state. |
Linear Q-learning | A variant of Q-learning that approximates the Q-value function using linear functions instead of deep neural networks. |
Deep Q-Network (DQN) | A reinforcement learning algorithm that uses a deep neural network to approximate Q-values, enabling decision-making in complex environments. |
State-Action Embeddings | Vector representations of states and actions that help estimate Q-values in reinforcement learning models. |
Bellman Equation | A fundamental equation in reinforcement learning that expresses the relationship between a state-action pair’s value and the values of future states. |
Replay Buffer | A memory structure that stores past experiences (state, action, reward, next state) for training reinforcement learning models. |
Target Network | A copy of the Q-network in DQN used to stabilize training by reducing Q-value fluctuations. |
Exploration vs. Exploitation | The trade-off in RL between exploring new actions to discover better strategies and exploiting known actions to maximize immediate rewards. |
Epsilon-Greedy Strategy | A decision-making strategy where the agent randomly explores new actions with probability ε and exploits known best actions otherwise. |
Reward Function | The function that assigns a numerical value (reward) to actions based on their desirability in achieving the learning objective. |
Portfolio Optimization | The process of selecting investments to maximize returns while managing risk, often using reinforcement learning in trading. |
Stock Market Backtesting | A simulation process used to evaluate trading strategies by applying them to historical stock market data. |
Policy Network | A neural network in RL that maps states to probabilities of taking different actions. |
Value Network | A neural network that estimates the expected return (reward) for a given state in reinforcement learning. |
Feature Extraction | The process of transforming raw data (e.g., stock prices) into meaningful input features for machine learning models. |
Multi-Step Prediction | A technique where the model predicts multiple future steps ahead rather than relying solely on immediate rewards. |
Hyperparameters | Configuration settings that define the behavior of a machine learning algorithm, such as learning rate, batch size, and exploration rate. |
Gradient Clipping | A technique used to prevent the gradient explosion problem in deep learning by capping the maximum gradient magnitude. |
Adam Optimizer | A widely used optimization algorithm in deep learning that adapts learning rates for different parameters. |
Overestimation Bias | A common issue in Q-learning where the algorithm overestimates action values, leading to suboptimal policies. |
Moving Average (SMA) | A technical indicator that smooths out stock price fluctuations over a specified period to identify trends. |
Volatility | A measure of the variation in stock prices over time, often used to assess risk in trading strategies. |
Backpropagation | A key algorithm in deep learning that adjusts neural network weights based on gradient descent to minimize loss. |