MR.Q: A New Approach to Reinforcement Learning in Finance

March 18, 2025

Page content

✨ Introduction: Real-Time Self-Tuning with MR.Q

Most machine learning models need hundreds of examples, large GPUs, and hours of training to learn anything useful. But what if you could build a system that gets smarter with just a handful of preference examples, runs entirely on your CPU, and improves while you work?

That’s exactly what MR.Q offers.

🔍 It doesn’t require full retraining.
⚙️ It doesn’t modify the base model.
🧠 It simply learns how to judge quality — and does it fast.

MR.Q is a lightweight evaluator that learns from pairwise comparisons (just like Direct Preference Optimization), but instead of fine-tuning a large model, it trains a tiny value network that predicts which outputs are better. Think of it as your model’s real-time taste tester 🍽️.

In this blog post, we will delve into the inner workings of MR.Q, exploring how it leverages these innovative techniques to excel in various applications, particularly in the realm of stock trading. You’ll gain insights into its practical implementation through detailed Python code examples derived from the official GitHub repository. Additionally, we’ll draw comparisons between MR.Q and the well-established Deep Q-Network (DQN), highlighting the unique advantages that MR.Q brings to the table.

This post aims to equip you with a comprehensive understanding of MR.Q and its transformative potential in the world of AI-powered trading strategies.

🧭 What is DPO? (Direct Preference Optimization)

🧪 DPO is a training method where a model learns from preferences rather than absolute scores.

Instead of asking:

“Is this answer correct?”

It asks:

“Which of these two answers is better?”

This idea is surprisingly powerful. It powers systems like:

🦙 ChatGPT’s Reinforcement Learning from Human Feedback (RLHF)
🏆 Anthropic’s Claude
🧠 OpenAI’s reward models

📚 How it works:

DPO trains a model on (prompt, output A, output B, preferred) tuples:

Prompt: What's the capital of France?
Output A: Paris ✔️
Output B: Berlin ❌
Preferred: A

Instead of retraining the whole model, DPO fine-tunes it to prefer A over B.

🔧 Why MR.Q is Different:

While DPO fine-tunes the full language model, MR.Q takes a shortcut:

✅ It learns to rank outputs using a tiny neural network
✅ It doesn’t need model updates or gradient access
✅ It’s fast, local, and interpretable

🧠 Think of MR.Q as a plug-in judge you can train in seconds and run alongside any language model.

🧠 How MR.Q Works

Key Idea: Model-Based Representations for Q-learning

Instead of directly optimizing a Q-value function like traditional model-free Reinforcement Learning (RL) methods, MR.Q learns state-action embeddings that approximate a linear relationship with the value function. This allows the model to retain the rich representations of model-based RL without explicitly planning future states.

💡 Advantages of using linear representations for Q-values in RL

📏 What Are Linear Representations for Q-values?

In Q-learning, the goal is to estimate the Q-value function, which represents the expected future reward for taking a given action in a particular state.

A linear representation means that instead of using a deep neural network with non-linear activations (like ReLU or Tanh) to approximate Q-values, we use a linear function:

\[ Q(s, a) = w^T \phi(s, a) \]

Where:

\( w \) → A weight vector (parameters learned during training).
\( \phi(s, a) \) → A feature representation of the state-action pair.
\( Q(s, a) \) → The estimated Q-value.

Instead of using deep neural networks, this approach approximates Q-values using simple linear functions, leading to several benefits.

❓ Why Is This Beneficial?

1️⃣ Strong Generalization Across Multiple RL Benchmarks

Generalization means the model can perform well on unseen environments after being trained on similar data.
Since linear models do not overfit as easily as deep networks, they generalize better across different RL tasks.
This is useful when applying RL to multiple stock markets, trading environments, or different control tasks.

Example:

A deep Q-network (DQN) trained on one stock market might fail on another.
A linear Q-function trained on various trading conditions generalizes better.

2️⃣ Better Performance Than Traditional Q-learning Approaches

Traditional Q-learning uses deep neural networks, which require a lot of data and suffer from instability (due to overfitting or divergence).
Linear Q-functions: Require less data to learn meaningful relationships.
Converge faster because they don’t suffer from vanishing/exploding gradients.
Can be interpretable, helping traders understand why certain actions are chosen.

Example:

Traditional DQN struggles with high-dimensional inputs (e.g., financial indicators).
Linear Q-learning finds useful patterns faster with fewer parameters.

3️⃣ Faster Training Than Model-Based Methods

Model-based RL first learns a world model (predicts how the environment behaves) before optimizing a policy.
Model-free methods (like Q-learning) skip the world model and directly optimize the action-value function.
Linear Q-learning is even faster because: No deep neural networks → Faster updates per iteration.
Lower memory usage → No need to store complex network weights.
Efficient updates → Can use matrix operations for batch learning.

Example:

A deep Q-network (DQN) might take millions of steps to learn trading patterns.
Linear Q-learning can learn meaningful strategies in fewer steps.

🤖⚖️ Model-Free vs. Model-Based Reinforcement Learning (RL)

Reinforcement Learning (RL) is about training agents to make decisions by interacting with an environment to maximize rewards. The two main approaches are model-free and model-based RL.

🔄🗺️ Model-Free RL (No Environment Model)

The agent directly learns from trial and error without explicitly modeling the environment.
It does not predict future states—instead, it optimizes its behavior based on observed rewards.
Example algorithms:
- Q-Learning / Deep Q-Networks (DQN)
- Policy Gradient Methods (REINFORCE, PPO, SAC, TD3)

Pros:

Easier to implement.
Works well in complex environments.
No need for an explicit environment model.

Cons:

Requires a lot of experience (slow learning).
Less sample-efficient (high data requirements).
Can struggle with long-term planning.

🧠⚙️ Model-Based RL (Uses an Environment Model)

The agent learns a model of the environment and uses it to predict future states.
It can simulate actions before actually taking them, leading to better decision-making.
Example algorithms:
- AlphaZero (Monte Carlo Tree Search + Deep Learning)
- Dyna-Q (Q-Learning + Planning)
- MuZero (Deep Model-Based RL)

Pros:

More sample-efficient (learns with fewer interactions).
Better at long-term planning.
Can simulate future outcomes before acting.

Cons:

Harder to implement.
Needs a good model of the environment.
Computationally expensive.

Key Differences

Feature	Model-Free RL 🏆	Model-Based RL 🔮
Uses Environment Model?	❌ No	✅ Yes
Efficiency	❌ Needs more experience	✅ Learns faster
Long-Term Planning	❌ Weak	✅ Strong
Computational Cost	✅ Lower	❌ Higher
Works Well in Complex Environments?	✅ Yes	❌ Harder

Example Analogy

Model-Free RL: A tourist exploring a city blindly, learning from mistakes and rewards.
Model-Based RL: A tourist using Google Maps to predict the best route before traveling.

When to Use Each?

Model-Free RL: When the environment is too complex to model, and direct learning is good enough.
Model-Based RL: When the environment is predictable and you want better efficiency & long-term planning.

🧑‍🏫 MR.Q for Stock Market Analysis

🔑 Key Features of MR.Q

MR.Q comes equipped with several innovative features tailored for dynamic decision-making in complex environments like stock trading:

Latent State Representation: Unlike traditional approaches that rely on raw price data, MR.Q learns an abstract representation of market conditions. This allows it to capture underlying patterns and trends effectively.
Multi-Step Prediction (Horizon-Based Learning): By considering future price movements, MR.Q makes informed decisions rather than reacting solely to immediate changes. This forward-looking approach enhances its strategic capabilities.
Reinforcement Learning for Decision-Making: Utilizing reinforcement learning principles, MR.Q learns optimal actions—whether to buy, sell, or hold—based on maximizing cumulative rewards over time.
Versatility with Data Types: Initially designed for game environments involving image data, MR.Q has been adapted to handle numeric financial data seamlessly. This flexibility underscores its broad applicability.

Why is this useful for trading?
Rather than simply responding to historical price movements, MR.Q leverages deep reinforcement learning techniques to predict future market trends. This predictive capability enables smarter, more proactive trading strategies.

🤔 How Does MR.Q Work?

MR.Q consists of three main components:

Encoder: Converts raw data into a latent state representation.
Policy Network: Decides which action to take (buy, sell, hold).
Value Network: Estimates future rewards from actions.

Here’s a simplified workflow for MR.Q in stock trading:

📊 Stock Market Data → 🧠 Encoder (Feature Extraction) → 🎯 Policy (Action Selection: Buy/Sell/Hold) → 💰 Profit Maximization

🐍 MR.Q Python Implementation

1️⃣ Setting Up MR.Q for Stock Market Analysis

To use MR.Q for stock analysis, we will:

Download stock price data (Yahoo Finance)
Modify MR.Q’s environment to handle financial data
Train the RL model on stock market trends
Backtest & evaluate performance

2️⃣ Install Dependencies

pip install yfinance pandas numpy torch

3️⃣ Getting Stock Market Data

Before training MR.Q, we need historical stock prices. We’ll use Yahoo Finance to fetch stock data. For me the data comes down in non standard CSV format. So you need to do a bit of processing to get it right. I want to add a few more columns that will allow the models to get more information from the data.

import yfinance as yf
import pandas as pd

def get_stock_data(symbol, start="2010-01-01", end="2024-01-01"):
    """Fetch stock data from Yahoo Finance."""
    data = yf.download(symbol, start=start, end=end)
    file_name = f"{symbol}_{start}_{end}.csv"
    data.to_csv(file_name)
    headers = ['Date','Close','High','Low','Open','Volume']

    # Load CSV while skipping the first 3 lines
    stock = pd.read_csv(file_name, skiprows=3, names=headers)
    stock["SMA_5"] = stock["Close"].rolling(window=5).mean()
    stock["SMA_20"] = stock["Close"].rolling(window=20).mean()
    stock["Return"] = stock["Close"] - stock["Open"]
    stock["Volatility"] = stock["Return"].rolling(window=5).std()
    stock = stock.iloc[20:] # calc of 20 day moving average will lead ot NaN values
    return stock

data = get_stock_data("AAPL")
FEATURES = ["Close", "SMA_5", "SMA_20", "Return", "Volatility"]
data[FEATURES].head()
data = data[FEATURES]
data = data.astype(float)
dates = data.index
print(data.head())

This will return stock prices & returns, which we use as input for MR.Q.

	Close	SMA_5	SMA_20	Return	Volatility
20	74.933746	76.716121	75.656308	-2.764829	1.555393
21	74.727959	76.702077	75.792250	1.055579	1.651474
22	77.194992	76.758244	76.022854	0.857052	1.592379
23	77.824463	76.618304	76.301999	-0.501150	1.607351
24	78.734787	76.683189	76.568556	0.639152	1.585143

4️⃣ Build the Buffer

The StockReplayBuffer is a memory storage system for reinforcement learning, specifically designed for stock trading agents. It stores past experiences (state, action, reward, next state, done) and retrieves random samples for training.

import numpy as np
import torch

class StockReplayBuffer:
    def __init__(self, max_size=100000):
        self.max_size = max_size
        self.states = []
        self.actions = []
        self.rewards = []
        self.next_states = []
        self.dones = []

    def add(self, state, action, reward, next_state, done):
        if len(self.states) >= self.max_size:
            self.states.pop(0)
            self.actions.pop(0)
            self.rewards.pop(0)
            self.next_states.pop(0)
            self.dones.pop(0)

        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
        self.next_states.append(next_state)
        self.dones.append(done)

    def sample(self, batch_size):
        indices = np.random.choice(len(self.states), batch_size, replace=False)
        
        # Convert lists to NumPy arrays before converting to tensors
        states_np = np.array(self.states, dtype=np.float32)
        actions_np = np.array(self.actions, dtype=np.int64)
        rewards_np = np.array(self.rewards, dtype=np.float32).reshape(-1, 1)
        next_states_np = np.array(self.next_states, dtype=np.float32)
        dones_np = np.array(self.dones, dtype=np.float32).reshape(-1, 1)

        return (torch.tensor(states_np)[indices],
                torch.tensor(actions_np)[indices],
                torch.tensor(rewards_np)[indices],
                torch.tensor(next_states_np)[indices],
                torch.tensor(dones_np)[indices])

1️⃣ Stores Trading Experiences → Saves state, action, reward, next state, and done flag.
2️⃣ Maintains a Fixed Size → Removes the oldest experience when full.
3️⃣ Samples Random Batches → Randomly selects past experiences for training, preventing bias from consecutive trades.

5️⃣ Implement the RL Agent for Trading (policy network)

💹 MR.Q Stock Trading Agent Implementation

import torch
import torch.nn as nn
import torch.optim as optim
import random

class StockTradingAgent:
    def __init__(self, state_dim, action_dim=3, learning_rate=0.001, exploration_prob=0.2):
        self.policy_network = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, action_dim),
            nn.Softmax(dim=-1)
        )

        self.optimizer = optim.Adam(self.policy_network.parameters(), lr=learning_rate)
        self.loss_fn = nn.MSELoss()
        self.exploration_prob = exploration_prob  # Chance of taking random action

    def select_action(self, state):
        """MR.Q uses softmax-based selection with forced exploration."""
        if random.random() < self.exploration_prob:
            return random.choice([0, 1, 2])  # Random action (Buy, Sell, Hold)

        with torch.no_grad():
            state_tensor = torch.tensor(state, dtype=torch.float32)
            action_probs = self.policy_network(state_tensor)
            return torch.argmax(action_probs).item()  # Choose action with highest probability

    def train(self, buffer, batch_size=32):
        """Train the agent using experiences from the replay buffer."""
        if len(buffer.states) < batch_size:
            return  # Not enough data to train

        states, actions, rewards, next_states, dones = buffer.sample(batch_size)

        # Ensure correct data types
        states = states.float()
        next_states = next_states.float()
        rewards = rewards.float().unsqueeze(1)  # Ensure shape [batch_size, 1]
        dones = dones.float().unsqueeze(1)  # Ensure shape [batch_size, 1]

        # Convert actions to one-hot encoding
        actions_one_hot = torch.nn.functional.one_hot(actions.long(), num_classes=3).float()

        # Compute predicted Q-values
        q_values = self.policy_network(states)

        # Get Q-values for the selected actions
        q_values = (q_values * actions_one_hot).sum(dim=1, keepdim=True)  # Ensures [batch_size, 1]

        # Compute target Q-values correctly
        next_q_values = self.policy_network(next_states).max(dim=1)[0].unsqueeze(1)  # Ensures shape [batch_size, 

        target_q_values = rewards + 0.99 * next_q_values * (1 - dones)  # Ensures [batch_size, 1]

        # Compute loss with correct dimensions
        loss = self.loss_fn(q_values, target_q_values.detach())

        # Update the model
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

6️⃣ Training the RL Trading Agent

buffer = StockReplayBuffer()
agent = StockTradingAgent(state_dim=5)  # We use 5 features: Close, SMA_5, SMA_20, Return, Volatility

for i in range(len(data)-1):
    state = data.iloc[i].values
    next_state = data.iloc[i+1].values

    action = agent.select_action(state)

    # Reward function: Encourage profitable trades
    reward = (next_state[0] - state[0]) * (1 if action == 1 else -1)

    if action == 1 and next_state[0] > state[0]:  # Buy & price goes up 
        reward += 2  
    elif action == 2 and next_state[0] < state[0]:  # Sell & price drops 
        reward += 2  
    elif action == 1 and next_state[0] < state[0]:  # Buy & price goes down 
        reward -= 2  
    elif action == 2 and next_state[0] > state[0]:  # Sell & price goes up 
        reward -= 2  

    done = i == len(data)-2
    buffer.add(state, action, reward, next_state, done)

    # Train agent multiple times per step
    if len(buffer.states) > 32:
        for _ in range(5):  # Train 5x per step
            agent.train(buffer)

🎯 Why we chose this reward function

The reward function guides the learning process of the MR.Q agent, helping it make better trading decisions. Let’s break down the logic behind it.

🧮 Base Reward Formula

reward = (next_state[0] - state[0]) * (1 if action == 1 else -1)

📐 Why This Works

Encourages buying low & selling high → The reward is positive when buying before a price increase.
Penalizes wrong trades → The reward is negative when selling before a price increase.

Scenario	Action	Price Movement	Reward Calculation
✅ Correct Buy	Buy	Price ↑	`(+price_change) * (+1)` (Positive Reward)
❌ Wrong Buy	Buy	Price ↓	`(-price_change) * (+1)` (Negative Reward)
✅ Correct Sell	Sell	Price ↓	`(-price_change) * (-1)` (Positive Reward)
❌ Wrong Sell	Sell	Price ↑	`(+price_change) * (-1)` (Negative Reward)

🔢 Additional Reward Adjustments

These bonus rewards & penalties help fine-tune the agent’s learning.

Condition	Reward Adjustment	Reason
Buy & Price Increases ✅	`+2`	Reinforces good trades
Sell & Price Drops ✅	`+2`	Rewards correct selling decisions
Buy & Price Drops ❌	`-2`	Discourages bad buys
Sell & Price Increases ❌	`-2`	Punishes selling too early

Why This is Useful:
Speeds up learning → The agent learns to avoid bad trades faster.
Encourages profit-taking → The agent doesn’t hold trades indefinitely.
Makes the agent more aggressive → Helps it act, not just hold.

Why This Reward Function is a Good Choice

Encourages profit-maximizing behavior
Discourages risky trades
Encourages active trading (not just holding stocks)
Speeds up learning compared to pure price difference rewards

📈 Possible Improvements

1️⃣ Risk Management → Add penalties for holding too long or over-trading.
2️⃣ Transaction Costs → Subtract a small penalty per trade to simulate real-world commissions.
3️⃣ Sharpe Ratio Consideration → Reward the agent for consistent profits, not just high-risk trades.

7️⃣ Backtesting: Does It Make Money?

capital = 10000  # Initial investment
holdings = 0  # Number of shares owned
portfolio_values = []  # Store portfolio value over time
dates = data.index  # Get dates for x-axis
actions_taken = []  # Store buy/sell actions for plotting

for i in range(len(data)-1):
    state = data.iloc[i].values
    action = agent.select_action(state)
    
    actions_taken.append((dates[i], action))  # Track actions

    if action == 1 and capital >= state[0]:  # Buy
        holdings += 1
        capital -= state[0]
    elif action == 2 and holdings > 0:  # Sell
        holdings -= 1
        capital += state[0]

    total_value = capital + holdings * data.iloc[i]['Close']
    portfolio_values.append(total_value)

final_value = capital + holdings * data.iloc[-1]['Close']
print(f"Final Portfolio Value: ${final_value:.2f}")
print(f"Total Profit: ${final_value - 10000:.2f}")

Final Portfolio Value: $28846.61
Total Profit: $18846.61

8️⃣ Plot the Results

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.plot(dates[:-1], portfolio_values, label="MR.Q Trading Strategy", 
    color="blue", linewidth=2)
plt.xlabel("Date")
plt.ylabel("Portfolio Value ($)")
plt.title("MR.Q Trading Strategy: Portfolio Performance")
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.show()

MR.Q Trades

8️⃣ MR.Q’s Trading Decisions

This will generate a plot of MR.Q Trading decisions.

import matplotlib.pyplot as plt

# Store trade actions
buy_dates = []
sell_dates = []
hold_dates = []

for i in range(len(data) - 1):
    state = data.iloc[i].values
    action = agent.select_action(state)
    
    if action == 1:  # Buy
        buy_dates.append(dates[i])
    elif action == 2:  # Sell
        sell_dates.append(dates[i])
    else:  # Hold
        hold_dates.append(dates[i])

# Plot Stock Prices with Buy/Sell Markers
plt.figure(figsize=(12, 6))
plt.plot(dates[:-1], data["Close"][:-1], label="Stock Price", color="black")

# Buy & Sell markers
plt.scatter(buy_dates, data.loc[buy_dates, "Close"], label="Buy Signal", color="green", marker="^", alpha=1, s=100)
plt.scatter(sell_dates, data.loc[sell_dates, "Close"], label="Sell Signal", color="red", marker="v", alpha=1, s=100)

# Formatting
plt.xlabel("Date")
plt.ylabel("Stock Price ($)")
plt.title("MR.Q Trading Strategy: Buy & Sell Signals")
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.show()

MR.Q Trades

📊⚖️ Performance Comparison

Now, let’s compare MR.Q vs. Buy & Hold with a performance table.

initial_investment = 10000
final_mrq_value = portfolio_values[-1]  # Final MR.Q portfolio value
final_hold_value = (initial_investment / data.iloc[0]["Close"]) * data.iloc[-1]["Close"]  # Buy & Hold return

performance_data = {
    "Strategy": ["MR.Q Reinforcement Learning", "Buy & Hold"],
    "Final Portfolio Value ($)": [round(final_mrq_value, 2), round(final_hold_value, 2)],
    "Profit (%)": [
        round(((final_mrq_value - initial_investment) / initial_investment) * 100, 2),
        round(((final_hold_value - initial_investment) / initial_investment) * 100, 2)
    ]
}

# Convert to DataFrame
performance_df = pd.DataFrame(performance_data)
print(performance_df)

Strategy  Final Portfolio Value ($)  Profit (%)
0  MR.Q Reinforcement Learning                   28972.90      189.73
1                   Buy & Hold                  324697.45     3146.97

⚔️ Comparing MR.Q to DQN

Deep Q-Network (DQN) is a reinforcement learning algorithm that uses a deep neural network to approximate Q-values, enabling agents to learn optimal actions in complex environments by maximizing long-term rewardOkays.

1️⃣ Update the Replay Buffer

For this comparison I am going to enhance the buffer.

This class stores past experiences (state, action, reward, next state, done) in a buffer and assigns priorities to them based on how important they are for learning.

# 🔹 Implement Prioritized Experience Replay (PER)
class PrioritizedReplayBuffer:
    def __init__(self, max_size=100000, alpha=0.6):
        self.max_size = max_size
        self.alpha = alpha  # Prioritization factor
        self.states, self.actions, self.rewards, self.next_states, self.dones = [], [], [], [], []
        self.priorities = []  # Stores priority for each experience

    def add(self, state, action, reward, next_state, done, td_error=1.0):
        """Add experience with priority based on TD error."""
        if len(self.states) >= self.max_size:
            del self.states[0], self.actions[0], self.rewards[0], self.next_states[0], self.dones[0], self.priorities[0]

        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
        self.next_states.append(next_state)
        self.dones.append(done)
        self.priorities.append(abs(td_error) + 1e-5)  # Small value to prevent zero probability

    def sample(self, batch_size, beta=0.4):
        """Sample experiences with probability proportional to priority."""
        priorities = np.array(self.priorities) ** self.alpha
        probs = priorities / priorities.sum()
        indices = np.random.choice(len(self.states), batch_size, p=probs)

        # Compute importance-sampling weights
        total = len(self.states)
        weights = (total * probs[indices]) ** (-beta)
        weights /= weights.max()

        return (torch.tensor(np.array(self.states), dtype=torch.float32)[indices],
                torch.tensor(np.array(self.actions), dtype=torch.int64)[indices],
                torch.tensor(np.array(self.rewards), dtype=torch.float32)[indices].unsqueeze(1),
                torch.tensor(np.array(self.next_states), dtype=torch.float32)[indices],
                torch.tensor(np.array(self.dones), dtype=torch.float32)[indices].unsqueeze(1),
                torch.tensor(weights, dtype=torch.float32).unsqueeze(1), indices)

    def update_priorities(self, indices, td_errors):
        """Update priorities after training using TD errors."""
        for i, td_error in zip(indices, td_errors):
            self.priorities[i] = abs(td_error.item()) + 1e-5

🛠️ Breakdown of What Each Function Does

add(...) → Stores an experience and assigns it a priority based on TD error (how surprising the result was).
sample(...) → Selects experiences for training, with higher-priority experiences more likely to be chosen.
update_priorities(...) → Adjusts priorities after training so that new, important experiences have a better chance of being selected next time.

❓ Why This Is Useful?

The most important experiences (ones with the highest learning impact) are sampled more often.
This helps the agent learn faster by focusing on the most useful experiences.

🧮 How the Prioritized Experience Replay (PER) Determines Priorities

The priority of an experience is determined using the Temporal Difference (TD) error, which measures how surprising or important a learning update is.

📚 Steps in Determining Priorities

Calculate the TD Error
- The TD error is the difference between the predicted Q-value and the actual reward + discounted future Q-value:
  \[ TD\_error = |Q_{\text{target}} - Q_{\text{predicted}}| \]
- A high TD error means the agent was very wrong about its prediction, so the experience is more important for learning.
Assign Priority Based on TD Error
- Each experience gets a priority score based on its TD error:
  \[ \text{priority} = |\text{TD error}| + \epsilon \]
- \(\epsilon\) (small constant, e.g., 1e-5) prevents priorities from becoming exactly zero.
Control Prioritization Strength (\(\alpha\))
- The priority is raised to the power of \(\alpha\) to control how much prioritization affects sampling: \[ p_i = (\text{priority}_i)^\alpha \]
- \(\alpha = 0\) → Uniform sampling (like normal replay buffer).
- \(\alpha = 1\) → Full prioritization (purely based on TD error).
Convert Priorities into Sampling Probabilities
- Each experience is sampled proportionally to its priority: \[ P(i) = \frac{p_i}{\sum_j p_j} \]
- Higher priority → Higher probability of being selected for training.
Update Priorities After Training
- After training on a batch of experiences, new TD errors are computed.
- The priorities are updated so that the model keeps focusing on important experiences.

⚖️ Example: How Priorities Work

Imagine an agent experiences these rewards:

Experience	Old Q-value	New Q-value	TD Error	Priority
A	5	10	5	5.00001
B	20	21	1	1.00001
C	8	8.2	0.2	0.20001

Experience A is most surprising → Highest priority → More likely to be sampled.
Experience C is less important → Less likely to be used in training.

🧬 Why This Improves Learning

Focuses on important learning moments (big mistakes).
Reduces training time by prioritizing high-impact updates.
Improves stability by balancing old and new experiences.

2️⃣ Define the MR.Q Trading Agent

The MRQTradingAgent is a reinforcement learning (RL) trading agent that uses a softmax based policy network to decide whether to buy, sell, or hold stocks. Unlike traditional Q-learning (DQN), MR.Q optimizes decisions using probabilistic action selection rather than selecting the action with the highest Q-value.

# MR.Q Trading Agent
class MRQTradingAgent:
    def __init__(self, state_dim, action_dim=3, learning_rate=0.001, exploration_prob=0.2):
        self.policy_network = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, action_dim),
            nn.Softmax(dim=-1)
        )
        self.optimizer = optim.Adam(self.policy_network.parameters(), lr=learning_rate)
        self.loss_fn = nn.MSELoss()
        self.exploration_prob = exploration_prob  

    def select_action(self, state):
        """MR.Q uses softmax-based selection with forced exploration."""
        if random.random() < self.exploration_prob:
            return random.choice([0, 1, 2])  # Random action (Buy, Sell, Hold)

        with torch.no_grad():
            state_tensor = torch.tensor(state, dtype=torch.float32)
            action_probs = self.policy_network(state_tensor)
            return torch.multinomial(action_probs, 1).item()  # Sample from softmax probabilities

    def train(self, buffer, batch_size=32):
        if len(buffer.states) < batch_size:
            return  

        states, actions, rewards, next_states, dones, _, _ = buffer.sample(batch_size)

        actions_one_hot = torch.nn.functional.one_hot(actions, num_classes=3).float()
        q_values = self.policy_network(states)
        q_values = (q_values * actions_one_hot).sum(dim=1, keepdim=True)
        next_q_values = self.policy_network(next_states).max(dim=1, keepdim=True)[0]
        target_q_values = rewards + 0.99 * next_q_values * (1 - dones)
        
        loss = self.loss_fn(q_values, target_q_values.detach())

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

📜 Description of `MRQTradingAgent` Class

The MRQTradingAgent is a reinforcement learning (RL) trading agent that uses a softmax-based policy network to decide whether to buy, sell, or hold stocks. Unlike traditional Q-learning (DQN), MR.Q optimizes decisions using probabilistic action selection rather than selecting the action with the highest Q-value.

Key Features:

Softmax-Based Action Selection → Chooses actions probabilistically rather than always picking the highest-value action.
Exploration Probability (ε-greedy-like) → Ensures a balance between random exploration and policy-driven actions.
Policy Network → Uses a neural network to learn optimal trading strategies.
One-Hot Encoded Actions → Encodes actions for stable training.
Gradient-Based Training → Uses MSE loss and Adam optimizer for weight updates.
Future Reward Estimation → Computes discounted future rewards to improve trading decisions over time.

📩 Workflow

1️⃣ Selects an action by either sampling from softmax probabilities or choosing randomly for exploration.
2️⃣ Samples experiences from the replay buffer for training.
3️⃣ Computes expected rewards using next state predictions.
4️⃣ Optimizes the policy network using gradient-based learning.

This agent is designed for dynamic and probabilistic decision-making, making it suitable for uncertain financial markets. 🚀📈

3️⃣ Define the DQN Trading Agent

This code implements a Deep Q-Network (DQN) trading agent that uses reinforcement learning to decide whether to buy, sell, or hold stocks based on historical market data.

# DQN Trading Agent
class DQNTradingAgent:
    def __init__(self, state_dim, action_dim=3, learning_rate=0.001, gamma=0.99, epsilon=1.0, epsilon_min=0.01, epsilon_decay=0.995):
        self.q_network = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, action_dim)
        )
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
        self.loss_fn = nn.MSELoss()
        self.gamma = gamma  
        self.epsilon = epsilon  
        self.epsilon_min = epsilon_min  
        self.epsilon_decay = epsilon_decay  

    def select_action(self, state):
        """DQN uses epsilon-greedy strategy with forced exploration."""
        if random.random() < self.epsilon:
            return random.choice([0, 1, 2])  # Force exploration more often

        with torch.no_grad():
            state_tensor = torch.tensor(state, dtype=torch.float32)
            action_values = self.q_network(state_tensor)
            return torch.argmax(action_values).item()

        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

    def train(self, buffer, batch_size=32):
        if len(buffer.states) < batch_size:
            return  

        # Now correctly extracts all 7 values from Prioritized Replay Buffer
        states, actions, rewards, next_states, dones, weights, indices = buffer.sample(batch_size)

        q_values = self.q_network(states).gather(1, actions.unsqueeze(1))

        with torch.no_grad():
            next_q_values = self.q_network(next_states).max(dim=1, keepdim=True)[0]
            target_q_values = rewards + self.gamma * next_q_values * (1 - dones)

        # Apply importance-sampling weights to the loss function
        td_errors = target_q_values - q_values
        loss = (weights * td_errors.pow(2)).mean()  

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Update experience priorities in PER
        buffer.update_priorities(indices, td_errors.abs())  

        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)  
        return loss.item()

🧾 Description of `DQNTradingAgent` Class

The DQNTradingAgent is a Deep Q-Network (DQN) trading agent that learns to buy, sell, or hold stocks using reinforcement learning.

🛠️ Key Features

Epsilon-Greedy Exploration → Balances random exploration and exploitation of learned strategies.
Neural Network (Q-Network) → Predicts Q-values for each action (buy, sell, hold).
Prioritized Experience Replay (PER) → Samples important experiences more frequently.
Importance-Sampling Weights → Corrects bias introduced by PER for stable learning.
TD Error-Based Priority Updates → Adjusts experience priorities dynamically based on training feedback.
Adam Optimizer → Efficiently updates the Q-network using gradient descent.

Workflow:

1️⃣ Selects an action using the ε-greedy policy.
2️⃣ Trains by sampling experiences from a prioritized replay buffer.
3️⃣ Computes target Q-values using Bellman’s equation.
4️⃣ Updates Q-network weights using weighted TD errors.
5️⃣ Adjusts priorities to focus learning on high-error experiences.
6️⃣ Gradually reduces exploration (ε decay) to favor learned strategies over time.

This class optimizes trading decisions by learning from past experiences while prioritizing high-impact market events.

4️⃣ Train Both Agents

# 🔹 Train Both Agents with Prioritized Experience Replay (PER)
buffer = PrioritizedReplayBuffer()
mrq_agent = MRQTradingAgent(state_dim=5)
dqn_agent = DQNTradingAgent(state_dim=5)

mrq_values, dqn_values = [10000], [10000]  # Initial capital values

def calculate_reward(action, next_price, current_price):
    """Computes reward based on action and price movement."""
    if action == 1 and next_price > current_price:  # Buy & price increases ✅
        return 5
    elif action == 2 and next_price < current_price:  # Sell & price drops ✅
        return 5
    elif action == 1 and next_price < current_price:  # Buy & price drops ❌
        return -5
    elif action == 2 and next_price > current_price:  # Sell & price increases ❌
        return -5
    return -1  # Encourage trading

for i in range(len(data) - 1):
    state = data.iloc[i].values
    next_state = data.iloc[i + 1].values
    next_price, current_price = next_state[0], state[0]

    # 🔹 Select actions
    mrq_action = mrq_agent.select_action(state)
    dqn_action = dqn_agent.select_action(state)

    # ✅ Compute rewards
    mrq_reward = calculate_reward(mrq_action, next_price, current_price)
    dqn_reward = calculate_reward(dqn_action, next_price, current_price)

    done = (i == len(data) - 2)  # Mark last step

    # 🔹 Store experiences in buffer
    buffer.add(state, mrq_action, mrq_reward, next_state, done)
    buffer.add(state, dqn_action, dqn_reward, next_state, done)

    # 🔹 Train the agents multiple times per step
    if len(buffer.states) > 32:
        for _ in range(10):
            mrq_agent.train(buffer)
            dqn_agent.train(buffer)

    # 🔹 Update portfolio values
    mrq_values.append(mrq_values[-1] + mrq_reward)
    dqn_values.append(dqn_values[-1] + dqn_reward)

5️⃣ Overlay Graph: MR.Q vs. DQN

import matplotlib.pyplot as plt

# 🔹 Plot Performance Comparison
plt.figure(figsize=(12, 6))
plt.plot(dates[:-1], mrq_values, label="MR.Q", color="blue")
plt.plot(dates[:-1], dqn_values, label="DQN", color="red", linestyle="dashed")
plt.legend()
plt.show()

MR.Q vs DQN

Note: I got a wide range of results. This was my favourite.

❓ Why Consider MR.Q Over DQN?

MR.Q and DQN are both reinforcement learning algorithms, but MR.Q offers several advantages over DQN, particularly in certain use cases. Here’s a detailed comparison:

1️⃣ MR.Q Learns Faster (More Sample Efficient)

MR.Q: Uses linear Q-value representations, making it more efficient in learning with fewer data points.
DQN: Requires deep neural networks and large amounts of data to generalize well.
Why it matters: If you have limited training data (e.g., stock market historical data), MR.Q will likely converge faster than DQN.

2️⃣ MR.Q Generalizes Better to New Market Conditions

MR.Q: Uses model-based representations, which help it generalize better across different environments (e.g., different stocks or economic conditions).
DQN: Overfits more easily to the training environment and requires extensive retraining when market conditions change.
Why it matters: If you want an RL model that adapts well across multiple financial markets, MR.Q is the better choice.

3️⃣ MR.Q Requires Less Computational Power

MR.Q: Uses linear representations, meaning it doesn’t require deep networks and runs efficiently on CPUs.
DQN: Uses deep neural networks, which require powerful GPUs for training.
Why it matters: If you don’t have a GPU or need a model that can run on low-power devices, MR.Q is the better option.

4️⃣ MR.Q Avoids Instability in Training

MR.Q: Doesn’t suffer from exploding/vanishing gradients because of its simplified linear Q-learning approach.
DQN: Is prone to training instability, requiring techniques like target networks and experience replay to stabilize.
Why it matters: If you want an easier, more stable training process without much tuning, MR.Q is preferable.

5️⃣ MR.Q Has a More Interpretable Decision Process

MR.Q: Since it uses linear approximations for Q-values, we can easily interpret why it made a certain decision.
DQN: Uses a black-box deep learning model, making it harder to explain decisions.
Why it matters: If you need a transparent model for trading (especially in finance, where regulations require explainability), MR.Q is the better choice.

6️⃣ MR.Q Uses Model-Based Representations for More Strategic Decisions

MR.Q: Uses latent state representations and multi-step prediction, meaning it predicts future trends instead of reacting only to past states.
DQN: Only optimizes immediate rewards without explicitly considering future states.
Why it matters: If you need an agent that thinks ahead (e.g., for long-term trading strategies), MR.Q is more powerful.

7️⃣ MR.Q Avoids the Overestimation Problem in Q-Learning

MR.Q: Uses linear Q-learning approximations, which reduce Q-value overestimation errors.
DQN: Suffers from Q-value overestimation, leading to suboptimal action selection.
Why it matters: If you want a more reliable and consistent reinforcement learning agent, MR.Q is a better choice.

🤔 When to Choose DQN Over MR.Q?

DQN is still a good choice in certain situations:

If you need to model highly complex environments (e.g., high-frequency trading, where deep networks may capture more subtle patterns).
If you have a large dataset and a powerful GPU (DQN can generalize better in big data settings).
If interpretability is not an issue (DQN’s black-box nature is fine in many applications).

🧠 Conclusion: MR.Q vs. DQN – Which Should You Use?

Feature	MR.Q ✅	DQN
Faster learning with fewer samples	✅ Yes	❌ Needs more data
Generalizes to new environments	✅ Yes	❌ More retraining needed
Requires less computation	✅ CPU-friendly	❌ Needs GPU
Stable training	✅ Yes	❌ Can be unstable
Interpretable decisions	✅ Yes	❌ Black-box model
Thinks ahead (multi-step prediction)	✅ Yes	❌ No
Prevents overestimation bias	✅ Yes	❌ Suffers from overestimation
Handles large-scale complex environments	❌ Not ideal	✅ Better

🧑‍⚖️ Final Recommendation

Use MR.Q if you want a fast, stable, and interpretable reinforcement learning model that generalizes well.
Use DQN if you have big data, lots of compute power, and need deep feature learning.

🗺️ MR.Q for stock trading

So far in the post I implemented a basic MR.Q. There is a much more advanced version in the repo. We will modify that MR.Q to work with stock market data instead of image-based inputs.

The updated MR.Q will:

✅ Process stock data (prices, SMA, RSI, volatility) instead of pixels
✅ Use a replay buffer optimized for time-series trading data
✅ Modify the state encoding system to work with stock features
✅ Remove unnecessary image-processing components
✅ Keep MR.Q’s advanced learning methods (prioritized replay, target networks, exploration strategies)


import copy
import dataclasses
from typing import Dict
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import buffer  # Replay buffer from original MR.Q
import utils  # Utility functions

@dataclasses.dataclass
class Hyperparameters:
    # Generic
    batch_size: int = 256
    buffer_size: int = 1e6
    discount: float = 0.99
    target_update_freq: int = 250

    # Exploration
    buffer_size_before_training: int = 10e3
    exploration_noise: float = 0.2

    # Q-Learning Horizon
    enc_horizon: int = 5
    Q_horizon: int = 3

    # Encoder Model
    zs_dim: int = 512
    za_dim: int = 256
    zsa_dim: int = 512
    enc_hdim: int = 512
    enc_activ: str = 'elu'
    enc_lr: float = 1e-4
    enc_wd: float = 1e-4

    # Value Model
    value_hdim: int = 512
    value_activ: str = 'elu'
    value_lr: float = 3e-4
    value_wd: float = 1e-4
    value_grad_clip: float = 20

    # Policy Model
    policy_hdim: int = 512
    policy_activ: str = 'relu'
    policy_lr: float = 3e-4
    policy_wd: float = 1e-4
    gumbel_tau: float = 10
    pre_activ_weight: float = 1e-5

    def __post_init__(self): 
        enforce_dataclass_type(self)


class MRQStockAgent:
    def __init__(self, obs_shape: tuple, action_dim: int, device: torch.device, history: int=1, hp: Dict={}):
        self.name = 'MR.Q Stock Trader'

        self.hp = Hyperparameters(**hp)
        set_instance_vars(self.hp, self)
        self.device = device

        # Initialize Replay Buffer
        self.replay_buffer = ReplayBuffer(
            obs_shape, action_dim, max_action=1.0, pixel_obs=False, device=self.device,
            history=history, max_size=self.hp.buffer_size, batch_size=self.hp.batch_size
        )

        # Encoder: Now processes stock data, not images
        self.encoder = StockEncoder(obs_shape[0] * history, action_dim,
            self.hp.zs_dim, self.hp.za_dim, self.hp.zsa_dim, self.hp.enc_hdim, self.hp.enc_activ).to(self.device)
        self.encoder_optimizer = optim.AdamW(self.encoder.parameters(), lr=self.hp.enc_lr, weight_decay=self.hp.enc_wd)
        self.encoder_target = copy.deepcopy(self.encoder)

        # Policy & Value Networks
        self.policy = Policy(action_dim, self.hp.gumbel_tau, self.hp.zs_dim, self.hp.policy_hdim, self.hp.policy_activ).to(self.device)
        self.policy_optimizer = optim.AdamW(self.policy.parameters(), lr=self.hp.policy_lr, weight_decay=self.hp.policy_wd)
        self.policy_target = copy.deepcopy(self.policy)

        self.value = Value(self.hp.zsa_dim, self.hp.value_hdim, self.hp.value_activ).to(self.device)
        self.value_optimizer = optim.AdamW(self.value.parameters(), lr=self.hp.value_lr, weight_decay=self.hp.value_wd)
        self.value_target = copy.deepcopy(self.value)

        # Tracked values
        self.reward_scale = 1
        self.training_steps = 0

    def select_action(self, state: np.array, use_exploration: bool=True):
        if self.replay_buffer.size < self.hp.buffer_size_before_training and use_exploration:
            return np.random.choice([0, 1, 2])  # Random action (Buy, Sell, Hold)

        with torch.no_grad():
            state = torch.tensor(state, dtype=torch.float, device=self.device).reshape(-1, *self.replay_buffer.state_shape)
            zs = self.encoder.zs(state)
            action_probs = self.policy.act(zs)
            if use_exploration: 
                action_probs += torch.randn_like(action_probs) * self.hp.exploration_noise
            return int(action_probs.argmax())

    def train(self):
        if self.replay_buffer.size <= self.hp.buffer_size_before_training:
            return

        self.training_steps += 1

        # Target Network Updates
        if (self.training_steps-1) % self.hp.target_update_freq == 0:
            self.policy_target.load_state_dict(self.policy.state_dict())
            self.value_target.load_state_dict(self.value.state_dict())
            self.encoder_target.load_state_dict(self.encoder.state_dict())

        state, action, next_state, reward, not_done = self.replay_buffer.sample(self.hp.Q_horizon)

        Q, Q_target = self.train_rl(state, action, next_state, reward, not_done)

        # Update Priorities
        priority = (Q - Q_target.expand(-1,2)).abs().max(1).values
        self.replay_buffer.update_priority(priority)

    def train_rl(self, state, action, next_state, reward, not_done):
        with torch.no_grad():
            next_zs = self.encoder_target.zs(next_state)
            next_action = self.policy_target.act(next_zs)
            next_zsa = self.encoder_target(next_zs, next_action)
            Q_target = self.value_target(next_zsa).min(1, keepdim=True).values
            Q_target = (reward + not_done * self.hp.discount * Q_target)

        zs = self.encoder.zs(state)
        zsa = self.encoder(zs, action)

        Q = self.value(zsa)
        value_loss = F.smooth_l1_loss(Q, Q_target.expand(-1,2))

        self.value_optimizer.zero_grad()
        value_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.value.parameters(), self.hp.value_grad_clip)
        self.value_optimizer.step()

        return Q, Q_target


class StockEncoder(nn.Module):
    """Modified MRQ Encoder for Stock Trading Data"""
    def __init__(self, state_dim, action_dim, zs_dim, za_dim, zsa_dim, hdim, activ):
        super().__init__()
        self.zs_mlp = nn.Sequential(
            nn.Linear(state_dim, hdim),
            nn.ReLU(),
            nn.Linear(hdim, zs_dim)
        )
        self.za = nn.Linear(action_dim, za_dim)
        self.zsa = nn.Sequential(
            nn.Linear(zs_dim + za_dim, zsa_dim),
            nn.ReLU(),
            nn.Linear(zsa_dim, zsa_dim)
        )
        self.model = nn.Linear(zsa_dim, zs_dim)

    def forward(self, zs, action):
        za = F.relu(self.za(action))
        return self.zsa(torch.cat([zs, za], 1))

    def zs(self, state):
        return F.relu(self.zs_mlp(state))


class Policy(nn.Module):
    def __init__(self, action_dim, gumbel_tau, zs_dim, hdim, activ):
        super().__init__()
        self.policy = nn.Sequential(
            nn.Linear(zs_dim, hdim),
            nn.ReLU(),
            nn.Linear(hdim, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, zs):
        return self.policy(zs)


class Value(nn.Module):
    def __init__(self, zsa_dim, hdim, activ):
        super().__init__()
        self.q1 = nn.Linear(zsa_dim, hdim)
        self.q2 = nn.Linear(hdim, 1)

    def forward(self, zsa):
        return torch.cat([self.q1(zsa), self.q2(zsa)], 1)

🌐 Get Stock data

We are going to update the get_stock_data function we are going to add the rsi.


import yfinance as yf
import pandas as pd

def get_stock_data(symbol, start="2010-01-01", end="2024-01-01"):
    """Fetch stock data from Yahoo Finance."""
    data = yf.download(symbol, start=start, end=end)
    file_name = f"{symbol}_{start}_{end}.csv"
    data.to_csv(file_name)
    headers = ['Date','Close','High','Low','Open','Volume']

    # Load CSV while skipping the first 3 lines
    stock = pd.read_csv(file_name, skiprows=3, names=headers)
    stock["SMA_5"] = stock["Close"].rolling(window=5).mean()
    stock["SMA_20"] = stock["Close"].rolling(window=20).mean()
    stock["Return"] = stock["Close"] - stock["Open"]
    stock["Volatility"] = stock["Return"].rolling(window=5).std()
    delta = stock["Close"].diff(1)
    delta.fillna(0, inplace=True)  # Fix: Replace NaN with 0

    gain = delta.where(delta > 0, 0)
    loss = -delta.where(delta < 0, 0)

    # Ensure no NaN issues in rolling calculations
    avg_gain = gain.rolling(window=14, min_periods=1).mean()
    avg_loss = loss.rolling(window=14, min_periods=1).mean()

    rs = avg_gain / (avg_loss + 1e-9)  # Avoid division by zero
    stock["RSI"] = 100 - (100 / (1 + rs))
    stock = stock.iloc[20:] # calc of 20 day moving average will lead ot NaN values
    return stock

data = get_stock_data("AAPL")
FEATURES = ["Close", "SMA_5", "SMA_20", "Return", "Volatility", "RSI"]
data[FEATURES].head()
data = data[FEATURES]
data = data.astype(float)
dates = data.index
print(data.head())

🚀 Running the MRQStockAgent


# ✅ Set up training environment
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
obs_shape = (len(FEATURES),)  # State dimension
action_dim = 3  # Buy, Sell, Hold

# Initialize MR.Q agent
mrq_agent = MRQStockAgent(obs_shape=obs_shape, action_dim=action_dim, device=device)

# Initialize portfolio tracking
capital = 10000  
holdings = 0  
portfolio_values = []  
buy_dates, sell_dates = [], []

# Training loop
for i in range(len(data) - 1):
    state = data.iloc[i].values
    next_state = data.iloc[i + 1].values

    action = mrq_agent.select_action(state)

    # Compute rewards
    reward = (next_state[0] - state[0]) * (1 if action == 1 else -1)  

    if action == 1 and next_state[0] > state[0]:  
        reward += 2  
    elif action == 2 and next_state[0] < state[0]:  
        reward += 2  
    elif action == 1 and next_state[0] < state[0]:  
        reward -= 2  
    elif action == 2 and next_state[0] > state[0]:  
        reward -= 2  

    # Store experience in replay buffer
    done = (i == len(data) - 2)

    mrq_agent.replay_buffer.add(
        torch.tensor(state, dtype=torch.float32, device=mrq_agent.device).unsqueeze(0).clone().detach(),
        int(action),  
        torch.tensor(next_state, dtype=torch.float32, device=mrq_agent.device).unsqueeze(0).clone().detach(),
        float(reward),  
        bool(done), 
        False 
    )

    # Train the agent
    if mrq_agent.replay_buffer.size > 32:
        for _ in range(10):  
            mrq_agent.train()

    # Portfolio value tracking
    if action == 1 and capital >= state[0]:  
        holdings += 1
        capital -= state[0]
        buy_dates.append(dates[i])
    elif action == 2 and holdings > 0:  
        holdings -= 1
        capital += state[0]
        sell_dates.append(dates[i])

    total_value = capital + holdings * state[0]
    portfolio_values.append(total_value)


final_value = capital + holdings * data.iloc[-1]["Close"]
print(f"Final Portfolio Value: ${final_value:.2f}")
print(f"Total Profit: ${final_value - 10000:.2f}")

Final Portfolio Value: $17337.51
Total Profit: $7337.51

📈 Plot our trades

import matplotlib.pyplot as plt

# Store trade actions
buy_dates = []
sell_dates = []
hold_dates = []

for i in range(len(data) - 1):
    state = data.iloc[i].values
    action = mrq_agent.select_action(state)
    
    if action == 1:  # Buy
        buy_dates.append(dates[i])
    elif action == 2:  # Sell
        sell_dates.append(dates[i])
    else:  # Hold
        hold_dates.append(dates[i])

# Plot Stock Prices with Buy/Sell Markers
plt.figure(figsize=(12, 6))
plt.plot(dates[:-1], data["Close"][:-1], label="Stock Price", color="black")

# Buy & Sell markers
plt.scatter(buy_dates, data.loc[buy_dates, "Close"], label="Buy Signal", color="green", marker="^", alpha=1, s=100)
plt.scatter(sell_dates, data.loc[sell_dates, "Close"], label="Sell Signal", color="red", marker="v", alpha=1, s=100)

# Formatting
plt.xlabel("Date")
plt.ylabel("Stock Price ($)")
plt.title("MR.Q Trading Strategy: Buy & Sell Signals")
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.show()

MR.Q Agent Trades

🛠️ Using Hyperparameters tp optimize MR.Q

To improve MR.Q’s stock trading strategy, we’ll focus on tuning hyperparameters that affect exploration, learning stability, and reward optimization.

🧪 Key Hyperparameters to Optimize

These hyperparameters control MR.Q’s decision-making process:

Hyperparameter	Description	Current Value	Optimized Range
`exploration_noise`	Randomness added to actions to encourage exploration	`0.2`	`0.05 - 0.5`
`buffer_size_before_training`	How many experiences are stored before training starts	`10,000`	`5,000 - 50,000`
`discount (γ)`	How much future rewards matter	`0.99`	`0.9 - 0.99`
`target_update_freq`	How often target networks update	`250`	`100 - 1000`
`policy_lr`	Learning rate of the policy network	`3e-4`	`1e-4 - 5e-4`
`value_lr`	Learning rate of the value network	`3e-4`	`1e-4 - 5e-4`

🛠️ Step 1: Define a Grid Search for Hyperparameters

To find the best hyperparameter values, we’ll use grid search to test different combinations.

🧾 Code: Run a Hyperparameter Grid Search

import itertools
import numpy as np

# Define ranges for hyperparameter tuning
param_grid = {
    "exploration_noise": [0.05, 0.1, 0.2, 0.3, 0.5],
    "buffer_size_before_training": [5000, 10000, 20000, 50000],
    "discount": [0.9, 0.95, 0.99],
    "target_update_freq": [100, 250, 500, 1000],
    "policy_lr": [1e-4, 3e-4, 5e-4],
    "value_lr": [1e-4, 3e-4, 5e-4]
}

# Generate all combinations
param_combinations = list(itertools.product(*param_grid.values()))

# Track best results
best_params = None
best_profit = -np.inf

for params in param_combinations:
    # Extract parameters
    exp_noise, buffer_size, discount, target_freq, policy_lr, value_lr = params
    
    # Initialize MR.Q agent with new hyperparameters
    agent = MRQStockAgent(
        obs_shape=obs_shape,
        action_dim=action_dim,
        device=device,
        hp={
            "exploration_noise": exp_noise,
            "buffer_size_before_training": buffer_size,
            "discount": discount,
            "target_update_freq": target_freq,
            "policy_lr": policy_lr,
            "value_lr": value_lr
        }
    )
    
    # Train the agent and track performance
    final_profit = train_and_evaluate(agent)  # Function to train and return final profit

    # Store best configuration
    if final_profit > best_profit:
        best_profit = final_profit
        best_params = params

print("Best Hyperparameters:", best_params)
print("Best Profit:", best_profit)

⚖️ Step 2: Adjust Exploration-Exploitation Balance

The balance between exploration (trying new actions) and exploitation (using learned knowledge) is critical for MR.Q.

⚔️ Key Adjustments

1️⃣ Decrease exploration_noise gradually over time → Helps the agent start with exploration but converge to optimal actions.

self.exploration_noise = max(0.01, self.exploration_noise * 0.995)  # Decay over time

2️⃣ Use adaptive exploration (ε-greedy with decay):

epsilon = max(0.01, 1 - (self.training_steps / 50000))  # Decrease exploration over time
if random.random() < epsilon:
    action = random.choice([0, 1, 2])  # Random action (Buy, Sell, Hold)
else:
    action = self.policy.act(state)  # Best known action

3️⃣ Increase discount (γ) if trades need long-term consideration:

For short-term trading (scalping) → γ = 0.9
For long-term investments → γ = 0.99

Step 3: Tune Learning Rate for Stability

If learning is unstable, reduce policy_lr and value_lr to prevent overcorrection.
If learning is too slow, increase policy_lr slightly.

Expected Outcome After Optimization

More profitable trades
Faster convergence
Better risk management
More consistent portfolio growth

🏁 Conclusion

In this post, we delved into MR.Q (Model-based Representations for Q-learning), a cutting-edge reinforcement learning algorithm, and explored its application to stock trading. Through a blend of theoretical insights and practical implementation, we demonstrated how MR.Q can be leveraged to make smarter trading decisions. Here’s a summary of what we covered:

✅ Built a Minimal Implementation of MR.Q:
We implemented a simplified version of MR.Q tailored specifically for stock trading, showcasing its ability to learn optimal strategies from historical market data.

✅ Evaluated MR.Q Using Real Stock Market Data:
We tested MR.Q’s performance using real-world stock data, demonstrating its effectiveness in generating profitable trading strategies through backtesting.

✅ Enhanced Learning with Prioritized Experience Replay:
We designed a PrioritizedReplayBuffer, which prioritizes important experiences based on their Temporal Difference (TD) errors, improving the agent’s learning efficiency and decision-making capabilities.

✅ Compared MR.Q with DQN and Buy & Hold Strategies:
We conducted a detailed performance comparison between MR.Q, Deep Q-Network (DQN), and the traditional Buy & Hold strategy, highlighting MR.Q’s advantages in terms of faster learning, better generalization, and computational efficiency.

✅ Adapted MR.Q for Financial Markets:
We modified the original MR.Q architecture to work seamlessly with stock market features (e.g., prices, moving averages, RSI, volatility) instead of image-based inputs. This involved integrating advanced techniques like prioritized experience replay, target networks, and reward shaping, resulting in a version of MR.Q fully optimized for financial applications.

✅ Explored Hyperparameter Tuning for Optimization:
We discussed how to fine-tune MR.Q’s hyperparameters—such as exploration noise, discount factor, and learning rates—to enhance its stability, profitability, and adaptability to different trading scenarios.

✨ Why MR.Q Stands Out

MR.Q bridges the gap between model-free and model-based reinforcement learning by combining the simplicity and efficiency of model-free methods with the rich representations of model-based approaches. Its ability to generalize across different environments, avoid overestimation bias, and provide interpretable decisions makes it particularly well-suited for dynamic and uncertain domains like stock trading.

🏆 Final Thoughts

By leveraging MR.Q, traders and researchers can develop robust AI-driven strategies that adapt to changing market conditions while maintaining computational efficiency. Whether you’re looking to build a fast, stable, and interpretable model or seeking a solution that generalizes well across multiple financial markets, MR.Q offers a compelling alternative to traditional reinforcement learning algorithms like DQN.

📚 References

Towards General-Purpose Model-Free Reinforcement Learning
MR.Q Github

📘 Glossary of Terms

Term	Definition
Reinforcement Learning (RL)	A machine learning approach where agents learn to make decisions by interacting with an environment to maximize cumulative rewards.
Model-Free RL	A type of reinforcement learning where the agent learns an optimal policy without explicitly modeling the environment’s transition dynamics.
Model-Based RL	A reinforcement learning method that builds an explicit model of the environment’s dynamics to plan future actions.
MR.Q (Model-based Representations for Q-learning)	A reinforcement learning algorithm that incorporates model-based representations while retaining the simplicity of model-free methods.
Q-learning	A reinforcement learning algorithm that learns the value of taking certain actions in certain states, aiming to maximize rewards.
Q-value Function	A function that estimates the expected reward for taking a specific action in a given state.
Linear Q-learning	A variant of Q-learning that approximates the Q-value function using linear functions instead of deep neural networks.
Deep Q-Network (DQN)	A reinforcement learning algorithm that uses a deep neural network to approximate Q-values, enabling decision-making in complex environments.
State-Action Embeddings	Vector representations of states and actions that help estimate Q-values in reinforcement learning models.
Bellman Equation	A fundamental equation in reinforcement learning that expresses the relationship between a state-action pair’s value and the values of future states.
Replay Buffer	A memory structure that stores past experiences (state, action, reward, next state) for training reinforcement learning models.
Target Network	A copy of the Q-network in DQN used to stabilize training by reducing Q-value fluctuations.
Exploration vs. Exploitation	The trade-off in RL between exploring new actions to discover better strategies and exploiting known actions to maximize immediate rewards.
Epsilon-Greedy Strategy	A decision-making strategy where the agent randomly explores new actions with probability ε and exploits known best actions otherwise.
Reward Function	The function that assigns a numerical value (reward) to actions based on their desirability in achieving the learning objective.
Portfolio Optimization	The process of selecting investments to maximize returns while managing risk, often using reinforcement learning in trading.
Stock Market Backtesting	A simulation process used to evaluate trading strategies by applying them to historical stock market data.
Policy Network	A neural network in RL that maps states to probabilities of taking different actions.
Value Network	A neural network that estimates the expected return (reward) for a given state in reinforcement learning.
Feature Extraction	The process of transforming raw data (e.g., stock prices) into meaningful input features for machine learning models.
Multi-Step Prediction	A technique where the model predicts multiple future steps ahead rather than relying solely on immediate rewards.
Hyperparameters	Configuration settings that define the behavior of a machine learning algorithm, such as learning rate, batch size, and exploration rate.
Gradient Clipping	A technique used to prevent the gradient explosion problem in deep learning by capping the maximum gradient magnitude.
Adam Optimizer	A widely used optimization algorithm in deep learning that adapts learning rates for different parameters.
Overestimation Bias	A common issue in Q-learning where the algorithm overestimates action values, leading to suboptimal policies.
Moving Average (SMA)	A technical indicator that smooths out stock price fluctuations over a specified period to identify trends.
Volatility	A measure of the variation in stock prices over time, often used to assess risk in trading strategies.
Backpropagation	A key algorithm in deep learning that adjusts neural network weights based on gradient descent to minimize loss.

✨ Introduction: Real-Time Self-Tuning with MR.Q

🧭 What is DPO? (Direct Preference Optimization)

📚 How it works:

🔧 Why MR.Q is Different:

🧠 How MR.Q Works

Key Idea: Model-Based Representations for Q-learning

💡 Advantages of using linear representations for Q-values in RL

📏 What Are Linear Representations for Q-values?

❓ Why Is This Beneficial?

1️⃣ Strong Generalization Across Multiple RL Benchmarks

2️⃣ Better Performance Than Traditional Q-learning Approaches

3️⃣ Faster Training Than Model-Based Methods

🤖⚖️ Model-Free vs. Model-Based Reinforcement Learning (RL)

🔄🗺️ Model-Free RL (No Environment Model)

🧠⚙️ Model-Based RL (Uses an Environment Model)

Key Differences

Example Analogy

When to Use Each?

🧑‍🏫 MR.Q for Stock Market Analysis

🔑 Key Features of MR.Q

🤔 How Does MR.Q Work?

🐍 MR.Q Python Implementation

1️⃣ Setting Up MR.Q for Stock Market Analysis

2️⃣ Install Dependencies

3️⃣ Getting Stock Market Data

4️⃣ Build the Buffer

5️⃣ Implement the RL Agent for Trading (policy network)

💹 MR.Q Stock Trading Agent Implementation

6️⃣ Training the RL Trading Agent

🎯 Why we chose this reward function

🧮 Base Reward Formula

📐 Why This Works

🔢 Additional Reward Adjustments

Why This Reward Function is a Good Choice

📈 Possible Improvements

7️⃣ Backtesting: Does It Make Money?

8️⃣ Plot the Results

8️⃣ MR.Q’s Trading Decisions

📊⚖️ Performance Comparison

⚔️ Comparing MR.Q to DQN

1️⃣ Update the Replay Buffer

🛠️ Breakdown of What Each Function Does

❓ Why This Is Useful?

🧮 How the Prioritized Experience Replay (PER) Determines Priorities

📚 Steps in Determining Priorities

⚖️ Example: How Priorities Work

🧬 Why This Improves Learning

2️⃣ Define the MR.Q Trading Agent

📜 Description of MRQTradingAgent Class

Key Features:

📩 Workflow

3️⃣ Define the DQN Trading Agent

🧾 Description of DQNTradingAgent Class

🛠️ Key Features

Workflow:

4️⃣ Train Both Agents

5️⃣ Overlay Graph: MR.Q vs. DQN

❓ Why Consider MR.Q Over DQN?

1️⃣ MR.Q Learns Faster (More Sample Efficient)

2️⃣ MR.Q Generalizes Better to New Market Conditions

3️⃣ MR.Q Requires Less Computational Power

4️⃣ MR.Q Avoids Instability in Training

5️⃣ MR.Q Has a More Interpretable Decision Process

6️⃣ MR.Q Uses Model-Based Representations for More Strategic Decisions

7️⃣ MR.Q Avoids the Overestimation Problem in Q-Learning

🤔 When to Choose DQN Over MR.Q?

🧠 Conclusion: MR.Q vs. DQN – Which Should You Use?

🧑‍⚖️ Final Recommendation

🗺️ MR.Q for stock trading

🌐 Get Stock data

🚀 Running the MRQStockAgent

📈 Plot our trades

🛠️ Using Hyperparameters tp optimize MR.Q

🧪 Key Hyperparameters to Optimize

🛠️ Step 1: Define a Grid Search for Hyperparameters

🧾 Code: Run a Hyperparameter Grid Search

⚖️ Step 2: Adjust Exploration-Exploitation Balance

⚔️ Key Adjustments

Step 3: Tune Learning Rate for Stability

Expected Outcome After Optimization

📜 Description of `MRQTradingAgent` Class

🧾 Description of `DQNTradingAgent` Class