Self-Learning LLMs for Stock Forecasting: A Python Implementation with Direct Preference Optimization

February 13, 2025

Page content

Summary

Forecasting future events is a critical task in fields like finance, politics, and technology. However, improving the forecasting abilities of large language models (LLMs) often requires extensive human supervision. In this post, we explore a novel approach from the paper LLMs Can Teach Themselves to Better Predict the Future that enables LLMs to teach themselves better forecasting skills using self-play and Direct Preference Optimization (DPO). We’ll walk through a Python implementation of this method, step by step.

If we can give our applications the ability to auto-tune themselves they would simulate living things. This would be a concrete step towards Artificial General Intelligence (AGI).

What is Direct Preference Optimization (DPO)?

Direct Preference Optimization (DPO) simplifies the fine-tuning of large language models (LLMs) by directly learning from human preferences.

Think of DPO as a teacher grading student essays. Instead of giving a detailed rubric (like RLHF), the teacher simply points out which essay is better. The student (the LLM) learns directly from these comparisons, improving over time.

Instead of training a separate reward model (as in Reinforcement Learning from Human Feedback, or RLHF), DPO directly optimizes the LLM itself. This involves presenting the model with pairs of responses and asking humans to select the preferred option. By learning from these preferences, the model is gradually refined to generate outputs that better align with human expectations.

This approach eliminates the need for a complex reward model, making the fine-tuning process more streamlined and potentially more stable.

In this blog post we take it to the next level and use real time data to determine the preference.

How DPO Works

DPO learns from a dataset of (prompt, preferred response, undesirable response) triplets. The core idea is to adjust the model so that it assigns higher probability to preferred responses and lower probability to undesirable ones.

Steps in DPO:

Collect preference data:
- Given a prompt, human labelers (or AI-assisted systems) rank multiple model-generated responses.
- This results in preference pairs: one response is preferred, and another is undesirable.
Optimize the model directly:
- DPO adjusts the model’s logits (probabilities) to prefer responses that human raters liked.
- It avoids reward models and reinforcement learning, relying purely on likelihood ratio optimization.
Fine-tune with supervised learning:
- The model is fine-tuned by maximizing the likelihood of preferred responses over undesirable ones.

The key advantage is that DPO sidesteps the complexity of reinforcement learning while still aligning models with human preferences.

How DPO Differs from RLHF

Feature	RLHF	DPO
Uses reward model?	✅ Yes (requires training a separate model)	❌ No
Uses reinforcement learning?	✅ Yes (PPO or other RL methods)	❌ No
Computationally expensive?	🔥 High	⚡ Lower
Stability	🚧 Can be unstable due to RL dynamics	✅ More stable
Ease of implementation	🚀 Complex	✨ Simple

Mathematical Formulation of DPO

DPO works by re-weighting the model’s output distribution so that the probability of the preferred response is higher than the undesirable response.

Given:

A model π(θ) parameterized by θ
A dataset of (prompt, preferred response, undesirable response) pairs

DPO optimizes the logit difference using the objective:

[ L(\theta) = \sum_{(x, y^+, y^-)} \log \frac{\pi_\theta(y^+ | x)}{\pi_\theta(y^- | x)} ]

This ensures that the model assigns higher probability to preferred responses without needing a separate reward model.

Why Use DPO?

🔹 Eliminates complexity: No need for reward models or RL algorithms.
🔹 More stable: Avoids high variance and instability seen in RLHF.
🔹 Faster & cheaper: Requires less computation than RL-based approaches.
🔹 Easier to implement: Uses standard supervised fine-tuning techniques instead of reinforcement learning.

1. Forecasting Dataset

Training Data

The first step is to gather a dataset of binary outcome forecasting questions. I took a really simple approach.

1. We get news for stocks.
2. Is the news positive for the stock.
3. If this is the case the stock will appreciate

I know this is simplistic and not always true but generally or most ot the time it is the case.

For this post we are using a dataset from huggingface oliverwang15/us_stock_news_with_price

This dataset has the following format

Data Description

date: The date of the news published.
stock: The symbol of the stocks the news related to. (checked by whether title or content has the company information.
title: The title of the news.
content: The content of the news.
trading_date: Here is the assumed trading date, which should be the next date of the publish date.
exact_trading_date: The exact next trading date after the news was made public.
ts_{-30…-1}: Stock prices before the exact trading date. (30 trading days)
ts_0: Stock prices of the exact trading date.
ts_{1…15}: Stock prices after the exact trading date. (15trading days)

For now we are only interested in a subset of the columns

Column	New Name	Description
stock	ticker	Stock ticker
title	news_title	Title of the news story
content	news_summary	The actual news content
ts_0	next_day_price	The price after the news
ts_-1	news_day_price	The price before the news
		Calculated sentiment


def get_data():
    try:
         # Load dataset from Hugging Face
        dataset = load_dataset("oliverwang15/us_stock_news_with_price")
        df = dataset["train"].to_pandas()
        print("Available Columns:", df.columns)

        # Extract necessary columns and drop missing values
        df = df[["stock", "title", "content", "ts_0", "ts_-1", "trading_date", "exact_trading_date"]].dropna()

        # Calculate news effect: positive if next_day_price >= news_day_price, else negative
        df["news_effect"] = df.apply(
            lambda row: "positive" if row["ts_0"] >= row["ts_-1"] else "negative", axis=1
        )

        # Rename columns for clarity
        df_news = df.rename(
            {
                "stock": "ticker",
                "title": "news_title",
                "content": "news_summary",
                "ts_0": "next_day_price",
                "ts_-1": "news_day_price",
            },
            axis=1,
        )
        print("Final Columns:", df_news.columns)
        return df_news
    except Exception as e:
        print(f"Error loading or processing data: {e}")
        return None

This dataset contains forecasting questions along with binary outcomes (1 for Yes, 0 for No).

2. Fetching Relevant News

The paper enhances the forecasting ability of LLMs by providing news summaries as additional context.

We have the news date in our dataset

Python Code: Generating News Summaries lets summarize it and get the sentiment

Create a database to store our data

Creating three tables

stock_news: Essentially this table is the information we are interested from initial data source.
news_sentiment: This contains the generated sentiment from the model (positive/negative)
news_forecast: this table is for the forecast from self play
model_updates: the idea is that the model will be updated onn a schedule with the latest news. This table is to track that process.

def create_database():
    # Create SQLite database connection
    conn = sqlite3.connect("stock_news.db")
    cursor = conn.cursor()

    # Enable foreign key support
    cursor.execute("PRAGMA foreign_keys = ON;")

    # Create tables in SQLite
    cursor.execute(
        """
        CREATE TABLE IF NOT EXISTS stock_news (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            ticker TEXT,
            news_title TEXT,
            news_summary TEXT,
            next_day_price FLOAT,
            news_day_price FLOAT,
            trading_date TEXT,
            exact_trading_date TEXT,
            news_effect TEXT
        );
    """
    )

    cursor.execute(
        """
        CREATE TABLE IF NOT EXISTS news_sentiment (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            news_id INTEGER,
            sentiment TEXT,
            news_explanation TEXT,
            FOREIGN KEY (news_id) REFERENCES stock_news(id) ON DELETE CASCADE
        );
    """
    )

    cursor.execute(
        """
        CREATE TABLE IF NOT EXISTS news_forecast (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            news_id INTEGER,
            ticker TEXT,
            forecast_1 TEXT,
            probability_1 FLOAT,
            forecast_2 TEXT,
            probability_2 FLOAT,
            best_forecast TEXT,
            FOREIGN KEY (news_id) REFERENCES news_sentiment(news_id) ON DELETE CASCADE
        );
    """
    )
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS model_updates (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            filename TEXT,
            update_date TEXT,
            brier_score FLOAT
        );
    """)

    ## insert the hugging face data
    def insert_data(df_data):
        conn = sqlite3.connect("stock_news.db")
        cursor = conn.cursor()
        for _, row in df_data.iterrows():
            cursor.execute("""
                INSERT INTO stock_news (ticker, news_title, news_summary, next_day_price, news_day_price, news_effect)
                VALUES (?, ?, ?, ?, ?, ?);
            """, (row["ticker"], row["news_title"], row["news_summary"], row["next_day_price"],
                    row["news_day_price"], row["news_effect"]))
        conn.commit()
        conn.close()

Analyze sentiment with explanation

Here we ask the model to determine the sentiment of the news. We also ask it to generate a brief explanation on why it thinks this is the case.

Sentiment analysis helps us understand whether news is likely to have a positive or negative impact on stock prices. By analyzing the sentiment of news articles, we aim to better predict stock movements.

def analyze_sentiment_and_explain(news_title, news_summary):
    """Use the model to analyze sentiment and provide an explanation in JSON format."""
    prompt = f"""
    Analyze the sentiment of the following stock news article and explain your analysis

    Title: {news_title}
    Summary: {news_summary}

    Provide the response in **valid JSON format**:
    {{
        "sentiment": "<positive/negative/neutral>",
        "explanation": "<Brief explanation>"
    }}
    """
    response: ChatResponse = chat(model='qwen2.5', messages=[
        {
            'role': 'user',
            'content': f'{prompt}'
        }])


    response_text = response['message']['content'].strip()

    try:
        sentiment_data = json.loads(response_text)  # Parse JSON response
        sentiment = sentiment_data.get(
            "sentiment", "neutral"
        )  # Default to Neutral if sentiment is not provided
        sentiment = sentiment.lower()
        explanation = sentiment_data.get("explanation", "No explanation available.")
    except json.JSONDecodeError as e:
        # If JSON parsing fails, default to Neutral sentiment
        sentiment, explanation = "neutral", f"Sentiment could not be determined. {str(e)}"

    return sentiment, explanation

def get_sentiment():
    conn = sqlite3.connect("stock_news.db")
    cursor = conn.cursor()
    cursor.execute("""SELECT id, ticker, news_title, news_summary
                        FROM stock_news
                        LIMIT 20;""") # just a subset of the data for now
    news_data = cursor.fetchall()

    for news_id, ticker, title, summary in news_data:
        # call teh model with our data
        sentiment, explanation = analyze_sentiment_and_explain(title, summary)

        # Insert sentiment & explanation into news_sentiment table
        cursor.execute(
            """
            INSERT INTO news_sentiment (news_id, sentiment, news_explanation)
            VALUES (?, ?, ?);
        """,
            (news_id, sentiment, explanation),
        )
    conn.commit()
    conn.close()

Example results:

id	news_id	sentiment	news_explanation
1	1	positive	The article is positive as it mentions that RadioShack should post ‘outsized gains next year’ according to Barclays. It also lists several positive factors contributing to the outlook, such as the addition of T-Mobile as a carrier and new branding campaigns. The conclusion that the stock ‘is a solid investment for 2010’ further reinforces the positive sentiment.
2	2	neutral	The article presents both positive and negative aspects of AT&T’s network upgrade efforts. On one hand, it acknowledges the company’s response to criticism by upgrading as fast as possible and planning measures to handle high-bandwidth users. However, it also highlights that despite this effort, financial data shows that AT&T has spent less on network buildout every quarter since the iPhone was launched, which could be seen as a negative sign for future network quality or expansion. The overall tone is balanced without strong positive or negative connotations.

3. Generating Forecasts via Model Self-Play

The core innovation in the paper is self-play, where the model generates multiple forecasts for the same question. This allows the model to explore different reasoning paths for the same event.

Here we are saying to the model, this is what Bill said what do you think are the odds. This force the model to think deeper on the problem. It also give us away to measure how good the model has gotten predicting stock movement from the news.

Now know how the model is that the news is positive or negative. If we feed this in to the model getting it to improve these scores over time it should improve.


def extract_json_from_tags(text):
    """Extract JSON content between ```json and ``` tags."""
    match = re.search(r"```json\s*(\{.*?\})\s*```", text, re.DOTALL)
    return match.group(1) if match else None


def forecast_sentiment(ticker, sentiment, news_explanation):
    """Prompt GPT-4 to generate a stock movement forecast as 'positive' or 'negative'."""
    prompt = f"""
    Given the question: Is this forecast {sentiment} for stock {ticker}.
    And the following news summary about the stock: {news_explanation}

    Provide a JSON response with the following:
    {{
        "forecast": "<positive/negative>",
        "probability": <A number between 0 and 1>
        "explanation": "<Brief explanation>"
    }}
    """



    response: ChatResponse = chat(model='qwen2.5', messages=[
        {
            'role': 'user',
            'content': f'{prompt}'
        }])

    response_text = response['message']['content'].strip()
    if "```json" in response_text:
        response_text = extract_json_from_tags(response_text)

    try:
        forecast_data = json.loads(response_text)  # Parse JSON response
        forecast = forecast_data.get(
            "forecast", "missing"
        )  # Default to "missing" if missing
        probability = forecast_data.get("probability", 0.0)
    except json.JSONDecodeError:
        forecast, probability = "missing", 0.0

    return forecast, probability

    # Fetch sentiment data for forecasting along with the stock ticker


def gen_forecast():
    conn = sqlite3.connect("stock_news.db")
    cursor = conn.cursor()
    cursor.execute(
        """
        SELECT news_sentiment.news_id,
        stock_news.ticker, 
        news_sentiment.sentiment, 
        news_sentiment.news_explanation 
        FROM news_sentiment
        JOIN stock_news ON news_sentiment.news_id = stock_news.id;
    """
    )
    sentiment_data = cursor.fetchall()
    for news_id, ticker, sentiment, explanation in sentiment_data:
        print(f"Generating forecasts for id: {news_id} stock {ticker} with sentiment {sentiment}...")
        forecast_1, probability_1 = forecast_sentiment(
            ticker, "positive", explanation
        )
        forecast_2, probability_2 = forecast_sentiment(
            ticker, "negative", explanation
        )

        if probability_1 > probability_2:
            best_forecast = forecast_1
        else:
            best_forecast = forecast_2

        # Insert forecasts into news_forecast table
        cursor.execute(
            """
            INSERT INTO news_forecast (news_id, ticker, forecast_1, probability_1, forecast_2, probability_2, best_forecast)
            VALUES (?, ?, ?, ?, ?, ?, ?);
        """,
            (news_id, ticker, forecast_1, probability_1, forecast_2, probability_2, best_forecast),
        )

    # Commit forecasts to database
    conn.commit()
    conn.close()

Example results

id	news_id	ticker	forecast_1	probability_1	forecast_2	probability_2	best_forecast
1	1	AAPL	positive	0.95	negative	0.2	positive
2	2	AAPL	negative	0.5	negative	0.4	negative
3	3	AAPL	negative	0.8	negative	0.8	negative

4. Ranking Forecasts Based on Accuracy

Once real-world outcomes are available, we can rank forecasts based on their proximity to the true outcome.

Ranking forecasts based on their accuracy helps us identify which predictions are most reliable. This ensures that we prioritize the best forecasts for decision-making.

The ranking function used in the paper is:

[ r(p, o) = |p - o| ]

where:

( p ) is the predicted probability
( o ) is the actual outcome (0 or 1)

Python Code: Ranking Forecasts


def ranking_metric(prediction, actual_outcome):
    """Calculate the absolute error between prediction and actual outcome."""
    return abs(prediction - actual_outcome)

# Convert forecast strings to float probabilities
df["forecast_1"] = df["forecast_1"].str.extract(r'(\d\.\d+)').astype(float)
df["forecast_2"] = df["forecast_2"].str.extract(r'(\d\.\d+)').astype(float)

# Compute ranking scores
df["rank_1"] = df.apply(lambda row: ranking_metric(row["forecast_1"], row["outcome"]), axis=1)
df["rank_2"] = df.apply(lambda row: ranking_metric(row["forecast_2"], row["outcome"]), axis=1)

# Determine the better forecast
df["better_forecast"] = df.apply(lambda row: "Forecast 1" if row["rank_1"] < row["rank_2"] else "Forecast 2", axis=1)

# Display dataset with rankings
tools.display_dataframe_to_user(name="Ranked Forecasts", dataframe=df)

This step identifies which forecast was closer to the actual outcome.

5. Fine-Tuning with Direct Preference Optimization (DPO)

The paper fine-tunes the model by using pairs of ranked reasoning traces. We’ll prepare a fine-tuning dataset based on our rankings.

Python Code: Preparing Fine-Tuning Data

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType

def fine_tune_dpo(df):
    model_name = "mistralai/Mistral-7B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
    
    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules=["q_proj", "v_proj"]
    )
    model = get_peft_model(model, peft_config)
    
    train_texts = [f"News: {row['content']}\nSentiment: {row['news_effect']}\n" for _, row in df.iterrows()]
    train_labels = [row['news_effect'] for _, row in df.iterrows()]
    
    tokenized_data = tokenizer(train_texts, padding=True, truncation=True, return_tensors="pt")
    
    training_args = TrainingArguments(
        output_dir="./dpo_model",
        per_device_train_batch_size=4,
        num_train_epochs=3,
        logging_dir="./logs",
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_data,
    )
    trainer.train()
    model.save_pretrained("./dpo_model")
    tokenizer.save_pretrained("./dpo_model")

# Train the model
fine_tune_dpo(df)

Check this blog post: Mastering LLM Fine-Tuning: A Practical Guide with LLaMA-Factory and LoRA to learn more about fine tuning LLM’s.

By integrating Direct Preference Optimization (DPO), we enhance an LLM’s ability to forecast stock movements based on news sentiment. This approach eliminates the need for a complex reward model, making fine-tuning simpler, faster, and more stable.

Additionally, ranking forecasts ensures we prioritize more accurate predictions, further improving decision-making. 🚀

6. Evaluating the Model: Brier Score Calculation

The paper evaluates model performance using the Brier Score, defined as:

[ BS = \frac{1}{N} \sum (p_i - o_i)^2 ]

Where:

( p_i ) is the predicted probability
( o_i ) is the actual outcome

Python Code: Computing Brier Score

def brier_score(predictions, outcomes):
    """Compute the Brier Score for probabilistic forecasts."""
    return ((predictions - outcomes) ** 2).mean()

brier = brier_score(df["forecast_1"], df["outcome"])
print(f"Brier Score: {brier:.4f}")

A lower Brier score indicates better forecasting performance.