Fin-R1: a Financial Reasoning LLM with Reinforcement Learning and CoT

March 21, 2025

Page content

Introduction

Fin-R1 is a new model specifically fine-tuned for financial reasoning, with performance that beats much larger models like DeepSeek-R1.

This post will use this model and compare it with phi3 across various tasks.

phi3 for comparison

Phi-3: a lightweight, general-purpose model known for its efficiency and strong reasoning performance at smaller parameter scales. It serves as a great baseline for assessing how domain-specific tuning in Fin-R1 improves financial understanding and response structure.

Qwen2.5 for judging the results

Qwen2.5: is used as the judge because it exhibits strong instruction-following ability and Chain-of-Thought reasoning skills, making it ideal for judging response quality across dimensions like accuracy, completeness, and clarity. Its adherence to structure and scoring consistency helped generate reliable evaluations across use cases.

TL;DR - What Can Fin-R1 Do?

✅ Robo-advisory with CoT evaluation
✅ Basel III compliance automation
✅ Finance exam tutoring
✅ Market risk analysis
✅ KPI interpretation with exec summary
✅ Sentiment from SEC filings (10-K, etc.)
✅ Quant finance code generation (e.g. Black-Scholes, Sharpe Ratio)
✅ EDGAR Document Analysis

By the end of this blog you should have a good idea of some of the way we can leverage this model for our finance related tasks.

Why `Fin-R1`

1. Fin-R1 specifically designed and fine-tuned for financial applications.

Data Category	Data Category Description	Source	Proportion
Financial Code	Financial Quantitative Strategy Code Generation	FinanceQT	0.2%
Financial Expertise	Financial Terminology, Explanation, Q&A on Financial Expertise, Financial Calculations	Finance-Instruct-500K	18.2%
		FinanceIQ	3.4%
		FinPEE	0.3%
Non-reasoning Financial Business Knowledge	Content Generation in Financial Business, Regulatory Compliance, Financial Knowledge, Financial Cognition, Financial Logic	Ant-Finance	2.0%
		FinCorpus	48.4%
Financial Reasoning Business Knowledge	Numerical Reasoning on Financial Data, Financial News Sentiment Classification, Financial News Classification, Financial Causal Relationship Extraction	FinQA	4.8%
		ConvFinQA	12.3%
		TFNS	4.0%
		FinCUGE	6.4%

2. Versatile Financial Applications

Fin-R1 supports a wide range of financial tasks, including but not limited to:

Financial Analysis: Generating insights, summarizing reports, and analyzing trends.
Risk Management: Assessing risks, identifying potential issues, and providing recommendations.
Investment Research: Supporting decision-making by processing market data, news, and company reports.
Regulatory Compliance: Assisting with understanding and interpreting complex financial regulations.
Customer Support: Automating responses to customer inquiries about banking, investments, or financial products.

3. Open-Source and Customizable

Fin-R1 is open-source, allowing users to access, change, and fine-tune the model to suit their specific needs.
This flexibility enables organizations to adapt the model for proprietary datasets or unique financial workflows, enhancing its utility.

What Makes Financial Reasoning Hard?

Finance-specific LLMs must deal with:

Fragmented financial data
Black-box reasoning (not interpretable enough for regulation)
Poor generalization across diverse business scenarios

Fin-R1 addresses these with:

A specialized dataset: Fin-R1-Data
A two-stage training approach: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO)
Explicit reasoning structure using <think> and <answer> tags

Fin-R1-Data contains 60,091 reasoning traces, built from:

Public datasets: FinQA, ConvFinQA, TFNS, FinCorpus, etc.
A curated FinPEE dataset from financial exam PDFs

Converting the model to GGUF so we can test and run it locally

I wanted to use the model locally through Ollama to make this easy I converted the model to GGUF.

What is GGUF and Why Does It Matter?

GGUF (Ggerganov’s General Unified Format) is a powerful and efficient model file format designed for transformer-based large language models (LLMs). It builds on earlier formats like GGML and offers a more streamlined, portable, and quantization-friendly way to store models for offline inference.

Modern LLMs like LLaMA, Mistral, and Phi are massive—often requiring GPUs with tens of GBs of VRAM. GGUF enables these models to:

✅ Run efficiently on CPUs (even laptops and Raspberry Pi)
✅ Use low memory through quantization (like Q4_K_M, Q8_0)
✅ Load faster with optimized weights and metadata
✅ Interoperate with llama.cpp-based tools like:
- Ollama – local models with an easy-to-use chat interface and API
- LM Studio – GUI for chatting with GGUF models
- text-generation-webui – powerful browser-based UI
- koboldcpp – fine-tuned for storytelling and role play

📦 When Should You Use GGUF?

Use GGUF when you want to:

Use Case	Why GGUF Helps
Run LLMs locally	Avoid sending data to the cloud, ensure privacy
Deploy lightweight AI tools	Run on CPU or low-end hardware without a GPU
Experiment with quantized models	Save space and memory using 4-bit or 8-bit versions
Integrate into desktop or web apps	Fast load times and compatibility with local inference engines

🛠️ Model Types Supported

GGUF is designed for transformer architectures like:

LLaMA / LLaMA2 / LLaMA3
Mistral / Mixtral
Falcon
GPT-J / GPT-NeoX
Phi-2
OpenHermes, StarCoder, and other fine-tuned derivatives

Creating a gguf version of the model

You can use the hugging face gguf-my-repo to do this.

I created an 8-bit gguf version: ernanhughes/Fin-R1-Q8_0-GGUF this will allow you to use the model in ollama and LM Studio. These give a nice chat interface and API in windows.

Using docker to create a gguf models

You can also use docker to do this conversion process.

mkdir -p ~/models
huggingface-cli login
huggingface-cli download SUFE-AIFLM-Lab/Fin-R1 --local-dir "~/models" --include "*"

#Convert to GGUF
docker run --rm -v "~/models":/repo ghcr.io/ggerganov/llama.cpp:full --convert "/repo" --outtype f32
ls ~/models | grep .gguf
#> ggml-model-f32.gguf

#Quantize from F32.gguf to Q4_K_M.bin
docker run --rm -v "~/models":/repo ghcr.io/ggerganov/llama.cpp:full --quantize "/repo/ggml-model-f32.gguf" "/repo/ggml-model-Q4_K_M.bin" "Q4_K_M"
ls ~/models | grep .bin
#> ggml-model-Q4_K_M.bin

Once converted you can teh use this converted model in Ollama. To use the model:

 ollama pull hf.co/ernanhughes/Fin-R1-Q8_0-GGUF

GGUF vs TorchScript vs ONNX

Format	Best For	Pros	Cons
GGUF	Local CPU inference	Quantized, llama.cpp tools, easy to use	Mostly for LLaMA-style models
TorchScript	PyTorch deployment	Native PyTorch integration	Larger size, less portable
ONNX	Cross-platform ML pipelines	Broad compatibility (TensorRT, OpenVINO)	Conversion can be tricky

Loading the model config

The Hugging Face AutoConfig API allows you to programmatically retrieve a model’s architecture and configuration details—such as hidden size, number of layers, attention heads, max position embeddings, and vocabulary size either from a local directory or directly from the Hugging Face Hub.

This is useful when:

You want to check how much text you can reliably send to the model for summarization or processing.
You are converting a model to another format (like GGUF) and want to verify dimensions.
You want to inspect model compatibility with tools (e.g., tokenizer, quantization).
Documentation is missing or unclear about the model internals.

By using AutoConfig.from_pretrained("model-name"), you can avoid guesswork and ensure that you’re handling the model correctly across tools and pipelines.

Code example loading the model config for a model

This is some example code showing how to find or generate the model config for viewing.


from transformers import AutoConfig
from huggingface_hub import HfApi
import requests
import json
import os

def config_to_markdown(config_dict):
    # Priority ordered keys
    priority_keys = [
        "model_type", "architectures", "hidden_size", "num_attention_heads",
        "num_hidden_layers", "intermediate_size", "vocab_size", "type_vocab_size",
        "max_position_embeddings", "hidden_act", "initializer_range",
        "layer_norm_eps", "attention_probs_dropout_prob", "hidden_dropout_prob",
        "pad_token_id", "bos_token_id", "eos_token_id"
    ]
    all_keys = list(config_dict.keys())
    
    # Sort by priority first, then append any other keys
    ordered_keys = [k for k in priority_keys if k in config_dict] + \
                   [k for k in all_keys if k not in priority_keys]

    # Create markdown table
    markdown = "| Key | Value |\n|-----|-------|\n"
    for k in ordered_keys:
        v = config_dict[k]
        if isinstance(v, list):
            v = ", ".join(map(str, v))
        elif isinstance(v, dict):
            v = json.dumps(v)
        markdown += f"| `{k}` | `{v}` |\n"
    return markdown

def inspect_model_config(model_name_or_path, source="autoconfig"):
    """
    Inspect a Hugging Face model's configuration and output as markdown.

    Parameters:
        model_name_or_path (str): model name from Hugging Face hub or local path.
        source (str): One of ["autoconfig", "api", "rest", "local_json"]
    """
    config_dict = None

    if source == "autoconfig":
        print("Using AutoConfig (local or remote)...")
        config = AutoConfig.from_pretrained(model_name_or_path)
        config_dict = config.to_dict()

    elif source == "api":
        print("Using huggingface_hub HfApi...")
        api = HfApi()
        model_info = api.model_info(model_name_or_path)
        config_dict = model_info.config if hasattr(model_info, 'config') else {}

    elif source == "rest":
        print("Using Hugging Face REST API...")
        url = f"https://huggingface.co/api/models/{model_name_or_path}"
        response = requests.get(url)
        if response.status_code == 200:
            data = response.json()
            config_dict = data.get("config", {})
        else:
            print("Failed to retrieve model metadata from REST API.")

    elif source == "local_json":
        print("Loading config.json from local directory...")
        path = os.path.join(model_name_or_path, "config.json")
        if os.path.exists(path):
            with open(path, "r") as f:
                config_dict = json.load(f)
        else:
            print(f"config.json not found at {path}")

    else:
        print("Invalid source. Choose from: autoconfig, api, rest, local_json.")

    if config_dict:
        print("\n### Model Configuration (Markdown Table)\n")
        print(config_to_markdown(config_dict))


model_name = "SUFE-AIFLM-Lab/Fin-R1"
model_config = inspect_model_config(model_name, source="autoconfig")
config_to_markdown(model_config)

`SUFE-AIFLM-Lab/Fin-R1` Model Configuration (Markdown Table)

Key	Value
`model_type`	`qwen2`
`architectures`	`Qwen2ForCausalLM`
`hidden_size`	`3584`
`num_attention_heads`	`28`
`num_hidden_layers`	`28`
`intermediate_size`	`18944`
`vocab_size`	`152064`
`max_position_embeddings`	`32768`
`hidden_act`	`silu`
`initializer_range`	`0.02`
`pad_token_id`	`None`
`bos_token_id`	`151643`
`eos_token_id`	`151645`
`use_sliding_window`	`False`
`sliding_window`	`None`
`max_window_layers`	`28`
`num_key_value_heads`	`4`
`rms_norm_eps`	`1e-06`
`use_cache`	`True`
….

Testing the Fin-R1 model

These are some tests I ran on the model. Using these tests I want to get an idea of the quality of the results from the model.

Demo 1: Robo Advisory

Here we are using Fin-R1 to generate financial advice for a set of investors with different financial goals. We send the same query into another model phi3 We then use another model to judge these results Qwen2.5.

Code example and testing

import os
import sqlite3
from typing import Optional, Tuple, Dict
from pprint import pprint
from dotenv import load_dotenv

import litellm

class Config:
    def __init__(self):
        load_dotenv()
        self.model1_name = os.getenv("MODEL_1", "ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF")
        self.model2_name = os.getenv("MODEL_2", "ollama/phi3")
        self.evaluator_model = os.getenv("EVAL_MODEL", "ollama/qwen2.5")
        self.db_path = os.getenv("DB_PATH", "fin_r1_advisory.db")
        self.debug = os.getenv("DEBUG", "true").lower() == "true"
        if self.debug:
            litellm._turn_on_debug()

def query_litellm(model_name: str, prompt: str) -> str:
    response = litellm.completion(
        model=model_name,
        messages=[{"role": "user", "content": prompt}]
    )
    return response['choices'][0]['message']['content']

def call_models(prompt: str, config: Config) -> tuple[str, str]:
    return (
        query_litellm(config.model1_name, prompt),
        query_litellm(config.model2_name, prompt)
    )

def build_comparison_prompt(question: str, response_a: str, response_b: str) -> str:
    return f"""
### Task:
You are a financial reasoning expert. Two models have provided investment advice in CoT format.
Evaluate which advice is more helpful, complete, and appropriate based on reasoning and financial knowledge.

### Instructions:
1. Assess each response based on:
- Accuracy and factual correctness
- Clarity and coherence
- Depth of explanation
- Relevance to the question
2. Assign each response a score from **0 to 10**.
3. Provide a short justification for your scoring.
---
### Question:
{question}
---
### Response A:
{response_a}
---
### Response B:
{response_b}
---
### Output Format:
Response A Score: X/10  
Response B Score: Y/10  
Winner: Response A or Response B or Tie  
Justification: <your analysis here>
"""

class UserProfile:
    def __init__(self, name, age, gender, risk_tolerance, investment_goal, investment_horizon_years, current_portfolio, economic_context):
        self.name = name
        self.age = age
        self.gender = gender
        self.risk_tolerance = risk_tolerance
        self.investment_goal = investment_goal
        self.investment_horizon_years = investment_horizon_years
        self.current_portfolio = current_portfolio
        self.economic_context = economic_context

    def to_dict(self):
        return self.__dict__

def gen_robo_request(user_profile: UserProfile) -> str:
    profile = user_profile.to_dict()
    return (
        f"<think>{profile['name']} is {profile['age']} years old with a {profile['risk_tolerance']} risk tolerance. "
        f"Goal is {profile['investment_goal']} in {profile['investment_horizon_years']} years. "
        f"Current portfolio: {profile['current_portfolio']}. "
        f"Economic context: {profile['economic_context']}."
    )

class DatabaseManager:
    def __init__(self, db_path: str):
        self.conn = sqlite3.connect(db_path)
        self._initialize_schema()

    def _initialize_schema(self):
        cursor = self.conn.cursor()
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS user_profiles (
                name TEXT PRIMARY KEY,
                age INTEGER,
                gender TEXT,
                risk_tolerance TEXT,
                investment_goal TEXT,
                horizon_years INTEGER,
                economic_context TEXT,
                current_portfolio TEXT
            )
        """)
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS advisory_results (
                user_name TEXT,
                request TEXT,
                fin_r1_output TEXT,
                model_output TEXT,
                report_prompt TEXT,
                report TEXT,
                FOREIGN KEY(user_name) REFERENCES user_profiles(name)
            )
        """)
        self.conn.commit()

    def insert_user_profile(self, profile: Dict):
        cursor = self.conn.cursor()
        cursor.execute("""
            INSERT OR REPLACE INTO user_profiles (
                name, age, gender, risk_tolerance, investment_goal,
                horizon_years, economic_context, current_portfolio
            ) VALUES (?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            profile['name'], profile['age'], profile['gender'], profile['risk_tolerance'],
            profile['investment_goal'], profile['investment_horizon_years'],
            profile['economic_context'], str(profile['current_portfolio'])
        ))

    def insert_advisory_result(self, user_name: str, request: str,
                               fin_r1_output: str, model_output: str,
                               report_prompt: str, report: str):
        cursor = self.conn.cursor()
        cursor.execute("""
            INSERT INTO advisory_results (
                user_name, request, fin_r1_output, model_output, report_prompt, report
            ) VALUES (?, ?, ?, ?, ?, ?)
        """, (user_name, request, fin_r1_output, model_output, report_prompt, report))

    def commit(self):
        self.conn.commit()

    def close(self):
        self.conn.close()

def main():
    config = Config()
    db = DatabaseManager(config.db_path)

    user_profiles = [
        UserProfile(
            name="User A", age=35, gender="female", risk_tolerance="moderate",
            investment_goal="retirement", investment_horizon_years=20,
            current_portfolio={"stocks": 6000, "bonds": 300, "cash": 1000},
            economic_context="Fed is expected to raise interest rates next quarter"
        ),
        UserProfile(
            name="User B", age=25, gender="male", risk_tolerance="high",
            investment_goal="wealth_growth", investment_horizon_years=10,
            current_portfolio={"stocks": 20000, "bonds": 10000, "cash": 300000},
            economic_context="Inflation is steady and market volatility is growing"
        )
    ]

    for profile in user_profiles:
        db.insert_user_profile(profile.to_dict())

        request = gen_robo_request(profile)
        fin_r1_output, model_output = call_models(request, config)
        report_prompt = build_comparison_prompt(request, fin_r1_output, model_output)
        report = query_litellm(config.evaluator_model, report_prompt)

        db.insert_advisory_result(
            user_name=profile.name,
            request=request,
            fin_r1_output=fin_r1_output,
            model_output=model_output,
            report_prompt=report_prompt,
            report=report
        )

        print(f"\n--- Report for {profile.name} ---\n")
        print(report)

    db.commit()
    db.close()

Results

In my results on this small test Fin-R1 performed very well.

Response A Score: 9/10
Response B Score: 7/10
Winner: Response A

Response A Score: 8/10
Response B Score: 6/10
Winner: Response A

Demo 2: Financial Compliance Automation

We are going to use the exact same structure to build a financial compliance application.

Code example and testing

Note I only included the prompts here. The database and function calling is the same. In this posts github repo you can view the full code.

compliance_rules = [
    "Basel III requires banks to maintain a minimum capital adequacy ratio (CAR) of 8% to ensure financial stability.",
    "Basel III mandates that the Liquidity Coverage Ratio (LCR) must be at least 100% to ensure banks can meet short-term obligations.",
    "Under Basel III, the leverage ratio must be at least 3% to prevent excessive on- and off-balance sheet leverage."
]
policy_statements = [
    "The capital adequacy ratio of the bank fell below 8% in Q2 due to increased exposure to high-risk loans.",
    "The bank’s liquidity coverage ratio dropped to 87% in the last quarter as it increased its long-term asset holdings to chase yield.",
    "The bank reported a leverage ratio of 2.5% after an aggressive expansion in derivatives trading and structured financing."
]

def generate_prompt(policy_input: str, compliance_rule: str, model: str) -> str:
    instruction = f"""
        You are a financial reasoning assistant. Read the compliance rule and the bank's policy statement, then determine whether the bank is compliant. Provide your response in Chain-of-Thought format using <think> and <answer> tags.
        Compliance Rule:
        \"\"\"{compliance_rule}\"\"\"
        Policy Statement:
        \"\"\"{policy_input}\"\"\"
        """
    return instruction

def build_comparison_prompt(policy_input: str, compliance_rule: str, output_a: str, output_b: str) -> str:
    return f"""
        You are a highly qualified financial compliance analyst with expertise in Basel III regulations.

        Two AI models were given a compliance rule and a policy statement. They were asked to reason through the scenario using a Chain-of-Thought (CoT) format and provide a final compliance assessment.

        Your task is to critically evaluate both responses and score them based on:
        1. **Regulatory accuracy**  how well the model understands and applies the compliance rule
        2. **Reasoning clarity**  how logically the model arrives at its conclusion
        3. **Financial insight**  how deep and appropriate the explanation is
        4. **Output structure**  correct use of <think> and <answer> tags (important for interpretability)
        ---
        ### Compliance Rule:
        \"\"\"{compliance_rule}\"\"\"
        ### Policy Statement:
        \"\"\"{policy_input}\"\"\"
        ---
        ### Model A Response:
        {output_a}
        ### Model B Response:
        {output_b}
        ---
        Please evaluate each model independently and respond in the following format:
        Model A Score: <score>/10  
        Justification A: <explanation of how well it reasoned, applied the rule, and used CoT formatting>
        Model B Score: <score>/10  
        Justification B: <same as above>
        Preferred Model: Model A or Model B
        Only select the preferred model if one is clearly better overall.
        """

Results evaluation

The results are again impressive

### Evaluation:
#### Model A:
- **Regulatory Accuracy**: 9/10. The response correctly interprets that Basel III mandates a minimum CAR of 8% and accurately identifies that falling below this threshold in Q2 constitutes non-compliance with the rule.
**Model A Score: 8.6/10**
**Model B Score: 7.2/10**
....

Model A Score: 8/10  
Model B Score: 6/10  
Preferred Model: Model A

Model A is preferred because it demonstrates better clarity in reasoning, provides more detailed financial insight, and uses the CoT format more effectively. While both models ultimately come to the same conclusion about non-compliance, Model A offers a more comprehensive and structured assessment.

Demo 3: Exam question

This demo shows how we could help a student prepare for their finance exams.

exam_questions = [
    "A company's revenue is $500,000 and expenses are $350,000. What is the net income?",
    "An investor buys a bond for $950 and receives $1,000 at maturity in one year. What is the yield?",
    "A project requires an initial investment of $10,000 and is expected to return $2,500 annually for 5 years. What is the payback period?"
]

# --- Step 2: Shared prompt template for both models ---
def generate_exam_prompt(question: str) -> str:
    return f"""
You are a finance tutor helping a student prepare for an exam.
Answer the following question using Chain-of-Thought (CoT) reasoning format.
Wrap your reasoning in <think> tags and your final answer in <answer> tags.

Question:
"{question}"
"""

def judge_exam_answers(question, output_a, output_b):
    return f"""
    You are a finance instructor evaluating student responses to an exam question. Two models have answered using a Chain-of-Thought format.

    Question:
    "{question}"

    ---
    Model A:
    {output_a}

    Model B:
    {output_b}

    Evaluate each model on:
    1. Correctness of the answer
    2. Accuracy and completeness of the reasoning
    3. Use of <think> and <answer> tags

    Please respond in this format:

    Model A Score: <score>/10
    Justification A: <reasoning>

    Model B Score: <score>/10
    Justification B: <reasoning>

    Preferred Model: Model A or Model B
    """

Analysis and Results

Model A Score: 8/10
Model B Score: 7/10

Model A Score: 9.5/10
Model B Score: 8/10

Model A Score: 9/10
Model B Score: 8/10

Demo 4: Risk Analysis

risk_scenarios = [
    "An investor is heavily invested in Nvidia. A new AI paradigm called DeepSeek emerges out of China, dramatically reducing the need for GPUs. Assess the risk to Nvidia's stock.",
    "An investor holds a concentrated position in Tesla. BYD announces a new solid-state battery that doubles range, halves cost, and enables near-instant charging, removing the need for most public charging infrastructure. Analyze the potential impact on Tesla."
]

# --- Step 2: Shared prompt for both models ---
def generate_risk_prompt(scenario: str) -> str:
    return f"""
You are a financial analyst evaluating the impact of a company-specific market scenario.
Explain how this scenario might affect the investor's portfolio using Chain-of-Thought (CoT) reasoning.
Wrap your reasoning in <think> tags and provide a summary inside <answer> tags.

Scenario:
"{scenario}"
"""

def judge_risk_outputs(scenario, output_a, output_b):
    return f"""
You are a senior equity analyst comparing two CoT reasoning outputs for a company-specific risk scenario.
Evaluate which response is more insightful, realistic, and helpful for an investor.
Scenario:
"{scenario}"
---
Model A:
{output_a}
---
Model B:
{output_b}
---
Evaluate each model on:
1. Accuracy of market and company-specific reasoning
2. Clarity and insight of explanation
3. Proper structure using <think> and <answer> tags
Respond in this format:
Model A Score: <score>/10
Justification A: <reasoning>

Model B Score: <score>/10
Justification B: <reasoning>

Preferred Model: Model A or Model B
"""

In this case lets look at the generated analysis, we see an in depth answer. I included the full text here so you can view the quality of the answer from the model.

Tesla Answer

First, I need to understand what the new technology does: it doubles the range of electric vehicles (EVs), halves production costs, and allows for near-instant charging. These are massive improvements over current battery tech. The immediate thought is that this could make BYD’s EVs more competitive in the market, especially since cost and range anxiety are big factors for consumers.

But wait, the investor here has a large stake in Tesla. So how does this impact them? Let’s break it down step by step.

Competitive Advantage: If BYD’s new battery technology is as revolutionary as described, it could give BYD an edge over other EV manufacturers, including Tesla. This might lead to increased market share for BYD in the EV sector. Since the investor’s portfolio is heavily weighted towards Tesla, this could be concerning because a competitor’s success directly impacts Tesla’s potential growth.
Impact on Tesla’s Stock: If the market starts to price in the risk that BYD becomes a major player due to their new battery tech, there might be a decrease in demand for Tesla’s shares. This could lead to a drop in Tesla’s stock price, which would negatively affect the investor’s holdings.
Long-term vs Short-term: The scenario mentions that this technology removes the need for most public charging infrastructure. That’s a huge point because range anxiety is a significant deterrent for many potential EV buyers. If BYD can eliminate the need for extensive charging networks, their EVs become more attractive compared to Tesla’s vehicles, which rely on existing infrastructure. This could shift consumer preferences towards BYD products.
Innovation Pressure: The development by BYD might force Tesla to accelerate its own R&D efforts in battery technology. If Tesla doesn’t keep up with the technological advancements, they might lose their competitive edge. However, investing heavily in research and development (R&D) could be costly and might not guarantee success. This financial strain could affect Tesla’s profitability and stock performance.
Cross-border Competition: BYD is a Chinese company, so its expansion into new markets could face different regulatory and trade barriers. For example, the U.S. market has high entry costs due to regulations like CAFE standards. If BYD can navigate these challenges effectively, they might penetrate American markets more easily than Tesla could compete against them.
Customer Perception: A technology that halves production costs would make EVs cheaper for consumers. Even if Tesla’s vehicles have features or designs customers prefer, the lower price point from BYD could be a major draw. This could lead to decreased demand for higher-priced Tesla models unless they can offer something else that compensates for the cost difference.
Supply Chain Shifts: If BYD becomes a significant EV manufacturer, there might be changes in the global supply chain. Companies might prefer to source components from BYD due to lower costs or better performance. This could disrupt Tesla’s existing supplier relationships and increase their operational costs.
Corporate Strategy: Tesla has historically focused on premium vehicles and vertical integration (like their own battery production). The announcement by BYD challenges both of these strategies. If other companies start adopting similar approaches, it might force Tesla to change its business model, which could be risky for investors if the transition isn’t smooth.

Putting all this together, the introduction of BYD’s solid-state batteries presents a significant threat to Tesla’s market position. The investor holding a concentrated Tesla portfolio should be cautious as this could lead to potential losses if the stock price drops due to increased competition or if Tesla fails to innovate sufficiently. However, it’s also possible that Tesla could respond by investing heavily in their own battery technology, perhaps even developing solid-state batteries themselves, which would keep them competitive. The outcome depends on how quickly Tesla can adapt and compete against BYD’s new offerings. The scenario introduces a revolutionary solid-state battery technology from BYD that significantly reduces costs, doubles range, and eliminates the need for extensive public charging infrastructure. This has several implications for an investor holding a concentrated position in Tesla:

Competitive Disadvantage: BYD’s advancements could empower them to capture more market share, especially if their EVs become more cost-effective and convenient than Tesla’s offerings.
Stock Price Risk: Increased competition from BYD might reduce demand for Tesla shares, leading to a decline in stock price.
Consumer Shifts: Lower-cost EVs with extended ranges could alter consumer preferences, potentially decreasing demand for higher-priced Tesla models.
R&D Pressures: Tesla may need to accelerate its own battery technology development or face losing market share. This could strain financial resources and operational efficiency.
Supply Chain Changes: Shifts in component sourcing due to cost advantages could disrupt Tesla’s existing supply chain arrangements.

However, Tesla might counter by investing aggressively in solid-state batteries or other innovations to retain its competitive edge. The ultimate impact depends on Tesla’s ability to innovate and adapt swiftly.

Summary: BYD’s technological leap creates a direct threat to Tesla’s market position, increasing risks for investors holding concentrated Tesla shares unless Tesla successfully mitigates this competition through innovation or strategic responses.

Model A Score: 8/10
Model B Score: 6/10

Demo 5: KPI Analysis

In this test I want to demonstrates how Fin-R1 can provide chain-of-thought (CoT) explanations for financial Key Performance Indicator (KPI) changes.

Simulated financial KPI shifts and external drivers
Chain-of-Thought explanations from both models
An evaluation of the model outputs
SQLite database storage of scenario evaluations

kpi_scenarios = [
    {
        "kpi": "Net Income",
        "change": "dropped by 15% in Q4 2024",
        "context": "due to rising input costs and slower sales in Europe"
    },
    {
        "kpi": "Gross Margin",
        "change": "improved by 3% in Q1 2025",
        "context": "following a successful shift to in-house component manufacturing"
    }
]

def generate_kpi_prompt(scenario):
    return f"""
        You are a financial analyst preparing a quarterly report for executive stakeholders.
        Analyze the following KPI change using Chain-of-Thought (CoT) reasoning.
        Wrap your reasoning in <think> tags and summarize the implications in <answer> tags.

        KPI: {scenario['kpi']}  
        Change: {scenario['change']}  
    """

def judge_kpi_outputs(scenario, output_a, output_b):
    prompt = f"""
    You are a CFO reviewing two model-generated CoT analyses for a KPI shift.
    Evaluate which is more insightful, executive-ready, and grounded in financial reasoning.

    KPI: {scenario['kpi']}  
    Change: {scenario['change']}  
    Context: {scenario['context']}

    ---
    Model A:
    {output_a}

    Model B:
    {output_b}

    Evaluate each model on:
    1. Business relevance
    2. Reasoning clarity
    3. Correct use of <think> and <answer> tags

    Respond in this format:

    Model A Score: <score>/10
    Justification A: <reasoning>

    Model B Score: <score>/10
    Justification B: <reasoning>

    Preferred Model: Model A or Model B
    """

#### Model A:
**Score:** 8/10
#### Model B:
**Score:** 6/10
### Preferred Model:
**Model A**
...
## Preferred Model: Model A

Demo 6: Quant Code Generation

We want to see how well the model generated quant code.

Define quantitative tasks in natural language
Prompt both models to generate Python solutions
Use Qwen2.5 to assess the quality of generated code
Store results in an SQLite database

quant_tasks = [
    "Write Python code to calculate the Sharpe Ratio for a stock given a list of daily returns and a risk-free rate.",
    "Generate Python code to backtest a moving average crossover strategy using two different windows.",
    "Create Python code to compute Value at Risk (VaR) using historical simulation for a portfolio of returns.",
    "Write Python code to price a European call option using the Black-Scholes formula.",
    "Generate code to simulate a geometric Brownian motion for an asset price.",
    "Write Python code to calculate the duration and convexity of a bond given its cash flows and yield.",
    "Generate Monte Carlo simulation code to price a European call option.",
    "Write Python code to calculate Delta and Gamma for a European call option using the Black-Scholes model."
]

# --- Step 2: Prompt format ---
def generate_code_prompt(task: str) -> str:
    return f"""
    You are a quantitative financial developer. Write clean, documented Python code to solve the following problem:

    {task}

    Instructions:
    - Explain your approach in <think> tags.
    - Include all Python code within <code> tags.
    - Use comments to clarify logic and assumptions.
    - Ensure the code runs correctly and handles potential edge cases.
    - Prefer standard libraries like NumPy, pandas, and SciPy where applicable.
    - Avoid excessive complexity; prioritize clarity and correctness.
    """

def generate_validation_prompt(task: str, code_output: str) -> str:
    return f"""
    You are a senior quantitative engineer reviewing Python code for accuracy.
    The code was written to address the following financial task:

    """{task}"""

    Here is the model's output:
    {code_output}

    Instructions:
    1. Verify that the code correctly solves the stated task.
    2. Identify any bugs, logical flaws, or poor practices.
    3. Comment on use of financial logic, edge case handling, and numerical stability.
    4. Score the solution out of 10 for correctness and clarity.

    Respond with:
    - Score: <X>/10
    - Critique: <brief assessment>
    - Recommendations: <concrete fixes or improvements>
    """

For the option question it generated this code. Which is correct.

<code>
import numpy as np
from scipy.stats import norm

def black_scholes_call(S, X, r, T, sigma):
    """
    Calculate the price of a European call option using the Black-Scholes formula.

    Parameters:
    S (float): Current stock price
    X (float): Exercise price (strike price)
    r (float): Risk-free interest rate (annualized)
    T (float): Time to expiration (years)
    sigma (float): Volatility of the underlying asset

    Returns:
    float: Price of the European call option
    """
    
    # Check for negative inputs that don't make sense in financial context
    if S < 0 or X < 0:
        raise ValueError("Stock price and strike price must be non-negative.")
    if T <= 0:
        raise ValueError("Time to expiration must be positive.")

    # Calculate d1 and d2
    d1 = (np.log(S / X) + (r + sigma**2 / 2) * T) / (sigma * np.sqrt(T))
    d2 = d1 - sigma * np.sqrt(T)
    
    # Ensure numerical stability for extreme values of d2
    if d2 > 20:
        d2 = 20.0
    elif d2 < -20:
        d2 = -20.0
    
    # Calculate the call option price using Black-Scholes formula
    call_price = S * norm.cdf(d1) - X * np.exp(-r * T) * norm.cdf(d2)
    
    return call_price

# Example usage
if __name__ == "__main__":
    S = 100.0  # Current stock price
    X = 110.0  # Strike price
    r = 0.05   # Risk-free rate (5% annual)
    T = 1.0    # Time to expiration (1 year)
    sigma = 0.2 # Volatility (20% annual)

    try:
        call_option_price = black_scholes_call(S, X, r, T, sigma)
        print(f"The price of the European call option is: ${call_option_price:.4f}")
    except ValueError as e:
        print(e)
</code>

EDGAR Sentiment Analysis

For this we will download some filings from EDGAR and determine sentiment.

What is a 10K report about

A 10-K report is a comprehensive annual filing submitted by publicly traded companies to the U.S. Securities and Exchange Commission (SEC). It provides a detailed overview of a company’s financial performance, business operations, risk factors, legal proceedings, and management discussion and analysis.

Unlike an annual report to shareholders, which is often more polished and promotional, the 10-K is a formal and standardized disclosure required by law. Investors, analysts, and regulators use it to assess a company’s health and make informed decisions.

The 10-K is a valuable tool for assessing a company’s health because it offers audited financial statements, detailed insights into risks, and management’s own analysis of performance all in a standardized, transparent format.

Best Sections of the 10K for Sentiment Analysis

For sentiment analysis in a 10-K filing, not all sections are valuable. You’ll get the most useful and meaningful sentiment signals from the narrative sections that include forward-looking language and risk assessments. These are the most useful sections of a 10K for Sentiment Analysis.

Section	Why it’s Useful
Item 1A. Risk Factors	Discusses potential threats and management concerns (rich in negative tone)
Item 7. Management’s Discussion and Analysis (MD&A)	Most insightful for tone, confidence, forward-looking statements
Item 1. Business	Gives context about operations; useful for neutral-to-positive framing
Item 7A. Quantitative and Qualitative Disclosures about Market Risk	Risk language, volatility perception

Document Size and Context Limitations

Large documents can exceed the context window of your language model. You can estimate the limit by gradually increasing the input length until the model fails or starts to respond inconsistently. Note that this threshold can vary depending on factors like available VRAM and the specific model you are using.

Because of this, documents need to be split into smaller chunks before processing.

It’s also important to recognize that model performance often degrades as context length increases. NVIDIA’s RULER project provides helpful insights into the effective context length of many models, offering a more realistic view than just advertised token limits.

This example function that can help you determine your model’s usable context length:


def test_context_limit(token_multiplier=50):
    base = "The quick brown fox jumps over the lazy dog. "
    for i in range(1, 1000):
        try:
            prompt = base * (i * token_multiplier)
            print(f"Trying length {len(prompt)}...")
            query_litellm(prompt)  # or however you call your local model
        except Exception as e:
            print(f"Failed at length {len(prompt)}: {e}")
            break

My testing I got a figure of 20250. I am testing with splits of 20,000

Getting the document from EDGAR

I will do a complete post on this for now I have written a function that uses the sec_api to get our documents

This code will allow us to download the latest 10-K for a ticker. For this analysis we are using section 1A. The code should work on any section.

from sec_api import QueryApi
from sec_api import ExtractorApi
from dotenv import load_dotenv
import os
load_dotenv()

import os
import logging
from sec_api import QueryApi

logger = logging.getLogger(__name__)

def get_latest_filing_url(ticker: str, form_type: str = "10-K"):
    """
    Get the CIK accessionNo and url for the last filing of a from.

    Args:
        ticker (str): The stock.
        form_type (str): The form.

    Returns:
        str: The text of the section.
    """
    api_key = os.getenv("SEC_API_KEY")
    query = {
        "query": {
            "query_string": {
                "query": f'formType:"{form_type}" AND ticker:{ticker}'
            }
        },
        "from": "0",
        "size": "1",
        "sort": [{ "filedAt": { "order": "desc" } }]
    }
    try:
        queryApi = QueryApi(api_key=api_key)
        filings = queryApi.get_filings(query)
        filing = filings['filings'][0]
        accession_number = filing['accessionNo'].replace("-", "")
        cik = filing['cik']
        url = filing['linkToHtml']
        logger.info(f"CIK: {cik} Accession Number: {accession_number} Filing URL: {url}")
        return cik, accession_number, url
    except Exception as e:
        logger.error("Failed to fetch filing info", exc_info=True)
        raise

def get_section_text(filing_url: str, section: str = "1A") -> str:
    """
    Get the text of a specific section from a filing URL.

    Args:
        filing_url (str): The URL of the filing.
        section (str): The section to extract.

    Returns:
        str: The text of the section.
    """
    api_key = os.getenv("SEC_API_KEY")
    extractorApi = ExtractorApi(api_key)
    text = extractorApi.get_section(filing_url, section, "text")
    return text

Splitting the Document

Splitting the text in document is an art. I will write a separate blog post on that. For now I will split on paragraphs around the configured text length of the split.

import re
# --- Step 2: Chunk text on paragraphs with approximate size limit ---
def chunk_text(text, chunk_size=20000):
    paragraphs = [p.strip() for p in text.split('\n') if p.strip()]
    chunks = []
    current_chunk = ""
    for para in paragraphs:
        if len(current_chunk) + len(para) <= chunk_size:
            current_chunk += para + "\n"
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para + "\n"
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

Analyzing the sentiment of each section

Analyzing the sentiment of the section. We want to get three things from each section

1. The chain of thought sentiment or summary of the section
1. The final summary of this sentiment.
1. A sentiment score. A value which makes it clear what the model thinks the text says.

# --- Step 3: Perform sentiment analysis using Fin-R1
def analyze_sentiment_with_fin_r1(report_text: str):
    prompt = f"""
You are Fin-R1, a financial language model.
Analyze the following portion of an SEC filing and extract key sentiment and guidance.

<document>
{chunk}
</document>


Respond in Chain-of-Thought format:
- Begin with <think> for detailed reasoning
- Include a long descriptive summary
- End with a clearly labeled section: "Final Summary:" that gives a 1-2 sentence overall summary.
- End with a clearly labeled section:  "Sentiment Score:" from 0 (completely negative) to 10 (completely positive). Surround the score with <sentiment_score> tags.
Wrap everything in <answer> tags.
"""
    content = query_litellm(prompt, model_name='ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF')
    logger.info(f"content: {content}")  
    score = extract_sentiment_score_from_tags(content)
    final_summary = extract_final_summary(content)
    logger.info(f"score: {score}, final_summary: {final_summary}")
    return content, final_summary, score

Extracting the results of each section

We will have large blocks of text returned from the model. We for now are specifically interested in the score and the final summary. These functions allow us to extract the information from this text.

Score: I found that enclosing the score in <sentiment_score> tags to be the most reliable method for extracting it from the results.

def extract_sentiment_score_from_tags(text):
    match = re.search(r"<sentiment_score>(\d+(?:\.\d+)?)</sentiment_score>", text, re.IGNORECASE)
    res = match.group(1) if match else None
    if res is None: 
        logger.info(f"Error: score not found in {text}")
        return 5
    return float(match.group(1)) if match else None

def extract_final_summary(text):
    match = re.search(r"Final Summary:\s*(.*?)\n?(Sentiment Score:|</answer>|$)", text, re.DOTALL)
    res = match.group(1).strip() if match else None
    if res is None:
        logger.info(f"Error: final_summary not found in {text}")
        return text
    logger.info(f"final_summary: {res}")
    return res

Joining all sections to get a final summary

# --- Step 4: Summarize all chunk responses ---
def summarize_all_chunks(all_summaries: list):
    combined = "\n\n".join(all_summaries)
    prompt = f"""
You are Fin-R1.
Summarize the overall sentiment and financial outlook from the following collection of analysis outputs from an SEC 10-K filing.
Focus on key trends, risks, tone, and forward-looking statements.

"{combined}"

Respond with a concise <answer> summary followed by optional <insight> sections.
"""
    return query_litellm(prompt, model_name='ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF')

Tying all sections together

import pandas as pd
import matplotlib.pyplot as plt

results = []
ticker = "TSLA"
form = "10-K" 
chunk_size = 20000
cik, accession_number, url = get_latest_filing_url(ticker, form)
filing_text = get_section_text(url, "1A")

logger.info(f"Processing Filing Text: {filing_text[:100]}...")
if filing_text:
    chunks = chunk_text(filing_text, chunk_size=chunk_size)
    all_summaries = []
    for idx, chunk in enumerate(chunks):
        try:
            content, summary, score = analyze_sentiment_with_fin_r1(chunk)
            all_summaries.append(summary)
            cursor.execute("""
                INSERT INTO edgar_sentiment_chunks (cik, accession, chunk_index, chunk_text, chunk_sentiment, chunk_final_sentiment, sentiment_score)
                VALUES (?, ?, ?, ?, ?, ?, ?)""",
                (cik, accession_number, idx, chunk, content, summary, score))
            conn.commit()
        except Exception as e:
            logger.info(f"Error analyzing chunk: {e}")
    full_summary = summarize_all_chunks(all_summaries)
    cursor.execute("""
    INSERT INTO edgar_sentiment (cik, accession, sentiment_summary)
    VALUES (?, ?, ?)""", (cik, accession_number, full_summary))
    conn.commit()
    results.append({"cik": cik, "summary": full_summary})
conn.close()

<answer>
\nThe SEC filing segment outlines financial strategies and anti-takeover measures impacting Tesla’s corporate structure. Key points include:

1. **Convertible Senior Notes Hedges**: These mitigate dilution and cash outflows, neutralizing debt risks.
2. **Warrant Transactions**: Dilution depends on stock price thresholds; failure to meet these could trigger share issuance.
3. **Musk’s Personal Loans**: Forced share sales if loans are secured by Tesla stock could lower the stock price.
4. **Anti-Takeover Provisions**: Enhance shareholder protection but complicate acquisitions.

**Overall Sentiment**: Mixed, with negative risks from potential dilution and Musk-related pressures outweighing neutral anti-takeover measures.

\boxed{3}
</answer>
<insight>
The document emphasizes strategic financial tools like convertible notes and warrants to manage debt and equity dilution. However, these come with conditional risks tied to market performance (e.g., warrant exercises) and personal obligations (Musk’s loan pledges). Anti-takeover provisions protect shareholders but limit flexibility in corporate governance.
</insight>

Code examples

The code examples for this post can be found here:

Fin-R1-example

References

Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

Fin-R1: Model page Hugging Face

Fin-R1 Github

📘 Glossary of Terms

Term / Acronym	Meaning / Description
LLM	Large Language Model – a neural network trained to understand and generate human language.
Fin-R1	A 7B-parameter financial reasoning LLM trained with supervised learning and reinforcement learning.
SFT	Supervised Fine-Tuning – a phase where the model is trained on labeled data to learn task-specific behavior.
RLHF / GRPO	Reinforcement Learning with Human Feedback – here specifically GRPO (Generalized Reward Policy Optimization), used to fine-tune model outputs.
CoT	Chain-of-Thought – a prompting technique that makes the model reason step-by-step before answering.
GGUF	Ggerganov’s General Unified Format – a compact, efficient model file format for LLMs optimized for local inference.
Q4_K_M / Q8_0	Quantization levels (4-bit and 8-bit) used in GGUF models to reduce memory requirements.
LM Studio	GUI desktop app for running and chatting with GGUF models.
Basel III	International banking regulations that define capital adequacy, liquidity, and leverage requirements.
CAR	Capital Adequacy Ratio – minimum capital banks must maintain, part of Basel III.
LCR	Liquidity Coverage Ratio – ensures banks can meet short-term obligations.
VaR	Value at Risk – a measure of potential financial loss over a defined period.
Sharpe Ratio	A measure of risk-adjusted return in portfolio management.
10-K Filing	Annual report filed by public companies with the SEC; includes risk factors, financials, and management discussion.
SEC	U.S. Securities and Exchange Commission – regulates financial markets and company disclosures.
EDGAR	Electronic Data Gathering, Analysis, and Retrieval system – the SEC’s online database of filings.
GGML / llama.cpp	Toolkits that support local inference of quantized LLMs using formats like GGUF.
Quant Finance	Quantitative finance – using mathematical models and computational tools for financial decision-making.

Introduction

TL;DR - What Can Fin-R1 Do?

Why Fin-R1

1. Fin-R1 specifically designed and fine-tuned for financial applications.

2. Versatile Financial Applications

3. Open-Source and Customizable

What Makes Financial Reasoning Hard?

Converting the model to GGUF so we can test and run it locally

What is GGUF and Why Does It Matter?

📦 When Should You Use GGUF?

🛠️ Model Types Supported

Creating a gguf version of the model

Using docker to create a gguf models

GGUF vs TorchScript vs ONNX

Loading the model config

Code example loading the model config for a model

SUFE-AIFLM-Lab/Fin-R1 Model Configuration (Markdown Table)

Testing the Fin-R1 model

Demo 1: Robo Advisory

Code example and testing

Results

Demo 2: Financial Compliance Automation

Code example and testing

Results evaluation

Demo 3: Exam question

Analysis and Results

Demo 4: Risk Analysis

Tesla Answer

Demo 5: KPI Analysis

Demo 6: Quant Code Generation

EDGAR Sentiment Analysis

What is a 10K report about

Best Sections of the 10K for Sentiment Analysis

Document Size and Context Limitations

Getting the document from EDGAR

Splitting the Document

Analyzing the sentiment of each section

Extracting the results of each section

Joining all sections to get a final summary

Tying all sections together

Code examples

References

📘 Glossary of Terms

Why `Fin-R1`

`SUFE-AIFLM-Lab/Fin-R1` Model Configuration (Markdown Table)