Fin-R1: a Financial Reasoning LLM with Reinforcement Learning and CoT
Introduction
Fin-R1 is a new model specifically fine-tuned for financial reasoning, with performance that beats much larger models like DeepSeek-R1.
This post will use this model and compare it with phi3
across various tasks.
- phi3 for comparison
Phi-3: a lightweight, general-purpose model known for its efficiency and strong reasoning performance at smaller parameter scales. It serves as a great baseline for assessing how domain-specific tuning in Fin-R1 improves financial understanding and response structure.
- Qwen2.5 for judging the results
Qwen2.5: is used as the judge because it exhibits strong instruction-following ability and Chain-of-Thought reasoning skills, making it ideal for judging response quality across dimensions like accuracy, completeness, and clarity. Its adherence to structure and scoring consistency helped generate reliable evaluations across use cases.
TL;DR - What Can Fin-R1 Do?
- ✅ Robo-advisory with CoT evaluation
- ✅ Basel III compliance automation
- ✅ Finance exam tutoring
- ✅ Market risk analysis
- ✅ KPI interpretation with exec summary
- ✅ Sentiment from SEC filings (10-K, etc.)
- ✅ Quant finance code generation (e.g. Black-Scholes, Sharpe Ratio)
- ✅ EDGAR Document Analysis
By the end of this blog you should have a good idea of some of the way we can leverage this model for our finance related tasks.
Why Fin-R1
1. Fin-R1 specifically designed and fine-tuned for financial applications.
Data Category | Data Category Description | Source | Proportion |
---|---|---|---|
Financial Code | Financial Quantitative Strategy Code Generation | FinanceQT | 0.2% |
Financial Expertise | Financial Terminology, Explanation, Q&A on Financial Expertise, Financial Calculations | Finance-Instruct-500K | 18.2% |
FinanceIQ | 3.4% | ||
FinPEE | 0.3% | ||
Non-reasoning Financial Business Knowledge | Content Generation in Financial Business, Regulatory Compliance, Financial Knowledge, Financial Cognition, Financial Logic | Ant-Finance | 2.0% |
FinCorpus | 48.4% | ||
Financial Reasoning Business Knowledge | Numerical Reasoning on Financial Data, Financial News Sentiment Classification, Financial News Classification, Financial Causal Relationship Extraction | FinQA | 4.8% |
ConvFinQA | 12.3% | ||
TFNS | 4.0% | ||
FinCUGE | 6.4% |
2. Versatile Financial Applications
Fin-R1 supports a wide range of financial tasks, including but not limited to:
- Financial Analysis: Generating insights, summarizing reports, and analyzing trends.
- Risk Management: Assessing risks, identifying potential issues, and providing recommendations.
- Investment Research: Supporting decision-making by processing market data, news, and company reports.
- Regulatory Compliance: Assisting with understanding and interpreting complex financial regulations.
- Customer Support: Automating responses to customer inquiries about banking, investments, or financial products.
3. Open-Source and Customizable
- Fin-R1 is open-source, allowing users to access, change, and fine-tune the model to suit their specific needs.
- This flexibility enables organizations to adapt the model for proprietary datasets or unique financial workflows, enhancing its utility.
What Makes Financial Reasoning Hard?
Finance-specific LLMs must deal with:
- Fragmented financial data
- Black-box reasoning (not interpretable enough for regulation)
- Poor generalization across diverse business scenarios
Fin-R1 addresses these with:
- A specialized dataset: Fin-R1-Data
- A two-stage training approach: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO)
- Explicit reasoning structure using
<think>
and<answer>
tags
Fin-R1-Data contains 60,091 reasoning traces, built from:
- Public datasets: FinQA, ConvFinQA, TFNS, FinCorpus, etc.
- A curated FinPEE dataset from financial exam PDFs
Converting the model to GGUF so we can test and run it locally
I wanted to use the model locally through Ollama to make this easy I converted the model to GGUF.
What is GGUF and Why Does It Matter?
GGUF (Ggerganov’s General Unified Format) is a powerful and efficient model file format designed for transformer-based large language models (LLMs). It builds on earlier formats like GGML and offers a more streamlined, portable, and quantization-friendly way to store models for offline inference.
Modern LLMs like LLaMA, Mistral, and Phi are massive—often requiring GPUs with tens of GBs of VRAM. GGUF enables these models to:
- ✅ Run efficiently on CPUs (even laptops and Raspberry Pi)
- ✅ Use low memory through quantization (like Q4_K_M, Q8_0)
- ✅ Load faster with optimized weights and metadata
- ✅ Interoperate with llama.cpp-based tools like:
- Ollama – local models with an easy-to-use chat interface and API
- LM Studio – GUI for chatting with GGUF models
- text-generation-webui – powerful browser-based UI
- koboldcpp – fine-tuned for storytelling and role play
📦 When Should You Use GGUF?
Use GGUF when you want to:
Use Case | Why GGUF Helps |
---|---|
Run LLMs locally | Avoid sending data to the cloud, ensure privacy |
Deploy lightweight AI tools | Run on CPU or low-end hardware without a GPU |
Experiment with quantized models | Save space and memory using 4-bit or 8-bit versions |
Integrate into desktop or web apps | Fast load times and compatibility with local inference engines |
🛠️ Model Types Supported
GGUF is designed for transformer architectures like:
- LLaMA / LLaMA2 / LLaMA3
- Mistral / Mixtral
- Falcon
- GPT-J / GPT-NeoX
- Phi-2
- OpenHermes, StarCoder, and other fine-tuned derivatives
Creating a gguf version of the model
You can use the hugging face gguf-my-repo to do this.
I created an 8-bit gguf version: ernanhughes/Fin-R1-Q8_0-GGUF this will allow you to use the model in ollama and LM Studio. These give a nice chat interface and API in windows.
Using docker to create a gguf models
You can also use docker to do this conversion process.
mkdir -p ~/models
huggingface-cli login
huggingface-cli download SUFE-AIFLM-Lab/Fin-R1 --local-dir "~/models" --include "*"
#Convert to GGUF
docker run --rm -v "~/models":/repo ghcr.io/ggerganov/llama.cpp:full --convert "/repo" --outtype f32
ls ~/models | grep .gguf
#> ggml-model-f32.gguf
#Quantize from F32.gguf to Q4_K_M.bin
docker run --rm -v "~/models":/repo ghcr.io/ggerganov/llama.cpp:full --quantize "/repo/ggml-model-f32.gguf" "/repo/ggml-model-Q4_K_M.bin" "Q4_K_M"
ls ~/models | grep .bin
#> ggml-model-Q4_K_M.bin
Once converted you can teh use this converted model in Ollama. To use the model:
ollama pull hf.co/ernanhughes/Fin-R1-Q8_0-GGUF
GGUF vs TorchScript vs ONNX
Format | Best For | Pros | Cons |
---|---|---|---|
GGUF | Local CPU inference | Quantized, llama.cpp tools, easy to use | Mostly for LLaMA-style models |
TorchScript | PyTorch deployment | Native PyTorch integration | Larger size, less portable |
ONNX | Cross-platform ML pipelines | Broad compatibility (TensorRT, OpenVINO) | Conversion can be tricky |
Loading the model config
The Hugging Face AutoConfig
API allows you to programmatically retrieve a model’s architecture and configuration details—such as hidden size, number of layers, attention heads, max position embeddings, and vocabulary size either from a local directory or directly from the Hugging Face Hub.
This is useful when:
- You want to check how much text you can reliably send to the model for summarization or processing.
- You are converting a model to another format (like GGUF) and want to verify dimensions.
- You want to inspect model compatibility with tools (e.g., tokenizer, quantization).
- Documentation is missing or unclear about the model internals.
By using AutoConfig.from_pretrained("model-name")
, you can avoid guesswork and ensure that you’re handling the model correctly across tools and pipelines.
Code example loading the model config for a model
This is some example code showing how to find or generate the model config for viewing.
from transformers import AutoConfig
from huggingface_hub import HfApi
import requests
import json
import os
def config_to_markdown(config_dict):
# Priority ordered keys
priority_keys = [
"model_type", "architectures", "hidden_size", "num_attention_heads",
"num_hidden_layers", "intermediate_size", "vocab_size", "type_vocab_size",
"max_position_embeddings", "hidden_act", "initializer_range",
"layer_norm_eps", "attention_probs_dropout_prob", "hidden_dropout_prob",
"pad_token_id", "bos_token_id", "eos_token_id"
]
all_keys = list(config_dict.keys())
# Sort by priority first, then append any other keys
ordered_keys = [k for k in priority_keys if k in config_dict] + \
[k for k in all_keys if k not in priority_keys]
# Create markdown table
markdown = "| Key | Value |\n|-----|-------|\n"
for k in ordered_keys:
v = config_dict[k]
if isinstance(v, list):
v = ", ".join(map(str, v))
elif isinstance(v, dict):
v = json.dumps(v)
markdown += f"| `{k}` | `{v}` |\n"
return markdown
def inspect_model_config(model_name_or_path, source="autoconfig"):
"""
Inspect a Hugging Face model's configuration and output as markdown.
Parameters:
model_name_or_path (str): model name from Hugging Face hub or local path.
source (str): One of ["autoconfig", "api", "rest", "local_json"]
"""
config_dict = None
if source == "autoconfig":
print("Using AutoConfig (local or remote)...")
config = AutoConfig.from_pretrained(model_name_or_path)
config_dict = config.to_dict()
elif source == "api":
print("Using huggingface_hub HfApi...")
api = HfApi()
model_info = api.model_info(model_name_or_path)
config_dict = model_info.config if hasattr(model_info, 'config') else {}
elif source == "rest":
print("Using Hugging Face REST API...")
url = f"https://huggingface.co/api/models/{model_name_or_path}"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
config_dict = data.get("config", {})
else:
print("Failed to retrieve model metadata from REST API.")
elif source == "local_json":
print("Loading config.json from local directory...")
path = os.path.join(model_name_or_path, "config.json")
if os.path.exists(path):
with open(path, "r") as f:
config_dict = json.load(f)
else:
print(f"config.json not found at {path}")
else:
print("Invalid source. Choose from: autoconfig, api, rest, local_json.")
if config_dict:
print("\n### Model Configuration (Markdown Table)\n")
print(config_to_markdown(config_dict))
model_name = "SUFE-AIFLM-Lab/Fin-R1"
model_config = inspect_model_config(model_name, source="autoconfig")
config_to_markdown(model_config)
SUFE-AIFLM-Lab/Fin-R1
Model Configuration (Markdown Table)
Key | Value |
---|---|
model_type |
qwen2 |
architectures |
Qwen2ForCausalLM |
hidden_size |
3584 |
num_attention_heads |
28 |
num_hidden_layers |
28 |
intermediate_size |
18944 |
vocab_size |
152064 |
max_position_embeddings |
32768 |
hidden_act |
silu |
initializer_range |
0.02 |
pad_token_id |
None |
bos_token_id |
151643 |
eos_token_id |
151645 |
use_sliding_window |
False |
sliding_window |
None |
max_window_layers |
28 |
num_key_value_heads |
4 |
rms_norm_eps |
1e-06 |
use_cache |
True |
…. |
Testing the Fin-R1 model
These are some tests I ran on the model. Using these tests I want to get an idea of the quality of the results from the model.
Demo 1: Robo Advisory
Here we are using Fin-R1 to generate financial advice for a set of investors with different financial goals.
We send the same query into another model phi3
We then use another model to judge these results Qwen2.5
.
Code example and testing
import os
import sqlite3
from typing import Optional, Tuple, Dict
from pprint import pprint
from dotenv import load_dotenv
import litellm
class Config:
def __init__(self):
load_dotenv()
self.model1_name = os.getenv("MODEL_1", "ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF")
self.model2_name = os.getenv("MODEL_2", "ollama/phi3")
self.evaluator_model = os.getenv("EVAL_MODEL", "ollama/qwen2.5")
self.db_path = os.getenv("DB_PATH", "fin_r1_advisory.db")
self.debug = os.getenv("DEBUG", "true").lower() == "true"
if self.debug:
litellm._turn_on_debug()
def query_litellm(model_name: str, prompt: str) -> str:
response = litellm.completion(
model=model_name,
messages=[{"role": "user", "content": prompt}]
)
return response['choices'][0]['message']['content']
def call_models(prompt: str, config: Config) -> tuple[str, str]:
return (
query_litellm(config.model1_name, prompt),
query_litellm(config.model2_name, prompt)
)
def build_comparison_prompt(question: str, response_a: str, response_b: str) -> str:
return f"""
### Task:
You are a financial reasoning expert. Two models have provided investment advice in CoT format.
Evaluate which advice is more helpful, complete, and appropriate based on reasoning and financial knowledge.
### Instructions:
1. Assess each response based on:
- Accuracy and factual correctness
- Clarity and coherence
- Depth of explanation
- Relevance to the question
2. Assign each response a score from **0 to 10**.
3. Provide a short justification for your scoring.
---
### Question:
{question}
---
### Response A:
{response_a}
---
### Response B:
{response_b}
---
### Output Format:
Response A Score: X/10
Response B Score: Y/10
Winner: Response A or Response B or Tie
Justification: <your analysis here>
"""
class UserProfile:
def __init__(self, name, age, gender, risk_tolerance, investment_goal, investment_horizon_years, current_portfolio, economic_context):
self.name = name
self.age = age
self.gender = gender
self.risk_tolerance = risk_tolerance
self.investment_goal = investment_goal
self.investment_horizon_years = investment_horizon_years
self.current_portfolio = current_portfolio
self.economic_context = economic_context
def to_dict(self):
return self.__dict__
def gen_robo_request(user_profile: UserProfile) -> str:
profile = user_profile.to_dict()
return (
f"<think>{profile['name']} is {profile['age']} years old with a {profile['risk_tolerance']} risk tolerance. "
f"Goal is {profile['investment_goal']} in {profile['investment_horizon_years']} years. "
f"Current portfolio: {profile['current_portfolio']}. "
f"Economic context: {profile['economic_context']}."
)
class DatabaseManager:
def __init__(self, db_path: str):
self.conn = sqlite3.connect(db_path)
self._initialize_schema()
def _initialize_schema(self):
cursor = self.conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS user_profiles (
name TEXT PRIMARY KEY,
age INTEGER,
gender TEXT,
risk_tolerance TEXT,
investment_goal TEXT,
horizon_years INTEGER,
economic_context TEXT,
current_portfolio TEXT
)
""")
cursor.execute("""
CREATE TABLE IF NOT EXISTS advisory_results (
user_name TEXT,
request TEXT,
fin_r1_output TEXT,
model_output TEXT,
report_prompt TEXT,
report TEXT,
FOREIGN KEY(user_name) REFERENCES user_profiles(name)
)
""")
self.conn.commit()
def insert_user_profile(self, profile: Dict):
cursor = self.conn.cursor()
cursor.execute("""
INSERT OR REPLACE INTO user_profiles (
name, age, gender, risk_tolerance, investment_goal,
horizon_years, economic_context, current_portfolio
) VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (
profile['name'], profile['age'], profile['gender'], profile['risk_tolerance'],
profile['investment_goal'], profile['investment_horizon_years'],
profile['economic_context'], str(profile['current_portfolio'])
))
def insert_advisory_result(self, user_name: str, request: str,
fin_r1_output: str, model_output: str,
report_prompt: str, report: str):
cursor = self.conn.cursor()
cursor.execute("""
INSERT INTO advisory_results (
user_name, request, fin_r1_output, model_output, report_prompt, report
) VALUES (?, ?, ?, ?, ?, ?)
""", (user_name, request, fin_r1_output, model_output, report_prompt, report))
def commit(self):
self.conn.commit()
def close(self):
self.conn.close()
def main():
config = Config()
db = DatabaseManager(config.db_path)
user_profiles = [
UserProfile(
name="User A", age=35, gender="female", risk_tolerance="moderate",
investment_goal="retirement", investment_horizon_years=20,
current_portfolio={"stocks": 6000, "bonds": 300, "cash": 1000},
economic_context="Fed is expected to raise interest rates next quarter"
),
UserProfile(
name="User B", age=25, gender="male", risk_tolerance="high",
investment_goal="wealth_growth", investment_horizon_years=10,
current_portfolio={"stocks": 20000, "bonds": 10000, "cash": 300000},
economic_context="Inflation is steady and market volatility is growing"
)
]
for profile in user_profiles:
db.insert_user_profile(profile.to_dict())
request = gen_robo_request(profile)
fin_r1_output, model_output = call_models(request, config)
report_prompt = build_comparison_prompt(request, fin_r1_output, model_output)
report = query_litellm(config.evaluator_model, report_prompt)
db.insert_advisory_result(
user_name=profile.name,
request=request,
fin_r1_output=fin_r1_output,
model_output=model_output,
report_prompt=report_prompt,
report=report
)
print(f"\n--- Report for {profile.name} ---\n")
print(report)
db.commit()
db.close()
Results
In my results on this small test Fin-R1 performed very well.
Response A Score: 9/10
Response B Score: 7/10
Winner: Response A
Response A Score: 8/10
Response B Score: 6/10
Winner: Response A
Demo 2: Financial Compliance Automation
We are going to use the exact same structure to build a financial compliance application.
Code example and testing
Note I only included the prompts here. The database and function calling is the same. In this posts github repo you can view the full code.
compliance_rules = [
"Basel III requires banks to maintain a minimum capital adequacy ratio (CAR) of 8% to ensure financial stability.",
"Basel III mandates that the Liquidity Coverage Ratio (LCR) must be at least 100% to ensure banks can meet short-term obligations.",
"Under Basel III, the leverage ratio must be at least 3% to prevent excessive on- and off-balance sheet leverage."
]
policy_statements = [
"The capital adequacy ratio of the bank fell below 8% in Q2 due to increased exposure to high-risk loans.",
"The bank’s liquidity coverage ratio dropped to 87% in the last quarter as it increased its long-term asset holdings to chase yield.",
"The bank reported a leverage ratio of 2.5% after an aggressive expansion in derivatives trading and structured financing."
]
def generate_prompt(policy_input: str, compliance_rule: str, model: str) -> str:
instruction = f"""
You are a financial reasoning assistant. Read the compliance rule and the bank's policy statement, then determine whether the bank is compliant. Provide your response in Chain-of-Thought format using <think> and <answer> tags.
Compliance Rule:
\"\"\"{compliance_rule}\"\"\"
Policy Statement:
\"\"\"{policy_input}\"\"\"
"""
return instruction
def build_comparison_prompt(policy_input: str, compliance_rule: str, output_a: str, output_b: str) -> str:
return f"""
You are a highly qualified financial compliance analyst with expertise in Basel III regulations.
Two AI models were given a compliance rule and a policy statement. They were asked to reason through the scenario using a Chain-of-Thought (CoT) format and provide a final compliance assessment.
Your task is to critically evaluate both responses and score them based on:
1. **Regulatory accuracy** how well the model understands and applies the compliance rule
2. **Reasoning clarity** how logically the model arrives at its conclusion
3. **Financial insight** how deep and appropriate the explanation is
4. **Output structure** correct use of <think> and <answer> tags (important for interpretability)
---
### Compliance Rule:
\"\"\"{compliance_rule}\"\"\"
### Policy Statement:
\"\"\"{policy_input}\"\"\"
---
### Model A Response:
{output_a}
### Model B Response:
{output_b}
---
Please evaluate each model independently and respond in the following format:
Model A Score: <score>/10
Justification A: <explanation of how well it reasoned, applied the rule, and used CoT formatting>
Model B Score: <score>/10
Justification B: <same as above>
Preferred Model: Model A or Model B
Only select the preferred model if one is clearly better overall.
"""
Results evaluation
The results are again impressive
### Evaluation:
#### Model A:
- **Regulatory Accuracy**: 9/10. The response correctly interprets that Basel III mandates a minimum CAR of 8% and accurately identifies that falling below this threshold in Q2 constitutes non-compliance with the rule.
**Model A Score: 8.6/10**
**Model B Score: 7.2/10**
....
Model A Score: 8/10
Model B Score: 6/10
Preferred Model: Model A
Model A is preferred because it demonstrates better clarity in reasoning, provides more detailed financial insight, and uses the CoT format more effectively. While both models ultimately come to the same conclusion about non-compliance, Model A offers a more comprehensive and structured assessment.
Demo 3: Exam question
This demo shows how we could help a student prepare for their finance exams.
exam_questions = [
"A company's revenue is $500,000 and expenses are $350,000. What is the net income?",
"An investor buys a bond for $950 and receives $1,000 at maturity in one year. What is the yield?",
"A project requires an initial investment of $10,000 and is expected to return $2,500 annually for 5 years. What is the payback period?"
]
# --- Step 2: Shared prompt template for both models ---
def generate_exam_prompt(question: str) -> str:
return f"""
You are a finance tutor helping a student prepare for an exam.
Answer the following question using Chain-of-Thought (CoT) reasoning format.
Wrap your reasoning in <think> tags and your final answer in <answer> tags.
Question:
"{question}"
"""
def judge_exam_answers(question, output_a, output_b):
return f"""
You are a finance instructor evaluating student responses to an exam question. Two models have answered using a Chain-of-Thought format.
Question:
"{question}"
---
Model A:
{output_a}
Model B:
{output_b}
Evaluate each model on:
1. Correctness of the answer
2. Accuracy and completeness of the reasoning
3. Use of <think> and <answer> tags
Please respond in this format:
Model A Score: <score>/10
Justification A: <reasoning>
Model B Score: <score>/10
Justification B: <reasoning>
Preferred Model: Model A or Model B
"""
Analysis and Results
Model A Score: 8/10
Model B Score: 7/10
Model A Score: 9.5/10
Model B Score: 8/10
Model A Score: 9/10
Model B Score: 8/10
Demo 4: Risk Analysis
risk_scenarios = [
"An investor is heavily invested in Nvidia. A new AI paradigm called DeepSeek emerges out of China, dramatically reducing the need for GPUs. Assess the risk to Nvidia's stock.",
"An investor holds a concentrated position in Tesla. BYD announces a new solid-state battery that doubles range, halves cost, and enables near-instant charging, removing the need for most public charging infrastructure. Analyze the potential impact on Tesla."
]
# --- Step 2: Shared prompt for both models ---
def generate_risk_prompt(scenario: str) -> str:
return f"""
You are a financial analyst evaluating the impact of a company-specific market scenario.
Explain how this scenario might affect the investor's portfolio using Chain-of-Thought (CoT) reasoning.
Wrap your reasoning in <think> tags and provide a summary inside <answer> tags.
Scenario:
"{scenario}"
"""
def judge_risk_outputs(scenario, output_a, output_b):
return f"""
You are a senior equity analyst comparing two CoT reasoning outputs for a company-specific risk scenario.
Evaluate which response is more insightful, realistic, and helpful for an investor.
Scenario:
"{scenario}"
---
Model A:
{output_a}
---
Model B:
{output_b}
---
Evaluate each model on:
1. Accuracy of market and company-specific reasoning
2. Clarity and insight of explanation
3. Proper structure using <think> and <answer> tags
Respond in this format:
Model A Score: <score>/10
Justification A: <reasoning>
Model B Score: <score>/10
Justification B: <reasoning>
Preferred Model: Model A or Model B
"""
In this case lets look at the generated analysis, we see an in depth answer. I included the full text here so you can view the quality of the answer from the model.
Tesla Answer
First, I need to understand what the new technology does: it doubles the range of electric vehicles (EVs), halves production costs, and allows for near-instant charging. These are massive improvements over current battery tech. The immediate thought is that this could make BYD’s EVs more competitive in the market, especially since cost and range anxiety are big factors for consumers.
But wait, the investor here has a large stake in Tesla. So how does this impact them? Let’s break it down step by step.
-
Competitive Advantage: If BYD’s new battery technology is as revolutionary as described, it could give BYD an edge over other EV manufacturers, including Tesla. This might lead to increased market share for BYD in the EV sector. Since the investor’s portfolio is heavily weighted towards Tesla, this could be concerning because a competitor’s success directly impacts Tesla’s potential growth.
-
Impact on Tesla’s Stock: If the market starts to price in the risk that BYD becomes a major player due to their new battery tech, there might be a decrease in demand for Tesla’s shares. This could lead to a drop in Tesla’s stock price, which would negatively affect the investor’s holdings.
-
Long-term vs Short-term: The scenario mentions that this technology removes the need for most public charging infrastructure. That’s a huge point because range anxiety is a significant deterrent for many potential EV buyers. If BYD can eliminate the need for extensive charging networks, their EVs become more attractive compared to Tesla’s vehicles, which rely on existing infrastructure. This could shift consumer preferences towards BYD products.
-
Innovation Pressure: The development by BYD might force Tesla to accelerate its own R&D efforts in battery technology. If Tesla doesn’t keep up with the technological advancements, they might lose their competitive edge. However, investing heavily in research and development (R&D) could be costly and might not guarantee success. This financial strain could affect Tesla’s profitability and stock performance.
-
Cross-border Competition: BYD is a Chinese company, so its expansion into new markets could face different regulatory and trade barriers. For example, the U.S. market has high entry costs due to regulations like CAFE standards. If BYD can navigate these challenges effectively, they might penetrate American markets more easily than Tesla could compete against them.
-
Customer Perception: A technology that halves production costs would make EVs cheaper for consumers. Even if Tesla’s vehicles have features or designs customers prefer, the lower price point from BYD could be a major draw. This could lead to decreased demand for higher-priced Tesla models unless they can offer something else that compensates for the cost difference.
-
Supply Chain Shifts: If BYD becomes a significant EV manufacturer, there might be changes in the global supply chain. Companies might prefer to source components from BYD due to lower costs or better performance. This could disrupt Tesla’s existing supplier relationships and increase their operational costs.
-
Corporate Strategy: Tesla has historically focused on premium vehicles and vertical integration (like their own battery production). The announcement by BYD challenges both of these strategies. If other companies start adopting similar approaches, it might force Tesla to change its business model, which could be risky for investors if the transition isn’t smooth.
Putting all this together, the introduction of BYD’s solid-state batteries presents a significant threat to Tesla’s market position. The investor holding a concentrated Tesla portfolio should be cautious as this could lead to potential losses if the stock price drops due to increased competition or if Tesla fails to innovate sufficiently. However, it’s also possible that Tesla could respond by investing heavily in their own battery technology, perhaps even developing solid-state batteries themselves, which would keep them competitive. The outcome depends on how quickly Tesla can adapt and compete against BYD’s new offerings. The scenario introduces a revolutionary solid-state battery technology from BYD that significantly reduces costs, doubles range, and eliminates the need for extensive public charging infrastructure. This has several implications for an investor holding a concentrated position in Tesla:
- Competitive Disadvantage: BYD’s advancements could empower them to capture more market share, especially if their EVs become more cost-effective and convenient than Tesla’s offerings.
- Stock Price Risk: Increased competition from BYD might reduce demand for Tesla shares, leading to a decline in stock price.
- Consumer Shifts: Lower-cost EVs with extended ranges could alter consumer preferences, potentially decreasing demand for higher-priced Tesla models.
- R&D Pressures: Tesla may need to accelerate its own battery technology development or face losing market share. This could strain financial resources and operational efficiency.
- Supply Chain Changes: Shifts in component sourcing due to cost advantages could disrupt Tesla’s existing supply chain arrangements.
However, Tesla might counter by investing aggressively in solid-state batteries or other innovations to retain its competitive edge. The ultimate impact depends on Tesla’s ability to innovate and adapt swiftly.
Summary: BYD’s technological leap creates a direct threat to Tesla’s market position, increasing risks for investors holding concentrated Tesla shares unless Tesla successfully mitigates this competition through innovation or strategic responses.
Model A Score: 8/10
Model B Score: 6/10
Demo 5: KPI Analysis
In this test I want to demonstrates how Fin-R1 can provide chain-of-thought (CoT) explanations for financial Key Performance Indicator (KPI) changes.
- Simulated financial KPI shifts and external drivers
- Chain-of-Thought explanations from both models
- An evaluation of the model outputs
- SQLite database storage of scenario evaluations
kpi_scenarios = [
{
"kpi": "Net Income",
"change": "dropped by 15% in Q4 2024",
"context": "due to rising input costs and slower sales in Europe"
},
{
"kpi": "Gross Margin",
"change": "improved by 3% in Q1 2025",
"context": "following a successful shift to in-house component manufacturing"
}
]
def generate_kpi_prompt(scenario):
return f"""
You are a financial analyst preparing a quarterly report for executive stakeholders.
Analyze the following KPI change using Chain-of-Thought (CoT) reasoning.
Wrap your reasoning in <think> tags and summarize the implications in <answer> tags.
KPI: {scenario['kpi']}
Change: {scenario['change']}
"""
def judge_kpi_outputs(scenario, output_a, output_b):
prompt = f"""
You are a CFO reviewing two model-generated CoT analyses for a KPI shift.
Evaluate which is more insightful, executive-ready, and grounded in financial reasoning.
KPI: {scenario['kpi']}
Change: {scenario['change']}
Context: {scenario['context']}
---
Model A:
{output_a}
Model B:
{output_b}
Evaluate each model on:
1. Business relevance
2. Reasoning clarity
3. Correct use of <think> and <answer> tags
Respond in this format:
Model A Score: <score>/10
Justification A: <reasoning>
Model B Score: <score>/10
Justification B: <reasoning>
Preferred Model: Model A or Model B
"""
#### Model A:
**Score:** 8/10
#### Model B:
**Score:** 6/10
### Preferred Model:
**Model A**
...
## Preferred Model: Model A
Demo 6: Quant Code Generation
We want to see how well the model generated quant code.
- Define quantitative tasks in natural language
- Prompt both models to generate Python solutions
- Use Qwen2.5 to assess the quality of generated code
- Store results in an SQLite database
quant_tasks = [
"Write Python code to calculate the Sharpe Ratio for a stock given a list of daily returns and a risk-free rate.",
"Generate Python code to backtest a moving average crossover strategy using two different windows.",
"Create Python code to compute Value at Risk (VaR) using historical simulation for a portfolio of returns.",
"Write Python code to price a European call option using the Black-Scholes formula.",
"Generate code to simulate a geometric Brownian motion for an asset price.",
"Write Python code to calculate the duration and convexity of a bond given its cash flows and yield.",
"Generate Monte Carlo simulation code to price a European call option.",
"Write Python code to calculate Delta and Gamma for a European call option using the Black-Scholes model."
]
# --- Step 2: Prompt format ---
def generate_code_prompt(task: str) -> str:
return f"""
You are a quantitative financial developer. Write clean, documented Python code to solve the following problem:
{task}
Instructions:
- Explain your approach in <think> tags.
- Include all Python code within <code> tags.
- Use comments to clarify logic and assumptions.
- Ensure the code runs correctly and handles potential edge cases.
- Prefer standard libraries like NumPy, pandas, and SciPy where applicable.
- Avoid excessive complexity; prioritize clarity and correctness.
"""
def generate_validation_prompt(task: str, code_output: str) -> str:
return f"""
You are a senior quantitative engineer reviewing Python code for accuracy.
The code was written to address the following financial task:
"""{task}"""
Here is the model's output:
{code_output}
Instructions:
1. Verify that the code correctly solves the stated task.
2. Identify any bugs, logical flaws, or poor practices.
3. Comment on use of financial logic, edge case handling, and numerical stability.
4. Score the solution out of 10 for correctness and clarity.
Respond with:
- Score: <X>/10
- Critique: <brief assessment>
- Recommendations: <concrete fixes or improvements>
"""
For the option question it generated this code. Which is correct.
<code>
import numpy as np
from scipy.stats import norm
def black_scholes_call(S, X, r, T, sigma):
"""
Calculate the price of a European call option using the Black-Scholes formula.
Parameters:
S (float): Current stock price
X (float): Exercise price (strike price)
r (float): Risk-free interest rate (annualized)
T (float): Time to expiration (years)
sigma (float): Volatility of the underlying asset
Returns:
float: Price of the European call option
"""
# Check for negative inputs that don't make sense in financial context
if S < 0 or X < 0:
raise ValueError("Stock price and strike price must be non-negative.")
if T <= 0:
raise ValueError("Time to expiration must be positive.")
# Calculate d1 and d2
d1 = (np.log(S / X) + (r + sigma**2 / 2) * T) / (sigma * np.sqrt(T))
d2 = d1 - sigma * np.sqrt(T)
# Ensure numerical stability for extreme values of d2
if d2 > 20:
d2 = 20.0
elif d2 < -20:
d2 = -20.0
# Calculate the call option price using Black-Scholes formula
call_price = S * norm.cdf(d1) - X * np.exp(-r * T) * norm.cdf(d2)
return call_price
# Example usage
if __name__ == "__main__":
S = 100.0 # Current stock price
X = 110.0 # Strike price
r = 0.05 # Risk-free rate (5% annual)
T = 1.0 # Time to expiration (1 year)
sigma = 0.2 # Volatility (20% annual)
try:
call_option_price = black_scholes_call(S, X, r, T, sigma)
print(f"The price of the European call option is: ${call_option_price:.4f}")
except ValueError as e:
print(e)
</code>
EDGAR Sentiment Analysis
For this we will download some filings from EDGAR and determine sentiment.
What is a 10K report about
A 10-K report is a comprehensive annual filing submitted by publicly traded companies to the U.S. Securities and Exchange Commission (SEC). It provides a detailed overview of a company’s financial performance, business operations, risk factors, legal proceedings, and management discussion and analysis.
Unlike an annual report to shareholders, which is often more polished and promotional, the 10-K is a formal and standardized disclosure required by law. Investors, analysts, and regulators use it to assess a company’s health and make informed decisions.
The 10-K is a valuable tool for assessing a company’s health because it offers audited financial statements, detailed insights into risks, and management’s own analysis of performance all in a standardized, transparent format.
Best Sections of the 10K for Sentiment Analysis
For sentiment analysis in a 10-K filing, not all sections are valuable. You’ll get the most useful and meaningful sentiment signals from the narrative sections that include forward-looking language and risk assessments. These are the most useful sections of a 10K for Sentiment Analysis.
Section | Why it’s Useful |
---|---|
Item 1A. Risk Factors | Discusses potential threats and management concerns (rich in negative tone) |
Item 7. Management’s Discussion and Analysis (MD&A) | Most insightful for tone, confidence, forward-looking statements |
Item 1. Business | Gives context about operations; useful for neutral-to-positive framing |
Item 7A. Quantitative and Qualitative Disclosures about Market Risk | Risk language, volatility perception |
Document Size and Context Limitations
Large documents can exceed the context window of your language model. You can estimate the limit by gradually increasing the input length until the model fails or starts to respond inconsistently. Note that this threshold can vary depending on factors like available VRAM and the specific model you are using.
Because of this, documents need to be split into smaller chunks before processing.
It’s also important to recognize that model performance often degrades as context length increases. NVIDIA’s RULER project provides helpful insights into the effective context length of many models, offering a more realistic view than just advertised token limits.
This example function that can help you determine your model’s usable context length:
def test_context_limit(token_multiplier=50):
base = "The quick brown fox jumps over the lazy dog. "
for i in range(1, 1000):
try:
prompt = base * (i * token_multiplier)
print(f"Trying length {len(prompt)}...")
query_litellm(prompt) # or however you call your local model
except Exception as e:
print(f"Failed at length {len(prompt)}: {e}")
break
My testing I got a figure of 20250. I am testing with splits of 20,000
Getting the document from EDGAR
I will do a complete post on this for now I have written a function that uses the sec_api to get our documents
This code will allow us to download the latest 10-K for a ticker. For this analysis we are using section 1A. The code should work on any section.
from sec_api import QueryApi
from sec_api import ExtractorApi
from dotenv import load_dotenv
import os
load_dotenv()
import os
import logging
from sec_api import QueryApi
logger = logging.getLogger(__name__)
def get_latest_filing_url(ticker: str, form_type: str = "10-K"):
"""
Get the CIK accessionNo and url for the last filing of a from.
Args:
ticker (str): The stock.
form_type (str): The form.
Returns:
str: The text of the section.
"""
api_key = os.getenv("SEC_API_KEY")
query = {
"query": {
"query_string": {
"query": f'formType:"{form_type}" AND ticker:{ticker}'
}
},
"from": "0",
"size": "1",
"sort": [{ "filedAt": { "order": "desc" } }]
}
try:
queryApi = QueryApi(api_key=api_key)
filings = queryApi.get_filings(query)
filing = filings['filings'][0]
accession_number = filing['accessionNo'].replace("-", "")
cik = filing['cik']
url = filing['linkToHtml']
logger.info(f"CIK: {cik} Accession Number: {accession_number} Filing URL: {url}")
return cik, accession_number, url
except Exception as e:
logger.error("Failed to fetch filing info", exc_info=True)
raise
def get_section_text(filing_url: str, section: str = "1A") -> str:
"""
Get the text of a specific section from a filing URL.
Args:
filing_url (str): The URL of the filing.
section (str): The section to extract.
Returns:
str: The text of the section.
"""
api_key = os.getenv("SEC_API_KEY")
extractorApi = ExtractorApi(api_key)
text = extractorApi.get_section(filing_url, section, "text")
return text
Splitting the Document
Splitting the text in document is an art. I will write a separate blog post on that. For now I will split on paragraphs around the configured text length of the split.
import re
# --- Step 2: Chunk text on paragraphs with approximate size limit ---
def chunk_text(text, chunk_size=20000):
paragraphs = [p.strip() for p in text.split('\n') if p.strip()]
chunks = []
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) <= chunk_size:
current_chunk += para + "\n"
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para + "\n"
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
Analyzing the sentiment of each section
Analyzing the sentiment of the section. We want to get three things from each section
-
- The chain of thought sentiment or summary of the section
-
- The final summary of this sentiment.
-
- A sentiment score. A value which makes it clear what the model thinks the text says.
# --- Step 3: Perform sentiment analysis using Fin-R1
def analyze_sentiment_with_fin_r1(report_text: str):
prompt = f"""
You are Fin-R1, a financial language model.
Analyze the following portion of an SEC filing and extract key sentiment and guidance.
<document>
{chunk}
</document>
Respond in Chain-of-Thought format:
- Begin with <think> for detailed reasoning
- Include a long descriptive summary
- End with a clearly labeled section: "Final Summary:" that gives a 1-2 sentence overall summary.
- End with a clearly labeled section: "Sentiment Score:" from 0 (completely negative) to 10 (completely positive). Surround the score with <sentiment_score> tags.
Wrap everything in <answer> tags.
"""
content = query_litellm(prompt, model_name='ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF')
logger.info(f"content: {content}")
score = extract_sentiment_score_from_tags(content)
final_summary = extract_final_summary(content)
logger.info(f"score: {score}, final_summary: {final_summary}")
return content, final_summary, score
Extracting the results of each section
We will have large blocks of text returned from the model. We for now are specifically interested in the score and the final summary. These functions allow us to extract the information from this text.
Score: I found that enclosing the score in
<sentiment_score>
tags to be the most reliable method for extracting it from the results.
def extract_sentiment_score_from_tags(text):
match = re.search(r"<sentiment_score>(\d+(?:\.\d+)?)</sentiment_score>", text, re.IGNORECASE)
res = match.group(1) if match else None
if res is None:
logger.info(f"Error: score not found in {text}")
return 5
return float(match.group(1)) if match else None
def extract_final_summary(text):
match = re.search(r"Final Summary:\s*(.*?)\n?(Sentiment Score:|</answer>|$)", text, re.DOTALL)
res = match.group(1).strip() if match else None
if res is None:
logger.info(f"Error: final_summary not found in {text}")
return text
logger.info(f"final_summary: {res}")
return res
Joining all sections to get a final summary
# --- Step 4: Summarize all chunk responses ---
def summarize_all_chunks(all_summaries: list):
combined = "\n\n".join(all_summaries)
prompt = f"""
You are Fin-R1.
Summarize the overall sentiment and financial outlook from the following collection of analysis outputs from an SEC 10-K filing.
Focus on key trends, risks, tone, and forward-looking statements.
"{combined}"
Respond with a concise <answer> summary followed by optional <insight> sections.
"""
return query_litellm(prompt, model_name='ollama/hf.co/ernanhughes/Fin-R1-Q8_0-GGUF')
Tying all sections together
import pandas as pd
import matplotlib.pyplot as plt
results = []
ticker = "TSLA"
form = "10-K"
chunk_size = 20000
cik, accession_number, url = get_latest_filing_url(ticker, form)
filing_text = get_section_text(url, "1A")
logger.info(f"Processing Filing Text: {filing_text[:100]}...")
if filing_text:
chunks = chunk_text(filing_text, chunk_size=chunk_size)
all_summaries = []
for idx, chunk in enumerate(chunks):
try:
content, summary, score = analyze_sentiment_with_fin_r1(chunk)
all_summaries.append(summary)
cursor.execute("""
INSERT INTO edgar_sentiment_chunks (cik, accession, chunk_index, chunk_text, chunk_sentiment, chunk_final_sentiment, sentiment_score)
VALUES (?, ?, ?, ?, ?, ?, ?)""",
(cik, accession_number, idx, chunk, content, summary, score))
conn.commit()
except Exception as e:
logger.info(f"Error analyzing chunk: {e}")
full_summary = summarize_all_chunks(all_summaries)
cursor.execute("""
INSERT INTO edgar_sentiment (cik, accession, sentiment_summary)
VALUES (?, ?, ?)""", (cik, accession_number, full_summary))
conn.commit()
results.append({"cik": cik, "summary": full_summary})
conn.close()
<answer>
\nThe SEC filing segment outlines financial strategies and anti-takeover measures impacting Tesla’s corporate structure. Key points include:
1. **Convertible Senior Notes Hedges**: These mitigate dilution and cash outflows, neutralizing debt risks.
2. **Warrant Transactions**: Dilution depends on stock price thresholds; failure to meet these could trigger share issuance.
3. **Musk’s Personal Loans**: Forced share sales if loans are secured by Tesla stock could lower the stock price.
4. **Anti-Takeover Provisions**: Enhance shareholder protection but complicate acquisitions.
**Overall Sentiment**: Mixed, with negative risks from potential dilution and Musk-related pressures outweighing neutral anti-takeover measures.
\boxed{3}
</answer>
<insight>
The document emphasizes strategic financial tools like convertible notes and warrants to manage debt and equity dilution. However, these come with conditional risks tied to market performance (e.g., warrant exercises) and personal obligations (Musk’s loan pledges). Anti-takeover provisions protect shareholders but limit flexibility in corporate governance.
</insight>
Code examples
The code examples for this post can be found here:
References
Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning
Fin-R1: Model page Hugging Face
📘 Glossary of Terms
Term / Acronym | Meaning / Description |
---|---|
LLM | Large Language Model – a neural network trained to understand and generate human language. |
Fin-R1 | A 7B-parameter financial reasoning LLM trained with supervised learning and reinforcement learning. |
SFT | Supervised Fine-Tuning – a phase where the model is trained on labeled data to learn task-specific behavior. |
RLHF / GRPO | Reinforcement Learning with Human Feedback – here specifically GRPO (Generalized Reward Policy Optimization), used to fine-tune model outputs. |
CoT | Chain-of-Thought – a prompting technique that makes the model reason step-by-step before answering. |
GGUF | Ggerganov’s General Unified Format – a compact, efficient model file format for LLMs optimized for local inference. |
Q4_K_M / Q8_0 | Quantization levels (4-bit and 8-bit) used in GGUF models to reduce memory requirements. |
LM Studio | GUI desktop app for running and chatting with GGUF models. |
Basel III | International banking regulations that define capital adequacy, liquidity, and leverage requirements. |
CAR | Capital Adequacy Ratio – minimum capital banks must maintain, part of Basel III. |
LCR | Liquidity Coverage Ratio – ensures banks can meet short-term obligations. |
VaR | Value at Risk – a measure of potential financial loss over a defined period. |
Sharpe Ratio | A measure of risk-adjusted return in portfolio management. |
10-K Filing | Annual report filed by public companies with the SEC; includes risk factors, financials, and management discussion. |
SEC | U.S. Securities and Exchange Commission – regulates financial markets and company disclosures. |
EDGAR | Electronic Data Gathering, Analysis, and Retrieval system – the SEC’s online database of filings. |
GGML / llama.cpp | Toolkits that support local inference of quantized LLMs using formats like GGUF. |
Quant Finance | Quantitative finance – using mathematical models and computational tools for financial decision-making. |