Self-Taught Self-Correction for Small Language Models

March 19, 2025

Page content

Introduction

Large Language Models (LLMs) have made significant strides in reasoning and text generation. However, they remain prone to errors, including hallucinations and logical inconsistencies. While existing self-correction approaches rely on external feedback mechanisms or large proprietary models, this paper presents Self-Taught Self-Correction (STaSC) an innovative framework that enables small language models (SLMs) to self-correct through iterative fine-tuning on self-generated data. The approach improves model performance without external evaluators or supervision, making self-correction more accessible and computationally efficient.

STaR

Summary of Self-Taught Reasoning (STaR)

📌 What is STaR?

Self-Taught Reasoning (STaR) is a bootstrapping technique that helps language models improve their reasoning abilities without external supervision. Instead of relying on labeled data, STaR generates its own reasoning chains, filters out incorrect ones, and fine-tunes itself iteratively.

🚀 Key Idea of STaR

Generate Reasoning Paths: The model generates multiple reasoning paths for a given question.
Filter Correct Paths: A selection process keeps only logically correct reasoning chains.
Fine-Tune with the Best Paths: The model is re-trained on these self-generated correct examples.
Repeat Iteratively: This process is repeated, progressively improving the model’s reasoning capabilities.

🔑 Why is STaR Important?

✅ Eliminates reliance on external labeled data.
✅ Improves step-by-step reasoning using self-generated examples.
✅ Works iteratively, meaning models can refine their own reasoning over time.
✅ Inspired STaSC (Self-Taught Self-Correction), which extends it to self-correcting models.

🛠️ How STaR Works (Step-by-Step)

Initial Generation
- The model generates step-by-step reasoning for an input question.
Filtering Incorrect Paths
- Incorrect answers or inconsistent reasoning chains are removed.
Fine-Tuning on Filtered Data
- The model is trained on only the correct reasoning paths.
Repeat the Process
- After multiple iterations, the model gets better at reasoning.

📊 Example: STaR in Action

Question: What is the square root of 81?

STaR Process:

Model Generates Step-by-Step Answers
- Step 1: The square root of a number X is a number Y such that Y × Y = X.
- Step 2: 9 × 9 = 81, so the square root of 81 is 9.
- Final Answer: 9
Model Self-Evaluates & Filters Out Incorrect Responses
- If it generates “The square root of 81 is 10”, it removes it from training data.
Fine-Tunes Itself on the Correct Reasoning Paths
- The model learns from its correct answers, refining future responses.

STaR Implementation


import torch
import sqlite3
import logging
import os
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForCausalLM


# Configure logging
logging.basicConfig(
    filename="star_pipeline.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

class Config:
    CONFIDENCE_THRESHOLD = 0.7  # Default threshold for correctness determination
    MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
    DB_NAME = "reasoning_paths.db"
    LOG_FILE = "star_pipeline.log"
    NUM_PATHS = 3
    MAX_LENGTH = 100
    LEARNING_RATE = 5e-5
    EPOCHS = 3
    QUESTIONS_FILE = "questions.txt"
    MODEL_SAVE_DIR = "fine_tuned_models"
    CONFIDENCE_METHOD = "model"  # Options: "model", "heuristic", "external"

class Database:
    def __init__(self, db_name=Config.DB_NAME):
        self.db_name = db_name
        self.init_db()
    
    def init_db(self):
        conn = sqlite3.connect(self.db_name)
        cursor = conn.cursor()
        cursor.execute("""
        CREATE TABLE IF NOT EXISTS paths (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            question TEXT,
            reasoning_path TEXT,
            is_correct BOOLEAN,
            confidence REAL DEFAULT 0.0,
            is_best BOOLEAN DEFAULT 0
        )""")
        cursor.execute("""
        CREATE TABLE IF NOT EXISTS fine_tuning_log (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            base_model TEXT,
            model_path TEXT,
            source TEXT,
            timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
        )""")
        conn.commit()
        conn.close()
        logging.info("Database initialized.")
    
    def save_fine_tuning_log(self, base_model, model_path, source):
        conn = sqlite3.connect(self.db_name)
        cursor = conn.cursor()
        cursor.execute("INSERT INTO fine_tuning_log (base_model, model_path, source) VALUES (?, ?, ?)", (base_model, model_path, source))
        conn.commit()
        conn.close()
        logging.info(f"Fine-tuning logged: {model_path} from {source}")

    def get_latest_fine_tuned_model(self, model_name=Config.MODEL_NAME):
        conn = sqlite3.connect(self.db_name)
        cursor = conn.cursor()
        cursor.execute("SELECT model_path FROM fine_tuning_log WHERE base_model = ? ORDER BY timestamp DESC LIMIT 1", (model_name,))
        result = cursor.fetchone()
        conn.close()
        if result:
            return result[0]
        else:
            return None
        

    def save_path(self, question, path, is_correct, confidence):
        conn = sqlite3.connect(self.db_name)
        cursor = conn.cursor()
        cursor.execute("INSERT INTO paths (question, reasoning_path, is_correct, confidence, is_best) VALUES (?, ?, ?, ?, ?)", 
                       (question, path, is_correct, confidence, 0))
        conn.commit()
        conn.close()
        logging.info(f"Saved path for question: {question} with confidence: {confidence}")
    
    def update_best_path(self, question, best_path): 
        conn = sqlite3.connect(self.db_name)
        cursor = conn.cursor()
        # First, reset all paths for the question to not be best
        cursor.execute("UPDATE paths SET is_best = 0 WHERE question = ?", (question,))
        # Then, set the best path
        cursor.execute("UPDATE paths SET is_best = 1 WHERE question = ? AND reasoning_path = ?", (question, best_path))
        conn.commit()
        conn.close()
        logging.info(f"Updated best path for question: {question}")

class ConfidenceEvaluator:
    def __init__(self, config, model, tokenizer):
        self.config = config
        self.model = model
        self.tokenizer = tokenizer
    
    def calculate_confidence(self, path):
        method = self.config.CONFIDENCE_METHOD
        if method == "model":
            return self.model_based_confidence(path)
        elif method == "heuristic":
            return self.heuristic_confidence(path)
        elif method == "external":
            return self.external_model_confidence(path)
        else:
            logging.warning(f"Unknown confidence method: {method}, defaulting to heuristic.")
            return self.heuristic_confidence(path)
    
    def model_based_confidence(self, path):
        if len(path) < 500:
            inputs = self.tokenizer(path, return_tensors="pt", padding=True, truncation=True)
            with torch.no_grad():
                outputs = self.model(**inputs)
            logits = outputs.logits
            probs = torch.nn.functional.softmax(logits, dim=-1)
            return probs[:, -1, :].max(dim=-1).values.mean().item()  # Max probability from last token
        else:
            return self.log_probability_confidence(path)
    
    def log_probability_confidence(self, path):
        inputs = self.tokenizer(path, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
        logits = outputs.logits
        log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
        token_ids = inputs["input_ids"]
        selected_log_probs = log_probs.gather(dim=-1, index=token_ids.unsqueeze(-1)).squeeze(-1)
        return selected_log_probs.mean().exp().item()  # Normalize score across sequence length
    
    def external_model_confidence(self, path):
        return 0.5  # Placeholder for external verifier

class ModelManager:
    def __init__(self, config, db):
        self.config = config
        self.db = db
        self._load_model()

    def _load_model(self):
        """Loads the latest fine-tuned model or defaults to the base model."""
        latest_model = self.db.get_latest_fine_tuned_model()
        
        if latest_model and os.path.exists(latest_model):
            try:
                logging.info("Loading fine-tuned model...")
                self.tokenizer = AutoTokenizer.from_pretrained(latest_model)
                self.model = AutoModelForCausalLM.from_pretrained(latest_model)
                logging.info("Fine-tuned model loaded successfully.")
                return
            except Exception as e:
                logging.error(f"Failed to load fine-tuned model. Error: {e}")

        # Load base model as fallback
        self.tokenizer = AutoTokenizer.from_pretrained(self.config.MODEL_NAME)
        self.model = AutoModelForCausalLM.from_pretrained(self.config.MODEL_NAME)
        logging.info("Loaded default model.")


    def save_fine_tuned_model(self):
        """Saves the fine-tuned model with a timestamp and updates the latest model reference."""
        from datetime import datetime
        timestamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
        model_path = os.path.join(self.config.MODEL_SAVE_DIR, f"{self.config.MODEL_NAME}_{timestamp}")

        logging.info(f"Saving fine-tuned model at {model_path}")

        os.makedirs(model_path, exist_ok=True)

        self.model.save_pretrained(model_path)
        self.tokenizer.save_pretrained(model_path)

        self.db.save_fine_tuning_log(self.config.MODEL_NAME, model_path, "STaR pipeline")

        logging.info(f"Fine-tuned model and tokenizer saved at {model_path}")

    def fine_tune_model(self, training_text):
        """Fine-tunes the model on the given training text and updates the saved model."""
        optimizer = torch.optim.AdamW(self.model.parameters(), lr=self.config.LEARNING_RATE)
        loss_fn = torch.nn.CrossEntropyLoss()

        try:
            inputs = self.tokenizer(training_text, return_tensors="pt", padding=True, truncation=True)
            labels = inputs["input_ids"].clone()

            outputs = self.model(**inputs)
            logits = outputs.logits
            loss = loss_fn(logits.view(-1, logits.size(-1)), labels.view(-1))

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            logging.info(f"Fine-tuning completed with loss: {loss.item()}")

            self.save_fine_tuned_model()
            self._load_model()  # Reload the model after fine-tuning

        except Exception as e:
            logging.error(f"Error during fine-tuning: {e}")

class STaR:
    def __init__(self):
        self.config = Config()
        self.db = Database()
        self.model_save_path = os.path.join(self.config.MODEL_SAVE_DIR, "latest")

        if os.path.exists(self.model_save_path):
            try:
                logging.info("Loading fine-tuned model...")
                self.tokenizer = AutoTokenizer.from_pretrained(self.model_save_path)
                self.model = AutoModelForCausalLM.from_pretrained(self.model_save_path)
                logging.info("Fine-tuned model loaded successfully.")
            except Exception as e:
                logging.error(f"Failed to load fine-tuned model. Error: {e}")
                self.tokenizer = AutoTokenizer.from_pretrained(self.config.MODEL_NAME)
                self.model = AutoModelForCausalLM.from_pretrained(self.config.MODEL_NAME)
        else:
            self.tokenizer = AutoTokenizer.from_pretrained(self.config.MODEL_NAME)
            self.model = AutoModelForCausalLM.from_pretrained(self.config.MODEL_NAME)
        
        self.confidence_evaluator = ConfidenceEvaluator(self.config, self.model, self.tokenizer)
    
    def evaluate_paths(self, question, paths):
        best_path = None
        best_confidence = 0.0
        confidence_list = []  # Store confidence values to compute threshold dynamically
        
        for path in paths:
            confidence = self.confidence_evaluator.calculate_confidence(path)
            confidence_list.append(confidence)
            
            is_correct = confidence > self.config.CONFIDENCE_THRESHOLD  # Use configurable threshold
            self.db.save_path(question, path, is_correct, confidence)
            
            if confidence > best_confidence:
                best_confidence = confidence
                best_path = path
        
        if best_path:
            self.db.update_best_path(question, best_path)
        
        logging.info(f"Best answer selected with confidence {best_confidence}")
        return best_path
        best_path = None
        best_confidence = 0.0
        
        for path in paths:
            confidence = self.confidence_evaluator.calculate_confidence(path)
            is_correct = confidence > 0.7  # Threshold for correctness
            self.db.save_path(question, path, is_correct, confidence)
            
            if confidence > best_confidence:
                best_confidence = confidence
                best_path = path
        
        if best_path:
            self.db.update_best_path(question, best_path)
        
        logging.info(f"Best answer selected with confidence {best_confidence}")
        return best_path

if __name__ == "__main__":
    star_pipeline = STaR()
    questions = [
        "How does a neural network backpropagation work?",
        "What are the key differences between supervised and unsupervised learning?",
        "Explain the concept of reinforcement learning with examples.",
        "How do decision trees handle missing values?",
        "Describe the process of hyperparameter tuning in machine learning."
    ]
    
    for question in tqdm(questions, desc="Processing questions", unit="question", leave=True, position=0):
        logging.info(f"Processing question: {question}")
        paths = [question]  # Placeholder, replace with actual model outputs
        best_answer = star_pipeline.evaluate_paths(question, paths)
        logging.info(f"Best answer for '{question}': {best_answer}")

the problem I found with this approach was that it generated too much data Um even um mute for these small models um I ran it over 10 questions and I60 gigabytes of um generated models this is not what I wanted like I don’t want I don’t know the quality of these models really you know what I mean really this this is not going to work.

📌 Why Does This Matter for STaSC?

STaSC builds upon STaR but focuses on self-correcting wrong outputs rather than improving logical reasoning.
Both methods rely on self-learning without human intervention, making them scalable and adaptable.


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a small open-source model
MODEL_NAME = "microsoft/phi-3-mini"  # Change to a different model if needed
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Define a function to generate reasoning paths
def generate_reasoning_paths(model, tokenizer, question, num_paths=3, max_length=100):
    """Generates multiple reasoning paths for a given question."""
    prompt = f"Think step by step: {question}\n"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_length=max_length, num_return_sequences=num_paths)
    
    return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

# Define a function to filter correct paths
def filter_correct_paths(paths):
    """Filters out incorrect reasoning paths. In real use, this could involve a verifier model."""
    filtered_paths = []
    for path in paths:
        if "error" not in path.lower():  # Naïve filtering (replace with better logic)
            filtered_paths.append(path)
    return filtered_paths

# Define a function to fine-tune the model on correct reasoning paths
def fine_tune_model(model, tokenizer, correct_paths, learning_rate=5e-5, epochs=3):
    """Fine-tunes the model using filtered correct reasoning paths."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    loss_fn = torch.nn.CrossEntropyLoss()
    
    for epoch in range(epochs):
        print(f"Epoch {epoch+1}/{epochs}")
        for path in correct_paths:
            inputs = tokenizer(path, return_tensors="pt", padding=True, truncation=True)
            labels = inputs["input_ids"].clone()
            
            outputs = model(**inputs, labels=labels)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        print("Fine-tuning step completed.")
    
    return model

# Run STaR Pipeline
def run_star_pipeline(question):
    print(f"Question: {question}")
    
    # Step 1: Generate Reasoning Paths
    reasoning_paths = generate_reasoning_paths(model, tokenizer, question)
    print("\nGenerated Reasoning Paths:")
    for i, path in enumerate(reasoning_paths, 1):
        print(f"Path {i}: {path}\n")
    
    # Step 2: Filter Correct Paths
    correct_paths = filter_correct_paths(reasoning_paths)
    print("\nFiltered Correct Paths:")
    for path in correct_paths:
        print(path)
    
    # Step 3: Fine-Tune the Model on Correct Paths
    if correct_paths:
        fine_tune_model(model, tokenizer, correct_paths)
        print("\nModel fine-tuned successfully!")
    else:
        print("\nNo correct paths found. Fine-tuning skipped.")


test_question = "How does a neural network backpropagation work?"
run_star_pipeline(test_question)

Self-Taught Self-Correction (STaSC) Approach

STaSC is an adaptation of the Self-Taught Reasoning (STaR) approach but focuses on self-correction rather than reasoning. Unlike previous self-correction methods, STaSC integrates flexible algorithmic design choices, allowing fine control over initial answer exploration, correction filtering, and iterative fine-tuning.

Key Contributions

Proposes STaSC, a unified and extensible self-correction framework.
Demonstrates SLM self-correction capability on a question-answering (QA) task.
Open-sources the implementation and trained models, fostering further research.

Formal Definition and Algorithmic Choices

The STaSC algorithm iteratively refines the model’s outputs using a structured process:

Sampling Initial Answers:
- Uses either a fixed base model (Fixed Initialization) or a continuously updated version (Evolving Initialization).
Generating Corrections:
- The model generates multiple corrections per initial output.
Filtering Corrections:
- Improving Filter: Only keeps corrections that show performance improvement.
- Non-Decreasing Filter: Retains corrections that do not degrade quality.
Fine-Tuning the Model:
- Fixed Fine-Tuning: Trains from the base model.
- Evolving Fine-Tuning: Updates from the latest iteration.

Each design choice affects the learning dynamics and self-correction efficiency.

Experimental Setup

The study evaluates STaSC on the Natural Questions dataset, a well-known benchmark for question-answering. Two small models are tested:

Qwen-2.5-1.5B
Phi3-Mini

Evaluation Metrics

Inaccuracy rate: Measures the correctness of answers.
Reward function: Determines answer quality improvements over iterations.

Implementing the test

To implement the Self-Taught Self-Correction (STaSC) algorithm in Python, we need to structure our implementation into key steps: initial answer generation, correction generation, filtering, and fine-tuning. Below is a detailed Python implementation using the Hugging Face Transformers library for loading and training small language models.

1. Install Necessary Dependencies

Ensure you have the required libraries installed:

pip install transformers datasets torch accelerate

Yes! If you’re looking for a smaller dataset than Natural Questions for testing Self-Taught Self-Correction (STaSC), here are some great alternatives:

1. SQuAD (Stanford Question Answering Dataset)

Size: ~100K samples (Smaller subsets available)
Use Case: Extractive QA tasks, similar to Natural Questions.

How to Load:

from datasets import load_dataset
dataset = load_dataset("squad", split="train[:500]")  # Load a smaller subset

Why? More compact and structured, making it a good choice for testing small models.

2. TriviaQA

Size: ~100K QA pairs (You can subset it)
Use Case: Open-domain QA like Natural Questions.

How to Load:

dataset = load_dataset("trivia_qa", "unfiltered", split="train[:500]")

Why? Good mix of structured and unstructured questions.

3. HotpotQA (For Multi-Hop QA)

Size: ~100K (Subset is available)
Use Case: Requires reasoning across multiple documents.

How to Load:

dataset = load_dataset("hotpot_qa", "fullwiki", split="train[:500]")

Why? If you want to test self-correction in a multi-hop reasoning context.

4. BoolQ (Yes/No QA Task)

Size: ~16K samples (Very compact)
Use Case: Fact-checking style QA with binary answers.

How to Load:

dataset = load_dataset("boolq", split="train[:500]")

Why? If you want to test self-correction in a simple factual verification task.

5. XSum (For Short-Form Abstractive Summarization)

Size: ~200K summaries
Use Case: If you want to test self-correction in text generation.

How to Load:

dataset = load_dataset("xsum", split="train[:500]")

Why? This can test STaSC in a different domain (summarization).

Which Dataset Should You Choose?

If you need a smaller version of Natural Questions, go for SQuAD or TriviaQA.
If you want to test multi-step reasoning, try HotpotQA.
If your focus is fact-checking with short answers, use BoolQ.
If you want self-correction in text generation, go for XSum.

2. Load the Model and Dataset

We use a small open-source language model, such as Phi3-Mini or Qwen-2.5-1.5B, and a QA dataset like Natural Questions.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset
import torch

# Load model and tokenizer
MODEL_NAME = "microsoft/phi-3-mini"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Load dataset (using a subset for efficiency)
dataset = load_dataset("natural_questions", split="train[:500]")

3. Define Self-Taught Self-Correction (STaSC) Algorithm

We implement four key steps: initial answer generation, correction generation, filtering, and fine-tuning.

Step 1: Generate Initial Answers

This step uses the model to generate initial answers for input questions.

def generate_initial_answers(model, tokenizer, dataset, num_samples=1, max_length=100):
    initial_answers = []
    for sample in dataset:
        input_text = sample['question']
        inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_length=max_length, num_return_sequences=num_samples)
        decoded_answers = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
        initial_answers.append(decoded_answers)
    return initial_answers

Step 2: Generate Corrections

The model refines the previously generated answers, acting as its own corrector.

def generate_corrections(model, tokenizer, dataset, initial_answers, num_corrections=3, max_length=100):
    corrections = []
    for sample, answers in zip(dataset, initial_answers):
        question = sample['question']
        corrected_versions = []
        for ans in answers:
            prompt = f"Question: {question}\nInitial Answer: {ans}\nCorrected Answer:"
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
            outputs = model.generate(**inputs, max_length=max_length, num_return_sequences=num_corrections)
            corrected_answers = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
            corrected_versions.extend(corrected_answers)
        corrections.append(corrected_versions)
    return corrections

Step 3: Filter Corrections

We keep only the answers that improve over the previous ones.

def filter_corrections(initial_answers, corrected_answers, reference_answers):
    filtered_data = []
    for init, corrs, ref in zip(initial_answers, corrected_answers, reference_answers):
        best_correction = max(corrs, key=lambda x: reward_function(x, ref))
        if reward_function(best_correction, ref) > reward_function(init[0], ref):  # Ensures improvement
            filtered_data.append((init[0], best_correction))
    return filtered_data

Step 4: Define Reward Function

We define a simple reward function that checks correctness.

def reward_function(predicted, reference):
    return 1 if reference.lower() in predicted.lower() else 0  # Binary reward

Step 5: Fine-Tune the Model

Fine-tune the model on the improved corrections.

def fine_tune_model(model, tokenizer, filtered_data, epochs=3):
    # Convert to dataset format
    train_data = [{"input_text": f"Question: {q}\nCorrected Answer: {a}"} for q, a in filtered_data]
    
    def tokenize_data(sample):
        return tokenizer(sample["input_text"], padding="max_length", truncation=True, max_length=128)
    
    tokenized_dataset = train_data.map(tokenize_data)

    training_args = TrainingArguments(
        output_dir="./stasc_model",
        per_device_train_batch_size=8,
        num_train_epochs=epochs,
        save_steps=500,
        save_total_limit=2,
        evaluation_strategy="epoch",
        logging_dir="./logs"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset
    )

    trainer.train()

4. Run the Full Pipeline

Execute the steps in sequence to implement STaSC.

# Step 1: Generate Initial Answers
initial_answers = generate_initial_answers(model, tokenizer, dataset)

# Step 2: Generate Corrections
corrections = generate_corrections(model, tokenizer, dataset, initial_answers)

# Step 3: Filter Best Corrections
reference_answers = [sample['answers']['text'][0] for sample in dataset]  # Using dataset references
filtered_data = filter_corrections(initial_answers, corrections, reference_answers)

# Step 4: Fine-Tune the Model
fine_tune_model(model, tokenizer, filtered_data)

Conclusion

This implementation provides a working STaSC pipeline that:

Generates initial responses using a small language model.
Self-corrects the outputs iteratively.
Filters out unhelpful corrections and keeps improvements.
Fine-tunes the model using the best corrections.

Results & Discussion

Impact of Initial Answer and Correction Sampling

Experiments show that:

Increasing initial answer sampling (Ninit) significantly boosts performance for weaker models.
Increasing correction samples (Ncorr) improves performance in more capable models.
The Non-Decreasing Filter leads to instability in some settings, highlighting the need for careful correction selection.

STaSC with Evolving Initialization

Evolving Fine-Tuning improves correction accuracy progressively over iterations.
Filtering strictly improving corrections is critical for avoiding performance degradation.

STaSC with Fixed Initialization

Works well for reducing noise and ensuring stability.
Benefits greatly from Evolving Fine-Tuning, since corrections drive all exploration.
Fixed Fine-Tuning struggles without proper correction filtering.

STaSC Impact on Initial Answers

Even though training focuses on corrections, initial answer quality improves over time.
Suggests that self-correction strengthens both factual knowledge and reasoning capabilities.

This work builds upon previous research on self-correction in LLMs, including:

External feedback-based methods: Use retrieval-augmented generation or external tools.
Intrinsic self-correction: Includes iterative reasoning approaches but often depends on external critics.
SCoRE framework: Explored intrinsic self-correction using reinforcement learning but required large proprietary models.

STaSC extends these ideas by refining correction methods and making self-correction feasible for small, open-source models.

Conclusion

STaSC is a powerful framework for enabling intrinsic self-correction in small language models, eliminating the need for external feedback mechanisms. Key takeaways include:

SLMs can learn to self-correct through iterative fine-tuning on self-generated data.
Filtering strategy and fine-tuning method are critical for performance.
Both initial answer quality and correction effectiveness improve, demonstrating the model’s adaptability.
The open-source release facilitates further exploration of self-correction strategies across tasks and domains.

Limitations & Future Work

Model capacity: Small models may be limited in their ability to fully self-correct.
Task generalization: Experiments focus on QA; future studies should test broader NLP tasks.
Hyperparameter tuning: Optimal configurations may vary across models and datasets.
Evaluation criteria: Future work should develop more nuanced self-correction benchmarks.

Ethical Considerations

STaSC aims to reduce reliance on large, proprietary LLMs, making AI more accessible and transparent. However, like all AI models, biases in training data may still influence outputs, requiring careful oversight. Open-sourcing the implementation ensures greater transparency and community-driven improvements.

References

Self-Taught Self-Correction for Small Language Models

STASC Github

Introduction

STaR

Summary of Self-Taught Reasoning (STaR)

📌 What is STaR?

🚀 Key Idea of STaR

🔑 Why is STaR Important?

🛠️ How STaR Works (Step-by-Step)

📊 Example: STaR in Action

Question: What is the square root of 81?

STaR Process:

STaR Implementation

📌 Why Does This Matter for STaSC?

Self-Taught Self-Correction (STaSC) Approach

Key Contributions

Formal Definition and Algorithmic Choices

Experimental Setup

Evaluation Metrics

Implementing the test

1. Install Necessary Dependencies

1. SQuAD (Stanford Question Answering Dataset)

2. TriviaQA

3. HotpotQA (For Multi-Hop QA)

4. BoolQ (Yes/No QA Task)

5. XSum (For Short-Form Abstractive Summarization)

Which Dataset Should You Choose?

2. Load the Model and Dataset

3. Define Self-Taught Self-Correction (STaSC) Algorithm

Step 1: Generate Initial Answers

Step 2: Generate Corrections

Step 3: Filter Corrections

Step 4: Define Reward Function

Step 5: Fine-Tune the Model

4. Run the Full Pipeline

Conclusion

Results & Discussion

Impact of Initial Answer and Correction Sampling

STaSC with Evolving Initialization

STaSC with Fixed Initialization

STaSC Impact on Initial Answers

Related Work

Conclusion

Limitations & Future Work

Ethical Considerations

References