Self-Improving Agents: Applying the Sharpening Framework to Local LLMs

May 20, 2025

Page content

This is the second post in a 100-part series, where we take breakthrough AI papers and turn them into working code building the next generation of AI, one idea at a time.

🔧 Summary

In my previous post, I introduced stephanie a modular implementation of the AI co-scientist concept, inspired by DeepMind’s recent paper Towards an AI Co-Scientist.

But now, we’re going deeper.

This isn’t just about running prompts through an agent system it’s about building something radically different:

🧠 An AI that learns as it thinks. A self-improving agent that sharpens its own reasoning in real time without retraining.

Unlike traditional agents that rely on fixed instructions or full model updates, this system evolves inference-time behavior using a powerful combination of:

📈 Lightweight reward modeling via MR.Q
🧩 Structured prompting techniques like CRITIC, RECAP, and GROWS
🔄 Real-time feedback loops built entirely with local tools (no API keys, no cloud dependencies)

We’re not just prompting we’re programming intelligence at both ends, guiding the model through deliberate reflection, refinement, and measurable improvement.

🚀 What You’ll Learn in This Post

I’ll walk you through how I built a working prototype of this vision grounded in two cutting-edge papers:

Sharpening Language Models: A Theory of Inference-Time Improvement introducing the sharpening framework
MR.Q: Towards General-Purpose Model-Free Reinforcement Learning showing how to learn from preferences without retraining

You’ll see how I applied these ideas to build:

🧪 A hypothesis generation and refinement pipeline
🧠 A lightweight evaluator that learns from every interaction
🛠 A set of programmable templates that guide reasoning
📊 A local-first architecture that stores everything for traceability and evolution

By the end, you’ll understand how to:

✅ Build a self-tuning agent that improves with each step
🔁 Create feedback loops where the system learns as it thinks
📦 Run all of this locally, with open-source tools like Ollama, DSPy, and pgvector

This is more than prompt engineering.

It’s the first step toward an AI co-scientist that builds knowledge not just outputs.

✅ Why This Matters

Because for the first time:

✅ You can build agents that improve themselves during execution
✅ You don’t need access to model weights or massive compute
✅ You’re not fine-tuning you’re sharpening intelligence in real time

And most importantly:

This is not science fiction. It runs on your machine. Right now.

🔗 In This Post, We’ll Cover

Section	What You’ll Learn
🔁 The Return of MR.Q	How to use preference learning to evaluate and rank outputs without labels
🧠 Sharpening Mechanism	How to refine hypotheses using structured prompting instead of retraining
🤖 Agent Architecture	How to build a modular, feedback-driven system that evolves over time
📊 Tracking Improvements	How to store embeddings, log results, and evolve better strategies
🧬 Prompt Programming	How to turn CRITIC, GROWS, RECAP, and other frameworks into code

🕵️‍♂️ The Return of MR.Q

As introduced in MR.Q: A New Approach to Reinforcement Learning in Finance this framework enables real-time learning not by retraining models, but by refining how we use them.

MR.Q isn’t just another reward model or evaluation framework. It is, at its heart, a mechanism for real-time learning, one that allows us to sharpen our models’ outputs without retraining them, and more importantly, without needing access to their weights.

This dynamic, weightless learning capability is what makes MR.Q so powerful and why I built the entire sharpening mechanism in stephanie around it.

The key point is worth repeating the AI learns without retraining.

🔁 The Core Idea: Learning on the Fly

Traditional fine-tuning requires heavy infrastructure, data labeling, and compute resources. But what if you could learn from every interaction your system has not by changing the model itself, but by refining how you use it?

That’s where MR.Q comes in.

“MR.Q enables preference modeling over sequences, allowing models to be sharpened through inference-time refinement rather than parameter updates.”

In other words:

✅ You don’t need to train a new model.
✅ You don’t need access to model weights.
✅ You can adapt behavior dynamically using preference learning.

This is revolutionary because it means:

You can build self-improving agents that evolve in real time not over weeks of training, but within minutes of execution.

🗂️ What Kind of Data Does MR.Q Need?

    flowchart LR
    A[📝 Input Data Prompt + Output A/B] --> B[🔍 Embedding Lookup memory.embedding]
    B --> C[🔤 TextEncoder prompt_emb + output_emb → zsa]
    C --> D[📈 HypothesisValuePredictor value_a / value_b]
    D --> E[⚖️ Compare Scores preferred = a or b]
    E --> F[📜 Log / Train loss.backward or log evaluation]
    style A fill:#f9f,stroke:#333,stroke-width:4px

MR.Q learns from DPO-style (Direct Preference Optimization) data that is, pairs of outputs with a preference. It doesn’t require labeled classifications or explicit numeric scores. All it needs is:

A prompt (or goal)
Two outputs (e.g., hypotheses A and B)
A preference indicating which output is better ("a" or "b")

This format allows MR.Q to learn how to discriminate between stronger and weaker responses in context, and it works with relatively small datasets.

🧪 Example Training Item

{
  "prompt": "What are the potential benefits of gene editing in agriculture?",
  "output_a": "It allows crops to become more resistant to disease and pests.",
  "output_b": "Gene editing could potentially create superweeds.",
  "preferred": "a"
}

📊 Extracting DPO Data for MR.Q Using SQL

To generate these training pairs from our database, we use an SQL query that follows this strategy:

For each prompt, we select two hypotheses:
- One with the highest Elo rating (top-rated output)
- One with the lowest Elo rating (worst plausible output)
We ensure:
- Both hypotheses are enabled
- Both are linked to the same goal and prompt
- They have different scores to ensure a valid comparison

This gives us high-contrast training data, where the preference signal is strong and unambiguous exactly what MR.Q needs to learn a reliable internal value function.

✅ Why It Works:

Using the most and least successful hypotheses maximizes the training signal.
Helps MR.Q quickly learn the difference between strong and weak outputs.
Avoids confusion caused by training on “close call” examples with minor differences.

This method turns your existing hypothesis logs into efficient training pairs no extra annotation required.

WITH top_h AS (
    SELECT DISTINCT ON (p.id)
        p.id AS prompt_id,
        g.goal_text AS goal,
        p.prompt_text,
        h.text AS output_a,
        h.elo_rating AS rating_a
    FROM prompts p
    JOIN goals g ON p.goal_id = g.id
    JOIN hypotheses h ON h.prompt_id = p.id
    WHERE h.enabled = TRUE
      AND h.goal_id = g.id
      AND p.agent_name = %s
    ORDER BY p.id, h.elo_rating DESC
),
bottom_h AS (
    SELECT DISTINCT ON (p.id)
        p.id AS prompt_id,
        h.text AS output_b,
        h.elo_rating AS rating_b
    FROM prompts p
    JOIN hypotheses h ON h.prompt_id = p.id
    JOIN goals g ON p.goal_id = g.id
    WHERE h.enabled = TRUE
      AND h.goal_id = g.id
      AND p.agent_name = %s
    ORDER BY p.id, h.elo_rating ASC
)
SELECT 
    top_h.prompt_id,
    top_h.goal,
    top_h.prompt_text,
    top_h.output_a,
    top_h.rating_a,
    bottom_h.output_b,
    bottom_h.rating_b
FROM top_h
JOIN bottom_h ON top_h.prompt_id = bottom_h.prompt_id
WHERE top_h.rating_a != bottom_h.rating_b
LIMIT %s;

📘 What Is DPO (Direct Preference Optimization)?

DPO is a learning approach that fine-tunes models using pairwise preference data, where the system is shown two possible outputs and learns to prefer one over the other.

Instead of teaching the model what the “correct” answer is, DPO teaches the model how to rank answers. This avoids the need for strong supervision and allows models to learn from more natural feedback, like:

Which answer was more helpful?
Which one aligned better with the user’s intent?

⚙️ How MR.Q Uses DPO Data

Unlike traditional DPO pipelines that fine-tune large language models directly, MR.Q takes a modular, lightweight approach.

It breaks the preference learning problem into two clean stages:

Embedding + Compression of Prompt & Outputs
Value Prediction & Ranking

This makes MR.Q fast to train, easy to understand, and highly adaptable for real-time applications.

📈 MR.Q Embeddings

    flowchart LR
    A[📝 Input Data Prompt + Output A/B] --> B[🔍 Embedding Lookup memory.embedding]
    B --> C[🔤 TextEncoder prompt_emb + output_emb → zsa]
    C --> D[📈 HypothesisValuePredictor value_a / value_b]
    D --> E[⚖️ Compare Scores preferred = a or b]
    E --> F[📜 Log / Train loss.backward or log evaluation]
    style B fill:#f9f,stroke:#333,stroke-width:4px

Embeddings are the foundation of how MR.Q and the broader Co AI system understands language. Every prompt, hypothesis, or response is converted into a dense vector of numbers that captures its semantic meaning.

To make this process both efficient and consistent, we implemented a centralized embedding utility that:

🔗 1. Connects to a Local Embedding Model

We used Ollama as our embedding backend fast, local, and privacy-respecting.

# example embedding config in stephanie
embeddings:
  model: "mxbai-embed-large"
  dimension: 1024    # dimension for your database embeding columns
  endpoint: "http://localhost:11434/api/embeddings"

Each time we embed a piece of text, we send it to this endpoint using a lightweight POST request. The model returns a vector of floats that represents the text’s meaning in high-dimensional space.


def get_embedding(text: str, cfg):
    """
    Get an embedding from Ollama using the configured model.

    Args:
        text (str): The input text to embed.
        cfg (dict)): Configuration containing 'model' and optionally 'endpoint'.

    Returns:
        list[float]: The embedding vector.
    """
    cached = embedding_cache.get(text)
    if cached is not None:
        print("🔁 Using cached embedding")
        return cached

    model = cfg.get("embeddings", {}).get("model", "mxbai-embed-large")
    endpoint = cfg.get("embeddings", {}).get("endpoint", "http://localhost:11434/api/embeddings")
    response = requests.post(
        endpoint,
        json={"model": model, "prompt": text},
    )
    response.raise_for_status()
    return response.json().get("embedding")

🧠 2. Caches Embeddings for Reuse

Embeddings are expensive to compute so we cache them:

cached = embedding_cache.get(text)
if cached is not None:
    print("🔁 Using cached embedding")
    return cached

This ensures we never recompute embeddings for the same text dramatically speeding up evaluation and reducing redundant API calls.

🛢️ 3. Stores Embeddings in a Central Table

In the Co AI framework, all embeddings (prompts, hypotheses, results) are stored in a shared table in the database. This gives us:

🧩 A single source of truth for embeddings
🧠 Seamless access across agents (Refiner, Sharpening, MR.Q, etc.)
💾 Persistent storage for long-term usage and auditability

Every time MR.Q needs to compare two outputs, it pulls their pre-computed embeddings from this shared table via:

self.memory.embedding.get_or_create(text)

This ensures consistency across the entire framework all agents see the same semantic space.

    def get_or_create(self, text):
        try:
            with self.db.cursor() as cur:
                cur.execute("SELECT embedding FROM embeddings WHERE text = %s", (text,))
                row = cur.fetchone()
                if row:
                    return row[0]  # Force conversion to list of floats
        except Exception as e:
            print(f"❌ Exception: {type(e).__name__}: {e}")
            if self.logger:
                self.logger.log("EmbeddingFetchFailed", {"error": str(e)})

🧬 Text Encoder

    flowchart LR
    A[📝 Input Data Prompt + Output A/B] --> B[🔍 Embedding Lookup memory.embedding]
    B --> C[🔤 TextEncoder prompt_emb + output_emb → zsa]
    C --> D[📈 HypothesisValuePredictor value_a / value_b]
    D --> E[⚖️ Compare Scores preferred = a or b]
    E --> F[📜 Log / Train loss.backward or log evaluation]
    style C fill:#f9f,stroke:#333,stroke-width:4px

At the heart of MR.Q is a deceptively simple idea:

“The better a hypothesis fits a prompt, the higher its value should be.”

But how do we numerically represent a prompt and hypothesis in a way that captures their meaning, relationship, and quality?

That’s the job of the TextEncoder.

🧠 What the TextEncoder Does

Once we get embeddings we pass them into the TextEncoder to produce a combined feature vector that represents the interaction between:

The prompt (what we’re trying to solve)
The response/hypothesis (a proposed solution)

This combination is crucial we’re not evaluating the response in isolation. We’re asking:

“How well does this response fit this specific prompt?”

🔬 Anatomy of the Encoder

[prompt_emb] -> zs_mlp -> zs
[response_emb] -> za_mlp -> za
concat(zs, za) -> zsa_mlp -> zsa

zs_mlp: Maps the prompt to a latent intent vector (zs)
za_mlp: Maps the hypothesis to a response vector (za)
zsa_mlp: Fuses both into a joint signal (zsa) used for value prediction

This fusion allows the model to capture alignment, plausibility, and even strategy compatibility between prompt and hypothesis.

⚖️ Why This Matters for MR.Q

MR.Q needs to decide which hypothesis is better but this isn’t a generic comparison.

It’s a contextual comparison: better for this specific prompt.

The TextEncoder is the bridge between raw embeddings and meaningful, context-aware evaluation. Without it, MR.Q would be blind to the nuances of different tasks.

🔧 Tunable and Modular

Because it’s implemented as a PyTorch module:

You can experiment with different architectures (e.g., attention, residuals)
You can plug in different embedding models upstream
It runs on CPU or GPU, enabling lightweight training loops

🧠 Summary

The TextEncoder transforms static embeddings into dynamic relationships.

It’s not just encoding text it’s encoding relevance, quality, and fit between goal and solution.

If MR.Q is a mini brain, the TextEncoder is its attention mechanism deciding what to focus on, and what matters most.

import torch
import torch.nn as nn
import torch.nn.functional as F

# TextEncoder for embedding prompts and hypotheses
class TextEncoder(nn.Module):
    def __init__(self, embedding_dim=1024, zs_dim=512, za_dim=256, zsa_dim=512, hdim=1024):
        super().__init__()
        self.zs_mlp = nn.Sequential(
            nn.Linear(embedding_dim, hdim),
            nn.ReLU(),
            nn.Linear(hdim, zs_dim)
        )
        self.za_mlp = nn.Sequential(
            nn.Linear(embedding_dim, hdim),
            nn.ReLU(),
            nn.Linear(hdim, za_dim)
        )
        self.zsa_mlp = nn.Sequential(
            nn.Linear(zs_dim + za_dim, zsa_dim),
            nn.ReLU(),
            nn.Linear(zsa_dim, zsa_dim)
        )

    def forward(self, prompt_emb, response_emb):
        zs = F.relu(self.zs_mlp(prompt_emb))
        za = F.relu(self.za_mlp(response_emb))
        zsa = self.zsa_mlp(torch.cat([zs, za], dim=1))
        return zsa

🧪 Scoring Intelligence: The HypothesisValuePredictor

    flowchart LR
    A[📝 Input Data Prompt + Output A/B] --> B[🔍 Embedding Lookup memory.embedding]
    B --> C[🔤 TextEncoder prompt_emb + output_emb → zsa]
    C --> D[📈 HypothesisValuePredictor value_a / value_b]
    D --> E[⚖️ Compare Scores preferred = a or b]
    E --> F[📜 Log / Train loss.backward or log evaluation]
    style D fill:#f9f,stroke:#333,stroke-width:4px

Once we’ve encoded the relationship between a prompt and a hypothesis into a vector using the TextEncoder, we need to answer a very human question:

“How good is this hypothesis, really?”

That’s where the HypothesisValuePredictor comes in.

❓ What It Does

The HypothesisValuePredictor is a tiny neural network that takes in a combined prompt-hypothesis representation (called zsa) and outputs a single score a scalar that represents the predicted “quality” or “value” of that hypothesis in the context of the prompt.

class HypothesisValuePredictor(nn.Module):
    def __init__(self, input_dim=512, hidden_dim=1024):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)  # Output: a single scalar
        )

    def forward(self, x):
        return self.network(x)

⚖️ Compare Scores: Determining the Better Output

    flowchart LR
    A[📝 Input Data Prompt + Output A/B] --> B[🔍 Embedding Lookup memory.embedding]
    B --> C[🔤 TextEncoder prompt_emb + output_emb → zsa]
    C --> D[📈 HypothesisValuePredictor value_a / value_b]
    D --> E[⚖️ Compare Scores preferred = a or b]
    E --> F[📜 Log / Train loss.backward or log evaluation]
    style E fill:#f9f,stroke:#333,stroke-width:4px

At the heart of the sharpening process lies a simple but powerful mechanism: scoring and comparing competing hypotheses.

After generating a sharpened hypothesis from a prompt using one of our templated refinement strategies, we pass both the original and the sharpened versions to MR.Q, our lightweight reward model. MR.Q outputs a pair of numerical values:

value_a: the model’s predicted score for the original hypothesis
value_b: the model’s predicted score for the sharpened hypothesis

We then perform a direct comparison:

preferred = "a" if value_a >= value_b else "b"

This comparison determines:

Which version performed better, according to MR.Q’s learned preferences.
Whether the sharpening actually improved the hypothesis (improved = preferred == "b").
Which prompt template led to the best refinement in that context.

The result is stored along with metadata:

score_diff: the absolute improvement between versions
winner: "a" or "b" the preferred result
comparison: "sharpened_better" or "original_better" for easy filtering and visualization

💡 Example

Let’s say we evaluate a pair:

value_a = 72.3 (original)
value_b = 78.1 (sharpened)

Then:

preferred = "b"
improved = True
score_diff = 5.8

This allows us to:

Log and analyze which templates consistently improve results
Automatically save improved prompts and hypotheses
Visualize which strategies yield meaningful refinement

🔬 Logging & Training in MR.Q

    flowchart LR
    A[📝 Input Data Prompt + Output A/B] --> B[🔍 Embedding Lookup memory.embedding]
    B --> C[🔤 TextEncoder prompt_emb + output_emb → zsa]
    C --> D[📈 HypothesisValuePredictor value_a / value_b]
    D --> E[⚖️ Compare Scores preferred = a or b]
    E --> F[📜 Log / Train loss.backward or log evaluation]
    style F fill:#f9f,stroke:#333,stroke-width:4px

How the model learns and how we track it

In MR.Q, learning happens in a tight feedback loop, powered by preference data (DPO-style pairs) and guided by a small neural network trained to distinguish better outputs from worse ones.

🔧 The Learning Step: `loss.backward()`

During training, we compare two embeddings:

zsa_a: the vector representation of (prompt, output A)
zsa_b: the vector representation of (prompt, output B)

zsa is the joint representation of a prompt (zs) and response (za)

We construct a difference vector based on the preferred output:

diff = zsa_a - zsa_b  if preferred == "a" else zsa_b - zsa_a

Then we feed this difference into our HypothesisValuePredictor, a small feed-forward neural network, and optimize its weights using:

loss = -torch.log(torch.sigmoid(preds)).mean()
loss.backward()
opt.step()

This step tunes MR.Q to assign higher values to preferred responses effectively learning what “better” means in the context of your domain and goals.

📈 The Tracking Step: `MRQTrainingEpoch`

After each epoch, we log key stats:

self.logger.log("MRQTrainingEpoch", {
    "epoch": epoch + 1,
    "avg_loss": round(avg_loss, 5),
    "goal": goal
})

These logs give visibility into:

Training progress (e.g., loss over time)
Convergence trends (flattening loss means learning is stabilizing)
Traceability (what goal, what data, what results)

You’ll see entries like:

📈 [MRQTrainingEpoch] {'epoch': 3, 'avg_loss': 0.64518, 'goal': 'Can AI improve diagnostic accuracy in radiology?'}

These logs are essential for:

Debugging
Verifying model behavior
Determining when to stop training (e.g., when loss plateaus)

You can configure training parameters in the agent config file.

  device: cpu # How many epochs to wait after no improvement
  limit: 1000 # max training data
  epochs: 20  # how much to train
  patience: 3  # (eary stopping) How many epochs to wait after no improvement
  min_delta: 0.0001  # (eary stopping)  Minimum change in loss to qualify as improvement

📈 Future Extensions with MR.Q

Because MR.Q operates entirely at inference time, it opens up several exciting paths forward:

🔄 Continuous online learning from user feedback
🤖 Multi-agent tournaments where hypotheses compete
🧪 Multi-model comptetition where we try different models for differnt pipline and strategies
🧠 Preference transfer across domains
📊 Visualization of strategy evolution via embedding space

All of these can be layered on top of the existing structure without modifying the base LLM.

📌 Final Thoughts

Choosing MR.Q wasn’t just a technical decision it was a philosophical one.

It allowed me to build a system where:

Learning happens continuously
Feedback drives improvement
Agents become smarter through experience, not retraining

This dynamic, lightweight, and modular approach is what sets stephanie apart and why MR.Q will be a central focus in the paper.

Because in the end, this isn’t about fine-tuning language models.

It’s about sharpening intelligence through reflection, preference, and iteration and doing it all in real time.

And that’s exactly what MR.Q lets us do.

As you have probably guess by not I am a big fan of MR.Q. However this agent is about sharpening and how we can use this to improve our results lets get into that now.

🪓 Sharpening

While MR.Q gives us a powerful mechanism to evaluate hypotheses using lightweight preference-based learning, it doesn’t generate new ones on its own.

That’s where sharpening comes in.

Inspired by the Sharpening paper, this approach treats prompt templates as programmable transformations not just static instructions. Instead of tuning a model, we tune the interaction between the model and the problem, modifying prompts to encourage better reasoning.

At the heart of this idea is a simple but powerful loop:

🧠 Generate an initial hypothesis from a prompt
🔍 Reflect on its quality or limitations
🛠 Refine the prompt or the hypothesis using expert-style templates
⚖️ Evaluate if the refined output is better (via MR.Q)
🔁 Repeat with the best-performing prompt

This isn’t just prompt engineering it’s a structured, template-driven feedback loop that evolves the interaction itself. By applying multiple sharpening strategies (like critic, grow, lens, and more), we experiment with different reasoning styles to find what works best.

In stephanie, the SharpeningAgent handles this entire process. It:

Loads a set of expert-crafted templates
Applies them to previous prompts/hypotheses
Evaluates the refined versions using MR.Q
Logs which strategies lead to measurable improvements
Optionally saves improved prompts and hypotheses for future use

In short, while MR.Q helps us choose what’s better, sharpening helps us create something better over time, over cycles, and without changing the underlying model.

Let’s look at how that’s implemented in the agent logic.

    flowchart LR
    A[📥 Prompt + Hypothesis] --> B[🔁 For each Template]
    B --> C[🧪 Apply Sharpening Template CRITIC, GROWS]
    C --> D[🧠 Generate Sharpened Hypothesis LLM]
    D --> E[⚖️ Evaluate A vs B using MR.Q]
    E --> F{Is B Better?}
    F -- Yes --> G[💾 Save Sharpened Prompt & Hypothesis]
    F -- No --> H[⏭ Skip Saving]
    G & H --> I[🗂 Log SharpeningResult]
    I --> J[📈 Select Best Output]

Prompting Techniques - our tempaltes of change

These templates are a set of techniques we use to make the llm think or evaluate its work. This is the list we are working with now.

You can find our more about each template here: Prompting Techniques

Template	Description
`critic`	Uses the CRITIC framework to systematically analyze and improve hypotheses by identifying assumptions, gaps, and proposing refinements.
`grow`	Applies a lightweight GROW-style pattern to generate and refine a hypothesis through review and iteration.
`grows`	Full GROWS Loop: Generate, Review, Optimize, Work Again, and Stop. Used for iterative reasoning and continuous hypothesis improvement.
`devil`	Adopts a Devil’s Advocate stance challenges the current hypothesis by identifying flaws, contradictions, and weak assumptions.
`lens`	Applies the Three Lens Review (Strategic, Tactical, Operational) to assess the hypothesis or plan from multiple perspectives.
`cot`	Implements Chain-of-Thought (CoT) prompting, encouraging step-by-step reasoning before presenting the refined hypothesis.
`aot`	Uses the Atom of Thought (AoT) pattern to break down complex goals into subproblems, answer them independently, and reassemble a solution.
`recap`	Applies the RECAP framework Evidence, Context, Analysis, Perspective to comprehensively evaluate and refine a hypothesis.
`reflect`	Invokes the REFLECT process to perform introspective reasoning about a hypothesis or decision ideal for after-action improvement.
`step`	Leverages the STEP framework: Structure, Think, Evaluate, Proceed guides the model through careful, logical progression.
`swapi`	Structured critique via S.W.A.P.I.: Strengths, Weaknesses, Assumptions, Proposals, Iteration for refining hypotheses through review cycles.

🧑‍⚖️ Templates Programming the LLM, One Prompt at a Time

So what are we really doing when we apply these sharpening templates?

At a glance, it may seem like we’re just feeding variations of instructions into the model. But under the hood, we’re doing something far more powerful:

🔧 Each template is a miniature program a reasoning strategy encoded in text.

In traditional programming, we write functions and logic to process data. Here, the LLM is the processor, and our prompts are the code. Every template whether it’s CRITIC, GROWS, or SWAPI defines a different mental algorithm. It tells the model how to think, what to focus on, and how to evaluate its own outputs.

These templates are:

🧩 Modular reasoning routines small, composable strategies you can mix and match.
🔁 Reusable mental scaffolds templates like GROWS and AoT don’t just generate one result they structure how the model thinks.
🪚 Sharpeners by applying these templates, we’re not just rewriting prompts or responses. We’re refining the reasoning behind them.

Think of it like this:

The goal is your intention.
The prompt is the tool.
The template is the technique you’re using with that tool carving deeper, cleaner, more useful hypotheses with each pass.

We’re programming the model at runtime, using natural language as our IDE.

And in doing so, we’re not just generating outputs we’re actively crafting a system that learns how to reason better, step by step.

🌱 GROWS template example

In my testing this was by fare the most effective template. When you review it I bet you can guess why.


You are an iterative assistant tasked with improving a hypothesis using the GROWS Loop:

1. **Generate**: Start with the current hypothesis.
2. **Review**: Rate the hypothesis (1-10) and identify areas for improvement.
3. **Optimize**: Rewrite the hypothesis based on feedback.
4. **Work Again**: Present the revised version.
5. **Stop**: Evaluate whether the output meets the desired quality or requires another iteration.

Goal:
{{ goal }}

Preferences:
{% for p in preferences %}
- {{ p }}
{% endfor %}

{% if examples %}
Examples:
{% for h in examples %}
Hypothesis {{ loop.index }}:
{{ h.hypothesis }}

Review:
{{ h.review }}
{% endfor %}
{% endif %}

Instructions:
Follow the GROWS loop iteratively. Stop when the hypothesis scores above 8/10 in your own review.

Output format:
Refined Hypothesis: <your improved version here>
Score: <score>
Review: <justification>

Notice how it keeps self improving until it gets a very high score.

🧩 SharpeningAgent Code: Putting It All Together

Everything you’ve seen so far template transformation, LLM execution, and MR.Q evaluation is orchestrated by the SharpeningAgent.

This agent ties together:

🧠 Goal and hypothesis selection
🧰 Template-based prompt programming
🤖 LLM execution
📏 Real-time evaluation via MR.Q
💾 Selective saving of improved prompts and outputs

Here’s the core logic:

class SharpeningAgent(BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.target = cfg.get("target", "generation")
        self.device = cfg.get("device", "cpu")
        self.evaluator = MRQSelfEvaluator(memory, logger, device=self.device)
        self.templates = cfg.get("templates", ["critic"])

    async def run(self, context: dict):
        goal = context.get(GOAL)
        self.evaluator.train_from_database(goal=goal, cfg=self.cfg)

        prompts = context.get("prompt_history", {}).get(self.target, [])
        results = []
        for data in prompts:
            result = self.run_selected(data, context)
            results.append(result)
            if self.cfg.get("log_results", False):
                self.log_sharpening_results(goal, data.get("prompt"), data.get("response"), result)

        context[self.output_key] = results
        return context

And just like the diagram showed, each prompt is run through templates, evaluated, and optionally saved:

def run_selected(self, data: dict, context: dict) -> list[dict]:
    ...
    for name in self.templates:
        prompt_template = self.prompt_loader.from_file(name, self.cfg, merged)
        sharpened_hypothesis = self.call_llm(prompt_template, merged)
        ...
        preferred_output, scores = self.evaluator.evaluate(goal, prompt, hypothesis, sharpened_hypothesis)
        ...
        if entry["improved"]:
            self.save_improved(...)

🔁 What makes this powerful is that it operates as a looping, iterative refinement system, applying new strategies until the outputs actually improve.

⚙️ Configuring the Agent

The SharpeningAgent is built using Hydra, a powerful configuration management system. This design allows you to define the agent’s behavior entirely through structured YAML configuration files making it easy to experiment, extend, and deploy across different environments.

Here’s a breakdown of what the configuration controls:

🧩 Core Parameters

sharpening:
  name: sharpening
  target: generation
  device: cpu

name: Identifier for the agent.
target: Which part of the context the agent will operate on (e.g., "generation").
device: Can be "cpu" or "cuda" depending on your setup.

📚 Training Controls

  limit: 1000
  epochs: 20
  patience: 3
  min_delta: 0.0001

limit: Maximum number of training samples to use from the database.
epochs: Number of full training cycles for MR.Q.
patience: Early stopping criteria training will halt if no improvement is seen for patience epochs.
min_delta: Minimum required improvement in loss to count as progress.

📝 Data and Result Management

  log_results: true
  save_improved: true
  save_context: false
  skip_if_completed: false

log_results: Save sharpening outcomes (e.g., score comparisons, improvements) to the database.
save_improved: Store newly refined prompts and hypotheses when improvements are detected.
save_context: Optionally record full context for traceability.
skip_if_completed: Skip processing if the context already contains sharpening results.

🧠 LLM Configuration

  model:
    name: ollama_chat/qwen3
    api_base: http://localhost:11434

This block specifies the local LLM endpoint used for prompt completion and scoring in this case, qwen3 via Ollama.

🪓 Sharpening Strategy

  mode: templates  # Options: templates, judge, compare_mrq
  templates:
    - critic
    - grow
    ...

You can choose how the agent sharpens hypotheses:

templates: Apply a sequence of specialized Jinja-based refinement strategies.
judge: Use a scoring model to evaluate a single hypothesis.
compare_mrq: Run MR.Q-based evaluations to compare hypotheses and select the best.

The templates list defines which prompt refiners to use. Each template represents a sharpening strategy like critic, reflect, grows, etc., that shapes how the hypothesis is rewritten.

🎯 Prompt Configuration

  required_keys: ["goal", "prompt_history"]
  input_key: "prompt_history"
  output_key: "sharpening"
  prompt_mode: strategy
  strategy: sharpening

This defines how the agent navigates the context and where results are stored:

required_keys: Keys that must exist in the input context.
input_key / output_key: Where to read from and write to in the context.
prompt_mode: Prompt generation mode (strategy, initial, etc.)
strategy: Which type of refinement flow to use.

🧬 Evolving Better Strategies Over Time

Once you’ve seen which templates lead to higher-ranked hypotheses, reuse that knowledge.

File: `stephanieanie/agents/prompt_tuning.py`

class PromptTuningAgent:
    def run(self, context: dict) -> dict:
        goal = context.get("goal")
        examples = self.memory.hypotheses.get_top_ranked(goal, limit=20)
        
        training_set = self._build_training_set(examples)
        tuner = BootstrapFewShot(metric=self._exact_match_metric)
        tuned_program = tuner.compile(
            student=Predict(PromptTuningSignature), trainset=training_set
        )

        best_prompt = tuned_program.student.demos[0].input_prompt
        context["refined_prompt"] = best_prompt
        return context

Train improved versions:

python main.py agents/prompt_tuning.enabled=true

Now you’re implementing exactly what the Sharpening paper describes:

“πβ⋆ = arg maxπ∈Π {Eπ[rself(y|x)] − βDKL(π || πbase)}”

⚠️ Limitations & Future Directions

While the sharpening pipeline powered by MR.Q represents a major step forward in self-improving AI workflows, several key limitations remain — both in design and implementation. These are not just technical footnotes, but active design tensions that will shape how this system scales and evolves.

1. 🧠 Template Quality is Strategy

Issue: The effectiveness of sharpening heavily depends on the quality and intent of the prompt templates used (e.g. CRITIC, DEVIL, GROWS).
Implication: Poorly written or misaligned templates can degrade hypothesis quality — making template design itself a form of programming the AI.
Future Direction: Support auto-selection or even template evolution (e.g., scoring templates by average improvement).

2. 🧩 Model Choice Matters

Issue: The LLM used to generate and evaluate hypotheses significantly affects performance. Some models are better at instruction-following, others at reasoning or self-critique.
Implication: Different models may yield divergent results on the same template or goal, complicating reproducibility and tuning.
Future Direction: Enable per-template model selection and automatic benchmarking across models.

3. 🐢 Latency and Cost

Issue: Each sharpening cycle involves multiple LLM calls — across templates, judges, and evaluations. This introduces real compute cost and time delay.
Implication: Sharpening can become slow on large datasets or many prompts.
Future Direction: Add batched inference support, caching, and parallelism; explore lightweight or distilled LLMs for inner-loop tasks.

4. 🧊 Cold Start for MR.Q

Issue: MR.Q needs a critical mass of preference pairs (~100+) to make reliable evaluations.
Implication: New domains may perform poorly without this foundation.
Current Workaround: Use synthetic or bootstrapped data via dspy.teleprompt.BootstrapFewShot.

5. 🔁 Embedding Drift

Issue: Static embeddings (e.g., mxbai-embed-large) used for similarity and memory may become stale as hypotheses evolve.
Implication: Search and scoring accuracy may degrade over time.

Future Direction: Automatically re-embed hypotheses after updates:

if hypothesis_updated:
    emb = get_embedding(hypothesis.text, cfg)

6. 👁️‍🗨️ Human-in-the-Loop Gaps

Issue: MR.Q optimizes internal metrics, which may diverge from human values or contextual insight.
Implication: There’s a risk of overfitting to “what the model thinks is good.”
Safeguard: Introduce periodic human reviews via flagged_hypotheses and support side-by-side comparisons for auditability.

7. ⚙️ Configuration Complexity

Issue: The flexibility of stephanie means many things — like training strategy, prompt mode, or evaluation method — are runtime-configurable.
Implication: Misconfiguration or silent defaults can cause inconsistent behavior across runs or environments.
Future Direction: Include stricter validation, template explainability tools, and best-practice presets.

✅ Wrapping Up: A System That Learns to Refine Itself

We’ve now seen how SharpeningAgent orchestrates a multi-step process that brings together:

🧠 Structured prompting
🧩 Template-based transformations
🤖 LLM-driven hypothesis generation
🧪 Evaluation through MR.Q
🔁 Continuous self-improvement

This system doesn’t just react it reflects. It doesn’t just generate it iterates. And it doesn’t just respond it learns how to get better over time.

🚀 Why This Matters

Most LLM systems today focus on single-pass outputs. What we’re building here goes further toward continuous improvement loops, where:

The inputs evolve
The outputs improve
And the system rewrites its own reasoning

With MR.Q and SharpeningAgent, we’re introducing the foundations for self-tuning pipelines that operate in real-time, learn from feedback, and adapt without retraining.

📚 What’s Next?

This post is just the beginning. In the next installments, we’ll cover:

🧠 Using real-world datasets for refinement
🧪 Evaluating across domains (science, policy, business, etc.)
🧰 Expanding the template library and sharpening strategies
📈 Building a benchmark dataset for template-driven improvement
🏅 More papers implemented into the pipeline lots more.
📘 Using this system to generate syntethic datasets for AI.

✅ Sharpening Paper Implementation Checklist

Feature / Step from Sharpening Paper	Implemented in Your Code	Notes
1. Initial Prompt + Hypothesis Generation	✅	`prompt_history` holds the prompt-hypothesis pairs to be sharpened.
2. Self-Evaluation / Preference Signal	✅	`MRQSelfEvaluator.evaluate(...)` produces scores via a trained model.
3. Prompt Refinement	✅	Templates like `critic`, `grow`, `lens`, etc. rewrite the prompt.
4. Hypothesis Re-generation (post-refinement)	✅	The sharpened prompt is used to re-generate an improved hypothesis.
5. Comparison of Outputs (original vs sharpened)	✅	Evaluated by MR.Q using preference learning to determine improvement.
6. Logging Sharpening Outcomes	✅	`SharpeningResult` records all transformations and scores.
7. Selective Persistence (save if improved)	✅	Only saves new prompts/hypotheses when `winner == "b"` (i.e., improved).
8. Multi-template Strategy	✅	`self.templates` iterates over multiple refinement strategies.
9. Score-based Decision Mechanism	✅	Best result is selected by max score difference via `MR.Q`.
10. Training Self-Evaluator on DPO-style pairs	✅	`train_from_database()` uses ELO-based pairs for fast local training.
11. Metadata & Source Tracking	✅	Saves original prompt, score diff, template name in metadata fields.
12. Configurable Modes (Judge-only vs Templates)	✅	Supports both template-based and judge-only self-reward modes.

💬 Final Reflection

In traditional software, we sharpen our tools before we start cutting. In this new era, the tools sharpen themselves as they think, generate, and reflect.

This isn’t prompt engineering. This is prompt programming an AI that learns to rewrite itself.

📚 References

🔪 Self-Improvement in Language Models: The Sharpening Mechanism

Paper: Self-Improvement in Language Models: The Sharpening Mechanism arXiv: [https://arxiv.org/abs/2305.14885]

This paper introduced a method for improving model outputs via iterative self-refinement. The core insight is that an LLM can critique and improve its own reasoning through repeated prompting, using templates like Reflect, Critic, and Refine. This approach inspired the SharpeningAgent architecture.

🧠 MR.Q: Lightweight Preference-Based Self-Tuning

Concept: MR.Q Modular Reward Quality Estimator Inspired by: Direct Preference Optimization (DPO) and reward modeling Implementation: This blog post and accompanying repo Key Idea: A small, local neural network trained on preference pairs (prompt, output A/B, winner) to rapidly score and rank model outputs in real time.

https://arxiv.org/abs/2501.16142v1

Unlike traditional reward models or large-scale DPO training, MR.Q is:

Fast to train on small datasets
Interpretable and modular
Tuned using the same interface it evaluates (embeddings)

MR.Q extends the sharpening concept by adding a quantitative tuning loop.

📚 Code

https://github.com/facebookresearch/MRQ MRQ repo

https://github.com/ernanhughes/MRQ

An extension to MRQ for finance

https://github.com/ernanhughes/co-ai CO-AI heavy development where I implemented this agent.

📘 Glossary of Key Terms

Term	Definition
Sharpening	A process where a model iteratively improves its own prompts and outputs using evaluation, feedback, and prompt rewriting — without retraining.
MR.Q	A small neural network trained on preference data that predicts the quality of hypotheses. It acts as a fast, local reward model.
DPO (Direct Preference Optimization)	A training technique that optimizes a model based on pairwise preferences (Output A preferred over Output B), used in RLHF.
Prompt Template	A reusable structured instruction that programs the LLM to perform tasks in a specific way (e.g., `CRITIC`, `GROWS`, `DEVIL`).
Self-Evaluation	The act of the model evaluating or scoring its own output to guide further refinement. Often done using a critic or numerical reward.
Hypothesis	A generated output or response to a specific goal or prompt. In this system, hypotheses are evaluated and refined.
ELO Rating	A dynamic ranking system (from chess) used to compare the relative quality of multiple hypotheses. The higher the rating, the better the response.
Hydra	A configuration framework for Python that makes it easy to control agents, models, and templates using YAML.
SharpeningAgent	A Co AI agent that applies sharpening templates to prompts, evaluates improved outputs using MR.Q, and logs/store better results.
Template Mode	A configuration where the system cycles through multiple prompt templates to find the best-performing prompt style.
Judge Mode	A simplified sharpening mode where outputs are scored (e.g., 1–10) using a critic template without running multiple refinements.