Self-Improving Agents: Applying the Sharpening Framework to Local LLMs

This is the second post in a 100-part series, where we take breakthrough AI papers and turn them into working code building the next generation of AI, one idea at a time.
π§ Summary
In my previous post, I introduced co_ai
a modular implementation of the AI co-scientist concept, inspired by DeepMindβs recent paper Towards an AI Co-Scientist.
But now, weβre going deeper.
This isnβt just about running prompts through an agent system itβs about building something radically different:
π§ An AI that learns as it thinks. A self-improving agent that sharpens its own reasoning in real time without retraining.
Unlike traditional agents that rely on fixed instructions or full model updates, this system evolves inference-time behavior using a powerful combination of:
- π Lightweight reward modeling via MR.Q
- π§© Structured prompting techniques like CRITIC, RECAP, and GROWS
- π Real-time feedback loops built entirely with local tools (no API keys, no cloud dependencies)
Weβre not just prompting weβre programming intelligence at both ends, guiding the model through deliberate reflection, refinement, and measurable improvement.
π What Youβll Learn in This Post
Iβll walk you through how I built a working prototype of this vision grounded in two cutting-edge papers:
- Sharpening Language Models: A Theory of Inference-Time Improvement introducing the sharpening framework
- MR.Q: Towards General-Purpose Model-Free Reinforcement Learning showing how to learn from preferences without retraining
You’ll see how I applied these ideas to build:
- π§ͺ A hypothesis generation and refinement pipeline
- π§ A lightweight evaluator that learns from every interaction
- π A set of programmable templates that guide reasoning
- π A local-first architecture that stores everything for traceability and evolution
By the end, youβll understand how to:
- β Build a self-tuning agent that improves with each step
- π Create feedback loops where the system learns as it thinks
- π¦ Run all of this locally, with open-source tools like Ollama, DSPy, and pgvector
This is more than prompt engineering.
Itβs the first step toward an AI co-scientist that builds knowledge not just outputs.
β Why This Matters
Because for the first time:
- β You can build agents that improve themselves during execution
- β You donβt need access to model weights or massive compute
- β Youβre not fine-tuning youβre sharpening intelligence in real time
And most importantly:
This is not science fiction. It runs on your machine. Right now.
π In This Post, Weβll Cover
Section | What You’ll Learn |
---|---|
π The Return of MR.Q | How to use preference learning to evaluate and rank outputs without labels |
π§ Sharpening Mechanism | How to refine hypotheses using structured prompting instead of retraining |
π€ Agent Architecture | How to build a modular, feedback-driven system that evolves over time |
π Tracking Improvements | How to store embeddings, log results, and evolve better strategies |
𧬠Prompt Programming | How to turn CRITIC, GROWS, RECAP, and other frameworks into code |
π΅οΈββοΈ The Return of MR.Q
As introduced in MR.Q: A New Approach to Reinforcement Learning in Finance this framework enables real-time learning not by retraining models, but by refining how we use them.
MR.Q isnβt just another reward model or evaluation framework. It is, at its heart, a mechanism for real-time learning, one that allows us to sharpen our models’ outputs without retraining them, and more importantly, without needing access to their weights.
This dynamic, weightless learning capability is what makes MR.Q so powerful and why I built the entire sharpening mechanism in co_ai
around it.
The key point is worth repeating the AI learns without retraining.
π The Core Idea: Learning on the Fly
Traditional fine-tuning requires heavy infrastructure, data labeling, and compute resources. But what if you could learn from every interaction your system has not by changing the model itself, but by refining how you use it?
Thatβs where MR.Q comes in.
βMR.Q enables preference modeling over sequences, allowing models to be sharpened through inference-time refinement rather than parameter updates.β
In other words:
- β You donβt need to train a new model.
- β You donβt need access to model weights.
- β You can adapt behavior dynamically using preference learning.
This is revolutionary because it means:
You can build self-improving agents that evolve in real time not over weeks of training, but within minutes of execution.
ποΈ What Kind of Data Does MR.Q Need?
flowchart LR A[π Input Data Prompt + Output A/B] --> B[π Embedding Lookup memory.embedding] B --> C[π€ TextEncoder prompt_emb + output_emb β zsa] C --> D[π HypothesisValuePredictor value_a / value_b] D --> E[βοΈ Compare Scores preferred = a or b] E --> F[π Log / Train loss.backward or log evaluation] style A fill:#f9f,stroke:#333,stroke-width:4px
MR.Q
learns from DPO-style (Direct Preference Optimization) data that is, pairs of outputs with a preference. It doesnβt require labeled classifications or explicit numeric scores. All it needs is:
- A prompt (or goal)
- Two outputs (e.g., hypotheses A and B)
- A preference indicating which output is better (
"a"
or"b"
)
This format allows MR.Q to learn how to discriminate between stronger and weaker responses in context, and it works with relatively small datasets.
π§ͺ Example Training Item
{
"prompt": "What are the potential benefits of gene editing in agriculture?",
"output_a": "It allows crops to become more resistant to disease and pests.",
"output_b": "Gene editing could potentially create superweeds.",
"preferred": "a"
}
π Extracting DPO Data for MR.Q Using SQL
To generate these training pairs from our database, we use an SQL query that follows this strategy:
- For each prompt, we select two hypotheses:
- One with the highest Elo rating (top-rated output)
- One with the lowest Elo rating (worst plausible output)
- We ensure:
- Both hypotheses are enabled
- Both are linked to the same goal and prompt
- They have different scores to ensure a valid comparison
This gives us high-contrast training data, where the preference signal is strong and unambiguous exactly what MR.Q needs to learn a reliable internal value function.
β Why It Works:
- Using the most and least successful hypotheses maximizes the training signal.
- Helps MR.Q quickly learn the difference between strong and weak outputs.
- Avoids confusion caused by training on “close call” examples with minor differences.
This method turns your existing hypothesis logs into efficient training pairs no extra annotation required.
WITH top_h AS (
SELECT DISTINCT ON (p.id)
p.id AS prompt_id,
g.goal_text AS goal,
p.prompt_text,
h.text AS output_a,
h.elo_rating AS rating_a
FROM prompts p
JOIN goals g ON p.goal_id = g.id
JOIN hypotheses h ON h.prompt_id = p.id
WHERE h.enabled = TRUE
AND h.goal_id = g.id
AND p.agent_name = %s
ORDER BY p.id, h.elo_rating DESC
),
bottom_h AS (
SELECT DISTINCT ON (p.id)
p.id AS prompt_id,
h.text AS output_b,
h.elo_rating AS rating_b
FROM prompts p
JOIN hypotheses h ON h.prompt_id = p.id
JOIN goals g ON p.goal_id = g.id
WHERE h.enabled = TRUE
AND h.goal_id = g.id
AND p.agent_name = %s
ORDER BY p.id, h.elo_rating ASC
)
SELECT
top_h.prompt_id,
top_h.goal,
top_h.prompt_text,
top_h.output_a,
top_h.rating_a,
bottom_h.output_b,
bottom_h.rating_b
FROM top_h
JOIN bottom_h ON top_h.prompt_id = bottom_h.prompt_id
WHERE top_h.rating_a != bottom_h.rating_b
LIMIT %s;
π What Is DPO (Direct Preference Optimization)?
DPO is a learning approach that fine-tunes models using pairwise preference data, where the system is shown two possible outputs and learns to prefer one over the other.
Instead of teaching the model what the βcorrectβ answer is, DPO teaches the model how to rank answers. This avoids the need for strong supervision and allows models to learn from more natural feedback, like:
- Which answer was more helpful?
- Which one aligned better with the userβs intent?
βοΈ How MR.Q Uses DPO Data
Unlike traditional DPO pipelines that fine-tune large language models directly, MR.Q takes a modular, lightweight approach.
It breaks the preference learning problem into two clean stages:
- Embedding + Compression of Prompt & Outputs
- Value Prediction & Ranking
This makes MR.Q fast to train, easy to understand, and highly adaptable for real-time applications.
π MR.Q Embeddings
flowchart LR A[π Input Data Prompt + Output A/B] --> B[π Embedding Lookup memory.embedding] B --> C[π€ TextEncoder prompt_emb + output_emb β zsa] C --> D[π HypothesisValuePredictor value_a / value_b] D --> E[βοΈ Compare Scores preferred = a or b] E --> F[π Log / Train loss.backward or log evaluation] style B fill:#f9f,stroke:#333,stroke-width:4px
Embeddings are the foundation of how MR.Q and the broader Co AI system understands language. Every prompt, hypothesis, or response is converted into a dense vector of numbers that captures its semantic meaning.
To make this process both efficient and consistent, we implemented a centralized embedding utility that:
π 1. Connects to a Local Embedding Model
We used Ollama as our embedding backend fast, local, and privacy-respecting.
# example embedding config in co_ai
embeddings:
model: "mxbai-embed-large"
dimension: 1024 # dimension for your database embeding columns
endpoint: "http://localhost:11434/api/embeddings"
Each time we embed a piece of text, we send it to this endpoint using a lightweight POST request. The model returns a vector of floats that represents the textβs meaning in high-dimensional space.
def get_embedding(text: str, cfg):
"""
Get an embedding from Ollama using the configured model.
Args:
text (str): The input text to embed.
cfg (dict)): Configuration containing 'model' and optionally 'endpoint'.
Returns:
list[float]: The embedding vector.
"""
cached = embedding_cache.get(text)
if cached is not None:
print("π Using cached embedding")
return cached
model = cfg.get("embeddings", {}).get("model", "mxbai-embed-large")
endpoint = cfg.get("embeddings", {}).get("endpoint", "http://localhost:11434/api/embeddings")
response = requests.post(
endpoint,
json={"model": model, "prompt": text},
)
response.raise_for_status()
return response.json().get("embedding")
π§ 2. Caches Embeddings for Reuse
Embeddings are expensive to compute so we cache them:
cached = embedding_cache.get(text)
if cached is not None:
print("π Using cached embedding")
return cached
This ensures we never recompute embeddings for the same text dramatically speeding up evaluation and reducing redundant API calls.
π’οΈ 3. Stores Embeddings in a Central Table
In the Co AI framework, all embeddings (prompts, hypotheses, results) are stored in a shared table in the database. This gives us:
- π§© A single source of truth for embeddings
- π§ Seamless access across agents (Refiner, Sharpening, MR.Q, etc.)
- πΎ Persistent storage for long-term usage and auditability
Every time MR.Q needs to compare two outputs, it pulls their pre-computed embeddings from this shared table via:
self.memory.embedding.get_or_create(text)
This ensures consistency across the entire framework all agents see the same semantic space.
def get_or_create(self, text):
try:
with self.db.cursor() as cur:
cur.execute("SELECT embedding FROM embeddings WHERE text = %s", (text,))
row = cur.fetchone()
if row:
return row[0] # Force conversion to list of floats
except Exception as e:
print(f"β Exception: {type(e).__name__}: {e}")
if self.logger:
self.logger.log("EmbeddingFetchFailed", {"error": str(e)})
𧬠Text Encoder
flowchart LR A[π Input Data Prompt + Output A/B] --> B[π Embedding Lookup memory.embedding] B --> C[π€ TextEncoder prompt_emb + output_emb β zsa] C --> D[π HypothesisValuePredictor value_a / value_b] D --> E[βοΈ Compare Scores preferred = a or b] E --> F[π Log / Train loss.backward or log evaluation] style C fill:#f9f,stroke:#333,stroke-width:4px
At the heart of MR.Q is a deceptively simple idea:
βThe better a hypothesis fits a prompt, the higher its value should be.β
But how do we numerically represent a prompt and hypothesis in a way that captures their meaning, relationship, and quality?
Thatβs the job of the TextEncoder
.
π§ What the TextEncoder Does
Once we get embeddings we pass them into the TextEncoder
to produce a combined feature vector that represents the interaction between:
- The prompt (what weβre trying to solve)
- The response/hypothesis (a proposed solution)
This combination is crucial weβre not evaluating the response in isolation. We’re asking:
βHow well does this response fit this specific prompt?β
π¬ Anatomy of the Encoder
[prompt_emb] -> zs_mlp -> zs
[response_emb] -> za_mlp -> za
concat(zs, za) -> zsa_mlp -> zsa
zs_mlp
: Maps the prompt to a latent intent vector (zs
)za_mlp
: Maps the hypothesis to a response vector (za
)zsa_mlp
: Fuses both into a joint signal (zsa
) used for value prediction
This fusion allows the model to capture alignment, plausibility, and even strategy compatibility between prompt and hypothesis.
βοΈ Why This Matters for MR.Q
MR.Q needs to decide which hypothesis is better but this isn’t a generic comparison.
Itβs a contextual comparison: better for this specific prompt.
The TextEncoder
is the bridge between raw embeddings and meaningful, context-aware evaluation. Without it, MR.Q would be blind to the nuances of different tasks.
π§ Tunable and Modular
Because itβs implemented as a PyTorch module:
- You can experiment with different architectures (e.g., attention, residuals)
- You can plug in different embedding models upstream
- It runs on CPU or GPU, enabling lightweight training loops
π§ Summary
The TextEncoder
transforms static embeddings into dynamic relationships.
It’s not just encoding text itβs encoding relevance, quality, and fit between goal and solution.
If MR.Q is a mini brain, the TextEncoder
is its attention mechanism deciding what to focus on, and what matters most.
import torch
import torch.nn as nn
import torch.nn.functional as F
# TextEncoder for embedding prompts and hypotheses
class TextEncoder(nn.Module):
def __init__(self, embedding_dim=1024, zs_dim=512, za_dim=256, zsa_dim=512, hdim=1024):
super().__init__()
self.zs_mlp = nn.Sequential(
nn.Linear(embedding_dim, hdim),
nn.ReLU(),
nn.Linear(hdim, zs_dim)
)
self.za_mlp = nn.Sequential(
nn.Linear(embedding_dim, hdim),
nn.ReLU(),
nn.Linear(hdim, za_dim)
)
self.zsa_mlp = nn.Sequential(
nn.Linear(zs_dim + za_dim, zsa_dim),
nn.ReLU(),
nn.Linear(zsa_dim, zsa_dim)
)
def forward(self, prompt_emb, response_emb):
zs = F.relu(self.zs_mlp(prompt_emb))
za = F.relu(self.za_mlp(response_emb))
zsa = self.zsa_mlp(torch.cat([zs, za], dim=1))
return zsa
π§ͺ Scoring Intelligence: The HypothesisValuePredictor
flowchart LR A[π Input Data Prompt + Output A/B] --> B[π Embedding Lookup memory.embedding] B --> C[π€ TextEncoder prompt_emb + output_emb β zsa] C --> D[π HypothesisValuePredictor value_a / value_b] D --> E[βοΈ Compare Scores preferred = a or b] E --> F[π Log / Train loss.backward or log evaluation] style D fill:#f9f,stroke:#333,stroke-width:4px
Once we’ve encoded the relationship between a prompt and a hypothesis into a vector using the TextEncoder
, we need to answer a very human question:
“How good is this hypothesis, really?”
Thatβs where the HypothesisValuePredictor
comes in.
β What It Does
The HypothesisValuePredictor
is a tiny neural network that takes in a combined prompt-hypothesis representation (called zsa
) and outputs a single score a scalar that represents the predicted “quality” or “value” of that hypothesis in the context of the prompt.
class HypothesisValuePredictor(nn.Module):
def __init__(self, input_dim=512, hidden_dim=1024):
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1) # Output: a single scalar
)
def forward(self, x):
return self.network(x)
Absolutely! Here’s a section you can include in your blog post or documentation that explains the “Compare Scores” logic using the preferred = "a"
or "b"
structure:
βοΈ Compare Scores: Determining the Better Output
flowchart LR A[π Input Data Prompt + Output A/B] --> B[π Embedding Lookup memory.embedding] B --> C[π€ TextEncoder prompt_emb + output_emb β zsa] C --> D[π HypothesisValuePredictor value_a / value_b] D --> E[βοΈ Compare Scores preferred = a or b] E --> F[π Log / Train loss.backward or log evaluation] style E fill:#f9f,stroke:#333,stroke-width:4px
At the heart of the sharpening process lies a simple but powerful mechanism: scoring and comparing competing hypotheses.
After generating a sharpened hypothesis from a prompt using one of our templated refinement strategies, we pass both the original and the sharpened versions to MR.Q, our lightweight reward model. MR.Q outputs a pair of numerical values:
value_a
: the modelβs predicted score for the original hypothesisvalue_b
: the modelβs predicted score for the sharpened hypothesis
We then perform a direct comparison:
preferred = "a" if value_a >= value_b else "b"
This comparison determines:
- Which version performed better, according to MR.Q’s learned preferences.
- Whether the sharpening actually improved the hypothesis (
improved = preferred == "b"
). - Which prompt template led to the best refinement in that context.
The result is stored along with metadata:
score_diff
: the absolute improvement between versionswinner
:"a"
or"b"
the preferred resultcomparison
:"sharpened_better"
or"original_better"
for easy filtering and visualization
π‘ Example
Letβs say we evaluate a pair:
value_a = 72.3
(original)value_b = 78.1
(sharpened)
Then:
preferred = "b"
improved = True
score_diff = 5.8
This allows us to:
- Log and analyze which templates consistently improve results
- Automatically save improved prompts and hypotheses
- Visualize which strategies yield meaningful refinement
π¬ Logging & Training in MR.Q
flowchart LR A[π Input Data Prompt + Output A/B] --> B[π Embedding Lookup memory.embedding] B --> C[π€ TextEncoder prompt_emb + output_emb β zsa] C --> D[π HypothesisValuePredictor value_a / value_b] D --> E[βοΈ Compare Scores preferred = a or b] E --> F[π Log / Train loss.backward or log evaluation] style F fill:#f9f,stroke:#333,stroke-width:4px
How the model learns and how we track it
In MR.Q, learning happens in a tight feedback loop, powered by preference data (DPO-style pairs) and guided by a small neural network trained to distinguish better outputs from worse ones.
π§ The Learning Step: loss.backward()
During training, we compare two embeddings:
zsa_a
: the vector representation of (prompt, output A)zsa_b
: the vector representation of (prompt, output B)
zsa
is the joint representation of a prompt (zs
) and response (za
)
We construct a difference vector based on the preferred output:
diff = zsa_a - zsa_b if preferred == "a" else zsa_b - zsa_a
Then we feed this difference into our HypothesisValuePredictor
, a small feed-forward neural network, and optimize its weights using:
loss = -torch.log(torch.sigmoid(preds)).mean()
loss.backward()
opt.step()
This step tunes MR.Q to assign higher values to preferred responses effectively learning what “better” means in the context of your domain and goals.
π The Tracking Step: MRQTrainingEpoch
After each epoch, we log key stats:
self.logger.log("MRQTrainingEpoch", {
"epoch": epoch + 1,
"avg_loss": round(avg_loss, 5),
"goal": goal
})
These logs give visibility into:
- Training progress (e.g., loss over time)
- Convergence trends (flattening loss means learning is stabilizing)
- Traceability (what goal, what data, what results)
Youβll see entries like:
π [MRQTrainingEpoch] {'epoch': 3, 'avg_loss': 0.64518, 'goal': 'Can AI improve diagnostic accuracy in radiology?'}
These logs are essential for:
- Debugging
- Verifying model behavior
- Determining when to stop training (e.g., when loss plateaus)
You can configure training parameters in the agent config file.
device: cpu # How many epochs to wait after no improvement
limit: 1000 # max training data
epochs: 20 # how much to train
patience: 3 # (eary stopping) How many epochs to wait after no improvement
min_delta: 0.0001 # (eary stopping) Minimum change in loss to qualify as improvement
π Future Extensions with MR.Q
Because MR.Q operates entirely at inference time, it opens up several exciting paths forward:
- π Continuous online learning from user feedback
- π€ Multi-agent tournaments where hypotheses compete
- π§ͺ Multi-model comptetition where we try different models for differnt pipline and strategies
- π§ Preference transfer across domains
- π Visualization of strategy evolution via embedding space
All of these can be layered on top of the existing structure without modifying the base LLM.
π Final Thoughts
Choosing MR.Q wasnβt just a technical decision it was a philosophical one.
It allowed me to build a system where:
- Learning happens continuously
- Feedback drives improvement
- Agents become smarter through experience, not retraining
This dynamic, lightweight, and modular approach is what sets co_ai
apart and why MR.Q will be a central focus in the paper.
Because in the end, this isn’t about fine-tuning language models.
It’s about sharpening intelligence through reflection, preference, and iteration and doing it all in real time.
And thatβs exactly what MR.Q lets us do.
As you have probably guess by not I am a big fan of MR.Q
. However this agent is about sharpening
and how we can use this to improve our results lets get into that now.
πͺ Sharpening
While MR.Q gives us a powerful mechanism to evaluate hypotheses using lightweight preference-based learning, it doesn’t generate new ones on its own.
Thatβs where sharpening comes in.
Inspired by the Sharpening paper, this approach treats prompt templates as programmable transformations not just static instructions. Instead of tuning a model, we tune the interaction between the model and the problem, modifying prompts to encourage better reasoning.
At the heart of this idea is a simple but powerful loop:
- π§ Generate an initial hypothesis from a prompt
- π Reflect on its quality or limitations
- π Refine the prompt or the hypothesis using expert-style templates
- βοΈ Evaluate if the refined output is better (via MR.Q)
- π Repeat with the best-performing prompt
This isnβt just prompt engineering itβs a structured, template-driven feedback loop that evolves the interaction itself. By applying multiple sharpening strategies (like critic
, grow
, lens
, and more), we experiment with different reasoning styles to find what works best.
In co_ai
, the SharpeningAgent
handles this entire process. It:
- Loads a set of expert-crafted templates
- Applies them to previous prompts/hypotheses
- Evaluates the refined versions using MR.Q
- Logs which strategies lead to measurable improvements
- Optionally saves improved prompts and hypotheses for future use
In short, while MR.Q helps us choose what’s better, sharpening helps us create something better over time, over cycles, and without changing the underlying model.
Letβs look at how thatβs implemented in the agent logic.
flowchart LR A[π₯ Prompt + Hypothesis] --> B[π For each Template] B --> C[π§ͺ Apply Sharpening Template CRITIC, GROWS] C --> D[π§ Generate Sharpened Hypothesis LLM] D --> E[βοΈ Evaluate A vs B using MR.Q] E --> F{Is B Better?} F -- Yes --> G[πΎ Save Sharpened Prompt & Hypothesis] F -- No --> H[β Skip Saving] G & H --> I[π Log SharpeningResult] I --> J[π Select Best Output]
Prompting Techniques - our tempaltes of change
These templates are a set of techniques we use to make the llm think or evaluate its work. This is the list we are working with now.
You can find our more about each template here: Prompting Techniques
Template | Description |
---|---|
critic |
Uses the CRITIC framework to systematically analyze and improve hypotheses by identifying assumptions, gaps, and proposing refinements. |
grow |
Applies a lightweight GROW-style pattern to generate and refine a hypothesis through review and iteration. |
grows |
Full GROWS Loop: Generate, Review, Optimize, Work Again, and Stop. Used for iterative reasoning and continuous hypothesis improvement. |
devil |
Adopts a Devilβs Advocate stance challenges the current hypothesis by identifying flaws, contradictions, and weak assumptions. |
lens |
Applies the Three Lens Review (Strategic, Tactical, Operational) to assess the hypothesis or plan from multiple perspectives. |
cot |
Implements Chain-of-Thought (CoT) prompting, encouraging step-by-step reasoning before presenting the refined hypothesis. |
aot |
Uses the Atom of Thought (AoT) pattern to break down complex goals into subproblems, answer them independently, and reassemble a solution. |
recap |
Applies the RECAP framework Evidence, Context, Analysis, Perspective to comprehensively evaluate and refine a hypothesis. |
reflect |
Invokes the REFLECT process to perform introspective reasoning about a hypothesis or decision ideal for after-action improvement. |
step |
Leverages the STEP framework: Structure, Think, Evaluate, Proceed guides the model through careful, logical progression. |
swapi |
Structured critique via S.W.A.P.I.: Strengths, Weaknesses, Assumptions, Proposals, Iteration for refining hypotheses through review cycles. |
π§ββοΈ Templates Programming the LLM, One Prompt at a Time
So what are we really doing when we apply these sharpening templates?
At a glance, it may seem like we’re just feeding variations of instructions into the model. But under the hood, we’re doing something far more powerful:
π§ Each template is a miniature program a reasoning strategy encoded in text.
In traditional programming, we write functions and logic to process data. Here, the LLM is the processor, and our prompts are the code. Every template whether it’s CRITIC, GROWS, or SWAPI defines a different mental algorithm. It tells the model how to think, what to focus on, and how to evaluate its own outputs.
These templates are:
- π§© Modular reasoning routines small, composable strategies you can mix and match.
- π Reusable mental scaffolds templates like GROWS and AoT don’t just generate one result they structure how the model thinks.
- πͺ Sharpeners by applying these templates, we’re not just rewriting prompts or responses. We’re refining the reasoning behind them.
Think of it like this:
- The goal is your intention.
- The prompt is the tool.
- The template is the technique you’re using with that tool carving deeper, cleaner, more useful hypotheses with each pass.
We’re programming the model at runtime, using natural language as our IDE.
And in doing so, we’re not just generating outputs we’re actively crafting a system that learns how to reason better, step by step.
π± GROWS template example
In my testing this was by fare the most effective template. When you review it I bet you can guess why.
You are an iterative assistant tasked with improving a hypothesis using the GROWS Loop:
1. **Generate**: Start with the current hypothesis.
2. **Review**: Rate the hypothesis (1-10) and identify areas for improvement.
3. **Optimize**: Rewrite the hypothesis based on feedback.
4. **Work Again**: Present the revised version.
5. **Stop**: Evaluate whether the output meets the desired quality or requires another iteration.
Goal:
{{ goal }}
Preferences:
{% for p in preferences %}
- {{ p }}
{% endfor %}
{% if examples %}
Examples:
{% for h in examples %}
Hypothesis {{ loop.index }}:
{{ h.hypothesis }}
Review:
{{ h.review }}
{% endfor %}
{% endif %}
Instructions:
Follow the GROWS loop iteratively. Stop when the hypothesis scores above 8/10 in your own review.
Output format:
Refined Hypothesis: <your improved version here>
Score: <score>
Review: <justification>
Notice how it keeps self improving until it gets a very high score.
π§© SharpeningAgent Code: Putting It All Together
Everything youβve seen so far template transformation, LLM execution, and MR.Q evaluation is orchestrated by the SharpeningAgent
.
This agent ties together:
- π§ Goal and hypothesis selection
- π§° Template-based prompt programming
- π€ LLM execution
- π Real-time evaluation via MR.Q
- πΎ Selective saving of improved prompts and outputs
Hereβs the core logic:
class SharpeningAgent(BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.target = cfg.get("target", "generation")
self.device = cfg.get("device", "cpu")
self.evaluator = MRQSelfEvaluator(memory, logger, device=self.device)
self.templates = cfg.get("templates", ["critic"])
async def run(self, context: dict):
goal = context.get(GOAL)
self.evaluator.train_from_database(goal=goal, cfg=self.cfg)
prompts = context.get("prompt_history", {}).get(self.target, [])
results = []
for data in prompts:
result = self.run_selected(data, context)
results.append(result)
if self.cfg.get("log_results", False):
self.log_sharpening_results(goal, data.get("prompt"), data.get("response"), result)
context[self.output_key] = results
return context
And just like the diagram showed, each prompt is run through templates, evaluated, and optionally saved:
def run_selected(self, data: dict, context: dict) -> list[dict]:
...
for name in self.templates:
prompt_template = self.prompt_loader.from_file(name, self.cfg, merged)
sharpened_hypothesis = self.call_llm(prompt_template, merged)
...
preferred_output, scores = self.evaluator.evaluate(goal, prompt, hypothesis, sharpened_hypothesis)
...
if entry["improved"]:
self.save_improved(...)
π What makes this powerful is that it operates as a looping, iterative refinement system, applying new strategies until the outputs actually improve.
βοΈ Configuring the Agent
The SharpeningAgent
is built using Hydra, a powerful configuration management system. This design allows you to define the agentβs behavior entirely through structured YAML configuration files making it easy to experiment, extend, and deploy across different environments.
Hereβs a breakdown of what the configuration controls:
π§© Core Parameters
sharpening:
name: sharpening
target: generation
device: cpu
name
: Identifier for the agent.target
: Which part of the context the agent will operate on (e.g.,"generation"
).device
: Can be"cpu"
or"cuda"
depending on your setup.
π Training Controls
limit: 1000
epochs: 20
patience: 3
min_delta: 0.0001
limit
: Maximum number of training samples to use from the database.epochs
: Number of full training cycles for MR.Q.patience
: Early stopping criteria training will halt if no improvement is seen forpatience
epochs.min_delta
: Minimum required improvement in loss to count as progress.
π Data and Result Management
log_results: true
save_improved: true
save_context: false
skip_if_completed: false
log_results
: Save sharpening outcomes (e.g., score comparisons, improvements) to the database.save_improved
: Store newly refined prompts and hypotheses when improvements are detected.save_context
: Optionally record full context for traceability.skip_if_completed
: Skip processing if the context already contains sharpening results.
π§ LLM Configuration
model:
name: ollama_chat/qwen3
api_base: http://localhost:11434
This block specifies the local LLM endpoint used for prompt completion and scoring in this case, qwen3
via Ollama.
πͺ Sharpening Strategy
mode: templates # Options: templates, judge, compare_mrq
templates:
- critic
- grow
...
You can choose how the agent sharpens hypotheses:
templates
: Apply a sequence of specialized Jinja-based refinement strategies.judge
: Use a scoring model to evaluate a single hypothesis.compare_mrq
: Run MR.Q-based evaluations to compare hypotheses and select the best.
The templates
list defines which prompt refiners to use. Each template represents a sharpening strategy like critic
, reflect
, grows
, etc., that shapes how the hypothesis is rewritten.
π― Prompt Configuration
required_keys: ["goal", "prompt_history"]
input_key: "prompt_history"
output_key: "sharpening"
prompt_mode: strategy
strategy: sharpening
This defines how the agent navigates the context and where results are stored:
required_keys
: Keys that must exist in the input context.input_key
/output_key
: Where to read from and write to in the context.prompt_mode
: Prompt generation mode (strategy
,initial
, etc.)strategy
: Which type of refinement flow to use.
𧬠Evolving Better Strategies Over Time
Once you’ve seen which templates lead to higher-ranked hypotheses, reuse that knowledge.
File: co_ai/agents/prompt_tuning.py
class PromptTuningAgent:
def run(self, context: dict) -> dict:
goal = context.get("goal")
examples = self.memory.hypotheses.get_top_ranked(goal, limit=20)
training_set = self._build_training_set(examples)
tuner = BootstrapFewShot(metric=self._exact_match_metric)
tuned_program = tuner.compile(
student=Predict(PromptTuningSignature), trainset=training_set
)
best_prompt = tuned_program.student.demos[0].input_prompt
context["refined_prompt"] = best_prompt
return context
Train improved versions:
python main.py agents/prompt_tuning.enabled=true
Now youβre implementing exactly what the Sharpening paper describes:
βΟΞ²β = arg maxΟβΞ {EΟ[rself(y|x)] β Ξ²DKL(Ο || Οbase)}β
β οΈ Limitations & Future Directions
While the sharpening pipeline powered by MR.Q represents a major step forward in self-improving AI workflows, several key limitations remain β both in design and implementation. These are not just technical footnotes, but active design tensions that will shape how this system scales and evolves.
1. π§ Template Quality is Strategy
- Issue: The effectiveness of sharpening heavily depends on the quality and intent of the prompt templates used (e.g. CRITIC, DEVIL, GROWS).
- Implication: Poorly written or misaligned templates can degrade hypothesis quality β making template design itself a form of programming the AI.
- Future Direction: Support auto-selection or even template evolution (e.g., scoring templates by average improvement).
2. π§© Model Choice Matters
- Issue: The LLM used to generate and evaluate hypotheses significantly affects performance. Some models are better at instruction-following, others at reasoning or self-critique.
- Implication: Different models may yield divergent results on the same template or goal, complicating reproducibility and tuning.
- Future Direction: Enable per-template model selection and automatic benchmarking across models.
3. π’ Latency and Cost
- Issue: Each sharpening cycle involves multiple LLM calls β across templates, judges, and evaluations. This introduces real compute cost and time delay.
- Implication: Sharpening can become slow on large datasets or many prompts.
- Future Direction: Add batched inference support, caching, and parallelism; explore lightweight or distilled LLMs for inner-loop tasks.
4. π§ Cold Start for MR.Q
- Issue: MR.Q needs a critical mass of preference pairs (~100+) to make reliable evaluations.
- Implication: New domains may perform poorly without this foundation.
- Current Workaround: Use synthetic or bootstrapped data via
dspy.teleprompt.BootstrapFewShot
.
5. π Embedding Drift
-
Issue: Static embeddings (e.g.,
mxbai-embed-large
) used for similarity and memory may become stale as hypotheses evolve. -
Implication: Search and scoring accuracy may degrade over time.
-
Future Direction: Automatically re-embed hypotheses after updates:
if hypothesis_updated: emb = get_embedding(hypothesis.text, cfg)
6. ποΈβπ¨οΈ Human-in-the-Loop Gaps
- Issue: MR.Q optimizes internal metrics, which may diverge from human values or contextual insight.
- Implication: There’s a risk of overfitting to “what the model thinks is good.”
- Safeguard: Introduce periodic human reviews via
flagged_hypotheses
and support side-by-side comparisons for auditability.
7. βοΈ Configuration Complexity
- Issue: The flexibility of
co_ai
means many things β like training strategy, prompt mode, or evaluation method β are runtime-configurable. - Implication: Misconfiguration or silent defaults can cause inconsistent behavior across runs or environments.
- Future Direction: Include stricter validation, template explainability tools, and best-practice presets.
β Wrapping Up: A System That Learns to Refine Itself
Weβve now seen how SharpeningAgent
orchestrates a multi-step process that brings together:
- π§ Structured prompting
- π§© Template-based transformations
- π€ LLM-driven hypothesis generation
- π§ͺ Evaluation through MR.Q
- π Continuous self-improvement
This system doesn’t just react it reflects. It doesn’t just generate it iterates. And it doesnβt just respond it learns how to get better over time.
π Why This Matters
Most LLM systems today focus on single-pass outputs. What weβre building here goes further toward continuous improvement loops, where:
- The inputs evolve
- The outputs improve
- And the system rewrites its own reasoning
With MR.Q
and SharpeningAgent
, weβre introducing the foundations for self-tuning pipelines that operate in real-time, learn from feedback, and adapt without retraining.
π Whatβs Next?
This post is just the beginning. In the next installments, weβll cover:
- π§ Using real-world datasets for refinement
- π§ͺ Evaluating across domains (science, policy, business, etc.)
- π§° Expanding the template library and sharpening strategies
- π Building a benchmark dataset for template-driven improvement
- π More papers implemented into the pipeline lots more.
- π Using this system to generate syntethic datasets for AI.
β Sharpening Paper Implementation Checklist
Feature / Step from Sharpening Paper | Implemented in Your Code | Notes |
---|---|---|
1. Initial Prompt + Hypothesis Generation | β | prompt_history holds the prompt-hypothesis pairs to be sharpened. |
2. Self-Evaluation / Preference Signal | β | MRQSelfEvaluator.evaluate(...) produces scores via a trained model. |
3. Prompt Refinement | β | Templates like critic , grow , lens , etc. rewrite the prompt. |
4. Hypothesis Re-generation (post-refinement) | β | The sharpened prompt is used to re-generate an improved hypothesis. |
5. Comparison of Outputs (original vs sharpened) | β | Evaluated by MR.Q using preference learning to determine improvement. |
6. Logging Sharpening Outcomes | β | SharpeningResult records all transformations and scores. |
7. Selective Persistence (save if improved) | β | Only saves new prompts/hypotheses when winner == "b" (i.e., improved). |
8. Multi-template Strategy | β | self.templates iterates over multiple refinement strategies. |
9. Score-based Decision Mechanism | β | Best result is selected by max score difference via MR.Q . |
10. Training Self-Evaluator on DPO-style pairs | β | train_from_database() uses ELO-based pairs for fast local training. |
11. Metadata & Source Tracking | β | Saves original prompt, score diff, template name in metadata fields. |
12. Configurable Modes (Judge-only vs Templates) | β | Supports both template-based and judge-only self-reward modes. |
π¬ Final Reflection
In traditional software, we sharpen our tools before we start cutting. In this new era, the tools sharpen themselves as they think, generate, and reflect.
This isnβt prompt engineering. This is prompt programming an AI that learns to rewrite itself.
π References
πͺ Self-Improvement in Language Models: The Sharpening Mechanism
Paper: Self-Improvement in Language Models: The Sharpening Mechanism arXiv: [https://arxiv.org/abs/2305.14885]
This paper introduced a method for improving model outputs via iterative self-refinement. The core insight is that an LLM can critique and improve its own reasoning through repeated prompting, using templates like Reflect
, Critic
, and Refine
. This approach inspired the SharpeningAgent
architecture.
π§ MR.Q: Lightweight Preference-Based Self-Tuning
Concept: MR.Q Modular Reward Quality Estimator Inspired by: Direct Preference Optimization (DPO) and reward modeling Implementation: This blog post and accompanying repo Key Idea: A small, local neural network trained on preference pairs (
prompt
,output A/B
,winner
) to rapidly score and rank model outputs in real time.
https://arxiv.org/abs/2501.16142v1
Unlike traditional reward models or large-scale DPO training, MR.Q is:
- Fast to train on small datasets
- Interpretable and modular
- Tuned using the same interface it evaluates (embeddings)
MR.Q extends the sharpening concept by adding a quantitative tuning loop.
π Code
https://github.com/facebookresearch/MRQ MRQ repo
https://github.com/ernanhughes/MRQ
An extension to MRQ for finance
https://github.com/ernanhughes/co-ai CO-AI heavy development where I implemented this agent.
π Glossary of Key Terms
Term | Definition |
---|---|
Sharpening | A process where a model iteratively improves its own prompts and outputs using evaluation, feedback, and prompt rewriting β without retraining. |
MR.Q | A small neural network trained on preference data that predicts the quality of hypotheses. It acts as a fast, local reward model. |
DPO (Direct Preference Optimization) | A training technique that optimizes a model based on pairwise preferences (Output A preferred over Output B), used in RLHF. |
Prompt Template | A reusable structured instruction that programs the LLM to perform tasks in a specific way (e.g., CRITIC , GROWS , DEVIL ). |
Self-Evaluation | The act of the model evaluating or scoring its own output to guide further refinement. Often done using a critic or numerical reward. |
Hypothesis | A generated output or response to a specific goal or prompt. In this system, hypotheses are evaluated and refined. |
ELO Rating | A dynamic ranking system (from chess) used to compare the relative quality of multiple hypotheses. The higher the rating, the better the response. |
Hydra | A configuration framework for Python that makes it easy to control agents, models, and templates using YAML. |
SharpeningAgent | A Co AI agent that applies sharpening templates to prompts, evaluates improved outputs using MR.Q, and logs/store better results. |
Template Mode | A configuration where the system cycles through multiple prompt templates to find the best-performing prompt style. |
Judge Mode | A simplified sharpening mode where outputs are scored (e.g., 1β10) using a critic template without running multiple refinements. |