Self-Improving AI: A System That Learns, Validates, and Retrains Itself

🤖 The Static AI Trap
Today’s AI systems are frozen in time: trained once, deployed forever. Yet the real world never stops evolving. Goals shift overnight. New research upends old truths. Context transforms without warning.
What if your AI could wake up?
In this post, we engineer an intelligence that teaches itself a system that continuously learns from the web, audits its own judgments, and retrains itself when confidence wavers.
We’ll build two breakthrough capabilities:
- Goal → Model Pipeline: Transform ambiguous goals (“Build an AI that self-improves”) into precision reward models using Arxiv research.
- Self-Tuning Loop: The AI’s internal auditor that spots drift, validates against GPT-4, and triggers retraining no humans needed.
By the end, you’ll deploy systems that don’t just process information… they evolve with it.
This system is built around a new approach we call RIVAL:
Reinforcement learning with Iterative and adVersarial optimization of Language models.
RIVAL is a closed-loop framework where an AI system doesn’t just learn from a static dataset it actively evaluates, challenges, and retrains itself based on feedback from a trusted oracle (e.g., an LLM). This gives it the ability to evolve its understanding of goals over time, continuously improving without requiring manual labels.
🧱 Part 1: From Goal to Reward Model
Remember the pipeline can be anything for this post I have chosen one that I will use often a processs to try and find scientific research related to a goal.
We’re building a dynamic research assistant that knows how to find, filter, and learn from the best information on the web and get better at doing that over time.
At the core of this system is a custom pipeline focused on AI research discovery, using Arxiv.org as its source. Arxiv is the fastest-moving, highest-signal archive of research in the AI space. If you’re building frontier models or tracking innovation, it’s where the story begins.
We’ll show you how to:
- Search Arxiv with intent (goal-driven search)
- Load and profile new papers
- Score their quality and relevance
- Integrate their knowledge into a working AI
- Validate and retrain that AI using its own results
- Continuously improve the search and filtering loop
This isn’t just “better search.” It’s a self-improving intelligence pipeline. It adapts its filters, refines what it considers “valuable” research, and uses that refinement to train internal reward models that replace the role of the LLM for faster, cheaper, more scalable evaluations.
🪞 Design Rationale
When you ask a general LLM to “find the best papers on self-improving AI,” you’ll get a generic list. Maybe 20% of that list is useful. But if you give the system a goal say “Build an AI that teaches itself to solve complex tasks” and let it learn which results actually help, it becomes something more than a query engine. It becomes autonomous, goal-oriented, and self-tuning.
This post is part of a larger vision: a system that constantly scans research, videos, codebases, and discussions from around the world, learns what’s useful to its goals, and trains itself forward without us in the loop.
🎯 Step 1: Define a Goal
I want to build an AI that can teach itself to solve complex problems better over time.
🧭 The Real Problem Isn’t Search It’s Signal
We started with this goal. It found over 300,000 results.
But when we reviewed the top 100, fewer than 25 were actually useful. The rest were vague, outdated, speculative, or just off-topic.
This is the real challenge:
🤖 AI doesn’t struggle to find information it struggles to filter it.
Building a self-improving system isn’t just about learning from data. It’s about learning to recognize good data, and ignore the rest. That’s what this post is about.
🔧 Building the Self-Improving Research Pipeline
Let’s get hands-on. Below is a real, running pipeline designed to search Arxiv, load and analyze papers, score their relevance, and then retrain itself based on what it learns.
This YAML configuration defines the pipeline structure:
goal:
goal_text: I want to build an AI that can teach itself to solve complex problems better over time.
goal_type: "tactical"
goal_category: "meta_learning"
focus_area: "self_improvement"
pipeline:
name: rivals
tag: "search_arxiv"
description: "Search Arxiv for papers related to a goal"
stages:
- name: arxiv_search
description: "Search Arxiv for papers related to the goal"
cls: co_ai.agents.knowledge.arxiv_search.ArxivSearchAgent
enabled: true
iterations: 1
- name: document_loader
description: "Load documents from the search results and summarize them"
cls: co_ai.agents.knowledge.document_loader.DocumentLoaderAgent
enabled: true
iterations: 1
- name: document_profiler
description: "Profile the loaded documents to extract key sections"
cls: co_ai.agents.knowledge.document_profiler.DocumentProfilerAgent
enabled: true
iterations: 1
- name: paper_score
description: "Score the papers based on their relevance and quality"
cls: co_ai.agents.knowledge.paper_score.PaperScoreAgent
enabled: true
iterations: 1
- name: knowledge_loader
description: "Load knowledge from the scored papers into the system"
cls: co_ai.agents.knowledge.knowledge_loader.KnowledgeLoaderAgent
enabled: true
iterations: 1
- name: document_trainer
description: "Build document pairs for training and evaluation, tran and generate models"
cls: co_ai.agents.knowledge.document_trainer.DocumentTrainerAgent
enabled: true
iterations: 1
- name: document_reward_scorer
description: "Score the documents based on their relevance and quality"
cls: co_ai.agents.knowledge.document_reward_scorer.DocumentRewardScorerAgent
enabled: true
iterations: 1
Each of these agents is a working Python class that performs a specific task in the pipeline. Let’s walk through what each one does in sequence, and how they contribute to a system that improves itself every time it runs.
flowchart LR A[🎯 ArxivSearchAgent<br/>Find goal-related seed papers]:::highlighted B[📥 DocumentLoaderAgent<br/>Download & extract text] C[🧬 DocumentProfilerAgent<br/>Enrich, embed, and segment] D[📈 PaperScoreAgent<br/>Rate for novelty, relevance, etc.] E[📚 KnowledgeLoaderAgent<br/>Load knowledge from the Brain] F[🎓 DocumentTrainerAgent<br/>Learning Better Papers] G[🏅 DocumentRewardScorerAgent<br/>Score Documents] A --> B --> C --> D --> E --> F --> G classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
🔎 ArxivSearchAgent
Semantic Search for Self-Improvement
The ArxivSearchAgent
is the first step in our AI pipeline that learns to improve itself. It transforms high-level research goals into actionable queries, interfaces with the arXiv API, and returns high-quality papers relevant to the specified objective.
✨ Purpose
This agent allows our system to autonomously discover and retrieve relevant literature from arXiv.org, which becomes the knowledge base for further evaluation, ranking, and learning.
🧠 How It Works
-
Goal Understanding The agent reads a high-level
goal_text
such as: “I want to build an AI that can teach itself to solve complex problems better over time.” -
Keyword Extraction A lightweight prompt-based method extracts semantic keywords from the goal (e.g.,
reinforcement learning
,recursive self-improvement
,curriculum generation
).
You are an expert AI research assistant.
Your task is to analyze a research or development goal and return a list of concise, technical keywords or phrases that would be useful in an academic search engine like arXiv, Semantic Scholar, or Google Scholar.
These keywords should be specific enough to narrow results to relevant technical papers, and may include terms related to:
- methodology (e.g., "meta learning", "reward modeling")
- concepts (e.g., "recursive self-improvement", "strategic reasoning")
- tasks (e.g., "curriculum generation", "continual learning")
- disciplines (e.g., "reinforcement learning", "AI alignment")
---
Goal:
{{ goal.goal_text }}
---
{% if preferences %}
And these preferences:
{% for p in preferences %}
- {{ '{{' }} p {{ '}}' }}
{% endfor %}
{% endif %}
{% if instructions %}
Additional instructions:
{% for i in instructions %}
- {{ '{{' }} i {{ '}}' }}
{% endfor %}
{% endif %}
Please respond with a list of 5–12 keywords or key phrases in plain text, one per line. Do not include explanations, just the keywords.
This will generate a set of key words like this
"reinforcement learning",
"online learning",
"experience replay",
"feedback loops",
"meta-learning",
"self-improvement",
"recursive self-improvement",
"trategic reasoning",
"reward modeling",
"curriculum generation",
"adaptive learning strategies"
We configure the search through properties in the config
year_start: 2024
year_end: 2025
category: cs.AI
max_results: 50
top_n: 10
-
Query Construction These keywords are transformed into a valid arXiv query with filters for:
- Category (e.g.,
cs.AI
) - Date range (e.g., papers from 2021–2025)
- Category (e.g.,
-
API Search Uses the
arxiv
Python package to fetch matching papers, sorted by relevance. -
Metadata Enrichment For each result, the agent records:
- PDF URL
- Title, abstract, authors
- Goal ID and strategy context
- arXiv ID and category
-
Output Results are stored in
context["raw_arxiv_results"]
and passed to downstream agents likeDocumentLoaderAgent
andPaperScorerAgent
.
This will generate this query
("reward modeling" OR "meta learning" OR "continual learning" OR "recursive
self-improvement" OR "feedback-driven optimization" OR "performance-based adaptation"
OR "ynamic criteria adjustment" OR "AI alignment" OR "curriculum generation" OR
"reinforcement learning" OR "elf-improving systems") AND submittedDate:[20240101
TO 20251231] AND cat:cs.AI
``
---
#### 🧪 Example Output
```json
{
"title": "Towards Self-Improving AI: A Meta-Learning Approach",
"summary": "...",
"url": "https://arxiv.org/pdf/2501.12345v2.pdf",
"goal_id": "goal-1234",
"parent_goal": "I want to build an AI...",
"strategy": "stepwise_decomposition",
"focus_area": "self_improvement",
"published": "2024-11-01T00:00:00Z"
}
✅ Built for research
ArxivSearchAgent forms the foundation of a system that can:
- Autonomously retrieve state-of-the-art knowledge
- Benchmark itself against expert-written papers
- Learn and update its internal value models over time
It’s not just search it’s self-supervised knowledge acquisition tailored to goal-driven reasoning.
📄 Loading the Document Results DocumentLoaderAgent
We covert this class in aprevious post Document Intelligence: Turning Documents into Structured Knowledge.
flowchart LR A[🎯 ArxivSearchAgent<br/>Find goal-related seed papers] B[📥 DocumentLoaderAgent<br/>Download & extract text]:::highlighted C[🧬 DocumentProfilerAgent<br/>Enrich, embed, and segment] D[📈 PaperScoreAgent<br/>Rate for novelty, relevance, etc.] E[📚 KnowledgeLoaderAgent<br/>Load knowledge from the Brain] F[🎓 DocumentTrainerAgent<br/>Learning Better Papers] G[🏅 DocumentRewardScorerAgent<br/>Score Documents] A --> B --> C --> D --> E --> F --> G classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
Once we’ve searched Arxiv and retrieved results, we use the DocumentLoaderAgent
to download and process those papers. Here’s what happens:
🔁 Step-by-Step Flow
-
Check for Existing Documents If we’ve already downloaded this paper before, skip downloading but optionally reclassify its domain (if
force_domain_update = True
). -
Download and Extract Text from PDF
- Download the PDF from the Arxiv URL.
- Extract text using
PDFConverter
. - Clean up the file afterward.
-
Summarize or Use Arxiv Metadata
- Use Arxiv metadata if available.
- If not, summarize with an LLM.
- Guess the title if needed (especially for messy PDFs).
-
Generate Embeddings Create a vector embedding from the title + summary. This enables similarity search and clustering later on.
-
Store to the Knowledge Base Save the document, with its metadata, into the system.
-
Classify by Domain Use
DomainClassifier
to label the document with relevant research domains (e.g. “machine learning”, “optimization”, “robotics”).
🚀 What Makes This Agent Self-Improving?
This loader is more than just a parser:
- It filters noise early by rejecting bad PDFs or already-seen documents.
- It adds metadata and structure needed for downstream training.
- Its embeddings are reusable, helping compare and cluster papers over time.
- It trains the domain classifier continually, if you plug it into your learning loop.
🧬 Key Code Snippet
Here’s the document ingestion process, simplified:
class DocumentLoaderAgent(BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.domain_classifier = DomainClassifier(
memory, logger, cfg.get("domain_seed_config_path", "config/domain/seeds.yaml")
)
self.download_directory = cfg.get("download_directory", "/tmp")
self.summarize_documents = cfg.get("summarize_documents", False)
async def run(self, context: dict) -> dict:
search_results = context.get(self.input_key, [])
stored_documents = []
for result in search_results:
url = result.get("url")
title = result.get("title")
existing = self.memory.document.get_by_url(url)
if existing:
stored_documents.append(existing.to_dict())
continue
# Download and extract PDF text
pdf_path = self.download_pdf(url, title)
if not PDFConverter.validate_pdf(pdf_path):
continue
text = PDFConverter.pdf_to_text(pdf_path)
os.remove(pdf_path)
# Optional: summarize via LLM or fetch ArXiv metadata
summary = result.get("summary")
if self.summarize_documents:
summary = self.generate_summary(text, context)
# Save document + embedding
doc = self.memory.document.add_document({
"title": title, "summary": summary, "text": text, "url": url,
"goal_id": context.get("goal", {}).get("id")
})
self.memory.embedding.get_or_create(f"{title}\n\n{summary}")
self.assign_domains_to_document(doc)
stored_documents.append(doc.to_dict())
context[self.output_key] = stored_documents
return context
def download_pdf(self, url, title):
response = requests.get(url, stream=True)
file_name = re.sub(r'[^\w\-]', "_", title)[:80]
pdf_path = f"{self.download_directory}/{file_name}.pdf"
with open(pdf_path, "wb") as f:
for chunk in response.iter_content(8192):
f.write(chunk)
return pdf_path
def generate_summary(self, text, context):
prompt = self.prompt_loader.load_prompt(self.cfg, {"document_text": text, **context})
return self.call_llm(prompt, context)
def assign_domains_to_document(self, document):
content = document.content
for domain, score in self.domain_classifier.classify(content, top_k=3, min_score=0.6):
self.memory.document_domains.insert({
"document_id": document.id,
"domain": domain,
"score": score
})
This gives us structured, searchable, classified, and embeddable research documents.
📚 Structuring Knowledge: The Role of the DocumentProfilerAgent
Once documents are retrieved and loaded into the system, raw text alone isn’t enough. To make use of this information especially in the context of AI research papers we need to understand the structure of each document and extract the parts that matter most.
This is where the Document Profiler comes in.
flowchart LR A[🎯 ArxivSearchAgent<br/>Find goal-related seed papers] B[📥 DocumentLoaderAgent<br/>Download & extract text] C[🧬 DocumentProfilerAgent<br/>Enrich, embed, and segment]:::highlighted D[📈 PaperScoreAgent<br/>Rate for novelty, relevance, etc.] E[📚 KnowledgeLoaderAgent<br/>Load knowledge from the Brain] F[🎓 DocumentTrainerAgent<br/>Learning Better Papers] G[🏅 DocumentRewardScorerAgent<br/>Score Documents] A --> B --> C --> D --> E --> F --> G classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
🤔 What It Does
The DocumentProfilerAgent
is responsible for breaking down raw documents into meaningful, structured sections such as:
- Title
- Abstract
- Methods
- Results
- Key Contributions
These sections are crucial because they isolate the most useful parts of a paper and allow downstream agents (like scoring, training, or validation engines) to focus only on the information that’s likely to impact decision-making.
The profiler uses a two-phase approach:
- Unstructured Heuristics: It tries to parse section headings and extract content using rules and patterns.
- LLM Fallback (if needed): If the heuristic extraction misses something or is too low quality, it invokes an LLM to assist in summarizing or identifying sections.
Each section is stored with:
- The section name
- Extracted text
- An optional LLM-generated summary
- Associated domains (e.g. “reinforcement learning”, “multi-modal systems”)
DEFAULT_SECTIONS = ["title", "abstract", "methods", "results", "contributions"]
class DocumentProfilerAgent(BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.output_sections = cfg.get("output_sections", DEFAULT_SECTIONS)
self.min_chars_per_sec = cfg.get("min_chars_per_section", 120)
self.domain_classifier = DomainClassifier(
memory, logger, cfg.get("domain_seed_config_path")
)
self.section_parser = DocumentSectionParser(cfg, logger)
async def run(self, context: dict) -> dict:
documents = context.get(self.input_key, [])
profiled = []
for doc in documents:
doc_id = doc["id"]
text = doc.get("content", doc.get("text", ""))
title = doc.get("title")
summary = doc.get("summary")
# Step 1: Parse unstructured sections
parsed = self.section_parser.parse(text)
# Step 2: Optionally add title & abstract
if title:
parsed["title"] = title
if summary:
parsed["abstract"] = summary
# Step 3: Store sections & domains
for section, section_text in parsed.items():
entry = self.memory.document_section.upsert({
"document_id": doc_id,
"section_name": section,
"section_text": section_text,
"source": "unstructured",
})
domains = self.domain_classifier.classify(section_text)
for domain, score in domains:
self.memory.document_section_domains.insert({
"document_section_id": entry.id,
"domain": domain,
"score": float(score),
})
profiled.append({
"id": doc_id,
"structured_data": parsed,
})
context[self.output_key] = profiled
return context
🧩 Engineering Impact
This stage is a bridge between raw knowledge and usable insight. By structuring the data:
- We enable fine-grained comparison between documents
- We allow domain-aware filtering and scoring
- We support selective training of models on the most relevant parts (e.g., only learning from a method or result section)
🧬 Contribution to Self-Improvement
The profiler contributes to the self-improving loop in two critical ways:
- Data Quality: By ensuring the training data is cleanly structured, we avoid training our models on noisy or irrelevant content.
- Domain Awareness: By tagging sections with topic domains, we can align documents with goals, identify coverage gaps, and route them more intelligently in future learning cycles.
In essence, this agent turns a chaotic dump of text into a set of high-quality, semantically-tagged, modular building blocks. These become the core “knowledge atoms” our AI learns from.
✔️ Measuring Relevance and Utility the PaperScoreAgent
Once documents have been structured and profiled, the next step is to assess how useful each one is in helping the system achieve its current goal. This is where the PaperScoreAgent
comes into play.
flowchart LR A[🎯 ArxivSearchAgent<br/>Find goal-related seed papers] B[📥 DocumentLoaderAgent<br/>Download & extract text] C[🧬 DocumentProfilerAgent<br/>Enrich, embed, and segment] D[📈 PaperScoreAgent<br/>Rate for novelty, relevance, etc.]:::highlighted E[📚 KnowledgeLoaderAgent<br/>Load knowledge from the Brain] F[🎓 DocumentTrainerAgent<br/>Learning Better Papers] G[🏅 DocumentRewardScorerAgent<br/>Score Documents] A --> B --> C --> D --> E --> F --> G classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
This agent evaluates each document across multiple dimensions (like relevance, novelty, clarity, etc.), using LLM-based or rule-based scoring mechanisms encapsulated in the PaperScoringMixin
. It avoids redundant work by skipping re-scoring for already evaluated papers unless forced via configuration.
Each document is scored and stored, and the results can later be used for:
- Selecting top-performing documents for training or inference,
- Understanding which kinds of research are consistently useful,
- Fine-tuning the search and filtering process.
This ensures that not only are we collecting AI papers, but we’re intelligently filtering them to extract high-value insights.
📦 Code Summary: PaperScoreAgent
- Score papers: Computes evaluation scores for each document.
- Avoid redundant work: Checks if a document has already been scored and skips it unless
force_rescore
is enabled. - Pulls stored scores: Fetches past evaluations from a memory database via
EvaluationORM
andScoreORM
. - Aggregates results: Averages scores by dimension when using cached results.
✅ Inputs:
- A list of
documents
(from context).
📤 Outputs:
context[self.output_key]
containing titles and score dictionaries.
class PaperScoreAgent(BaseAgent, PaperScoringMixin):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.force_rescore = cfg.get("force_rescore", False)
async def run(self, context: dict) -> dict:
documents = context.get(self.input_key, [])
results = []
for document in documents:
doc_id = document["id"]
existing = self.get_scores_by_document_id(doc_id)
if existing and not self.force_rescore:
results.append({
"title": document.get("title"),
"scores": self.aggregate_scores_by_dimension(existing)
})
continue
score_result = self.score_paper(document, context=context)
results.append({
"title": document.get("title"),
"scores": score_result
})
context[self.output_key] = results
return context
def get_scores_by_document_id(self, doc_id: int) -> list:
evaluations = self.memory.session.query(EvaluationORM).filter_by(document_id=doc_id).all()
scores = []
for ev in evaluations:
scores.extend(
self.memory.session.query(ScoreORM).filter_by(evaluation_id=ev.id).all()
)
return scores
def aggregate_scores_by_dimension(self, scores: list) -> dict:
totals = defaultdict(list)
for score in scores:
if score.score != 0:
totals[score.dimension].append(score.score)
return {dim: round(sum(vals) / len(vals), 4) for dim, vals in totals.items()}
🧮 How PaperScoringMixin Works
The PaperScoringMixin
provides the scoring logic used by PaperScoreAgent
. It defines a single method score_paper()
which delegates the actual evaluation to a flexible PaperScoreEvaluator
.
This evaluator is configured through a YAML file (e.g., paper_review.yaml
) that defines what dimensions to score (e.g., relevance, originality, clarity) and how to prompt the LLM to perform that evaluation.
The mixin ensures that any agent using it:
- Loads the scoring rubric and prompt templates,
- Injects the document and context into the LLM,
- Receives a set of scored dimensions back.
This modular design allows you to swap in different evaluators or scoring rules just by updating the config file no code changes required. It’s an elegant abstraction that decouples how papers are scored from where they’re processed.
class PaperScoringMixin:
def score_paper(self, paper_doc: dict, context: dict = None) -> dict:
context = context or {}
context["paper_score"] = paper_doc
if not hasattr(self, "call_llm"):
raise AttributeError("Agent must implement `call_llm(prompt, context)`")
evaluator = PaperScoreEvaluator.from_file(
filepath=self.cfg.get("score_config", "config/scoring/paper_review.yaml"),
prompt_loader=self.prompt_loader,
cfg=self.cfg,
logger=self.logger,
memory=self.memory,
)
scores = evaluator.evaluate(document=paper_doc, context=context, llm_fn=self.call_llm)
return scores
Here’s a cleaned-up and properly formatted version of the next blog section on the KnowledgeLoaderAgent
. This version keeps your original structure and clarity, but smooths out flow, formatting, and emphasis for easier inclusion in a markdown-based blog post.
🧠 KnowledgeLoaderAgent
Filtering for Signal, Not Just Matches
flowchart LR A[🎯 SurveyAgent<br/>Find goal-related seed papers] B[🔍 SearchOrchestratorAgent<br/>Expand with related papers] C[📥 DocumentLoaderAgent<br/>Download & extract text] D[🧠 DocumentProfilerAgent<br/>Enrich, embed, and segment] E[📊 PaperScoreAgent<br/>Rate for novelty, relevance, etc.] F[📚 KnowledgeLoaderAgent<br/>Store as structured knowledge]:::highlighted A --> B --> C --> D --> E --> F classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
As document pipelines grow in depth and breadth, not all retrieved content deserves to be retained. The KnowledgeLoaderAgent
is where the system gets selective filtering only the most useful knowledge for long-term storage or further reasoning.
🎯 Role in the Pipeline
The KnowledgeLoaderAgent
turns a pile of downloaded, profiled, and scored documents into a targeted collection of high-quality knowledge, tuned to the current research goal. It doesn’t just rely on text overlap or keywords it uses embedding similarity, domain matching, and quality scoring to curate the best content.
This is the agent that separates “might be relevant” from “essential to know.”
🔍 What It Actually Does
-
Goal Domain Matching
- Uses embedding vectors to classify the goal into a domain (like
"LLM Optimization"
or"Knowledge Distillation"
). - Computes cosine similarity between the goal embedding and each domain’s seed embedding.
- Uses embedding vectors to classify the goal into a domain (like
-
Document Domain Filtering
- Keeps only those documents whose domain tags match the goal’s domain, and with a domain confidence score above a threshold.
- Domain tags were precomputed by the
DocumentProfilerAgent
.
-
Optional Score Filtering
-
If
use_dimensional_scores: true
, it additionally filters based on quality scores like:relevance
,usefulness
,clarity
,implementability
,novelty
-
You can set a weighted score threshold or sort based on top-K performance across selected dimensions.
-
-
Flexible Return Format
- Returns either
summary
orfull text
, depending on downstream needs (include_full_text: true/false
).
- Returns either
🧪 Example Use Case
Say your goal is: “Improve the fine-tuning efficiency of transformer models.”
The KnowledgeLoader:
- Embeds the goal.
- Detects it belongs to the
"LLM Optimization"
domain. - Selects documents tagged with that domain, and scores above 0.6 on domain confidence.
- If configured, also checks that selected docs score well on clarity and implementability.
Result: a curated set of documents, cleanly scoped to the goal and backed by semantic and quality filtering.
⚙️ Example Config
knowledge_loader:
name: knowledge_loader
domain_seeds: ${path:config/domain/seeds.yaml}
top_k: 3
domain_threshold: 0.4
include_full_text: false
use_dimensional_scores: true
dimension_weights:
relevance: 1.0
usefulness: 0.8
clarity: 0.6
implementability: 0.7
novelty: 0.5
min_weighted_score: 0.5
🧩 Trimmed Code Summary
class KnowledgeLoaderAgent(BaseAgent):
def __init__(...):
self.domain_seeds = cfg.get("domain_seeds", {})
self.top_k = cfg.get("top_k", 3)
self.threshold = cfg.get("domain_threshold", 0.0)
self.include_full_text = cfg.get("include_full_text", False)
self.use_dimensional_scores = cfg.get("use_dimensional_scores", False)
self.dimension_weights = cfg.get("dimension_weights", {...})
self.min_weighted_score = cfg.get("min_weighted_score", 0.5)
async def run(self, context):
goal_text = context["goal"]["goal_text"]
goal_vector = self.memory.embedding.get_or_create(goal_text)
# 1. Match goal to a domain
domain_vectors = {
d: np.mean([self.memory.embedding.get_or_create(x) for x in ex], axis=0)
for d, ex in self.domain_seeds.items()
}
goal_domain = max(domain_vectors, key=lambda d: cosine_similarity([goal_vector], [domain_vectors[d]])[0][0])
# 2. Filter docs by domain + optional score
filtered = []
for doc in context["documents"]:
domains = self.memory.document_domains.get_domains(doc["id"])
if any(dom.domain == goal_domain and dom.score >= self.threshold for dom in domains):
if self.use_dimensional_scores and self.compute_weighted_score(doc["id"]) < self.min_weighted_score:
continue
filtered.append(doc)
context[self.output_key] = filtered
return context
🦾 DocumentTrainerAgent
: Learning to Prefer Better Papers
Once we’ve scored papers and collected preferences, we want the system to internalize what makes one paper better than another across different dimensions like relevance, clarity, usefulness, etc. That’s the job of the DocumentTrainerAgent
.
flowchart LR A[🎯 SurveyAgent<br/>Find goal-related seed papers] B[🔍 SearchOrchestratorAgent<br/>Expand with related papers] C[📥 DocumentLoaderAgent<br/>Download & extract text] D[🧠 DocumentProfilerAgent<br/>Enrich, embed, and segment] E[📊 PaperScoreAgent<br/>Rate for novelty, relevance, etc.] F[📚 KnowledgeLoaderAgent<br/>Store as structured knowledge] G[📚 DocumentTrainerAgent<br/>Learning Better Papers]:::highlighted A --> B --> C --> D --> E --> F --> G classDef highlighted fill:#ffebcc,stroke:#ffaa00,stroke-width:2px;
🧠 Purpose
The DocumentTrainerAgent
creates training data from real scoring preferences and trains multi-dimensional reward models (RMs) that can later be used to predict document quality.
This is how the system starts to teach itself what “good” means, based on your goals and scoring feedback.
⚙️ What It Does
-
👥 Builds Contrastive Training Pairs Uses
DocumentPreferencePairBuilder
to construct contrastive pairs like:“For goal X, document A is more relevant than document B.”
These are pulled from prior scoring runs stored in the system’s memory.
-
📈 Trains Per-Dimension Value Models For each dimension (e.g.,
clarity
,usefulness
,implementability
), it usesDocumentMRQTrainer
to train a dimension-specific reward model using contrastive loss.Each model learns to distinguish better vs. worse outputs on that metric.
-
🧠 Optional Tuning Layer After training, a lightweight regression tuner is fitted to calibrate model outputs against original scores. This helps smooth the reward predictions.
-
💾 Saves Models and Tuners Models are saved to disk (e.g.,
document_rm_clarity.pt
) and linked tuners are serialized to JSON. These can later be loaded for inference in downstream agents.
flowchart LR A[Get Contrast Pairs] --> B[Group by Dimension] B --> C[Train MRQ Model per Dimension] C --> D[Save Models + Tuners] D --> E[Return to Scoring Pipeline]
🧩 Code Summary
This agent:
- Pulls training pairs from memory (organized by dimension),
- Prepares the data for training,
- Trains a regression model per dimension using MRQ (Multidimensional Reward Quantification),
- Saves the models and any tuning metadata for later use in scoring.
This closes the loop between observation and adaptation — it’s how our system evolves its judgment from LLM supervision to fast local predictors.
Here’s the core implementation:
class DocumentTrainerAgent(BaseAgent):
async def run(self, context):
goal_text = context["goal"]["goal_text"]
# Step 1: Build contrastive pairs
builder = DocumentPreferencePairBuilder(self.memory.session, self.logger)
pairs = builder.get_training_pairs_by_dimension(goal=goal_text)
# Step 2: Flatten all pairs into training examples
all_pairs = []
for dim, dim_pairs in pairs.items():
for p in dim_pairs:
all_pairs.append({
"title": p["title"],
"output_a": p["output_a"],
"output_b": p["output_b"],
"value_a": p["value_a"],
"value_b": p["value_b"],
"dimension": dim,
})
# Step 3: Train reward models
trainer = DocumentMRQTrainer(
memory=self.memory,
logger=self.logger,
encoder=TextEncoder(),
value_predictor=DocumentValuePredictor(),
device="cuda" if torch.cuda.is_available() else "cpu"
)
trained_models, tuners = trainer.train_multidimensional_model(all_pairs, cfg={
"epochs": 10,
"lr": 1e-4,
"patience": 2
})
# Step 4: Save models and tuners
for dim, model in trained_models.items():
torch.save(model, f"models/document_rm_{dim}.pt")
if dim in tuners:
tuners[dim].save(f"models/document_rm_{dim}_tuner.json")
return context
🧱 Building Training Pairs from Scored Papers
After documents are scored, we need to convert that feedback into structured training data. The DocumentPreferencePairBuilder
does exactly that: it transforms past evaluations into contrastive pairs that teach a model how to rank quality.
🔍 What It Does
The DocumentPreferencePairBuilder
queries your database for papers that have been scored (across any dimension like relevance
, clarity
, etc.). For each document and dimension, it finds:
- The highest-scoring version (what we want the model to prefer), and
- The lowest-scoring version (what we want it to avoid).
These pairs are grouped by dimension and formatted as input to a contrastive learning model, which will later be trained to favor better outputs.
This forms the core learning signal for the reward model: “Given two documents, which is better and why?”
⚙️ Example Output Format
{
"relevance": [
{
"title": "Language Models as Agents",
"output_a": "...", # preferred version
"output_b": "...", # less preferred
"value_a": 8.2,
"value_b": 5.1
},
...
],
"clarity": [ ... ],
"usefulness": [ ... ]
}
Each dimension produces a list of pairs. These are consumed by the DocumentTrainerAgent
in the next stage.
🧩 Code Summary
class DocumentPreferencePairBuilder:
def __init__(self, db, logger=None):
self.db = db
self.logger = logger
def get_training_pairs_by_dimension(self, goal=None, limit=10000) -> dict:
# SQL query to find top- and bottom-scoring versions of each doc per dimension
query = text(""" ... """) # trimmed for brevity
try:
rows = self.db.execute(query, {"limit": limit}).fetchall()
except Exception as e:
self.logger.log("DocumentPairBuilderError", {"error": str(e)})
return {}
grouped = defaultdict(dict)
results = defaultdict(list)
# Group rows into (top, bottom) per doc per dimension
for row in rows:
grouped[(row.dimension, row.doc_id)][row.rank_type] = row
for (dim, _), pair in grouped.items():
if "top" in pair and "bottom" in pair:
results[dim].append({
"title": pair["top"].title,
"output_a": pair["top"].content,
"output_b": pair["bottom"].content,
"value_a": float(pair["top"].score),
"value_b": float(pair["bottom"].score),
})
return dict(results)
Here’s a concise explanation of the SQL query used to extract preference pairs for training a reward or ranking model:
🧮 SQL: Extracting Document Preference Pairs
The SQL query builds document pairs based on score differences across dimensions (e.g., novelty, clarity, relevance) to train models like MR.Q. Here’s how it works:
scored_docs
CTE
- Joins
scores
,evaluations
, anddocuments
to retrieve:- The document content,
- Its associated score per dimension, and
- Row numbers (
rank_high
,rank_low
) that identify the highest and lowest scored instances per dimension and document. - Filters out null scores.
Top & Bottom Selection
- Extracts:
- The top-scored version of each document per dimension (
rank_high = 1
), - The bottom-scored version (
rank_low = 1
), - Ensures the document has valid, non-empty content.
- The top-scored version of each document per dimension (
WITH scored_docs AS (
SELECT
s.dimension,
s.score,
d.id AS doc_id,
d.title,
d.content,
ROW_NUMBER() OVER (
PARTITION BY s.dimension, d.id ORDER BY s.score DESC
) AS rank_high,
ROW_NUMBER() OVER (
PARTITION BY s.dimension, d.id ORDER BY s.score ASC
) AS rank_low
FROM scores s
JOIN evaluations e ON s.evaluation_id = e.id
JOIN documents d ON e.document_id = d.id
WHERE s.score IS NOT NULL
)
SELECT
dimension,
title,
content,
score,
rank_type,
doc_id
FROM (
SELECT
dimension,
title,
content,
score,
'top' AS rank_type,
doc_id
FROM scored_docs
WHERE rank_high = 1
AND content IS NOT NULL
AND content <> ''
UNION ALL
SELECT
dimension,
title,
content,
score,
'bottom' AS rank_type,
doc_id
FROM scored_docs
WHERE rank_low = 1
) AS ranked_pairs
ORDER BY dimension, doc_id
LIMIT :limit
Result Format
-
Returns a flattened list of pairs marked as
'top'
or'bottom'
for each document and dimension. -
These are then grouped in code to form contrast pairs like:
{ "title": "Sample Title", "output_a": "high-quality text", "output_b": "lower-quality text", "value_a": 8.5, "value_b": 4.2 }
Usage
- This output feeds into the contrastive training of ranking models that learn to prefer higher-quality research content based on past scores.
🧠 Learning to Rank: The DocumentMRQTrainer
Once we’ve built contrastive preference pairs from human or LLM evaluations, we need a model that can learn to replicate those judgments. That’s where the DocumentMRQTrainer
comes in.
🎯 What It Does
The DocumentMRQTrainer
trains a multi-dimensional reward model a lightweight neural predictor that learns to estimate the relative quality of documents given a goal. It’s designed for continuous retraining using LLM feedback or human preferences as supervision.
It supports multiple quality dimensions (e.g. relevance, clarity, novelty, etc.) and returns:
- A trained model per dimension.
- A regression tuner that aligns model predictions with real LLM score scales.
Think of this as the “student” that learns from the scoring “teacher” and eventually replaces it for faster inference and ranking.
⚙️ How It Works
-
Embedding Comparison For each pair, it uses the
TextEncoder
to compute a goal-aware representation for both documents. The preferred document’s embedding is subtracted from the less-preferred one, producing a contrast vector. -
Binary Classification Training These contrast vectors are passed into a small feedforward model (
DocumentValuePredictor
) trained with binary cross-entropy loss. A label of1.0
signals “A is better than B.” -
Dimension-Specific Models Each quality dimension is trained independently, and the model state is saved for future use.
-
Score Alignment (Tuner) After training, a lightweight
RegressionTuner
is fit to map MRQ predictions to LLM-calibrated scores, using real examples. This tuner allows the MRQ model to produce human-like scores at runtime.
🧩 Key Functions
class DocumentMRQTrainer:
def __init__(self, memory, logger, encoder=None, value_predictor=None, device="cpu"):
self.memory = memory
self.logger = logger
self.device = device
self.encoder = encoder or TextEncoder()
self.value_predictor = value_predictor or DocumentValuePredictor(512, 1024)
self.regression_tuners = {}
def prepare_training_data(self, samples):
inputs, labels = [], []
for item in samples:
# Embed goal (context) and document candidates
ctx_emb = self.memory.embedding.get_or_create(item["title"])
emb_a = self.memory.embedding.get_or_create(item["output_a"])
emb_b = self.memory.embedding.get_or_create(item["output_b"])
with torch.no_grad():
zsa_a = self.encoder(ctx_emb, emb_a)
zsa_b = self.encoder(ctx_emb, emb_b)
# Generate contrast vector: (preferred doc) - (less preferred doc)
diff = zsa_a - zsa_b if item["value_a"] >= item["value_b"] else zsa_b - zsa_a
inputs.append(diff) # Train model to recognize this "preference signal"
labels.append(torch.tensor([1.0]))
return DataLoader(TensorDataset(torch.stack(inputs), torch.stack(labels)), batch_size=16)
def train(self, dataloader, cfg):
optimizer = torch.optim.Adam(self.value_predictor.parameters(), lr=cfg.get("lr", 1e-4))
loss_fn = nn.BCEWithLogitsLoss()
for epoch in range(cfg.get("epochs", 10)):
for x, y in dataloader:
preds = self.value_predictor(x)
loss = loss_fn(preds, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
def train_multidimensional_model(self, contrast_pairs, cfg=None):
models, tuners = {}, {}
by_dim = defaultdict(list)
for pair in contrast_pairs:
by_dim[pair["dimension"]].append(pair)
for dim, samples in by_dim.items():
dataloader = self.prepare_training_data(samples)
self.train(dataloader, cfg or {})
models[dim] = self.value_predictor.state_dict()
tuner = RegressionTuner(dimension=dim, logger=self.logger)
for s in samples:
for side in ["a", "b"]:
mrq_score = self.value_predictor(self.encoder(
self.memory.embedding.get_or_create(s["title"]),
self.memory.embedding.get_or_create(s[f"output_{side}"])
)).item()
tuner.train_single(mrq_score, s[f"value_{side}"])
tuners[dim] = tuner
return models, tuners
🔄 Self-Alignment to LLM
An optional feature of the trainer is self-alignment. It can compare new MRQ outputs to the nearest LLM-scored neighbors and adjust using align_with_llm_score
. This keeps the model grounded to high-quality feedback while continuously improving.
trainer.align_with_llm_score(dimension, goal, hypothesis, llm_score)
📏 Staying Aligned Over Time
This trainer enables the system to close the loop it doesn’t just consume LLM scores, it learns from them and gradually becomes capable of performing its own quality judgments. Over time, this reduces reliance on large models and supports scalable, goal-specific document filtering.
🧠 DocumentRewardScorerAgent
Multi-Dimensional Document Evaluation
The DocumentRewardScorerAgent
evaluates research documents by scoring them across multiple quality dimensions using trained reward models. It plays a crucial role in your self-improving AI system by assigning structured, learnable feedback to documents enabling future ranking, filtering, and learning behaviors.
🔍 Purpose
After downloading, parsing, and profiling documents, this agent uses pre-trained reward models to assign scores along defined dimensions (e.g., relevance
, clarity
, engagement
). These scores serve as the reward signal in downstream learning pipelines like MRQ, DPO, or preference tuning.
⚙️ Configuration
The agent loads models and encoders based on the configuration:
dimensions: ["relevance", "clarity", "engagement"]
model_dir: models/document
model_prefix: document_rm_
These models are loaded through a DocumentMRQScorer
, which wraps the inference logic for multiple dimensions.
🧬 Workflow
-
Input:
- A list of parsed documents (
context["documents"]
) withtitle
andcontent
. - The active goal (
context["goal"]["goal_text"]
), used as context during scoring.
- A list of parsed documents (
-
Scoring: For each document and each dimension, the agent calls the
DocumentMRQScorer.score()
method, which:- Embeds the goal and document.
- Passes the combined representation through the trained predictor.
- Returns a scalar score indicating quality.
-
Output: The agent adds a structured result under its
output_key
, containing each document’stitle
,text
, and per-dimensionscores
:
{
"title": "Self-Improving Agents via Reinforcement",
"text": "... full document text ...",
"scores": {
"relevance": 8.2,
"clarity": 7.5,
"engagement": 6.9
}
}
🔄 Integration
This agent typically runs after:
DocumentLoaderAgent
(which fetches and converts PDF text),DocumentProfilerAgent
(which adds structure and metadata).
And before:
DocumentTrainerAgent
(which uses scored outputs to generate preference training pairs).
✅ Benefits
- Enables automated reward model inference for document-level supervision.
- Provides consistent multi-aspect feedback aligned to goal context.
- Fully compatible with preference-based learning loops for self-improvement.
⚙️ Part 2: From Building to Improving
The pipeline gives us structure: it ingests documents, scores their utility, and trains custom reward models. But structure isn’t enough. We need intelligence. What happens when our model starts drifting? When new goals arrive? When reality changes? That’s where self-tuning comes in. Our system doesn’t just run once it watches itself, compares its judgments with a trusted LLM, and retrains whenever confidence drops.
In this section, we’ll show how the system:
- Monitors model performance over time
- Validates itself against LLM judges
- Identifies drift, stagnation, and failure modes
- Retrains and updates only when trust erodes
This is the core intelligence layer the part of the system that lets it evolve, adapt, and stay sharp.
flowchart TD A[New Goal + Documents] --> B[MRQ Scoring Engine] B --> C[Scored Document Pairs] C --> D[SelfValidationEngine] D --> E[Validation Stats<br/>Agreement, Matches] E --> F[MetaConfidenceTracker] F --> G{Confidence Low?} G -- Yes --> H[TrainingController] H --> I{Cooldown OK?} I -- Yes --> J[Retrain MRQ Model] J --> B I -- No --> K[Wait / Skip Training] G -- No --> L[Keep Using Model] B --> M[CycleWatcher] M --> N{Stuck or Oscillating?} N -- Yes --> O[Flag Goal/Dimension for Intervention] N -- No --> P[Continue Monitoring] F --> Q[StateTracker] H --> Q B --> Q D --> Q style A fill:#e3f2fd,stroke:#2196f3,stroke-width:2px style B fill:#fff3e0,stroke:#ff9800,stroke-width:2px style D fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px style F fill:#e8f5e9,stroke:#4caf50,stroke-width:2px style H fill:#fbe9e7,stroke:#ff5722,stroke-width:2px style M fill:#ede7f6,stroke:#673ab7,stroke-width:2px style Q fill:#f0f4c3,stroke:#cddc39,stroke-width:2px
🔁 The Self-Tuning Loop Explained
Once the MRQ models are trained, they’re not frozen. The system continually:
- Scores new documents using these models (
DocumentRewardScorerAgent
). - Samples a subset of those scores and compares them with LLM judgments (
SelfValidationEngine
). - Tracks confidence over time with
MetaConfidenceTracker
. - Decides whether to retrain using the
TrainingController
(with cooldowns to avoid thrashing). - Triggers retraining by regenerating contrast pairs and updating models (
DocumentTrainerAgent
).
This loop runs independently per goal and dimension. Each dimension learns at its own pace much like a student with separate subjects.
🧱 Centralized Intelligence: The Supervisor and the Shared Registry
As the system scales, multiple intelligent components trackers, controllers, validators need to coordinate efficiently. To enable this, we introduced a central registry, a lightweight but powerful mechanism for wiring up shared components.
📦 The Registry: Global, Safe, and Explicit
The registry (co_ai/registry/registry.py
) is a global key-value store that acts as a service container. It ensures that core tools (like the confidence tracker or validation engine) are:
- Registered only once (avoiding accidental overwrite),
- Globally accessible from anywhere in the system,
- Easily testable and resettable when needed.
# Example: Registering and retrieving a shared component
register("confidence_tracker", tracker)
...
tracker = get("confidence_tracker")
tracker.update(...)
This makes dependency management easy and avoids passing long chains of objects through method calls or agents.
🧠 The Supervisor: Wiring the Brain Together
The Supervisor
class is the entry point to the entire AI system. It boots up all core subsystems, registers them in the global registry, and coordinates their execution. This includes:
- StateTracker: Keeps track of pipeline progress and state.
- MetaConfidenceTracker: Monitors model agreement with the LLM.
- CycleWatcher: Detects when a model is stuck or flip-flopping.
- TrainingController: Triggers retraining based on confidence drops.
- SelfValidationEngine: Compares model predictions against LLM supervision.
Here’s how the components are wired:
# Inside Supervisor.__init__()
state_tracker = StateTracker(...)
confidence_tracker = MetaConfidenceTracker(...)
cycle_watcher = CycleWatcher(...)
validator = SelfValidationEngine(...)
training_controller = TrainingController(
cfg=cfg,
memory=self.memory,
logger=self.logger,
validator=validator,
tracker=confidence_tracker,
trainer_fn=trainer_fn, # user-defined training callback
)
register("state_tracker", state_tracker)
register("confidence_tracker", confidence_tracker)
register("cycle_watcher", cycle_watcher)
register("training_controller", training_controller)
register("self_validation", validator)
Now, anywhere in the pipeline, you can call:
from co_ai.registry.registry import get
controller = get("training_controller")
controller.maybe_train(goal, dimension, pairs)
✅ What Makes This Powerful
The registry-supervisor pattern lets us:
- Decouple agents from their dependencies,
- Swap implementations easily during testing or research,
- Add runtime behavior (like retraining or validation) without hardcoding it into each agent,
- Maintain system-wide coherence, even as complexity grows.
This setup ensures the system isn’t just intelligent it’s composable, extensible, and introspective.
🧭 StateTracker: Keeping Tabs on the System’s Learning Journey
In a self-improving system, it’s not enough to just score documents and train models you need to track the state of every goal and dimension over time. That’s where the StateTracker
comes in.
It acts like the memory and metadata hub for each goal. For every evaluation (e.g., scoring, validation, retraining), it records:
- ✅ What happened
- ⏱️ When it happened
- 🔄 How many times it’s happened
This allows other components like the TrainingController
or CycleWatcher
to make safe, informed decisions about when to retrain, freeze learning, or flag problems.
🔑 What It Tracks
For every (goal, dimension)
pair, the StateTracker
keeps track of:
Event | Description |
---|---|
scored |
When documents were last scored by the model |
validated |
When LLM validation last occurred |
trained |
When the reward model was last retrained |
retrain_count |
How many times the model has been retrained |
frozen / active |
Whether learning is currently enabled or paused |
metadata |
Arbitrary tags or notes per goal (e.g., source) |
🧠 Continuous Improvement, Built-In
- Prevents redundant retraining by checking timestamps and cooldowns
- Enables learning analytics (e.g., which goals are evolving, which are stagnant)
- Acts as a registry of goals currently being monitored and improved
- Facilitates lifecycle management from new goal to mature model
🧬 Example Behavior
When a document batch is scored:
state_tracker.update_event("improve_medical_accuracy", "relevance", "scored")
Later, when training completes:
state_tracker.update_event("improve_medical_accuracy", "relevance", "trained")
And at any point, you can retrieve full goal state:
state = state_tracker.get_state("improve_medical_accuracy", "relevance")
Which might return:
{
"last_scored_at": 1719727812.3,
"last_trained_at": 1719738912.7,
"retrain_count": 3,
"status": "active"
}
🧩 How It Fits In
Other modules like TrainingController
, MetaConfidenceTracker
, and CycleWatcher
depend on the StateTracker
to answer questions like:
- “Is this a new goal?”
- “When was the model last retrained?”
- “Should we skip training due to cooldown?”
- “Is this dimension currently active or frozen?”
This lightweight yet essential tool gives your AI system a memory of its own progress making it more aware, more cautious, and more intelligent over time.
🔁 CycleWatcher: Detecting When the Model Is Stuck or Spinning Its Wheels
Not all model failures are obvious. Sometimes, a model keeps training but doesn’t improve. Or worse it flip-flops on decisions with each retraining. That’s why we built the CycleWatcher
.
This component acts like a thermometer for learning progress. It watches the model’s validation agreement over time and flags patterns like:
- 🔄 Oscillation bouncing between different behaviors with no clear trend
- 🧱 Stagnation stuck at low agreement, not learning from new data
- 📈 Healthy learning stable or improving agreement
📊 How It Works
Each time the system runs LLM validation, the CycleWatcher
is notified:
cycle_watcher.record_agreement(goal="ai_alignment", dimension="clarity", agreement=0.82)
It stores a short moving window of agreement scores per goal+dimension and checks:
Pattern | Condition |
---|---|
oscillating |
Recent scores swing up and down without settling |
stuck |
No meaningful improvement for a configured number of steps |
ok |
Scores are trending up or stable above a confidence threshold |
🔍 Example Usage
status = cycle_watcher.status("ai_alignment", "clarity")
if status == "oscillating":
logger.warning("Clarity scoring for 'ai_alignment' is oscillating. Consider intervention.")
elif status == "stuck":
logger.info("No learning detected will refresh document pool.")
This gives the system a diagnostic reflex a way to self-assess not just what it’s learning, but how well.
🧠 Why It Matters
- Avoids wasted retraining cycles
- Helps surface noisy or low-signal goals
- Gives you insight into the maturity of each goal
- Supports automatic interventions (e.g., adding new documents or freezing training)
🧩 System Role
CycleWatcher
works closely with:
- ✅
MetaConfidenceTracker
: Uses agreement scores to determine model confidence - 🛑
TrainingController
: May defer training if the cycle is unhealthy - 🧭
StateTracker
: Updates state when cycle issues are flagged
Together, these components make your system resilient able to spot when it’s learning poorly and adjust course automatically.
📈 MetaConfidenceTracker: Monitoring Trust in Each Model Over Time
Your reward model might start strong, but over time, it could drift, degrade, or simply face harder examples. That’s where the MetaConfidenceTracker
comes in it’s the memory of model trustworthiness.
This component tracks how often each reward model agrees with the LLM, for each goal and each scoring dimension.
🎯 Self-Tuning in Action
Every time a batch of document pairs is validated using the SelfValidationEngine
, we pass the agreement score to the MetaConfidenceTracker
:
tracker.update("ai_alignment", "clarity", validation_result)
It stores:
- ✅ Agreement % recent validation score
- 📆 Timestamps when last validated or updated
- 🔁 Trend history optional, for plotting improvement or decline
Then, we can ask:
if tracker.should_retrain("ai_alignment", "clarity"):
print("Triggering retraining due to low confidence.")
🧠 Why It’s Smart
This tracker creates goal- and dimension-specific trust scores. That means:
- A model scoring “relevance” for “robot ethics” might be high-confidence
- But the same model scoring “engagement” for “computational biology” might be flagged for retraining
It’s all localized and contextual, just like how human expertise varies per topic.
⚙️ System Role
The MetaConfidenceTracker
drives data-aware training loops by:
- Signaling the
TrainingController
to retrain a model if agreement drops below a threshold - Coordinating with
CycleWatcher
to confirm issues aren’t transient - Recording metadata in
StateTracker
to track retrain events
🔐 Safety and Control
You can configure:
agreement_threshold
: When to flag low confidencemin_validation_count
: Don’t retrain on one bad batchretrain_cooldown
: Prevents retraining too frequently
📝 Recap
Function | Description |
---|---|
update(goal, dim, result) |
Stores latest agreement % for model on a goal + dimension |
should_retrain(...) |
Returns True if agreement is below confidence threshold |
get_confidence(...) |
Returns current trust score for a goal + dimension |
This tracker ensures your AI isn’t just improving it knows when and where it’s improving.
🛠️ TrainingController: Retrain If Confidence Falls
The TrainingController
is the decision maker behind every retraining event in your system. It doesn’t just fire off training runs blindly it listens to signals from the validation system, checks cooldowns, and ensures that retraining happens only when justified.
🚦 What It Does
Whenever validation results come in, the controller evaluates:
- Is confidence low? (via
MetaConfidenceTracker
) - Has enough time passed since last training? (cooldown logic)
- Is there fresh training data available?
If all conditions are met, it triggers a model retrain for a specific goal
and dimension
.
🧠 Decision Logic
class TrainingController:
def maybe_train(self, goal: str, dimension: str, pairs: list):
if self.tracker.should_retrain(goal, dimension):
self.trainer_fn(goal, dimension, pairs)
self.tracker.reset_cooldown(goal, dimension)
You can plug in any trainer_fn
you like this makes it modular. For example:
def trainer_fn(goal, dimension, pairs):
trainer = DocumentMRQTrainer(...)
trainer.train_single_dimension(goal, dimension, pairs)
🧰 What It Tracks
- ✅ Confidence score: From the validation engine
- 🕒 Last training time: Stored in
StateTracker
- 🔁 Cooldown window: Prevents thrashing the model with too-frequent updates
- 🔒 Manual freeze status: You can freeze dimensions to block retraining temporarily
⚙️ System Integration
Dependency | Role |
---|---|
MetaConfidenceTracker |
Provides signal on model reliability |
StateTracker |
Records when retraining happened |
SelfValidationEngine |
Validates current model performance |
trainer_fn |
Executes the retraining |
🧭 Learning What Works, Forgetting What Doesn’t
Without this controller, you risk:
- Overfitting by retraining on every dip in performance
- Underfitting by never retraining models even when they degrade
- Wasted compute from redundant updates
The TrainingController
gives your AI system the discipline to wait, watch, and act only when it’s necessary just like a human researcher retraining their beliefs after seeing enough contradictory evidence.
✅ SelfValidationEngine Are We Still Aligned?
No matter how good a model is, it can drift. That’s why every self-improving system needs a reality check.
The SelfValidationEngine
is that check.
It samples a fraction of your document comparisons those judged by your local reward model and asks a trusted LLM to weigh in. If the model and LLM agree, that’s a good sign. If they start to diverge, it’s time to worry and maybe retrain.
🎯 How the AI Improves Itself
- Samples Pairs: From a batch of document comparisons, it randomly selects a subset (e.g., 5%) for validation.
- Asks the Model: For each sampled pair, it calls your reward model to decide: “Which of these two documents better satisfies the goal?”
- Asks the LLM: It then asks a fallback LLM (like GPT-4 or Qwen3) the same question.
- Compares: If the model and LLM choose the same document, that’s a match. If not, it’s a miss.
- Logs & Saves: It tracks validation statistics total checked, agreement rate, mismatches and logs everything to memory for auditing or triggering retraining.
🧬 Code Walkthrough
class SelfValidationEngine:
def __init__(..., reward_model, llm_judge):
self.reward_model = reward_model
self.llm_judge = llm_judge
self.validation_sample_rate = 0.05 # 5% of pairs get audited
def validate_batch(goal, pairs, dimension=None):
sample = [p for p in pairs if random.random() < self.validation_sample_rate]
for pair in sample:
model_pref = reward_model(goal, doc_a, doc_b)
llm_pref = llm_judge(goal, doc_a, doc_b)
match = model_pref == llm_pref
...
The result is a report like:
{
"validated": 20,
"matches": 17,
"agreement": 0.85
}
🔒 Reliability Through Self-Correction
Think of this as unit testing for your model’s behavior:
Feature | Purpose |
---|---|
✅ Validates predictions | Ensures the model still agrees with an external oracle |
📉 Detects drift | If agreement drops, the model might be losing reliability |
🔁 Triggers retraining | Feeds into the TrainingController to kick off updates |
🧠 Informs meta-learning | Helps MetaConfidenceTracker spot trends in model performance |
🌐 Example in Context
Let’s say your system is working on the goal: “Find the most innovative climate tech startups”.
Over time, it has trained a local reward model to judge articles and reports. But suddenly, validation shows agreement with the LLM has dropped from 92% to 71%. The SelfValidationEngine
catches this and logs it. That triggers a review:
- Is the model overfitting?
- Has the data changed?
- Should we retrain?
With this mechanism, your AI isn’t just learning it’s learning to self-audit.
class SelfValidationEngine:
def __init__(...):
# Takes in a config, memory store, logger, reward model, and fallback LLM judge
...
def validate_batch(goal, pairs, dimension=None):
# Randomly samples a subset of document pairs
sample = [pair for pair in pairs if random.random() < validation_sample_rate]
for pair in sample:
model_pref = reward_model(goal, doc_a, doc_b)
llm_pref = llm_judge(goal, doc_a, doc_b)
is_match = model_pref == llm_pref
logs.append({ goal, dimension, model_pref, llm_pref, match, truncated_docs })
agreement = matches / validated
memory.save("self_validation", {...})
return { validated, matches, agreement, logs }
✨ Building a Self-Tuning AI That Learns from the Web
Imagine giving an AI a goal say, “Should I invest in Tesla?” The goal itself is deceptively simple, but the process the AI must undertake to answer it is anything but.
The AI begins by interpreting the goal and translating it into a search strategy. It scans the internet news sites, financial reports, forums, YouTube videos, and more to find documents that could help it make a decision. This is not just search. This is targeted retrieval, powered by goal-aware filters and rival-ranking: each piece of data is judged by how well it competes in usefulness against others.
1. Filter and Rank Incoming Data
Each document pulled in goes through a comparative evaluation. The AI uses a language model (LLM) to judge preference between candidate pairs. For example:
“Which of these two reports better supports the decision to buy Tesla?”
These judgments generate training signals. The AI doesn’t just take data at face value it scores, compares, and filters.
2. Train Itself with Preference Modeling (MRQ)
Using the preference signals, the AI trains a Multidimensional Ranking + Quantification (MRQ) model. This model learns to emulate the LLM’s decisions essentially distilling expensive, high-quality LLM judgments into a fast, local model.
Over time, the MRQ model becomes the AI’s primary decision engine cheaper, faster, and fine-tuned to the goal.
3. Validate and Self-Correct
This isn’t a one-shot training. The system constantly runs self-validation: comparing the MRQ model’s predictions with fresh LLM judgments on new or difficult pairs. If the model drifts, confidence drops, or oscillates, it knows to retrain.
This is where self-awareness kicks in. The AI tracks its confidence over time, learning curves, and model reliability using tools like:
MetaConfidenceTracker
CycleWatcher
SelfValidationEngine
4. Extend and Iterate
When the AI is “ready” on a goal (e.g. it can explain and justify its Tesla decision), you introduce new, related goals: “How does Tesla compare to NIO?” or “Which company leads in EV battery tech?”
Each new goal introduces fresh data and creates another self-contained learning loop. The system keeps refining not just using new data, but improving its ability to evaluate, rank, and learn from that data.
5. Continuous Self-Tuning from the Web
This is not just a static LLM pipeline. This is an active learner. It’s always scanning, always challenging itself with new data, always improving its own model of the world.
And because it uses rival-based ranking and model self-validation, it doesn’t need ground-truth labels. It constructs its own signal a powerful step toward autonomous, scalable AI reasoning.
📈 Case Study: Self-Improvement in Action
Let’s say the AI is asked: “Which recent papers best explain alignment challenges in RLHF?”
- It pulls 120 candidate papers from ArXiv.
- It filters and ranks them using an LLM-based comparator.
- After 2 days, its MRQ model achieves 92% agreement with the LLM on sampled comparisons.
- But over time, its agreement score on the “clarity” dimension drops below 80%.
- The system retrains its clarity model using new pairs and fresh LLM judgments and recovers to 89%.
- The AI now scores faster, and its top 5 suggestions include two papers it originally ignored.
This demonstrates the power of a feedback-driven loop that adapts as knowledge evolves.
🔚 Conclusion: An Always-On Intelligence Loop
Over this deep dive, we engineered an AI that doesn’t just execute it evolves.
Here’s what we conquered step-by-step, and why it changes everything:
🔧 Step 1: From Static to Dynamic Intelligence
We shattered the “train once, deploy forever” paradigm. Starting with goal-driven research (e.g., “Build AI that self-improves at complex tasks”), we:
- Searched Arxiv with semantic keyword extraction
- Scored papers across (any) 5+ dimensions (novelty, relevance, clarity)
- Trained MRQ models to replace costly LLM judgments
Why it counts: Turns vague goals into self-updating knowledge engines.
🔄 Step 2: The Self-Tuning Loop
We gave our AI a conscience through:
- Validation Engine: Auditing model vs. GPT-4 judgments (5% samples)
- Confidence Tracking: Monitoring agreement decay with
MetaConfidenceTracker
- Auto-Retraining: Triggering updates only when trust erodes
Why it counts: Systems that self-correct > systems that decay.
🧩 Step 3: The Intelligent Core
We wired together autonomous agents like:
CycleWatcher
to detect learning stagnationStateTracker
to manage goal lifecyclesTrainingController
to enforce disciplined retraining
Why it counts: Modular intelligence > monolithic models.
🌟 Why This Isn’t Just Another Pipeline
We moved beyond automation to autonomous evolution:
Traditional AI | This System |
---|---|
Fixed knowledge | Web-fed learning |
Manual retraining | Self-triggered updates |
Black-box decisions | Auditable validation logs |
One-size-fits-all | Goal-specialized RMs |
🚀 Where Do We Go From Here?
You’ve now built an AI that:
- Learns from open-source knowledge (Arxiv → web)
- Validates its own reasoning
- Retrains itself when confidence drops
- Evolves with your goals
This is the foundation for truly adaptive AI. Imagine extending RIVAL to:
- Real-time market analysis (tracking crypto/news)
- Medical research synthesis (updating with new trials)
- Self-optimizing code generation
You wont have to you will see it right here.
“The future belongs not to the strongest AI, but to the most adaptable.”
📚 References
-
Achiam, J. et al. (2023). GPT-4 Technical Report. arXiv:2303.08774.
- Cited for foundational insights into LLM capabilities and evaluation frameworks used in self-improving systems.
-
Ahmed, A. M. et al. (2024). Scalable Ensembling for Mitigating Reward Overoptimization. arXiv:2406.01013.
- Addresses challenges in reward hacking and overfitting, relevant to the
SelfValidationEngine
andMetaConfidenceTracker
components.
- Addresses challenges in reward hacking and overfitting, relevant to the
-
Pan, A., Bhatia, K., & Steinhardt, J. (2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. arXiv:2201.03544.
- Discusses reward model alignment issues addressed by the adversarial training loop in RIVAL.
-
Stiennon, N. et al. (2020). Learning to Summarize with Human Feedback. Advances in Neural Information Processing Systems, 33.
- Influenced the design of preference pair generation and LLM-based validation in the
DocumentTrainerAgent
.
- Influenced the design of preference pair generation and LLM-based validation in the
-
Tan, S. & Monz, C. (2025). REMEDY: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling.
- Directly informs the integration of qualitative (preference pairs) and quantitative (BLEU/COMET) rewards in the
MRQ Scoring Engine
.
- Directly informs the integration of qualitative (preference pairs) and quantitative (BLEU/COMET) rewards in the
-
RIVAL Framework (2025). Reinforcement Learning with Iterative and Adversarial Optimization (Paper ID: 2506.05070v1).
- Core methodology adapted for self-improving AI systems described in the blog post.
📘 Glossary
Term | Definition |
---|---|
LLM (Large Language Model) | A machine learning model trained on massive text datasets to understand and generate human-like language. Used here for initial document evaluations and fallback validation. |
MRQ (Model-Reward-Quality) | A scoring and training framework that learns to predict quality across multiple dimensions (e.g., relevance, clarity, engagement) from pairwise document preferences. |
Goal | A high-level task or objective the AI is optimizing for (e.g., “Evaluate Tesla investment”). Guides document retrieval, scoring, and training. |
Reward Model | A model trained to predict which document better satisfies a given goal. Initially mimics an LLM, then operates independently. |
SelfValidationEngine | A module that compares the reward model’s decisions against trusted LLM judgments to detect drift and measure correctness. |
MetaConfidenceTracker | Tracks model vs LLM agreement over time, enabling automatic confidence monitoring and retraining decisions. |
TrainingController | Oversees whether and when to retrain models based on validation scores, cooldowns, and thresholds. |
CycleWatcher | Detects learning issues like oscillation or stagnation in model confidence. Helps prevent wasted training cycles. |
Contrast Pair / Preference Pair | A pair of documents (A vs B) where one is preferred over the other. Used to train and validate reward models. |
Dimension | A scoring axis such as “relevance,” “clarity,” or “insight.” MRQ models are trained per-dimension. |
Supervisor | The central controller that wires together the self-improving pipeline: registering agents, managing state, and coordinating retraining logic. |
StateTracker | Maintains metadata about recent events (e.g., last scored, trained, validated) for each goal and dimension. |
RIVAL | An acronym describing the system architecture: Reinforcement learning with Iterative and adVersarial optimization. Represents the closed loop of improvement. |
LLM Judge | The fallback, trusted judgment from a large language model used to validate the predictions of local reward models. |
Cooldown | A time-based guard to prevent too-frequent retraining of reward models. Managed by the TrainingController . |