The Self-Aware Pipeline: Empowering AI to Choose Its Own Path to the Goal

🔧 Summary
Modern AI systems require more than just raw processing power they need contextual awareness, strategic foresight, and adaptive learning capabilities. In this post, we walk through how we implemented a self-aware pipeline system inspired by the Devil’s Advocate paper.
Unlike brittle, static workflows, this architecture empowers agents to reflect on their own steps, predict failure modes, and adapt their strategies in real time.
🧠 Grounding in Research
Devil’s Advocate (ReReST)
ReReST: Devil's Advocate: Anticipatory Reflection for LLM Agents introduces a self-training framework for LLM agents. The core idea is to have a “reflector” agent anticipate failures and revise the original plan before executing a powerful method for reducing hallucinations and improving sample quality. Our implementation draws heavily on these ideas to enable dynamic planning and feedback loops within the pipeline.
🔍 What Is a Self-Aware Pipeline?
A self-aware pipeline is a dynamic, reasoning-based AI system that:
- Reflects on its own reasoning flow
- Predicts and mitigates failure points
- Learns from past runs to adapt future strategies
Instead of:
Generate → Review → Judge
We do:
graph TD A[Goal Input] --> B(SymbolicOptimizer) B --> C(Lookahead Analysis) C --> D(Generate Hypotheses) D --> E(Judge + Review) E --> F{Best Hypothesis} F --> G(Execute Best) G --> H[Log Run + Score] H --> I[Compute Reflection Delta] I --> J[Update Strategy Memory] J --> B
🛠️ Core Components We Built
1. 🔮 LookaheadAgent Anticipatory Reflection
The LookaheadAgent implements the key idea from the Devil’s Advocate paper: anticipatory reflection.
Rather than blindly executing a pipeline, this agent asks:
“Is this the best strategy for this goal? What could go wrong?”
It uses a prompt-based analysis to reason through the strengths and weaknesses of the current pipeline and recommends a revised version if necessary.
class LookaheadAgent(BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
async def run(self, context: dict):
goal = self.memory.goals.get_or_create(context.get(GOAL))
# Build context for prompt template
pipeline = context.get(PIPELINE, [])
agent_registry = context.get("agent_registry", {})
# Current agents and all available agents from the registry
pipeline_info = {
step: agent_registry.get(step, {"description": "No description."})
for step in pipeline
}
print(f"Pipeline info: {pipeline_info}")
all_agents_info = {name: data for name, data in agent_registry.items()}
print(f"All agents info: {all_agents_info}")
prompt_context = {
"goal": goal.goal_text,
"goal_type": goal.goal_type,
"focus_area": goal.focus_area,
"strategy": goal.strategy,
"llm_suggested_strategy": goal.llm_suggested_strategy,
PIPELINE: pipeline,
"pipeline_info": {
step: agent_registry.get(step, {"description": "No description"})
for step in pipeline
},
"all_agents": agent_registry,
**context
}
prompt_template = self.prompt_loader.load_prompt(self.cfg, prompt_context)
# Call LLM to generate anticipated issues and fallbacks
response = self.call_llm(prompt_template, prompt_context).strip()
# Store the reflection for traceability
model_name = self.cfg.get("model").get("name")
extracted = self.parse_response(response)
context.update(extracted)
pipeline = context.get(PIPELINE, [])
reflection = Lookahead(
goal=goal.goal_text,
agent_name=self.name,
model_name=model_name,
input_pipeline=pipeline,
suggested_pipeline=extracted.get("suggested_pipeline"),
rationale=extracted.get("rationale"),
reflection=response,
metadata={"raw_output": response},
run_id=context.get("run_id"),
)
reflection.store(self.memory, self.logger)
# Log the result
self.logger.log(
"LookaheadGenerated",
{
"goal": goal.goal_text,
"lookahead": response[:250], # short preview
},
)
# Store in context
context[self.output_key] = asdict(reflection)
return context
def parse_response(self, text: str) -> dict:
import re
suggested = re.search(r"# Suggested Pipeline\s*(.*?)\n#", text, re.DOTALL)
rationale = re.search(r"# Rationale\s*(.*)", text, re.DOTALL)
pipeline = suggested.group(1).strip().splitlines() if suggested else []
pipeline = [line.strip("- ").strip() for line in pipeline if line.strip()]
return {
"suggested_pipeline": pipeline if pipeline else None,
"rationale": rationale.group(1).strip() if rationale else None,
}
def extract_sections(self, text: str) -> dict:
# Simple section splitting
risks_match = re.search(r"# Predicted Risks\s*(.*?)(?:#|$)", text, re.DOTALL)
backups_match = re.search(r"# Backup Plans\s*(.*)", text, re.DOTALL)
return {
"rationale": risks_match.group(1).strip() if risks_match else None,
"backup_plans": [
line.strip("- ").strip()
for line in (
backups_match.group(1).strip().split("\n") if backups_match else []
)
if line.strip()
],
}
Prompt Template
This is the prompt the agent uses to accomplish its task
# Goal
{{ goal }}
{% if goal_type %}- Type: {{ goal_type }}{% endif %}
{% if focus_area %}- Focus Area: {{ focus_area }}{% endif %}
{% if strategy %}- Current Strategy: {{ strategy }}{% endif %}
{% if llm_suggested_strategy %}- LLM Suggested Strategy: {{ llm_suggested_strategy }}{% endif %}
# Current Pipeline:
{% for step in pipeline %}
- {{ step }}: {{ pipeline_info[step]["description"] }}
{% endfor %}
# All Available Agents:
{% for name, data in all_agents.items() %}
- {{ name }}: {{ data["description"] }}
{% endfor %}
# Instructions
You are an anticipatory reasoning agent. Your task is to reflect on whether the current pipeline is optimal for the given goal.
1. Identify any potential weaknesses or unnecessary agents in the current pipeline.
2. Check if any available agents could better serve this goal type, focus area, or strategy.
3. Suggest a revised pipeline if appropriate.
4. Justify your changes with reasoning based on agent capabilities and goal alignment.
# Analysis
[Your analysis here]
# Suggested Pipeline
- [agent_1]
- [agent_2]
...
# Rationale
[Your explanation of why this revision improves alignment with the goal.]
Example Output
# Analysis
The pipeline lacks factual checking.
# Suggested Pipeline
- generation
- verifier
- judge
# Rationale
Verifier ensures outputs align with known facts, reducing hallucination risk.
graph TD A[Current Pipeline] --> B(LookaheadAgent) B --> C(Predict Risks) C --> D(Suggest Backup Plans) D --> E(Suggested Alternative Pipeline) E --> F[Use New Pipeline?] F -- Yes --> G[Execute Optimized Pipeline] F -- No --> H[Proceed With Original]
2. 🔁 ReflectionDeltaAgent Tracking Pipeline Improvements
One of the most important features of a self-aware pipeline is its ability to learn from experience. The ReflectionDeltaAgent
helps us track how changes to the pipeline impact performance across different runs on the same goal.
This is inspired by the self-training and feedback loop concept from the Devil’s Advocate paper, where the system doesn’t just try once it reflects, revises, and retries.
📜 What the Agent Does
-
Loads the current goal
-
Finds all pipeline runs that have been attempted for this goal
-
Compares each pair of runs:
- Checks if they’re both scored
- Measures their score delta
- Logs what changed between the two runs
-
Stores that difference in a
reflection_deltas
table for later learning and analysis.
This allows us to build a training dataset of causal improvement chains, which is essential for symbolic learning or future reward models like MR.Q or ReflectorNet.
🧠 Source Code
from co_ai.agents.base import BaseAgent
from co_ai.constants import GOAL
from co_ai.models.reflection_delta import ReflectionDelta
from co_ai.analysis.reflection_delta import compute_pipeline_delta
from dataclasses import asdict
from datetime import datetime
class ReflectionDeltaAgent(BaseAgent):
async def run(self, context: dict) -> dict:
goal = self.memory.goals.get_or_create(context.get(GOAL))
if not goal:
self.logger.log("ReflectionDeltaSkipped", {"reason": "no goal in context"})
return context
runs = self.memory.pipeline_runs.get_by_goal_id(goal.id)
if len(runs) < 2:
self.logger.log("ReflectionDeltaSkipped", {
"goal": goal,
"reason": "only one or zero runs"
})
return context
logged_deltas = 0
for i, run_a in enumerate(runs):
for run_b in runs[i+1:]:
scores_a = self.memory.scores.get_by_run_id(run_a.run_id)
scores_b = self.memory.scores.get_by_run_id(run_b.run_id)
if not scores_a or not scores_b:
continue # skip unscored runs
delta = compute_pipeline_delta(run_a, run_b, scores_a, scores_b)
self.memory.reflection_deltas.insert(ReflectionDelta(**delta))
self.logger.log("ReflectionDeltaLogged", {
"goal_id": goal.id,
"run_id_a": run_a.run_id,
"run_id_b": run_b.run_id,
"score_delta": delta.get("score_delta"),
"causal": delta.get("causal_improvement")
})
logged_deltas += 1
context["reflection_deltas_logged"] = logged_deltas
return context
🧩 How It Works (Step-by-Step)
-
Get the Goal
- Pulls the current goal from memory using
context[GOAL]
.
- Pulls the current goal from memory using
-
Find All Runs for That Goal
- Fetches all pipeline runs that have been tried on this goal.
-
Pairwise Comparison
- Iterates over every pair of runs (
run_a
andrun_b
) - Loads their corresponding scores
- Skips any unscored pairs
- Iterates over every pair of runs (
-
Compute Differences
-
Calls
compute_pipeline_delta()
(described below) to:- Compare pipeline agents
- Compare scores
- Compare strategies/models
- Extract changes in lookahead rationale
-
-
Log the Delta
- Saves the result to the
reflection_deltas
table - Adds a logger entry showing whether there was a meaningful score improvement
- Saves the result to the
-
Return Updated Context
- Stores the number of deltas logged in the context for downstream tracking
🗃️ Why This Matters
The ReflectionDeltaAgent
turns your Co AI system into a learning engine:
- It tracks what changed and what improved across runs
- It enables future agents to learn which strategies work better for which goals
- It generates ground truth for training reward models like MR.Q or ReflectorNet
Without this agent, we’d have no way of building an evolving pipeline system it’s what closes the loop from execution → feedback → optimization.
OK#### Reflection Delta Table Includes:
goal_id
,run_id_a
,run_id_b
score_a
,score_b
,score_delta
pipeline_diff
,strategy_diff
,model_diff
,rationale_diff
- Used for training MRQ and symbolic learners
graph TD A[Run A] --> C[Compare Pipelines] B[Run B] --> C C --> D[Pipeline Diff] C --> E[Score Delta] C --> F[Rationale Diff] D --> G[Log Delta to DB] E --> G F --> G
3. 🧾 PipelineJudgeAgent Post-Execution Self-Judgment
The PipelineJudgeAgent
provides a reflective judgment on the pipeline after it has run. This fulfills the post-hoc evaluation requirement from the Devil’s Advocate paper enabling the system to score its own performance not just based on raw hypothesis quality, but by evaluating the pipeline structure that produced it.
📌 Why It Matters
- Helps quantify pipeline quality after execution
- Allows scoring without needing external reviewers
- Adds a layer of interpretability and introspection
This agent looks at the goal, the pipeline, the top hypothesis (if any), and prior reflections then calls an LLM to assign a score and explanation for the pipeline as a whole.
🧠 Source Code
import re
from co_ai.agents.base import BaseAgent
from co_ai.models import Score
from co_ai.constants import PIPELINE, RUN_ID
from dataclasses import asdict
class PipelineJudgeAgent(BaseAgent):
async def run(self, context: dict) -> dict:
goal = context["goal"]
pipeline = context[PIPELINE]
hypotheses = context.get("scored_hypotheses", []) or context.get("hypotheses", [])
# Get top-scoring or first hypothesis if available
top_hypo = None
if hypotheses:
top_hypo = hypotheses[0]
reflection = context.get("lookahead", {}).get("reflection", "")
prompt_context = {
"goal": goal["goal_text"],
"pipeline": pipeline,
"hypothesis": top_hypo,
"lookahead": reflection
}
prompt = self.prompt_loader.load_prompt(self.cfg, prompt_context)
judgement = self.call_llm(prompt, prompt_context).strip()
# Parse score and rationale from LLM response
score_match = re.search(r"\*\*?score[:=]?\s*([0-9]*\.?[0-9]+)\*\*?", judgement, re.IGNORECASE)
score = float(score_match.group(1)) if score_match else None
rationale = (
judgement if score is None
else judgement[judgement.index(str(score)) + len(str(score)):].strip()
)
# Store score in memory
score_obj = Score(
goal=prompt_context["goal"],
hypothesis=prompt_context["hypothesis"],
agent_name="PipelineJudgeAgent",
model_name=self.cfg.get("model", {}).get("name"),
evaluator_name="PipelineJudgeAgent",
score_type="pipeline_judgment",
score=score,
rationale=rationale,
run_id=context.get(RUN_ID),
metadata={"raw_response": judgement},
)
score_obj.store(self.memory, self.logger)
context[self.output_key] = {
"score": asdict(score_obj),
"judgement": judgement
}
return context
📋 What’s Happening Here
-
Context Preparation It builds a
prompt_context
with:- The goal text
- The executed pipeline
- The top hypothesis (if available)
- Any prior lookahead reflection
-
Prompt Execution It loads a prewritten evaluation prompt and sends it to the LLM to produce:
- A numerical score (e.g. “Score: 0.8”)
- A natural language rationale
-
Score Extraction & Storage
- It parses the score using regex
- Extracts the rationale following the score
- Saves the result in the
scores
table for long-term comparison and symbolic analysis
Here is an actual generated result
**Score: 1 (excellent)**
**Rationale:** The pipeline produced two highly relevant and well-structured hypotheses that directly address the goal of AI self-reprogramming. Hypothesis 1 leverages evolutionary algorithms, drawing a clear analogy to biological evolution, with a concrete experimental plan to measure performance gains over generations. Hypothesis 2 uses reinforcement learning to enable dynamic neural architecture adjustments, extending RL’s adaptability to self-optimization. Both hypotheses are grounded in existing AI techniques, provide clear mechanisms, and include testable experimental designs. The hypotheses are distinct yet complementary, covering different pathways to self-reprogramming (evolutionary vs. reinforcement-based), ensuring robustness and depth in exploring the goal.
4. 🧭 MRQStrategyAgent Choosing the Best Pipeline
The MRQStrategyAgent is designed to answer one high-level question:
“Given what we’ve learned from previous pipeline runs, which agent sequence (strategy) should we use for this new goal?”
This is a strategy recommender, trained on ReflectionDelta data, using a symbolic scoring model inspired by the MR.Q framework.
🧠 Full Source Code
from co_ai.agents import BaseAgent
from co_ai.constants import GOAL
from omegaconf import OmegaConf
DEFAULT_PIPELINES = [
["generation", "judge"],
["generation", "verifier", "judge"],
["generation", "reviewer", "judge"],
["cot_generator", "reviewer", "judge"],
["retriever", "generation", "judge"],
["retriever", "cot_generator", "judge"],
["retriever", "generation", "verifier", "judge"]
]
class MRQStrategyAgent(BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
# Load candidate strategies
file_path = cfg.get("strategy_file")
if file_path:
strategy_cfg = OmegaConf.load(file_path)
self.candidate_strategies = strategy_cfg.get("candidate_strategies", [])
else:
self.candidate_strategies = cfg.get("candidate_strategies", [])
self.trained_ranker = None
self.training_data = []
self.train_from_reflection_deltas()
async def run(self, context: dict) -> dict:
goal = context.get("goal", {})
goal_text = goal.get("goal_text", "")
scored = []
for pipeline in self.cfg.get("candidate_strategies", DEFAULT_PIPELINES):
s = self.trained_ranker(pipeline)
scored.append((pipeline, s))
scored.sort(key=lambda x: x[1], reverse=True)
best = scored[0][0]
context["mrq_suggested_pipeline"] = best
self.logger.log(
"MRQPipelineSuggested",
{"goal": goal_text, "suggested": best, "scored_candidates": scored},
)
return context
def train_from_reflection_deltas(self):
deltas = self.memory.reflection_deltas.get_all()
examples = []
for d in deltas:
a = d.get("pipeline_a")
b = d.get("pipeline_b")
score_a = d.get("score_a")
score_b = d.get("score_b")
if not isinstance(a, list) or not isinstance(b, list):
continue
if score_a is None or score_b is None:
continue
if abs(score_a - score_b) < 0.05:
continue
label = "b" if score_b > score_a else "a"
examples.append({
"goal_text": d.get("goal_text"),
"pipeline_a": a,
"pipeline_b": b,
"score_a": score_a,
"score_b": score_b,
"label": label
})
self.training_data = examples
self.logger.log("MRQTrainingDataLoaded", {"count": len(examples)})
# Train dummy ranker
self.trained_ranker = self.symbolic_ranker()
def symbolic_ranker(self):
"""
Simple ranker that scores pipelines based on symbolic features.
Prefers longer pipelines and known strong agents.
"""
def score(pipeline):
return (
len(pipeline)
+ 1.5 * ("verifier" in pipeline)
+ 1.2 * ("reviewer" in pipeline)
+ 1.0 * ("retriever" in pipeline)
+ 0.8 * ("cot_generator" in pipeline)
)
return score
🧠 Understanding the MRQStrategyAgent
The MRQStrategyAgent
is the decision-making brain of the self-aware pipeline system. It analyzes past experiences (specifically ReflectionDelta entries) to learn which pipeline strategies are more effective then recommends the best one for a new goal. This is our first step toward training a reward model (inspired by MR.Q) to optimize pipeline planning.
🔍 What It Does
-
Loads Strategy Candidates: It pulls candidate pipelines from either:
- A
strategy_file
YAML (preferred for extensibility) - Or a hardcoded fallback list (
DEFAULT_PIPELINES
)
- A
-
Trains from Reflection Deltas: It fetches previously logged comparisons between pipeline runs (from
reflection_deltas
table). For each delta:- It checks the difference in scores between two pipelines.
- If the difference is meaningful (more than
0.05
), it adds it to training examples.
label = "b" if score_b > score_a else "a"
-
Scores Pipelines: Currently, it uses a symbolic scoring function a placeholder for a future learned model.
len(pipeline) + 1.5 * ("verifier" in pipeline) + 1.2 * ("reviewer" in pipeline) + ...
This rewards pipelines with known strong agents like
verifier
,retriever
, orcot_generator
. -
Suggests Best Pipeline: For a new goal, it:
- Scores all candidate strategies
- Picks the highest-scoring one
- Logs the decision and injects it into the pipeline context
⚙️ Why It Works This Way
-
Symbolic Simplicity First: We’re starting with a symbolic scorer (like hand-crafted reward heuristics) to get baseline behavior and enable inspection.
-
Plug-in Learning Later: This design allows a learned reward model (e.g. BERT, LLM, logistic regression) to be swapped in later to replace
symbolic_ranker
. -
Offline Training Loop: The agent trains itself offline at init time using accumulated
reflection_deltas
, so it doesn’t slow down the live pipeline. -
Goal-Aware Logging: It logs its suggestions and scoring process per goal, which allows transparent review and debugging of strategy choices.
🧪 How It’s Different From Other MR.Q Implementations
Unlike earlier versions of MR.Q you may have seen (e.g., pairwise response ranking, prompt scoring, etc.), this version works at the strategy level:
Feature | Classic MR.Q | This Version |
---|---|---|
Level | Hypothesis/response scoring | Pipeline strategy scoring |
Input | Two responses | Two pipeline + score comparisons |
Output | Preferred hypothesis | Recommended pipeline |
Training Signal | Preference label | Score delta across pipeline runs |
This lets MRQ evolve from “which response is better?” to “which approach to the goal works best?” a huge step toward meta-reasoning and adaptive AI orchestration.
🧠 Why This Agent Is the Core
This agent closes the loop:
- It takes feedback from actual past pipeline performance (
ReflectionDelta
) - It ranks strategies and proposes new ones
- It learns over time (symbolically now, MR.Q-learned in the future)
- It’s modular, easy to swap out ranking functions
This is where symbolic planning meets self-reflective learning and it’s the foundation for building a pipeline that evolves on its own.
graph TD A[Goal Input] --> B(SymbolicOptimizer) B --> C(Pipeline Suggestion) C --> D(Lookahead Analysis) D --> E(Hypothesis Generation) E --> F(Scoring) F --> G{Execute Best} G --> H[Memory Storage] H --> B
📚 Agent Registry Format
To enable dynamic reflection and reasoning about pipeline composition, we introduced an Agent Registry — a centralized YAML file that describes each agent’s capabilities, risks, and preferred use cases. This registry is used by the LookaheadAgent
and other reflective agents to reason symbolically about the pipeline structure.
Here’s an example entry:
generation:
name: generation
description: "Generates creative or analytical text based on a goal or prompt."
provides:
- Generated text output
- Metadata (token count, model used, timestamp)
requires:
- Prompt or goal
- Optional context/history
failure_modes:
- Repetitive or low-diversity outputs
- Hallucinations in complex domains
preferred_for:
- Creative writing
- Scientific hypothesis generation
- Content creation
avoid_for:
- Tasks requiring high factual accuracy without verification
Why We Introduced It
- 🧠 Symbolic Reasoning: The registry allows agents like
LookaheadAgent
to reason over the pipeline as structured knowledge. - 🧩 Modular Design: Every agent declares what it does, what it needs, and where it struggles — making pipeline adaptation interpretable.
- 🔍 Strategy Matching: Helps reflective agents suggest agents that are more aligned with the goal’s domain or difficulty.
This format makes the pipeline system not just configurable, but legible and explainable — laying the foundation for symbolic learning and future strategy optimization.
graph LR A[Goal Type: Physics Question] --> B(MRQStrategyAgent) B --> C{Check Agent Registry} C -->|Generation| D[Generates text] C -->|Verifier| E[Checks correctness] C -->|Debate| F[Simulates multiple perspectives] B --> G[Choose Best Pipeline] G --> H[Plan First → Debate → Judge]
🧪 HelpSteer3 Integration
We’re integrating the HelpSteer3 dataset to:
- ✅ Use human-preference data as structured goal sources
- 🧪 Validate and compare pipeline outcomes against annotated human judgments
- 🎯 Fine-tune symbolic agents like MRQStrategyAgent using preference-informed feedback
This allows us to empirically tune pipeline strategies using real-world annotated data.
🔄 What’s Coming Next
In follow-up posts, we’ll show how HelpSteer3 enables:
- 🧠 Pipeline outcome evaluation at scale — using human preferences as a supervision signal
- 🔁 Training symbolic and neural models to predict preference-aligned strategies
- 📈 Demonstrating end-to-end improvements, from raw input to refined hypothesis, across a wide range of task types
By using HelpSteer3 as a benchmark dataset, we can close the loop between reflective pipeline design and human-aligned performance.
This series will culminate in a demonstration of how self-aware pipeline systems can outperform static workflows, backed by quantitative evidence from preference-driven evaluation.
🧠 Annotated Example Pipeline
graph TB A[Goal Input] --> B(SymbolicOptimizer) B --> C(LookaheadAgent) C --> D(Pipeline Generator) D --> E(Hypothesis Generation) E --> F(Judge + Review) F --> G(Select Best Hypothesis) G --> H(Log Run + Score) H --> I(ReflectionDeltaAgent) I --> J[Update Strategy Memory] J --> B
✅ Conclusion
This architecture brings together symbolic learning, anticipatory planning, and structured memory to make AI pipelines more adaptive and self-improving. It operationalizes key insights from Devil’s Advocate and lays the groundwork for reflective strategy optimization with future training.
Stay tuned for the next post on symbolic learning and how agents can evolve their pipelines automatically from experience. In our next post, we’ll explore the SymbolicOptimizerAgent, which directly leverages the ReflectionDelta training data to learn which pipelines work best for different goals closing the loop on self-improving strategy selection.
✅ Checklist: Devil’s Advocate (ReReST) Paper vs. Our Implementation
Component / Concept | In Paper | In Our Implementation | Status |
---|---|---|---|
Anticipatory Reflection Agent | Reflector model simulates failures and proposes revisions before execution | LookaheadAgent using prompt-based pipeline simulation and revision |
✅ Complete |
Reflector Prompt / Structured Self-Evaluation | Specific prompts to elicit self-assessment and revision | Jinja-based prompt for LookaheadAgent with goal, pipeline, agent registry |
✅ Complete |
Dynamic Pipeline Adjustment | Pipeline is modified based on reflection | Reflected revisions stored and applied in Supervisor.maybe_adjust_pipeline() |
✅ Complete |
Causal Evaluation / Retry Loop | Revised plan is executed and logged | Pipeline re-executed with adjusted steps; run_id and config fully logged |
✅ Complete |
Post-execution Self-Judgment | Evaluate outcome quality and pipeline performance | PipelineJudgeAgent parses scores + rationale from LLM |
✅ Complete |
Logging of Causal Improvement Chains | Track what changes led to improvement | ReflectionDeltaAgent + reflection_deltas table |
✅ Complete |
Symbolic Feedback / Delta Training | Use differences to train agents | MRQStrategyAgent trained on ReflectionDelta score deltas |
✅ Complete |
Retry + Exploration Framework | Try different revisions if one fails | Manual rerun supported (rerun_pipeline(run_id) ), future automation planned |
🟡 Partial |
Data Used for Tuning and Analysis | Simulated or real preference data (e.g., preference modeling) | HelpSteer3 integration added, strategyqa+goals also used | ✅ In Progress |
Scalable Preference-based Evaluation | System evolves using large-scale feedback | Plans for MRQ + HelpSteer3-based symbolic evaluation in future posts | ✅ Planned |
🔗 Rebooting Open AI Pipelines Source Code on GitHub
All of the components described in this blog from the LookaheadAgent
to the MRQStrategyAgent
and ReflectionDeltaAgent
are fully implemented and available in our open-source repository:
👉 co-ai: The Self-Aware Reasoning Framework
This repo includes:
- 🧠 All agent source code (fully configurable with Hydra)
- 🧪 SQL schema and
MemoryTool
for logging and analysis - 🧰 Agent registry with dynamic metadata for reflection
- 🔄 Pipeline execution system with built-in retry and context memory
- 📊 Jupyter-ready outputs for scoring and training
We designed this framework for developers, researchers, and tinkerers who want to explore how agents can reflect, reason, and adapt.
💡 Clone the repo, load up a goal, and watch your pipeline evolve. Contributions, ideas, and forks are all welcome.
📚 References
-
Devil’s Advocate: Anticipatory Reflection for LLM Agents Nisan Stiennon, Yuntao Bai, Andy Zou, et al. arXiv:2405.16334 Core inspiration for the LookaheadAgent and anticipatory reflection loop.
-
Symbolic Learning Enables Self-Evolving Agents Jules White, Nader Hanna, et al. arXiv:2406.18532 Forms the basis for the
SymbolicOptimizerAgent
(to be implemented in a follow-up post). -
HelpSteer: A Benchmark for Aligned Response Ranking NVIDIA Research arXiv:2505.11475 Used for pipeline validation and future reward modeling.
-
ELO-Based Ranking Systems (Originally proposed by Arpad Elo) Used as part of the composite scoring system to evaluate and track hypothesis competitiveness.
-
Hydra Configuration System Facebook Research https://github.com/facebookresearch/hydra Used to modularize agent configurations and support dynamic pipelines.
📘 Glossary
Term | Definition |
---|---|
Self-Aware Pipeline | An adaptive AI workflow that reflects on its own steps, predicts risks, and revises its structure dynamically based on goal type and past performance. |
Pipeline | A sequence of agents (e.g. generation → review → judge) used to solve a specific goal. |
Goal | A task or problem passed to the system, typically with natural language input and an optional goal type or focus area. |
Agent | A modular AI component (e.g. generation , reviewer , judge ) with a defined purpose and prompt. Agents can be chained in a pipeline. |
Agent Registry | A YAML file that describes agent capabilities, failure modes, and preferred use cases. Used by LookaheadAgent and others for strategy reflection. |
LookaheadAgent | An agent that analyzes the current pipeline before execution, anticipates weaknesses, and proposes improved strategies. Inspired by Devil’s Advocate (ReReST). |
Reflection | A structured analysis of the current pipeline’s appropriateness for a goal. Often LLM-generated based on agent registry and goal metadata. |
SymbolicOptimizerAgent | A strategy learner that analyzes past pipeline performance and recommends future agent sequences based on success history. |
MRQ (Minimal Rational Questioner) | A simplified reward model used to rank and select pipelines based on symbolic cues and past data without complex model training. |
MRQStrategyAgent | A trained strategy agent that uses reflection deltas to score and rank pipeline configurations. It evolves its ranking based on real-world results. |
ReflectionDelta | A database entry comparing two pipeline runs for the same goal, noting what changed (agents, model, strategy) and whether the change improved performance. |
compute_pipeline_delta() | A function that compares two pipeline runs and calculates the difference in agents, scores, strategy, model, and rationale. Used for reflection and training. |
PipelineJudgeAgent | An agent that evaluates whether a completed pipeline was effective, using LLM prompts to score overall performance and rationale. |
Scored Hypothesis | A hypothesis (e.g. output from generation) that has been evaluated by multiple scoring agents (proximity, judge, MRQ, etc.). |
HelpSteer3 | A preference-aligned dataset from NVIDIA used to simulate and test goal-solving strategies with real-world LLM comparisons. |
Dynamic Pipeline Selection | The process of choosing the best pipeline on-the-fly based on goal characteristics and historical success patterns. |
Rationale Diff | A qualitative comparison between the reasoning outputs of two different pipeline runs. |
Causal Improvement Chain | A series of pipeline changes (reflected in deltas) that clearly improve performance and can be used to guide strategy optimization. |