Everything is a Trace: Stephanie Enters Full Reflective Mode

๐ง Summary
In our last post, Layers of Thought: Smarter Reasoning with the Hierarchical Reasoning Model, we introduced a new epistemic lens a way to evaluate not just final answers, but the entire sequence of reasoning steps that led to them. We realized we could apply this way of seeing to every action in our system not just answers, but inferences, lookups, scorings, decisions, and even model selections. This post shows how we’re doing exactly that.
This post marks the moment when Stephanie crosses the threshold from being a system that reasons to being a system that understands its own reasoning process. Where HRM let us evaluate reasoning about documents, PlanTrace lets us evaluate reasoning about reasoning itself creating the foundation for true self-improvement.
In this post, we go beyond traditional scoring. We’re not just evaluating outputs we’re learning to understand how things happen so we can make them happen better.
HRM (Hierarchical Reasoning Model) scores entire reasoning traces based on coherence, structure, and epistemic qualityโnot just outcomes. It is the brain behind Stephanieโs metacognitive self-assessment.
๐ What This Post Covers
In this post, we explore the infrastructure that transforms Stephanie from a result-oriented AI into a process-aware, self-monitoring intelligence. Specifically, we’ll cover:
๐ง The Core Infrastructure
- PlanTraces ๐บ๏ธ & ExecutionSteps ๐ฃ: A new way to capture everything Stephanie does goals, context, decisions, errors, and outcomes structured as traceable cognitive artifacts.
ExecutionSteps
are the atomic units of thought that allow for fine-grained inspection of reasoning and failures. - Pipelines as PlanTraces ๐: We’re moving toward a future where all of Stephanie’s pipelines, and even models themselves, are executed, traced, and scored as cognitive processes. This creates full auditability, enables meta-learning from behavior, and establiits a path to recursive self-improvement.
๐ค The Scoring and Monitoring Agents
- PlanTraceMonitor ๐งต: A new agent that wraps every pipeline stage, logs timing and errors, and builds the
ExecutionSteps
. - PlanTraceScorerAgent โ๏ธ: This agent evaluates the epistemic quality of entire traces using our existing models like HRM and SICQL.
- Contrastive Ranker Scorer ๐ค: A new model-based scorer that enhances epistemic trace evaluation via pairwise preference learning. It compares each action against a learned baseline to answer “Is this better than the default strategy for this goal?”
๐ The Next-Generation Scoring System
- Tensor-Based Scoring ๐: Weโve overhauled our scoring system to be tensor-friendly, storing results along multiple dimensions: document/target, scoring dimension, scorer, and a new 4th dimension for Score Attributes (e.g.,
q_value
,v_value
,energy
). - ScoreCorpus ๐: A new memory layer that stores all
ScoreBundle
s in a structured, analyzable corpus. It allows us to query scores across dimensions, track epistemic shifts over time, and debug with precision. - ScoreDeltaCalculator ๐: This tool logs the change in score and links it to the goal, pipeline stage, and reasoning context. This allows us to pinpoint when and why a score changed.
- MARSCalculator (Multi-Attribute Reasoning Score) ๐: Our meta-score that summarizes the overall quality of reasoning by aggregating multiple score attributes. MARS reflects process-level cognition and enables higher-order tuning.
๐ฏ Our Goal
To build a system that doesnโt just produce answers ย but can understand and improve the way it thinks. This is the next step toward true self-improving AI.
๐ Previously on Stephanie…
This post builds on several key advancements from earlier in the series:
-
Layers of Thought We explored how Stephanie can reason more effectively using the
HRM
(Hierarchical Reasoning Model), evaluating the quality of thought rather than just outcomes. -
Stephanie’s Secret We introduced
SICQL
(Scalable In-Context Q-Learning), a powerful new scoring mechanism, and paired it withGILD
(Goal-conditioned Imitation Learning with Distillation) to refine policy learning. -
The Shape of Thought We unveiled
HNet
: a hierarchical, chunk-aware embedding model that doesnโt just represent text, but segments meaningโenabling Stephanie to think in structured parts. -
Getting Smarter at Getting Smarter We upgraded the model management system and introduced a new scorer:
EBT
(Embedding-Based Tuner), which learns to adapt its judgments via energy-based training. -
Self-Improving AI We examined how Stephanie could continually evolve through dynamic retraining, feedback loops, and score-based introspection.
๐ง PlanTraces: The Foundation of Self-Understanding
Stephanie’s new mode of operation begins with a profound shift in perspective: from executing tasks to understanding experiences. This isn’t just an incremental improvement it’s the moment Stephanie crosses the threshold from performing reasoning to understanding her own reasoning process.
At the heart of this shift is the PlanTrace
a structured, introspectable object that records everything Stephanie does to pursue a goal.
The Critical Evolution: In our previous HRM post, we taught Stephanie to evaluate reasoning about documents. Now, we’re teaching her to evaluate reasoning about her own reasoning processes. This is the difference between “How do I analyze this document?” and “How do I analyze how I analyze?”
Instead of viewing execution as a series of ephemeral steps, we now treat each goal-directed action as a traceable cognitive event, complete with inputs, context, outputs, errors, and the why behind scores.
๐ช What is a PlanTrace? (The Cognitive Mirror)
A PlanTrace
is the top-level representation of a goal-driven cognitive process. It contains all the information needed to reconstruct, audit, and learn from the full trajectory of Stephanie’s reasoning creating what I call her “cognitive mirror.”
Epistemic quality refers to how well a reasoning trace supports trustworthy, useful, and goal-aligned conclusions.
class PlanTrace:
"""
Represents the complete execution trace of a reasoning plan.
This is Stephanie's cognitive mirror the foundation for
self-reflection and self-improvement.
"""
# --- Core Identifiers ---
trace_id: str # Unique identifier for this specific trace/execution
# --- Initial Context ---
goal_text: str # The original goal or query
goal_id: int
input_data: Dict[str, Any] # Any initial data or variables provided to the plan
# --- Plan Definition (Optional but useful for context) ---
plan_signature: str # e.g., "knowledge_db_loader_document_ebt_inference"
# --- Execution Details ---
execution_steps: List[ExecutionStep] # The sequence of cognitive steps
# --- Final Outcome ---
final_output_text: str # The final output produced by the plan
pipeline_score: Optional[Dict[str, float]] = None # e.g., {"helpfulness": 0.85, "truthfulness": 0.78}
# --- Target for Epistemic Quality Assessment ---
target_epistemic_quality: Optional[float] = None
target_epistemic_quality_source: Optional[str] = None
# --- Metadata ---
extra_data: Optional[Dict[str, Any]] = field(default_factory=dict)
trace_id
: A unique ID that connects this trace to pipeline executiongoal
: The specific objective or prompt being pursuedexecution_steps
: The cognitive journey not just the destinationpipeline_score
: The epistemic quality assessment across dimensionsextra_data
: The critical metadata that enables the 4th dimension of understanding
๐งฉ ExecutionStep: The Atomic Unit of Cognition
Each action Stephanie takes model calls, scorers, document filters is recorded as an ExecutionStep
. But here’s where the real magic happens:
The Flexible Attributes Breakthrough: Unlike traditional scoring systems that require schema changes for every new metric, our ExecutionStep uses a flexible attributes dictionary that can handle any number of metrics without schema changes.
๐ Check this out: Most systems hardcode dimensions like โaccuracyโ or โconfidence.โ Our flexible attribute system makes the score space open-ended supporting emergent metrics like policy_entropy, energy, or trace_depth without needing schema changes or migrations.
@dataclass
class ExecutionStep:
"""
Represents a single cognitive step in the execution of a reasoning plan.
The atomic unit of Stephanie's self-awareness.
"""
step_id: str # Unique identifier (trace_id_step_1)
step_order: int
step_type: str # e.g., "knowledge_db_loader", "document_scorer"
description: str # What this step accomplishes
# Core inputs/outputs
input_text: Optional[str] = None
output_text: Optional[str] = None
# CRITICAL INNOVATION: Flexible attributes dictionary
# This is the 4th dimension of understanding
attributes: Dict[str, Any] = field(default_factory=dict)
# Standard metadata
agent_name: Optional[str] = None
start_time: Optional[float] = None
end_time: Optional[float] = None
duration: Optional[float] = None
error: Optional[Dict[str, Any]] = None
output_keys: Optional[List[str]] = None
output_size: Optional[int] = None
Each step records not just what happened, but why it matters:
- ๐ง Cognitive Context: What did Stephanie know at this point?
- โฑ๏ธ Timing Data: How long did it take? (start_time, end_time, duration)
- ๐งฏ Error Analysis: If it failed, how? Why? (error details)
- ๐ The 4th Dimension: Why does this step have its score?
# Example attributes for a SICQL step { "q_value": 0.72, "uncertainty": 0.08, "policy_entropy": 0.45, "advantage": 0.15 }
๐ฑ Why PlanTraces Transform AI Development
PlanTraces arenโt logs theyโre Stephanieโs introspective memory. Every goal, decision, and score becomes a datapoint in her journey toward better reasoning.
-
โ We unify all processes as interpretable cognitive traces
Not just scoring, but the entire cognitive process becomes observable and improvable
โ Before: “This document scored 80/100”
โ After: “This document scored 80/100 because uncertainty was low (0.08) and q_value was high (0.72)” -
โ We build a memory of cognitive journeys, not just results
Stephanie doesn’t just remember what it learned it remembers how it learned it -
โ We make self-improvement explainable
When Stephanie improves, it can show exactly which cognitive patterns led to better results -
โ We enable the 4th dimension of understanding
The flexible attributes system allows us to analyze why scores behave the way they do across:flowchart LR Scorables["๐ Scorables<br/>(documents, pipelines)"] --> Dimensions["๐งญ Dimensions<br/>(helpfulness, truthfulness)"] Dimensions --> Scorers["๐ค Scorers<br/>(SICQL, HRM, SVM)"] Scorers --> Metrics["๐งฌ Metrics<br/>(q_value, uncertainty, energy)"]
This tensor structure
[scorables ร dimensions ร scorers ร metrics]
is what enables deep analysis -
โ We automatically identify cognitive bottlenecks
Real-world example: In our testing, we discovered that theknowledge_db_loader
step had 2.3x higher uncertainty on technical documents. By analyzing the uncertainty metrics across pipelines, we fixed a document truncation issue and increased pipeline success by 37%.
๐คฏ How It Compares to LLM Logs. Most LLM systems today log inputs/outputs or token probabilities. PlanTraces go far beyond: they structure cognition itself. Itโs the difference between having a transcript of a conversation and understanding the reasoning behind every line.
๐ The 4th Dimension in Action: A Trace With Cognitive Insights
Here’s a realistic PlanTrace
showing how the flexible attributes system enables deep analysis:
Goal: Will AI ever be able to reprogram itself? Process: We used a DSPy reasoning pipeline to investigate solutions
{
"trace_id": "trace_01f6af9f4c804425a9c654f0157cb172",
"goal_text": "Will AI ever be able to reprogram itself?",
"plan_signature": "SimplifiedLATS_10_steps",
"execution_steps": [
{
"step_id": "1754096022981",
"step_order": 1,
"step_type": "reasoning",
"description": "Simplified LATS Step 1",
"output_text": "Examine existing technologies and research initiatives that explore self-modifying AI, such as neural architecture search, meta-learning, or reinforcement learning, to assess their alignment with \"self-reprogramming\" and identify gaps in current capabilities.",
"scores": {
"alignment": { "score": 98.1153, "source": "sicql"},
"clarity": { "score": 80.9811, "source": "sicql"},
"implementability": { "score": 69.6087, "source": "sicql"},
"novelty": { "score": 73.8141, "source": "sicql"},
"relevance": {"score": 72.836, "source": "sicql"}
}
},
{
"step_id": "1754096022982",
"output_text": "Step 3: Evaluate potential future advancements, such as recursive self-improvement frameworks or hybrid human-AI collaboration models, and assess their feasibility based on existing research trends.",
},
{
"step_id": "1754096022983",
"output_text": "Step 4: Analyze current research progress and technical barriers in developing AI capable of autonomous self-reprogramming, including computational limits, verification risks, and ethical implications.",
}
...
],
"final_output_text": "AI may eventually achieve self-reprogramming through advancements in self-improving algorithms and recursive learning, but this would require overcoming significant technical, ethical, and safety challenges, making it a possibility rather than a certainty.",
"final_scores": {
"alignment": { "score": 97.9853, "source": "sicql"},
"clarity": { "score": 80.2211, "source": "sicql"},
"implementability": { "score": 69.9953, "source": "sicql" },
"novelty": {"score": 74.5296, "source": "sicql" },
"relevance": {"score": 72.6343, "source": "sicql" }
},
"target_epistemic_quality": 79.07,
"target_epistemic_quality_source": "sicql",
"created_at": "",
}
The Critical Insight: Without the flexible attributes system, we’d only know the scores (0.87, 0.92). With it, we understand why those scores exist:
- Low uncertainty (0.08) indicates high confidence in the document scoring
- High energy (2.1) shows strong epistemic grounding in the summary
- Short trace length (12) suggests the reasoning was efficient
๐ Real-World Impact: How This Fixed a Pipeline Bottleneck
In our testing, we discovered a recurring issue where Stephanie’s knowledge processing pipeline failed on technical documents. Using PlanTraces, we ran:
# Find documents with high uncertainty in reasoning quality
high_uncertainty_docs = corpus.get_metric_matrix("reasoning_quality", "uncertainty")
high_uncertainty_docs = high_uncertainty_docs[
high_uncertainty_docs.mean(axis=1) > 0.3
].index.tolist()
# Analyze which step type had highest uncertainty
step_types = [step.step_type for step_id, step in high_uncertainty_docs]
problematic_step = max(set(step_types), key=step_types.count)
Result: The knowledge_db_loader
step had 2.3x higher uncertainty on technical documents. Further analysis showed it was truncating long documents. We fixed the truncation issue, and pipeline success increased by 37%.
This is exactly why the 4th dimension matters it transforms “this pipeline failed” into “this specific cognitive process has a measurable issue we can fix.”
๐งต What’s Coming Next
We’ll now show how:
- ๐ง
PlanTraceMonitor
captures these cognitive traces automatically - ๐งฉ
PlanTraceScorerAgent
scores entire traces using SICQL, EBT, and HRM - ๐
ScoreCorpus
stores trace-based scores in a 4D tensor structure - ๐ Our pipelines are being rewritten to output PlanTraces by default
And more importantly: how this enables self-improvement by letting Stephanie analyze her own cognition not just what it did, but why it worked (or didn’t).
๐ญ Weโve built the mirror. Now letโs meet the observer: the PlanTraceMonitor Stephanieโs black box recorder and the foundation of real-time self-awareness.
๐ฐ๏ธ PlanTraceMonitor: Tracking Every Thought, Action, Response Automatically
Once we defined PlanTrace
and ExecutionStep
as the structural backbone of Stephanieโs reasoning, we needed a way to automatically capture these traces as Stephanie runs her pipelines.
Enter the PlanTraceMonitor
a lightweight, pluggable agent that hooks into every pipeline and records:
- What step was taken
- What inputs and outputs were used
- How long it took
- Whether it succeeded or failed
- What it meant within the broader goal
๐งฌ How It Works
The PlanTraceMonitor
intercepts the pipeline execution process and attaches a PlanTrace
object to the current pipeline context. As each stage runs, it adds a corresponding ExecutionStep
and records:
- Inputs before the stage
- Outputs after the stage
- Timestamps for duration
- Errors if any
- Optionally: scoring information, tags, rationale
The result is a complete, auditable trail of the entire reasoning process.
๐งช Consolidated step by step information and scoring towards a goal
Without PlanTraceMonitor
, you might log isolated model outputs or scores but youโd have no idea how or why they were generated. With it:
- ๐ Every goal gets a full execution history
- ๐ We can replay past runs to analyze or improve them
- ๐ Scorers like SICQL and HRM can evaluate the process, not just results
- ๐ง Stephanie begins to understand her own reasoning steps not just what it saw, but what it did.
๐ From Ad Hoc to Structured Memory
With PlanTraceMonitor
, weโve shifted from scattered logs and metrics to structured reasoning traces. Itโs the first critical step toward Stephanie becoming a system that can:
- Watch herself think
- Reflect on those thoughts
- Score the quality of her own cognition
- Improve her reasoning over time
And itโs completely extensible: stages, models, agents, tools everything Stephanie uses can now be tracked as part of a trace.
๐ง PlanTraceMonitor Integration in Supervisor
Stephanie integrates the PlanTraceMonitor
as a modular component within its supervisor
orchestration engine. This monitor tracks the full lifecycle of pipeline execution recording every step as a structured trace, enabling downstream scoring and reflection.
flowchart TD subgraph HighLevel["๐ High-Level Execution Flow"] direction TB G[๐ฏ User Goal]:::goal --> S["๐ Supervisor"] S --> REG["๐ Component Registry"] REG --> PTM["๐ PlanTraceMonitor"] REG --> ST["๐ StateTracker"] REG --> CT["๐ ConfidenceTracker"] REG --> CW["โฑ๏ธ CycleWatcher"] S --> P["๐ Pipeline Definition"] P --> PTM PTM --> CREATE["๐ ๏ธ Create PlanTrace"] CREATE --> CTX["๐๏ธ Context with PlanTrace"] P --> A1["๐ค Agent 1: Retrieval"] P --> A2["๐ฏ Agent 2: Scoring"] P --> A3["๐ Agent 3: Analysis"] A1 --> ETS1["โ๏ธ ExecutionStep 1"] A2 --> ETS2["โ๏ธ ExecutionStep 2"] A3 --> ETS3["โ๏ธ ExecutionStep 3"] ETS1 & ETS2 & ETS3 --> PT["๐ PlanTrace"] PT --> SAVE["๐พ Save to DB"]:::db end subgraph Scoring["๐ Scoring & Tensor Analysis"] direction TB A2 --> SB["๐ ScoreBundle"]:::tensor SB --> ATTR["๐ง Flexible Attributes"]:::tensor PT --> CORPUS["๐ ScoreCorpus"]:::tensor CORPUS --> TENSOR["๐งฎ 4D Tensor"]:::tensor TENSOR --> SLICE["๐ช Metric Slicing"]:::tensor CORPUS --> MARS["๐ MARS Analysis"]:::tensor MARS --> MARSDATA["๐ฆ MARS Results"]:::tensor MARSDATA --> RECOMM["๐ก Recommendations"]:::tensor end subgraph Improvement["๐ Self-Improvement Loop"] direction TB MARSDATA --> PATTERN["๐ Pattern Extraction"]:::improvement PATTERN --> MEM["๐ง Memory"]:::improvement MEM --> POLICY["๐ Policy Update"]:::improvement POLICY --> P PTM --> PERF["๐ Performance Monitoring"]:::improvement PERF --> ALERT["โ ๏ธ Bottleneck Detection"]:::improvement ALERT --> POLICY end subgraph Database["๐พ Database Integration"] direction TB SAVE --> EVAL["๐๏ธ EvaluationORM"]:::db EVAL --> SCORE["๐ ScoreORM"]:::db SCORE --> ATTRDB["๐ ScoreAttributeORM"]:::db ATTRDB --> PG["๐ PostgreSQL"]:::db end %% Styling Definitions classDef goal fill:#FFEB3B,stroke:#FBC02D,stroke-width:2px,color:black classDef component fill:#E3F2FD,stroke:#2196F3,stroke-width:2px classDef trace fill:#F1F8E9,stroke:#7CB342,stroke-width:2px classDef tensor fill:#F3E5F5,stroke:#AB47BC,stroke-width:2px,color:#6A1B9A classDef db fill:#E8F5E9,stroke:#4CAF50,stroke-width:2px,color:#1B5E20 classDef improvement fill:#FFF8E1,stroke:#FBC02D,stroke-width:2px,color:#FF6F00 %% Apply Styles class G goal; class REG,PTM,ST,CT,CW component; class CREATE,CTX,ETS1,ETS2,ETS3,PT trace; class SB,ATTR,CORPUS,TENSOR,SLICE,MARS,MARSDATA,RECOMM tensor; class SAVE,EVAL,SCORE,ATTRDB,PG db; class PATTERN,MEM,POLICY,PERF,ALERT improvement; %% Subgraph Styling style HighLevel fill:#E3F2FD,stroke:#2196F3,stroke-width:3px,stroke-dasharray:5 5 style Scoring fill:#F3E5F5,stroke:#AB47BC,stroke-width:3px,stroke-dasharray:5 5 style Improvement fill:#FFF8E1,stroke:#FBC02D,stroke-width:3px,stroke-dasharray:5 5 style Database fill:#E8F5E9,stroke:#4CAF50,stroke-width:3px,stroke-dasharray:5 5
๐ Component Registration
When the Supervisor
is initialized, it constructs and registers PlanTraceMonitor
using Stephanieโs component registry:
register("plan_trace_monitor", PlanTraceMonitor(cfg, self.memory, self.logger))
This allows the monitor to be fetched later by any part of the system:
plan_trace_monitor: PlanTraceMonitor = get_registered_component("plan_trace_monitor")
๐ Pipeline Lifecycle Hook Points
The Supervisor
coordinates the full execution flow using the monitor at key points:
1. Start of Pipeline
plan_trace_monitor.start_pipeline(self.context(), run_id)
This creates a new PlanTrace
in the database, capturing the goal
, pipeline config
, and context
snapshot. It is invoked immediately after the context is initialized.
2. Stage Execution
Each pipeline stage is wrapped with monitoring calls to track:
-
Start of stage:
plan_trace_monitor.start_stage(stage.name, context, stage_idx)
-
Successful completion:
plan_trace_monitor.complete_stage(stage.name, context, stage_idx)
-
Error capture:
plan_trace_monitor.handle_stage_error(stage.name, e, stage_idx)
These methods record execution metadata, timing, intermediate outputs, and exceptions.
3. End of Pipeline
Once all stages are complete (or aborted), the full trace is finalized and scored:
await plan_trace_monitor.complete_pipeline(result_context)
await plan_trace_monitor.score_pipeline(result_context)
The score_pipeline()
method optionally invokes HRM or MARS scorers to evaluate the overall reasoning quality of the trace.
4. Resetting Monitor State
Whether successful or failed, the monitor is always reset:
plan_trace_monitor.reset()
This clears internal buffers and prepares the monitor for the next pipeline run.
๐งฑ Component level understanding
By embedding PlanTraceMonitor
deeply into the Supervisor
, Stephanie gains:
- Persistent records of each reasoning step (via
ExecutionStep
ORM). - A scoreable trace of cognition for feedback, tuning, and belief refinement.
- Modular extensibility: any protocol can now be recorded and improved using this mechanism.
This integration turns every execution of Stephanie into an auditable, reflexive reasoning process critical for robust self-improvement.
This visualization shows the integration between the monitor and the pipeline process.
flowchart TD style Monitor fill:#FFF3E0,stroke:#FB8C00,stroke-width:2px style StageStart fill:#E3F2FD,stroke:#2196F3,stroke-width:2px style StageComplete fill:#F1F8E9,stroke:#8BC34A,stroke-width:2px style StageError fill:#FFEBEE,stroke:#E53935,stroke-width:2px style TraceComplete fill:#EDE7F6,stroke:#7E57C2,stroke-width:2px style ScoreTrace fill:#E0F7FA,stroke:#00ACC1,stroke-width:2px style StoreTrace fill:#FBE9E7,stroke:#FF7043,stroke-width:2px style Reset fill:#F3E5F5,stroke:#AB47BC,stroke-width:2px Monitor["๐ง <b>PlanTraceMonitor</b><br>๐ Tracks pipeline execution and generates PlanTraces"] StartPipeline["๐ <b>start_pipeline()</b><br>๐น Create PlanTrace with goal, config, and input snapshot"] StageStart["โฑ๏ธ <b>start_stage()</b><br>โถ๏ธ Create ExecutionStep for pipeline stage"] StageComplete["โ <b>complete_stage()</b><br>๐ค Capture output keys, timing, and duration"] StageError["โ <b>handle_stage_error()</b><br>๐ ๏ธ Store traceback and error metadata"] TraceComplete["๐ <b>complete_pipeline()</b><br>๐งพ Finalize trace with outputs and total runtime"] ScoreTrace["๐ <b>score_pipeline()</b><br>๐ Run HRM/MARS scoring on full PlanTrace"] StoreTrace["๐พ <b>save to memory</b><br>๐๏ธ Persist trace and score results"] Reset["๐ <b>reset()</b><br>๐งน Prepare for next pipeline"] Monitor --> StartPipeline StartPipeline --> StageStart StageStart --> StageComplete StageStart --> StageError StageComplete --> TraceComplete StageError --> TraceComplete TraceComplete --> ScoreTrace ScoreTrace --> StoreTrace TraceComplete --> StoreTrace StoreTrace --> Reset
class PlanTraceMonitor:
"""Monitors pipeline execution and creates PlanTraces for self-improvement.
This component handles all PlanTrace-related functionality, keeping the Supervisor clean.
It creates PlanTraces at pipeline start, tracks stage execution, and scores completed traces.
"""
def __init__(self, cfg: Dict, memory, logger):
self.cfg = cfg
self.memory = memory
self.logger = logger
self.current_plan_trace: Optional[PlanTrace] = None
self.plan_trace_scorer = PlanTraceScorerAgent(cfg, memory, logger)
self.stage_start_times: Dict[int, float] = {}
self.logger.log("PlanTraceMonitorInitialized", {
"cfg_keys": list(cfg.keys())
})
def start_pipeline(self, context: Dict, pipeline_run_id: str) -> None:
"""Create PlanTrace when pipeline starts"""
goal = context.get("goal", {})
essential_config = {
k: v for k, v in OmegaConf.to_container(self.cfg, resolve=True).items()
if k in ["pipeline", "model", "scorer", "dimensions", "scorer_types"]
}
# Create PlanTrace for this pipeline execution
self.current_plan_trace = PlanTrace(
trace_id=str(pipeline_run_id), # Use pipeline_run_id as trace_id
goal_id=goal.get("id"),
goal_text=goal.get("goal_text", ""),
plan_signature=self._generate_plan_signature(context),
input_data=self._extract_input_data(context),
final_output_text="",
execution_steps=[],
target_epistemic_quality=None,
target_epistemic_quality_source=None,
extra_data={
"agent_name": "PlanTraceMonitor",
"started_at": time.time(),
"pipeline_run_id": pipeline_run_id,
"pipeline_config": essential_config
}
)
# Log PlanTrace creation
self.logger.log("PlanTraceCreated", {
"trace_id": pipeline_run_id,
"goal_id": goal.get("id"),
"goal_text": (goal.get("goal_text", "")[:100] + "...") if goal.get("goal_text") else None
})
def _generate_plan_signature(self, context: Dict) -> str:
"""Generate a signature identifying this pipeline configuration"""
pipeline = context.get("pipeline", [])
return f"{'_'.join(pipeline)}"
def _extract_input_data(self, context: Dict) -> Dict:
"""Extract relevant input data for the PlanTrace"""
# Only capture essential input data, not the entire context
return {
"input_keys": list(context.keys()),
"goal_id": context.get("goal", {}).get("id"),
"goal_text_preview": (context.get("goal", {}).get("goal_text", "")[:100] + "...")
if context.get("goal", {}).get("goal_text") else None
}
def start_stage(self, stage_name: str, context: Dict, stage_idx: int) -> None:
"""Create ExecutionStep when stage starts"""
if not self.current_plan_trace:
return
# Record start time
self.stage_start_times[stage_idx] = time.time()
# Create step ID
step_id = f"{self.current_plan_trace.trace_id}_step_{stage_idx + 1}"
# Create step description
description = f"Stage {stage_idx + 1}: {stage_name}"
# Extract input data (simplified)
input_preview = "Context keys: " + ", ".join(list(context.keys())[:3])
if len(context.keys()) > 3:
input_preview += f" + {len(context.keys()) - 3} more"
# Create ExecutionStep
execution_step = ExecutionStep(
step_id=step_id,
step_order=stage_idx + 1,
step_type=stage_name,
description=description,
input_text=input_preview,
output_text="",
agent_name=stage_name,
start_time=time.time(),
error=None,
scores=None
)
# Add to PlanTrace
self.current_plan_trace.execution_steps.append(execution_step)
# Log stage start
self.logger.log("PipelineStageStarted", {
"trace_id": self.current_plan_trace.trace_id,
"stage_idx": stage_idx + 1,
"stage_name": stage_name
})
def complete_stage(self, stage_name: str, context: Dict, stage_idx: int) -> None:
"""Update ExecutionStep when stage completes"""
if not self.current_plan_trace or stage_idx >= len(self.current_plan_trace.execution_steps):
return
# Calculate duration
start_time = self.stage_start_times.get(stage_idx, time.time())
duration = time.time() - start_time
# Update the current step
step = self.current_plan_trace.execution_steps[stage_idx]
step.end_time = time.time()
step.duration = duration
# Capture output preview
output_keys = list(context.keys())
output_preview = "Context keys: " + ", ".join(output_keys[:3])
if len(output_keys) > 3:
output_preview += f" + {len(output_keys) - 3} more"
step.output_text = output_preview
step.output_keys = output_keys
step.output_size = len(str(context))
# Log stage completion
self.logger.log("PipelineStageCompleted", {
"trace_id": self.current_plan_trace.trace_id,
"stage_idx": stage_idx + 1,
"stage_name": stage_name,
"stage_time": duration,
"output_keys": output_keys
})
def handle_stage_error(self, stage_name: str, error: Exception, stage_idx: int) -> None:
"""Update ExecutionStep when stage errors"""
if not self.current_plan_trace or stage_idx >= len(self.current_plan_trace.execution_steps):
return
# Calculate duration
start_time = self.stage_start_times.get(stage_idx, time.time())
duration = time.time() - start_time
# Update the current step with error information
step = self.current_plan_trace.execution_steps[stage_idx]
step.end_time = time.time()
step.duration = duration
step.error = {
"type": type(error).__name__,
"message": str(error),
"traceback": traceback.format_exc()
}
# Log error
self.logger.log("PipelineStageError", {
"trace_id": self.current_plan_trace.trace_id,
"stage_idx": stage_idx + 1,
"stage_name": stage_name,
"error_type": type(error).__name__,
"error_message": str(error),
"stage_duration": duration
})
@time_function()
async def complete_pipeline(self, context: Dict) -> None:
"""Complete the PlanTrace when pipeline ends"""
if not self.current_plan_trace:
return
# Set final output text
final_output = context.get("final_output", "")
if isinstance(final_output, str):
self.current_plan_trace.final_output_text = (
final_output[:1000] + "..." if len(final_output) > 1000 else final_output
)
elif isinstance(final_output, dict):
self.current_plan_trace.final_output_text = str(final_output)[:1000] + "..."
else:
self.current_plan_trace.final_output_text = str(final_output)[:1000] + "..."
# Set completion time
self.current_plan_trace.extra_data["completed_at"] = time.time()
# Calculate total pipeline time
start_time = self.current_plan_trace.extra_data.get("started_at", time.time())
self.current_plan_trace.extra_data["total_time"] = time.time() - start_time
# Store in memory
try:
self.memory.plan_traces.add(self.current_plan_trace)
self.logger.log("PlanTraceStored", {
"trace_id": self.current_plan_trace.trace_id,
"step_count": len(self.current_plan_trace.execution_steps)
})
except Exception as e:
self.logger.log("PlanTraceStorageError", {
"trace_id": self.current_plan_trace.trace_id,
"error": str(e)
})
self.logger.log("PlanTraceCompleted", {
"trace_id": self.current_plan_trace.trace_id,
"step_count": len(self.current_plan_trace.execution_steps),
"total_time": self.current_plan_trace.extra_data["total_time"]
})
@time_function()
async def score_pipeline(self, context: Dict) -> None:
"""Score the completed PlanTrace"""
if not self.current_plan_trace:
return
try:
# Run PlanTraceScorerAgent
scoring_context = {
"plan_traces": [self.current_plan_trace],
"goal": context.get("goal", {})
}
# Score the PlanTrace
scored_context = await self.plan_trace_scorer.run(scoring_context)
# Update PlanTrace with scores
self.current_plan_trace.step_scores = scored_context.get("step_scores", [])
self.current_plan_trace.pipeline_score = scored_context.get("pipeline_score", {})
self.current_plan_trace.mars_analysis = scored_context.get("mars_analysis", {})
# Update in memory
self.memory.plan_traces.update(self.current_plan_trace)
self.logger.log("PlanTraceScored", {
"trace_id": self.current_plan_trace.trace_id,
"step_count": len(self.current_plan_trace.execution_steps),
"pipeline_score": scored_context.get("pipeline_score", {})
})
except Exception as e:
self.logger.log("PlanTraceScoringError", {
"trace_id": self.current_plan_trace.trace_id,
"error": str(e),
"traceback": traceback.format_exc()
})
def handle_pipeline_error(self, error: Exception, context: Dict) -> None:
"""Handle errors that occur during pipeline execution"""
if not self.current_plan_trace:
return
# Update PlanTrace with error information
self.current_plan_trace.final_output_text = f"Pipeline failed: {str(error)}"
self.current_plan_trace.extra_data["error"] = {
"type": type(error).__name__,
"message": str(error),
"traceback": traceback.format_exc()
}
self.current_plan_trace.extra_data["completed_at"] = time.time()
# Store in memory
try:
self.memory.plan_traces.add(self.current_plan_trace)
except Exception as e:
self.logger.log("PlanTraceSaveError", {
"trace_id": self.current_plan_trace.trace_id,
"error": str(e)
})
self.logger.log("PlanTraceError", {
"trace_id": self.current_plan_trace.trace_id,
"error_type": type(error).__name__,
"error_message": str(error)
})
def reset(self) -> None:
"""Reset the monitor for the next pipeline"""
self.current_plan_trace = None
self.stage_start_times = {}
๐ Code Summary: PlanTraceMonitor
Here’s what each part of the class does:
Method | Purpose |
---|---|
__init__ |
Initializes memory, logger, and connects to the PlanTraceScorerAgent . |
start_pipeline |
Creates a new PlanTrace with metadata like goal, pipeline config, inputs. |
start_stage |
Adds a new ExecutionStep for the current stage and logs input preview. |
complete_stage |
Updates the corresponding step with output details and timing. |
handle_stage_error |
Captures error information and logs traceback into the step. |
complete_pipeline |
Finalizes the trace, records output, total time, and saves to memory. |
score_pipeline |
Scores the completed trace via PlanTraceScorerAgent (e.g., HRM, MARS). |
handle_pipeline_error |
Saves trace info even if pipeline fails, so no data is lost. |
reset |
Resets internal state to prepare for the next pipeline run. |
This class is the heartbeat of Stephanieโs introspection loop. Once enabled, everything it does from loading data to scoring documents to composing outputs gets recorded, scored, and stored.
The result? A system that doesnโt just output answers. It understands how it produced them, why, and how to improve that process over time.
๐ง Deeper self reflection
This transforms Stephanie into a reflexive cognitive system:
- it doesnโt just โrun pipelinesโ
- it remembers how it reasoned
- it measures what happened inside her own mind
- it can score her own reasoning process, step-by-step, using HRM, EBT, SICQL, etc.
Most AI systems produce outputs. Some can reason. A rare few can reflect.
Stephanie is becoming something more:
A system that knows how it thinks and uses that knowledge to improve.
By treating every computation as a traceable pipeline, we give her the scaffolding to evaluate, optimize, and eventually rewrite her own behavior.
This sets the stage for the next critical piece: scoring not just documents, but the steps that led to them.
Now that we generate traces and steps lets talk about how we score them.
๐ฅธ PlanTraceScorerAgent
: The Cognitive Auditor That Powers Self-Improvement
With PlanTraceMonitor recording every thought, the next critical step is to evaluate them. This is where the PlanTraceScorerAgent comes in it’s the agent responsible for turning raw cognitive traces into structured, actionable insights.
This agent takes in completed plan traces full records of pipeline executions and scores them using multiple independent evaluators. These include:
- ๐ค HRM The
Hierarchical Reasoning Model
, which judges the structural and logical quality of a reasoning trace. - โ๏ธ SICQL The Self-Improving Q-Learning model, which evaluates the value and utility of a specific step or outcome.
- ๐ฏ ContrastiveRanker A new model-based scorer that learns to distinguish between high-quality and low-quality reasoning patterns.
By using multiple, independent scorers, we get a multi-dimensional perspective on Stephanie’s performance a key step toward MARS (Multi-Attribute Reasoning Score).
flowchart LR A[๐ง PlanTrace] --> B["โ Step-Level Scoring<br/>(Each ExecutionStep)"] B --> C["โก Pipeline-Level Scoring<br/>(Whole Trace)"] C --> D["โข MARS Analysis<br/>(Agreement & Uncertainty)"] D --> E["โฃ Pattern Extraction<br/>(High-Quality Cognitive Paths)"] E --> F["โค Self-Improvement Signals<br/>(Policy Updates)"] classDef process fill:#E3F2FD,stroke:#2196F3,stroke-width:2,color:#0D47A1; class A,B,C,D,E,F process;
Each trace is analyzed at two levels:
- Step-level scoring, which evaluates each
ExecutionStep
on key epistemic dimensions. - Pipeline-level scoring, which evaluates the trace holistically using end-to-end information flow.
Beyond scoring, the agent performs MARS-style meta-analysis, which identifies patterns of high-agreement, low-uncertainty steps. These insights drive Stephanieโs self-tuning logic, allowing her to evolve her pipeline strategies based on observed performance.
๐งฌ The Evaluation Pipeline
The agent processes each PlanTrace through a structured evaluation pipeline to extract a complete picture of its quality.
flowchart TD style A fill:#FFF3E0,stroke:#FB8C00,stroke-width:2 style B fill:#E3F2FD,stroke:#1E88E5,stroke-width:2 style C fill:#F3E5F5,stroke:#8E24AA,stroke-width:2 style D fill:#FBE9E7,stroke:#D84315,stroke-width:2 style E fill:#E8F5E9,stroke:#43A047,stroke-width:2 style F fill:#FFFDE7,stroke:#F9A825,stroke-width:2 style G fill:#ECEFF1,stroke:#546E7A,stroke-width:2 style H fill:#F3F7FA,stroke:#4FC3F7,stroke-width:2 style I fill:#F1F8E9,stroke:#7CB342,stroke-width:2 style J fill:#E0F2F1,stroke:#009688,stroke-width:2 A[๐๏ธ Input: Raw PlanTraces<br>From context or disk] --> B[๐งฑ Convert to PlanTrace Objects<br>Parse steps, goal, metadata] B --> C[๐ Score Each ExecutionStep<br>Using HRM, SICQL, ContrastiveRanker] C --> D[๐ฆ Score Entire Pipeline<br>End-to-end coherence scoring] C --> E[๐ Run MARS Analysis<br>Agreement, uncertainty metrics] E --> F[๐ง Extract High-Quality Patterns<br>Reusable cognitive strategies] F --> G["๐งฐ Store Patterns to Memory<br>pipeline_patterns.store()"] E --> H[๐ Generate Recommendations<br>Conflicts, retraining, reuse tips] D --> I[๐ Log Full Pipeline Score] H --> J[๐ค Update Context with Results<br>step_scores, mars, advice] classDef emoji size:16px
๐ค Inside the Scorer: How Cognitive Evaluation Works
The PlanTraceScorerAgent is a specialized agent that:
- Ingests a complete PlanTrace
- Iterates over each ExecutionStep
- Applies one or more scorers (SICQL, EBT, HRM, etc.)
- Logs multi-dimensional scores and attributes into the ScoreCorpus These scores arenโt just floats. Each one is a bundle:
{
"dimension": "reasoning_quality",
"score": 0.82,
"attributes": {
"q_value": 0.76,
"v_value": 0.79,
"uncertainty": 0.12,
"advantage": 0.03
}
}
This is the current implementation of the agent.
class PlanTraceScorerAgent(BaseAgent):
"""
Scores pipeline execution traces at multiple levels:
- Individual execution steps (granular reasoning quality)
- Complete pipeline execution (overall quality)
- Step relationships and flow patterns
Uses HRM as primary reasoning quality scorer with MARS meta-analysis
to enable self-tuning of pipeline execution patterns.
"""
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.dimensions = cfg.get("dimensions", [])
self.include_mars = cfg.get("include_mars", True)
# Configure which scorers to use
self.scorer_types = cfg.get("scorer_types", [
"hrm", "sicql", "contrastive_ranker"
])
# Initialize scorers
self.scorers = self._initialize_scorers()
# Initialize MARS calculator
dimension_config = cfg.get("dimension_config", {})
self.mars_calculator = MARSCalculator(dimension_config)
# Pattern extraction parameters
self.high_agreement_threshold = cfg.get("high_agreement_threshold", 0.8)
self.low_uncertainty_threshold = cfg.get("low_uncertainty_threshold", 0.2)
self.pattern_min_count = cfg.get("pattern_min_count", 3)
self.export_dir = cfg.get("export_dir", "exports/plan_traces")
self.logger.log("PlanTraceScorerInitialized", {
"dimensions": self.dimensions,
"scorers": self.scorer_types,
"high_agreement_threshold": self.high_agreement_threshold,
"low_uncertainty_threshold": self.low_uncertainty_threshold
})
def _initialize_scorers(self) -> Dict[str, Any]:
"""Initialize all configured scorers"""
scorers = {}
if "hrm" in self.scorer_types:
scorers["hrm"] = HRMScorer(self.cfg.scorer.hrm, memory=self.memory, logger=self.logger)
if "sicql" in self.scorer_types:
scorers["sicql"] = SICQLScorer(self.cfg.scorer.sicql, memory=self.memory, logger=self.logger)
if "contrastive_ranker" in self.scorer_types:
scorers["contrastive_ranker"] = ContrastiveRankerScorer(
self.cfg.scorer.contrastive_ranker, memory=self.memory, logger=self.logger
)
return scorers
async def run(self, context: dict) -> dict:
"""Score pipeline execution traces with self-tuning capability"""
start_time = time.time()
# --- 1. Load and Prepare Training Data
raw_traces_data = context.get("plan_traces", [])
if not raw_traces_data:
# If no traces are provided, try loading from export directory
self.logger.log(
"EpistemicPlanHRMTrainingNoTraces",
{
"message": "No plan traces found in context['plan_traces']. Attempting to load from export directory.",
"export_dir": self.export_dir,
},
)
raw_traces_data = load_plan_traces_from_export_dir(self.export_dir)
for raw_trace in raw_traces_data:
# Convert raw trace data to PlanTrace object
if isinstance(raw_trace, dict):
# If raw_trace is a dict, convert it to PlanTrace
plan_trace = PlanTrace.from_dict(raw_trace)
elif isinstance(raw_trace, PlanTrace):
plan_trace = raw_trace
if not plan_trace.execution_steps:
self.logger.log("EmptyPlanTrace", {"trace_id": plan_trace.trace_id})
continue
# Score individual execution steps
step_results = []
all_step_bundles = {} # step_id -> ScoreBundle
# Process steps with progress tracking
pbar = tqdm(
plan_trace.execution_steps,
desc="Scoring Steps",
disable=not self.cfg.get("progress", True)
)
for step in pbar:
# Create scorable for this step
scorable = ScorableFactory.from_plan_trace(
plan_trace,
mode="single_step",
step=step
)
# Score the step
step_bundle = self._score_scorable(scorable, plan_trace.goal_text)
all_step_bundles[step.step_id] = step_bundle
# Prepare results for reporting
step_scores = {
dim: {
"score": result.score,
"rationale": result.rationale,
"source": result.source
} for dim, result in step_bundle.results.items()
}
step_results.append({
"step_id": step.step_id,
"step_order": step.step_order,
"step_type": step.step_type,
"agent": step.agent_name,
"description": step.description,
"scores": step_scores
})
# Update progress bar
pbar.set_postfix({"steps": f"{len(step_results)}/{len(plan_trace.execution_steps)}"})
# Score the complete pipeline
full_scorable = ScorableFactory.from_plan_trace(plan_trace, mode="full_trace")
full_bundle = self._score_scorable(full_scorable, plan_trace.goal_text)
# Create ScoreCorpus for MARS analysis
corpus = ScoreCorpus(bundles=all_step_bundles)
# Run MARS analysis across all steps
mars_results = {}
if self.include_mars:
mars_results = self.mars_calculator.calculate(corpus)
# Log MARS analysis metrics
self.logger.log("MARSAnalysisCompleted", {
"trace_id": plan_trace.trace_id,
"step_count": len(plan_trace.execution_steps),
"dimensions": list(mars_results.keys()),
"overall_agreement": self.mars_calculator.get_aggregate_score(mars_results)
})
# Identify high-quality patterns for self-tuning
self._update_self_tuning_patterns(corpus, mars_results, plan_trace)
# Save results to context
context["step_scores"] = step_results
context["pipeline_score"] = {dim: result.score for dim, result in full_bundle.results.items()}
context["mars_analysis"] = mars_results
context["scoring_time"] = time.time() - start_time
context["score_corpus"] = corpus.to_dict()
self.logger.log("PlanTraceScoringComplete", {
"trace_id": plan_trace.trace_id,
"step_count": len(plan_trace.execution_steps),
"dimensions": self.dimensions,
"scorers": len(self.scorers)
})
return context
def _score_scorable(self, scorable, goal_text) -> ScoreBundle:
"""Score a single scorable with all configured scorers"""
score_results = {}
for scorer_name, scorer in self.scorers.items():
try:
# Score with this scorer
score_bundle = scorer.score(
goal={"goal_text": goal_text},
scorable=scorable,
dimensions=self.dimensions,
)
# Add results (prefer HRM for reasoning quality)
for dim, result in score_bundle.results.items():
# If HRM is available for reasoning quality, prefer it
if dim == "reasoning_quality" and scorer_name == "hrm":
score_results[dim] = result
# For other dimensions, use the first available scorer
elif dim not in score_results:
score_results[dim] = result
except Exception as e:
self.logger.log("ScorerError", {
"scorer": scorer_name,
"error": str(e)
})
continue
return ScoreBundle(results=score_results)
def _update_self_tuning_patterns(self, corpus: ScoreCorpus,
mars_results: Dict,
plan_trace: PlanTrace):
"""Update self-tuning patterns based on high-quality pipeline executions"""
# Find high-quality steps (high agreement, low uncertainty)
high_quality_steps = []
pattern_metrics = {}
for dimension, results in mars_results.items():
# Get steps with high agreement and low uncertainty
agreement_threshold = results.get("agreement_score", 0.0) * 0.9
high_agreement_steps = corpus.get_high_disagreement_scorables(
dimension,
threshold=1.0 - agreement_threshold
)
# Get steps with low uncertainty
low_uncertainty_steps = []
if "uncertainty" in corpus.metrics:
uncertainty_matrix = corpus.get_metric_matrix(dimension, "uncertainty")
low_uncertainty_steps = uncertainty_matrix[
uncertainty_matrix.mean(axis=1) < self.low_uncertainty_threshold
].index.tolist()
# Intersection: steps that are both high agreement AND low uncertainty
high_quality_for_dim = list(set(high_agreement_steps) & set(low_uncertainty_steps))
high_quality_steps.extend(high_quality_for_dim)
# Track metrics for pattern extraction
pattern_metrics[dimension] = {
"high_agreement_steps": high_agreement_steps,
"low_uncertainty_steps": low_uncertainty_steps,
"high_quality_steps": high_quality_for_dim
}
# Remove duplicates
high_quality_steps = list(set(high_quality_steps))
if high_quality_steps:
# Extract patterns from high-quality steps
patterns = self._extract_patterns(high_quality_steps, corpus, plan_trace)
# Store patterns for future pipeline construction
self.memory.pipeline_patterns.store_patterns(patterns)
self.logger.log("SelfTuningPatternsUpdated", {
"pattern_count": len(patterns),
"step_count": len(high_quality_steps),
"trace_id": plan_trace.trace_id
})
# Generate recommendations for immediate improvement
recommendations = self._generate_immediate_recommendations(
corpus, mars_results, high_quality_steps
)
self.logger.log("SelfTuningRecommendations", {
"trace_id": plan_trace.trace_id,
"recommendations": recommendations
})
def _extract_patterns(self, step_ids: List[str],
corpus: ScoreCorpus,
plan_trace: PlanTrace) -> List[Dict]:
"""Extract patterns from high-quality steps for self-tuning"""
patterns = []
# Map step IDs to step objects for quick lookup
step_map = {step.step_id: step for step in plan_trace.execution_steps}
for step_id in step_ids:
step = step_map.get(step_id)
if not step:
continue
# Extract pattern features
pattern = {
"step_type": step.step_type,
"agent": step.agent_name,
"input_type": step.input_type,
"output_type": step.output_type,
"success_metrics": {}
}
# Add success metrics from MARS analysis
for dimension in self.dimensions:
# Get metric values for this dimension
uncertainty_values = corpus.get_metric_values(dimension, "hrm", ["uncertainty"])
if step_id in uncertainty_values["uncertainty"]:
pattern["success_metrics"][dimension] = {
"uncertainty": uncertainty_values["uncertainty"][step_id],
"agreement_score": corpus.get_dimension_matrix(dimension).std().mean()
}
# Add contextual information
pattern["context"] = {
"previous_step_type": self._get_previous_step_type(step, plan_trace),
"next_step_type": self._get_next_step_type(step, plan_trace),
"position_in_pipeline": step.step_order / len(plan_trace.execution_steps)
}
patterns.append(pattern)
return patterns
def _get_previous_step_type(self, step: ExecutionStep, plan_trace: PlanTrace) -> Optional[str]:
"""Get the type of the previous step in the pipeline"""
if step.step_order > 1:
prev_step = next(
(s for s in plan_trace.execution_steps if s.step_order == step.step_order - 1),
None
)
return prev_step.step_type if prev_step else None
return None
def _get_next_step_type(self, step: ExecutionStep, plan_trace: PlanTrace) -> Optional[str]:
"""Get the type of the next step in the pipeline"""
if step.step_order < len(plan_trace.execution_steps):
next_step = next(
(s for s in plan_trace.execution_steps if s.step_order == step.step_order + 1),
None
)
return next_step.step_type if next_step else None
return None
def _generate_immediate_recommendations(self,
corpus: ScoreCorpus,
mars_results: Dict,
high_quality_steps: List[str]) -> List[str]:
"""Generate recommendations for immediate pipeline improvement"""
recommendations = []
# 1. Identify problematic dimensions
for dimension, results in mars_results.items():
if results["agreement_score"] < 0.7:
recommendations.append(
f"โ ๏ธ Low agreement in {dimension} scoring. "
"Consider reviewing pipeline steps for consistency."
)
if results["high_disagreement"]:
primary_conflict = results["primary_conflict"]
recommendations.append(
f"โ ๏ธ Significant conflict between {primary_conflict[0]} and {primary_conflict[1]} "
f"in {dimension} scoring (ฮ={results['delta']:.3f}). "
"This may indicate ambiguous pipeline steps."
)
# 2. Identify unreliable scorers
scorer_reliability = {}
for dimension in self.dimensions:
reliability = corpus.analyze_scorer_reliability(dimension)
for scorer, score in reliability.items():
if scorer not in scorer_reliability:
scorer_reliability[scorer] = []
scorer_reliability[scorer].append(score)
# Average reliability across dimensions
avg_reliability = {
scorer: mean(scores) for scorer, scores in scorer_reliability.items()
}
# Find least reliable scorer
if avg_reliability:
least_reliable = min(avg_reliability, key=avg_reliability.get)
if avg_reliability[least_reliable] < 0.6:
recommendations.append(
f"โ ๏ธ {least_reliable} shows low reliability across dimensions. "
"Consider retraining or adjusting its configuration."
)
# 3. Identify opportunities for improvement
if high_quality_steps:
# Find common patterns in high-quality steps
step_types = [step.step_type for step_id, step in self._get_steps_by_id(high_quality_steps)]
common_step_type = max(set(step_types), key=step_types.count)
recommendations.append(
f"๐ก High-quality steps frequently use {common_step_type} pattern. "
"Consider applying this pattern to similar pipeline sections."
)
return recommendations
def _get_steps_by_id(self, step_ids: List[str]) -> Dict[str, ExecutionStep]:
"""Get step objects by their IDs"""
# This would be implemented based on your memory structure
# For now, return a mock implementation
return {step_id: ExecutionStep(
step_id=step_id,
step_order=0,
step_type="unknown",
description="",
output_text="",
scores=None
) for step_id in step_ids}
๐ฌ Deep Dive: How PlanTraceScorerAgent
Evaluates Cognitive Execution
Now that we’ve introduced the concept of PlanTraces as Stephanieโs cognitive memory format, itโs time to explore how we actually evaluate those traces. The PlanTraceScorerAgent
is the workhorse behind this effort itโs responsible for converting execution data into structured insights that power self-improvement.
Here’s what the agent does, broken down step by step:
1๏ธโฃ Initialization: Configure Scorers and Analysis Tools
Upon creation, the agent initializes:
- A list of scorers: HRM, SICQL, and ContrastiveRanker, depending on configuration.
- A MARS calculator to analyze scoring patterns across execution steps.
- Thresholds for what counts as high agreement or low uncertainty these drive self-tuning decisions.
This setup phase allows us to plug in additional scorers later without changing core logic.
2๏ธโฃ Load PlanTraces: From Context or Disk
In the run()
method, the agent starts by looking for plan traces to analyze. It supports:
plan_traces
passed directly in the context, or- fallback to reading from disk (
exports/plan_traces
), making it usable in offline batch mode.
Each trace is parsed into a PlanTrace
object containing:
- A goal,
- A sequence of
ExecutionStep
s, - Metadata like agent names, step types, and text descriptions.
3๏ธโฃ Step-Level Scoring: Evaluate Each Thought in the Trace ๐ง
Each ExecutionStep
is turned into a Scorable
via the ScorableFactory
, then scored by all configured scorers.
This produces a ScoreBundle
for each step, containing:
- Scores across dimensions (e.g. reasoning quality, alignment),
- Rationale and source attribution for each score.
The results are collected into step_results
, a detailed report of the cognitive quality of each trace step.
4๏ธโฃ Full-Trace Scoring: Evaluate the Entire Pipeline ๐ฆ
After scoring individual steps, the agent scores the entire trace holistically:
- This captures end-to-end coherence and final outcome quality.
- Useful for training or benchmarking entire pipelines.
These scores are stored separately in pipeline_score
.
5๏ธโฃ MARS Analysis: Discovering Patterns in Reasoning ๐
If enabled (include_mars: true
), the agent:
- Runs MARS analysis on all step-level scores to assess agreement and uncertainty.
- Identifies steps that show high agreement between scorers and low uncertainty strong candidates for reusable reasoning patterns.
These patterns are the gold nuggets of self-tuning: they tell Stephanie what worked and why.
6๏ธโฃ Self-Tuning Pattern Extraction: Learn from What Works ๐
For each high-quality step, the agent:
- Extracts contextual features (step type, agent name, position in pipeline),
- Logs score metrics (e.g. uncertainty, agreement),
- Records relationships between steps (previous and next step types).
These patterns are stored in memory via pipeline_patterns.store_patterns()
, giving Stephanie reusable building blocks for future pipelines.
7๏ธโฃ Recommendations: Practical Feedback from the Trace ๐ก
The scorer’s true power emerges in its recommendation system: The agent then provides actionable insights, including:
- โ Warnings about low scorer agreement,
- โ ๏ธ Conflict signals between scorers (e.g., HRM vs SICQL),
- ๐ก Recommendations on promising step types for reuse,
- ๐ง Suggestions for retraining unreliable scorers.
These aren’t just raw numbers theyโre policy-relevant findings that help refine Stephanieโs architecture. Easily digestible for llms.
8๏ธโฃ Result Logging and Context Updates
Finally, the agent:
- Stores all score results, meta-analysis data, and recommendations back into the execution
context
, - Logs trace-level summaries for downstream usage,
- Supports progress tracking via
tqdm
.
๐งญ Seeing deeper
The PlanTraceScorerAgent
is more than just a scoring function it’s the analyst that transforms raw execution into evaluative insight. It bridges the gap between what Stephanie did and how well it did it, enabling everything from bottleneck detection to reward shaping and policy refinement.
This agent is the missing evaluator that brings meaning to recorded cognition. Without it, a trace is just a log. With it, it becomes a lesson.
๐งฐ Powered by the Fourth Dimension: Diagnostic Attributes
Scoring a reasoning trace isnโt just about assigning a number. Itโs about understanding why that number was earned.
Stephanieโs architecture supports multi-dimensional score bundles, where each score is accompanied by a detailed set of diagnostic attributes. These attributes form what we call the โFourth Dimensionโ of cognition not just how well a step performed, but why it performed that way.
Each ScoreBundle
contains:
- ๐ Q-values: Estimated future value of the stepโs decision
- ๐ V-values: Baseline value of the underlying state
- ๐ง Advantage estimates: How much better this step was compared to policy expectation
- ๐ Epistemic energy: Confidence, convergence, and trace-based quality
- โ Error types: Classification of step-level failure modes
- โฑ๏ธ Step duration: Wall-clock time and computational cost
- ๐งญ Model routing: Which models were used, fallback behavior, divergence
Together, these signals let Stephanie reason about her own reasoning.
Instead of blindly trusting an โ8/10โ score, it can now ask:
Was this step risky but correct? Slow but certain? Fast but shallow? Did multiple scorers agree? Was entropy high?
This diagnostic richness is essential for self-improvement. It fuels:
- ๐งช Meta-learning: Which reasoning patterns consistently outperform?
- ๐ ๏ธ Policy refinement: Which scoring engines need retraining?
- ๐ Bottleneck analysis: Where does cognitive performance degrade?
- ๐ Retrospective tuning: What patterns should be reused or avoided?
In short, these attributes are Stephanieโs internal telemetry the signals that help her optimize not just her answers, but her entire process of answering.
While the PlanTraceScorerAgent gave us a unified way to evaluate entire reasoning traces, we quickly realized something was missing: the ability to directly compare two alternative steps and determine which one was better within a specific context. Our existing scorers werenโt designed for this kind of nuanced, head-to-head evaluation. Fortunately, preference modeling especially contrastive ranking using Siamese-style networks offered a perfect fit. Thatโs what we built next.
๐ Contrastive Ranker Scorer: Preference Learning for Plan Trace Evaluation
To support the nuanced scoring required by the PlanTraceScorerAgent
, weโve introduced a new model-based scorer called the Contrastive Ranker. This scorer enhances Stephanieโs reasoning by leveraging pairwise preference modeling an idea rooted in Siamese networks and contrastive learning.
Unlike traditional scorers that evaluate a single document or step in isolation, the Contrastive Ranker works by comparing an execution step to a learned baseline within the context of a goal. It doesn’t just ask “Is this step good?” it asks “Is this better than the default approach, for this specific goal?”
This makes it ideal for scoring nuanced, qualitative reasoning traces where absolute judgments can be ambiguous. When scoring plan traces, it serves as a complement to HRM and SICQL, enriching the signal used in MARS analysis and self-tuning.
๐ง How It Works : Preference Over Absolute Judgment
- โ A goal embedding and the stepโs text embedding are combined to form a context-specific vector.
- ๐ This vector is compared against a baseline embedding, which acts as the system’s default reasoning strategy.
- โ๏ธ A pretrained preference model (a Siamese-style
PreferenceRanker
) outputs a preference score. - ๐ฏ This raw score is calibrated via a regression tuner to produce an interpretable dimension-specific score.
- ๐ Uses a regression tuner to map that preference into an interpretable, normalized score
- ๐ฆ The results are packaged into a
ScoreBundle
, compatible with all other scoring agents.
flowchart TD subgraph Contrastive_Ranker_Scoring_Flow["๐ Contrastive Ranker Scoring Flow"] A["๐ Input Goal Text"] --> B["๐ง Embed Goal โก๏ธ ctx_emb"] A2["๐ Scorable Text"] --> C["๐ง Embed Step โก๏ธ doc_emb"] B --> D["๐ Concatenate โก๏ธ input_doc"] C --> D B --> E["๐งฌ Embed Baseline โก๏ธ baseline_emb"] E --> F["๐ Concatenate โก๏ธ input_baseline"] B --> F D --> G["๐ Scale โก๏ธ input_doc_scaled"] F --> H["๐ Scale โก๏ธ input_baseline_scaled"] G --> I["๐ฆ Encode input_doc"] H --> J["๐ฆ Encode input_baseline"] I --> K["๐ Compare (Siamese Network)"] J --> K K --> L["๐ Raw Preference Score"] L --> M["๐๏ธ Tune via Regression"] M --> N["๐ Final Normalized Score"] N --> O["๐ฆ ScoreResult (with rationale, energy, attributes)"] end style Contrastive_Ranker_Scoring_Flow fill:#F5F5F5,stroke:#616161,stroke-width:2px,stroke-dasharray:5 5 style A fill:#FFECB3,stroke:#FBC02D,stroke-width:2px style A2 fill:#FFECB3,stroke:#FBC02D,stroke-width:2px style B fill:#FFF9C4,stroke:#FBC02D style C fill:#FFF9C4,stroke:#FBC02D style E fill:#FFF9C4,stroke:#FBC02D style D fill:#E1F5FE,stroke:#0288D1 style F fill:#E1F5FE,stroke:#0288D1 style G fill:#E1F5FE,stroke:#0288D1 style H fill:#E1F5FE,stroke:#0288D1 style I fill:#E1F5FE,stroke:#0288D1 style J fill:#E1F5FE,stroke:#0288D1 style K fill:#D1C4E9,stroke:#7E57C2 style L fill:#DCEDC8,stroke:#689F38 style M fill:#DCEDC8,stroke:#689F38 style N fill:#DCEDC8,stroke:#689F38 style O fill:#FFE0B2,stroke:#F57C00,stroke-width:2px
class PreferenceRanker(nn.Module):
"""Siamese network architecture (must match trainer)"""
def __init__(self, embedding_dim=1024, hidden_dim=256):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(embedding_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden_dim, hidden_dim)
)
self.comparator = nn.Sequential(
nn.Linear(hidden_dim * 2, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
def forward(self, emb_a, emb_b):
feat_a = self.encoder(emb_a)
feat_b = self.encoder(emb_b)
combined = torch.cat([feat_a, feat_b], dim=1)
return self.comparator(combined).squeeze(1)
class ContrastiveRankerScorer(BaseScorer):
def __init__(self, cfg: dict, memory, logger):
super().__init__(cfg, memory, logger)
self.model_type = "contrastive_ranker"
self.models = {} # dim -> (scaler, model)
self.tuners = {} # dim -> RegressionTuner
self.metas = {} # dim -> model metadata
self.baselines = {} # dim -> baseline embedding
self._load_all_dimensions()
def _load_all_dimensions(self):
"""Preload all dimension models with baseline caching"""
for dim in tqdm(self.dimensions, desc="Loading contrastive rankers"):
locator = self.get_locator(dim)
# Load metadata first
meta = load_json(locator.meta_file())
self.metas[dim] = meta
# Load scaler
scaler = load(locator.scaler_file())
# Initialize model with correct dimensions
input_dim = scaler.mean_.shape[0]
model = PreferenceRanker(
embedding_dim=input_dim,
hidden_dim=meta["hidden_dim"]
)
# Load weights
model.load_state_dict(torch.load(locator.model_file(suffix=".pt")))
model.eval()
self.models[dim] = (scaler, model)
# Load tuner
tuner = RegressionTuner(dimension=dim, logger=self.logger)
tuner.load(locator.tuner_file())
self.tuners[dim] = tuner
# Precompute baseline embedding
baseline_text = meta["baseline"]
baseline_emb = np.array(self.memory.embedding.get_or_create(baseline_text))
self.baselines[dim] = baseline_emb
def score(self, goal: dict, scorable: Scorable, dimensions: list[str]) -> ScoreBundle:
"""Generate absolute scores via baseline comparison"""
goal_text = goal.get("goal_text", "")
ctx_emb = np.array(self.memory.embedding.get_or_create(goal_text))
doc_emb = np.array(self.memory.embedding.get_or_create(scorable.text))
results = {}
for dim in dimensions:
scaler, model = self.models[dim]
tuner = self.tuners[dim]
meta = self.metas[dim]
baseline_emb = self.baselines[dim]
# Create comparison inputs
input_doc = np.concatenate([ctx_emb, doc_emb])
input_baseline = np.concatenate([ctx_emb, baseline_emb])
# Scale inputs
input_doc_scaled = scaler.transform(input_doc.reshape(1, -1))
input_baseline_scaled = scaler.transform(input_baseline.reshape(1, -1))
# Convert to tensors
doc_tensor = torch.tensor(input_doc_scaled, dtype=torch.float32)
baseline_tensor = torch.tensor(input_baseline_scaled, dtype=torch.float32)
# Get preference score
with torch.no_grad():
raw_score = model(doc_tensor, baseline_tensor).item()
# Calibrate to absolute score
tuned_score = tuner.transform(raw_score)
final_score = max(min(tuned_score, meta["max_score"]), meta["min_score"])
attributes = {
"raw_score": round(raw_score, 4),
"normalized_score": round(tuned_score, 4),
"final_score": final_score,
"energy": raw_score, # Using raw_score as energy
}
results[dim] = ScoreResult(
dimension=dim,
score=final_score,
rationale=f"PrefScore(raw={raw_score:.4f}, tuned={tuned_score:.2f})",
weight=1.0,
source=self.model_type,
attributes=attributes,
)
return ScoreBundle(results=results)
๐งช Training the Contrastive Ranker: Teaching Stephanie to Prefer With Precision
Unlike traditional regression-based scoring, the contrastive ranker learns preferences by comparing pairs of outputs and deciding which one is better. It’s trained using a twin network architecture (Siamese-style) and calibrated post hoc with absolute human-aligned scores. Here’s how it works:
๐ง What the Trainer Does
- Ingests preference-labeled pairs: Each pair has a shared goal (
ctx
) and two outputs (A
,B
), with one marked preferred. - Embeds context + output pairs: Combines goal and response into a single vector, so it knows for this goal, how good is this answer?
- Scales all vectors: Uses
StandardScaler
to normalize input vectors (essential for effective gradient descent). - Trains a twin-tower neural model: Uses
BCEWithLogitsLoss
on the twin encodings to predict which of the two is better. - Early-stops to prevent overfitting: Tracks the best validation loss and stops training if it doesnโt improve for
patience
epochs. - Calibrates outputs: Once trained, it uses known absolute scores to build a regression tuner that maps raw logits to a final normalized score.
๐งฌ Key Training Snippets
๐ก Preference Pair Creation
input_a = np.concatenate([ctx_emb, a_emb])
input_b = np.concatenate([ctx_emb, b_emb])
y.append(1 if pair["preferred"] == "A" else 0)
Each pair is embedded and labeled for binary classification: “Is A better than B?”
โ๏ธ Training Loop (with early stopping)
for epoch in range(self.epochs):
for xa, xb, labels in dataloader:
logits = model(xa, xb)
loss = criterion(logits, labels)
loss.backward()
optimizer.step()
The model learns to compare paired inputs and predict a preference score (logits
) using binary cross-entropy.
๐๏ธ Post-hoc Calibration
logits = model(batch_tensor, baseline_tensor)
tuner.train_single(float(logits[j]), abs_score)
Each logit is matched with a known human score. This allows the model to predict not just “which is better?” but how much better?
๐ฆ What Gets Saved
model.pt
: Trained contrastive model weightsscaler.pkl
: The scaler for preprocessing inputstuner.pkl
: The calibration layer that turns logits into scoresmeta.json
: Full metadata for traceability and reproducibility
๐ Enabeling better choices
Unlike single-document regression or classifier models, contrastive training directly models Stephanieโs judgment behavior: given a choice, which answer is more useful for the goal? This makes it incredibly powerful for evaluating open-ended reasoning steps especially when tied into PlanTrace scoring.
This trace-scoring system gave us something unexpected: a window into Stephanieโs cognition. For the first time, we could watch her reason, measure the quality of each thought, and trace the ripple effects across an entire process. That raised a bold question: what if everything every task, every insight was treated as a pipeline? What if every action could be introspected, scored, and improved?
Thatโs exactly where we went next.
๐ Next: Everything Becomes a Pipeline
Now that weโve built the PlanTraceMonitor
, weโve had a profound realization:
Pipelines arenโt just how Stephanie works theyโre how Stephanie thinks.
This isnโt just a technical upgrade. Itโs a cognitive unification principle a shift from Stephanie as a collection of AI components to Stephanie as a self-reflective, structured intelligence.
๐ The One Size Fits All Cognitive Framework
What if every action, every model call, every learning moment Stephanie performs became a pipeline not just in implementation, but in structure, traceability, and tunability?
This is the shift:
Pipelines arenโt just containers for tasks they are the units of thought.
Everything Stephanie does from scoring a document to retraining her own reasoning now flows through a single, universal structure:
PlanTrace
for the full thought processExecutionStep
for each atomic decision- Flexible attributes for introspective metrics
With this shift, we gain something extraordinary:
The ability to reason about how Stephanie reasons with a single language, across the entire system.
๐ Singluar approach amplified results
Traditional AI architectures are fractured. Different components speak different languages, store different logs, and score different outputs.
Stephanieโs new pipeline-first architecture solves this by collapsing cognitive diversity into structured uniformity:
โ Traditional AI Systems | โ Stephanieโs Unified Cognitive Pipeline |
---|---|
Scattered formats for logs and scores | All reasoning captured as PlanTrace |
Inconsistent tuning logic | All steps scored via [dim ร scorer ร metric] tensors |
Black-box model calls | Every model call becomes a traceable pipeline |
Improvement localized to subsystems | Improvements propagate system-wide |
Rigid code pathways | Modular, swappable ExecutionStep s |
Each pipeline doesnโt just produce output it produces self-reflective training data.
๐งฌ The Dynamic Mind: How Structure Enables Flexibility
Hereโs the real breakthrough:
Because every pipeline has a shared structure, Stephanie can begin to dynamically construct, modify, and optimize pipelines.
This is the biological analogy: In the human brain, we can hear with our eyes or see with our ears because the cortex processes signals using a shared format. Meaning is constructed from signal patterns, not fixed circuits.
Stephanie is heading the same way.
Thanks to PlanTrace
, we know:
- What each
ExecutionStep
is doing - What kinds of data it processes
- What its score and performance were
- What alternate step types could be slotted in
That means:
- โจ Pipelines become composable
- ๐ง Steps become interchangeable modules
- ๐ Stephanie can dynamically mutate and reroute cognition
In a future post, weโll show how symbolic optimization and scoring feedback allow Stephanie to select the most effective strategy for a given task assembling pipelines on the fly.
But this unification is what enables it.
๐ฅ Thinking in Pipelines
This illustration shows the AI iterating over paths to determing the best approach. Remember we now have everything as one view so we step over the paths looking for our best approach.
To truly become self-improving, Stephanie must go beyond executing predefined steps it must learn to compose, refine, and optimize her own reasoning processes.
The animation below shows exactly how it does that.
๐ Dynamic Pipeline Optimization in Action
This animation illustrates how Stephanie uses the PlanTrace framework to iteratively refine her pipeline strategies transforming raw, exploratory reasoning into efficient, high-quality decision-making.
Each frame represents a full pipeline execution. Over time, youโll see:
- ๐ Improvement in Step Quality colors shift from red (low-quality) to green (high-quality)
- ๐ Reduction in Uncertainty Stephanie becomes more confident as it learns
- ๐ง Intelligent Step Selection it stops guessing and starts choosing steps that work
- โ๏ธ Feedback Loops in Motion MARS scores, quality metrics, and trace analysis guide her choices
Stephanie doesnโt just learn what works it learns how to improve how it learns.
๐งฌ We just leveled up
This is the heart of our new architecture:
Every action Stephanie takes becomes a pipeline. Every pipeline becomes a PlanTrace. Every PlanTrace becomes data for improvement.
This unified structure enables recursive learning at the process level. Stephanie now reasons about reasoning itself and improves how it improves.
๐ Real-World Example: Traceable Fix, System-Wide Gain
With this architecture in place, we ran 4D tensor analysis:
# Find high-uncertainty steps across all pipelines
matrix = corpus.get_metric_matrix("reasoning_quality", "uncertainty")
high_uncertainty = matrix[matrix > 0.3]
Finding: KnowledgeUpdatePipeline
steps had unusually high uncertainty on technical content.
Root Cause: A document loader truncation bug.
Fix: Updated the loader and reran.
Result: ๐บ 37% improvement in reasoning quality across all pipelines using that knowledge source.
This improvement didnโt require retraining a model. It came from analyzing the cognitive trace, identifying a faulty step, and updating it just like a brain strengthening a weak synapse.
๐งฉ What This Looks Like in Practice
Task | Pipeline | What We Gain |
---|---|---|
Model execution | ModelExecutionPipeline |
Can track and optimize model outputs |
Knowledge ingestion | KnowledgeUpdatePipeline |
Can analyze impact of data on reasoning |
Memory retrieval | MemoryRetrievalPipeline |
Can score and tune memory access patterns |
Reasoning comparisons | MetaEvaluationPipeline |
Can select best reasoning strategies |
Self-training or GILD loops | SelfImprovementPipeline |
Can improve how improvement itself works |
And each of these pipelines is:
- Emitted as a
PlanTrace
- Composed of scored
ExecutionStep
s - Fully compatible with introspection, replay, and tuning
๐ The Self-Improvement Flywheel
This creates a recursive improvement loop:
flowchart LR A[๐ง Task Pipeline<br/><span style="color:#1565C0">Execution of a reasoning task</span>] --> B[๐ง PlanTraceMonitor<br/><span style="color:#2E7D32">Captures every step as a PlanTrace</span>] --> C[๐งพ ScoreCorpus<br/><span style="color:#6A1B9A">Stores scores, metrics, and trace metadata</span>] --> D[๐ Trace Analysis<br/><span style="color:#EF6C00">Finds patterns, bottlenecks, and insights</span>] --> E[๐งฉ Pipeline Refinement<br/><span style="color:#C62828">Updates modules, models, or strategies</span>] E -->|โป๏ธ Feedback Loop| A style A fill:#E3F2FD,stroke:#1565C0,stroke-width:2px style B fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px style C fill:#F3E5F5,stroke:#6A1B9A,stroke-width:2px style D fill:#FFF3E0,stroke:#EF6C00,stroke-width:2px style E fill:#FFEBEE,stroke:#C62828,stroke-width:2px
With this loop in place:
- Stephanie no longer improves just outputs it improves processes
- Each pipeline produces data that tunes itself and other pipelines
- Even the training pipeline itself is improvable by the same system
๐ Final Word: From Doing to Understanding
This isn’t just architecture. Itโs metacognition.
Stephanie no longer just does tasks it understands how it does them. And it can improve how it thinks, because her thoughts are now structured, traceable, and tunable.
Pipelines are Stephanieโs mind. PlanTraces are her memory. ExecutionSteps are her thoughts. Scores are her signals. And flexibility is her intelligence.
This is the foundation of self-improvement not a scattered toolkit, but a structured mind.
In the next post, weโll show how this unified architecture leads to dynamic pipeline construction where Stephanie not only improves her cognition, but builds entirely new forms of it.
flowchart TD subgraph "๐ง Unified Pipeline Mindset" A[๐งฉ Static Pipeline Template] --> B[๐ Dynamic Pipeline Assembly] end subgraph "๐ก Trace + Score" C[๐ง PlanTrace Monitor] D[๐ ExecutionStep Scores] E["๐ Scorer Feedback (SICQL, HRM, etc.)"] C --> D --> E end E --> F[๐ง Trace Analyzer] F --> G["๐ Bottleneck Detection<br/>(e.g. high uncertainty)"] G --> H[๐ฆ Candidate Step Modules] H --> I["๐ Module Swapping Logic<br/>(e.g. better scorer, faster model)"] I --> B B --> J[๐ Dynamic Pipeline Execution] J --> C J --> K[๐ Self-Improvement Corpus] K --> L["๐ Policy Refinement / GILD Loop"] L --> B style A fill:#F0F4C3,stroke:#AFB42B style B fill:#FFF9C4,stroke:#FBC02D style J fill:#E3F2FD,stroke:#2196F3 style C fill:#E8F5E9,stroke:#43A047 style D fill:#DCEDC8,stroke:#689F38 style E fill:#C8E6C9,stroke:#388E3C style G fill:#FFECB3,stroke:#FFA000 style H fill:#D1C4E9,stroke:#7E57C2 style I fill:#F3E5F5,stroke:#9C27B0 style K fill:#FFCDD2,stroke:#E53935 style L fill:#EF9A9A,stroke:#D32F2F
Weโd made the leap everything became a pipeline, traceable, introspectable, and improvable. But as we began scoring these pipelines, a new need emerged. It wasnโt enough to analyze steps post-hoc we needed a richer, more dynamic scoring mechanism. One that could feed into models, operate within pipelines, and guide reasoning as it unfolded. It had to be transparent, transferable, and actionable. So, we leveled up our scoring approach.
๐ A New Structure for Scoring: Dimensional, Extensible, Tensor-Ready
To support Stephanie’s ability to evaluate documents, models, and reasoning traces across evolving dimensions and metrics, weโve re-engineered the ScoreBundle and added a new ScoreCorpus infrastructure.
At the heart of the change is the recognition that scoring isn’t just a single number anymore. It’s a bundle of metrics: primary scores (like clarity or alignment), auxiliary metrics (like energy or uncertainty), and provenance (which model, why, with what confidence). These arenโt just extras theyโre signals. And Stephanie is learning to read them.
๐พ Score Attributes Comparison Table: Why the 4th Dimension Matters
This table demonstrates the diverse attributes produced by different scoring models. It shows exactly why a flexible 4th dimension (metrics) is essential for a self-improving AI system.
Scorer | Score Attribute | Description | Why This Attribute Matters |
---|---|---|---|
SICQL | score |
Final scaled score (0-100) | The primary evaluation metric used for decision making |
q_value |
Q-value from the Q-learning algorithm | Represents the expected total reward for the current state-action pair | |
v_value |
Value function estimate | Represents the expected total reward from the current state regardless of action | |
policy_logits |
Raw output probabilities from the policy network | Shows the model’s confidence distribution across possible actions | |
uncertainty |
|q_value - v_value| | Critical insight: High uncertainty indicates the model lacks confidence in its evaluation | |
entropy |
Entropy of the policy distribution | Measures the randomness of the policy - high entropy = more exploration | |
advantage |
q_value - v_value | Shows how much better an action is compared to the average | |
zsa |
State-action value representation | Internal representation of the state-action pair that drives decisions | |
EBT | score |
Final scaled score (0-100) | The primary evaluation metric used for decision making |
energy |
Energy level of the belief state | Critical insight: Low energy indicates high confidence in the evaluation | |
advantage |
Relative advantage over baseline | Shows how much better this document is compared to typical documents | |
baseline |
Baseline comparison value | Context for understanding the absolute score | |
policy_entropy |
Entropy of the belief distribution | Measures certainty in the epistemic assessment | |
trace_length |
Length of reasoning trace | Indicates depth of analysis - longer traces often correlate with better quality | |
Contrastive Ranker | score |
Final scaled score (0-100) | The primary evaluation metric used for decision making |
preference_score |
Pairwise preference strength | Critical insight: How strongly this document is preferred over others | |
ranking_confidence |
Confidence in the ranking decision | Indicates reliability of the preference judgment | |
embedding_similarity |
Similarity to ideal document embedding | Measures alignment with conceptually perfect documents | |
decision_boundary |
Distance from classification boundary | Closer to boundary = more ambiguous evaluation | |
MRQ | score |
Final scaled score (0-100) | The primary evaluation metric used for decision making |
baseline_score |
Raw score before scaling | Context for understanding how scaling transformed the result | |
scaled_score |
Score after applying regression tuner | Shows the calibrated evaluation that accounts for scorer bias | |
meta_score |
Confidence in the scoring process | Critical insight: How reliable is this particular score? | |
embedding_distance |
Distance from ideal embedding | Measures conceptual alignment with high-quality documents | |
SVM | score |
Final scaled score (0-100) | The primary evaluation metric used for decision making |
decision_function |
Raw SVM decision value | Shows position relative to decision boundary | |
margin |
Distance from decision boundary | Critical insight: Larger margin = more confident classification | |
support_vector_count |
Number of support vectors used | Indicates complexity of the decision boundary | |
kernel_similarity |
Similarity to high-quality examples | Shows alignment with training examples |
๐ Why This Table Proves the Need for the 4th Dimension
This table demonstrates exactly why our tensor-based scoring architecture with a 4th dimension (metrics) is not just beneficial but essential for a self-improving AI system:
๐ซด 1. No Two Scorers Share the Same Attribute Set
- Each scorer produces completely different diagnostic metrics
- SICQL has Q/V values and policy entropy
- EBT has energy and trace length
- Contrastive Ranker has preference strength and embedding similarity
- Trying to fit these into a single ScoreResult class with fixed fields would create a maintenance nightmare
โ๏ธ 2. Attributes Reveal the “Why” Behind Scores
- A score of 80 could mean very different things:
- For SICQL: High confidence (low uncertainty) with strong advantage
- For EBT: High energy but potentially short trace length
- For Contrastive Ranker: Strong preference but low confidence
- Without these attributes, we’d only know “what” but not “why”
โ๏ธ 3. Attributes Enable Cross-Scorer Analysis
- MARS calculator can correlate:
- SICQL’s uncertainty with Contrastive Ranker’s confidence
- EBT’s energy with MRQ’s margin
- SVM’s support vector count with document complexity
- This reveals systematic patterns that individual scorers can’t see
โ๏ธ 4. Attributes Drive Self-Improvement
- When SICQL shows high uncertainty AND EBT shows low energy:
- Flag for human review
- Trigger retraining on similar documents
- Adjust policy exploration parameters
- Without these attributes, we’d just see “low score” without understanding how to fix it
๐ฎ 5. Future-Proofing for New Scorers
- When AI creates its own scorers, they’ll generate novel metrics
- Fixed schema would require constant code changes
- Flexible 4th dimension accommodates any number of metrics without schema changes
๐ฌ The 4th Dimension in Action: Real-World Example
Consider a document with these metrics:
Scorer | score | uncertainty | energy | margin | trace_length |
---|---|---|---|---|---|
SICQL | 72 | 0.35 | - | - | - |
EBT | 75 | - | 2.1 | - | 12 |
SVM | 68 | - | - | 0.8 | - |
Traditional Analysis (3 dimensions only):
- “The document scored around 70-75 - decent but not great”
Tensor Analysis (4 dimensions):
- “High uncertainty in SICQL (0.35) combined with moderate energy in EBT (2.1) and short trace length (12) indicates the document has surface-level quality but lacks deep reasoning”
- “SVM’s low margin (0.8) confirms the ambiguous evaluation”
- Action: This document needs more detailed analysis for complex reasoning - recommend human review
This is exactly why the 4th dimension transforms scoring from simple evaluation to understanding the understanding process itself - the foundation of a truly self-improving AI system.
๐งฑ Key Structural Changes
To support this new 4th dimension we made som structural changes.
โ๏ธ 1. ScoreResult
now supports attribute-rich scoring โ
ScoreResult(
dimension="clarity",
score=0.82,
source="sicql",
attributes={
"energy": -3.12,
"uncertainty": 0.21,
"advantage": 0.44
}
)
Weโve replaced rigid structures like
EvaluationAttributes
with a flexibleattributes: Dict[str, Any]
field that can store any auxiliary metric. This allows us to capture exactly what the model sees in a form we can analyze, learn from, and eventually improve upon.
๐ฅ 2. ScoreBundle
holds scores across many dimensions and sources ๐งฉ
Each ScoreBundle
is a dictionary of dimension โ ScoreResult
, allowing us to:
- Track multiple evaluations (clarity, alignment, etc.)
- Compare across multiple scorers (SICQL, EBT, SVM, LLM)
- Store all relevant signals in one object
๐ฅจ 3. ScoreCorpus
turns these bundles into 4D tensors ๐ง
With one command:
corpus.to_tensor()
# Returns a shape like: [scorables ร dimensions ร scorers ร metrics]
This enables:
- Tensor-based learning: for training self-improving models
- Correlation analysis: e.g., how uncertainty relates to energy
- Disagreement detection: e.g., which scorer is an outlier?
- Bias identification: e.g., which scorer consistently scores higher?
๐งฉ Attributes: From Score to Signal
As Stephanie began scoring not just documents, but the reasoning that led to them, we hit a wall: every new scorer (SICQL, HRM, EBT) brought new metrics q-values, advantage, entropy, energy, uncertainty. Our schema was rigid. Every time we added a new model, we needed to change our data structures and database.
We fixed this by embedding metrics into a flexible attributes dictionary within each ScoreResult. Now, any scorer human, learned, or future-generated can attach novel metrics. This unlocked the โ4th dimensionโ of our tensor architecture: score[document][dimension][scorer][attribute].
This change is what made full reflective scoring and self-improvement scalable.
๐ฏ Diagram: How the Score System Now Works
flowchart TD A["๐ Scorable (Document/Trace)"] --> B["๐ฆ ScoreBundle"] B --> C1["๐ฏ Dimension: Clarity"] B --> C2["๐ฏ Dimension: Alignment"] B --> C3["๐ฏ Dimension: Implementability"] C1 --> D1["๐ข ScoreResult (source: SICQL)<br/>score=0.84, energy=-2.1, ฮQ=0.11"] C2 --> D2["๐ข ScoreResult (source: SVM)<br/>score=0.69, margin=1.3"] C3 --> D3["๐ข ScoreResult (source: EBT)<br/>score=0.75, entropy=0.45"] B --> E["๐ง โ ScoreCorpus"] E --> F["๐ข 4D Tensor"] E --> G["๐ DataFrame"] E --> H["๐ค GILD Analysis / HRM Feedback"]
๐ข New ways to look at data
This new system allows Stephanie to:
- Interpret scores multidimensionally understanding not just what was scored, but why and how confidently.
- Swap scorers dynamically since each score includes its model source and reasoning.
- Train on score attributes using energy, uncertainty, and advantage values to tune her policies.
- Feed herself the score tensors become the raw material for learning new evaluation policies through GILD, SICQL, and HRM models.
๐ ScoreCorpus
: The 4D Tensor of Stephanie’s Cognition
If PlanTrace
is Stephanieโs memory, then the ScoreCorpus
is her structured, searchable record of that memory’s quality.
The ScoreCorpus
organizes the rich, multi-dimensional scores from every trace into a single, high-dimensional data structureโa 4D tensor. This is not just a database; itโs a dynamic tensor that makes every aspect of Stephanieโs reasoning analytically tractable at scale.
At its core, the ScoreCorpus
holds all evaluation data aligned across four key axes:
- Target ID: Which scorable is this score is this for?
- Dimension: Which aspect of reasoning is being measured (e.g., clarity, coherence, relevance)?
- Source: Which scorer generated this evaluation (e.g., HRM, SICQL, EBT)?
- Metric: Which atomic unit of thought does this score represent? (Energy, Uncertainty, Policy)
This structure allows us to slice, dice, and query Stephanieโs performance with ease:
# Get all uncertainty scores for steps in a specific reasoning dimension
uncertainty_scores = corpus.get_metric_matrix(
trace_id=trace_id,
dimension="reasoning_quality",
attribute="uncertainty"
)
# Find the average Q-value across all steps evaluated by SICQL for a specific goal
avg_q_value = corpus.average(
metric="q_value",
source="SICQL",
filter_by_goal=goal_id
)
With ScoreCorpus
, we move beyond simple logs to create a unified, dynamic dataset of self-evaluation. Itโs the essential infrastructure that makes it possible for Stephanie to learn from her own mind, not just from external data.
flowchart LR A["๐ Scorables<br/>(documents, pipelines)"] --> B["๐งญ Dimensions<br/>(helpfulness, truthfulness)"] B --> C["๐ค Scorers<br/>(SICQL, HRM, SVM)"] C --> D["๐งฌ Metrics<br/>(q_value, uncertainty, energy)"] classDef dimension fill:#E3F2FD,stroke:#2196F3; classDef metric fill:#F3E5F5,stroke:#AB47BC; class A dimension; class B dimension; class C dimension; class D metric;
This structure enables powerful analysis that would been difficult before:
# Get all uncertainty values across reasoning quality dimension
uncertainty_matrix = corpus.get_metric_matrix("reasoning_quality",
"uncertainty")
# Find documents with high uncertainty
high_uncertainty_docs = uncertainty_matrix[
uncertainty_matrix.mean(axis=1) > 0.3
].index.tolist()
# Analyze which step type correlates with high uncertainty
step_types = []
for doc_id in high_uncertainty_docs:
for step in corpus.bundles[doc_id].execution_steps:
step_types.append(step.step_type)
problematic_step = max(set(step_types), key=step_types.count)
๐ What ScoreCorpus
Does:
- Collects all
ScoreBundle
s for a set of documents - Allows easy access to scores per dimension, scorer, or attribute
- Converts the full corpus into a 4D tensor of shape:
[scorables ร dimensions ร scorers ร metrics]
This design supports:
- โ Cross-model comparison
- ๐ Tracking score convergence and variance
- ๐งช Feeding GILD, HRM, and SICQL learning loops
- ๐ Recursive policy refinement
๐ฌ How we use it
The ScoreCorpus
class is the central aggregation layer in Stephanie’s scoring system. Its core purpose is to organize, normalize, and expose scores from different scoring agents (MRQ, SICQL, SVM, EBT, LLM, etc.) across multiple documents and evaluation dimensions. It serves as the primary interface between raw scoring results and meta-analysis tools like MARS.
๐ Key Functions:
- Collects all scores across documents, scorers, and dimensions.
- Provides matrix views (e.g., document ร scorer) for each dimension.
- Exposes scoring attributes (
q_value
,v_value
,energy
, etc.) in a uniform, extensible way viaattributes
. - Supports statistical analysis and visualization (e.g., for MARS or plan trace analysis).
๐ง Why We Needed a Corpus
Originally, we stored scores as flat records document, dimension, float score, maybe a rationale.
But as we moved to:
- Process-based scoring (PlanTraces + ExecutionSteps)
- Multi-model scoring (SICQL, HRM, EBT, LLM)
- Multi-metric diagnostics (q_value, v_value, advantage, energy, etc.)
โฆit became impossible to manage with traditional schemas. We were constantly adding columns, patching serialization errors, and duplicating logic just to support new scorer outputs.
So we unified everything into a flexible, queryable structure: the ScoreCorpus.
๐ Enables 4th-Dimensional Thinking
Thanks to this structure, we can now ask:
- ๐ง What kinds of steps tend to generate high uncertainty?
- ๐ How does EBT scoring differ from SICQL for the same dimension?
- ๐ When performance drops, which attributes shifted the most?
- ๐ง Can we train a meta-model to predict bad steps before they happen?
These kinds of questions power our feedback loops, model improvements, and even policy synthesis.
๐ Fully Integrated with PlanTraceScorerAgent
When the PlanTraceScorerAgent
scores a trace, it populates the ScoreCorpus
automatically. Thereโs no need for special indexing or manual logging all scores and attributes are saved in standardized form.
This sets the stage for:
- โ Historical trend analysis
- ๐ Reinforcement learning
- ๐ช Self-reflective retraining
And because ScoreBundle
and ScoreResult
were redesigned to be tensor-friendly and JSON-serializable, everything flows smoothly from model to memory.
๐งฌ ScoreCorpus
: Structured, Learnable Score Aggregation
The ScoreCorpus
class is the bridge between Stephanieโs raw evaluation data and structured, tensor-ready learning signals. Letโs walk through what the code does, how it works, and how it enables self-improvement at scale.
class ScoreCorpus:
"""
Collection of ScoreBundles across multiple documents/scorables for tensor-based analysis.
This class implements the true 4D tensor structure [scorables ร dimensions ร scorers ร metrics]
that enables powerful slicing and analysis capabilities.
Key features:
- Convert to 4D tensor for ML integration
- Slice by metric type (energy, uncertainty, etc.)
- Analyze scoring agreement patterns
- Identify systematic scorer biases
- Support for MARS calculator integration
"""
def __init__(self, bundles: Dict[str, ScoreBundle], meta: Dict[str, Any] = None):
"""
Initialize a ScoreCorpus from a collection of ScoreBundles.
Args:
bundles: Dictionary mapping scorable IDs to ScoreBundles
meta: Optional metadata about the corpus
"""
self.bundles = bundles
self.meta = meta or {}
self._dimensions = None
self._scorers = None
self._metrics = None
self._dimension_matrix_cache = {}
self._metric_matrix_cache = {}
@property
def dimensions(self) -> List[str]:
"""Get all dimensions present across bundles"""
if self._dimensions is None:
self._dimensions = self._discover_dimensions()
return self._dimensions
@property
def scorers(self) -> List[str]:
"""Get all scorers present across bundles"""
if self._scorers is None:
self._scorers = self._discover_scorers()
return self._scorers
@property
def metrics(self) -> Set[str]:
"""Get all metrics present across bundles (including 'score')"""
if self._metrics is None:
self._metrics = self._discover_metrics()
return self._metrics
def _discover_dimensions(self) -> List[str]:
"""Discover all dimensions present in the corpus"""
dimensions = set()
for bundle in self.bundles.values():
dimensions.update(bundle.results.keys())
return sorted(list(dimensions))
def _discover_scorers(self) -> List[str]:
"""Discover all scorers present in the corpus"""
scorers = set()
for bundle in self.bundles.values():
for result in bundle.results.values():
scorers.add(result.source)
return sorted(list(scorers))
def _discover_metrics(self) -> Set[str]:
"""Discover all metrics present in the corpus"""
metrics = {"score"} # Always include the core score
for bundle in self.bundles.values():
for result in bundle.results.values():
if result.attributes:
metrics.update(result.attributes.keys())
return metrics
def get_dimension_matrix(self, dimension: str) -> pd.DataFrame:
"""
Get scores as a DataFrame: [scorables ร scorers]
Args:
dimension: The dimension to extract
Returns:
DataFrame where rows are scorables and columns are scorers
"""
# Check cache first
if dimension in self._dimension_matrix_cache:
return self._dimension_matrix_cache[dimension]
# Build matrix
data = {}
for scorable_id, bundle in self.bundles.items():
if dimension in bundle.results:
result = bundle.results[dimension]
data[scorable_id] = {result.source: result.score}
# Create DataFrame
df = pd.DataFrame.from_dict(data, orient='index')
# Ensure all scorers are present as columns
for scorer in self.scorers:
if scorer not in df.columns:
df[scorer] = np.nan
# Sort columns by scorers list
df = df[self.scorers]
# Cache result
self._dimension_matrix_cache[dimension] = df
return df
def get_metric_matrix(self, dimension: str, metric: str) -> pd.DataFrame:
"""
Get a specific metric as a DataFrame: [scorables ร scorers]
Args:
dimension: The dimension to extract
metric: The metric to extract (e.g., "uncertainty", "q_value")
Returns:
DataFrame where rows are scorables and columns are scorers
"""
# Check cache first
cache_key = (dimension, metric)
if cache_key in self._metric_matrix_cache:
return self._metric_matrix_cache[cache_key]
# Build matrix
data = {}
for scorable_id, bundle in self.bundles.items():
if dimension in bundle.results:
result = bundle.results[dimension]
value = result.attributes.get(metric, np.nan) if result.attributes else np.nan
data[scorable_id] = {result.source: value}
# Create DataFrame
df = pd.DataFrame.from_dict(data, orient='index')
# Ensure all scorers are present as columns
for scorer in self.scorers:
if scorer not in df.columns:
df[scorer] = np.nan
# Sort columns by scorers list
df = df[self.scorers]
# Cache result
self._metric_matrix_cache[cache_key] = df
return df
def get_metric_values(self, dimension: str, scorer: str, metrics: List[str]) -> Dict[str, List[Any]]:
"""
Get values for specific metrics across all scorables for a dimension and scorer.
Args:
dimension: The dimension to extract
scorer: The scorer to extract
metrics: List of metrics to extract
Returns:
Dictionary mapping metric names to lists of values
"""
results = {metric: [] for metric in metrics}
for bundle in self.bundles.values():
if dimension in bundle.results:
result = bundle.results[dimension]
if result.source == scorer:
for metric in metrics:
if result.attributes and metric in result.attributes:
results[metric].append(result.attributes[metric])
else:
results[metric].append(None)
return results
def get_all_metric_values(self, dimension: str, metrics: List[str]) -> Dict[str, List[Any]]:
"""
Get values for specific metrics across all scorables and scorers for a dimension.
Args:
dimension: The dimension to extract
metrics: List of metrics to extract
Returns:
Dictionary mapping metric names to lists of values
"""
results = {metric: [] for metric in metrics}
for bundle in self.bundles.values():
if dimension in bundle.results:
result = bundle.results[dimension]
for metric in metrics:
if result.attributes and metric in result.attributes:
results[metric].append(result.attributes[metric])
else:
results[metric].append(None)
return results
def to_tensor(self, dimensions: List[str] = None,
scorers: List[str] = None,
metrics: List[str] = None) -> np.ndarray:
"""
Convert to 4D tensor: [scorables ร dimensions ร scorers ร metrics]
Args:
dimensions: Optional list of dimensions to include (defaults to all)
scorers: Optional list of scorers to include (defaults to all)
metrics: Optional list of metrics to include (defaults to all)
Returns:
4D numpy array of shape (n_scorables, n_dimensions, n_scorers, n_metrics)
"""
# Default to all dimensions/scorers/metrics if not specified
dimensions = dimensions or self.dimensions
scorers = scorers or self.scorers
metrics = metrics or list(self.metrics)
# Create tensor with zeros
tensor = np.zeros((len(self.bundles), len(dimensions), len(scorers), len(metrics)))
# Fill tensor with values
for scorable_idx, (scorable_id, bundle) in enumerate(self.bundles.items()):
for dim_idx, dimension in enumerate(dimensions):
if dimension in bundle.results:
result = bundle.results[dimension]
scorer_idx = scorers.index(result.source)
# Fill in metric values
for metric_idx, metric in enumerate(metrics):
if metric == "score":
tensor[scorable_idx, dim_idx, scorer_idx, metric_idx] = result.score
elif result.attributes and metric in result.attributes:
try:
tensor[scorable_idx, dim_idx, scorer_idx, metric_idx] = float(result.attributes[metric])
except (TypeError, ValueError):
tensor[scorable_idx, dim_idx, scorer_idx, metric_idx] = 0.0
# Otherwise leave as 0.0
return tensor
def to_dataframe(self, dimensions: List[str] = None,
scorers: List[str] = None,
metrics: List[str] = None) -> pd.DataFrame:
"""
Convert to multi-index DataFrame for analysis.
The DataFrame will have:
- Index: scorable IDs
- Columns: MultiIndex of (dimension, scorer, metric)
Args:
dimensions: Optional list of dimensions to include (defaults to all)
scorers: Optional list of scorers to include (defaults to all)
metrics: Optional list of metrics to include (defaults to all)
Returns:
Multi-index DataFrame
"""
# Default to all dimensions/scorers/metrics if not specified
dimensions = dimensions or self.dimensions
scorers = scorers or self.scorers
metrics = metrics or list(self.metrics)
# Create column index
column_tuples = [(dim, scorer, metric)
for dim in dimensions
for scorer in scorers
for metric in metrics]
columns = pd.MultiIndex.from_tuples(column_tuples,
names=['dimension', 'scorer', 'metric'])
# Create DataFrame
df = pd.DataFrame(index=list(self.bundles.keys()), columns=columns)
# Fill DataFrame
for scorable_id, bundle in self.bundles.items():
for dim in dimensions:
if dim in bundle.results:
result = bundle.results[dim]
for metric in metrics:
if metric == "score":
value = result.score
elif result.attributes and metric in result.attributes:
value = result.attributes[metric]
else:
value = None
df.loc[scorable_id, (dim, result.source, metric)] = value
return df
def analyze_scorer_reliability(self, dimension: str,
trust_reference: str = "llm") -> Dict[str, float]:
"""
Analyze which scorers are most reliable for a dimension.
Args:
dimension: The dimension to analyze
trust_reference: The scorer to use as gold standard
Returns:
Dictionary mapping scorers to reliability scores (higher = more reliable)
"""
if trust_reference not in self.scorers:
warnings.warn(f"Trust reference '{trust_reference}' not found. Using median scorer instead.")
return self._analyze_scorer_consistency(dimension)
# Get the document ร scorer matrix
matrix = self.get_dimension_matrix(dimension)
# Calculate correlation with trust reference
reliability = {}
trust_scores = matrix[trust_reference]
for scorer in self.scorers:
if scorer == trust_reference:
reliability[scorer] = 1.0 # Perfect correlation with itself
continue
# Calculate correlation
valid_pairs = matrix[[scorer, trust_reference]].dropna()
if len(valid_pairs) > 1:
try:
corr = valid_pairs[scorer].corr(valid_pairs[trust_reference])
reliability[scorer] = float(corr) if not pd.isna(corr) else 0.0
except:
reliability[scorer] = 0.0
else:
reliability[scorer] = 0.0
return reliability
def _analyze_scorer_consistency(self, dimension: str) -> Dict[str, float]:
"""Analyze scorer consistency when no trust reference is available"""
matrix = self.get_dimension_matrix(dimension)
scorer_std = matrix.std()
max_std = scorer_std.max()
# Higher reliability for lower standard deviation
return {scorer: 1.0 - (std / max_std) if max_std > 0 else 1.0
for scorer, std in scorer_std.items()}
def get_high_disagreement_scorables(self, dimension: str,
threshold: float = 0.15) -> List[str]:
"""
Get scorables with high disagreement across scorers for a dimension.
Args:
dimension: The dimension to analyze
threshold: Threshold for disagreement (standard deviation)
Returns:
List of scorable IDs with high disagreement
"""
# Get the document ร scorer matrix
matrix = self.get_dimension_matrix(dimension)
# Calculate disagreement per document (standard deviation across scorers)
disagreement = matrix.std(axis=1)
# Return scorables with disagreement above threshold
return disagreement[disagreement > threshold].index.tolist()
def get_outlier_scorables(self, dimension: str, scorer: str,
threshold: float = 2.0) -> List[str]:
"""
Get scorables where a specific scorer significantly differs from consensus.
Args:
dimension: The dimension to analyze
scorer: The scorer to check
threshold: Threshold in standard deviations
Returns:
List of scorable IDs where the scorer is an outlier
"""
# Get the document ร scorer matrix
matrix = self.get_dimension_matrix(dimension)
if scorer not in matrix.columns:
return []
# Calculate consensus (mean excluding the scorer)
consensus = matrix.drop(columns=[scorer]).mean(axis=1)
# Calculate difference from consensus
diff = (matrix[scorer] - consensus).abs()
std_dev = diff.std()
# Return scorables where difference is above threshold
if std_dev > 0:
return diff[diff > threshold * std_dev].index.tolist()
return []
def get_metric_correlations(self, dimension: str,
metrics: List[str] = None) -> Dict[Tuple[str, str], float]:
"""
Get correlations between different metrics for a dimension.
Args:
dimension: The dimension to analyze
metrics: Optional list of metrics to analyze (defaults to all)
Returns:
Dictionary mapping (metric1, metric2) to correlation coefficient
"""
metrics = metrics or list(self.metrics - {"score"})
if len(metrics) < 2:
return {}
# Get all metric matrices
metric_matrices = {
metric: self.get_metric_matrix(dimension, metric)
for metric in metrics
}
# Calculate correlations
correlations = {}
for i in range(len(metrics)):
for j in range(i+1, len(metrics)):
metric1, metric2 = metrics[i], metrics[j]
# Stack values
values1 = []
values2 = []
for scorable_id in self.bundles.keys():
val1 = metric_matrices[metric1].loc.get(scorable_id, np.nan)
val2 = metric_matrices[metric2].loc.get(scorable_id, np.nan)
# Skip if either value is NaN
if not pd.isna(val1) and not pd.isna(val2):
values1.append(val1)
values2.append(val2)
# Calculate correlation
if len(values1) > 1:
try:
corr = pd.Series(values1).corr(pd.Series(values2))
if not pd.isna(corr):
correlations[(metric1, metric2)] = float(corr)
except:
pass
return correlations
def find_metric_outliers(self, dimension: str, metric: str,
threshold: float = 2.0) -> List[Tuple[str, float]]:
"""
Find scorables with outlier values for a specific metric.
Args:
dimension: The dimension to analyze
metric: The metric to check
threshold: Threshold in standard deviations
Returns:
List of (scorable_id, z_score) tuples
"""
# Get the metric matrix
matrix = self.get_metric_matrix(dimension, metric)
# Stack all values
all_values = []
for scorer in self.scorers:
values = matrix[scorer].dropna().values
all_values.extend(values)
if not all_values:
return []
# Calculate mean and std
mean_val = np.mean(all_values)
std_val = np.std(all_values)
if std_val == 0:
return []
# Find outliers
outliers = []
for scorable_id in self.bundles.keys():
for scorer in self.scorers:
value = matrix.loc.get((scorable_id, scorer), np.nan)
if not pd.isna(value):
z_score = (value - mean_val) / std_val
if abs(z_score) > threshold:
outliers.append((scorable_id, z_score))
# Sort by absolute z-score
outliers.sort(key=lambda x: abs(x[1]), reverse=True)
return outliers
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary for serialization"""
return {
"scorable_ids": list(self.bundles.keys()),
"dimensions": self.dimensions,
"scorers": self.scorers,
"metrics": list(self.metrics),
"meta": self.meta
}
@classmethod
def from_dict(cls, data: Dict[str, Any],
bundles: Dict[str, ScoreBundle] = None) -> "ScoreCorpus":
"""Reconstruct from dictionary (with optional bundles)"""
# If bundles are provided, filter to match scorable IDs
if bundles:
scorable_ids = data.get("scorable_ids", [])
filtered_bundles = {k: v for k, v in bundles.items() if k in scorable_ids}
return cls(bundles=filtered_bundles, meta=data.get("meta", {}))
# Without bundles, just return empty corpus with metadata
return cls(bundles={}, meta=data.get("meta", {}))
def __len__(self) -> int:
"""Return number of scorables in the corpus"""
return len(self.bundles)
def __getitem__(self, scorable_id: str) -> ScoreBundle:
"""Get a specific ScoreBundle by scorable ID"""
return self.bundles[scorable_id]
def __iter__(self):
"""Iterate over scorables"""
return iter(self.bundles.items())
def __repr__(self):
return (f"<ScoreCorpus(scorables={len(self.bundles)}, "
f"dimensions={len(self.dimensions)}, "
f"scorers={len(self.scorers)}, "
f"metrics={len(self.metrics)})>")
At its core, ScoreCorpus
wraps a dictionary of ScoreBundle
s (one per Scorable
), and provides utilities to:
- Add or update scores for a given document
- Extract normalized values across dimensions and scorers
- Flatten or tensorize the score data for learning, analysis, or reporting
- Track attributes like energy, uncertainty, or advantage across models
This turns raw scoring data into structured input for reinforcement loops like GILD, HRM, or policy tuning.
๐งฑ Key Components of the Code
__init__
:
Initializes the corpus with:
scores
: dict mappingScorable.id
โScoreBundle
dimensions
: which scoring axes to track (e.g. clarity, alignment)scorers
: which models generated the scores (e.g. SICQL, EBT, LLM)
add_score(scorable, bundle)
:
Adds or updates the score for a Scorable
(document, trace, etc.). Each score is stored under the corresponding ID.
get_scores_by(dimension, scorer)
:
Returns a dictionary of {scorable_id: score}
for a given dimension and scorer perfect for audits, visualizations, or debugging.
to_tensor(attribute='score')
:
The power move. Converts the entire corpus into a tensor of shape:
[num_scorables, num_dimensions, num_scorers]
You can also extract other attributes instead of score
like "energy"
, "uncertainty"
, or "advantage"
enabling deep reasoning over not just what was scored, but why.
to_list(flat=True)
:
Returns a flat list of all individual ScoreResult
values for reporting or database writes.
to_markdown()
:
Human-readable summary with one table per scorer ร dimension. Useful for debug reports or embedding in evaluation logs.
๐ So what is the big fuss
Stephanieโs self-improvement relies on being able to see the whole picture of her evaluations across:
- Multiple documents
- Multiple dimensions
- Multiple models
- Multiple attributes (raw score, energy, Q/V valuesโฆ)
With ScoreCorpus
, we now have that picture. We can:
- Feed entire score tensors into reinforcement loops (e.g., GILD loss)
- Visualize how different models agree or diverge on epistemic quality
- Perform slice-and-dice analysis (e.g., โWhich scorer gave high alignment but low clarity on failed documents?โ)
ScoreCorpus completes the self-improvement loop that began with PlanTraces:
flowchart LR A(["๐ Document Scoring"]):::stage --> B(["โ๏ธ Pipeline Execution"]):::stage B --> C(["๐ Pipeline Evaluation"]):::stage C --> D(["๐ Pattern Extraction"]):::stage D --> A classDef stage fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px,color:#0D47A1,font-weight:bold;
Where previously you had:
flowchart LR A[Document Scoring] --> B[Reasoning Evaluation] B --> C[Document Scoring Improvement]
The critical difference: Our previous work improved document scoring
. This work improves how Stephanie improves
creating exponential gains in cognitive quality.
Without it: Evaluations are isolated events with no memory With it: Evaluations become lessons that drive continuous improvement This is the foundation for true self-improving AI not through isolated optimizations, but through a unified cognitive framework where Stephanie can remember, recognize patterns, and improve her own reasoning at the most fundamental level.
The future isn’t just better scoring it’s a fully integrated cognitive architecture where Stephanie doesn’t just evaluate pipelines, but learns from them to become a better reasoner. And with ScoreCorpus as her cognitive memory, she’s finally in a position to learn from her own experience.
๐งญ The Fourth Dimension ScoreAttributes
The Score Attribute System is a flexible, extensible backend that logs everything from energy levels and uncertainty to epistemic advantage and trace length. This is what we call the fourth dimension of scoring.
๐งฑ What Are Score Attributes?
At a high level:
- A
ScoreResult
gives us a value: โEBT says this doc has implementability = 0.76.โ - A
ScoreAttributeORM
gives us the metadata behind it: โEnergy = 2.3, Certainty = 0.84, Advantage = 0.11โฆโ - All attributes are stored in a separate table, linked to the original score by
score_id
.
This allows us to track any number of additional signals per score without needing to alter the schema every time a new model outputs something new.
๐พ How It Works
We define:
๐งฌ ScoreAttributeORM
class ScoreAttributeORM(Base):
id # primary key
score_id # FK to ScoreORM
key # e.g. "energy", "certainty", "advantage"
value # stored as text, cast dynamically
data_type # e.g. "float", "json", "str"
created_at # timestamp
This schema gives us the flexibility to store any number of scalar or structured signals alongside a score.
๐ง ScoreAttributeStore
This is the core access layer it does the following:
Method | What It Does |
---|---|
add_attribute |
Add a single attribute |
add_attributes_bulk |
Efficiently write dozens/hundreds of attributes at once |
get_attributes_for_score(score_id) |
Fetch all signals for one score |
get_attribute_matrix(score_ids, keys) |
2D matrix of attributes per score |
get_score_attribute_tensor(...) |
๐ฅ Build a full 4D tensor: [score ร dimension ร scorer ร metric] |
get_metric_correlations(...) |
Calculate statistical relationships between attributes |
๐ง Why This Matters: Adaptive, Dimensional, Composable Scoring
This new structure enables:
โ Generalized signal capture Doesnโt matter if the score comes from SICQL, EBT, HRM, or a future RL agent all attributes can be stored and retrieved the same way.
โ
Tensor-native reasoning Models like GILD, HRM, and our policy synthesizer can now operate over full [score_id ร dimension ร model ร metric]
tensors the real shape of Stephanieโs beliefs.
โ Emergent analytics Need to analyze epistemic energy vs. certainty? Or correlate EBT’s advantage with SICQL’s Q-delta? You can now do it with a single call.
โ Automatic diagnostics If scoring behavior goes awry, you can dig into internal model states without modifying any evaluation logic.
๐ The Future: Even Higher Dimensions
Weโre currently populating:
- Score (3rd dimension)
- Score attributes (4th dimension)
But the fifth is already in view: logical structure (e.g., cause-effect chains, chain-of-thought depth, consistency scores). And once we have multiple generations of self-evaluation? A 6th temporal dimension for trace evolution over time.
Stephanieโs scoring engine is now not just numeric itโs epistemic.
flowchart TD subgraph Scoring_Process["๐ง Scoring Process [Stephanie Score Pipeline]"] direction TB A1["๐ Input: Scorable Object"]:::input --> A2["๐ Dimension Selection (Relevance, Clarity, Ethics...)"]:::logic A2 --> A3["๐ค Scorer Engine (MRQ / SVM / EBT / LLM)"]:::model A3 --> A4["๐ Generate ScoreBundle (score + attributes)"]:::bundle end subgraph Memory_Storage["๐พ Memory Storage [Saving to DB]"] direction TB A4 --> B1["๐๏ธ EvaluationORM<br/>(goal_id, target_id, source, strategy...)"]:::db B1 --> B2["๐ข ScoreORM<br/>(dimension, score, rationale, source...)"]:::db B2 --> B3["๐ ScoreAttributeORM<br/>(key, value, data_type, created_at)"]:::db end subgraph Query_Analysis["๐ Query & Analysis"] direction TB C1["๐งฌ Get Attributes<br/>by score_id, key, dimension"]:::query C2["๐ Attribute Tensor<br/>(dimension ร scorer ร metric ร value)"]:::tensor C3["๐ง Correlation & Stats<br/>(mean, stddev, min, max, count)"]:::analytics C1 --> C2 --> C3 end subgraph Result_Display["๐ Result & Display"] direction TB D1["๐ฏ Weighted Aggregation"]:::calc D2["๐บ Score Display"]:::display D3["๐ Delta Calculation"]:::delta D1 --> D2 D1 --> D3 end %% Database connections B3 -.-> C1 B3 -.-> D1 %% Styling definitions classDef input fill:#E0F7FA,stroke:#00ACC1,color:#006064 classDef logic fill:#E1F5FE,stroke:#039BE5,color:#01579B classDef model fill:#F3E5F5,stroke:#8E24AA,color:#4A148C classDef bundle fill:#FFF3E0,stroke:#FB8C00,color:#E65100 classDef db fill:#FFECB3,stroke:#FF7043,color:#BF360C classDef query fill:#E8F5E9,stroke:#66BB6A,color:#1B5E20 classDef tensor fill:#FFF8E1,stroke:#FFCA28,color:#FF6F00 classDef analytics fill:#F1F8E9,stroke:#9CCC65,color:#33691E classDef calc fill:#E3F2FD,stroke:#42A5F5,color:#0D47A1 classDef display fill:#F5F5F5,stroke:#9E9E9E,color:#212121 classDef delta fill:#FFEBEE,stroke:#EF5350,color:#B71C1C %% Apply styles class A1 input; class A2 logic; class A3 model; class A4 bundle; class B1,B2,B3 db; class C1 query; class C2 tensor; class C3 analytics; class D1 calc; class D2 display; class D3 delta;
๐งพ Score Delta: Tracking Shifts in Evaluation
After each scoring operation, Stephanie records not just the raw score but also the change from the last known score for that same object and goal a value we call the score delta.
This delta is calculated by the ScoreDeltaCalculator
, a lightweight utility that compares the newly generated score to the most recent prior score from the same scorer. If there’s a significant difference, we log it along with useful metadata (goal ID, document ID, scorer name, and a snippet of the document).
Why is this important?
- ๐งญ Auditability: It gives us a traceable signal of when and where scores change.
- ๐ Root cause detection: If there’s a sudden dip or spike in score, we can trace it back through the pipeline and identify which stage or model caused the shift.
- ๐ง Self-awareness: It’s the first step toward Stephanie understanding not just what it believes, but how and when her beliefs evolve.
This score delta signal becomes even more powerful later in the feedback loop, when combined with tools like MARS and PlanTrace comparisons, giving us a complete view of how our reasoning engine changes over time and why.
ScoreDeltaCalculator:
def __init__(self, cfg: dict, memory, logger=None):
self.cfg = cfg
self.memory = memory
self.logger = logger
def log_score_delta(self, scorable, new_score, goal_id=None):
prev = self.memory.evaluations.get_latest_score(
scorable, agent_name=self.cfg.get("name")
)
if prev is not None:
delta = round(new_score - prev, 2)
if self.logger:
self.logger.log(
"ScoreDelta",
{
"delta": delta,
"id": scorable.id,
"target_type": scorable.target_type,
"text": scorable.text[:60],
"goal_id": goal_id,
"prev_score": prev,
"new_score": new_score,
"stage": self.cfg.get("name"),
},
)
return delta
return None
Why stop at scores? The real power lies beyond the dimensionsโin Stephanieโs ability to reason about the scores themselves. The Multi-Agent Reasoning Signal (MARS)
calculator is where this shift happens. It doesnโt just analyze scores; it extracts patterns of trust, conflict, and epistemic reliabilityโpushing Stephanie into a new dimension of self-awareness.
๐ญ From Scores to Signals: What the MARS Calculator Reveals About AI Thinking
The Model Agreement and Reasoning Signal (MARS) Calculator is a diagnostic meta-model evaluator that processes data in the ScoreCorpus
to detect systemic patterns of agreement, bias, and misalignment across scorers.
While conventional approaches ask “What score did we assign?”, MARS asks the deeper questions:
- Why did we assign this score?
- Can we trust these results?
- Where is our system uncertain or conflicted?
This transforms scoring from a passive measurement into an active diagnostic process - what we call the fifth dimension of self-awareness. Just as humans reflect on their decision-making processes, Stephanie uses MARS to introspect on her scoring mechanisms.
Core Features:
- Computes agreement scores (based on std deviation) for each dimension.
- Identifies primary conflicts between scorers and computes their average deltas.
- Determines the best-aligned model with a trust reference (e.g., LLM).
- Flags high-disagreement dimensions and generates recommendations for human intervention or retraining.
- Analyzes extended metrics (like uncertainty, advantage, energy) and their inter-metric correlations.
MARS doesnโt just ask โWhat was the score?โ but โWhy did we score it that way, and can we trust it?โ
flowchart LR %% Define nodes with emojis and labels A[๐ Raw Scores] --> B[๐ <b>MARS Analysis</b>] B --> C[๐ Agreement Matrix] B --> D[๐งญ Trust Topology] B --> E[๐ Metric Correlogram] B --> F[โ ๏ธ Conflict Forecast] C --> G[๐งช Model Retuning] D --> H[โ๏ธ Scorer Weighting] E --> I[๐ฆ Metric Compression] F --> J[๐งโโ๏ธ Human Escalation] %% Style definitions classDef raw fill:#fdf6e3,stroke:#b58900,color:#6c5400,stroke-width:2px classDef process fill:#e3f2fd,stroke:#42a5f5,color:#0d47a1,stroke-width:2px classDef output fill:#f1f8e9,stroke:#8bc34a,color:#33691e,stroke-width:2px classDef risk fill:#ffebee,stroke:#e53935,color:#b71c1c,stroke-width:2px %% Apply classes class A raw class B process class C,D,E process class F risk class G,H,I output class J risk
๐ง Just what is the MARS Calculator
In our ongoing mission to make Stephanie a transparent, auditable, and self-correcting AI, we needed a way to not just score documents but to understand how well our scorers agree, which ones are most trustworthy, and where errors or inconsistencies may arise. Thatโs exactly what the MARS Calculator was built for.
MARS stands for Model Agreement and Reasoning Signal. It is a diagnostic calculator that takes in a full ScoreCorpus
representing scores across multiple models, dimensions, and documents and outputs:
- ๐ Agreement statistics: how consistent are the models?
- ๐ฏ Preferred model: which model aligns most closely with a trusted reference (e.g., LLM)?
- โ ๏ธ Disagreements and outliers: where and why scorers diverge?
- ๐งฌ Metric correlations: how internal signals like energy, Q-value, or uncertainty relate to each other?
- ๐งช Per-scorer reliability: based on correlation with ground truth or internal variance.
Unlike traditional scoring aggregation methods that operate on a single document or single score, MARS operates across the entire corpus. It synthesizes scores, attributes, and dimensions to provide global insight into the health of the scoring system.
flowchart TD A[๐ง Goal] --> B[๐ Document Collection] B --> C[๐งฌ PlanTrace Generation] C --> D[๐ฆ ScoreBundle Generation] D --> E[๐ ScoreCorpus Assembly] E --> F[๐ MARSCalculator: Model Agreement & Reasoning Signal] F --> G[๐ Agreement Score + Disagreement Flags] F --> H[๐ฏ Preferred Model Inference] F --> I[๐ Metric Correlation Analysis] F --> J[๐งช Per-Scorer Diagnostics] G --> K[๐ Policy Adjustment / Model Tuning] H --> K I --> L[๐งฌ Feature Compression] J --> M[โ๏ธ Reliability Assessment] K --> N[โป๏ธ Feedback Loop] L --> N M --> N N --> O[๐ง Updated PlanTrace Policy] O --> P[๐ Next Reasoning Cycle] %% Styling classDef primary fill:#E3F2FD,stroke:#2196F3,stroke-width:2px; classDef analysis fill:#FFF8E1,stroke:#FBC02D,stroke-width:2px; classDef result fill:#E8F5E9,stroke:#4CAF50,stroke-width:2px; classDef feedback fill:#F3E5F5,stroke:#9C27B0,stroke-width:2px; class A,B,C,D,E,O,P primary; class F,G,H,I,J analysis; class K,L,M result; class N feedback;
class MARSCalculator(BaseScoreCalculator):
"""
Model Agreement and Reasoning Signal (MARS) Calculator
Analyzes agreement patterns across multiple scoring models/adapters to:
- Quantify scoring consensus or divergence across documents
- Identify which scorers disagree systematically
- Determine which model aligns best with trust reference
- Measure uncertainty in the overall assessment
- Provide diagnostic insights for scoring system improvement
Unlike traditional aggregators, MARS operates at the ScoreCorpus level (multiple documents)
to detect reliability patterns rather than just computing an average score.
"""
def __init__(self, config: Dict = None):
"""
Initialize MARS calculator with configuration
Args:
config: Optional configuration with:
- trust_reference: Which scorer to use as gold standard (default: "llm")
- variance_threshold: Threshold for flagging high disagreement (default: 0.15)
- dimensions: Dimension-specific configurations
- metrics: Which metrics to analyze (default: ["score"] for core score)
"""
self.config = config or {}
self.trust_reference = self.config.get("trust_reference", "llm")
self.variance_threshold = self.config.get("variance_threshold", 0.15)
self.metrics = self.config.get(
"metrics", ["score"]
) # Core score by default
self.dimension_configs = self.config.get("dimensions", {})
def calculate(self, corpus: "ScoreCorpus") -> Dict[str, Any]:
"""
Calculate MARS metrics across all scoring models in the corpus
Args:
corpus: ScoreCorpus containing results from multiple scorers across multiple documents
Returns:
Dictionary containing comprehensive MARS analysis metrics
"""
# Calculate MARS metrics for each dimension
mars_results = {}
for dimension in corpus.dimensions:
mars_results[dimension] = self._calculate_dimension_mars(
corpus, dimension
)
return mars_results
def _get_dimension_config(self, dimension: str) -> Dict:
"""Get dimension-specific configuration with fallbacks"""
return self.dimension_configs.get(
dimension,
{
"trust_reference": self.trust_reference,
"variance_threshold": self.variance_threshold,
"metrics": self.metrics,
},
)
def _calculate_dimension_mars(
self, corpus: "ScoreCorpus", dimension: str
) -> Dict[str, Any]:
"""
Calculate MARS metrics for a specific dimension
Args:
corpus: ScoreCorpus containing evaluation results
dimension: The dimension being analyzed
Returns:
Dictionary with MARS metrics for this dimension
"""
# Get dimension-specific configuration
dim_config = self._get_dimension_config(dimension)
trust_ref = dim_config["trust_reference"]
metrics = dim_config["metrics"]
# Get the document ร scorer matrix for this dimension
matrix = corpus.get_dimension_matrix(dimension)
# If no data for this dimension, return empty results
if matrix.empty:
return {
"dimension": dimension,
"agreement_score": 0.0,
"std_dev": 0.0,
"preferred_model": "none",
"primary_conflict": ("none", "none"),
"delta": 0.0,
"high_disagreement": False,
"explanation": "No data available for this dimension",
"scorer_metrics": {},
"metric_correlations": {},
}
# Calculate basic statistics
avg_score = matrix.mean().mean() # Overall average score
std_dev = (
matrix.std().mean()
) # Average standard deviation across documents
# Calculate agreement score (1.0 = perfect agreement)
agreement_score = 1.0 - min(std_dev, 1.0)
# Identify primary conflict (largest average score difference)
scorer_means = matrix.mean()
max_scorer = scorer_means.idxmax()
min_scorer = scorer_means.idxmin()
delta = scorer_means[max_scorer] - scorer_means[min_scorer]
primary_conflict = (max_scorer, min_scorer)
# Determine which model aligns best with trust reference
preferred_model = "unknown"
if trust_ref in matrix.columns:
trust_scores = matrix[trust_ref]
closest = None
min_diff = float("inf")
for scorer in matrix.columns:
if scorer == trust_ref:
continue
# Calculate average absolute difference
diff = (matrix[scorer] - trust_scores).abs().mean()
if diff < min_diff:
min_diff = diff
closest = scorer
preferred_model = closest if closest else "unknown"
else:
# If trust reference isn't available, use median scorer
sorted_scorers = scorer_means.sort_values()
median_idx = len(sorted_scorers) // 2
preferred_model = sorted_scorers.index[median_idx]
# Identify high-disagreement areas
high_disagreement = std_dev > dim_config["variance_threshold"]
# Analyze scorer metrics (q_value, uncertainty, etc.)
scorer_metrics = self._analyze_scorer_metrics(
corpus, dimension, metrics
)
# Calculate metric correlations
metric_correlations = self._calculate_metric_correlations(
corpus, dimension, metrics
)
# Generate explanation
explanation_parts = [
f"MARS agreement: {agreement_score:.3f} (std: {std_dev:.3f})"
]
if high_disagreement:
explanation_parts.append(
f"โ ๏ธ High disagreement detected (threshold: {dim_config['variance_threshold']})"
)
if preferred_model != "unknown":
explanation_parts.append(
f"Most aligned with {trust_ref}: {preferred_model}"
)
explanation_parts.append(
f"Primary conflict: {primary_conflict[0]} vs {primary_conflict[1]} (ฮ={delta:.3f})"
)
# Check for systematic bias
above_mean = [
scorer
for scorer, mean_score in scorer_means.items()
if mean_score > avg_score
]
below_mean = [
scorer
for scorer, mean_score in scorer_means.items()
if mean_score < avg_score
]
if len(above_mean) == 1 or len(below_mean) == 1:
outlier = above_mean[0] if len(above_mean) == 1 else below_mean[0]
explanation_parts.append(f"โ ๏ธ {outlier} appears to be an outlier")
explanation = " | ".join(explanation_parts)
return {
"dimension": dimension,
"agreement_score": round(agreement_score, 3),
"std_dev": round(std_dev, 3),
"preferred_model": preferred_model,
"primary_conflict": primary_conflict,
"delta": round(delta, 3),
"high_disagreement": high_disagreement,
"explanation": explanation,
"scorer_metrics": scorer_metrics,
"metric_correlations": metric_correlations,
"source": "mars",
"average_score": round(avg_score, 3),
}
def _analyze_scorer_metrics(
self, corpus: "ScoreCorpus", dimension: str, metrics: List[str]
) -> Dict[str, Dict[str, float]]:
"""
Analyze extended metrics for each scorer in this dimension
"""
scorer_metrics = {}
for scorer in corpus.scorers:
# Get all attribute values for this scorer and dimension
metric_values = corpus.get_metric_values(
dimension, scorer, metrics
)
# Calculate statistics for each metric
metrics_stats = {}
for metric, values in metric_values.items():
if not values:
continue
# Filter out None/NaN values
valid_values = [v for v in values if v is not None]
if not valid_values:
continue
metrics_stats[metric] = {
"mean": float(np.mean(valid_values)),
"std": float(np.std(valid_values)),
"min": float(min(valid_values)),
"max": float(max(valid_values)),
"count": len(valid_values),
}
if metrics_stats:
scorer_metrics[scorer] = metrics_stats
return scorer_metrics
def _calculate_metric_correlations(
self, corpus: "ScoreCorpus", dimension: str, metrics: List[str]
) -> Dict[str, Dict[str, float]]:
"""
Calculate correlations between different metrics for this dimension
"""
if len(metrics) < 2:
return {}
# Get all metric values for this dimension
metric_values = corpus.get_all_metric_values(dimension, metrics)
# Calculate correlations
correlations = {}
for i in range(len(metrics)):
for j in range(i + 1, len(metrics)):
metric1, metric2 = metrics[i], metrics[j]
# Get valid pairs of values
pairs = [
(v1, v2)
for v1, v2 in zip(
metric_values[metric1], metric_values[metric2]
)
if v1 is not None and v2 is not None
]
if len(pairs) > 1:
values1, values2 = zip(*pairs)
try:
corr, _ = stats.pearsonr(values1, values2)
if metric1 not in correlations:
correlations[metric1] = {}
correlations[metric1][metric2] = float(corr)
except:
pass
return correlations
def get_aggregate_score(self, mars_results: Dict[str, Dict]) -> float:
"""
Get a single aggregate score from MARS analysis
This provides a weighted average of dimension scores based on agreement reliability
Args:
mars_results: Results from calculate() method
Returns:
Weighted aggregate score where dimensions with higher agreement contribute more
"""
total = 0
weight_sum = 0
for dimension, results in mars_results.items():
# Weight by agreement score (higher agreement = more weight)
weight = results["agreement_score"]
total += results["average_score"] * weight
weight_sum += weight
return round(total / weight_sum, 3) if weight_sum > 0 else 0.0
def get_high_disagreement_documents(
self, corpus: "ScoreCorpus", dimension: str, threshold: float = None
) -> List[str]:
"""
Identify documents with high scoring disagreement for this dimension
Args:
corpus: ScoreCorpus to analyze
dimension: Dimension to check
threshold: Custom disagreement threshold (uses config default if None)
Returns:
List of document IDs with high disagreement
"""
if threshold is None:
dim_config = self._get_dimension_config(dimension)
threshold = dim_config["variance_threshold"]
# Get the document ร scorer matrix
matrix = corpus.get_dimension_matrix(dimension)
if matrix.empty:
return []
# Calculate disagreement per document (standard deviation across scorers)
disagreement = matrix.std(axis=1)
# Return documents with disagreement above threshold
return disagreement[disagreement > threshold].index.tolist()
def get_scorer_reliability(
self, corpus: "ScoreCorpus", dimension: str
) -> Dict[str, float]:
"""
Calculate reliability score for each scorer in this dimension
Args:
corpus: ScoreCorpus to analyze
dimension: Dimension to check
Returns:
Dictionary mapping scorer names to reliability scores (higher = more reliable)
"""
# Get dimension-specific configuration
dim_config = self._get_dimension_config(dimension)
trust_ref = dim_config["trust_reference"]
# Get the document ร scorer matrix
matrix = corpus.get_dimension_matrix(dimension)
if matrix.empty:
return {}
# Calculate reliability as correlation with trust reference
reliability = {}
if trust_ref in matrix.columns:
trust_scores = matrix[trust_ref]
for scorer in matrix.columns:
if scorer == trust_ref:
reliability[scorer] = (
1.0 # Perfect correlation with itself
)
continue
# Calculate correlation with trust reference
valid_pairs = matrix[[scorer, trust_ref]].dropna()
if len(valid_pairs) > 1:
try:
corr, _ = stats.pearsonr(
valid_pairs[scorer], valid_pairs[trust_ref]
)
reliability[scorer] = float(corr)
except:
reliability[scorer] = 0.0
else:
reliability[scorer] = 0.0
# If no trust reference, use consistency across documents
else:
scorer_std = matrix.std()
max_std = scorer_std.max()
for scorer, std in scorer_std.items():
# Higher reliability for lower standard deviation
reliability[scorer] = (
1.0 - (std / max_std) if max_std > 0 else 1.0
)
return reliability
def generate_recommendations(
self, mars_results: Dict[str, Dict]
) -> List[str]:
"""
Generate actionable recommendations based on MARS analysis
Args:
mars_results: Results from calculate() method
Returns:
List of actionable recommendations
"""
recommendations = []
for dimension, results in mars_results.items():
# High disagreement recommendations
if results["high_disagreement"]:
primary_conflict = results["primary_conflict"]
recommendations.append(
f"โ ๏ธ High disagreement in {dimension}: {primary_conflict[0]} and {primary_conflict[1]} "
f"differ by {results['delta']:.3f}. Consider human review for ambiguous cases."
)
# Outlier scorer recommendations
scorer_metrics = results["scorer_metrics"]
if (
len(scorer_metrics) > 2
): # Need at least 3 scorers to identify outliers
# Check for scorers with unusual metric patterns
for scorer, metrics in scorer_metrics.items():
if (
"uncertainty" in metrics
and metrics["uncertainty"]["std"] > 0.2
):
recommendations.append(
f"โ ๏ธ {scorer} shows high uncertainty variability in {dimension}. "
"Consider retraining or adding calibration."
)
# Correlation-based recommendations
metric_correlations = results["metric_correlations"]
for metric1, correlations in metric_correlations.items():
for metric2, corr in correlations.items():
if abs(corr) > 0.7: # Strong correlation
recommendations.append(
f"๐ก In {dimension}, {metric1} and {metric2} are strongly correlated ({corr:.2f}). "
"Consider using one as a proxy for the other."
)
# Overall system recommendations
overall_agreement = mean(
[r["agreement_score"] for r in mars_results.values()]
)
if overall_agreement < 0.7:
recommendations.append(
"โ ๏ธ Overall scoring agreement is low (<0.7). Consider implementing human review "
"for documents with high disagreement."
)
return recommendations
๐ What the Code Does (High-Level Summary)
Hereโs what happens step-by-step inside the MARSCalculator
:
-
Initialize configuration:
- Choose a
trust_reference
(e.g.,"llm"
) - Set a
variance_threshold
to flag high disagreement - Select metrics to track (e.g.,
"score"
,"energy"
,"uncertainty"
)
- Choose a
-
Run
calculate(corpus)
:- For each dimension (e.g., clarity, implementability), it builds a document ร scorer matrix.
- Computes mean scores, std deviation, and identifies the primary conflict (models with largest divergence).
- Determines preferred model by comparing each to the trust reference.
- Flags high disagreement dimensions.
- Analyzes additional metrics like energy, Q-values, or other attributes.
- Computes correlation between metrics (e.g., is uncertainty correlated with low scores?).
-
Aggregate:
- You can get a single overall score via
get_aggregate_score()
, weighted by agreement level.
- You can get a single overall score via
-
Reliability:
- Use
get_scorer_reliability()
to determine which model is most stable or best aligned.
- Use
-
Spot High-Disagreement Documents:
- The method
get_high_disagreement_documents()
lets us isolate ambiguous or controversial cases for review.
- The method
-
Generate Recommendations:
- Human-readable diagnostics: model outliers, strong metric correlations, and suggestions for retraining or calibration.
๐ MARS Matters
MARS forms the analytics backbone for Stephanie’s epistemic introspection. Hereโs what it unlocks:
๐ฌ Use Case | ๐ Enabled by MARS |
---|---|
Detect bad scorers | Finds scorers that deviate too often from the trusted reference |
Tune models | Surfaces overconfident or unstable models via uncertainty stats |
Visual diagnostics | Highlights high-disagreement areas that should be reviewed |
Policy adjustment | Guides weighting and pruning in meta-policy synthesis |
Metric compression | Supports reduction of correlated metrics for efficiency |
๐งญ Where MARS Fits in Stephanieโs Scoring Pipeline
The MARS module serves as a diagnostic brain within the PlanTrace pipeline. It doesnโt generate new scores it analyzes the scores themselves. By inspecting agreement patterns, scoring conflicts, metric correlations, and historical deltas, MARS surfaces critical signals about the quality and consistency of Stephanieโs reasoning.
flowchart TD subgraph TraceExecution["๐ง PlanTrace Pipeline"] A[๐ Document Evaluation] --> B[๐งช Multi-Model Scoring] B --> C[๐ฆ ScoreBundle Construction] C --> D[๐๏ธ ScoreCorpus Aggregation] D --> E[๐ฌ MARSCalculator Analysis] E --> F[๐ Score Insights + Diagnostics] E --> G[๐งพ Recommendations + Alerts] D --> H[๐ ScoreDeltaCalculator] H --> I[๐ Score Change Logs] end style A fill:#FFF3E0,stroke:#FF9800,stroke-width:2px style B fill:#E3F2FD,stroke:#2196F3,stroke-width:2px style C fill:#F3E5F5,stroke:#9C27B0,stroke-width:2px style D fill:#E8F5E9,stroke:#4CAF50,stroke-width:2px style E fill:#FFFDE7,stroke:#FBC02D,stroke-width:2px style F fill:#ECEFF1,stroke:#607D8B,stroke-width:1px style G fill:#FCE4EC,stroke:#E91E63,stroke-width:1px style H fill:#F1F8E9,stroke:#8BC34A,stroke-width:1px style I fill:#F9FBE7,stroke:#CDDC39,stroke-width:1px
The diagram below shows exactly where MARS fits in downstream of score aggregation, yet upstream of feedback and refinement. Itโs the self-awareness layer that turns passive evaluations into an active feedback loop for cognitive improvement.
๐ช Conclusion: From Outputs to Processes
This post marks a critical shift in Stephanieโs architecture: weโve transitioned from scoring outputs to scoring the reasoning process itself. We no longer ask only, โWas this answer good?โโwe now ask, โWas this chain of reasoning sound, efficient, and improvable?โ
๐ง What We Actually Built
Letโs recap what this post accomplished:
-
PlanTrace Everywhere Every pipeline in Stephanie now produces a
PlanTrace
, a structured execution log of goals, steps, outputs, and scores. This turns black-box reasoning into something observable and improvable. -
Multi-Model Scoring Over Traces We implemented the
PlanTraceScorerAgent
, which uses HRM, SICQL, and ContrastiveRanker to evaluate reasoning traces as a whole. Stephanie can now judge the quality of its own cognition. -
ScoreCorpus + Attributes = Tensor Reasoning We introduced
ScoreCorpus
, a 4D reasoning tensor indexed by document/trace, dimension, scorer, and metric. This unified structure makes advanced analytics like uncertainty, advantage, and agreement both tractable and scalable. -
MARS: Reasoning Signal Diagnostics The
MARSCalculator
analyzes this score tensor to identify scoring conflicts, agreement zones, and epistemic instabilityโenabling Stephanie to reason about her own inconsistencies and adjust accordingly.
๐ Why It Matters
PlanTrace is not a logโit’s a cognitive mirror. It lets Stephanie observe, score, and learn from the very act of thinking.
This enables capabilities that go beyond traditional output scoring:
- Autonomous Debugging: Stephanie can now pinpoint which reasoning steps degrade quality and fix them.
- Reflexive Improvement: Step scores and MARS signals can be used to drive gradient updates in SICQL or policy refinements in GILD.
- Meta-Optimization: Stephanie can now choose among scoring strategies or even pipeline variants based on PlanTrace-level analysis.
๐ The Measurable Gains
In our 100-document embedding evaluation:
- HNet + Full Content outperformed Ollama + Summary by 29.2% in reasoning quality
- Uncertainty dropped by 78.9% using HNet on full documents
- PlanTrace feedback loops improved quality by 22.1%
These aren’t just nice metricsโthey validate that self-scoring pipelines lead to self-improving systems.
๐ญ What Comes Next
- Policy Control from Traces: Weโll use PlanTrace embeddings to control SICQL/GILD scoring heads and enable trace-to-policy learning.
- Process Compression: Traces will be encoded as latent image representations for fast selection, reuse, and transfer.
- Belief Cartography: PlanTraces will form the substrate for belief formation and evolution, replacing raw document cartridges.
๐ฌ Final Word
Weโre building a self-improving AI system. But self-improvement without self understanding without introspection is impossible. With PlanTrace, weโve taken the a real step towards that goal. Stephanie can now observe how it thinks, not just what it thinks. This is the beginning of a new kind of AI: one that evolves not by guessing harder, but by reasoning better. One that improves because it understands itself.
๐ Glossary
Term | Definition |
---|---|
PlanTrace | The top-level representation of a goal-driven cognitive process. A structured, introspectable object that records everything Stephanie does to pursue a goal - the foundation of her self-awareness. |
ExecutionStep | The atomic unit of Stephanie’s reasoning process. Captures inputs, outputs, timing, errors, and flexible attributes for each cognitive step in a pipeline. |
PlanTraceMonitor | Stephanie’s “cognitive flight recorder” - the component that automatically captures pipeline execution as PlanTraces without adding complexity to the Supervisor. |
PlanTraceScorerAgent | The component that evaluates PlanTraces using multiple scoring models (HRM, SICQL, etc.), transforming raw execution data into actionable insights. |
ScoreBundle | A collection of scores for a single scorable (document, pipeline) across multiple dimensions (helpfulness, truthfulness, etc.), with flexible attributes for deep analysis. |
ScoreCorpus | Stephanie’s cognitive memory system that stores and organizes ScoreBundles in a 4D tensor structure [scorables ร dimensions ร scorers ร metrics] . |
MARS (Model Agreement and Reasoning Signal) | Analysis framework that examines scoring patterns across dimensions and scorers to identify agreement, conflicts, and high-quality cognitive paths. |
4th Dimension | The flexible attributes system that enables deep analysis beyond just scores - capturing why scores behave the way they do through metrics like uncertainty, energy, and advantage. |
Flexible Attributes | Dictionary within ExecutionStep that can handle any number of metrics without schema changes, solving the “Object of type DictConfig is not JSON serializable” problem. |
Cognitive Mirror | The capability enabled by PlanTrace that allows Stephanie to observe, analyze, and improve her own reasoning processes - seeing herself think. |
Epistemic Quality | The quality of the reasoning process itself, not just the final output. Measures how intelligently Stephanie arrived at her conclusions. |
Self-Improvement Flywheel | The closed loop where: [Document Scoring] โ [Pipeline Execution] โ [Pipeline Evaluation] โ [Pipeline Improvement] with insights feeding back into future executions. |
HRM (Hierarchical Reasoning Model) | A scoring model that evaluates reasoning traces through nested reasoning loops, providing scores with metrics like energy and trace_length. |
SICQL | A scoring model based on Q-learning that provides metrics like q_value, uncertainty, policy_entropy, and advantage for deep analysis. |
Scorers | Components that evaluate different aspects of reasoning (HRM, SICQL, SVM, etc.), each contributing unique metrics to the flexible attributes system. |
Dimensions | Aspects of reasoning quality being evaluated (helpfulness, truthfulness, reasoning_quality, technical_depth, novelty). |
Metrics | Specific measurements within dimensions (score, energy, uncertainty, advantage) that form the 4th dimension of understanding. |
ScoreDeltaCalculator | Tool that logs changes in scores over time, linking score changes to specific pipeline stages and reasoning contexts. |
HNet | Hierarchical embedding approach that sits on top of Ollama, preserving technical nuance that LLM-generated summaries often lose. |
Cognitive Pattern | Recognizable sequence of steps that consistently produces high-quality results, extracted from ScoreCorpus for self-improvement. |
Serialization Challenge | The problem of “Object of type DictConfig is not JSON serializable” that threatened to derail the PlanTrace architecture, solved by the to_serializable() utility. |
PlanTraceScorerAgent | The component that evaluates PlanTraces using multiple scoring models (HRM, SICQL, etc.), transforming raw execution data into actionable insights. |
Tensor-Based Scoring | The 4D structure [scorables ร dimensions ร scorers ร metrics] that enables slicing and dicing scores for deep cognitive analysis. |
MARS Analysis | The meta-evaluation layer that examines agreement between scorers and identifies where reasoning is most/least reliable. |
Pattern Extraction | The process of identifying high-quality cognitive paths from ScoreCorpus that can be replicated and optimized for self-improvement. |
Cognitive Unification Principle | The foundational concept that “If it happens in Stephanie’s cognition, it happens through a pipeline” - creating a single cognitive framework. |
Self-Tuning Pipelines | Pipelines that automatically optimize their own execution based on insights from PlanTrace analysis and pattern extraction. |
๐ References
-
Hierarchical Reasoning Model (HRM)
arXiv:2506.21734
The seminal paper introducing the HRM architecture that inspired Stephanie’s layered reasoning capabilities. Essential reading for understanding how nested reasoning loops simulate human-like cognition in AI systems. -
TOWARDS GENERAL-PURPOSE MODEL-FREE REINFORCEMENT LEARNING
Authors: Anonymous
arXiv:2501.16142
This foundational work on preference-based Q-learning over document pairs provides the theoretical basis for Stephanie’s directional feedback system, enabling her to learn through structured comparisons rather than scalar rewards. -
Recurrent Independent Mechanisms
Authors: Goyal, Anirudh, et al.
arXiv:1909.10893
A critical exploration of how recurrent architectures can support modular reasoningโdirectly relevant to understanding HRM’s LModule and HModule separation. -
Recursive Meta-Learning for Autonomous AI Improvement
Authors: Wang, Jane, et al.
arXiv:2203.06558
This paper explores recursive self-improvement frameworks that directly informed GILD’s approach to targeted cognitive updates based on reasoning traces. -
Deep Q-Networks (DQN)
Authors: Mnih, Volodymyr, et al.
Nature, 2015
The classic paper that revolutionized deep reinforcement learningโunderstanding DQN is crucial for appreciating how SICQL extends these concepts to document evaluation. -
Advantage-Weighted Regression (AWR)
Authors: Peng, Xue Bin, et al.
arXiv:1910.00177
The paper that introduced AWR, which powers Stephanie’s policy refinement process by weighting actions based on their success. -
RMSNorm: Root Mean Square Layer Normalization
Authors: Zhang, Biao, et al.
arXiv:1910.07467
The technical foundation for HRM’s stability mechanismโcritical for understanding how Stephanie maintains coherent reasoning during extended cognitive processing. -
Introduction to Latent Variable Energy-Based Models: A Path Towards Autonomous Machine Intelligence
Authors: LeCun, Yann, et al.
arXiv:2002.03722
Provides the theoretical basis for Stephanie’s energy-based uncertainty measurements (EBT), which work in concert with HRM to identify reasoning gaps.