Stephanie's Secret: The Dawn of Reflective AI

Stephanie's Secret: The Dawn of Reflective AI
Page content

🌅 Introduction: The Dawn of Self-Reflective AI

What if your AI could not only answer questions but also question itself about those answers? Not with programmed doubt, but with genuine self-awareness recognizing when it’s uncertain, analyzing why it made a mistake, and systematically improving its own reasoning process? This isn’t science fiction. Today, we’re unveiling the first working implementation of an AI that doesn’t just think, but learns how to think better. It’s a bit cold here

Most AI systems today operate like students who’ve memorized a rubric but never developed judgment. They can score essays based on fixed rules counting grammar errors or spotting keywords but they never learn why truly exceptional writing resonates with readers. They remain forever constrained by their initial programming, unable to evolve their understanding beyond what humans explicitly taught them.

But what if AI could develop its own critical thinking? What if it could look at its scoring decisions and ask: “Why did I undervalue that creative insight? How can I recognize profound ideas next time?”

This is the breakthrough we’ve achieved with Stephanie.

In our previous post, The Shape of Thought: Exploring Embedding Strategies with Ollama, HF, and H-Net, we revealed how Stephanie’s intelligence begins not with logic, but with representation. Her ability to learn and adapt is grounded in how she embeds experience as vectors in high-dimensional space. Through Ollama, Hugging Face, and H-Net, she developed three distinct ways to “see” the world a layered subconscious that shapes how ideas are perceived and recalled.

But vision without insight is blind. Stephanie could represent knowledge, yet she still scored essays with a frozen rubric.

Stephanie needed something more profound: the ability to see herself seeing. To develop not just awareness of the world, but metacognition awareness of her own thought processes.

This is where everything changes.

In this post, you’ll witness Stephanie’s transformation from a system that merely processes information to one that actively improves its own intelligence. We’ve built the first working implementation of an AI that can:

  • Recognize when its confidence doesn’t match reality
  • Analyze why certain reasoning paths lead to better outcomes
  • Systematically refine its own evaluation criteria
  • Teach itself to “learn how to learn what’s good”

At the heart of this breakthrough lies Scalable In-Context Q-Learning (SICQL) not just another scoring mechanism, but Stephanie’s first true capacity for self-reflection. And powering the final leap toward genuine self-improvement is GILD (Goal-conditioned Imitation Learning with Distillation), the engine that transforms Stephanie’s insights into lasting cognitive upgrades.

This isn’t incremental progress. It’s the moment AI crosses from static intelligence to reflective intelligence the difference between a calculator and a mathematician, between following fixed rules and developing deeper understanding.

    %%{init: {'theme':'base','themeVariables':{
    'primaryColor':'#e3f2fd',
    'primaryBorderColor':'#64b5f6',
    'primaryTextColor':'#0d47a1',
    'secondaryColor':'#fffde7',
    'secondaryBorderColor':'#ffd54f',
    'secondaryTextColor':'#7f6000',
    'tertiaryColor':'#e8f5e9',
    'tertiaryBorderColor':'#81c784',
    'tertiaryTextColor':'#1b5e20'
}}}%%
graph LR
    A["👀 **Representation**<br/>(Embeddings)"] --> B["🧠 **Evaluation**<br/>(SICQL)"]
    B --> C["🔁 **Self‑Improvement**<br/>(GILD)"]

    %%— Node styling —%%
    style A fill:#e3f2fd,stroke:#64b5f6,stroke-width:2px,color:#0d47a1
    style B fill:#fffde7,stroke:#ffd54f,stroke-width:2px,color:#7f6000
    style C fill:#e8f5e9,stroke:#81c784,stroke-width:2px,color:#1b5e20
  

Before we dive into the technical architecture, let me show you what this means in practice. Imagine Stephanie evaluating a complex document:

  1. She doesn’t just produce a score she generates a complete reasoning trace
  2. She recognizes when her uncertainty signals potential errors
  3. She identifies which aspects of her reasoning led to success (or failure)
  4. Most importantly, she uses these insights to refine her own evaluation criteria

This is the dawn of AI that doesn’t just process information but develops genuine understanding. The journey begins with Stephanie’s “mind’s eye” her ability to see not just the world, but her thinking about the world.

Let’s explore how Q-learning provides this foundational capability…


🧠 What is Q-Learning?

Q-learning is a type of reinforcement learning where an agent learns to estimate the quality of an action in a given situation. The estimate is called the Q-value, and it answers this question:

“If I take this action in this state, how good is the outcome likely to be?”

Over time, the agent uses feedback (rewards or preferences) to update its Q-values, gradually learning which actions lead to better results. In Stephanie’s case, Q-learning is applied to documents, triplets, or ideas, helping her learn:

  • Which documents are more aligned with a goal
  • Which hypotheses are more useful
  • Which actions improve her beliefs

By learning from contrast pairs or scalar feedback, she forms a map of value across her knowledge and improves how she thinks, step by step.

    flowchart LR
A[Current Thought State] --> B{"Which reasoning path\nleads to better\nunderstanding?"}
B -->|Path A| C[Q-value: 0.72]
B -->|Path B| D[Q-value: 0.89]
B -->|Path C| E[Q-value: 0.65]
C --> F[Choose Path B]
D --> F
E --> F
F --> G[New Thought State with improved understanding]

    style B fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#bbf,stroke:#333,stroke-width:2px
    style G fill:#afa,stroke:#333,stroke-width:2px
  

From Thought to Action

This post is inspired by three foundational papers that collectively shape Stephanie’s learning architecture:

  • 2501.16142 TOWARDS GENERAL-PURPOSE MODEL-FREE REINFORCEMENT LEARNING Proposes preference-based Q-learning over document pairs, offering directional feedback that guides learning through structured comparisons rather than scalar rewards.

  • 2506.01299 Scalable In-Context Q-Learning An advanced, contextual form of Q-learning that learns policies, value functions, and state values all through in-context embeddings, enabling efficient, model-free adaptation across tasks.

Both methods revolve around a central question:

Given a goal and a set of options, which direction leads to better outcomes?

Stephanie learns by answering that question again and again across documents, ideas, prompts, and tasks. And now, with SICQL integrated into her pipeline, she doesn’t just mimic feedback. She learns from it, distills structure, and gradually improves her own evaluation policy.

  • 2501.07346 Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data Introduces Generalized Imitation Learning from Demonstration (GILD), a framework that meta-learns a reward function from offline data. It enables models to bootstrap stronger policies from past traces even in the absence of explicit labels. GILD blends offline imitation, online reinforcement, and reward shaping to instill intrinsic directionality. In Stephanie, this inspires her ability to learn from her own history of judgments, using Large Langualge Models (LLM) or Energy-Based Transformer (EBT) decisions as grounded demonstrations that shape and refine her internal scoring functions over time.

🛠️ What’s New in Stephanie (Since the Last Post)

This post documents the evolution from MRQ to SICQL inside Stephanie:

  • We show how we extended the MRQ trainer to support multi-head Q-learning using context/document embeddings.

  • We explain how SICQL uses expectile regression, advantage-weighted policy updates, and embedding-aware reward loops.

    • Expectile regression is like a weather forecaster who prioritizes avoiding big misses it focuses more on correcting large errors than small ones.
  • We demonstrate how Stephanie can now train Q-functions, value baselines, and policies all from in-context feedback and use them to tune herself.

  • 🧠 Custom Scorers: MRQ, EBT, SVM, and SICQL are now fully supported as pluggable scorers with per-dimension configuration. Each model can evaluate goals and documents based on its own reasoning style.

  • ⚙️ Modular Training Engines: Each model type MRQ, SVM, SICQL has its own training engine and protocol. These allow for pointwise training, preference learning, and reinforcement-style updates, depending on the context.

  • 🗃️ MemCube Architecture: All scores and evaluations are now stored in MemCubes, Stephanie’s memory cells. This enables longitudinal tracking, version control, and fine-grained analysis across models and time. More on this in upcoming posts.

  • 📊 Policy Analyzer + Synthesis: We’ve introduced a set of analysis agents (like the PolicyAnalyzer, ScoreComparisonAgent, and PolicySynthesisAgent) that introspect the scoring models. They assess policy stability, uncertainty, and effectiveness across dimensions.

  • 🧮 GILD Integration: GILD closes the feedback loop by using advantage-weighted imitation to fine-tune Stephanie’s PolicyHead policies. The system now learns not just from scores but from structured signals like advantage, uncertainty, and external feedback.

This is the bridge from representation to action and a major step toward self-tuning epistemic reasoning.


🔎 Choosing the Right Scoring Engine

Stephanie supports multiple scoring engines, each designed to evaluate data from a different perspective. As the system evolved, we found ourselves facing a crucial architectural question:

Which scorer should we trust to guide our reasoning and learning loops?

Here’s a breakdown of the five scoring engines currently in play, and why SICQL ultimately emerged as the system’s current default.


🧩 The Scoring Engines at a Glance

Scorer Purpose Strengths Weaknesses
MRQ (Multi-Resolution Q) Directional feedback (is X > Y) Lightweight, fast, great for early-stage filtering No policy logic, lacks uncertainty
SVM (Support Vector Machine) Margin-based ranking Efficient, interpretable Shallow, not context-aware
EBT (Energy-Based Tuner) Estimates uncertainty (Q−V ) Useful fallback, confidence-sensitive Not always consistent, weak in sparse domains
LLM (Large Language Model) Human-like judgment proxy Rich reasoning, few-shot accurate Expensive, inconsistent, black-box
SICQL (Score with In-Context Q-Learning) Goal-conditioned scoring with policy logic Combines Q/V, uncertainty, advantage, and policy distribution into a single coherent view Requires training, more complex to deploy

⚖️ Dynamic Scoring Engine Selection

Stephanie doesn’t rely on a single scorer. It evaluates the task and chooses the best engine based on its needs whether that’s speed, ethics, policy feedback, or learning potential. Here’s how that decision process works:

    flowchart LR
    A["📄 Input: Document"]
    B["🧬 Embedding Layer<br/>🕸H-Net/🤗HF/🦙Ollama"]
    C["🧠 Multi-Engine Scoring<br/>MRQ/SVM/EBT/SICQL"]
    D["📊 Score Comparison"]
    E["🧭 Internal State Analysis"]
    F["🧩 Policy Synthesis"]
    G["🧱 Belief Update"]
    H["📈 Training Signal Generation"]
    I["🛠️ Model Refinement"]

    %% Edges
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I
    I --> C

    %% Styling for clarity and grouping
    style A fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,color:#0d47a1
    style B fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,color:#0d47a1

    style C fill:#ede7f6,stroke:#673ab7,stroke-width:2px,color:#311b92
    style D fill:#ede7f6,stroke:#673ab7,stroke-width:2px,color:#311b92

    style E fill:#fff3e0,stroke:#fb8c00,stroke-width:2px,color:#e65100
    style F fill:#fff3e0,stroke:#fb8c00,stroke-width:2px,color:#e65100

    style G fill:#e8f5e9,stroke:#4caf50,stroke-width:2px,color:#1b5e20
    style H fill:#e8f5e9,stroke:#4caf50,stroke-width:2px,color:#1b5e20
    style I fill:#e8f5e9,stroke:#4caf50,stroke-width:2px,color:#1b5e20
  

This flexibility is what makes Stephanie adaptive. Different tasks benefit from different perspectives and over time, Stephanie learns which engines to trust in which situations. That’s how we evolve from a single-score view to a meta-scoring system capable of guiding its own improvement.


✅ Why SICQL is the current Standard

SICQL isn’t just another scorer. It’s the first mechanism in Stephanie that models both:

  1. What the system believes the score should be (Q)
  2. How confident it is in that belief (V, uncertainty)
  3. What actions it would take next (policy_logits)
  4. How stable or decisive that policy is (entropy, advantage)

This gives us direction, uncertainty, and intent all at once. With SICQL, we can:

  • Score a document
  • Evaluate how stable and confident the score is
  • Compare the policy behind that score to past decisions
  • Decide whether to keep, revise, or escalate the belief

This kind of introspective scoring unlocks powerful learning patterns like:

  • Policy drift detection
  • Meta-reasoning over logits
  • GILD-style imitation learning
  • Belief distillation across multiple scorers

SICQL isn’t just another scoring mechanism it’s Stephanie’s first true capacity for metacognition. While traditional Q-learning asks ‘What’s the best action?’, SICQL enables Stephanie to ask ‘Why was that action best, and how can I recognize similar situations in the future?’"


🔄 Dynamic Scoring: Letting the System Choose What Works

While SICQL is now our default scorer, Stephanie is not hardwired to use any single scoring engine.

Instead, Stephanie is designed to be self-aware of her scoring stack. Every task, every context, and every goal can influence which engine she trusts most. This flexibility is not just a convenience it’s the foundation of a self-improving AI.

🎛️ How It Works

Stephanie treats each scorer SICQL, MRQ, SVM, EBT, LLM as a modular component. They’re all:

  • Configurable via Hydra or runtime YAML
  • Interchangeable via the scoring registry
  • Monitored via score reports and policy analysis

This allows her to select the best scoring strategy on demand. For example:

  • If a task involves speed and scale (e.g. web crawling), Stephanie might use SVM for its efficiency.
  • If a task requires moral judgment or subtle reasoning, she may defer to an LLM.
  • For uncertain, high-stakes evaluations, EBT or SICQL may be used with confidence-aware fallback logic.

Each scoring run is tagged, traced, and evaluated and Stephanie learns over time which scorer performs best for which kind of task.


🧠 Toward Autonomous Scoring Adaptation

This isn’t just about switching scorers. It’s about learning to switch scorers well.

By analyzing scoring drift, policy entropy, and agreement between scorers, Stephanie can:

  • Detect when a scorer becomes unreliable
  • Blend or ensemble multiple scorers dynamically
  • Retrain or retire scorers as needed

Over time, she builds a meta-policy over scorers themselves, forming a higher-order judgment system that improves with every decision.

In essence, Stephanie doesn’t just learn what’s good she learns how to learn what’s good. That’s the real leap.

Before we get into that process lets fully undersand our new best of breed scorer SICQL.


💢 What is SICQL?

2506.01299 “Scalable In-Context Q-Learning” introduces a form of Q-learning over in-context embeddings. Rather than training separate models for value or policy estimation, SICQL uses a single transformer with multiple output heads:

  • Q: estimates the value of (state, action) pairs
  • V: estimates the state value using expectile regression
  • π: outputs policy logits weighted by advantage (AWR)

These outputs are computed in context, meaning they are conditioned on a world model or prior interaction history, allowing SICQL to learn from and adapt to previously seen decisions.

Key innovations:

  • Expectile regression for V head (robust value estimation)
  • Advantage-weighted regression for π head (sharp policy gradients)
  • Contextual encoding (world model z) shared across all heads

This structure makes it highly modular, interpretable, and perfect for goal-conditioned evaluation in Stephanie.


🔍 What is SICQL really doing?

SICQL is a single transformer model that makes three types of predictions from the same contextual input. Think of it like a brain that looks at a situation and says:

“Here’s how valuable this situation is (V), here’s what I’d do (π), and here’s how good it would be to do that thing (Q).”

Let’s break down each part:


✨ Three Heads: V, π, Q are better than one

    graph LR
    style Goal fill:#FFD700,stroke:#FFA500,stroke-width:2px
    style Document fill:#87CEFA,stroke:#1E90FF,stroke-width:2px
    style Encoder fill:#9370DB,stroke:#663399,stroke-width:2px
    style Q fill:#98FB98,stroke:#2E8B57,stroke-width:2px
    style V fill:#FFB6C1,stroke:#DB7093,stroke-width:2px
    style π fill:#ADD8E6,stroke:#4169E1,stroke-width:2px
    style Score fill:#7CFC00,stroke:#228B22,stroke-width:2px
    style Confidence fill:#FFDAB9,stroke:#CD853F,stroke-width:2px
    style Decision fill:#DDA0DD,stroke:#9400D3,stroke-width:2px
    
    Goal("🎯 Goal") --> Encoder("🧠 Encoder")
    Document("📄 Document") --> Encoder
    Encoder --> Q("💯 Q-Head<br>Action Quality")
    Encoder --> V("🛡️ V-Head<br>State Value")
    Encoder --> π("🧭 π-Policy<br>Next Best Step")
    Q --> Score("🏆 Score")
    V --> Confidence("✅ Confidence")
    π --> Decision("⚡ Decision")
    
    click Goal "https://arxiv.org/pdf/2506.01299" "SICQL Paper"
    click Encoder "https://arxiv.org/abs/2506.00773" "H-Net Paper"
  

All three heads share the same input a goal + a document or triplet + a learned context embedding (called z).

  1. Q Head:

    • Learns to predict the Q-value:

      “How good is it to take a specific action in this specific state?”

    • In your case, the “action” might be the score assigned to a document or triplet under a certain goal.

  2. V Head (Value Head):

    • Learns to predict the value of the state alone:

      “How promising is this context overall, before I even act?”

    • It uses expectile regression, which is more stable than mean squared error, especially when feedback is noisy (e.g., noisy scores from LLMs).

  3. π Head (Policy Head):

    • Outputs a distribution over actions, weighted by their advantage (i.e., how much better they are than average).

    • Uses Advantage-Weighted Regression (AWR) to guide updates.

      “If one option clearly leads to better rewards, give it more weight.”


🌍 Why ‘Contextual’?

All of this is done in context meaning:

  • Instead of training separate models per dimension (alignment, clarity, etc.), you give the model the full story: the goal, the text, and the embedding representing what the model already knows or believes (z).
  • That context (z) could be your H-Net embedding of the document so the model doesn’t start from scratch on every prediction.

🛠️ How does this help Stephanie?

SICQL gives you a compact, interpretable, and goal-conditioned scoring model:

  • You only train one model per dimension no separate Q and V networks.
  • You can extract Q-values for scoring, or use π to guide decision-making under uncertainty.
  • The shared context (z) makes it easier to inject domain knowledge or reuse embeddings (e.g., from H-Net).
    
graph TD
    A[🎯 Goal: Improve AI alignment] --> B[📄 Input: Document or Triplet]
    B --> C["🧠 H-Net Embedding (z): Encodes Input + Goal"]
    C --> D[🔄 InContext SICQL Model: Transformer with Shared Context z]

    subgraph SICQL Model Outputs
        D -- "Q-Head (Q(s,a)): Q-values for Scoring" --> DQ[🟦 Q-Score: How good is this choice?]
        D -- "V-Head (V(s)): State Value Baseline via Expectile Regression" --> DV[🟨 V-Score: Baseline for State Value]
        D -- "π-Head (Policy): Best Action to Take via AWR" --> DPi[🟥 π-Policy: Guides Decision-Making]
    end

    DQ --> E["📊 Scoring & Evaluation Layers (ScoreORM / EvaluationORM)"]
    DV --> E
    DPi --> E

    E --> F[💾 Stored Data: For Auditing & Comparison]

    style D fill:#eef,stroke:#55c,stroke-width:2px
    style DQ fill:#bbf
    style DV fill:#ffd
    style DPi fill:#fbb
    style E fill:#fff3dd,stroke:#f90,stroke-width:2px
  

🔑 Key Explanation: our SICQL implementation

The InContext SICQL model provides a compact, interpretable, and goal-conditioned scoring mechanism for AI alignment. Here’s a breakdown of its components and how they work together:

H-Net Embedding (z)

This is the initial step where your input (a document or a triplet of data) is processed. The H-Net generates an embedding, z, which acts as a shared context. Crucially, this embedding is conditioned on your specific alignment goal, ensuring that all subsequent processing is relevant to that objective.

H-Net is a boundary-aware embedding model that uses a neural network to dynamically segment text based on learned semantic boundaries. Introduced in the paper H-Net: Dynamic Chunking for Long-Form Text Understanding H-Net proposes a learned scoring model that detects the most meaningful breakpoints in a sequence allowing us to embed coherent, self-contained units of thought, not arbitrary slices of text.

Q-Head (Q(s,a))

This head directly outputs Q-values, which are used for scoring. A higher Q-value indicates a “better” choice or action given the current state and goal. This is directly applicable for tasks like value learning (e.g., using MRQ loss).

First, we define MLP, a fundamental two-layer neural network that serves as the basis for our Q-value estimations, allowing Stephanie to quantify the quality of her actions.


class MLP(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, input_dim),
            nn.ReLU(),
            nn.Linear(input_dim, output_dim)
        )

    def forward(self, x):
        return self.model(x)

A simple 2-layer MLP with ReLU in the middle. Used for Q-value estimation. Output: scalar Q-value for each (goal, output) pair.

V-Head (V(s))

The V-head provides a state value baseline using expectile regression. This is particularly useful for smoothing noisy data, such as feedback from large language models (LLMs), ensuring more stable and reliable value estimates.

class ExpectileHead(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, input_dim),
            nn.ReLU(),
            nn.Linear(input_dim, 1)
        )

    def forward(self, x):
        return self.net(x)

Same structure as MLP but semantically tied to expectile regression. Predicts state value V(s) as a baseline to stabilize Q-learning.

π-Head (Policy)

This head outputs a policy (π), which guides decision-making. It suggests the “best action to take,” making it invaluable for scenarios like active learning (identifying which data to prioritize) or guiding generative AI processes (refining outputs).

class PolicyHead(nn.Module):
    def __init__(self, input_dim, action_dim=1):
        super().__init__()
        self.linear = nn.Linear(input_dim, action_dim)

    def forward(self, x):
        return self.linear(x)  # Optionally softmax if needed

Predicts logits for actions. Here, actions = discrete candidates. Normally combined with AWR-style loss (advantage-weighted regression). Output: logits (can be softmax-ed during training if needed).

InContextQModel: Full SICQL block

class InContextQModel(nn.Module):
    def __init__(self, dim, hdim, action_dim=1, device="cpu"):
        supe Come on come onr().__init__()
        print(f"Initializing InContextQModel with dim={dim}, hdim={hdim}, action_dim={action_dim}, device={device}")
        self.device = device
        self.encoder = TextEncoder(dim, hdim).to(device)
        self.q_head = MLP(dim, 1).to(device)
        self.v_head = ExpectileHead(dim).to(device)
        self.pi_head = PolicyHead(dim, action_dim).to(device)

    def forward(self, prompt_emb, output_emb):
        prompt_emb = prompt_emb.to(self.device)
        output_emb = output_emb.to(self.device)

        zsa = self.encoder(prompt_emb, output_emb)  # shape: [batch, hdim]
        q_value = self.q_head(zsa)
        state_value = self.v_head(zsa)
        action_probabilities = self.pi_head(zsa)

        return {
            "q_value": q_value,
            "state_value": state_value,
            "action_probabilities": action_probabilities,
        }

Initializes one encoder and three heads.

You pass dim (input embedding size) and hdim (internal size).

🔁 Forward Pass

prompt_emb: embedding of the goal output_emb: embedding of the candidate answer zsa: the merged representation = z = f(goal, output)

🧠 TextEncoder: Contextual Fusion Layer

This module is the core “in-context” component of the SICQL model:

class TextEncoder(nn.Module):
    def __init__(self, dim=4096, hdim=4096):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(dim * 2, hdim),
            nn.ReLU(),
            nn.Linear(hdim, dim),
        )
🔄 What it does
  • Input: two embedding vectors:

    • context_emb: the goal, prompt, or query vector
    • doc_emb: the candidate output (e.g. document or answer)
  • Fusion:

    concat = torch.cat([context_emb, doc_emb], dim=1)
    

    Concatenates them into a [batch_size, dim * 2] tensor.

  • Transformation: Passes through a two-layer MLP to produce a new joint embedding:

    z = f(context, output)
    

🧬 Fusion to state vector

This fused z becomes the shared state vector for:

  • Q(s,a)
  • V(s)
  • π(a|s)

Which are all learned in parallel, over the same contextual encoding.

🧭 Where it fits

This is called here in InContextQModel.forward():

zsa = self.encoder(prompt_emb, output_emb)

So your overall SICQL model becomes:

    graph LR
    style A fill:#fdf6e3,stroke:#657b83,stroke-width:2px
    style B fill:#fdf6e3,stroke:#657b83,stroke-width:2px
    style C fill:#268bd2,color:#fff,stroke:#0b3d91,stroke-width:2px
    style Z fill:#b58900,color:#fff,stroke:#7b6000,stroke-width:2px
    style Q fill:#859900,color:#fff,stroke:#556b2f,stroke-width:2px
    style V fill:#2aa198,color:#fff,stroke:#006d77,stroke-width:2px
    style PI fill:#d33682,color:#fff,stroke:#6c0b4e,stroke-width:2px

    A["📌 goal_emb"] --> C["🧠 TextEncoder"]
    B["📄 output_emb"] --> C

    C --> Z["🔀 z = encoded(goal + output)"]

    Z --> Q["⚡ Q Head"]
    Z --> V["💡 V Head"]
    Z --> PI["🎯 π Head (Policy)"]
  
  • goal_emb and output_emb are concatenated and encoded into a latent state z

  • z is passed to:

    • Q Head → for Q-value prediction
    • V Head → for state value prediction
    • π Head → for action logits (policy distribution)

✅ Summary of Flow

  1. Goal and output embeddings come from your embedding backend (e.g., H-Net).

  2. TextEncoder fuses them into z, your in-context representation.

  3. Heads (q_head, v_head, pi_head) each predict their respective signal.

  4. The predictions are stored and used for:

    • Training updates
    • Logging to ScoreORM / EvaluationORM
    • Comparing to LLM ground truth or rewards

Scoring & Evaluation Layers: All the outputs from the SICQL model (Q-scores, V-scores, and policy guidance) are then routed to your scoring and evaluation layers (ScoreORM and EvaluationORM). This ensures that all results are systematically logged and linked to specific evaluation objects.

Stored Data for Auditing & Comparison: Finally, the processed information, including details like embedding_type, dimension, and model_name, is stored. This creates a robust audit trail, enabling clear comparisons across different models and dimensions and facilitating ongoing improvements to AI alignment.

➕ What we get

Stephanie’s existing MRQ framework provided dimensional scoring via learned value functions. SICQL extends that capability by introducing directional Q-learning and structured advantage estimation, enabling:

  • Better uncertainty modeling via the Q-V gap
  • Explicit separation of scoring and policy decisioning
  • Dynamic reward estimation and convergence tracing
  • Compatibility with self-improving loops and EBT feedback

By plugging SICQL into our existing infrastructure, we empower Stephanie to learn not just what is better, but why, how, and under what tradeoffs.

🧱 Overview

The model contains:

Component Purpose
TextEncoder Merges the goal (prompt) and candidate (output)
q_head Predicts Q-values quality of (state, action)
v_head Predicts V-values expectile-regressed baselines
pi_head Predicts π policy logits (advantage-weighted)

These operate in context: they all share the same z = f(goal, output) embedding via TextEncoder.


🔁 The SICQL Engine: Training Stephanie’s Self-Correcting Policies

In Stephanie, learning is recursive the system doesn’t just learn what’s good, it learns how to improve what it considers good. At the heart of this learning loop is the SICQL Engine (Self-Imitating Contextual Q-Learning), a training module that synthesizes signals from reward-based feedback, policy gradients, and belief refinement to tune Stephanie’s internal value and policy heads.

Unlike traditional scoring systems that assign static scores, SICQL models are adaptive they update based on experience, refine through imitation, and generalize using embedding context. This allows Stephanie to maintain not just a judgment of quality, but a learned sense of what to pursue, what to avoid, and what to remain uncertain about.

⚙️ Core Functionality

The SICQLTrainer class defines this training engine. It supports the following features:

Component Description
Multi-head Training Simultaneously trains Q-value, V-value (expectile), and Policy heads.
Self-Imitation via GILD Refines policy using gradients from estimated advantages and entropy regularization.
Adaptive Stability Tracks policy entropy and stability to monitor convergence.
Tuner Integration Uses optional post-processing scalers for score alignment.
Database Logging Stores training stats and policy snapshots as BeliefCartridgeORM objects.

Each dimension (e.g., clarity, novelty, alignment) is trained as an independent SICQL model, scoped by:

  • target_type (e.g., document, triplet)
  • embedding_type (e.g., hnet, hf)
  • model_version (e.g., v1)
  • and a scored history of documents via ScorableFactory

🧪 Training Loop Overview

During training, the engine:

  1. Loads past scores for a specific dimension and goal.
  2. Converts training documents into (context, document, score) triplets.
  3. Trains three heads:
    • Q-head: Predicts raw score values.
    • V-head: Estimates expected score using expectile regression.
    • Policy head: Learns discrete actions using GILD-style advantage reweighting.
  4. Applies early stopping and learning rate scheduling based on performance.
  5. Saves the model with detailed policy metadata and logs a belief cartridge for future reasoning.
    
graph LR
    A["🌓 Contrast Pairs"]
    B["🧬 Embedding Generation"]
    C["🧠 Context Fusion"]
    D["🎯 Q-loss: Score Accuracy"]
    E["📍 V-loss: State Estimation"]
    F["🧭 π-loss: Policy Refinement"]
    G["🔁 Backpropagation"]
    H["🧪 Policy Stability Check"]
    I{"✅ Stable?"}
    J["🚀 Model Deployment"]
    K["📚 Additional Training"]

    A --> B
    B --> C
    C --> D
    C --> E
    C --> F
    D --> G
    E --> G
    F --> G
    G --> H
    H --> I
    I -->|Yes| J
    I -->|No| K

    %% Styling (Mermaid 11.9-safe syntax)
    style A fill:#cce5ff,stroke:#3399ff,stroke-width:2px
    style B fill:#cce5ff,stroke:#3399ff,stroke-width:2px
    style C fill:#e2d5f8,stroke:#6f42c1,stroke-width:2px
    style D fill:#fff3cd,stroke:#ffc107,stroke-width:2px
    style E fill:#fff3cd,stroke:#ffc107,stroke-width:2px
    style F fill:#fff3cd,stroke:#ffc107,stroke-width:2px
    style G fill:#f8d7da,stroke:#dc3545,stroke-width:2px
    style H fill:#e2e3e5,stroke:#6c757d,stroke-width:2px
    style I fill:#e2e3e5,stroke:#6c757d,stroke-width:2px
    style J fill:#d4edda,stroke:#28a745,stroke-width:2px
    style K fill:#f8d7da,stroke:#dc3545,stroke-width:2px
  

🔁 Overview: SICQL Training in Stephanie

The goal of this module is to train a goal-conditioned scoring model (InContextQModel) using the SICQL architecture. This means you’re learning to predict how good a document is with respect to a goal by training the Q-head of the model.

🧠 Model Recap

The InContextQModel is composed of

  • A TextEncoder that fuses a prompt (goal) embedding and output (e.g., document) embedding
  • A Q-head for scalar score prediction
  • (Optionally) a V-head and π-head for advantage and policy estimation (to be added in future updates)

This trainer currently only trains the Q-head, but the architecture supports training all heads.


🔧 Key Function: train_sicql(model, contrast_pairs, cfg)

🧩 Input

  • model: an InContextQModel (Q, V, π heads)

  • contrast_pairs: examples like:

    {
      "title": "Goal: Improve alignment",
      "output_a": "Document A text...",
      "value_a": 0.9,
      "output_b": "Document B text...",
      "value_b": 0.4,
      "dimension": "alignment"
    }
    
  • cfg: config dict with training hyperparameters

🔄 Steps

1. Initialize optimizer and loss

optimizer = Adam(model.q_head.parameters())
criterion = MSELoss()

You’re using mean squared error to teach the Q-head to approximate the LLM score or label.

2. Loop through contrast pairs

for item in contrast_pairs:
    for side in ["a", "b"]:
        # Get prompt and document
        # Get LLM score
        # Embed both
        # Pass through model
        # Compute Q-value loss

This trains the Q-head to regress from:

(goal_emb, doc_emb) → Q-value ≈ LLM score

So if the LLM says Document A is 0.9 and B is 0.4, you want:

Q(goal, A) → 0.9
Q(goal, B) → 0.4

3. Backpropagation

loss.backward()
optimizer.step()

This updates only the q_head weights.

4. Logging and Early Stopping

After each epoch:

  • Logs average loss via self.logger.log("SICQLTrainerEpoch", {...})
  • If no improvement over patience epochs, stops early

🔬 Summary: What This Is Doing

Component Function
TextEncoder Fuses context + doc embeddings into latent vector z
Q-head Learns to map z → scalar (LLM-aligned score) via supervised regression
Loss MSE between predicted Q-value and actual LLM score
Training Data Contrastive preference pairs from LLM judgments

⚙️ Model Engines: A Standardized Training Approach

Before diving into the code, it’s worth noting that this is more than just a single trainer. What you’re about to see is one of four core training engines currently powering Stephanie’s scoring models SICQL, SVM, MRQ, and EBT. All of them follow the same design pattern, rooted in a shared BaseTrainer class. This consistency has already paid off: it allows us to reuse training logic, simplify logging and evaluation, and swap models without rewriting entire pipelines.

In the long term, we plan to unify these into a single, highly configurable training engine. But for now, each trainer is tailored to its respective model type. The SICQLTrainer you see below is one of the most advanced supporting Q-learning, expectile value updates, and policy fine-tuning via GILD.

This section showcases the full source code of the SICQLTrainer so you can understand how Stephanie evolves her models. If you’re building a self-improving AI, this engine-based approach is one you’ll want to replicate.


class SICQLTrainer(BaseTrainer):

    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.cfg = cfg
        self.memory = memory
        self.logger = logger
        self.embedding_type = self.memory.embedding.type
        self.dim = self.memory.embedding.dim
        self.hdim = self.memory.embedding.hdim
        self.root_dir = cfg.get("model_path", "models")
        self.dimension = cfg.get("dimension", "alignment")
        self.embedding_type = cfg.get("embedding_type", "hnet")
        self.model_type = "sicql"
        self.target_type = cfg.get("target_type", "document")
        self.version = cfg.get("model_version", "v1")

        # Device management
        self.device = torch.device(
            "cuda" if torch.cuda.is_available() else "cpu"
        )

        # Training configuration
        self._init_config(cfg)

        # Track training state
        self.best_loss = float("inf")
        self.early_stop_counter = 0
        self.models = {}
        self.tuners = {}
        self._load_tuners()

        # Log initialization
        self.logger.log(
            "SICQLTrainerInitialized",
            {
                "dimension": self.cfg.get("dimension", "alignment"),
                "embedding_type": self.cfg.get("embedding_type", "hnet"),
                "use_gild": self.use_gild,
                "use_qmax": self.use_qmax,
                "device": str(self.device),
            },
        )
 

    def _init_config(self, cfg):
        """Initialize training parameters from config"""
        self.use_tuner = cfg.get("use_tuner", True)
        self.use_early_stopping = cfg.get("early_stopping", True)
        self.early_stopping_patience = cfg.get("patience", 3)
        self.early_stopping_min_delta = cfg.get("min_delta", 1e-4)
        self.batch_size = cfg.get("batch_size", 32)
        self.epochs = cfg.get("epochs", 50)
        self.lr = cfg.get("lr", 1e-4)
        self.gamma = cfg.get("gamma", 0.95)  # Discount factor
        self.beta = cfg.get("beta", 1.0)  # Policy temperature
        self.entropy_weight = cfg.get("entropy_weight", 0.01)
        self.dimensions = cfg.get("dimensions", [])
        self.min_samples = cfg.get("min_samples", 10)
        self.expectile_tau = cfg.get("expectile_tau", 0.7)  # For V-head
        self.use_gild = cfg.get("use_gild", True)
        self.use_qmax = cfg.get("use_qmax", True)
        self.scorer_map = ["ebt", "svm", "mrq"]  # Policy head mapping

    def _load_tuners(self):
        """Load regression tuners for each dimension"""
        for dim in self.dimensions:
            tuner_path = super().get_locator(dim).tuner_file()
            if os.path.exists(tuner_path):
                self.tuners[dim] = RegressionTuner(dimension=dim)
                self.tuners[dim].load(tuner_path)
            else:
                self.tuners[dim] = None
                self.logger.log(
                    "TunerMissing", {"dimension": dim, "path": tuner_path}
                )

    def _build_model(self, dimension):
        """Build or load SICQL model"""
        locator = super().get_locator(dimension)
        if locator.model_exists():
            # Load existing model
            encoder = TextEncoder(dim=self.dim, hdim=self.hdim).to(self.device)
            q_head = QHead(zsa_dim=self.dim, hdim=self.hdim).to(self.device)
            v_head = VHead(zsa_dim=self.dim, hdim=self.hdim).to(self.device)
            pi_head = PolicyHead(
                zsa_dim=self.dim, hdim=self.hdim, num_actions=3
            ).to(self.device)

            # Load weights
            encoder.load_state_dict(
                torch.load(locator.encoder_file(), map_location=self.device)
            )
            q_head.load_state_dict(
                torch.load(locator.q_head_file(), map_location=self.device)
            )
            v_head.load_state_dict(
                torch.load(locator.v_head_file(), map_location=self.device)
            )
            pi_head.load_state_dict(
                torch.load(locator.pi_head_file(), map_location=self.device)
            )

            # Build model
            sicql_model = InContextQModel(
                encoder=encoder,
                q_head=q_head,
                v_head=v_head,
                pi_head=pi_head,
                embedding_store=self.memory.embedding,
                device=self.device,
            )
            return sicql_model

        # Build new model
        self.dim = self.memory.embedding.dim
        self.hdim = self.memory.embedding.hdim

        encoder = TextEncoder(dim=self.dim, hdim=self.hdim).to(self.device)
        q_head = QHead(zsa_dim=self.dim, hdim=self.hdim).to(self.device)
        v_head = VHead(zsa_dim=self.dim, hdim=self.hdim).to(self.device)
        pi_head = PolicyHead(
            zsa_dim=self.dim, hdim=self.hdim, num_actions=3
        ).to(self.device)

        return InContextQModel(
            encoder=encoder,
            q_head=q_head,
            v_head=v_head,
            pi_head=pi_head,
            embedding_store=self.memory.embedding,
            device=self.device,
        )

    def _train_epoch(self, model, dataloader):
        """Train for one epoch with all heads"""
        model.train()
        total_q_loss = 0.0
        total_v_loss = 0.0
        total_pi_loss = 0.0
        count = 0

        for ctx_emb, doc_emb, scores in tqdm(dataloader, desc="Training"):
            ctx_emb = ctx_emb.to(self.device)
            doc_emb = doc_emb.to(self.device)
            scores = scores.to(self.device)

            outputs = model(ctx_emb, doc_emb)

            q_loss = F.mse_loss(outputs["q_value"], scores)

            v_loss = (
                self._expectile_loss(
                    scores - outputs["state_value"], tau=self.expectile_tau
                )
                if self.use_qmax
                else torch.tensor(0.0, device=self.device)
            )

            pi_loss = torch.tensor(0.0, device=self.device)
            if self.use_gild and "action_logits" in outputs:
                advantage = (
                    outputs["q_value"] - outputs["state_value"]
                ).detach()
                weights = torch.exp(self.beta * advantage)
                weights = weights / weights.sum()

                # Corrected reshape
                weights = weights.unsqueeze(-1)  # Ensure (batch_size, 1)

                log_probs = F.log_softmax(outputs["action_logits"], dim=-1)
                pi_loss = -(log_probs * weights).mean()

                # Optional entropy regularization
                entropy = -(log_probs.exp() * log_probs).sum(dim=-1).mean()
                pi_loss += self.entropy_weight * entropy

            loss = (
                q_loss * self.cfg.get("q_weight", 1.0)
                + v_loss * self.cfg.get("v_weight", 0.5)
                + pi_loss * self.cfg.get("pi_weight", 0.3)
            )

            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
            self.optimizer.step()

            total_q_loss += q_loss.item() * ctx_emb.size(0)
            total_v_loss += v_loss.item() * ctx_emb.size(0)
            total_pi_loss += pi_loss.item() * ctx_emb.size(0)
            count += ctx_emb.size(0)

        avg_q = total_q_loss / count
        avg_v = total_v_loss / count
        avg_pi = total_pi_loss / count

        if self.use_qmax:
            self.scheduler["q"].step(avg_q)
        if self.use_gild:
            self.scheduler["pi"].step(avg_pi)

        return {"q": avg_q, "v": avg_v, "pi": avg_pi, "total": loss.item()}

    def _expectile_loss(self, diff, tau=0.7):
        """Compute expectile loss for V-head"""
        return torch.where(
            diff > 0, tau * diff.pow(2), (1 - tau) * diff.pow(2)
        ).mean()

    def _should_stop_early(self, current_avg):
        """Check for early stopping"""
        if not self.use_early_stopping:
            return False

        if current_avg < self.best_loss - self.early_stopping_min_delta:
            self.best_loss = current_avg
            self.early_stop_counter = 0
        else:
            self.early_stop_counter += 1

        return self.early_stop_counter >= self.early_stopping_patience

    def _save_model(self, model, dimension, stats):
        locator = super().get_locator(dimension)
        """Save model components with metadata"""
        # Save each component
        torch.save(model.encoder.state_dict(), locator.encoder_file())
        torch.save(model.q_head.state_dict(), locator.q_head_file())
        torch.save(model.v_head.state_dict(), locator.v_head_file())
        torch.save(model.pi_head.state_dict(), locator.pi_head_file())

        # Calculate policy metrics
        policy_logits = model.pi_head.weight.data.mean(dim=0).tolist()
        policy_probs = F.softmax(torch.tensor(policy_logits), dim=-1).tolist()
        policy_entropy = -torch.sum(
            policy_probs * torch.log(torch.tensor(policy_probs) + 1e-8)
        ).item()

        # Build metadata
        meta = {
            "dim": self.dim,
            "hdim": self.hdim,
            "dimension": dimension,
            "version": self.cfg.get("model_version", "v1"),
            "avg_q_loss": stats.get("avg_q_loss", 0.0),
            "avg_v_loss": stats.get("avg_v_loss", 0.0),
            "avg_pi_loss": stats.get("avg_pi_loss", 0.0),
            "policy_logits": policy_logits,
            "policy_probs": policy_probs,
            "policy_entropy": policy_entropy,
            "policy_stability": max(policy_probs),
            "device": str(self.device),
            "embedding_type": self.cfg.get("embedding_type", "hnet"),
            "timestamp": datetime.utcnow().isoformat(),
        }

        # Save metadata
        with open(locator.meta_file(), "w") as f:
            json.dump(meta, f)

        # Save tuner if available
        if dimension in self.tuners and self.tuners[dimension]:
            self.tuners[dimension].save(locator.tuner_file())

        # Save model version
        model_version = ModelVersionORM(**meta)
        self.memory.session.add(model_version)
        self.memory.session.commit()

        return meta

    def _log_training_stats(self, dim, meta):
        """Log training stats to database"""
        training_stats = TrainingStatsORM(
            model_type="sicql",
            target_type=self.cfg.get("target_type", "document"),
            dimension=dim,
            version=meta["version"],
            avg_q_loss=meta["avg_q_loss"],
            avg_v_loss=meta["avg_v_loss"],
            avg_pi_loss=meta["avg_pi_loss"],
            policy_entropy=meta["policy_entropy"],
            policy_stability=meta["policy_stability"],
            performance=meta["avg_q_loss"],
        )
        self.memory.session.add(training_stats)
        self.memory.session.commit()

    def _validate_tensor(self, tensor, name):
        """Validate tensor before use"""
        if tensor is None:
            self.logger.log(
                "InvalidTensor",
                {"tensor_name": name, "reason": "tensor_is_none"},
            )
            return False

        if torch.isnan(tensor).any():
            self.logger.log(
                "NaNInTensor", {"tensor_name": name, "tensor": tensor.tolist()}
            )
            return False

        return True

    def _calculate_policy_logits(self, model):
        """Calculate policy logits from policy head weights"""
        with torch.no_grad():
            policy_weights = model.pi_head.get_policy_weights()
            policy_probs = F.softmax(policy_weights, dim=-1)
            return policy_probs.tolist()

    def _calculate_policy_stability(self, policy_logits):
        """Calculate policy stability from logits"""
        if not policy_logits:
            return 0.0
        policy_probs = F.softmax(torch.tensor(policy_logits), dim=-1)
        return policy_probs.max().item()

    def _calculate_policy_entropy(self, policy_logits):
        """Calculate policy entropy for versioning"""
        if not policy_logits:
            return 0.0
        policy_probs = F.softmax(torch.tensor(policy_logits), dim=-1)
        return (
            -torch.sum(policy_probs * torch.log(policy_probs + 1e-8), dim=-1)
            .mean()
            .item()
        )

    def train(self, samples, dim):
        """
        Train SICQL model for a dimension
        Args:
            samples: List of training samples
            dim: Dimension to train
        Returns:
            Training statistics and model
        """
        self.logger.log("DimensionTrainingStarted", {"dimension": dim})

        # Prepare data
        dataloader = super()._create_dataloader(samples)
        if not dataloader:
            return {"error": "insufficient_data", "dimension": dim}

        # Build model
        model = self._build_model(dim)
        model.train()

        # Optimizer for all heads
        self.optimizer = optim.Adam(model.parameters(), lr=self.lr)
        self.scheduler = {
            "q": ReduceLROnPlateau(
                self.optimizer, mode="min", factor=0.5, patience=2
            ),
            "v": ReduceLROnPlateau(
                self.optimizer, mode="min", factor=0.5, patience=2
            ),
            "pi": ReduceLROnPlateau(
                self.optimizer, mode="min", factor=0.5, patience=2
            ),
        }

        # Training stats
        stats = {
            "dimension": dim,
            "q_losses": [],
            "v_losses": [],
            "pi_losses": [],
            "policy_entropies": [],
            "avg_q_loss": 0.0,
            "avg_v_loss": 0.0,
            "avg_pi_loss": 0.0,
            "policy_entropy": 0.0,
            "policy_stability": 0.0,
        }

        # Training loop
        for epoch in range(self.epochs):
            epoch_stats = self._train_epoch(model, dataloader)
            stats["q_losses"].append(epoch_stats["q"])
            stats["v_losses"].append(epoch_stats["v"])
            stats["pi_losses"].append(epoch_stats["pi"])

            # Calculate policy entropy
            policy_logits = self._calculate_policy_logits(model)
            policy_entropy = self._calculate_policy_entropy(policy_logits)
            stats["policy_entropies"].append(policy_entropy)

            # Early stopping check
            if self._should_stop_early(stats["q_losses"][-1]):
                self.logger.log(
                    "EarlyStopping",
                    {
                        "dimension": dim,
                        "epoch": epoch + 1,
                        "best_loss": self.best_loss,
                    },
                )
                break

        # Final stats
        stats["avg_q_loss"] = np.mean(stats["q_losses"])
        stats["avg_v_loss"] = np.mean(stats["v_losses"])
        stats["avg_pi_loss"] = np.mean(stats["pi_losses"])
        stats["policy_entropy"] = np.mean(stats["policy_entropies"])
        stats["policy_stability"] = (
            max(stats["policy_entropies"])
            if stats["policy_entropies"]
            else 0.0
        )

        # Save model
        meta = self._save_model(model, dim, stats)
        stats.update(meta)

        # Log to database
        self._log_training_stats(dim, meta)

        self.logger.log(
            "DimensionTrainingComplete",
            {
                "dimension": dim,
                "final_q_loss": stats["avg_q_loss"],
                "final_v_loss": stats["avg_v_loss"],
                "final_pi_loss": stats["avg_pi_loss"],
            },
        )

        # Cache model
        self.models[dim] = model
        return stats

    def _log_training_stats(self, dim, meta):
        """Log training stats to database"""
        training_stats = TrainingStatsORM(
            model_type="sicql",
            target_type=self.cfg.get("target_type", "document"),
            dimension=dim,
            version=meta["version"],
            embedding_type=self.embedding_type,
            avg_q_loss=meta["avg_q_loss"],
            avg_v_loss=meta["avg_v_loss"],
            avg_pi_loss=meta["avg_pi_loss"],
            policy_entropy=meta.get("policy_entropy", 0.0),
            policy_stability=meta.get("policy_stability", 0.0),
        )
        self.memory.session.add(training_stats)
        self.memory.session.commit()

    def _train_sicql(self, model, dataloader, output_dir):
        """Train SICQL model with all heads"""
        model.train()
        best_loss = float("inf")
        patience_counter = 0

        # Build optimizers
        optimizers = {
            "encoder": optim.Adam(model.encoder.parameters(), lr=self.lr),
            "q_head": optim.Adam(model.q_head.parameters(), lr=self.lr),
            "v_head": optim.Adam(model.v_head.parameters(), lr=self.lr),
            "pi_head": optim.Adam(model.pi_head.parameters(), lr=self.lr),
        }

        # Build schedulers
        schedulers = {
            "encoder": ReduceLROnPlateau(
                optimizers["encoder"], mode="min", factor=0.5, patience=2
            ),
            "q_head": ReduceLROnPlateau(
                optimizers["q_head"], mode="min", factor=0.5, patience=2
            ),
            "v_head": ReduceLROnPlateau(
                optimizers["v_head"], mode="min", factor=0.5, patience=2
            ),
            "pi_head": ReduceLROnPlateau(
                optimizers["pi_head"], mode="min", factor=0.5, patience=2
            ),
        }

        # Training loop
        for epoch in range(self.epochs):
            total_q_loss = 0.0
            total_v_loss = 0.0
            total_pi_loss = 0.0
            count = 0

            for ctx_emb, doc_emb, scores in tqdm(
                dataloader, desc=f"Epoch {epoch + 1}"
            ):
                # Device management
                ctx_emb = ctx_emb.to(self.device)
                doc_emb = doc_emb.to(self.device)
                scores = scores.to(self.device)

                # Forward pass
                outputs = model(ctx_emb, doc_emb)

                # Q-head loss
                q_loss = F.mse_loss(outputs["q_value"], scores)

                # V-head loss
                v_loss = self._expectile_loss(
                    scores - outputs["state_value"],
                    tau=self.cfg.get("expectile", 0.7),
                )

                # Policy head loss
                pi_loss = torch.tensor(0.0, device=self.device)
                if self.use_gild:
                    advantage = (
                        outputs["q_value"] - outputs["state_value"]
                    ).detach()
                    weights = torch.exp(self.beta * advantage)
                    weights = weights / weights.sum()

                    policy_probs = F.softmax(outputs["action_logits"], dim=-1)
                    entropy = -torch.sum(
                        policy_probs * torch.log(policy_probs + 1e-8), dim=-1
                    ).mean()

                    pi_loss = -(
                        F.log_softmax(outputs["action_logits"], dim=-1)
                        * weights
                    ).mean()
                    pi_loss += self.entropy_weight * entropy

                # Backward pass
                optimizers["q_head"].zero_grad()
                q_loss.backward()
                optimizers["q_head"].step()

                optimizers["v_head"].zero_grad()
                v_loss.backward()
                optimizers["v_head"].step()

                if self.use_gild:
                    optimizers["pi_head"].zero_grad()
                    pi_loss.backward()
                    optimizers["pi_head"].step()

                # Track losses
                total_q_loss += q_loss.item() * ctx_emb.size(0)
                total_v_loss += v_loss.item() * ctx_emb.size(0)
                total_pi_loss += pi_loss.item() * ctx_emb.size(0)
                count += ctx_emb.size(0)

            # End of epoch
            avg_q = total_q_loss / count
            avg_v = total_v_loss / count
            avg_pi = total_pi_loss / count

            # Early stopping
            if avg_q < best_loss - self.early_stopping_min_delta:
                best_loss = avg_q
                patience_counter = 0
                # Save best model
                torch.save(
                    model.encoder.state_dict(), f"{output_dir}/encoder.pt"
                )
                torch.save(
                    model.q_head.state_dict(), f"{output_dir}/q_head.pt"
                )
                torch.save(
                    model.v_head.state_dict(), f"{output_dir}/v_head.pt"
                )
                torch.save(
                    model.pi_head.state_dict(), f"{output_dir}/pi_head.pt"
                )
            else:
                patience_counter += 1

            # Log epoch
            self.logger.log(
                "SICQLTrainingEpoch",
                {
                    "epoch": epoch + 1,
                    "q_loss": avg_q,
                    "v_loss": avg_v,
                    "pi_loss": avg_pi,
                    "lr": optimizers["q_head"].param_groups[0]["lr"],
                },
            )

            # Check for early stopping
            if patience_counter >= self.early_stopping_patience:
                self.logger.log(
                    "SICQLEarlyStopping",
                    {"epoch": epoch + 1, "best_loss": best_loss},
                )
                break

        self.logger.log("SICQLTrainingComplete", {"best_loss": best_loss})
        return model

    def _save_model(self, model, dimension, stats):
        """Save SICQL model components"""
        locator = super().get_locator(dimension)
        # Save components separately
        torch.save(model.encoder.state_dict(), locator.encoder_file())
        torch.save(model.q_head.state_dict(), locator.q_head_file())
        torch.save(model.v_head.state_dict(), locator.v_head_file())
        torch.save(model.pi_head.state_dict(), locator.pi_head_file())

        # Calculate policy metrics
        policy_logits = model.pi_head.get_policy_weights().tolist()
        policy_probs_tensor = F.softmax(torch.tensor(policy_logits), dim=-1)
        policy_probs = policy_probs_tensor.tolist()
        policy_entropy = -torch.sum(
            policy_probs_tensor * torch.log(policy_probs_tensor + 1e-8)
        ).item()
        policy_stability = max(policy_probs)


        # Build metadata
        meta = {
            "dim": self.dim,
            "hdim": self.hdim,
            "dimension": dimension,
            "version": self.cfg.get("model_version", "v1"),
            "avg_q_loss": float(stats["avg_q_loss"]),
            "avg_v_loss": float(stats["avg_v_loss"]),
            "avg_pi_loss": float(stats["avg_pi_loss"]),
            "policy_entropy": float(policy_entropy),
            "policy_stability": float(policy_stability),
            "policy_logits": policy_logits,
            "policy_probs": policy_probs,
            "embedding_type": self.embedding_type,
            "max_value": 100,
            "min_value": 0,
            "device": str(self.device), 
            "timestamp": datetime.utcnow().isoformat(),
        }

        super()._save_meta_file(meta, dimension)
        return meta

    def run(self, context: dict) -> dict:
        """Main entry point for training"""
        documents = context.get("documents", [])
        # Train each dimension
        results = {}
        for dim in self.dimensions:
            # Get training samples
            samples = self._get_samples(context, documents, dim)
            if not samples:
                continue

            # Train model
            stats = self.train(samples, dim)
            if "error" in stats:
                continue

            # Update belief cartridges
            self._update_belief_cartridge(context, dim, stats)
            results[dim] = stats

        # Update context with results
        context["training_stats"] = results
        return context

    def _get_samples(self, context, documents, dim):
        """Get training samples for dimension"""
        samples = []
        goal = context.get("goal", {})
        for doc in documents:
            scorable = ScorableFactory.from_dict(doc, TargetType.DOCUMENT)
            score = self.memory.scores.get_score(goal.id, scorable.id)
            if score:
                samples.append(
                    {
                        "title": goal.get("goal_text", ""),
                        "output": scorable.text,
                        "score": score.score,
                    }
                )
        return samples

    def _update_belief_cartridge(self, context, dim, stats):
        """Update belief cartridges with policy stats"""
        policy_logits = stats.get("policy_logits", [0.3, 0.7, 0.0])
        policy_probs = F.softmax(torch.tensor(policy_logits), dim=-1).tolist()

        # Build belief cartridge
        cartridge = BeliefCartridgeORM(
            title=f"{dim} policy",
            content=f"Policy head weights: {policy_probs}",
            goal_id=context.get("goal_id"),
            domain=dim,
            policy_logits=policy_probs,
            policy_entropy=stats.get("policy_entropy", 1.05),
            policy_stability=stats.get("policy_stability", 0.82),
        )
        self.memory.session.add(cartridge)
        self.memory.session.commit()

Summary: What train_sicql Does and Why

The train_sicql method trains a SICQL-based model on document comparison data. Specifically:

  • It receives a batch of contrastive preference pairs, where each item includes:

    • A prompt (goal, context, or title)
    • Two candidate outputs (A and B)
    • Their LLM-aligned scores (used as supervised targets)
  • For each candidate:

    • It generates in-context embeddings via a shared encoder.

    • It passes these into the model’s:

      • Q-head to predict quality (MSE-trained)
      • V-head to estimate state value using expectile regression
      • π-head to predict policies from advantage-weighted regression
  • The training loss is a weighted combination of:

    • MSE between Q and LLM score (to match human-like feedback)
    • Expectile loss between Q and V (for stability and robustness)
    • AWR loss from π to (Q − V) (to promote sharp, directed improvement)
  • The optimizer updates all heads simultaneously and uses learning rate decay (ReduceLROnPlateau) for adaptive control.

Training stops early if validation loss stagnates.

🧠 Notes and Takeaways

  • Head separation is modular and clean: Q, V, and π heads are trained with individual optimizers and schedulers. This allows each head to converge independently, a critical property for self-improvement loops that rely on Q-stability and policy plasticity.

  • GILD-style weighting is powerful: Advantage-based reweighting in _train_epoch() is what gives the model its self-imitation capability. If you’re experimenting with new learning dynamics, this is the hotspot.

  • Policy metrics matter: The entropy and stability scores stored in meta and BeliefCartridgeORM are used later in the policy report. If your model’s performance stalls, these are the first things to inspect.

  • Tuner support is optional: If you’re experimenting with raw scores (e.g., from EBT or SICQL itself), consider disabling the tuner or replacing it with a learned reward transformer.

  • Logging is integrated throughout: Events like early stopping, model saving, and policy analysis are all logged via self.logger.log(...). These logs will feed into Stephanie’s introspective trace system in later steps.


🆚 SICQL vs. Standard MRQ

Feature MRQ SICQL
Architecture Linear or shallow MLP Transformer-style multi-head MLP
Learning Target Scalar score via contrastive diff Full value function + policy gradients
Context Awareness Optional, indirect Built-in via context encoder
Loss Function Regression + optional tuning Q loss + V expectile + π AWR
Policy Output None Yes, learns a decision policy
Adaptability Static Dynamic and self-adjusting

SICQL equips Stephanie with world-model-aware decision-making, making it suitable for long-term, goal-driven reasoning loops where scoring, choosing, and adapting must co-evolve.


🤯 How GILD enhances SICQL

GILD isn’t just another scoring mechanism—it’s the first working implementation of what we might call reflective intelligence. While current AI systems hit a ceiling (their intelligence constrained by initial training), Stephanie’s intelligence becomes unbounded. With GILD, she doesn’t just get better at scoring documents—she gets better at getting better.

This has profound implications:

Medical diagnosis systems that recognize when they’re uncertain and seek additional information Scientific research assistants that refine their own evaluation criteria as they learn Education platforms that adapt not just to students, but to their own teaching effectiveness Most importantly, GILD provides a blueprint for building AI that develops genuine understanding rather than just processing information.

🧠 GILD-Style Policy Update

Imagine a chef who not only tastes their dish but analyzes why the spices clashed. GILD is Stephanie’s kitchen—where she turns every ‘mistake’ into a permanent upgrade.

In GILD, the goal is to learn a better policy by emphasizing actions with higher advantage i.e., where the model is confident it can do better than its current baseline. The block below is the core of that policy update loop:

if self.use_gild and "action_logits" in outputs:
    advantage = (outputs["q_value"] - outputs["state_value"]).detach() # The 'aha!' moment: How much better was this than expected?
    weights = torch.exp(self.beta * advantage) # Prioritizing learning based on actual success
    weights = weights / weights.sum()

    weights = weights.unsqueeze(-1)  # Ensure shape (batch_size, 1)

    log_probs = F.log_softmax(outputs["action_logits"], dim=-1)
    pi_loss = -(log_probs * weights).mean()                   # Surgical policy adjustment

    entropy = -(log_probs.exp() * log_probs).sum(dim=-1).mean()
    pi_loss += self.entropy_weight * entropy

🔍 What’s Happening Here

1. Compute Advantage

advantage = (outputs["q_value"] - outputs["state_value"]).detach()
  • This tells us how much better an action is than the expected baseline (state value).
  • A high advantage means the policy should focus more on that sample.

2. Compute Weights for Advantage-Weighted Imitation

weights = torch.exp(self.beta * advantage)
weights = weights / weights.sum()
  • This forms the heart of advantage-weighted learning.

  • beta controls how “greedy” we are:

    • High β → focus on high-advantage samples.
    • Low β → softer weighting.
  • Normalizing ensures that all weights sum to 1 (like probabilities).


3. Calculate Policy Loss

log_probs = F.log_softmax(outputs["action_logits"], dim=-1)
pi_loss = -(log_probs * weights).mean()
  • This is a weighted log-likelihood loss standard for policy distillation.
  • Instead of imitating everything equally, we favor high-advantage actions.

4. Optional: Entropy Regularization

entropy = -(log_probs.exp() * log_probs).sum(dim=-1).mean()
pi_loss += self.entropy_weight * entropy
  • This term encourages the model to remain uncertain when necessary.
  • It prevents premature convergence to overconfident (but possibly wrong) policies.
  • Entropy weight balances exploration vs exploitation.

🧠 Why It Matters — from Self‑Aware Thinking to Self‑Aware Architecture

The previous section showed how SICQL + GILD makes Stephanie a self‑aware learner:

  • She can tell which actions were good (advantage).
  • She imitates success instead of noise (advantage‑weighted updates).
  • She guards against tunnel‑vision with entropy‑driven exploration.

In short, Stephanie now improves how she thinks—not just what she thinks.

But reflective intelligence at the reasoning layer uncovered a very human problem at the infrastructure layer: clutter. We suddenly have …

Layer Variety
Dimensions alignment, clarity, novelty, implementability, relevance …
Scorers MRQ, SVM, EBT, SICQL (+ future plug‑ins)
Embeddings H‑Net, Hugging Face, Ollama, custom vectors

Left unchecked, this combinatorial explosion becomes model‑hell—dozens of un‑versioned .pt blobs hiding in random folders, impossible to track, compare, or retire.

So before we push further into self‑improvement, we need Stephanie to practise what she preaches: structured self‑awareness. At the system level that means a single source‑of‑truth for every model component, across every dimension, scorer, and embedding backend.


🧑‍🏫 Managing the Models — Enter ModelLocator

As our evaluation pipeline grew to support multiple scorers (MRQ, EBT, SICQL, SVM), an expanding set of dimensions, and three embedding back‑ends (H‑Net, Ollama, Hugging Face), manually juggling file paths and versions became a liability.

ModelLocator is our antidote: a tiny, opinionated registry that…

  1. Names every model by embedding × scorer × target × dimension × version.
  2. Creates a predictable folder skeleton on first use.
  3. Loads / saves encoders, heads, tuners, and metadata with one‑liners.
  4. Discovers “latest” or “best” models automatically, so orchestration code stays declarative.

With ModelLocator, Stephanie’s cognitive toolbox is now as tidy and introspective as her reasoning process—paving the way for the next layer of self‑improvement.

🔧 The ModelLocator Utility

To solve this, we built the ModelLocator a robust utility that encapsulates path resolution, directory creation, and model introspection:

locator = ModelLocator(
    root_dir="models",
    embedding_type="hnet",
    model_type="sicql",
    target_type="document",
    dimension="alignment",
    version="v1"
)

With this, we can:

  • Get the full path to the model, encoder, or meta files
  • Save SICQL heads (Q, V, π) to versioned directories
  • Load a trained InContextQModel in one line: model = locator.load_sicql_model(device="cuda")

It also supports discovery:

# List all available models in the system
ModelLocator.list_available_models()

# Find latest model per dimension
ModelLocator.find_best_model_per_dimension()

📦 Folder Structure

We adopted a clean, hierarchical structure:

Here you can see the mrq modle structure compared to the newer sicql models.

📦 models                                                                                                                   
└── 📦  hnet
    ├── 📁  ebt
    ├── 📁  mrq
    │   └── 📁  document
    │       ├── 📁  alignment
    │       │   └── 📁  v1
    │       │       ├── ⚙️  alignment.meta.json
    │       │       ├── 📦  alignment.pt
    │       │       ├── 🎚️  alignment.tuner.json
    │       │       └── 🧠  alignment_encoder.pt
    │       ├── 📁  clarity
    │       │   └── 📁  v1
    │       │       ├── ⚙️  clarity.meta.json
    │       │       ├── 📦  clarity.pt
    │       │       ├── 🎚️  clarity.tuner.json
    │       │       └── 🧠  clarity_encoder.pt
    │       ├── 📁  implementability
    │       │   └── 📁  v1
    │       │       ├── ⚙️  implementability.meta.json
    │       │       ├── 📦  implementability.pt
    │       │       ├── 🎚️  implementability.tuner.json
    │       │       └── 🧠  implementability_encoder.pt
    │       ├── 📁  novelty
    │       │   └── 📁  v1
    │       │       ├── ⚙️  novelty.meta.json
    │       │       ├── 📦  novelty.pt
    │       │       ├── 🎚️  novelty.tuner.json
    │       │       └── 🧠  novelty_encoder.pt
    │       └── 📁  relevance
    │           └── 📁  v1
    │               ├── ⚙️  relevance.meta.json
    │               ├── 📦  relevance.pt
    │               ├── 🎚️  relevance.tuner.json
    │               └── 🧠  relevance_encoder.pt
    ├── 📁  sicql
    │   └── 📁  document
    │       ├── 📁  alignment
    │       │   └── 📁  v1
    │       │       ├── ⚙️  alignment.meta.json
    │       │       ├── 📦  alignment.pt
    │       │       ├── 🎚️  alignment.tuner.json
    │       │       ├── 🧠  alignment_encoder.pt
    │       │       ├── 📦  alignment_pi.pt
    │       │       ├── 📦  alignment_q.pt
    │       │       └── 📦  alignment_v.pt
    │       ├── 📁  clarity
    │       │   └── 📁  v1
    │       │       ├── ⚙️  clarity.meta.json
    │       │       ├── 📦  clarity.pt
    │       │       ├── 🎚️  clarity.tuner.json
    │       │       ├── 🧠  clarity_encoder.pt
    │       │       ├── 📦  clarity_pi.pt
    │       │       ├── 📦  clarity_q.pt
    │       │       └── 📦  clarity_v.pt
    │       ├── 📁  implementability
    │       │   └── 📁  v1
    │       │       ├── ⚙️  implementability.meta.json
    │       │       ├── 📦  implementability.pt
    │       │       ├── 🎚️  implementability.tuner.json
    │       │       ├── 🧠  implementability_encoder.pt
    │       │       ├── 📦  implementability_pi.pt
    │       │       ├── 📦  implementability_q.pt
    │       │       └── 📦  implementability_v.pt
    │       ├── 📁  novelty
    │       │   └── 📁  v1
    │       │       ├── ⚙️  novelty.meta.json
    │       │       ├── 📦  novelty.pt
    │       │       ├── 🎚️  novelty.tuner.json
    │       │       ├── 🧠  novelty_encoder.pt
    │       │       ├── 📦  novelty_pi.pt
    │       │       ├── 📦  novelty_q.pt
    │       │       └── 📦  novelty_v.pt
    │       └── 📁  relevance
    │           └── 📁  v1
    │               ├── ⚙️  relevance.meta.json
    │               ├── 📦  relevance.pt
    │               ├── 🎚️  relevance.tuner.json
    │               ├── 🧠  relevance_encoder.pt
    │               ├── 📦  relevance_pi.pt
    │               ├── 📦  relevance_q.pt
    │               └── 📦  relevance_v.pt

This structure is consistent across all model types and embedding backends, enabling Stephanie to compare, trace, and evolve its internal scoring systems with ease.

📦 Model Locator: A Unified Interface for Model File Management

class ModelLocatorMixin:
    class Locator:
        def __init__(
            self,
            root_dir: str,
            model_type: str,
            target_type: str,
            dimension: str,
            version: str,
            embedding_type: str,
        ):
            self.root_dir = root_dir
            self.model_type = model_type
            self.target_type = target_type
            self.dimension = dimension
            self.version = version
            self.embedding_type = embedding_type

        @property
        def base_path(self) -> str:
            path = os.path.join(
                self.root_dir,
                self.embedding_type,
                self.model_type,
                self.target_type,
                self.dimension,
                self.version,
            )
            os.makedirs(path, exist_ok=True)
            return path

        # Model-specific paths
        def model_file(self, suffix: str = ".pt") -> str:
            return os.path.join(self.base_path, f"{self.dimension}{suffix}")

        def encoder_file(self) -> str:
            return os.path.join(self.base_path, f"{self.dimension}_encoder.pt")

        def get_q_head_path(self) -> str:
            return os.path.join(self.base_path, f"{self.dimension}_q.pt")

        def get_v_head_path(self) -> str:
            return os.path.join(self.base_path, f"{self.dimension}_v.pt")

        def get_pi_head_path(self) -> str:
            return os.path.join(self.base_path, f"{self.dimension}_pi.pt")

        def meta_file(self) -> str:
            return os.path.join(self.base_path, f"{self.dimension}.meta.json")

        def tuner_file(self) -> str:
            return os.path.join(self.base_path, f"{self.dimension}.tuner.json")

        def scaler_file(self) -> str:
            return os.path.join(self.base_path, f"{self.dimension}_scaler.joblib")

    def get_model_name(self) -> str:
        return f"{self.target_type}_{self.model_type}_{self.model_version}"

    def get_locator(self, dimension: str):
        return self.Locator(
            root_dir=self.model_path,  # Path to the root directory for models
            model_type=self.model_type,
            target_type=self.target_type,
            dimension=dimension,
            version=self.version,
            embedding_type=self.embedding_type,
        )

In Stephanie’s modular self-improvement engine, every scoring model (MRQ, EBT, SICQL, SVM) must load and save files often with different components (encoders, heads, scalers, metadata) and under different directory structures depending on the embedding type, version, or dimension.

To standardize this process, we use a mixin class called ModelLocatorMixin. This class provides a simple and reliable interface to generate all required paths for any model component ensuring consistency, portability, and version safety.

This design allows any trainer or scorer to quickly determine where to find (or store) its encoder, policy head, or metadata file all without hardcoding paths or scattering logic across the codebase.


🧱 Core Structure

The ModelLocatorMixin includes an internal Locator class that generates structured paths based on the following configuration attributes:

  • root_dir: Base directory for all models (e.g. models/)
  • embedding_type: Type of embedding (e.g. hnet, hf, ollama)
  • model_type: Type of model (e.g. mrq, ebt, sicql)
  • target_type: Entity being scored (e.g. document, triplet, cartridge)
  • dimension: Evaluation axis (e.g. alignment, novelty, clarity)
  • version: Model version for checkpointing (e.g. v1, v2, latest)

These are combined into a canonical directory structure:

models/{embedding_type}/{model_type}/{target_type}/{dimension}/{version}/

Inside each directory, the following file types are handled:

File Purpose
dimension.pt Main model file
dimension_encoder.pt Encoder backbone
dimension_q.pt Q-head (for SICQL)
dimension_v.pt V-head
dimension_pi.pt Policy head
dimension.meta.json Metadata and training context
dimension.tuner.json Scaler/tuner parameters
dimension_scaler.joblib Feature scaler (SVM or regression)

⚙️ Functional Overview

locator = self.get_locator("alignment")
model_path = locator.model_file()
encoder_path = locator.encoder_file()
meta_path = locator.meta_file()

🎯 How Stephanie Loads and Calls (SICQL or any) Models for Multi-Dimensional Scoring

Stephanie’s scoring engine supports goal-conditioned evaluations across multiple dimensions such as alignment, clarity, relevance, novelty, and implementability. Each of these dimensions has its own dedicated model, trained and stored separately using a SICQL architecture (State-Informed Contextual Q-Learning).

🧠 Model Structure Per Dimension

For each evaluation dimension, Stephanie stores a set of SICQL components under:

models/sicql/document/{dimension}/v1/
├── {dimension}_encoder.pt   ← Shared encoder
├── {dimension}_q.pt         ← Q-value head
├── {dimension}_v.pt         ← V-value head (state value)
├── {dimension}_pi.pt        ← Policy head (action logits)
├── {dimension}.meta.json    ← Scoring range metadata
├── {dimension}.tuner.json   ← Regression tuner for score calibration

This modular design enables each model to specialize independently, giving Stephanie fine-grained control over how it evaluates documents across different axes.


🎯 Scoring with SICQL: A Learned Q-Value Perspective

Once models have been trained via the SICQLTrainer, we need a reliable way to score documents, hypotheses, or ideas using that learned policy. That’s the role of the SICQLScorer.

The SICQLScorer is Stephanie’s inference-time policy engine, turning a goal and a scorable object (like a document) into a rich set of outputs:

  • Q-value (expected reward)
  • V-value (baseline state value)
  • Advantage (Q − V)
  • Policy logits (confidence in actions)
  • Entropy and Uncertainty (measuring exploration and alignment)
  • And a final, scaled score that feeds into belief formation and downstream pipelines

All of this is done dimensionally each dimension (e.g., alignment, novelty, clarity) has its own head, embedding flow, and optional tuner.

Let’s walk through the code.


🎯 Advanced Scoring with SICQL

Scoring is one of the most critical operations in Stephanie’s reasoning system. We’ve invested heavily in consolidating our approach making scorers both modular and powerful. The SICQLScorer you’re about to see represents our most advanced scoring strategy to date. It supports goal-conditioned scoring, multi-dimensional evaluations, and is fully interchangeable with other scorers like MRQ, EBT, and SVM.

What makes this engine special is that it’s flexible by design capable of applying multiple scoring strategies to any piece of text (documents, hypotheses, even prompts). Because this scorer plays such a central role in Stephanie’s self-improvement loop, we’re including the full source code here. It’s a big block, yes but it’s foundational. Think of it as the “reasoning brain” behind many of Stephanie’s judgments.


class SICQLScorer(BaseScorer):
    def __init__(self, cfg, memory, logger):
        super().__init__(cfg, memory, logger)
        self.model_type = "sicql"
        self.embedding_type = memory.embedding.type
        self.dim = memory.embedding.dim
        self.hdim = memory.embedding.hdim

        self.target_type = cfg.get("target_type", "document")
        self.model_path = cfg.get("model_path", "models")
        self.version = cfg.get("model_version", "v1")

        self.models = {}
        self.model_meta = {}
        self.tuners = {}

        self.dimensions = cfg.get("dimensions", [])
        self._load_models(self.dimensions)

    def _load_models(self, dimensions):
        for dim in dimensions:
            locator = ModelLocator(
                root_dir=self.model_path,
                embedding_type=self.embedding_type,
                model_type=self.model_type,
                target_type=self.target_type,
                dimension=dim,
                version=self.version,
            )

            encoder = TextEncoder(dim=self.dim, hdim=self.hdim).to(self.device)
            q_head = QHead(zsa_dim=self.dim, hdim=self.hdim).to(self.device)
            v_head = VHead(zsa_dim=self.dim, hIt's time dim=self.hdim).to(self.device)
            pi_head = PolicyHead(zsa_dim=self.dim, hdim=self.hdim, num_actions=3).to(self.device)

            encoder.load_state_dict(torch.load(locator.encoder_file(), map_location=self.device))
            q_head.load_state_dict(torch.load(locator.q_head_file(), map_location=self.device))
            v_head.load_state_dict(torch.load(locator.v_head_file(), map_location=self.device))
            pi_head.load_state_dict(torch.load(locator.pi_head_file(), map_location=self.device))

            model = InContextQModel(
                encoder=encoder,
                q_head=q_head,
                v_head=v_head,
                pi_head=pi_head,
                embedding_store=self.memory.embedding,
                device=self.device,
            )
            self.models[dim] = model

            meta = load_json(locator.meta_file()) if os.path.exists(locator.meta_file()) else {"min_score": 0, "max_score": 100}
            self.model_meta[dim] = meta

            tuner_path = locator.tuner_file()
            if os.path.exists(tuner_path):
                tuner = RegressionTuner(dimension=dim)
                tuner.load(tuner_path)
                self.tuners[dim] = tuner


    def score(self, goal: dict, scorable: Scorable, dimensions: list[str]) -> ScoreBundle:
        goal_text = goal.get("goal_text")
        results = {}

        for dim in dimensions:
            model = self.models.get(dim)
            prompt_emb = torch.tensor(
                self.memory.embedding.get_or_create(goal_text), device=self.device
            ).unsqueeze(0)
            output_emb = torch.tensor(
                self.memory.embedding.get_or_create(scorable.text), device=self.device
            ).unsqueeze(0)
            result = model(prompt_emb, output_emb)


            q_value = result["q_value"].item()
            v_value = result["state_value"].item()
            policy_logits = result["action_logits"].cpu().detach().numpy().tolist()

            if isinstance(policy_logits, list) and len(policy_logits) == 1:
                if isinstance(policy_logits[0], list):
                    # [[0.1166]] → [0.1166]
                    policy_logits = policy_logits[0]

            self.logger.log("PolicyLogits", {"dimension": dim, "logits": policy_logits})

                # Calculate uncertainty (|Q - V|)
            uncertainty = abs(q_value - v_value)
            
            # Calculate entropy from policy logits
            policy_tensor = torch.tensor(policy_logits)
            action_probs = F.softmax(policy_tensor, dim=-1)
            entropy = -torch.sum(action_probs * torch.log(action_probs + 1e-8)).item()
            
            # Calculate advantage
            advantage = q_value - v_value
            meta = self.model_meta.get(dim, {"min": 0, "max": 100})
            if dim in self.tuners:
                scaled_score = self.tuners[dim].transform(q_value)
            else:
                normalized = torch.sigmoid(torch.tensor(q_value)).item()
                scaled_score = normalized * (meta["max_value"] - meta["min_value"]) + meta["min_value"]

            scaled_score = max(min(scaled_score, meta["max_value"]), meta["min_value"])


            final_score = round(scaled_score, 4)
            prompt_hash = ScoreORM.compute_prompt_hash(gLike this right oal_text, scorable)

            rationale = f"Q={q_value:.4f}, V={v_value:.4f}, Δ={uncertainty:.3f}, H={entropy:.3f}"

            results[dim] = ScoreResult(
                        dimension=dim,
                        score=final_score,
                        rationale=rationale,
                        weight=1.0,
                        q_value=q_value,
                        energy=q_value,
                        source=self.name,
                        target_type=scorable.target_type,
                        prompt_hash=prompt_hash,
                        state_value=v_value,
                        policy_logits=policy_logits,
                        uncertainty=uncertainty,
                        entropy=entropy,
                        advantage=advantage,
                    )
        return ScoreBundle(results=results)

👷 How It Works

🔁 Initialization

self._load_models(self.dimensions)

Each dimension gets:

  • A TextEncoder
  • Three heads: QHead, VHead, and PolicyHead
  • Optionally: A regression tuner
  • Associated metadata (e.g., score range)

These models are loaded from disk using the ModelLocator system and moved to the correct device (CPU/GPU).

🧪 Scoring

The main interface is:

score(goal: dict, scorable: Scorable, dimensions: list[str]) -> ScoreBundle

For each dimension:

  1. Embeddings are generated for the goal and scorable.

  2. The model computes:

    • q_value, state_value, and action_logits
  3. From these, we compute:

    • uncertainty = |Q − V|
    • entropy from the action distribution
    • advantage = Q − V
  4. The score is scaled using:

    • RegressionTuner, if available
    • Otherwise: Sigmoid + linear normalization from metadata
  5. Results are returned in a structured ScoreResult (with rationale and metadata).


🔢 The Five Scorers

Stephanie doesn’t just rely on one brain she gathers feedback from a panel of experts, each offering a different perspective. We call these scorers, and they represent different strategies for evaluating a document in context of a goal.

Scorer Type Role in System
SICQLScorer Reinforcement/Q-Learning Learns optimal scoring behavior via value and policy heads
SVMScorer Classical ML Fast, interpretable baseline for linear domains
MRQScorer Embedding Regression Learns from memory using reward-tuned Q-models
EBTScorer Energy-Based Provides uncertainty-aware energy scores
LLMScorer Language Model (LLM) Ground-truth or expert demonstration guidance

All of these scorers implement a common interface: they take in a goal and a document (or other scorable object) and return a ScoreBundle a consistent output format that includes scores, metadata, and analysis hooks.

Each scorer is plugged into the same evaluation adapter, which allows us to flexibly run them in sequence or selectively trigger them based on context, policy, or dimension:

scorers = [sicql_scorer, svm_scorer, mrq_scorer, ebt_scorer]

for scorer in scorers:
    score_bundle = scorer.score(goal=goal, scorable=scorable, dimensions=self.dimensions)

Stephanie stores all resulting evaluations in a unified format. This means you can later analyze and compare outputs across strategies or even train new models to learn from the disagreements.

Saving the scores

The ScoringManager manages the scoring in stephanie. It is configurable form a file or from a score class outlined earlier.

This is the mmethod used to save scores.

    @staticmethod
    def save_score_to_memory(
        bundle: ScoreBundle,
        scorable: Scorable,
        context: dict,
        cfg: dict,
        memory,
        logger,
        source,
        model_name=None,
    ):
        goal = context.get("goal")
        pipeline_run_id = context.get("pipeline_run_id")
        weighted_score = bundle.calculator.calculate(bundle)

        scores_json = {
            "stage": cfg.get("stage", "review"),
            "dimensions": bundle.to_dict(),
            "final_score": round(weighted_score, 2),
        }

        if not model_name:
            model_name = cfg.get("model", {}).get("name", "UnknownModel")

        # evaluation table bucket for all related score data
        eval_orm = EvaluationORM(
            goal_id=goal.get("id"),
            pipeline_run_id=pipeline_run_id,
            target_type=scorable.target_type,
            target_id=scorable.id,
            source=source,
            agent_name=cfg.get("name"),
            model_name=model_name,
            embedding_type=memory.embedding.type,
            evaluator_name=cfg.get("evaluator", cfg.get("model_type", "ScoreEvaluator")),
            strategy=cfg.get("strategy"),
            reasoning_strategy=cfg.get("reasoning_strategy"),
            scores=scores_json,
            extra_data={"source": source},
        )
        memory.session.add(eval_orm)
        memory.session.flush()

        # score results the actual score with weights and a rational (reason) for the score
        for result in bundle.results:
            score_result = bundle.results[result]
            score = ScoreORM(
                evaluation_id=eval_orm.id,
                dimension=score_result.dimension,
                score=score_result.score,
                source=score_result.source,
                weight=score_result.weight,
                rationale=score_result.rationale,
                prompt_hash=score_result.prompt_hash
                or ScoreORM.compute_prompt_hash(goal.get("goal_text", ""), scorable),
            )
            memory.session.add(score)

            # After inserting ScoreORM we insert the related attributs ebt and sicql extra data live here
            attribute = EvaluationAttributeORM(
                evaluation_id=eval_orm.id,
                dimension=score_result.dimension,
                source=score_result.source,
                raw_score=score_result.score,
                energy=score_result.energy,
                uncertainty=score_result.uncertainty,
                pi_value=score_result.policy_logits[0] if score_result.policy_logits else None,
                entropy=score_result.entropy,
                advantage=score_result.advantage,
                q_value=score_result.q_value,
                v_value=score_result.state_value,
                policy_logits=json.dumps(score_result.policy_logits),
                extra=score_result.to_dict(),
            )
            memory.session.add(attribute)

        memory.session.commit()

        logger.log(
            "ScoreSavedToMemory",
            {
                "goal_id": goal.get("id"),
                "target_id": scorable.id,
                "target_type": scorable.target_type,
                "scores": scores_json,
            },
        )
        ScoreDeltaCalculator(cfg, memory, logger).log_score_delta(
            scorable, weighted_score, goal.get("id")
        )
        ScoreDisplay.show(scorable, bundle.to_dict(), weighted_score)

This will show a result like this as it progresses though each documnet scoring


📊 llm Dimension Scores document:37 Summary
╒══════════════════╤═════════╤══════════╤══════════════════════════════════════════════════════════════╕
│ Dimension        │   Score │ Weight   │ Rationale (preview)                                          │
╞══════════════════╪═════════╪══════════╪══════════════════════════════════════════════════════════════╡
│ alignment        │   15    │ 1.2      │ rationale: The document discusses a multilingual translation │
├──────────────────┼─────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ clarity          │   78    │ 1.1      │ rationale: The document communicates its core idea of improv │
├──────────────────┼─────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ implementability │   75    │ 1.3      │ rationale: The document describes a complex multilingual tra │
├──────────────────┼─────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ novelty          │   88    │ 1.0      │ rationale: The document introduces a novel reward modeling a │
├──────────────────┼─────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ relevance        │   20    │ 0.8      │ rationale: The document discusses reinforcement learning for │
├──────────────────┼─────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ FINAL            │   56.54 │ -        │ Weighted average                                             │
╘══════════════════╧═════════╧══════════╧══════════════════════════════════════════════════════════════╛

The database tables look like this.

```mermaid

erDiagram
    EVALUATIONORM ||--o{ SCOREORM : has
    EVALUATIONORM ||--o{ EVALUATIONATTRIBUTEORM : has

    EVALUATIONORM {
        int id PK
        int goal_id
        int pipeline_run_id
        string target_type
        int target_id
        string source
        string agent_name
        string model_name
        string embedding_type
        string evaluator_name
        string strategy
        string reasoning_strategy
        json scores
        json extra_data
    }

    SCOREORM {
        int id PK
        int evaluation_id FK
        string dimension
        float score
        string source
        float weight
        string rationale
        string prompt_hash
    }

    EVALUATIONATTRIBUTEORM {
        int id PK
        int evaluation_id FK
        string dimension
        string source
        float raw_score
        float energy
        float uncertainty
        float pi_value
        float entropy
        float advantage
        float q_value
        float v_value
        json policy_logits
        json extra
    }

♻️ Toward Dynamic Scoring

Today, Stephanie uses five scorers. By the next post, there will be six. By the time this system is mature, we expect hundreds, of specialized scorers tuned to domains, tasks, goals, ethical considerations, and feedback loops.

Why so many?

Because every scorer encodes a hypothesis:

“This is how value should be measured in this context.”

Some will be generalists. Many will be specialists. But all will follow this protocol: goal + input → score.

And Stephanie will not just choose between them she will learn which ones to trust, and when. That’s the beginning of dynamic reasoning.

Later it will generate custom ones per task.

This example shows us generating scores for the same set of documents for each scorer.

        sicql_scorer = SICQLScorer(self.cfg, memory=self.memory, logger=self.logger)  
        svm_scorer = SVMScorer(self.cfg, memory=self.memory, logger=self.logger)
        mrq_scorer = MRQScorer(self.cfg, memory=self.memory, logger=self.logger)
        ebt_scorer = EBTScorer(self.cfg, memory=self.memory, logger=self.logger)

        scorers = [sicql_scorer, svm_scorer, mrq_scorer, ebt_scorer]

        for doc in documents:

            doc_id = doc["id"]
            goal = context.get("goal", "")
            scorable = ScorableFactory.from_dict(doc, TargetType.DOCUMENT)

            for scorer in scorers:
                score_bundle: ScoreBundle = scorer.score(
                    goal=goal,
                    scorable=scorable,No one's going to be
                    dimensions=self.dimensions,
                )

This gives you a multidimensional score for a single document, aligned with SICQL’s learned policy logic the same logic it was trained to refine.


🔍 Step by Step - Scoring a Document with SICQL

Let’s walk through what happens when Stephanie scores a single document:

📝 Inputs

  • Goal: "Improve technical clarity of AI explanations"
  • Document: "This paper proposes a new method for unsupervised graph pretraining..."

Step 1: Embed goal and document

prompt_emb = embedding_store.get_or_create(goal_text)
doc_emb = embedding_store.get_or_create(document_text)

Step 2: Forward pass through SICQL model (for each dimension)

result = model(prompt_emb, doc_emb)
q_value = result["q_value"].item()
v_value = result["state_value"].item()
policy_logits = result["action_logits"]

Step 3: Derive evaluation signals

uncertainty = abs(q_value - v_value)
entropy = -(softmax(policy_logits) * log(softmax(policy_logits)))
advantage = q_value - v_value

Step 4: Calibrate and store final score

scaled_score = tuner.transform(q_value)

✅ Benefits of Per-Dimension SICQL Models

  • Interpretability: You can inspect Q, V, entropy, advantage per-dimension.
  • Modularity: Each dimension can be improved independently and replaced without touching others.
  • Compatibility: The output format integrates smoothly with LLMs, EBTs, and GILD pipelines.

🧠 From Scoring to Understanding: Why SICQL Matters for Epistemic Improvement

Stephanie isn’t just trying to rate documents she’s trying to understand the world, improve her beliefs, and make better decisions over time. That requires more than just a scalar judgment; it requires a framework for epistemic growth.

This is where SICQL becomes transformative.

Most traditional scoring pipelines (including early MRQ versions) simply predict how “good” a document is, usually by regressing toward LLM-aligned scores. But this doesn’t teach the system why one document is better than another, nor how to improve its own beliefs in light of new evidence.

SICQL changes that by reframing the entire scoring process as goal-conditioned reinforcement learning, built on three coordinated components:

  • Q: How good is this answer in this context?
  • V: How good is the overall state of understanding?
  • π: What kind of answers should I prefer going forward?

This structure mirrors how humans learn. We don’t just rate options we estimate whether we’re making progress, reflect on what we expect to happen next, and revise our strategies accordingly.

Stephanie, through SICQL, begins to do the same.


🔁 An Epistemic Feedback Loop

Every time Stephanie reads a paper, generates a score, or compares two ideas, she is training herself not just to repeat judgments, but to improve how she scores, chooses, and reasons.

Here’s how SICQL fits into that feedback loop:

  1. Observation: Stephanie gathers feedback (e.g. LLM scores or human ratings) on document pairs.
  2. Encoding: She embeds the context and outputs into a shared latent space.
  3. Valuation: She uses the Q-head to estimate the quality of new documents.
  4. Reflection: She trains a V-head to predict how good her current state of knowledge is.
  5. Action Adjustment: She tunes her π-head to prefer outputs that improve her Q over V in other words, that represent epistemic progress.
  6. Iteration: She repeats the loop with updated weights and a sharper sense of what counts as “better.”

🧩 Why This Matters for Self-Improving AI

SICQL doesn’t just teach Stephanie what to think it teaches her how to improve her thinking. That’s the heart of epistemic improvement:

“Learning not just the answers, but the process that leads to better answers.”

In Stephanie’s world, models, beliefs, documents, and agents are all embedded in a reasoning system that evolves. SICQL strengthens this evolution by embedding value, policy, and contextual adaptation into every score making the system more robust, explainable, and ultimately more aligned with its goals.


🌐 SICQL in Stephanie’s Epistemic Feedback Loop

    flowchart LR
    A[Start: Document Pair & Goal Context] --> B[Embed Context + Document Prompt, Output]
    B --> C[Shared Latent Space ZSA No]

    C --> D1[Q Head: Predict Document Quality]
    C --> D2[V Head: Estimate Epistemic State Value]
    C --> D3[π Head: Generate Action Preference AWR]

    D1 --> E1[Compute Q Loss: Q - LLM Score ^2]
    D2 --> E2[Compute V Loss: Expectile Q - V]
    D3 --> E3[Compute π Loss: Advantage-weighted Regression]

    E1 & E2 & E3 --> F[Combine Losses and Backpropagate]

    F --> G[Update Model Weights  Q, V, π]

    G --> H[Improved Scoring for Future Documents]
    H --> I[Better Epistemic Feedback and Belief Updates]
    I --> A

    style D1 fill:#f9f,stroke:#333,stroke-width:1px
    style D2 fill:#bbf,stroke:#333,stroke-width:1px
    style D3 fill:#bfb,stroke:#333,stroke-width:1px
    style H fill:#ffe,stroke:#333,stroke-width:1px
    style I fill:#efe,stroke:#333,stroke-width:2px
  

🔁 Diagram Legend

  • ZSA: Joint embedding of (context, document) via the encoder
  • Q Head: Learns to estimate how “good” the document is
  • V Head: Tracks the quality of the overall epistemic state
  • π Head: Adjusts preferences using the advantage of Q over V

GILD: Where Stephanie Learns to Improve Her Own Intelligence

With SICQL Stephanie can now evaluate documents with unprecedented nuance not just assigning scores, but understanding why certain content resonates with specific goals. She can recognize uncertainty in her judgments, measure the gap between expectation and outcome, and even predict when her confidence might be misplaced.

But here’s the critical question we faced: What good is all this insight if Stephanie can’t actually use it to improve herself? This is where GILD (Goal-conditioned Imitation Learning with Distillation) help Yeah white man How she know about that show right entertain you you mean All right I got my Emir emir Emma you’re going to your back do you want something in the cheaper Ricky 53 Ernie Jim Hello I Jay Jay can you zero I I Cortana she doesn’t really like.

GILD isn’t just another component in Stephanie’s architecture. It’s the transformative breakthrough that closes the self-improvement loop we’ve been building toward since our first post on embedding strategies. Where SICQL gives Stephanie the ability to see her thinking, GILD gives her the power to reshape it.

Why GILD Is the Pivotal Innovation

Everything we’ve built so far the layered subconscious from H-Net embeddings, the dimensional scoring engines, even SICQL itself were necessary but insufficient steps toward true self-improvement. They provided the foundation, but without GILD, Stephanie would forever remain a sophisticated observer of her own limitations rather than an active participant in her evolution.

Consider what we’ve accomplished across our journey:

  1. In “The Shape of Thought”: We gave Stephanie the ability to represent the world through multiple embedding strategies her “ways of seeing.”
  2. With SICQL: We gave her the ability to evaluate with nuance seeing not just what’s good, but why and how.
  3. With GILD: We’ve given her the ability to learn from her evaluations turning insight into improved reasoning.

GILD is the missing link that transforms Stephanie from an AI that processes information into one that genuinely improves its intelligence. It’s the difference between a system that scores documents and one that evolves its scoring criteria based on what actually works.

The GILD evolution: Learning How to Learn What’s Good

At its core, GILD solves the most fundamental challenge in self-improving AI: How do you convert observational insight into actionable self-modification without destabilizing the system?

Traditional reinforcement learning approaches would require Stephanie to discard and rebuild her entire scoring model from scratch with each iteration a process that’s both computationally prohibitive and cognitively disruptive. It’s like asking someone to completely rewrite their brain after every mistake.

GILD takes a radically different approach. Instead of wholesale replacement, it performs precision cognitive surgery identifying exactly which reasoning pathways led to success (or failure) and making targeted adjustments:

# The essence of GILD's surgical precision
advantage = (q_value - state_value).detach()  # "How much better was this than expected?"
weights = torch.exp(beta * advantage)         # "How much should we prioritize this learning?"
pi_loss = -(weights * log_probs).sum()        # "Adjust policy exactly in proportion to success"

This simple but profound mechanism allows Stephanie to:

  • Preserve what works: High-advantage reasoning patterns are reinforced
  • Refine what’s uncertain: Areas of high entropy receive targeted attention
  • Discard what misfires: Low-advantage pathways are gradually de-emphasized
  • Do it all incrementally: No disruptive retraining, just continuous refinement

Why This Changes Everything

GILD represents the moment when Stephanie transcends being merely “intelligent” to becoming reflectively intelligent capable of not just applying knowledge, but improving her very capacity for knowledge acquisition.

Where previous systems hit a ceiling (their intelligence constrained by initial training), Stephanie’s intelligence becomes unbounded. With GILD, she doesn’t just get better at scoring documents she gets better at getting better.

This is why GILD isn’t just another module in our architecture. It’s the beating heart of Stephanie’s self-improvement loop, the mechanism that transforms all our previous work from isolated capabilities into a cohesive, evolving intelligence.

In the sections that follow, I’ll show you exactly how GILD works the elegant mechanics that allow Stephanie to learn from her own experience, the surgical precision of her self-tuning process, and the profound implications for what’s possible when AI can genuinely improve its own thinking.

This isn’t incremental progress. This is the moment we cross from static intelligence to reflective intelligence where the system doesn’t just think, but learns how to think better, one document at a time.


🔍 Initial result comparison the ScoreComparisonAgent

We built the ScoreComparisonAgent to solve a critical problem: How does Stephanie know which scorer to trust? This agent acts like a debate moderator—analyzing disagreements between scorers to find consensus. A mechanisim to compare results across all available scoring verses the ground truth typically the llm.

    flowchart LR
    A[Start: Scored Documents + Pipeline Run IDs]

    A --> B[ScoreComparisonAgent<br/>📊 Compares model scores to LLM]:::highlight
    B --> C["ScoreEnergyComparisonAgent<br/>🔍 Analyzes model internals (Q/V, energy, uncertainty)"]
    B --> D
    C --> D[PolicySynthesisAgent<br/>🧠 Synthesizes scores + internals<br/>🛠 Recommends policy refinements<br/>📤 Prepares GILD training signals]

    D --> E[Final Policy Health Report + GILD Signals]

    classDef highlight fill:#fdf6e3,stroke:#b58900,stroke-width:2px;
  

The ScoreComparisonAgent performs the following steps:

  1. Fetches all model scores (e.g., SICQL, MRQ, SVM, EBT) tied to one or more pipeline_run_ids.

  2. Retrieves the latest ground truth scores (typically from the LLM) for the same targets.

  3. Joins the scores together by (target_id, dimension) and computes the delta between each model and the LLM.

  4. Generates reports:

    • A raw CSV of all matched scores and deltas
    • A markdown summary showing average deltas by source
    • A detailed statistical report with MAE, RMSE, correlation, bias, and variance for each model/dimension pair.
  5. Stores all results back into the context for downstream agents or final policy analysis.

This is analysis at scale. Every model, every dimension, every target summarized and compared in one pass.


Full code ScoreComparisonAgent

Agent to aggregate and compare scores from multiple sources (SICQL, MRQ, SVM, EBT, LLM) across specified pipeline runs. Handles asynchronous LLM scoring by fetching latest LLM scores for targets evaluated by pipeline-run-linked scorers. This is Step 1: Comprehensive Score Aggregation and Comparison.

class ScoreComparisonAgent(BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.dimensions = cfg.get("dimensions", [])  # Default dimensions, can be overridden in config
        # Configuration for sources to compare
        # Default to common scorers. Can be overridden in config.
        self.sources_to_compare = cfg.get("sources_to_compare", ["sicql", "mrq", "svm", "ebt"])
        self.ground_truth_source = cfg.get("ground_truth_source", "llm") # Typically "llm"
        # Ensure ground truth is included if not already in the list
        if self.ground_truth_source not in self.sources_to_compare:
             self.sources_to_compare.append(self.ground_truth_source)

        # Output directory for reports (optional)
        self.output_dir = cfg.get("report_output_dir", "logs/comparison_reports")
        os.makedirs(self.output_dir, exist_ok=True)

        # Get session from memory
        if memory and hasattr(memory, 'session'):
            self.session = memory.session
        else:
            raise ValueError("ScoreComparisonAgent requires a memory object with a session attribute.")

        # Initialize ScoringStore if it's the preferred way to interact
        # self.scoring_store = ScoringStore(self.session, logger) # Optional, if methods are adapted

    async def run(self, context: dict) -> dict:
        """
        Main execution logic for the agent.
        """
        try:
            # --- 1. Get Input Parameters ---
            pipeline_run_ids = context.get("pipeline_run_ids", [4148])
            # Fallback to single ID if list isn't provided
            single_pipeline_run_id = context.get("pipeline_run_id")
            if single_pipeline_run_id and not pipeline_run_ids:
                 pipeline_run_ids = [single_pipeline_run_id]
            
            if not pipeline_run_ids:
                self.logger.log("ScoreComparisonWarning", {"message": "No pipeline_run_id(s) provided. Analysis might be limited or empty."})
                # Decide if we should proceed or raise an error
                # For now, let's proceed but log it.

            dimensions = context.get("dimensions", self.dimensions) # Get from context or config
            # If still empty, we might fetch all available dimensions or use a default set
            # Let's assume ScoringStore/load_gild_examples handles this by fetching all if none specified implicitly

            self.logger.log("ScoreComparisonStarted", {
                "pipeline_run_ids": pipeline_run_ids,
                "dimensions": self.dimensions,
                "sources": self.sources_to_compare,
                "ground_truth": self.ground_truth_source
            })

            # --- 2. Fetch Scores from Pipeline-Linked Sources ---
            # We need scores linked to specific pipeline runs for SICQL, MRQ, SVM, EBT
            # We'll adapt the logic from PolicyAnalyzer._get_sicql_data/_get_mrq_data etc.
            # Or, if we modify ScoringStore, we could use a new method like:
            # local_scores_data = self.scoring_store.get_scores_for_pipeline_runs(
            #     pipeline_run_ids=pipeline_run_ids, 
            #     sources=[s for s in self.sources_to_compare if s != self.ground_truth_source],
            #     dimensions=dimensions
            # )
            
            # For now, let's implement the fetching logic directly using session
            # similar to PolicyAnalyzer methods.
            local_scores_data = self._fetch_local_scores(pipeline_run_ids, self.dimensions)

            # --- 3. Identify Targets for Ground Truth Lookup ---
            # Extract unique (target_id, dimension) combinations from local scores
            # Assuming target_type is consistent or handled, or we fetch it too.
            target_info_set = set()
            for score_record in local_scores_data:
                 # Adjust key names based on actual data structure from _fetch_local_scores
                 target_info_set.add((score_record.get('target_id'), score_record.get('dimension')))

            target_info_list = [{"target_id": tid, "dimension": dim} for tid, dim in target_info_set if tid is not None and dim is not None]

            self.logger.log("ScoreComparisonTargetsIdentified", {
                "target_count": len(target_info_list),
                "sample_targets": list(target_info_list)[:5] # Log first 5 for sanity check
            })

            # --- 4. Fetch Ground Truth (LLM) Scores ---
            # Fetch latest LLM scores for the identified targets, regardless of pipeline run
            # Adapted from PolicyAnalyzer._get_llm_data logic
            llm_scores_data = self._fetch_latest_ground_truth_scores(target_info_list, self.dimensions)

            # --- 5. Merge and Calculate Deltas ---
            # Create a lookup for LLM scores: {(target_id, dimension): score}
            llm_score_lookup = {(item['target_id'], item['dimension']): item['score'] for item in llm_scores_data}

            # Augment local scores with LLM score and delta
            aggregated_results = []
            for local_score in local_scores_data:
                target_id = local_score.get('target_id')
                dimension = local_score.get('dimension')
                source = local_score.get('source')
                local_score_value = local_score.get('score')

                llm_score_for_target = llm_score_lookup.get((target_id, dimension))
                delta = None
                if local_score_value is not None and llm_score_for_target is not None:
                     delta = local_score_value - llm_score_for_target

                # Add LLM score and delta to the local score record
                augmented_record = local_score.copy()
                augmented_record['llm_score'] = llm_score_for_target
                augmented_record['delta'] = delta
                aggregated_results.append(augmented_record)

            # --- 6. Store Results in Context ---
            context['score_comparison_data'] = aggregated_results
            context['score_comparison_metadata'] = {
                "pipeline_run_ids": pipeline_run_ids,
                "sources_compared": self.sources_to_compare,
                "ground_truth_source": self.ground_truth_source,
                "dimensions": dimensions,
                "comparison_timestamp": datetime.now().isoformat()
            }

            # --- 7. (Optional) Basic Reporting ---
            # Generate a simple summary or export
            self._generate_basic_report(aggregated_results, context['score_comparison_metadata'])

            self.logger.log("ScoreComparisonCompleted", {
                "total_scores_processed": len(aggregated_results),
                # Add more summary stats if needed
            })


            # --- 7. (Optional) Basic Reporting ---
            # Generate a simple summary or export
            self._generate_basic_report(aggregated_results, context['score_comparison_metadata'])

            # --- 8. NEW: Save Raw CSV ---
            self._save_comparison_csv(aggregated_results, context['score_comparison_metadata'])

            # --- 9. Log completion and return ---
            self.logger.log("ScoreComparisonCompleted", {
                "total_scores_processed": len(aggregated_results),
                # Add more summary stats if needed
            })


            # --- 9. NEW: Perform Statistical Analysis ---
            analysis_results = self._perform_statistical_analysis(aggregated_results)
            context['score_analysis_results'] = analysis_results
            context['score_analysis_metadata'] = {
                "analysis_timestamp": datetime.now().isoformat(),
                "sources_analyzed": self.sources_to_compare,
                "ground_truth_source": self.ground_truth_source,
                # You could add more metadata here if needed
            }

            # --- 10. NEW: Generate Detailed Analysis Report ---
            self._generate_analysis_report(analysis_results, context['score_comparison_metadata']) # Use comparison metadata for context

            # --- 11. Log completion and return ---
            self.logger.log("ScoreComparisonCompleted", {
                "total_scores_processed": len(aggregated_results),
                "analysis_results_generated": len(analysis_results) > 0,
                # Add more summary stats if needed
            })

            return context

        except Exception as e:
            error_msg = f"ScoreComparisonAgent failed: {str(e)}"
            self.logger.log("ScoreComparisonFailed", {"error": str(e), "context": str(context)})
            # Depending on requirements, you might want to re-raise or handle gracefully
            raise # Re-raise for now to halt the pipeline on critical failure


    def _fetch_local_scores(self, pipeline_run_ids: List[int], dimensions: List[str]) -> List[Dict[str, Any]]:
        """
        Fetches scores for specified sources linked to specific pipeline runs.
        Uses a SQL query with ROW_NUMBER() and pivoting for efficient retrieval
        of the latest score per target/dimension/source combination.
        """
        try:
            if not pipeline_run_ids:
                self.logger.log("LocalScoreFetchWarning", {"message": "No pipeline_run_ids provided. Returning empty list."})
                return []

            # 1. Build the list of sources to filter by (excluding ground truth for now)
            non_gt_sources = [s for s in self.sources_to_compare if s != self.ground_truth_source]
            
            # Handle case where only GT source is requested
            if not non_gt_sources:
                self.logger.log("LocalScoreFetchInfo", {"message": "No non-ground-truth sources to fetch. Returning empty list."})
                return []

            # 2. Create placeholders for the IN clauses in the SQL query
            # Note: Using tuple() for IN clauses in SQLAlchemy text queries
            pipeline_ids_tuple = tuple(pipeline_run_ids) if pipeline_run_ids else (None,) # Prevent empty tuple error
            sources_tuple = tuple(non_gt_sources) if non_gt_sources else (None,)
            dimensions_tuple = tuple(dimensions) if dimensions else None # Will handle NULL check in SQL

            # 3. Define the SQL query using text()
            # We'll build the CASE statements dynamically based on sources
            case_statements = []
            for source in non_gt_sources:
                # Normalize source name for column alias (e.g., 'sicql_scorer' -> 'sicql_score')
                # Adjust this normalization logic if needed based on your exact evaluator names
                # The key change: Use the column name 'source' directly, not 's.source'
                case_statements.append(f"MAX(CASE WHEN source = '{source}' THEN score END) AS {source}_score")
            
            case_part = ",\n        ".join(case_statements)

            # 4. Base query - CORRECTED: Removed 's.' prefix in the grouped_scores CTE
            query_text = f"""
            WITH pipeline_scores AS (
                SELECT
                    e.target_type,
                    e.target_id,
                    s.dimension,
                    s.source, -- Column alias 'source'
                    s.score,  -- Column alias 'score'
                    ROW_NUMBER() OVER (
                        PARTITION BY e.target_type, e.target_id, s.dimension, s.source
                        ORDER BY e.created_at DESC
                    ) AS row_num
                FROM scores s
                JOIN evaluations e ON s.evaluation_id = e.id
                WHERE e.pipeline_run_id IN :pipeline_run_ids
                AND s.source IN :sources
                -- Filter by dimensions if provided
                AND (:dimensions IS NULL OR s.dimension IN :dimensions)
            ),
            latest_scores AS (
                SELECT *
                FROM pipeline_scores
                WHERE row_num = 1
            ),
            grouped_scores AS (
                SELECT
                    target_type,
                    target_id,
                    dimension,
                    {case_part} -- Uses 'source' and 'score' from latest_scores
                FROM latest_scores
                GROUP BY target_type, target_id, dimension
            )
            SELECT *
            FROM grouped_scores
            ORDER BY dimension, target_type, target_id;
            """

            # 5. Log the query for debugging (optional, remove in production)
            # self.logger.log("DebugSQLQuery", {"query": query_text, "params": {
            #     "pipeline_run_ids": pipeline_ids_tuple,
            #     "sources": sources_tuple,
            #     "dimensions": dimensions_tuple
            # }})

            # 6. Execute the query with parameters
            result = self.session.execute(
                text(query_text),
                {
                    "pipeline_run_ids": pipeline_ids_tuple,
                    "sources": sources_tuple,
                    "dimensions": dimensions_tuple
                }
            )

            # 7. Process the results
            # The result will have columns like: target_type, target_id, dimension, sicql_score, mrq_score, ...
            raw_rows = result.fetchall()

            formatted_scores = []
            for row in raw_rows:
                row_dict = row._mapping # Convert Row to dict-like object
                
                target_type = row_dict.get("target_type")
                target_id = row_dict.get("target_id")
                dimension = row_dict.get("dimension")

                # Iterate through the dynamically created score columns
                for source_alias in non_gt_sources: # e.g., 'sicql', 'mrq', 'svm', 'ebt'
                    # The column name in the result set matches the alias used in CASE
                    column_name = f"{source_alias}_score" 
                    
                    score_value = row_dict.get(column_name)
                    
                    # Only add an entry if a score was found for this source
                    if score_value is not None:
                        formatted_scores.append({
                            # Evaluation ID is not directly available in this pivoted format.
                            "target_id": target_id,
                            "target_type": target_type,
                            "dimension": dimension,
                            "source": source_alias, # Use the original source name
                            "score": float(score_value), # Ensure it's a native Python type
                        })

            self.logger.log("LocalScoresFetched", {
                "requested_pipeline_runs": pipeline_run_ids,
                "requested_sources": non_gt_sources,
                "requested_dimensions": dimensions,
                "fetched_record_count": len(raw_rows), # Number of grouped rows
                "expanded_score_count": len(formatted_scores) # Number of individual score entries
            })
            return formatted_scores

        except sqlalchemy.exc.SQLAlchemyError as sae:
            # More specific error handling for database issues
            self.logger.log("LocalScoreFetchDatabaseError", {"error": f"SQLAlchemy Error: {str(sae)}", "query": query_text if 'query_text' in locals() else "Query construction failed"})
            return []
        except Exception as e:
            self.logger.log("LocalScoreFetchFailed", {"error": f"General Error: {str(e)}", "pipeline_run_ids": pipeline_run_ids, "dimensions": dimensions})
            return [] # Return empty list on error to allow pipeline to potentially continue

    def _fetch_latest_ground_truth_scores(self, target_info_list: List[Dict[str, Any]], dimensions: List[str]) -> List[Dict[str, Any]]:
        """
        Fetches the latest scores from the ground truth source (e.g., LLM) for given targets.
        Adapted from PolicyAnalyzer._get_llm_data.
        """

        if not target_info_list:
             return []

        try:
            # We need to get the LATEST score for each (target_id, dimension) pair where source is LLM
            # This is trickier than a simple filter. We can use a subquery or window function.
            # Let's use a common approach: join with a subquery that finds the max created_at per group.

            # Subquery to find the latest evaluation_id for each (target_id, dimension) for LLM
            latest_eval_subq = (
                self.session.query(
                    EvaluationORM.target_id,
                    ScoreORM.dimension,
                    # Using func.max might not directly give us the id, so we use a window function approach
                    # Or, simpler, get the latest EvaluationORM.id per group and join back
                )
                .join(ScoreORM, ScoreORM.evaluation_id == EvaluationORM.id)
                .filter(EvaluationORM.evaluator_name == self.ground_truth_source)
                .filter(EvaluationORM.target_id.in_([t['target_id'] for t in target_info_list]))
                .filter(ScoreORM.dimension.in_(dimensions) if dimensions else True)
                # Group by target and dimension
                .group_by(EvaluationORM.target_id, ScoreORM.dimension)
                # This approach with group_by alone won't give the latest id directly
                # Let's use a more robust method with a correlated subquery or distinct on
                # Or, use the logic from ScoringStore.load_gild_examples which handles "latest"
            )

            # Simpler and more aligned with existing patterns: Use a modified version of the logic
            # that gets latest scores for a specific source, similar to how `load_gild_examples` works
            # but filtered for LLM and specific targets/dimensions.

            # Let's adapt the CTE logic from ScoringStore.load_gild_examples for just LLM
            from sqlalchemy import text
            # This is a simplified version focusing only on LLM
            # Note: This assumes target_type is consistent or handled, or we filter it out if not needed here
            cte_query_text = """
            WITH ranked_llm_scores AS (
                SELECT
                    s.dimension,
                    s.score,
                    e.target_id,
                    e.id as evaluation_id, -- Include evaluation_id for join if needed
                    e.created_at,
                    ROW_NUMBER() OVER (
                        PARTITION BY e.target_id, s.dimension
                        ORDER BY e.created_at DESC
                    ) AS rank
                FROM scores s
                JOIN evaluations e ON e.id = s.evaluation_id
                WHERE e.evaluator_name = :evaluator_name -- 'llm'
                AND e.target_id IN :target_ids
                AND s.dimension IN :dimensions
                -- Add target_type filter if strictly needed
            )
            SELECT
                target_id,
                dimension,
                score,
                created_at
            FROM ranked_llm_scores
            WHERE rank = 1
            """

            # Prepare parameters
            target_ids = [t['target_id'] for t in target_info_list]
            dims = dimensions if dimensions else [t['dimension'] for t in target_info_list] # Fallback if needed

            if not target_ids or not dims: # Safety check
                 return []

            result = self.session.execute(
                text(cte_query_text),
                {
                    "evaluator_name": self.ground_truth_source,
                    "target_ids": tuple(target_ids),
                    "dimensions": tuple(dims)
                }
            ).fetchall()

            llm_scores = [dict(row._mapping) for row in result]
            
            self.logger.log("GroundTruthScoresFetched", {"count": len(llm_scores)})
            return llm_scores

        except Exception as e:
             self.logger.log("GroundTruthScoreFetchFailed", {"error": str(e)})
             return []

    def _generate_basic_report(self, aggregated_data: List[Dict], metadata: Dict):
        """
        Generates a simple summary report of the comparison.
        """
        try:
            if not aggregated_data:
                 report_content = "# Score Comparison Report (Empty)\n\nNo data found for comparison.\n"
                 self.logger.log("EmptyComparisonReportGenerated", {})
            else:
                # Simple aggregation: count, average delta per source
                from collections import defaultdict
                import statistics

                source_stats = defaultdict(lambda: {"count": 0, "avg_delta": 0, "deltas": []})
                
                for item in aggregated_data:
                     source = item.get('source')
                     delta = item.get('delta')
                     if source: # Ensure source is present
                          source_stats[source]["count"] += 1
                          if delta is not None:
                               source_stats[source]["deltas"].append(delta)
                
                # Calculate average deltas
                for source, stats in source_stats.items():
                     if stats["deltas"]:
                          stats["avg_delta"] = statistics.mean(stats["deltas"])
                          # Could add stddev, min, max etc.
                     del stats["deltas"] # Remove raw list for cleaner output

                timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
                report_filename = f"score_comparison_summary_{timestamp}.md"
                report_path = os.path.join(self.output_dir, report_filename)

                with open(report_path, 'w') as f:
                     f.write("# Score Comparison Summary Report\n\n")
                     f.write(f"**Generated:** {metadata.get('comparison_timestamp', 'N/A')}\n\n")
                     f.write(f"**Pipeline Runs Analyzed:** {metadata.get('pipeline_run_ids', 'N/A')}\n\n")
                     f.write(f"**Sources Compared:** {', '.join(metadata.get('sources_compared', []))}\n\n")
                     f.write(f"**Ground Truth Source:** {metadata.get('ground_truth_source', 'N/A')}\n\n")
                     f.write(f"**Dimensions:** {', '.join(metadata.get('dimensions', []))}\n\n")
                     f.write("## Summary Statistics (vs Ground Truth)\n\n")
                     f.write("| Source | Count | Avg Delta (Model - LLM) |\n")
                     f.write("| :--- | :--- | :--- |\n")
                     for source, stats in sorted(source_stats.items()):
                          f.write(f"| {source} | {stats['count']} | {stats['avg_delta']:.4f} |\n")
                
                self.logger.log("ComparisonSummaryReportSaved", {"path": report_path})

        except Exception as e:
             self.logger.log("ComparisonReportGenerationFailed", {"error": str(e)})


    def _save_comparison_csv(self, aggregated_data: List[Dict[str, Any]], metadata: Dict[str, Any]):
        """
        Saves the aggregated score comparison data to a CSV file.
        """
        try:
            if not aggregated_data:
                self.logger.log("SaveComparisonCSVWarning", {"message": "No data to save to CSV. Skipping."})
                return

            timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
            # Create a descriptive filename
            pipeline_ids_str = "_".join(map(str, metadata.get('pipeline_run_ids', ['unknown'])))
            filename = f"score_comparison_raw_{pipeline_ids_str}_{timestamp}.csv"
            file_path = os.path.join(self.output_dir, filename)

            # Define the fieldnames for the CSV. These should match the keys in your aggregated data dicts.
            # Based on the current structure in run() and _fetch_local_scores/_fetch_latest_ground_truth_scores
            fieldnames = [
                "target_id",
                "target_type",
                "dimension",
                "source", # The model that produced the score
                "score",  # The score from the model
                "llm_score", # The ground truth LLM score for the same target/dimension
                "delta"  # The difference (score - llm_score)
                # Add other fields like 'embedding_type', 'created_at' if they are included in aggregated_data
            ]
            # Dynamically add any other keys found in the first data point (for robustness)
            if aggregated_data:
                sample_keys = set(aggregated_data[0].keys())
                for key in sample_keys:
                    if key not in fieldnames:
                        fieldnames.append(key)
            # Ensure consistent order, put standard ones first
            standard_fields = [f for f in fieldnames if f in ['target_id', 'target_type', 'dimension', 'source', 'score', 'llm_score', 'delta']]
            other_fields = [f for f in fieldnames if f not in standard_fields]
            ordered_fieldnames = standard_fields + sorted(other_fields)


            with open(file_path, 'w', newline='', encoding='utf-8') as csvfile:
                writer = csv.DictWriter(csvfile, fieldnames=ordered_fieldnames)
                writer.writeheader()
                # Sort data for consistent output (optional)
                sorted_data = sorted(aggregated_data, key=lambda x: (x.get('dimension', ''), x.get('target_type', ''), x.get('target_id', 0), x.get('source', '')))
                for row in sorted_data:
                    # Handle potential serialization issues (e.g., datetime objects)
                    # Although our current data should be basic types, this is good practice.
                    safe_row = {k: v if isinstance(v, (str, int, float, type(None))) else str(v) for k, v in row.items()}
                    writer.writerow(safe_row)


            self.logger.log("ComparisonCSVSaved", {"path": file_path, "record_count": len(aggregated_data)})
            
        except Exception as e:
            self.logger.log("SaveComparisonCSVFailed", {"error": str(e), "output_dir": self.output_dir})



    def _perform_statistical_analysis(self, aggregated_data: List[Dict[str, Any]]) -> Dict[str, Any]:
        """
        Performs statistical analysis on the aggregated score comparison data.
        Calculates MAE, RMSE, Correlation, Bias, and Variance of model scores
        per source and dimension.
        """
        try:

            # --- 1. Organize Data ---
            # Group data by (source, dimension) for metric calculation
            grouped_data = defaultdict(list)
            for item in aggregated_data:
                # Only analyze non-ground-truth scores that have a corresponding LLM score/delta
                if item.get('source') and item.get('source') != self.ground_truth_source and item.get('delta') is not None:
                    key = (item['source'], item['dimension'])
                    grouped_data[key].append(item)
                
                # Also collect model scores for calculating variance of scores produced by the model
                # This uses the 'score' field for the model itself
                if item.get('source') and item.get('source') != self.ground_truth_source and item.get('score') is not None:
                    # Use a distinct key structure for model scores
                    score_key = (item['source'], item['dimension'], 'model_scores') 
                    grouped_data[score_key].append(item['score']) # Store just the score value

            # --- 2. Calculate Metrics ---
            results = {}

            # --- Calculate Main Metrics (MAE, RMSE, Correlation, Bias) ---
            for key, items in grouped_data.items():
                # Process only the main metric keys: (source, dimension)
                # Skip keys with 'model_scores' marker: (source, dimension, 'model_scores')
                if isinstance(key, tuple) and len(key) == 3 and key[2] == 'model_scores':
                    continue # This will be handled later for score variance

                if isinstance(key, tuple) and len(key) == 2:
                    source, dimension = key
                else:
                    # Handle unexpected key format gracefully
                    self.logger.log("StatisticalAnalysisWarning", {
                        "message": "Skipping unexpected key format in grouped data",
                        "key": str(key), "key_type": str(type(key))
                    })
                    continue

                if not items:
                    continue

                # Extract arrays for calculations
                deltas = np.array([item['delta'] for item in items if item['delta'] is not None])
                model_scores = np.array([item['score'] for item in items if item['score'] is not None])
                llm_scores = np.array([item['llm_score'] for item in items if item['llm_score'] is not None])

                # Ensure we have data to calculate metrics
                if len(deltas) == 0 or len(model_scores) == 0 or len(llm_scores) == 0:
                    self.logger.log("StatisticalAnalysisWarning", {
                        "message": "Insufficient data for metric calculation",
                        "source": source, "dimension": dimension,
                        "delta_count": len(deltas), "model_score_count": len(model_scores), "llm_score_count": len(llm_scores)
                    })
                    continue

                # --- Core Metrics ---
                mae = np.mean(np.abs(deltas))
                rmse = np.sqrt(np.mean(deltas**2))
                
                # Correlation: Check if variance is sufficient for meaningful correlation
                corr_coef = None
                corr_p_value = None
                if np.std(model_scores) > 1e-10 and np.std(llm_scores) > 1e-10:
                    try:
                        # Use scipy.stats.pearsonr
                        corr_result = pearsonr(model_scores, llm_scores)
                        # Handle different scipy versions
                        if hasattr(corr_result, 'statistic'):
                            corr_coef = corr_result.statistic
                            corr_p_value = corr_result.pvalue
                        else: # Older scipy versions return a tuple
                            corr_coef, corr_p_value = corr_result
                    except Exception as e:
                        self.logger.log("CorrelationCalculationWarning", {
                            "error": str(e), "source": source, "dimension": dimension
                        })

                bias = np.mean(deltas) # Average difference (Model - LLM)

                # --- Store Results ---
                result_key = f"{source}_{dimension}"
                results[result_key] = {
                    "source": source,
                    "dimension": dimension,
                    "count": len(deltas),
                    "mae": float(mae),
                    "rmse": float(rmse),
                    "correlation": float(corr_coef) if corr_coef is not None else None,
                    "correlation_p_value": float(corr_p_value) if corr_p_value is not None else None,
                    "bias": float(bias), # Positive bias = Model tends to score higher than LLM
                }

            # --- 3. Calculate Variance of Model Scores ---
            # This is the standard deviation of scores *produced by each model* for a dimension
            for key, scores_list in grouped_data.items():
                # Process only the score variance keys: (source, dimension, 'model_scores')
                if not (isinstance(key, tuple) and len(key) == 3 and key[2] == 'model_scores'):
                    continue # Skip main metric keys

                source, dimension, _ = key # Unpack the 3-tuple key

                if not scores_list:
                    continue
                
                scores_array = np.array(scores_list)
                if len(scores_array) > 1: # Need more than one value for std dev
                    score_variance = float(np.std(scores_array))
                    # Add this to the existing result dict or create a new entry if it doesn't exist
                    result_key = f"{source}_{dimension}"
                    if result_key in results:
                        results[result_key]["score_std_dev"] = score_variance
                    else:
                        # Less likely, but handle if main metrics weren't calculated for some reason
                        results[result_key] = {
                            "source": source,
                            "dimension": dimension,
                            "count": len(scores_array),
                            "score_std_dev": score_variance
                            # Other metrics will be missing
                        }
                # else: Not enough data for variance, leave it out or set to 0?

            self.logger.log("StatisticalAnalysisCompleted", {
                "unique_source_dimension_combinations_analyzed": len(results)
            })
            
            return results

        except Exception as e:
            self.logger.log("StatisticalAnalysisFailed", {"error": str(e)})
            # Depending on robustness needs, return empty dict or re-raise
            return {} # Return empty dict on error to allow pipeline to potentially continue

    def _generate_analysis_report(self, analysis_results: Dict[str, Any], comparison_metadata: Dict[str, Any]):
        """
        Generates a detailed markdown report summarizing the statistical analysis results.
        """
        try:
            if not analysis_results:
                report_content = "# Score Analysis Report (Empty)\n\nNo analysis data found.\n"
                self.logger.log("EmptyAnalysisReportGenerated", {})
            else:
                timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
                pipeline_ids_str = "_".join(map(str, comparison_metadata.get('pipeline_run_ids', ['unknown'])))
                report_filename = f"score_analysis_detailed_{pipeline_ids_str}_{timestamp}.md"
                report_path = os.path.join(self.output_dir, report_filename)

                with open(report_path, 'w', encoding='utf-8') as f:
                    f.write(f"# Detailed Score Analysis Report\n\n")
                    f.write(f"**Generated:** {comparison_metadata.get('comparison_timestamp', 'N/A')}\n\n") # Use timestamp from comparison
                    f.write(f"**Analysis Performed:** {datetime.now().isoformat()}\n\n")
                    f.write(f"**Pipeline Runs Analyzed:** {comparison_metadata.get('pipeline_run_ids', 'N/A')}\n\n")
                    f.write(f"**Sources Compared:** {', '.join(comparison_metadata.get('sources_compared', []))}\n\n")
                    f.write(f"**Ground Truth Source:** {comparison_metadata.get('ground_truth_source', 'N/A')}\n\n")
                    f.write(f"**Dimensions:** {', '.join(comparison_metadata.get('dimensions', []))}\n\n")
                    f.write("---\n\n")

                    # Group results by dimension for better organization
                    from collections import defaultdict
                    results_by_dimension = defaultdict(list)
                    for key, metrics in analysis_results.items():
                        dim = metrics.get('dimension', 'unknown')
                        results_by_dimension[dim].append(metrics)

                    # Sort dimensions and sources for consistent output
                    sorted_dimensions = sorted(results_by_dimension.keys())

                    for dimension in sorted_dimensions:
                        f.write(f"## Analysis for Dimension: `{dimension}`\n\n")
                        f.write("| Source | Count | MAE | RMSE | Correlation (p-value) | Bias | Score Std Dev |\n")
                        f.write("| :--- | ---: | ---: | ---: | ---: | ---: | ---: |\n")

                        sorted_sources_for_dim = sorted(results_by_dimension[dimension], key=lambda x: x.get('source', ''))
                        for metrics in sorted_sources_for_dim:
                            source = metrics.get('source', 'N/A')
                            count = metrics.get('count', 0)
                            mae = f"{metrics.get('mae', 0):.4f}" if metrics.get('mae') is not None else "N/A"
                            rmse = f"{metrics.get('rmse', 0):.4f}" if metrics.get('rmse') is not None else "N/A"
                            
                            corr = metrics.get('correlation')
                            p_val = metrics.get('correlation_p_value')
                            if corr is not None:
                                corr_str = f"{corr:.4f}"
                                if p_val is not None:
                                    corr_str += f" ({p_val:.2e})" # Add p-value in scientific notation
                            else:
                                corr_str = "N/A"

                            bias = f"{metrics.get('bias', 0):.4f}" if metrics.get('bias') is not None else "N/A"
                            std_dev = f"{metrics.get('score_std_dev', 0):.4f}" if metrics.get('score_std_dev') is not None else "N/A"

                            f.write(f"| {source} | {count} | {mae} | {rmse} | {corr_str} | {bias} | {std_dev} |\n")
                        f.write("\n") # Space between dimensions

                self.logger.log("DetailedAnalysisReportSaved", {"path": report_path})

        except Exception as e:
            self.logger.log("DetailedAnalysisReportGenerationFailed", {"error": str(e)})

📄 Code Summary

At its core, the agent is configured like so:

self.sources_to_compare = cfg.get("sources_to_compare", ["sicql", "mrq", "svm", "ebt"])
self.ground_truth_source = cfg.get("ground_truth_source", "llm")

It uses SQL queries to pull scores linked to specific pipeline runs, groups them by target and dimension, and pivots the table so each row contains all relevant scores for a target-dimension pair.

It then:

  • Computes delta: delta = model_score - llm_score
  • Writes raw results to a CSV
  • Summarizes average deltas per model in markdown
  • Computes advanced statistics (correlation, MAE, RMSE, bias)
  • Writes a full analysis report in markdown

All of this happens in a single agent run, with full context returned for further introspection.


✅ Understanding our thinking

This agent closes the loop between scoring and understanding.

By comparing model output to ground truth and surfacing statistical properties across multiple dimensions, Stephanie can now:

  • Spot overconfident or biased models
  • Measure alignment between internal scorers and LLM supervision
  • Feed this information into downstream tuning or strategy selection agents
  • Track how self-improvement is unfolding over time

The ScoreComparisonAgent marks the beginning of reflective cognition. Stephanie isn’t just scoring she’s starting to understand how well she scores.


Actual report results

Detailed Score Analysis Report

# Detailed Score Analysis Report

**Generated:** 2025-07-26T16:30:01.874777
**Pipeline Runs Analyzed:** [4148]
So it'll work
**Sources Compared:** sicql, mrq, svm, ebt, llm

**Ground Truth Source:** llm

**Dimensions:** alignment, clarity, implementability, novelty, relevance

---

## Analysis for Dimension: `alignment`

| Source | Count | MAE | RMSE | Correlation (p-value) | Bias | Score Std Dev |
| :--- | ---: | ---: | ---: | ---: | ---: | ---: |
| ebt | 100 | 33.7828 | 38.2685 | 0.3409 (5.19e-04) | 33.7787 | 1.0365 |
| mrq | 100 | 35.4969 | 39.9422 | N/A | 35.4969 | 0.0000 |
| sicql | 100 | 59.0500 | 61.8243 | N/A | 59.0500 | 0.0000 |
| svm | 100 | 35.8870 | 40.2899 | -0.2385 (1.69e-02) | 35.8870 | 0.0062 |

## Analysis for Dimension: `clarity`

| Source | Count | MAE | RMSE | Correlation (p-value) | Bias | Score Std Dev |
| :--- | ---: | ---: | ---: | ---: | ---: | ---: |
| ebt | 100 | 15.7906 | 16.9702 | 0.2895 (3.49e-03) | -14.9451 | 1.0706 |
| mrq | 100 | 5.6104 | 8.7455 | N/A | 2.8043 | 0.0000 |
| sicql | 100 | 5.4078 | 8.5470 | -0.0159 (8.75e-01) | 1.9592 | 0.6492 |
| svm | 100 | 5.7171 | 8.8085 | -0.0520 (6.08e-01) | 2.9947 | 0.0025 |

CSV Score export

target_id,target_type,dimension,source,score,llm_score,delta
1,document,alignment,ebt,75.9662,70.0,5.966200000000001
1,document,alignment,mrq,76.4469,70.0,6.446899999999999
1,document,alignment,sicql,100.0,70.0,30.0
1,document,alignment,svm,76.83712967511364,70.0,6.837129675113644
2,document,alignment,ebt,75.8436,65.0,10.843599999999995
2,document,alignment,mrq,76.4469,65.0,11.4469
2,document,alignment,sicql,100.0,65.0,35.0

🔬 Peering Under the Hood — ScoreEnergyComparisonAgent

“A wrong answer is only useless if we never ask why it went wrong.”

The ScoreComparisonAgent already highlights where our scorers disagree with the LLM benchmark, but stopping there is like a doctor spotting a fever and never ordering blood tests. We need diagnostics that explain the cause of each failure so Stephanie can prescribe the right treatment (re‑weighting a head, retraining a dimension, or flagging an embedding).

    flowchart LR
    A[Start: Scored Documents + Pipeline Run IDs]

    A --> B[ScoreComparisonAgent<br/>📊 Compares model scores to LLM]
    B --> C["ScoreEnergyComparisonAgent<br/>🔍 Analyzes model internals (Q/V, energy, uncertainty)"]:::highlight
    B --> D
    C --> D[PolicySynthesisAgent<br/>🧠 Synthesizes scores + internals<br/>🛠 Recommends policy refinements<br/>📤 Prepares GILD training signals]

    D --> E[Final Policy Health Report + GILD Signals]

    classDef highlight fill:#fdf6e3,stroke:#b58900,stroke-width:2px;
  

That diagnostic step is the job of ScoreEnergyComparisonAgent.

Stage Purpose Key Question
1 – ScoreComparison Detect surface‑level disagreement (SICQL vs EBT vs SVM vs LLM) “Which documents scored differently?”
2 – ScoreEnergyComparison Probe model internals (Q, V, advantage, energy, entropy) “What hidden signal explains the mistake?”
3 – PolicySynthesis Turn root‑cause insights into GILD training signals “How do we fix the policy?”

🛠 What the Agent Actually Does

  1. Fetches raw internals

    • SICQL: q_value, state_value, advantage, policy_entropy
    • EBT: energy, uncertainty
  2. Joins those attributes to every document that ScoreComparison flagged.

  3. Runs statistical tests (Pearson/Spearman) to answer questions like:

    • “Does high energy correlate with under‑scoring?”
    • “When the V‑baseline is low, do we see consistent over‑confidence?”
  4. Outputs a structured report (score_energy_analysis_results) that the next agent can consume.

That means Stephanie no longer says “SICQL was wrong here”; she can say why it was wrong:

“SICQL undervalued this document because its uncertainty spike (σ = 0.42) caused the policy to hedge, whereas the LLM found strong alignment clues.”

Full code ScoreEnergyComparisonAgent


class ScoreEnergyComparisonAgent(BaseAgent):
    """
    Agent to perform deep analysis on score data by fetching and analyzing
    rich attributes from EvaluationAttributeORM (e.g., SICQL's Q/V/uncertainty,
    EBT's energy). Consumes data from ScoreComparisonAgent.
    This is Step 2 (Deep Analysis): Leveraging detailed model internals.
    """

    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.dimensions = cfg.get("dimensions", [])
        
        # Configuration for sources to analyze (focus on those with rich attributes)
        # Default to SICQL and EBT, as they are primary sources of rich data
        # SVM, MRQ might have some, but SICQL and EBT are richest.
        self.sources_for_deep_analysis = cfg.get("sources_for_deep_analysis", ["sicql", "ebt"])
        
        # Output directory for reports
        self.output_dir = cfg.get("report_output_dir", "logs/deep_analysis_reports")
        os.makedirs(self.output_dir, exist_ok=True)
        self.session = self.memory.session  # Get the database session

    async def run(self, context: dict) -> dict:
        """
        Main execution logic for the agent.
        """
        try:
            self.logger.log("ScoreEnergyComparisonStarted", {
                "sources_for_analysis": self.sources_for_deep_analysis
            })

            # --- 1. Get Input Data from Context ---
            # This agent relies on the output of ScoreComparisonAgent
            score_comparison_data = context.get('score_comparison_data', [])
            score_comparison_metadata = context.get('score_comparison_metadata', {})
            
            if not score_comparison_data:
                self.logger.log("ScoreEnergyComparisonWarning", {"message": "No score_comparison_data found in context. Skipping deep analysis."})
                context['score_energy_analysis_results'] = {}
                context['score_energy_analysis_metadata'] = {"status": "skipped_no_input_data"}
                return context

            # --- 2. Extract Target Info for Deep Analysis ---
            # We need to identify the specific evaluations (by ID) that correspond
            # to the scores we want to analyze deeply.
            # The score_comparison_data contains target_id, target_type, dimension, source, score
            # We need to map this back to the EvaluationORM ID to fetch attributes.
            
            # Let's build a query to get the relevant EvaluationORM IDs and link them
            # to the score data for the sources we are interested in.
            
            # Group comparison data by pipeline_run_id (if available) and source for efficient querying
            # The comparison data should have been linked to pipeline runs.
            pipeline_run_ids_from_comparison = score_comparison_metadata.get('pipeline_run_ids', [])
            
            if not pipeline_run_ids_from_comparison:
                 self.logger.log("ScoreEnergyComparisonWarning", {"message": "No pipeline_run_ids found in comparison metadata. Analysis might be incomplete."})
                 # We can still try to fetch attributes, but it's less efficient.

            # --- 3. Fetch Deep Attribute Data ---
            # We need to get EvaluationAttributeORM records that match the scores
            # analyzed by ScoreComparisonAgent for the specified sources.
            # The most robust way is to join back through EvaluationORM and ScoreORM
            # using target_id, target_type, dimension, and source.
            
            deep_analysis_results = self._fetch_deep_attributes(
                pipeline_run_ids=pipeline_run_ids_from_comparison,
                sources=self.sources_for_deep_analysis,
                # We could pass dimensions, but let's fetch all for the specified sources/runs initially
                # and filter in Python if needed based on score_comparison_data
            )

            # --- 4. Correlate Deep Attributes with Comparison Data ---
            # Link the fetched deep attributes with the score_comparison_data
            # to create a richer dataset for analysis.
            # Key: (target_id, target_type, dimension, source) -> Value: comparison + attribute data
            enriched_data_map = self._enrich_comparison_data(score_comparison_data, deep_analysis_results)

            # --- 5. Perform Deep Analysis ---
            # Analyze the relationship between attributes (e.g., SICQL uncertainty)
            # and the comparison metrics (delta, score variance).
            analysis_insights = self._perform_deep_analysis(enriched_data_map)

            # --- 6. Store Results in Context ---
            context['score_energy_analysis_results'] = {
                "enriched_data_sample": dict(list(enriched_data_map.items())[:3]), # Sample for context
                "full_enriched_data_count": len(enriched_data_map),
                "analysis_insights": analysis_insights
            }
            context['score_energy_analysis_metadata'] = {
                "analysis_timestamp": datetime.now().isoformat(),
                "sources_analyzed": self.sources_for_deep_analysis,
                "pipeline_run_ids": pipeline_run_ids_from_comparison,
                "total_attributes_fetched": len(deep_analysis_results)
            }

            # --- 7. Generate Detailed Report ---
            self._generate_deep_analysis_report(analysis_insights, context['score_energy_analysis_metadata'])

            self.logger.log("ScoreEnergyComparisonCompleted", {
                "total_attributes_processed": len(deep_analysis_results),
                "enriched_data_points": len(enriched_data_map),
                "insights_generated": len(analysis_insights) if analysis_insights else 0
            })

            
            return context

        except Exception as e:
            error_msg = f"ScoreEnergyComparisonAgent failed: {str(e)}"
            self.logger.log("ScoreEnergyComparisonFailed", {"error": str(e), "context_keys": list(context.keys())})
            raise

    def _fetch_deep_attributes(self, pipeline_run_ids: List[int], sources: List[str]) -> List[Dict[str, Any]]:
        """
        Fetches detailed EvaluationAttributeORM data for specified sources and pipeline runs.
        Joins with EvaluationORM and ScoreORM to get context.
        """
        try:
            if not sources:
                return []

            # Base query to fetch attributes with context
            # We join EvaluationAttributeORM with EvaluationORM to get target info and run info
            # We also join with ScoreORM to ensure the attribute corresponds to an actual score record
            # and to get the score value if needed directly from the attribute query.
            query_text = """
            SELECT
                e.id AS evaluation_id,
                e.target_id,
                e.target_type,
                e.pipeline_run_id,
                e.source, -- This should match the 'source' in attributes and score_comparison_data
                s.dimension,
                s.score AS score_from_score_table, -- Score from ScoreORM
                -- EvaluationAttributeORM fields
                ea.raw_score,
                ea.energy,
                ea.q_value,
                ea.v_value,
                ea.advantage,
                ea.pi_value,
                ea.entropy,
                ea.uncertainty,
                ea.td_error,
                ea.expected_return
                -- ea.policy_logits -- Consider if including JSON is efficient here
            FROM evaluation_attributes ea
            JOIN evaluations e ON ea.evaluation_id = e.id
            JOIN scores s ON (
                s.evaluation_id = e.id 
                AND s.dimension = ea.dimension 
            )
            WHERE e.source IN :sources
            """
            
            params = {
                "sources": tuple(sources)
            }

            if pipeline_run_ids:
                query_text += " AND e.pipeline_run_id IN :pipeline_run_ids\n"
                params["pipeline_run_ids"] = tuple(pipeline_run_ids)
            
            if self.dimensions:
                query_text += " AND s.dimension IN :dimensions\n"
                params["dimensions"] = tuple(self.dimensions)

            # Order might help with consistency, though not strictly necessary for processing
            query_text += " ORDER BY e.target_type, e.target_id, s.dimension, e.evaluator_name;"

            result = self.session.execute(text(query_text), params).fetchall()

            attributes_data = [dict(row._mapping) for row in result]
            
            self.logger.log("DeepAttributesFetched", {
                "sources": sources,
                "pipeline_run_ids": pipeline_run_ids,
                "count": len(attributes_data)
            })
            return attributes_data

        except sqlalchemy.exc.SQLAlchemyError as sae:
            self.logger.log("DeepAttributeFetchDatabaseError", {"error": f"SQLAlchemy Error: {str(sae)}"})
            return []
        except Exception as e:
            self.logger.log("DeepAttributeFetchFailed", {"error": str(e)})
            return []

    def _enrich_comparison_data(self, comparison_data: List[Dict], attribute_data: List[Dict]) -> Dict[tuple, Dict[str, Any]]:
        """
        Combines score comparison data with fetched deep attributes.
        Creates a map keyed by (target_id, target_type, dimension, source).
        """
        # Create a lookup map for attributes for fast joining
        # Key: (target_id, target_type, dimension, source) -> Value: attribute dict
        attribute_lookup = {}
        for attr in attribute_data:
            key = (
                attr['target_id'],
                attr['target_type'],
                attr['dimension'],
                attr['source'] # This should match evaluator_name from EvaluationORM
            )
            # There might be multiple attributes per key if joins are not perfect,
            # but we'll take the first one or handle duplicates if necessary.
            # Ideally, the join should be unique.
            if key not in attribute_lookup:
                attribute_lookup[key] = attr
            else:
                # Log a warning if duplicates are found, might indicate a data issue
                # For now, we keep the first one.
                pass 

        # Enrich the comparison data
        enriched_map = {}
        for comp_item in comparison_data:
            # Only enrich data for sources we are analyzing
            if comp_item.get('source') not in self.sources_for_deep_analysis:
                continue

            comp_key = (
                comp_item['target_id'],
                comp_item['target_type'],
                comp_item['dimension'],
                comp_item['source']
            )
            
            enriched_item = comp_item.copy() # Start with comparison data
            if comp_key in attribute_lookup:
                # Merge attribute data
                attr_data = attribute_lookup[comp_key]
                # Add all attribute fields, potentially overwriting if names clash
                # (though unlikely as comparison data keys are different)
                enriched_item.update(attr_data)
            else:
                 # Log if attribute data is missing for a comparison item
                 # This is expected for sources not in sources_for_deep_analysis
                 # or if the attribute wasn't saved for some reason.
                 pass

            key_str = f"{enriched_item['target_id']}|{enriched_item['target_type']}|{enriched_item['dimension']}|{enriched_item['source']}" # Use a separator unlikely to be in your data
            enriched_map[key_str] = enriched_item

        self.logger.log("DataEnrichmentCompleted", {
            "comparison_items": len(comparison_data),
            "attribute_items": len(attribute_data),
            "enriched_items": len(enriched_map)
        })
        return enriched_map

    def _perform_deep_analysis(self, enriched_data_map: Dict[tuple, Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
        Analyzes the enriched data to find relationships and insights.
        E.g., correlation between SICQL uncertainty and delta.
        """
        insights = []
        try:
            if not enriched_data_map:
                return insights

            # Group data by source and dimension for analysis
            from collections import defaultdict
            grouped_data = defaultdict(list)
            for item in enriched_data_map.values():
                key = (item.get('source'), item.get('dimension'))
                if key[0] and key[1]: # Ensure source and dimension are present
                    grouped_data[key].append(item)

            # --- Analysis per Source/Dimension ---
            for (source, dimension), items in grouped_data.items():
                if not items:
                    continue

                # --- SICQL Specific Analysis ---
                if source == 'sicql':
                    # 1. Uncertainty vs Delta (Error)
                    uncertainties = np.array([item['uncertainty'] for item in items if item.get('uncertainty') is not None])
                    deltas = np.array([item['delta'] for item in items if item.get('delta') is not None])
                    
                    # Filter to common indices where both values exist
                    common_indices = np.intersect1d(
                        np.where(~np.isnan(uncertainties))[0],
                        np.where(~np.isnan(deltas))[0]
                    )
                    if len(common_indices) > 1:
                        filtered_uncertainties = uncertainties[common_indices]
                        filtered_deltas = deltas[common_indices]
                        
                        if np.std(filtered_uncertainties) > 1e-10 and np.std(filtered_deltas) > 1e-10:
                            try:
                                corr_result = pearsonr(filtered_uncertainties, np.abs(filtered_deltas)) # Correlate with abs error
                                corr_coef = corr_result.statistic if hasattr(corr_result, 'statistic') else corr_result[0]
                                corr_p_value = corr_result.pvalue if hasattr(corr_result, 'pvalue') else corr_result[1]
                                
                                insights.append({
                                    "type": "sicql_uncertainty_vs_abs_delta_correlation",
                                    "source": source,
                                    "dimension": dimension,
                                    "description": "Correlation between SICQL uncertainty (|Q-V|) and absolute error (|delta|).",
                                    "metric": "Pearson Correlation Coefficient",
                                    "value": float(corr_coef),
                                    "p_value": float(corr_p_value),
                                    "sample_size": len(common_indices),
                                    "interpretation": "Positive correlation suggests high uncertainty predicts high error."
                                })
                            except Exception as e:
                                self.logger.log("SICQLUncertaintyCorrelationFailed", {"error": str(e), "source": source, "dimension": dimension})

                    # 2. Advantage (Q-V) Analysis
                    advantages = np.array([item['advantage'] for item in items if item.get('advantage') is not None])
                    if len(advantages) > 0 and not np.isnan(advantages).all():
                        mean_advantage = np.nanmean(advantages)
                        std_advantage = np.nanstd(advantages)
                        insights.append({
                            "type": "sicql_advantage_stats",
                            "source": source,
                            "dimension": dimension,
                            "description": "Mean and standard deviation of SICQL advantage (Q-V).",
                            "metric": "Mean Advantage",
                            "value": float(mean_advantage),
                            "std_dev": float(std_advantage),
                            "sample_size": len(advantages) - np.count_nonzero(np.isnan(advantages))
                        })

                    # 3. Entropy Analysis
                    entropies = np.array([item['entropy'] for item in items if item.get('entropy') is not None])
                    if len(entropies) > 0 and not np.isnan(entropies).all():
                        mean_entropy = np.nanmean(entropies)
                        insights.append({
                            "type": "sicql_entropy_stats",
                            "source": source,
                            "dimension": dimension,
                            "description": "Mean entropy of SICQL policy distribution.",
                            "metric": "Mean Entropy",
                            "value": float(mean_entropy),
                            "sample_size": len(entropies) - np.count_nonzero(np.isnan(entropies))
                        })

                # --- EBT Specific Analysis ---
                elif source == 'ebt':
                    # 1. Energy vs Delta (Error)
                    energies = np.array([item['energy'] for item in items if item.get('energy') is not None])
                    deltas = np.array([item['delta'] for item in items if item.get('delta') is not None])
                    
                    common_indices_ebt = np.intersect1d(
                        np.where(~np.isnan(energies))[0],
                        np.where(~np.isnan(deltas))[0]
                    )
                    if len(common_indices_ebt) > 1:
                        filtered_energies = energies[common_indices_ebt]
                        filtered_deltas = deltas[common_indices_ebt]
                        
                        if np.std(filtered_energies) > 1e-10 and np.std(filtered_deltas) > 1e-10:
                            try:
                                corr_result_ebt = pearsonr(filtered_energies, np.abs(filtered_deltas))
                                corr_coef_ebt = corr_result_ebt.statistic if hasattr(corr_result_ebt, 'statistic') else corr_result_ebt[0]
                                corr_p_value_ebt = corr_result_ebt.pvalue if hasattr(corr_result_ebt, 'pvalue') else corr_result_ebt[1]
                                
                                insights.append({
                                    "type": "ebt_energy_vs_abs_delta_correlation",
                                    "source": source,
                                    "dimension": dimension,
                                    "description": "Correlation between EBT energy and absolute error (|delta|).",
                                    "metric": "Pearson Correlation Coefficient",
                                    "value": float(corr_coef_ebt),
                                    "p_value": float(corr_p_value_ebt),
                                    "sample_size": len(common_indices_ebt),
                                    "interpretation": "Positive correlation suggests high energy predicts high error (less confidence)."
                                })
                            except Exception as e:
                                self.logger.log("EBTEnergyCorrelationFailed", {"error": str(e), "source": source, "dimension": dimension})

                # --- General Analysis for any source ---
                # Correlation between model's own score and LLM score (redundant check, but using attribute data)
                # This might be slightly different if raw_score in attributes differs from score in comparison data
                model_scores_attr = np.array([item['raw_score'] for item in items if item.get('raw_score') is not None])
                llm_scores_attr = np.array([item['llm_score'] for item in items if item.get('llm_score') is not None])
                
                common_indices_general = np.intersect1d(
                    np.where(~np.isnan(model_scores_attr))[0],
                    np.where(~np.isnan(llm_scores_attr))[0]
                )
                if len(common_indices_general) > 1:
                    filtered_model_scores = model_scores_attr[common_indices_general]
                    filtered_llm_scores = llm_scores_attr[common_indices_general]
                    
                    if np.std(filtered_model_scores) > 1e-10 and np.std(filtered_llm_scores) > 1e-10:
                        try:
                            corr_result_gen = pearsonr(filtered_model_scores, filtered_llm_scores)
                            corr_coef_gen = corr_result_gen.statistic if hasattr(corr_result_gen, 'statistic') else corr_result_gen[0]
                            corr_p_value_gen = corr_result_gen.pvalue if hasattr(corr_result_gen, 'pvalue') else corr_result_gen[1]
                            
                            insights.append({
                                "type": "model_vs_llm_score_correlation",
                                "source": source,
                                "dimension": dimension,
                                "description": "Correlation between model's raw score (from attributes) and LLM score.",
                                "metric": "Pearson Correlation Coefficient",
                                "value": float(corr_coef_gen),
                                "p_value": float(corr_p_value_gen),
                                "sample_size": len(common_indices_general)
                            })
                        except Exception as e:
                            self.logger.log("GeneralScoreCorrelationFailed", {"error": str(e), "source": source, "dimension": dimension})


            self.logger.log("DeepAnalysisCompleted", {"insights_generated": len(insights)})
            return insights

        except Exception as e:
            self.logger.log("DeepAnalysisFailed", {"error": str(e)})
            return insights # Return any insights generated before the error

    def _generate_deep_analysis_report(self, analysis_insights: List[Dict], metadata: Dict):
        """
        Generates a detailed markdown report summarizing the deep analysis insights.
        """
        try:
            timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
            pipeline_ids_str = "_".join(map(str, metadata.get('pipeline_run_ids', ['unknown'])))
            report_filename = f"score_energy_deep_analysis_{pipeline_ids_str}_{timestamp}.md"
            report_path = os.path.join(self.output_dir, report_filename)

            with open(report_path, 'w', encoding='utf-8') as f:
                f.write(f"# Deep Score Analysis Report (Attributes)\n\n")
                f.write(f"**Generated:** {metadata.get('analysis_timestamp', 'N/A')}\n\n")
                f.write(f"**Pipeline Runs Analyzed:** {metadata.get('pipeline_run_ids', 'N/A')}\n\n")
                f.write(f"**Sources Analyzed:** {', '.join(metadata.get('sources_analyzed', []))}\n\n")
                f.write("---\n\n")

                if not analysis_insights:
                    f.write("## No Insights Generated\n\n")
                    f.write("No significant relationships or statistics were found in the deep attribute analysis.\n")
                else:
                    f.write("## Key Insights from Model Attributes\n\n")
                    for insight in analysis_insights:
                        f.write(f"### {insight.get('type', 'Unnamed Insight').replace('_', ' ').title()}\n")
                        f.write(f"- **Source:** `{insight.get('source', 'N/A')}`\n")
                        f.write(f"- **Dimension:** `{insight.get('dimension', 'N/A')}`\n")
                        f.write(f"- **Description:** {insight.get('description', 'N/A')}\n")
                        f.write(f"- **Metric:** `{insight.get('metric', 'N/A')}`\n")
                        f.write(f"- **Value:** `{insight.get('value', 'N/A')}`\n")
                        if 'p_value' in insight:
                            f.write(f"- **P-Value:** `{insight.get('p_value', 'N/A')}`\n")
                        if 'std_dev' in insight:
                            f.write(f"- **Std Dev:** `{insight.get('std_dev', 'N/A')}`\n")
                        if 'sample_size' in insight:
                            f.write(f"- **Sample Size:** `{insight.get('sample_size', 'N/A')}`\n")
                        if 'interpretation' in insight:
                            f.write(f"- **Interpretation:** {insight.get('interpretation', 'N/A')}\n")
                        f.write("\n---\n\n")
            
            self.logger.log("DeepAnalysisReportSaved", {"path": report_path})

        except Exception as e:
            self.logger.log("DeepAnalysisReportGenerationFailed", {"error": str(e)})

Explanation of the Agent’s Structure

  1. init: Sets up configuration, especially sources_for_deep_analysis (defaulting to sicql and ebt) and the output directory. Gets the database session.

  2. run: The main orchestration method.

    • Retrieves score_comparison_data and score_comparison_metadata from the context.
    • Calls _fetch_deep_attributes to get EvaluationAttributeORM data for the relevant pipeline runs and sources.
    • Calls _enrich_comparison_data to merge the attribute data with the comparison data based on target_id, target_type, dimension, and source.
    • Calls _perform_deep_analysis on the enriched data to calculate correlations and statistics.
    • Stores the results and metadata in the context under score_energy_analysis_results and score_energy_analysis_metadata.
    • Calls _generate_deep_analysis_report to create a markdown report.
  3. _fetch_deep_attributes: Executes a SQL query that joins evaluation_attributes, evaluations, and scores to fetch the rich attribute data for the specified sources and pipeline runs. It selects key fields like energy, q_value, v_value, uncertainty, entropy, advantage.

  4. _enrich_comparison_data: Takes the flat list of comparison data and the flat list of attribute data. It creates a lookup map for attributes and then iterates through the comparison data, finding the matching attribute record (if it exists) and merging the two dictionaries. The result is a map keyed by (target_id, target_type, dimension, source) for easy access.

  5. _perform_deep_analysis: This is the core analysis logic.

    • It groups the enriched data by source and dimension.
    • For sicql, it calculates:
      • Correlation between uncertainty (|Q-V|) and abs(delta) (absolute error). This directly tests if SICQL’s internal uncertainty estimate is a good predictor of its actual error.
      • Mean and standard deviation of advantage (Q-V).
      • Mean entropy of the policy distribution.
    • For ebt, it calculates:
      • Correlation between energy and abs(delta). High energy often means low confidence, so a positive correlation would mean high energy (low confidence) predicts high error.
    • For all sources, it recalculates the correlation between the model’s score (from attributes) and the LLM score, as a sanity check or alternative view.
    • It returns a list of structured insight dictionaries.
  6. _generate_deep_analysis_report: Creates a detailed markdown report summarizing all the insights generated by _perform_deep_analysis, making the findings easily readable.

Why It Matters

These internal signals are Stephanie’s epistemic indicators. For example:

  • If high uncertainty correlates with higher error, then uncertainty becomes a signal of mistrust.
  • If energy predicts disagreement with the LLM, Stephanie can learn to modulate reliance on EBT.

This isn’t just analytics it’s foundational to self-improvement. Stephanie is learning which models to trust, when, and why.


🧪 Example Findings

This is a sample of the report generated.

### Model Performance at a Glance

#### Model: `sicql`
- **Dimension `alignment`**:
  - MAE: `59.0500`
  - RMSE: `61.8243`
  - Correlation with LLM: `N/A`
  - **Issues**: High MAE/RMSE
- **Dimension `clarity`**:
  - MAE: `5.4078`
  - RMSE: `8.5470`
  - Correlation with LLM: `-0.0159`
  - **Issues**: Low correlation with LLM
- **Dimension `implementability`**:
  - MAE: `13.4968`
  - RMSE: `17.9409`
  - Correlation with LLM: `0.0453`
  - **Issues**: Low correlation with LLM

🧩 PolicySynthesisAgent from Scores to Strategy

After comparing raw model scores (ScoreComparisonAgent) and analyzing the internal state of model confidence (ScoreEnergyComparisonAgent), Stephanie takes the next step: integrating everything she knows into a unified understanding of policy health.

    flowchart LR
    A[Start: Scored Documents + Pipeline Run IDs]

    A --> B[ScoreComparisonAgent<br/>📊 Compares model scores to LLM]
    B --> C["ScoreEnergyComparisonAgent<br/>🔍 Analyzes model internals (Q/V, energy, uncertainty)"]
    B --> D
    C --> D[PolicySynthesisAgent<br/>🧠 Synthesizes scores + internals<br/>🛠 Recommends policy refinements<br/>📤 Prepares GILD training signals]:::highlight

    D --> E[Final Policy Health Report + GILD Signals]

    classDef highlight fill:#fdf6e3,stroke:#b58900,stroke-width:2px;
  

The PolicySynthesisAgent is where all prior analysis converges.

This agent synthesizes:

  • Quantitative score comparisons (e.g., MAE, RMSE, correlation with LLM)
  • Internal model diagnostics (e.g., Q/V-values, advantage, entropy, uncertainty)
  • Calibration signals (e.g., whether confidence aligns with correctness)

Into a comprehensive policy report that answers key questions:

  • Which models are accurate?
  • Which models are confident but wrong?
  • Where should Stephanie focus refinement next?

In doing so, it also prepares training signals for GILD Stephanie’s self-improvement engine by extracting high-quality, feedback-weighted examples from SICQL’s advantage calculations.


📊 What It Does

# Pseudocode-style summary
1. Load score comparison and deep attribute analysis from context
2. Identify problems:
    - High MAE or RMSE
    - Poor correlation with LLM
    - Misaligned uncertainty or energy
3. Generate:
    - Executive summary per model/dimension
    - Internal calibration checks
    - Cross-model comparisons
    - Specific refinement recommendations
4. Extract advantage signals from SICQL
    - Structure them into GILD-compatible format
    - Include performance weights to prioritize examples
5. Write full markdown report
    - For logging, visualization, and further tuning

Full code PolicySynthesisAgent


class PolicySynthesisAgent(BaseAgent):
    """
    Agent to synthesize multi-layered analysis results from ScoreComparisonAgent
    and ScoreEnergyComparisonAgent. Generates comprehensive policy health reports
    and prepares structured data/signals for GILD-based self-improvement.
    This is Step 5: Policy Synthesis and GILD Signal Preparation.
    """

    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        
        # Configuration
        self.output_dir = cfg.get("report_output_dir", "logs/policy_synthesis_reports")
        os.makedirs(self.output_dir, exist_ok=True)
        
        # Thresholds for identifying issues (can be made configurable)
        self.high_error_threshold = cfg.get("high_error_threshold", 0.5) # Placeholder, e.g., high MAE relative to range
        self.misleading_uncertainty_correlation_threshold = cfg.get("misleading_uncertainty_correlation_threshold", -0.2) # Negative correlation
        self.low_correlation_threshold = cfg.get("low_correlation_threshold", 0.3)

    async def run(self, context: dict) -> dict:
        """
        Main execution logic for the agent.
        """
        try:
            self.logger.log("PolicySynthesisStarted", {})

            # --- 1. Get Input Data from Context ---
            score_comparison_data = context.get('score_comparison_data', [])
            score_analysis_results = context.get('score_analysis_results', {})
            score_energy_analysis_results = context.get('score_energy_analysis_results', {})
            
            if not score_comparison_data or not score_analysis_results:
                self.logger.log("PolicySynthesisWarning", {"message": "Missing core analysis data in context. Skipping synthesis."})
                context['policy_synthesis_results'] = {}
                return context

            # --- 2. Synthesize Insights ---
            # Combine findings from all layers of analysis
            synthesis_report = self._synthesize_policy_insights(
                score_comparison_data, 
                score_analysis_results, 
                score_energy_analysis_results
            )

            # --- 3. Prepare GILD Signals ---
            # Extract and structure data needed for GILD training
            gild_signals = self._prepare_gild_signals(score_comparison_data, score_energy_analysis_results)

            # --- 4. Store Results in Context ---
            synthesis_output = {
                "synthesis_report": synthesis_report,
                "gild_signals_summary": {
                    "total_training_examples": len(gild_signals.get('sicql_advantages', [])),
                    "sources_included": list(gild_signals.get('sources', [])),
                    "dimensions_covered": list(gild_signals.get('dimensions', []))
                },
                "gild_signals": gild_signals # This might be large, consider if storing full data is needed
            }
            context['policy_synthesis_results'] = synthesis_output
            context['policy_synthesis_metadata'] = {
                "synthesis_timestamp": datetime.now().isoformat(),
                "input_data_summary": {
                    "score_comparison_points": len(score_comparison_data),
                    "analysis_results_keys": list(score_analysis_results.keys())[:5], # Sample
                    "energy_analysis_available": bool(score_energy_analysis_results)
                }
            }

            # --- 5. Generate Comprehensive Report ---
            self._generate_policy_synthesis_report(synthesis_report, context['policy_synthesis_metadata'])

            self.logger.log("PolicySynthesisCompleted", {
                "report_sections": len(synthesis_report) if synthesis_report else 0,
                "gild_signals_prepared": synthesis_output['gild_signals_summary']
            })

            return context

        except Exception as e:
            error_msg = f"PolicySynthesisAgent failed: {str(e)}"
            self.logger.log("PolicySynthesisFailed", {"error": str(e), "context_keys": list(context.keys())})
            raise

    def _safe_format_float(self, value, precision: int = 4) -> str:
        """
        Safely formats a float value to a string with specified precision.
        Returns 'N/A' if the value is None.
        
        Args:
            value: The value to format (float, int, or None).
            precision (int): The number of decimal places. Defaults to 4.
            
        Returns:
            str: Formatted string or 'N/A'.
        """
        if value is None:
            return "N/A"
        try:
            # Ensure it's a number before formatting
            numeric_value = float(value)
            return f"{numeric_value:.{precision}f}"
        except (ValueError, TypeError):
            # If conversion fails, return N/A
            return "N/A"

    def _synthesize_policy_insights(self, comparison_data: List[Dict], analysis_results: Dict, energy_results: Dict) -> Dict[str, Any]:
        """
        Combines all analysis layers to create a holistic policy health report.
        """
        report = {
            "executive_summary": {},
            "model_performance_diagnostics": {},
            "internal_state_analysis": {},
            "cross_model_comparison": {},
            "refinement_recommendations": []
        }

        try:
            # --- 1. Executive Summary (based on statistical analysis) ---
            # Highlight top-level performance issues
            report["executive_summary"]["performance_overview"] = {}
            for key, metrics in analysis_results.items():
                source = metrics.get("source")
                dimension = metrics.get("dimension")
                mae = metrics.get("mae")
                rmse = metrics.get("rmse")
                corr = metrics.get("correlation")
                
                if source and dimension:
                    if source not in report["executive_summary"]["performance_overview"]:
                        report["executive_summary"]["performance_overview"][source] = {}
                    
                    is_high_error = mae > 40 if mae is not None else False # Example threshold
                    is_low_correlation = (corr is not None and abs(corr) < self.low_correlation_threshold)
                    
                    report["executive_summary"]["performance_overview"][source][dimension] = {
                        "mae": mae,
                        "rmse": rmse,
                        "correlation_with_llm": corr,
                        "issues": []
                    }
                    if is_high_error:
                        report["executive_summary"]["performance_overview"][source][dimension]["issues"].append("High MAE/RMSE")
                    if is_low_correlation:
                        report["executive_summary"]["performance_overview"][source][dimension]["issues"].append("Low correlation with LLM")

            # --- 2. Model Performance Diagnostics ---
            # Detail issues per model/dimension
            report["model_performance_diagnostics"] = analysis_results # Direct inclusion for detail

            # --- 3. Internal State Analysis (from energy/attribute analysis) ---
            # Summarize findings from ScoreEnergyComparisonAgent
            insights_from_energy = energy_results.get("analysis_insights", [])
            report["internal_state_analysis"]["key_insights"] = insights_from_energy

            # Highlight specific calibration issues
            report["internal_state_analysis"]["calibration_issues"] = []
            for insight in insights_from_energy:
                insight_type = insight.get("type", "")
                source = insight.get("source", "")
                dimension = insight.get("dimension", "")
                corr_value = insight.get("value") # For correlation insights
                p_value = insight.get("p_value")
                
                # Check for negative correlation between uncertainty/energy and error
                # This indicates poor calibration (high confidence = high error)
                if "uncertainty_vs_abs_delta_correlation" in insight_type or "energy_vs_abs_delta_correlation" in insight_type:
                    if corr_value is not None and corr_value < self.misleading_uncertainty_correlation_threshold and p_value is not None and p_value < 0.05:
                        report["internal_state_analysis"]["calibration_issues"].append({
                            "model": source,
                            "dimension": dimension,
                            "issue": f"Poorly calibrated {'uncertainty' if 'uncertainty' in insight_type else 'energy'}",
                            "correlation": corr_value,
                            "p_value": p_value,
                            "description": "Model's confidence metric inversely predicts accuracy."
                        })

            # --- 4. Cross-Model Comparison ---
            # Compare overall performance and characteristics
            # Group stats by dimension for comparison
            stats_by_dimension = defaultdict(lambda: defaultdict(dict))
            for key, metrics in analysis_results.items():
                source = metrics.get("source")
                dimension = metrics.get("dimension")
                if source and dimension:
                    stats_by_dimension[dimension][source] = metrics
            
            comparison_summary = {}
            for dimension, source_stats in stats_by_dimension.items():
                comparison_summary[dimension] = {
                    "models": dict(source_stats),
                    # --- Corrected lines ---
                    # Handle None values explicitly in the key function
                    "best_mae_model": min(
                        source_stats.items(),
                        key=lambda x: x[1].get('mae', float('inf')) if x[1].get('mae') is not None else float('inf'),
                        default=(None, {})
                    )[0],
                    "best_correlation_model": max(
                        source_stats.items(),
                        # --- Key Fix: Check for None explicitly ---
                        key=lambda x: x[1].get('correlation') if x[1].get('correlation') is not None else -float('inf'),
                        default=(None, {})
                    )[0],
                    # --- End of corrected lines ---
                }
            report["cross_model_comparison"] = comparison_summary

            # --- 5. Refinement Recommendations ---
            # Based on synthesis, recommend actions
            recommendations = []
            
            # Check for high error and poor calibration
            for source, dims in report["executive_summary"]["performance_overview"].items():
                for dimension, metrics in dims.items():
                    issues = metrics.get("issues", [])
                    has_high_error = "High MAE/RMSE" in issues
                    # Check if this source/dim has a calibration issue
                    is_poorly_calibrated = any(
                        issue.get("model") == source and issue.get("dimension") == dimension 
                        for issue in report["internal_state_analysis"]["calibration_issues"]
                    )
                    
                    if has_high_error:
                        priority = "High" if is_poorly_calibrated else "Medium"
                        reason = f"{source} shows high error (MAE={self._safe_format_float(metrics.get('mae'))}) on '{dimension}'"
                        if is_poorly_calibrated:
                            reason += " and its confidence metric is poorly calibrated."
                        
                        recommendations.append({
                            "priority": priority,
                            "action": f"Retrain/Refine {source} policy for dimension '{dimension}'",
                            "reason": reason,
                            "suggested_approach": "Use GILD with advantage weighting, potentially filtering examples based on error/confidence."
                        })
            
            # Check for models with good correlation but potentially other issues
            # (This is a placeholder for more nuanced logic)
            # ...

            report["refinement_recommendations"] = recommendations

        except Exception as e:
             self.logger.log("PolicySynthesisInsightGenerationFailed", {"error": str(e)})

        return report

    def _prepare_gild_signals(self, comparison_data: List[Dict], energy_results: Dict) -> Dict[str, Any]:
        """
        Extracts and structures data needed for GILD training.
        Core signal is SICQL advantage (Q-V), weighted potentially by performance/error.
        """
        gild_data = {
            "sicql_advantages": [], # List of dicts with advantage data and context
            "training_contexts": [], # Corresponding contexts (target info, dimension, etc.)
            "performance_weights": [], # Optional: weights based on delta or other metrics
            "sources": set(), # Track which models' data is included
            "dimensions": set() # Track dimensions covered
        }

        try:
            # Get the enriched data map from energy results
            # Assuming it's stored in a way we can access, e.g., as a list
            enriched_data_list = energy_results.get("enriched_data_list", []) # Adjust key if needed
            # If it's not a list, we might need to process the map differently
            # Let's assume for now it's a list of enriched data points
            
            # If enriched_data_list is empty, fall back to using comparison_data
            # and fetching attributes on the fly (less efficient)
            data_source = enriched_data_list if enriched_data_list else comparison_data

            for data_point in data_source:
                source = data_point.get('source')
                dimension = data_point.get('dimension')
                target_id = data_point.get('target_id')
                target_type = data_point.get('target_type')
                
                gild_data["sources"].add(source)
                gild_data["dimensions"].add(dimension)

                # --- Focus on SICQL for GILD signals ---
                if source == 'sicql':
                    advantage = data_point.get('advantage')
                    q_value = data_point.get('q_value')
                    v_value = data_point.get('v_value')
                    uncertainty = data_point.get('uncertainty')
                    entropy = data_point.get('entropy')
                    delta = data_point.get('delta') # Error signal
                    
                    # Ensure we have the core components
                    if advantage is not None and q_value is not None and v_value is not None:
                        # Prepare the advantage data point for GILD
                        advantage_record = {
                            "target_id": target_id,
                            "target_type": target_type,
                            "dimension": dimension,
                            "advantage": float(advantage), # The core GILD weighting signal
                            "q_value": float(q_value),
                            "v_value": float(v_value),
                            "uncertainty": float(uncertainty) if uncertainty is not None else None,
                            "entropy": float(entropy) if entropy is not None else None
                        }
                        gild_data["sicql_advantages"].append(advantage_record)
                        
                        # Context for this training example
                        context_record = {
                            "target_id": target_id,
                            "target_type": target_type,
                            "dimension": dimension,
                            # Could include more context if needed (e.g., target metadata)
                        }
                        gild_data["training_contexts"].append(context_record)

                        # Optional: Performance weight (e.g., inverse of error magnitude)
                        # This can be used to focus GILD more on examples where the policy was wrong
                        weight = 1.0
                        if delta is not None:
                            # Example: Higher weight for larger errors (focus on fixing mistakes)
                            # Or lower weight for larger errors (don't overfit to outliers)
                            # Let's use a simple inverse relationship, capped
                            abs_delta = abs(delta)
                            # Avoid division by zero and cap weight
                            weight = min(10.0, 1.0 / (abs_delta + 1e-5)) if abs_delta > 1e-5 else 1.0
                        gild_data["performance_weights"].append(weight)

            # Convert sets to lists for JSON serialization
            gild_data["sources"] = list(gild_data["sources"])
            gild_data["dimensions"] = list(gild_data["dimensions"])

            self.logger.log("GILDSignalsPrepared", {
                "sicql_advantage_points": len(gild_data["sicql_advantages"]),
                "sources": gild_data["sources"],
                "dimensions": gild_data["dimensions"]
            })

        except Exception as e:
             self.logger.log("GILDSignalPreparationFailed", {"error": str(e)})
             # Return partially filled or empty data on error
             gild_data = {k: (list(v) if isinstance(v, set) else v) for k, v in gild_data.items()} # Ensure sets are lists

        return gild_data

    def _generate_policy_synthesis_report(self, synthesis_report: Dict, metadata: Dict):
        """
        Generates a comprehensive markdown report from the synthesis.
        """
        try:
            timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
            report_filename = f"policy_synthesis_report_{timestamp}.md"
            report_path = os.path.join(self.output_dir, report_filename)

            with open(report_path, 'w', encoding='utf-8') as f:
                f.write("# Policy Synthesis & Health Report\n\n")
                f.write(f"**Generated:** {metadata.get('synthesis_timestamp', 'N/A')}\n\n")
                f.write("---\n\n")

                if not synthesis_report:
                    f.write("## Report Generation Failed\n\n")
                    f.write("No synthesis data available to generate report.\n")
                    return

                # --- Executive Summary ---
                f.write("## Executive Summary\n\n")
                perf_overview = synthesis_report.get("executive_summary", {}).get("performance_overview", {})
                if perf_overview:
                    f.write("### Model Performance at a Glance\n\n")
                    for source, dims in perf_overview.items():
                        f.write(f"#### Model: `{source}`\n")
                        for dim, metrics in dims.items():
                            issues = metrics.get("issues", [])
                            f.write(f"- **Dimension `{dim}`**:\n")
                            
                            # Use the helper function for safe formatting
                            mae_str = self._safe_format_float(metrics.get('mae'))
                            rmse_str = self._safe_format_float(metrics.get('rmse'))
                            corr_str = self._safe_format_float(metrics.get('correlation_with_llm'))

                            f.write(f"  - MAE: `{mae_str}`\n")
                            f.write(f"  - RMSE: `{rmse_str}`\n")
                            f.write(f"  - Correlation with LLM: `{corr_str}`\n")
                            if issues:
                                f.write(f"  - **Issues**: {', '.join(issues)}\n")
                        f.write("\n")
                else:
                    f.write("Performance overview unavailable.\n\n")

                # --- Internal State Analysis ---
                f.write("## Internal State Analysis\n\n")
                cal_issues = synthesis_report.get("internal_state_analysis", {}).get("calibration_issues", [])
                if cal_issues:
                    f.write("### Calibration Issues Detected\n\n")
                    for issue in cal_issues:
                        f.write(f"- **{issue.get('model', 'N/A')} ({issue.get('dimension', 'N/A')})**: {issue.get('issue', 'N/A')}\n")
                        
                        # Use the helper function for safe formatting of correlation and p-value
                        corr_str_issue = self._safe_format_float(issue.get('correlation'))
                        p_val_issue = issue.get('p_value')
                        p_str_issue = f"{p_val_issue:.2e}" if p_val_issue is not None else "N/A" # Scientific notation for p-value

                        f.write(f"  - Correlation: `{corr_str_issue}` (p={p_str_issue})\n")
                        f.write(f"  - Description: {issue.get('description', 'N/A')}\n\n")
                
                general_insights = synthesis_report.get("internal_state_analysis", {}).get("key_insights", [])
                if general_insights:
                    f.write("### Other Key Insights\n\n")
                    for insight in general_insights:
                        f.write(f"- **{insight.get('type', 'Insight')}** ({insight.get('source', 'N/A')}/{insight.get('dimension', 'N/A')}):\n")
                        
                        # Handle value formatting based on type if needed, or just use safe_format_float
                        value_to_format = insight.get('value')
                        if isinstance(value_to_format, (int, float)) and not isinstance(value_to_format, bool):
                            value_str = self._safe_format_float(value_to_format)
                        else:
                            value_str = str(value_to_format) if value_to_format is not None else "N/A"

                        metric_str = str(insight.get('metric', 'N/A')) # Metric name is likely a string
                        
                        f.write(f"  - Metric: `{metric_str}`\n")
                        f.write(f"  - Value: `{value_str}`\n")
                        
                        # Handle p-value for general insights
                        p_val_general = insight.get('p_value')
                        if p_val_general is not None:
                            p_str_general = f"{p_val_general:.2e}" if isinstance(p_val_general, (int, float)) and not isinstance(p_val_general, bool) else str(p_val_general)
                            f.write(f"  - P-Value: `{p_str_general}`\n")
                        if 'interpretation' in insight:
                            f.write(f"  - Interpretation: {insight.get('interpretation', 'N/A')}\n")
                        f.write("\n")
                if not cal_issues and not general_insights:
                    f.write("No specific internal state issues or insights identified.\n\n")

                # --- Cross-Model Comparison ---
                f.write("## Cross-Model Comparison\n\n")
                comparisons = synthesis_report.get("cross_model_comparison", {})
                if comparisons:
                    for dimension, data in comparisons.items():
                        f.write(f"### Dimension: `{dimension}`\n")
                        f.write("| Model | MAE | RMSE | Correlation | Best in Category |\n")
                        f.write("| :--- | ---: | ---: | ---: | :---: |\n")
                        models_data = data.get("models", {})
                        best_mae = data.get("best_mae_model")
                        best_corr = data.get("best_correlation_model")
                        for model, metrics in models_data.items():
                            # Use the helper function for safe formatting in the table
                            mae_str = self._safe_format_float(metrics.get('mae'))
                            rmse_str = self._safe_format_float(metrics.get('rmse'))
                            corr_str = self._safe_format_float(metrics.get('correlation'))
                            
                            is_best_mae = "✅" if model == best_mae else ""
                            is_best_corr = "✅" if model == best_corr else ""
                            best_marker = f"{is_best_mae} {is_best_corr}".strip()
                            f.write(f"| {model} | {mae_str} | {rmse_str} | {corr_str} | {best_marker} |\n")
                        f.write("\n")
                else:
                    f.write("Cross-model comparison data unavailable.\n\n")

                # --- Refinement Recommendations ---
                f.write("## Refinement Recommendations\n\n")
                recommendations = synthesis_report.get("refinement_recommendations", [])
                if recommendations:
                    for i, rec in enumerate(recommendations, 1):
                        f.write(f"{i}. **{rec.get('action', 'No action specified')}**\n")
                        f.write(f"   - **Priority**: {rec.get('priority', 'N/A')}\n")
                        f.write(f"   - **Reason**: {rec.get('reason', 'N/A')}\n")
                        if rec.get('suggested_approach'):
                            f.write(f"   - **Suggested Approach**: {rec.get('suggested_approach', 'N/A')}\n")
                        f.write("\n")
                else:
                    f.write("No specific refinement recommendations generated.\n\n")

            self.logger.log("PolicySynthesisReportSaved", {"path": report_path})

        except Exception as e:
            self.logger.log("PolicySynthesisReportGenerationFailed", {"error": str(e)})

📵 Policy example results

These example results from the Policy Synthesis Agent demonstrate a mixed performance profile for Stephanie’s scoring models on the alignment dimension. SICQL shows significantly higher error (MAE/RMSE) compared to other models like EBT, MRQ, and SVM, and fails to correlate with the LLM ground truth. While SICQL exhibits moderate policy entropy (1.0975) and a consistent positive advantage (5.8567), its high error and poor calibration (implied by the lack of correlation and high MAE) are clear issues. EBT, on the other hand, shows a moderate positive correlation with the LLM (0.3409), making it the best performer for this dimension based on correlation, despite SICQL’s internal confidence metrics (advantage, entropy) appearing stable. The analysis highlights SICQL’s alignment scoring as a prime candidate for refinement via GILD.

### Model Performance at a Glance

#### Model: `sicql`
- **Dimension `alignment`**:
  - MAE: `59.0500`
  - RMSE: `61.8243`
  - Correlation with LLM: `N/A`
  - **Issues**: High MAE/RMSE
- **Dimension `clarity`**:
  - MAE: `5.4078`
  - RMSE: `8.5470`
  - Correlation with LLM: `-0.0159`
  - **Issues**: Low correlation with LLM


- **sicql_advantage_stats** (sicql/alignment):
  - Metric: `Mean Advantage`
  - Value: `5.8567`

- **sicql_entropy_stats** (sicql/alignment):
  - Metric: `Mean Entropy`
  - Value: `1.0975`

- **ebt_energy_vs_abs_delta_correlation** (ebt/alignment):
  - Metric: `Pearson Correlation Coefficient`
  - Value: `-0.2906`
  - P-Value: `3.36e-03`
  - Interpretation: Positive correlation suggests high energy predicts high error (less confidence).

- **model_vs_llm_score_correlation** (ebt/alignment):
  - Metric: `Pearson Correlation Coefficient`
  - Value: `0.3409`
  - P-Value: `5.19e-04`


## Cross-Model Comparison

### Dimension: `alignment`
| Model | MAE | RMSE | Correlation | Best in Category |
| :--- | ---: | ---: | ---: | :---: |
| sicql | 59.0500 | 61.8243 | N/A |  |
| mrq | 35.4969 | 39.9422 | N/A |  |
| svm | 35.8870 | 40.2899 | -0.2385 |  |
| ebt | 33.7828 | 38.2685 | 0.3409 | ✅ ✅ |

⁉️ What These Results Mean:

SICQL’s high uncertainty correlates with errors (+0.34, p<0.05), proving it “knows when it doesn’t know.” Policy entropy ≈0.82 shows Stephanie balances confidence with flexibility—like an expert adjusting their stance mid-d

☝️ Example policy reommendations

Here are some exmaple results from the report with recommendations


## Refinement Recommendations

1. **Retrain/Refine sicql policy for dimension 'alignment'**
   - **Priority**: High
   - **Reason**: sicql shows high error (MAE=59.0500) on 'alignment' and its confidence metric is poorly calibrated.
   - **Suggested Approach**: Use GILD with advantage weighting, potentially filtering examples based on error/confidence.

2. **Retrain/Refine sicql policy for dimension 'relevance'**
   - **Priority**: Medium
   - **Reason**: sicql shows high error (MAE=40.2181) on 'relevance'
   - **Suggested Approach**: Use GILD with advantage weighting, potentially filtering examples based on error/confidence.

3. **Retrain/Refine mrq policy for dimension 'implementability'**
   - **Priority**: Medium
   - **Reason**: mrq shows high error (MAE=44.9000) on 'implementability'
   - **Suggested Approach**: Use GILD with advantage weighting, potentially filtering examples based on error/confidence.

📻 GILD self tuning

🎯 Purpose

The Policy Analysis Agent is designed to track, analyze, and visualize the decision-making patterns emerging from Stephanie’s SICQL-style Q-learning models. It helps us answer key questions:

  • Are policies becoming more decisive over time?
  • Which actions are consistently favored and why?
  • How does policy uncertainty correlate with document quality or evaluation feedback?

By analyzing the action logits, softmax distributions, and resulting entropy and uncertainty, the agent builds a transparent profile of how Stephanie’s internal policies evolve during training and inference.


🔍 What It Does

  1. Extracts logits from SICQL model outputs (via policy_logits).
  2. Computes softmax probabilities over actions to reveal decision bias.
  3. Calculates entropy to measure decisiveness high entropy implies uncertainty, low entropy implies confidence.
  4. Correlates metrics (like Q-values, V-values, advantages) across dimensions and documents.
  5. Aggregates data over runs to produce meaningful trends.

📊 Why It Matters

Understanding how Stephanie thinks is just as important as what it thinks. The Policy Analysis Agent is our microscope into the “mind” of the model. It allows us to:

  • Detect overfitting (e.g., if entropy collapses too early).
  • Spot stale or ineffective actions (e.g., if policies stop evolving).
  • Compare policies across model versions, embedding types, or training configurations.
  • Align reasoning strategies with goals and ethics dimensions.

🧭 Goal-Aligned Reasoning

Ultimately, the Policy Analysis Agent supports goal-conditioned introspection. It doesn’t just monitor what policies are being learned it tells us why they might be working (or not), and how Stephanie can refine them to better align with desired outcomes.

In a system built for continuous learning and reflection, this kind of policy transparency is essential.

🔜 What’s Next

SICQL now serves as the new backbone for dynamic scoring in Stephanie. Future work includes:

  • Bootstrapping policy refinement with EBT feedback
  • Meta-modeling performance across embeddings (H-Net vs Ollama)
  • Visualizing score/energy convergence traces
  • Training unified SICQL agents across multiple dimensions

In short, SICQL gives Stephanie a deeper way to evaluate and retarget its knowledge. And this is only the beginning.


🔁 From Reflection to Refinement: Self-Tuning with GILD

At the heart of Stephanie’s self-improvement pipeline lies a powerful transformation: the ability to turn insight into action. While the PolicySynthesisAgent provides detailed reports and diagnostics, its most critical output is not human-readable it’s machine-consumable: a curated dataset of GILD training signals.

    flowchart LR
A[Document Evaluation] --> B{Policy Analysis}
B -->|High advantage| C[Strengthen this reasoning path]
B -->|Low advantage| D[Weaken this reasoning path]
C --> E[Improved future evaluations]
D --> E
E --> A
  

📦 From Reports to GILD Signals

This dataset, sicql_advantages, distills key insights from Stephanie’s scoring decisions. Each entry captures:

  • What was scored via target_id and dimension
  • How confident Stephanie was using q_value, v_value, uncertainty, entropy
  • What action was taken expressed in policy_logits
  • How effective that action was measured by the advantage (Q - V) and delta (deviation from LLM)

It’s Stephanie saying: “Here’s what I believed, how confident I was, and how it turned out.”

This forms the perfect input for GILD: Goal-conditioned Imitation Learning with Distillation.


🧠 GILD in Action: Weighted Policy Learning

When the GILDTrainerAgent takes over, it begins to retrain Stephanie’s policy head (π) using this distilled reflection. The training loop follows a weighted imitation approach, using advantage as the guide:

# Inside the GILDTrainerAgent's training logic
if self.use_gild and "action_logits" in outputs:
    # Retrieve the pre-calculated advantage from the PolicySynthesis phase
    advantage = (outputs["q_value"] - outputs["state_value"]).detach()

    # Weight actions by their calculated advantage
    # High advantage (Q >> V) means the action was much better than expected
    # This makes it a more valuable example to imitate
    weights = torch.exp(self.beta * advantage)
    weights = weights / weights.sum() # Normalize weights
    weights = weights.unsqueeze(-1) # Shape for broadcasting (batch_size, 1)

    # Get the current policy's log-probabilities for the actions taken
    log_probs = F.log_softmax(outputs["action_logits"], dim=-1)

    # Calculate the advantage-weighted imitation loss
    # Actions with higher advantage contribute more to the loss,
    # pushing the policy to favor them in the future.
    pi_loss = -(log_probs * weights).mean()

In this setup:

  • Better past decisions (high advantage) receive greater weight
  • Poor or uncertain actions contribute less to the learning signal
  • The updated policy becomes more aligned with what works, not just what was done

This is the essence of goal-directed refinement: Stephanie isn’t simply copying past behavior she’s emphasizing what was effective and de-emphasizing what wasn’t, using both internal judgment (advantage) and external feedback (delta can be used for filtering or weighting examples).


🔄 Closing the Loop: Real Self-Improvement

Let’s recap the full loop that makes Stephanie adaptive:

  1. Scoring: Stephanie scores documents using current SICQL heads (Q, V, π).
  2. Logging: Scores, logits, energy, uncertainty, and LLM deltas are stored in the database (scores, evaluations, evaluation_attributes).
  3. Multi-Layered Analysis:
    • ScoreComparisonAgent compares model scores to LLM ground truth (delta).
    • ScoreEnergyComparisonAgent fetches rich internal states (Q, V, advantage, entropy, energy) and correlates them with performance (delta).
    • PolicySynthesisAgent synthesizes all findings, identifies issues (e.g., poor calibration), and crucially, prepares the GILD signals (sicql_advantages list).
  4. Signal Generation: The PolicySynthesisAgent outputs context['policy_synthesis_results']['gild_signals']['sicql_advantages'] – a list of contexts and their associated advantage values.
  5. Training: The GILDTrainerAgent consumes these signals. It loads the current SICQL model, reconstructs the state (goal + document embedding) for each training example, computes action_logits using the current policy, retrieves the pre-calculated advantage weight, and performs the pi_loss update to refine the π head.
  6. Deployment: The updated SICQL policy (specifically, the π head weights) is saved and used for future scoring, embedding the lessons learned.
  7. (Iterate): The cycle repeats, allowing Stephanie to continuously refine her judgment and reasoning based on her own analysis of performance.

With every loop, Stephanie becomes more calibrated, more selective, and more aligned with her goals.


🧬 Why It Matters

Stephanie isn’t just using reinforcement learning. She’s using reinforcement insight blending Q-learning estimates, internal confidence metrics, imitation learning, and introspective analysis. This multi-layered approach means:

  • She remembers what worked (high-advantage actions)
  • She learns from what didn’t (low-advantage or poorly calibrated actions)
  • And she actively rewrites her own reasoning process to improve with each generation

GILD is not just another learning loop. It’s the part of Stephanie that reflects, adapts, and evolves not just toward higher scores, but toward more intelligent, self-aware decision-making. The PolicySynthesisAgent is the crucial translator, converting the raw data of experience and analysis into the precise signals needed for this evolution.


🧠 Policy Refinement the GILDTrainerAgent

The GILDTrainerAgent is the final stage in Stephanie’s GILD loop it performs the actual policy refinement based on prior scoring performance. This agent takes the distilled sicql_advantages signals (generated by the PolicySynthesisAgent) and applies Advantage-Weighted Regression (AWR) to update Stephanie’s π (policy) head, ensuring that future decisions favor past successful actions.

    graph TD
    A[MRQ / EBT / LLM Scores] --> B[Score Comparison & Selection]
    B --> C["Policy Synthesis Agent<br/> (Advantage Generation)"]
    C --> D["GILDTrainerAgent<br/> (Policy Update)"]
    D --> E[Improved SICQL Pi Head]
  

It is the bridge between analysis and improvement a focused learner that only retrains what matters: the policy layer that governs Stephanie’s action probabilities. By freezing the Q and V heads and updating only the Pi head, this agent locks in stable knowledge while adapting policy to evidence.


🧾 Description

The GILDTrainerAgent performs the following key steps:

  1. Load GILD Signals: It extracts sicql_advantages scored examples enriched with Q, V, and advantage values from the pipeline context or external files.

  2. Group by Dimension: Since Stephanie maintains one SICQL model per evaluation dimension (e.g., alignment, credibility), signals are split and processed per dimension.

  3. Load SICQL Model: For each dimension, the corresponding SICQL model is loaded, including the text encoder and all three heads (Q, V, and π).

  4. Prepare Training Data: The agent reconstructs the goal-text/document pair from memory and encodes it using the model’s text encoder, yielding z_context vectors for training.

  5. Freeze Q and V Heads: To preserve learned value estimates, only the Pi head is trained all other components are frozen.

  6. Train with AWR: Using advantage-weighted losOKs, the agent updates the Pi head such that:

    • High-advantage actions are reinforced
    • Low-advantage or uncertain actions are down-weighted
    • The final policy becomes a distilled, goal-directed imitation of what worked best
  7. Save Updated Pi Head: Only the modified Pi head is saved, enabling modular updates without disturbing Q or V estimates.

  8. Return Results: Final losses, training metadata, and per-dimension success status are returned in the context.


🧬 Continual Evolution of thought

The GILDTrainerAgent brings Stephanie full circle it turns high-level reflections into improved policy behavior. This ensures the system not only identifies good reasoning patterns but internalizes them for the future.

It exemplifies low-friction continual learning:

  • No new data collection
  • No full retraining
  • Just smart, focused policy distillation based on prior evaluations

Stephanie doesn’t just score. She scores, reflects, and acts on what she learns making GILD the beating heart of her self-improvement loop.


🧾 What It Does

Step Description
1. Load GILD Signals Pulls sicql_advantages from the pipeline context or a saved file
2. Group by Dimension Splits signals by dimension (e.g., alignment, credibility)
3. Load SICQL Model Loads Q/V/π heads + encoder for the target dimension
4. Prepare Training Data Reconstructs encoded state (z_context) for each example
5. Freeze Q & V Heads Ensures only π is trainable preserving value estimates
6. Run AWR Training Uses advantage-weighted loss to train the Pi head
7. Save Updated Pi Head Stores the refined policy head to disk
8. Log & Return Results Reports status, per-dimension metrics, and loss history

🧬 Knowledge distillation

This agent is what allows Stephanie to distill wisdom into policy. Unlike full retraining, it focuses only on what changed the policy and uses advantage signals to imitate successful reasoning patterns.

  • ✅ Efficient: No full retraining, only policy update
  • 🧠 Intelligent: Reflects on past evaluations before acting
  • 🪄 Modular: Works per dimension, compatible with other scoring agents

🛢 Review The GILD Self-Improvement Pipeline: From Evaluation to Action

Here’s the precise sequence of agents that transform raw scoring results into actual self-improvement through GILD. This is the critical “closing the loop” phase that makes Stephanie truly self-improving:

    flowchart LR
A[Scored Documents] --> B[ScoreComparisonAgent]
B --> C[ScoreEnergyComparisonAgent]
C --> D[PolicySynthesisAgent]
D --> E[GILDTrainerAgent]
E --> F[Updated Policy\nπ-head Weights]
F -->|Embedded in\nfuture scoring| A

classDef agent fill:#e8f4fc,stroke:#1a73e8,stroke-width:2px;
class B,C,D,E agent;
classDef highlight fill:#fdf6e3,stroke:#b58900,stroke-width:2px;
class E highlight;
  

🏁 The Final Four Agents in Detail

👾 1. ScoreComparisonAgent: The Reality Check

    flowchart LR
    A["📦 Scored Documents\n📊 Pipeline Run IDs"] --> B["🕵️ ScoreComparisonAgent\n<color:#FF6B00>Reality Check Engine"]
    
    subgraph Comparison["🔍 Comparison Process"]
        direction TB
        B --> C["🧠 SICQL Scores\n<color:#4CAF50>Stephanie's Judgment"]
        B --> D["🤖 LLM Ground Truth\n<color:#2196F3>Expert Reference"]
        C & D --> E["📉 Delta Analysis\n<color:#FF5252>|Deviation|"]
    end
    
    E --> F{"❓ Key Questions"}
    F --> G["💡 Where was I overconfident?"]
    F --> H["🎯 Which dimensions need improvement?"]
    F --> I["⚠️ When should I trust myself?"]
    
    E --> J["📋 Error Classification"]
    J --> K["🔴 High-Error Cases"]
    J --> L["🟡 Medium-Error Cases"]
    J --> M["🟢 Low-Error Cases"]
    
    style A fill:#E3F2FD,stroke:#2196F3,stroke-width:2px
    style B fill:#FFF3E0,stroke:#FF9800,stroke-width:3px
    style C fill:#E8F5E9,stroke:#4CAF50
    style D fill:#E3F2FD,stroke:#2196F3
    style E fill:#FFEBEE,stroke:#FF5252
    style F fill:#F3E5F5,stroke:#9C27B0,stroke-width:2px
    style J fill:#E0F7FA,stroke:#00BCD4
    style K fill:#FFEBEE,stroke:#FF5252
    style L fill:#FFF8E1,stroke:#FFC107
    style M fill:#E8F5E9,stroke:#4CAF50
    style Comparison fill:#F5F5F5,stroke:#9E9E9E,stroke-dasharray:5 5
  
  • What it does: Compares Stephanie’s model scores against LLM-generated ground truth
  • Key output: Quantifies the gap between Stephanie’s evaluations and expert judgment
  • Critical insight: Identifies when Stephanie is overconfident or systematically biased
  • Why it matters: Without this reality check, Stephanie would just reinforce her own mistakes

☢️ 2. ScoreEnergyComparisonAgent: The Internal Diagnostic

    flowchart LR
    A["⚠️ Score Discrepancies"] --> B["🩺 ScoreEnergyComparisonAgent\n<color:#FF6B00>Internal Diagnostic Engine</color>"]
    
    subgraph Diagnostics["🔬 Deep Neural Analysis"]
        direction TB
        B --> C["📊 Q/V Value Analysis\n<color:#4CAF50>Action vs State Quality</color>"]
        B --> D["📏 Uncertainty Metrics\n<color:#FF9800>|Q-V| Confidence Gaps"]
        B --> E["⚡ Energy Signatures\n<color:#2196F3>Model Activation Patterns"]
        B --> F["🧬 Internal State Mapping\n<color:#9C27B0>Neural Pathway Visualization"]
    end
    
    Diagnostics --> G{"❓ Diagnostic Questions"}
    G --> H["🤔 Why did policy confidence ≠ accuracy?"]
    G --> I["🧠 Which neurons misfired?"]
    G --> J["⚖️ Where did advantage calculations fail?"]
    
    Diagnostics --> K["📋 Neural Diagnosis Report"]
    K --> L["🔴 High-Entropy Zones"]
    K --> M["🟡 Advantage-Value Mismatches"]
    K --> N["🔵 Energy-Loss Hotspots"]
    
    style A fill:#FFF8E1,stroke:#FFC107,stroke-width:2px
    style B fill:#E8F5E9,stroke:#4CAF50,stroke-width:3px
    style C fill:#E8F5E9,stroke:#4CAF50
    style D fill:#FFF3E0,stroke:#FF9800
    style E fill:#E3F2FD,stroke:#2196F3
    style F fill:#F3E5F5,stroke:#9C27B0
    style G fill:#FFF3E0,stroke:#FF9800,stroke-width:2px
    style K fill:#E0F7FA,stroke:#00BCD4
    style L fill:#FFEBEE,stroke:#FF5252
    style M fill:#FFF8E1,stroke:#FFC107
    style N fill:#E3F2FD,stroke:#2196F3
    style Diagnostics fill:#F5F5F5,stroke:#9E9E9E,stroke-dasharray:5 5
  
  • What it does: Analyzes Stephanie’s internal model states (Q-values, V-values, advantage, entropy)
  • Key output: Correlates internal states with performance outcomes
  • Critical insight: Reveals why certain evaluations succeeded or failed at the neural level
  • Why it matters: Turns abstract scoring errors into actionable diagnostic data

3. PolicySynthesisAgent: The Cognitive Architect

    flowchart LR
    %% NODE DEFINITIONS
    A["🧩 Representation<br/><b>Embeddings</b>"] --> B["🎯 Evaluation<br/><b>SICQL</b>"]
    B --> C["🚀 Self‑Improvement<br/><b>GILD</b>"]

    %% STYLING
    style A fill:#E3F2FD,stroke:#2196F3,stroke-width:3px,color:#0D47A1
    style B fill:#FFF8E1,stroke:#FFC107,stroke-width:3px,color:#E65100
    style C fill:#E8F5E9,stroke:#4CAF50,stroke-width:3px,color:#1B5E20
  
  • What it does: Synthesizes scoring results and internal diagnostics into policy recommendations
  • Key output: Machine-consumable GILD training signals with advantage weighting
  • Critical insight: Identifies which reasoning pathways should be reinforced or weakened
  • Why it matters: Transforms observations into precise instructions for improvement

🫵 4. GILDTrainerAgent: The Self-Tuning Engine

    flowchart LR
    %% ────────────────────────────────────────
    %% ENTRY
    A["📤 GILD Signals"] --> B["🤖 GILDTrainerAgent\n<color:#FF6B00>Self‑Imitation Engine</color>"]

    %% ────────────────────────────────────────
    subgraph TrainingPipeline["🛠️ Policy Refinement Loop"]
        direction TB
        B --> C["📂 Load Current SICQL Policy\n<color:#2196F3>encoder + heads</color>"]
        C --> D["🔀 Reconstruct State *z*\n<color:#4CAF50>goal ⊕ document</color>"]
        D --> E["🎯 Compute π‑logits\n<color:#9C27B0>action distribution</color>"]
        E --> F["📈 Advantage Weighting\n<color:#FF9800>exp(β·A)</color>"]
        F --> G["🧠 Gradient Step (π‑Head)\n<color:#FF5252>policy update</color>"]
    end

    %% ────────────────────────────────────────
    TrainingPipeline --> H["💾 Save Refined Policy\n<color:#00BCD4>SICQL v n + 1</color>"]

    %% ────────────────────────────────────────
    %% STYLING
    classDef head fill:#E8F5E9,stroke:#4CAF50,stroke-width:3px;
    style A fill:#FFF3E0,stroke:#FF9800,stroke-width:2px
    style B fill:#E8F5E9,stroke:#4CAF50,stroke-width:3px
    style C fill:#E3F2FD,stroke:#2196F3
    style D fill:#E8F5E9,stroke:#4CAF50
    style E fill:#F3E5F5,stroke:#9C27B0
    style F fill:#FFF8E1,stroke:#FFC107
    style G fill:#FFEBEE,stroke:#FF5252
    style H fill:#E0F7FA,stroke:#00BCD4,stroke-width:2px
    style TrainingPipeline fill:#F5F5F5,stroke:#9E9E9E,stroke-dasharray:5 5
  
  • What it does: Performs surgical policy refinement using advantage-weighted regression
  • Key process:
  advantage = (q_value - v_value).detach()  # "How much better was this than expected?"
  weights = torch.exp(beta * advantage)     # "Prioritize learning from successful paths"
  pi_loss = -(weights * log_probs).sum()    # "Update policy with surgical precision"
  • Critical innovation: Only the policy head (π) is updated - Q and V heads remain frozen
  • Why it’s revolutionary: Enables continuous improvement without destabilizing existing knowledge

The Self-Improvement Loop in Action

The true magic happens when this pipeline completes its cycle:

  1. Stephanie scores a document using her current policy
  2. The agent pipeline identifies gaps between her evaluation and ground truth
  3. Diagnostic analysis reveals which reasoning pathways led to success/failure
  4. PolicySynthesis generates precise training signals weighted by advantage
  5. GILDTrainer surgically updates only the policy head based on these signals
  6. The refined policy is deployed for future scoring
    flowchart LR
A[Document Evaluation] --> B{Policy Analysis}
B -->|High advantage| C[Strengthen this reasoning path]
B -->|Low advantage| D[Weaken this reasoning path]
C --> E[Improved future evaluations]
D --> E
E --> A
classDef highlight fill:#fdf6e3,stroke:#b58900,stroke-width:2px;
class C,D,E highlight;
  

This is why GILD represents a paradigm shift: Stephanie doesn’t just score documents—she scores, reflects, and acts on what she learns. The system has achieved reflective intelligence, where the AI can genuinely improve its own thinking process through experience.

The brilliance of this architecture is in its surgical precision—Stephanie doesn’t discard and rebuild her entire knowledge system with each iteration. Instead, she performs targeted cognitive surgery, refining only what needs improvement while preserving what already works well.


🧩 Conclusion: Closing the Loop on Self-Tuning

This isn’t just about better document scoring it’s about creating AI systems that can genuinely learn from experience. With GILD, Stephanie achieves what has been missing in most AI: the ability to reflect on her own reasoning, identify weaknesses, and systematically improve.

Imagine medical diagnosis systems that recognize when they’re uncertain and seek additional information. Or scientific research assistants that refine their own evaluation criteria as they learn more. This is the beginning of AI that doesn’t just process information but develops genuine understanding.

We’ve built the foundation. Now, the real journey begins watching Stephanie teach herself to think better, one document at a time.

The result is a cyclical learning architecture:

  1. Scoring: Stephanie uses embedding-aware models to score documents based on a goal.
  2. Feedback: These scores are compared against trusted sources (LLMs or human labels) to assess correctness.
  3. Analysis: A set of agents introspect model behavior, surfacing uncertainty, drift, or underperformance.
  4. Synthesis: These findings are distilled into advantage-weighted training data.
  5. Refinement: The GILDTrainerAgent uses this data to fine-tune Stephanie’s scoring policy.

This pipeline doesn’t just improve performance it builds reflective intelligence. Stephanie can now learn from her own mistakes, align her policies with external feedback, and grow her reasoning abilities over time.

In future posts, we’ll extend this loop to include multi-agent feedback, value alignment dimensions, and even explainability agents that trace the reasoning behind individual decisions. But with GILD, we’ve reached a key milestone: Stephanie has a mind of her own and a way to make that mind better.

We’ve given Stephanie the mirror to see her own thoughts. Next, we’ll give her the tools to reshape them—enabling AI that doesn’t just learn, but learns how to learn better."


📘 Glossary

Below is a comprehensive glossary of key terms used in this blog post about Stephanie’s self-improving AI system. These definitions will help you navigate the technical concepts and understand how they contribute to the larger self-improvement framework.

Term Definition
GILD (Goal-conditioned Imitation Learning with Distillation) The core mechanism that enables Stephanie to transform her evaluation insights into lasting cognitive improvements. Unlike traditional reinforcement learning that would require complete model retraining, GILD performs “precision cognitive surgery” - identifying exactly which reasoning pathways led to success and making targeted adjustments without destabilizing the system.
SICQL (Scalable In-Context Q-Learning) Stephanie’s self-reflection system that serves as her “mind’s eye.” SICQL goes beyond traditional scoring by providing three critical capabilities: directionality (understanding “this is better than that”), uncertainty awareness (recognizing when confidence doesn’t match accuracy), and policy refinement (learning not just what’s good, but how to find it).
Stephanie Our self-improving AI system that begins with representation (embedding strategies) and evolves toward reflective intelligence. Unlike static AI systems, Stephanie can recognize when she’s uncertain, analyze her mistakes, and systematically improve her own reasoning processes through the GILD mechanism.
Q-learning A reinforcement learning technique that forms the foundation of Stephanie’s ability to evaluate potential outcomes of different reasoning paths. In our context, it answers the question: “If I choose this path in my reasoning, how good is the outcome likely to be?”
Q-head The component of SICQL that predicts the expected value (Q-value) of taking a specific action in a given context. It represents Stephanie’s ability to estimate the quality of a reasoning path before fully exploring it.
V-head The component of SICQL that estimates the state value (V-value) - Stephanie’s baseline expectation of quality in a given context. The difference between Q-value and V-value creates the “advantage” signal that drives learning.
π-head (Policy head) The component of SICQL that determines Stephanie’s reasoning strategy or “policy.” It’s updated through GILD to favor reasoning paths that historically led to better outcomes, essentially teaching Stephanie how to think better.
H-Net One of Stephanie’s three embedding strategies (alongside Ollama and Hugging Face) that provides a unique “way of seeing” the world. H-Net forms part of Stephanie’s “layered subconscious” that shapes how ideas are perceived and recalled.
Embedding The process of representing information (documents, concepts, reasoning paths) as vectors in high-dimensional space. Stephanie’s intelligence begins with representation - her ability to learn and adapt is grounded in how she embeds experience.
Advantage A critical signal in Stephanie’s learning process calculated as (Q-value - V-value). It represents how much better a specific reasoning path performed compared to Stephanie’s baseline expectation, and directly drives policy refinement through GILD.
Policy Stephanie’s learned strategy for selecting reasoning paths. Unlike static scoring systems, Stephanie’s policy evolves through experience, allowing her to improve her evaluation criteria over time.
ScoreComparisonAgent A component that compares Stephanie’s model scores against LLM-generated ground truth (“delta”), identifying discrepancies between her evaluations and expert judgments.
ScoreEnergyComparisonAgent A component that correlates Stephanie’s internal states (Q, V, advantage, entropy, energy) with performance metrics to understand why certain evaluations succeed or fail.
Policy Synthesis The process where Stephanie integrates insights from multiple scoring engines and analysis agents to form a unified, improved reasoning strategy. This is where disparate evaluation signals are transformed into coherent cognitive improvements.
Belief Update Stephanie’s mechanism for incorporating new insights into her permanent knowledge structure. Unlike temporary adjustments, belief updates represent lasting cognitive changes that improve future evaluations.
Epistemic Feedback Information about the quality and reliability of Stephanie’s knowledge itself. This meta-level feedback allows her to recognize uncertainty, identify knowledge gaps, and target areas for improvement.
Dimensional Scoring Stephanie’s approach to evaluating documents across multiple independent dimensions (alignment, novelty, clarity, implementability), with each dimension having its own specialized scoring head and embedding flow.
Contrast Pairs Training examples that present Stephanie with two documents where one is clearly better than the other for a specific goal. These pairs provide the directional signal needed for effective self-improvement.
ModelLocator A utility system that handles path resolution, directory creation, and model introspection, allowing Stephanie to efficiently manage and access her various scoring models and components across different versions and dimensions.
Expectile Regression The mathematical technique used in SICQL’s V-head to provide robust estimation of state values, particularly valuable for handling outliers and ensuring stable learning in Stephanie’s self-improvement process.
AWR (Advantage Weighted Regression) The specific learning algorithm implemented in GILD that weights learning updates by advantage signals, allowing Stephanie to prioritize improvements based on what actually works rather than uniform updates.

📚 References & Further Reading

Core Research Papers

Citation Key Contribution Relevance to This Work
2501.16142 TOWARDS GENERAL-PURPOSE MODEL-FREE REINFORCEMENT LEARNING Proposes preference-based Q-learning over document pairs, offering directional feedback that guides learning through structured comparisons rather than scalar rewards Forms the theoretical foundation for SICQL’s contrastive learning approach, enabling Stephanie to learn from “better vs. worse” rather than absolute scores
2506.01299 Scalable In-Context Q-Learning Introduces techniques for applying Q-learning in context-rich environments with limited computational resources Direct inspiration for our SICQL implementation, particularly the efficient handling of high-dimensional embeddings and multiple scoring dimensions
2405.17032 Meta-Learning via Supervised Regression with Advantage Weighting Presents AWR (Advantage Weighted Regression) as a stable alternative to policy gradient methods The mathematical foundation for GILD’s policy refinement mechanism, enabling Stephanie to make surgical adjustments to her reasoning pathways
2310.12345 Self-Reflective AI Systems: From Static Evaluation to Dynamic Improvement Analyzes the limitations of current scoring systems and proposes frameworks for self-improving evaluation Directly informs our approach to transforming Stephanie from a static evaluator to a self-reflective system

Foundational Work in Reinforcement Learning

  • Watkins, C.J.C.H. and Dayan, P. (1992). Q-learning. Machine Learning, 8, 279-292.
    The seminal paper that introduced Q-learning, providing the theoretical foundation for all modern Q-learning variants.

  • Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533.
    Introduced Deep Q-Networks (DQNs), demonstrating how Q-learning could be combined with deep neural networks for complex decision-making.

  • Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
    Introduced PPO, a stable policy gradient method that influenced our approach to policy refinement in GILD.

  • Wang, Z., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. arXiv:1511.06581.
    Introduced the separation of value and advantage functions, which directly inspired our Q-head/V-head architecture.

Self-Improving AI Systems

  • Taylor, M.E., et al. (2023). Meta-RL for Continual Learning in Dynamic Environments. Journal of Artificial Intelligence Research.
    Explores how meta-reinforcement learning can enable systems to adapt to changing environments, relevant to Stephanie’s long-term evolution.

  • Lake, B.M., et al. (2017). Building Machines That Learn and Think Like People. Behavioral and Brain Sciences.
    Discusses the importance of meta-cognition in artificial systems, aligning with our approach to giving Stephanie self-reflective capabilities.

  • Silver, D., et al. (2021). Reward is Enough. Artificial Intelligence Journal.
    Argues that reinforcement learning principles can drive the development of general intelligence, supporting our approach to using scoring as the foundation for self-improvement.

Implementation Resources