Applied Policy: How to incorporate Policy and Hallucination in self-improving system

Page content

Building a Self-Improving AI: Cooperative ERL and Embed-RL in a Trace-Native Architecture

1. The Problem

Most self-improving AI systems fail for one of three reasons:

First, scalar reward collapse. Traditional reinforcement learning compresses multi-dimensional quality into a single scalar. This creates catastrophic interference: improving one axis (e.g., coherence) can degrade another (e.g., hallucination safety). The system optimizes for the blended metric, not the underlying objectives.

Second, representation drift. Embedding-based optimization without behavioral feedback creates geometric collapse. The embedding space becomes increasingly narrow, losing discriminative power. Similar queries map to identical regions. Diversity vanishes. The system becomes brittle.

Third, unstable reflection loops. Naive “retry until better” approaches lack structured correction. Each attempt is independent. Past failures are forgotten. The system cycles through the same error patterns without learning from them. This is not self-improvement—it’s stochastic search.

These problems are not independent. They compound:

  • Scalar rewards encourage representation drift
  • Representation drift corrupts behavioral signals
  • Unstable reflection amplifies both

The fundamental insight is this: behavioral improvement and representational improvement must co-evolve. They cannot be optimized in isolation. This requires a new architectural pattern—one that preserves multi-objective signals, enforces geometric stability, and compounds learning across episodes.


2. Core Design Principles

We built Stephanie’s self-improvement layer on five principles:

Multi-Objective Reward

Rewards must remain multi-dimensional. Each quality axis (HRM alignment, hallucination energy, embedding margin) maintains its own signal. Improvement is measured per-axis, not as a weighted sum. Dominance—not scalar comparison—determines whether an attempt is better.

Trace-Bound Evaluation

Every evaluation is bound to execution provenance. The RewardVector carries a trace ID, timestamp, and source model. This enables forensic replay, cross-episode pattern recognition, and causal attribution. Without trace binding, improvement signals become uninterpretable.

Axis-Aware Learning

Updates must be axis-specific. Improving HRM alignment should not interfere with embedding margin optimization. Each axis has its own direction semantics (higher-is-better vs. lower-is-better), update routing, and stability constraints.

Separation of Behavior and Representation

Behavioral correction (policy updates) and representational correction (embedding updates) occur in separate phases. The behavior layer improves what the system does. The representation layer improves how the system sees. They cooperate but do not interfere.

Geometry Stability Constraints

Embedding updates must preserve global structure. Local improvements cannot collapse the manifold. We enforce norm stability, angular drift limits, and variance preservation. Without these constraints, representation learning becomes destructive.


3. System Overview

The complete self-improvement loop executes in ten stages:

┌─────────────────────────────────────────────────────────────────┐
│                    STEP 1: ATTEMPT                              │
│                                                                 │
│  ContextPack → PlanTrace → Model.forward() → Output: y₁        │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│              STEP 2: MULTI-LAYER EVALUATION                     │
│                                                                 │
│  SignalProvider orchestration → RewardVector(score₁)           │
│                                                                 │
│  Axes evaluated:                                                │
│    • HRM alignment                                              │
│    • Hallucination energy                                       │
│    • Embedding margin                                           │
│    • Policy advantage                                           │
│    • Metric stability                                           │
│    • Context fidelity                                           │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│            STEP 3: REFLECTION TRIGGER                           │
│                                                                 │
│  if score₁.requires_reflection(config):                        │
│      → Generate ReflectionTrace                                 │
│      → Correction plan + focus dimensions                       │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│              STEP 4: IMPROVED ATTEMPT                           │
│                                                                 │
│  Model.forward_with_reflection(input, ReflectionTrace)         │
│                                 ↓                               │
│                         Output: y₂                              │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│              STEP 5: COMPARISON                                 │
│                                                                 │
│  Score y₂ → RewardVector(score₂)                               │
│                                                                 │
│  Δ_reward = score₂.delta(score₁)                               │
│                                                                 │
│  if score₂.dominates(score₁, critical_axes):                   │
│      → ACCEPT IMPROVEMENT                                      │
│  else:                                                         │
│      → REJECT, return y₁                                       │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│         STEP 6: BEHAVIOR INTERNALIZATION (ERL)                  │
│                                                                 │
│  Model.distill(input, y₂)                                      │
│                                                                 │
│  Updates:                                                       │
│    • Policy weights via SICQL/GILD                              │
│    • HRM alignment loss                                         │
│    • Context fidelity reinforcement                             │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│       STEP 7: REPRESENTATION INTERNALIZATION (Embed-RL)        │
│                                                                 │
│  EmbeddingTrainer.update_from_trace(                           │
│      input_embedding,                                          │
│      y₂_embedding,                                             │
│      improvement_delta=Δ_reward                                │
│  )                                                              │
│                                                                 │
│  Updates:                                                       │
│    • Contrastive loss: pull y₂ toward goal                      │
│    • Push y₁ away (negative example)                            │
│    • Mine hard negatives from MemCube                           │
│    • Apply geometry stability governor                          │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│              STEP 8: MEMORY CONSOLIDATION                       │
│                                                                 │
│  MemCube.store(                                                │
│      type="improvement_pattern",                               │
│      payload={                                                 │
│          "input_signature": embedding(input),                  │
│          "failure_vector": score₁,                             │
│          "correction_trace": ReflectionTrace,                  │
│          "improvement_delta": Δ_reward,                        │
│          "timestamp": now()                                    │
│      }                                                         │
│  )                                                              │
│                                                                 │
│  Future benefit:                                                │
│    • Retrieve similar failures → apply known corrections       │
│    • Cross-episode pattern recognition                         │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│              STEP 9: VPM COMPRESSION                            │
│                                                                 │
│  VPM.compress(                                                 │
│      trace_id,                                                 │
│      improvement_pattern=MemCube.last_stored(),                │
│      priority="high"                                           │
│  )                                                              │
│                                                                 │
│  Result:                                                       │
│    • Improved reasoning paths preserved                        │
│    • Compression prioritizes successful corrections            │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│         STEP 10: ADAPTIVE AXIS WEIGHTING                        │
│                                                                 │
│  Meta-optimizer tracks:                                        │
│    • Axis utility (improvement persistence)                    │
│    • Failure rate per axis                                     │
│    • Cross-axis interference                                   │
│                                                                 │
│  Updates config.weights for next episode                       │
└─────────────────────────────────────────────────────────────────┘

This loop executes per reasoning attempt. Successful improvements compound across episodes via MemCube retrieval and hard negative mining.


4. Multi-Objective Reward (RewardVector)

The RewardVector is the foundational primitive. It replaces scalar reward with structured, multi-dimensional assessment.

Axis Direction Semantics

Each reward axis declares its optimization direction:

class RewardAxis(str, Enum):
    HRM_ALIGNMENT = "hrm_alignment"                  # Higher = better
    HALLUCINATION_ENERGY = "hallucination_energy"    # Lower = better
    EMBEDDING_MARGIN = "embedding_margin"            # Higher = better
    POLICY_ADVANTAGE = "policy_advantage"            # Higher = better
    METRIC_ALIGNMENT = "metric_alignment"            # Higher = better
    COHERENCE = "coherence"                          # Higher = better
    CONTEXT_FIDELITY = "context_fidelity"            # Higher = better

class AxisDirection(str, Enum):
    HIGHER_IS_BETTER = "higher"
    LOWER_IS_BETTER = "lower"

AXIS_SEMANTICS: Dict[RewardAxis, AxisDirection] = {
    RewardAxis.HRM_ALIGNMENT: AxisDirection.HIGHER_IS_BETTER,
    RewardAxis.HALLUCINATION_ENERGY: AxisDirection.LOWER_IS_BETTER,
    RewardAxis.EMBEDDING_MARGIN: AxisDirection.HIGHER_IS_BETTER,
    # ... all axes
}

Direction semantics are ontology-level, not configuration-level. The physics of improvement cannot be overridden by mode settings.

Direction-Normalized Delta

Improvement deltas are computed with direction awareness:

def delta(self, other: "RewardVector") -> "RewardVector":
    """
    Direction-normalized delta.
    Positive values always mean improvement.
    """
    delta_vals: Dict[RewardAxis, float] = {}
    
    for axis in axes:
        self_val = self.values.get(axis, 0.0)
        other_val = other.values.get(axis, 0.0)
        direction = AXIS_SEMANTICS[axis]
        
        if direction == AxisDirection.HIGHER_IS_BETTER:
            delta = self_val - other_val
        else:  # LOWER_IS_BETTER
            delta = other_val - self_val
        
        delta_vals[axis] = delta
    
    return RewardVector(values=delta_vals, ...)

This guarantees: positive delta = improvement, regardless of axis direction. Training signals never invert.

Dominance Check

Improvement is determined by strict multi-objective dominance:

def dominates(self, other: "RewardVector", critical_axes: List[RewardAxis]) -> bool:
    """
    Strict multi-objective dominance.
    Must strictly improve on all critical axes.
    """
    for axis in critical_axes:
        self_val = self.values.get(axis)
        other_val = other.values.get(axis)
        direction = AXIS_SEMANTICS[axis]
        
        if direction == AxisDirection.HIGHER_IS_BETTER:
            if self_val <= other_val:
                return False
        else:  # LOWER_IS_BETTER
            if self_val >= other_val:
                return False
    
    return True

No weighted sums. No scalar collapse. Improvement must hold on all critical axes simultaneously. This prevents reward hacking and cross-axis interference.

Weighted Scoring (Optional)

For modes that require scalar fallback:

def weighted_score(self, weights: Dict[RewardAxis, float]) -> float:
    """
    Direction-normalized weighted scalar score.
    LOWER_IS_BETTER axes are automatically inverted.
    """
    total = 0.0
    for axis, weight in weights.items():
        val = self.values.get(axis, 0.0)
        direction = AXIS_SEMANTICS[axis]
        normalized_val = val if direction == AxisDirection.HIGHER_IS_BETTER else -val
        total += weight * normalized_val
    return total

This is used only for mode-specific policies, not for improvement decisions.

Failure Signatures and Confidence

Each RewardVector carries diagnostic meta

@dataclass(frozen=True)
class RewardVector:
    values: Dict[RewardAxis, float]
    trace_id: str
    timestamp: float
    source_model: str
    confidence: float = 1.0
    failure_signatures: List[str] = field(default_factory=list)

Failure signatures (e.g., "energy_spike", "margin_collapse") feed the reflection engine. Confidence scores track evaluation reliability.


5. Provider-Based Evaluation Architecture

The MultiLayerEvaluator orchestrates signal providers. It is a pure reducer—never a calculator.

Signal Provider Abstraction

Each quality axis is computed by an independent provider:

class SignalProvider(Protocol):
    def compute(
        self,
        context_pack: ContextPack,
        plan_trace: PlanTrace,
        output: Any,
        **kwargs
    ) -> SignalResult:
        ...

@dataclass
class SignalResult:
    """Atomic signal contribution from one provider"""
    axis_values: Dict[RewardAxis, float]          # What this provider measured
    diagnostics: Dict[str, Any] = field(default_factory=dict)
    failure_signatures: List[str] = field(default_factory=list)
    confidence: float = 1.0

Providers are stateless and independent. They can be added, removed, or replaced without modifying the evaluator.

Provider Taxonomy

Provider Responsibility Returns
HRMProvider Human reasoning alignment {HRM_ALIGNMENT: score}
CertumProvider Hallucination energy + failures {HALLUCINATION_ENERGY: val} + failures
EmbeddingProvider Margin, metric stability {EMBEDDING_MARGIN: val, METRIC_ALIGNMENT: val}
PolicyProvider Advantage, context fidelity {POLICY_ADVANTAGE: val, CONTEXT_FIDELITY: val}
CoherenceProvider Narrative flow {COHERENCE: val}

Reducer-Style Evaluator

The evaluator aggregates provider outputs:

class MultiLayerEvaluator:
    def __init__(self, providers: List[SignalProvider], config: EvaluatorConfig):
        self.providers = providers
        self.config = config
    
    def evaluate(...) -> ScoreBundle:
        all_values = {}
        all_diagnostics = {}
        all_failures = []
        total_confidence = 0.0
        
        for provider in self.providers:
            result = provider.compute(...)
            
            # Merge
            all_values.update(result.axis_values)
            all_diagnostics.update(result.diagnostics)
            all_failures.extend(result.failure_signatures)
            total_confidence += result.confidence
        
        # Build once
        reward_vector = RewardVector(
            values=all_values,
            confidence=total_confidence / len(self.providers),
            failure_signatures=all_failures,
            ...
        )
        
        return ScoreBundle(reward_vector=reward_vector, ...)

The evaluator is 30 lines of pure reduction. No signal logic. No heuristics. No conditionals.

Why This Prevents God-Class Explosion

Without providers, the evaluator becomes a maintenance nightmare:

  • Adding new axes requires modifying evaluator code
  • Signal computation logic leaks into orchestration
  • Testing requires mocking entire evaluator
  • Configuration becomes entangled with logic

With providers:

  • Adding new axes = adding a provider
  • Signal logic is isolated and testable
  • Evaluator never changes
  • Configuration is provider filtering + weighting

This scales to dozens of axes without architectural debt.


6. Structured Reflection (ERL Layer)

Reflection is not generative critique. It is structured trace modification.

ReflectionTrace

The reflection engine produces structured correction plans:

@dataclass
class ReflectionTrace:
    original_trace_id: str
    failure_signature: List[str]
    improvement_plan: Dict[str, Any]  # Correction instructions
    focus_dimensions: List[RewardAxis]  # Axes to prioritize
    confidence_estimate: float

No loose text. No unstructured reasoning. Everything is trace-native and executable.

Axis-Driven Correction

The reflection engine analyzes ScoreBundle failure signatures:

class ReflectionEngine:
    def generate(self, score_bundle: ScoreBundle) -> ReflectionTrace:
        failures = score_bundle.reward_vector.failure_signatures
        
        plan = {}
        focus = []
        
        if "energy_spike" in failures:
            plan["correction"] = "Increase evidence grounding, reduce speculative leaps"
            focus.append(RewardAxis.HALLUCINATION_ENERGY)
        
        if "margin_collapse" in failures:
            plan["correction"] = "Strengthen goal alignment, retrieve similar examples"
            focus.append(RewardAxis.EMBEDDING_MARGIN)
        
        # ... other failure patterns
        
        return ReflectionTrace(
            original_trace_id=score_bundle.plan_trace.trace_id,
            failure_signature=failures,
            improvement_plan=plan,
            focus_dimensions=focus,
            confidence_estimate=self._compute_confidence(score_bundle)
        )

Correction plans are deterministic and axis-specific. The system knows what failed and how to fix it.

Trace Modification Instead of Token Modification

Reflection modifies the reasoning trace, not the output tokens:

Original PlanTrace:
  Step 1: Retrieve evidence
  Step 2: Generate hypothesis
  Step 3: Output conclusion

ReflectionTrace:
  Correction: "Add verification step after hypothesis generation"
  Focus: [HRM_ALIGNMENT, CONTEXT_FIDELITY]

Modified PlanTrace:
  Step 1: Retrieve evidence
  Step 2: Generate hypothesis
  Step 3: Verify against context pack  ← ADDED
  Step 4: Output conclusion

This is safer than token-level editing. The reasoning structure remains intact. Only the execution path is corrected.

Why This Is Safer Than Generative Critique

Generative critique produces unstructured text:

“Your answer was too speculative. Try again with more evidence.”

This is ambiguous. What does “more evidence” mean? How much? Which sources?

Structured reflection produces executable instructions:

{
  "correction": "Increase evidence grounding",
  "parameters": {
    "min_evidence_count": 3,
    "require_primary_sources": true,
    "confidence_threshold": 0.8
  },
  "focus_axes": ["hallucination_energy", "context_fidelity"]
}

The model knows exactly what to change. No interpretation required.


7. Axis-Aware Distillation Router

Distillation updates are routed per-axis. This prevents cross-axis interference.

Update Routing

Each axis has its own update path:

class DistillationRouter:
    def route_update(
        self,
        delta: RewardVector,
        input: Any,
        improved_output: Any
    ):
        # HRM alignment update
        if delta.values.get(RewardAxis.HRM_ALIGNMENT, 0) > 0:
            self.hrm_trainer.update(input, improved_output)
        
        # Hallucination energy reduction
        if delta.values.get(RewardAxis.HALLUCINATION_ENERGY, 0) > 0:
            self.certum_trainer.update(input, improved_output)
        
        # Embedding margin improvement
        if delta.values.get(RewardAxis.EMBEDDING_MARGIN, 0) > 0:
            self.embedding_trainer.update(input, improved_output)
        
        # ... other axes

Updates only occur for axes that improved. No wasted computation. No interference.

Delta Scaling

Update magnitude scales with improvement size:

def _scale_update(delta_value: float, base_lr: float) -> float:
    """
    Scale learning rate by improvement magnitude.
    Small improvements → small updates
    Large improvements → large updates
    """
    # Sigmoid scaling: bounded between 0.1x and 2.0x base_lr
    scale = 0.1 + 1.9 / (1 + math.exp(-5 * delta_value))
    return base_lr * scale

This prevents overfitting to marginal improvements. Large deltas drive larger updates.

Why This Prevents Reward Hacking

Without axis-aware routing, the system can game the reward:

  • Boost HRM alignment by making answers verbose
  • Reduce hallucination energy by being overly conservative
  • Improve coherence by repeating the same phrase

With axis-aware routing:

  • Each axis is optimized independently
  • Improvements must hold across all critical axes
  • No single-axis optimization can compensate for others

The system cannot trade one quality for another. It must improve holistically.


8. Contrastive Embed-RL

Embed-RL optimizes the embedding geometry via contrastive learning.

Pull Toward Goal, Push Away From Failure

The core update is InfoNCE loss:

def compute_contrastive_loss(
    self,
    anchor_emb: torch.Tensor,      # Current improved output (y₂)
    positive_emb: torch.Tensor,     # Goal embedding
    hard_negatives: List[HardNegative],
    immediate_negatives: List[torch.Tensor]  # y₁ from current episode
) -> torch.Tensor:
    """
    InfoNCE loss with hard negatives from memory.
    """
    # Combine all negatives
    all_negatives = immediate_negatives + [
        hn.embedding.to(anchor_emb.device) for hn in hard_negatives
    ]
    
    # Numerator: similarity to positive
    sim_pos = torch.exp(
        F.cosine_similarity(anchor_emb, positive_emb, dim=-1) / self.temperature
    )
    
    # Denominator: sum of similarities to all negatives
    sim_negs = torch.stack([
        torch.exp(F.cosine_similarity(anchor_emb, neg, dim=-1) / self.temperature)
        for neg in all_negatives
    ], dim=-1)
    
    sim_neg_sum = sim_negs.sum(dim=-1)
    
    # Loss
    loss = -torch.log(sim_pos / (sim_pos + sim_neg_sum + 1e-8))
    
    return loss.mean()

This pulls improved outputs toward the goal embedding and pushes failures away.

Separation Margin

The margin between positive and negative similarities grows over time:

Initial state:
  y₁ (bad) ────●───────●───────●─── Goal
               │       │       │
               │       │       │
  y₂ (good) ───●───────●───────●─── Goal

After contrastive update:
  y₁ (bad) ────●───────────────●─── Goal
               │               │
               │               │
  y₂ (good) ────────────●───────●─── Goal

The good output moves closer to the goal. The bad output moves farther away. The margin increases.

Energy Constraint

Hallucination energy constrains the geometry:

def energy_constraint_loss(
    self,
    output_emb: torch.Tensor,
    energy: float,
    max_allowed_energy: float = 0.4
) -> torch.Tensor:
    """
    Penalize embeddings with high hallucination energy.
    """
    if energy <= max_allowed_energy:
        return torch.tensor(0.0)
    
    # Energy penalty scales with violation magnitude
    penalty = (energy - max_allowed_energy) ** 2
    return penalty * self.energy_weight

High-energy outputs are pushed away from the manifold. The embedding space becomes hallucination-safe by construction.

Why This Improves Reasoning Quality

Better embeddings enable better reasoning:

  • Retrieval improves: Similar queries find better evidence
  • Similarity judgments improve: The system distinguishes subtle differences
  • Goal alignment improves: Outputs stay closer to intent
  • Failure patterns emerge: The system recognizes its own mistakes

The embedding space becomes a geometric memory of what good reasoning looks like.


9. Geometry Stability Governor

The stability governor prevents catastrophic representation drift.

Drift Monitoring

Angular drift is bounded per update:

class GeometryStabilityGovernor:
    def check_drift(self, old_emb: torch.Tensor, new_emb: torch.Tensor) -> bool:
        """
        Measure angle shift between old and new embedding.
        Reject updates that rotate too far.
        """
        cos_sim = F.cosine_similarity(old_emb.detach(), new_emb.detach(), dim=-1)
        angle_drift = 1 - cos_sim
        
        return angle_drift < self.max_angle_drift  # e.g., 0.15 radians

Large rotations are rejected or scaled down. The manifold evolves gradually.

Norm Constraints

Embedding norms remain bounded:

def apply_norm_constraint(self, emb: torch.Tensor) -> torch.Tensor:
    """
    Normalize embeddings to unit sphere.
    Prevents norm inflation over time.
    """
    return F.normalize(emb, dim=-1, p=2)

Norm inflation causes numerical instability and distorts similarity metrics. Unit normalization preserves geometry.

Global Variance Preservation

Overall embedding space variance is tracked:

def global_variance(self) -> float:
    """
    Track moving window of embedding covariance trace.
    Prevents variance collapse.
    """
    if len(self.recent_embeddings) < 10:
        return 1.0
    
    stack = torch.stack(list(self.recent_embeddings))
    cov = torch.cov(stack.T)
    trace = torch.trace(cov)
    
    return trace.item()

def allow_update(self, old_emb: torch.Tensor, new_emb: torch.Tensor) -> bool:
    """
    Reject updates that would collapse global variance.
    """
    if not self.check_drift(old_emb, new_emb):
        return False
    
    self.update_global_stats(new_emb)
    
    if self.global_variance() < self.min_global_variance:
        return False
    
    return True

When variance drops too low, updates are frozen. The system waits for diversity to recover.

Why Embedding Collapse Is Dangerous

Without stability constraints:

  • Similarity metrics break: Everything becomes equally similar
  • Retrieval fails: Queries find irrelevant results
  • Certum calibration breaks: Energy thresholds no longer apply
  • HRM correlations drift: Alignment scores become meaningless
  • VPM topology corrupts: Compression loses semantic structure

The stability governor is not optional. It is the safety mechanism that makes Embed-RL viable.


10. Replay and Long-Term Retention

Past failures become future constraints via hard negative mining.

Hard Negative Miner

The miner retrieves relevant failures from MemCube:

class HardNegativeMiner:
    def query(
        self,
        input_embedding: torch.Tensor,
        score_bundle: ScoreBundle,
        top_k: int = 8
    ) -> List[HardNegative]:
        """
        Retrieve hard negatives relevant to current attempt.
        """
        # Query MemCube for similar input signatures
        query_vector = input_embedding.cpu().numpy()
        results = self.memcube.vector_search({
            "type": "improvement_pattern",
            "vector_field": "input_signature",
            "vector": query_vector.tolist(),
            "top_k": top_k * 3  # Oversample
        })
        
        # Filter and rank by similarity + failure severity
        candidates = []
        for result in results:
            similarity = self._cosine_similarity(
                input_embedding,
                torch.tensor(result["output_embedding_snapshot"])
            )
            
            if similarity < self.config.similarity_threshold:
                continue
            
            candidate = HardNegative(
                trace_id=result["trace_id"],
                embedding=torch.tensor(result["output_embedding_snapshot"]),
                failure_signatures=result["failure_signatures"],
                similarity_score=similarity.item(),
                # ... other fields
            )
            candidates.append(candidate)
        
        # Select top-k by strategy (TOP_K, DIVERSE, etc.)
        return self._select_negatives(candidates, top_k)

Episodic Meta-Learning

The embedding space learns from all historical mistakes:

Attempt 1:
  Query: "Explain quantum entanglement"
  Failure: energy_spike (speculative leap)
  Store: failure pattern in MemCube

Attempt 2 (same query):
  Mine negatives → retrieve Attempt 1 failure
  Update: push away from speculative leap pattern
  Result: more grounded explanation

Attempt 3 (similar query):
  "Explain quantum superposition"
  Mine negatives → retrieve Attempt 1 failure (similar signature)
  Update: avoid same mistake on related topic

The system compounds learning across episodes. Past failures become geometric constraints.

Why Replay Prevents Forgetting

Without replay:

  • Each episode is independent
  • The system re-learns the same corrections
  • Embedding updates are local and transient
  • Global patterns are never encoded

With replay:

  • Failures persist in memory
  • Similar queries retrieve similar corrections
  • The embedding space encodes system wisdom
  • Improvements compound over time

Replay transforms episodic memory into geometric intelligence.


11. Adaptive Axis Weight Meta-Learning

Static axis weights are naive. The system learns what matters.

Axis Utility Measurement

Each axis tracks its improvement persistence:

class AxisUtilityTracker:
    def __init__(self):
        self.improvements: Dict[RewardAxis, Deque[float]] = {
            axis: deque(maxlen=100) for axis in RewardAxis
        }
        self.failures: Dict[RewardAxis, Deque[float]] = {
            axis: deque(maxlen=100) for axis in RewardAxis
        }
    
    def record_improvement(self, axis: RewardAxis, delta: float):
        self.improvements[axis].append(delta)
    
    def record_failure(self, axis: RewardAxis, value: float):
        self.failures[axis].append(value)
    
    def compute_utility(self, axis: RewardAxis) -> float:
        """
        Utility = persistent improvement / failure rate
        """
        if not self.improvements[axis]:
            return 0.0
        
        avg_improvement = np.mean(self.improvements[axis])
        failure_rate = len(self.failures[axis]) / max(1, len(self.improvements[axis]))
        
        return avg_improvement / (1 + failure_rate)

Axes that consistently drive improvement get higher weights. Axes that rarely improve get lower weights.

Cross-Axis Interference Tracking

The system detects when axes conflict:

def detect_interference(
    self,
    delta: RewardVector,
    critical_axes: List[RewardAxis]
) -> bool:
    """
    Detect when improving one axis degrades another.
    """
    improvements = []
    degradations = []
    
    for axis in critical_axes:
        val = delta.values.get(axis, 0)
        if val > 0:
            improvements.append(axis)
        elif val < 0:
            degradations.append(axis)
    
    # If we improved some axes but degraded others, interference detected
    return len(improvements) > 0 and len(degradations) > 0

When interference is detected, weights are adjusted to prioritize the degraded axis.

Utility Normalization

Utilities are normalized to sum to 1.0:

def normalize_utilities(self, utilities: Dict[RewardAxis, float]) -> Dict[RewardAxis, float]:
    """
    Normalize utilities to sum to 1.0.
    Preserve relative importance.
    """
    total = sum(utilities.values())
    if total == 0:
        return {axis: 1.0 / len(utilities) for axis in utilities}
    
    return {axis: util / total for axis, util in utilities.items()}

This ensures stable weighting across episodes.

Why Static Weights Are Naive

Static weights assume:

  • All axes are equally important
  • Importance doesn’t change over time
  • No cross-axis interference exists
  • The system’s goals are fixed

In reality:

  • Some axes matter more for specific queries
  • Importance shifts as the system learns
  • Axes interfere and compete
  • Goals evolve with usage patterns

Adaptive weighting lets the system discover what matters—empirically, not theoretically.


12. What We Achieved

We built a cooperative self-improvement architecture that:

Preserves Multi-Objective Signals

No scalar collapse. No reward blending. Each axis maintains its own signal. Improvement is measured per-axis, not as a weighted sum.

Enforces Geometry Stability

Embedding updates preserve global structure. Norm stability, angular drift limits, and variance preservation prevent catastrophic collapse.

Compounds Learning Across Episodes

Hard negative mining retrieves past failures. The embedding space encodes system wisdom. Improvements compound over time.

Separates Behavior and Representation

Behavioral correction (policy updates) and representational correction (embedding updates) occur in separate phases. They cooperate but do not interfere.

Maintains Trace Provenance

Every evaluation, reflection, and update is trace-bound. Forensic replay, cross-episode pattern recognition, and causal attribution are built-in.

Adapts to What Matters

Axis weights evolve based on empirical utility. The system learns which axes drive persistent improvement and prioritizes them.

Prevents Reward Hacking

Strict dominance checks require improvement on all critical axes simultaneously. No single-axis optimization can compensate for others.

Scales via Provider Pattern

New axes can be added without modifying the evaluator. Signal providers are independent, testable, and hot-swappable.

This is not theoretical. Every component has been designed, prototyped, or implemented. The architecture is working.


13. What Remains

Several enhancements would strengthen the system:

Retention Tracking

Currently, improvements are stored but not tracked over time. We need metrics for:

  • How long do improvements persist?
  • Do corrections generalize to similar queries?
  • When do improvements decay or become obsolete?

Retention tracking would enable adaptive forgetting and curriculum learning.

Skill Abstraction

The system learns specific corrections but doesn’t abstract them into reusable skills. We need:

  • Pattern recognition across corrections
  • Skill extraction from successful reflections
  • Skill reuse for similar problems

This would transform episodic learning into strategic competence.

Automatic Threshold Calibration

Reflection thresholds (e.g., energy > 0.4) are currently manual. We need:

  • Adaptive threshold setting based on distribution statistics
  • Per-axis threshold calibration
  • Context-aware threshold adjustment

This would make the system more autonomous.

Long-Term Weight Decay

Axis utilities currently accumulate indefinitely. We need:

  • Exponential decay of old utilities
  • Forgetting mechanisms for obsolete axes
  • Seasonal adjustment for shifting priorities

This would prevent utility stagnation.

Meta-Optimizer

The adaptive weighting logic could itself be optimized via meta-learning:

  • Learn optimal utility computation functions
  • Discover interference patterns automatically
  • Optimize weight update schedules

This would close the loop on self-improvement of the self-improvement mechanism.

These are research directions, not architectural gaps. The core system is complete and functional.


14. Conclusion

We integrated two reinforcement learning paradigms—behavioral reflection (ERL) and representation learning (Embed-RL)—into a unified, trace-native architecture.

The key insight is this: behavior and representation must co-evolve. Optimizing one in isolation leads to instability, drift, or collapse. They require separate update paths but shared improvement signals.

Our architecture achieves this through:

  • Multi-objective reward that preserves axis semantics
  • Provider-based evaluation that scales cleanly
  • Structured reflection that modifies traces, not tokens
  • Axis-aware distillation that prevents interference
  • Contrastive embedding updates with hard negative mining
  • Geometry stability constraints that prevent collapse
  • Adaptive weighting that learns what matters

This is a step toward stable, self-improving AI systems. Not through theoretical breakthroughs, but through careful engineering: separation of concerns, trace provenance, multi-objective optimization, and stability constraints.

The system improves itself—not by becoming more powerful, but by becoming more precise, more stable, and more aligned with its own quality metrics.

That is the foundation of trustworthy self-improvement.