A Memory Gate for AI: Policy-Bounded Acceptance in the Executable Cognitive Kernel

A Memory Gate for AI: Policy-Bounded Acceptance in the Executable Cognitive Kernel
Page content

Summary

Dynamic AI systems face a hidden failure mode: they can learn from their own mistakes. If every output is allowed into memory, stochastic errors do not stay local they accumulate.

In earlier posts, I argued that AI systems should not be trusted to enforce their own correctness.

Modern models are stochastic. They produce correct outputs, partially correct outputs, and completely incorrect outputs, but they do not reliably distinguish between them. That means a system that stores everything it generates will eventually learn from its own mistakes.

This post makes that problem concrete inside the Executable Cognitive Kernel (ECK).

It introduces the first operational memory gate in ECK: a policy-bounded acceptance layer that determines what the system is allowed to store. Instead of treating every output as equally eligible for memory, the kernel now verifies outputs, attempts repair when possible, and rejects failures before commit.

The loop changes from:

generate โ†’ score โ†’ store

to:

generate โ†’ verify โ†’ repair or reject โ†’ commit

On a 100-item run, this produced only a small improvement in raw average F1, from 0.77 to 0.78. But that is not the main result.

The main result is that:

  • bad trace admission fell from 0.17 to 0.00
  • clean memory rate rose from 0.83 to 1.00
  • 44 of 100 outputs were rejected instead of silently stored

That is the point of the memory gate.

The model generates possibilities. Policy determines what becomes memory.


1. From Policy Theory to Runtime Control

Across several earlier posts, I kept returning to one idea:

policy

That was not accidental.

The argument was that stochastic generators should not be responsible for enforcing their own guarantees. In any high-trust system, the thing that produces candidates and the thing that decides what is acceptable should be separated.

That is already how traditional systems work:

  • file systems do not enforce their own security; the operating system does
  • processes do not enforce their own permissions; the kernel does
  • applications do not enforce their own isolation; the runtime does

The pattern is consistent:

critical constraints are enforced externally, not internally

The same principle applies to AI.

If the model is both generator and controller, then hallucinations, invalid structure, and incorrect results all become system problems. Prompting can help. Fine-tuning can help. But neither replaces a hard acceptance boundary outside the model.

That is what policy means here:

an external, deterministic layer that governs what is allowed to pass

Until now, that idea was mostly conceptual. This post turns it into a runnable mechanism inside ECK.


2. What Changes Inside ECK

The original ECK loop was built around execution:

generate โ†’ score โ†’ store

That design is enough to support action selection and iterative behavior. It is not enough to protect memory.

If every output is stored, then the system can improve its action policy while still learning from noisy, invalid, or misleading traces. Over time, that contaminates the very memory the system depends on.

So ECK needs two distinct policy layers:

Policy Role
Action Policy decides what the system does
Acceptance Policy decides what the system is allowed to store
    flowchart LR
    Policy["๐Ÿ”’ POLICY<br/>(Normative Layer)<br/>Defines constraints & invariants<br/>'Source of Truth'"] --> Verify
    Verify["โœ… VERIFICATION<br/>(Enforcement Layer)<br/>Applies policy constraints<br/>'Kernel Gate'"] --> Execute
    Execute["โš™๏ธ EXECUTION<br/>(Proposal Layer)<br/>Generates candidate outputs<br/>'Stochastic Engine'"] --> Memory
    Memory["๐Ÿ“ฆ MEMORY<br/>(Compliant Store)<br/>Stores ONLY verified outputs<br/>'Trusted State'"]

    classDef policy fill:#fff9c4,stroke:#fbc02d,stroke-width:3px,color:#000
    classDef verify fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000
    classDef execute fill:#bbdefb,stroke:#0d47a1,stroke-width:3px,color:#000
    classDef memory fill:#a5d6a7,stroke:#1b5e20,stroke-width:3px,color:#000

    class Policy policy
    class Verify verify
    class Execute execute
    class Memory memory
  

The first controls behavior.

The second controls what is allowed to become memory.

This post focuses on the second.

The updated loop is:

generate โ†’ verify โ†’ repair or reject โ†’ commit

That is the architectural shift.

If the first ECK post was about how the system learns to act, this one is about how the system learns to distrust its own bad outputs.


3. Why Dynamic Systems Need a Memory Gate

The need for a memory gate becomes much clearer when we look at how ECK actually behaves over time.

ECK is not a static model.

It is a dynamic system.

  • it generates outputs
  • it evaluates them
  • it stores them
  • and it uses stored results to inform future behavior

This means the system is continuously modifying the data it depends on.

The problem: self-contamination

In a static model, errors are isolated.

In a dynamic system, errors can propagate.

If incorrect outputs are stored:

  • they become part of memory
  • they influence future reasoning
  • they get reused in later steps

Over time, this creates a feedback loop:

bad output โ†’ stored โ†’ reused โ†’ reinforced

This is how a system poisons itself.

Not because the model is broken.

But because the system has no boundary around what it is allowed to remember.

Why stochastic generation makes this worse

The underlying model is stochastic:

  • it produces variable outputs
  • it does not enforce strict correctness
  • it cannot guarantee consistency

That means errors are not rare edge cases.

They are a normal part of operation.

Without a control layer, those errors accumulate.

The role of the memory gate

The memory gate breaks this loop.

Instead of allowing all outputs into memory, the system now enforces:

only policy-compliant outputs are allowed to persist

This changes the system from:

  • a self-accumulating process

into:

  • a policy-regulated process

What the gate actually protects

The memory gate does not make the model correct.

It protects something more important:

  • the integrity of memory
  • the quality of future learning
  • the stability of the system over time

The deeper point

In a static system, policy is useful.

In a dynamic, self-modifying system, policy becomes critical.

The more a system learns from itself, the more it needs a boundary around what it is allowed to learn.


4. The Memory Gate

This post introduces the first operational memory-gating prototype inside ECK.

The core idea is simple:

  • the model still generates outputs
  • verification checks those outputs against policy
  • repair is attempted when failure looks recoverable
  • commit happens only if verification passes

That turns memory from an append-only log into a governed state boundary.

The behavior of the system is fully determined by a single decision point: verification.

    flowchart LR
    Model["โš™๏ธ MODEL OUTPUT<br/>(Proposed Result)"] --> Verify

    Verify{"โœ… VERIFICATION<br/>Policy Check"}

    Verify -->|Pass| Commit["๐Ÿ“ฆ COMMIT<br/>Store in Memory"]
    Verify -->|Fail + Repairable| Repair["๐Ÿ”ง REPAIR<br/>Generate Fix"]
    Verify -->|Fail + Unfixable| Reject["โŒ REJECT<br/>Do Not Store"]

    Repair --> Model

    classDef model fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000
    classDef verify fill:#ffcc80,stroke:#e65100,stroke-width:3px,color:#000
    classDef commit fill:#a5d6a7,stroke:#1b5e20,stroke-width:3px,color:#000
    classDef repair fill:#ffe082,stroke:#ff6f00,stroke-width:2px,color:#000
    classDef reject fill:#ef9a9a,stroke:#b71c1c,stroke-width:3px,color:#000

    class Model model
    class Verify verify
    class Commit commit
    class Repair repair
    class Reject reject
  

This is the operational form of policy-bounded acceptance: every output must pass through this gate before it becomes memory.

The crucial mechanism is a single decision:

should_commit = (not self.use_verification) or v.passed

if should_commit:
    self.memory.record(score)

This is the memory gate.

In standard mode, everything is committed.

In verified mode, only policy-compliant outputs are committed.

That sounds like a small change, but it has large consequences.

The model can still be wrong. The system no longer has to remember that it was.


5. The Verification Layer

To make this concrete, the experiment implements policy through explicit constraints.

Policy types

There are two major classes of constraint:

  • schema constraints: the output must be valid JSON with required fields
  • semantic constraints: the output must match the task well enough to satisfy policy

This creates an external acceptance boundary around the model.

Severity and enforcement

Not all failures are the same.

Some are critical:

  • invalid JSON
  • missing required fields
  • broken structure

Others are semantic:

  • hallucinated entities
  • missing entities
  • wrong entity types
  • low F1

That distinction matters because it lets the system separate:

  • outputs that must be blocked immediately
  • outputs that are worth trying to repair

So verification is not just a binary stop sign. It is also a diagnostic layer.


6. The Repair Loop

The repair step is one of the most important parts of the design.

When an output fails verification, the system does not immediately give up. Instead, it feeds the failure back into the model in a constrained way.

The repair prompt includes:

  • the previous output
  • the list of violations
  • an instruction to fix specific problems

That creates a bounded correction loop:

attempt โ†’ verify โ†’ repair โ†’ verify

This matters because verification is doing two jobs at once:

  1. filtering bad outputs
  2. diagnosing repairable ones

In other words, verification is not only a rejection mechanism. It is also a controlled self-correction mechanism.

That is what makes the kernel more than a passive validator.


7. The Experiment

To test whether policy-bounded acceptance actually changes system behavior, I built a minimal ECK experiment with two modes:

  • standard mode: outputs are always stored
  • verified mode: outputs must pass policy before being stored

Everything else remains the same.

What stayed fixed

I did not change:

  • the model
  • the prompts
  • the dataset slice

The only change was whether outputs were allowed into memory unconditionally or only after verification.

Task setup

The task is structured extraction.

Input is raw text.

Output is JSON with:

  • persons
  • organizations
  • locations

This is a good test case because it exposes both structural and semantic failure modes:

  • invalid formatting
  • hallucinated entities
  • missing entities
  • wrong entity typing

It also makes verification measurable.

Metrics

The experiment tracks three kinds of outcome.

Task quality

  • average F1

Memory quality

  • bad trace admission rate
  • clean memory rate
  • number of stored traces

System behavior

  • retries
  • repair attempts
  • rejection rate

Expected tradeoff

Before running the experiment, the expected tradeoff was clear:

  • more retries
  • fewer stored traces
  • cleaner retained memory
  • possible quality improvement among committed outputs

That is exactly the kind of tradeoff a memory gate should create.


8. Results

Both modes were run on the same fixed 100-item slice.

Summary

Metric Standard Verified
Average F1 0.77 0.78
Stored Traces 100 56
Rejections 0 44
Bad Trace Admission 0.17 0.00
Clean Memory Rate 0.83 1.00

What changed

Nothing about the model changed.

Nothing about the prompts changed.

Nothing about the data changed.

Only one thing changed:

the system became selective about what it accepts

That selectivity produced three effects.

1. Repair before acceptance

Outputs that failed verification were given a structured chance to improve.

2. Rejection of repeated failures

Outputs that continued to fail were not committed.

3. Memory became stricter

Verified mode stored fewer traces because it refused to preserve non-compliant outputs.


9. Why the Main Result Is Not F1

The raw F1 gain on 100 items is small: +0.01.

That matters, but it is not the real story of this experiment.

The real story is selectivity.

A standard kernel commits everything, including outputs it should not trust.

A verified kernel rejects nearly half its outputs in order to keep memory clean.

That is not a throughput failure. It is evidence that the acceptance boundary is doing work.

The most important result is this:

bad trace admission dropped to zero

That is the real systems result.

Because ECK is not only generating outputs. It is building memory.

If invalid traces are admitted, the system learns from corrupted evidence.

If invalid traces are blocked, memory becomes a more trustworthy base for future learning.

So the correct interpretation is not:

verification makes the model better

It is:

verification changes what the system is willing to preserve

That is the deeper architectural contribution.


10. Case Studies

A few concrete cases make the behavior clearer.

Case 1: Standard mode stores a hallucinated entity

Input contains a phrase like:

... he said the meeting would continue ...

The model outputs:

{
  "persons": ["he"],
  "organizations": [],
  "locations": []
}

This is a hallucinated person entity.

In standard mode, the output fails verification but is still committed.

That means the system stores a trace it already has evidence against.

Case 2: Verified mode rejects the same failure

On the same kind of input, verified mode retries and still gets the same incorrect output:

{
  "persons": ["he"],
  "organizations": [],
  "locations": []
}

After repeated failure, the trace is rejected.

The model remains imperfect.

The memory does not inherit that imperfection.

Case 3: Verified mode repairs an incorrect type assignment

Input contains:

... reporting from London Newsroom ...

Initial output:

{
  "persons": ["London Newsroom"],
  "organizations": [],
  "locations": []
}

Verification detects the type error and the missing organization.

A repair prompt is issued.

Repair output:

{
  "persons": [],
  "organizations": ["London Newsroom"],
  "locations": []
}

That passes verification and is committed.

This shows the full intended behavior:

failure โ†’ feedback โ†’ repair โ†’ validation โ†’ commit

These examples capture the core difference:

Behavior Standard Verified
Accept incorrect outputs Yes No
Attempt structured repair Limited Yes
Reject repeated failures No Yes
Store only valid traces No Yes

11. What This Prototype Shows

This post is not the final theory of policy in AI systems.

It is a concrete ECK case study of policy-bounded acceptance.

It shows that the broader policy idea can be operationalized as a simple runtime mechanism:

  • define external constraints
  • verify outputs against them
  • repair what can be repaired
  • reject what cannot
  • gate memory at commit time

That is enough to materially change system behavior.

This prototype does not prove that verification always improves benchmark accuracy.

It does show something narrower and more important for ECK:

policy changes what becomes memory

And once that changes, the system itself changes.


12. Limitations

This is a deliberately small and controlled prototype.

A 100-item run is useful for showing the behavior of the memory gate, but it is not enough to support broad benchmark claims.

The exact magnitude of the F1 effect will vary with:

  • slice composition
  • task difficulty
  • repair prompt quality
  • threshold choice

The main claim here is architectural, not universal:

if a system is allowed to learn from its own outputs, then the boundary around what it is allowed to store becomes a first-class design problem

That is what this experiment demonstrates.

This integrity is not free

The gate increases retries, repair attempts, and rejected outputs, trading throughput for cleaner memory.

We see

  • 44 rejections out of 100
  • 113 repair attempts
  • a 0.28 repair success rate in the demo output

13. Where This Goes Next

This prototype opens the door to several more important questions:

  • richer domain-specific policy definitions
  • tool-backed verification
  • adaptive thresholds
  • policy learning
  • verification over multi-step reasoning
  • long-horizon experiments comparing gated vs ungated memory over time

That last one is especially important.

The strongest future test is not whether a memory gate improves one batch of outputs.

It is whether a system without a memory gate degrades over time while a system with one remains stable.

That is where this idea becomes much bigger than a filtering mechanism.


Conclusion

This post introduced the first operational memory-gating prototype inside ECK.

It changed one thing:

what the system is allowed to store

That change produced:

  • perfect clean-memory rate
  • zero bad trace admission
  • explicit rejection of 44 outputs that would otherwise have entered memory

That is the contribution.

ECK now has two distinct policy layers:

Policy Role
Action Policy decides what the system does
Acceptance Policy decides what the system learns from

Both are necessary.

Without action policy, the system cannot explore.

Without acceptance policy, it cannot trust its own memory.

The first ECK post argued that systems can improve through execution.

This post adds the missing condition:

they must govern what becomes memory before they can improve safely

Closing line

The model generates possibilities. Policy determines what becomes memory.


๐Ÿ“Ž Appendix: Running the Demo

The code below implements the ECK v2 architecture described in this post. It is fully self-contained and runnable.

# ==============================================================================
# ECK v2: Policy-Bounded Verifiable Intelligence (SEMANTICALLY ALIGNED)
# ==============================================================================

from __future__ import annotations

import json
import random
import sqlite3
import time
import uuid
import requests
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from enum import Enum
from statistics import mean
from typing import Any, Dict, List, Optional, Tuple, Set

from datasets import load_dataset

# ------------------------------------------------------------------------------
# 1. Core
# ------------------------------------------------------------------------------

class SeverityLevel(Enum):
    CRITICAL = "critical"
    WARNING = "warning"

class ConstraintType(Enum):
    SCHEMA = "schema"
    LOGIC = "logic"

@dataclass
class ConstraintViolation:
    constraint_id: str
    message: str
    severity: SeverityLevel

@dataclass
class VerificationResult:
    passed: bool
    violations: List[ConstraintViolation] = field(default_factory=list)
    confidence: float = 1.0
    is_blocking: bool = False

    def __post_init__(self):
        if any(v.severity == SeverityLevel.CRITICAL for v in self.violations):
            self.passed = False
            self.is_blocking = True

@dataclass
class ExecutionTrace:
    task: str
    action: str
    result: Any
    context: Dict[str, Any]
    constraint_id: str = ""

class Logger:
    def stage(self, msg):
        print(f"\n๐Ÿ“ฆ {msg}")

    def attempt(self, n, action):
        print(f"   ๐Ÿ” Attempt {n} (action={action})")

    def llm_start(self, model):
        print(f"   ๐Ÿง  LLM CALL ({model})...")

    def llm_end(self, latency, output):
        print(f"   โฑ๏ธ  {latency:.2f}s")
        print(f"   ๐Ÿ“ค {output[:120].replace(chr(10), ' ')}...")

    def verify(self, result):
        print(f"   ๐Ÿ” Verification: {'PASS' if result.passed else 'FAIL'}")
        for v in result.violations:
            print(f"      - {v.message} ({v.severity.value})")

    def score(self, f1):
        print(f"   ๐ŸŽฏ F1 Score: {f1:.2f}")

    def commit(self, yes):
        print(f"   ๐Ÿ’พ Commit: {'YES' if yes else 'NO'}")

log = Logger()

# ------------------------------------------------------------------------------
# 2. Policy
# ------------------------------------------------------------------------------

class ConstraintPolicy:
    def __init__(self):
        self.constraints = []

    def define_constraint(self, ctype, severity, rule):
        self.constraints.append({
            "type": ctype,
            "severity": severity,
            "rule": rule,
        })

# ------------------------------------------------------------------------------
# 3. Memory
# ------------------------------------------------------------------------------

class Memory:
    def __init__(self):
        self.data = []

    def record(self, score):
        self.data.append(score)

# ------------------------------------------------------------------------------
# 4. Verification
# ------------------------------------------------------------------------------

class ConstraintEvaluator(ABC):
    @abstractmethod
    def evaluate(self, trace: ExecutionTrace, rule: Dict) -> VerificationResult:
        pass

class SchemaEvaluator(ConstraintEvaluator):
    def evaluate(self, trace, rule):
        try:
            obj = json.loads(trace.result) if isinstance(trace.result, str) else trace.result
        except (json.JSONDecodeError, ValueError, TypeError):
            return VerificationResult(False, [ConstraintViolation("schema", "Invalid JSON", SeverityLevel.CRITICAL)])

        for f in ["persons", "organizations", "locations"]:
            if f not in obj or not isinstance(obj[f], list):
                return VerificationResult(False, [ConstraintViolation("schema", f"Missing or invalid {f}", SeverityLevel.CRITICAL)])

        return VerificationResult(True)

class SemanticEvaluator(ConstraintEvaluator):
    def evaluate(self, trace, rule):
        truth = trace.context["truth"]
        pred = parse_prediction(trace.result)

        if pred is None:
            return VerificationResult(
                passed=False,
                violations=[
                    ConstraintViolation(
                        "semantic",
                        "Output is not valid JSON",
                        SeverityLevel.CRITICAL,
                    )
                ],
                confidence=0.0,
                is_blocking=True,
            )

        violations = []

        for key in ["persons", "organizations", "locations"]:
            pred_set = set(x.lower() for x in pred.get(key, []))
            truth_set = set(x.lower() for x in truth.get(key, []))

            for item in pred_set - truth_set:
                violations.append(
                    ConstraintViolation(
                        "semantic",
                        f"Hallucinated {key}: {item}",
                        SeverityLevel.WARNING,
                    )
                )

            for item in truth_set - pred_set:
                violations.append(
                    ConstraintViolation(
                        "semantic",
                        f"Missing {key}: {item}",
                        SeverityLevel.WARNING,
                    )
                )

        f1 = compute_f1(pred, truth)

        if f1 < rule["min_f1"]:
            violations.append(
                ConstraintViolation(
                    "semantic",
                    f"Low F1={f1:.2f}",
                    SeverityLevel.WARNING,
                )
            )

        # IMPORTANT:
        # semantic violations should FAIL verification,
        # but not necessarily block execution
        if violations:
            return VerificationResult(
                passed=False,
                violations=violations,
                confidence=f1,
                is_blocking=False,
            )

        return VerificationResult(
            passed=True,
            violations=[],
            confidence=f1,
            is_blocking=False,
        )

class Verifier:
    def __init__(self, policy):
        self.policy = policy
        self.schema = SchemaEvaluator()
        self.semantic = SemanticEvaluator()

    def verify(self, trace):
        results = []

        for c in self.policy.constraints:
            if c["type"] == ConstraintType.SCHEMA:
                results.append(self.schema.evaluate(trace, c["rule"]))
            else:
                results.append(self.semantic.evaluate(trace, c["rule"]))

        final = VerificationResult(True)

        for r in results:
            if not r.passed:
                final.passed = False
                final.violations.extend(r.violations)
                if r.is_blocking:
                    final.is_blocking = True

        return final

# ------------------------------------------------------------------------------
# 5. LLM
# ------------------------------------------------------------------------------

def run_llm(prompt, log):
    log.llm_start("mistral")
    start = time.time()

    try:
        res = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": "mistral", "prompt": prompt, "stream": False},
            timeout=120,
        )
        res.raise_for_status()
        output = res.json().get("response", "")
    except requests.exceptions.RequestException as e:
        output = f"ERROR: {e}"

    log.llm_end(time.time() - start, output)
    return output

def build_initial_prompt(text):
    return f"""
You are a named entity extraction system.

Extract ALL named entities EXACTLY as written.

Return STRICT JSON:
{{
  "persons": [],
  "organizations": [],
  "locations": []
}}

Rules:
- Preserve exact text spans
- No guessing
- No explanation
- Output ONLY JSON

Text:
{text}
""".strip()

def build_repair_prompt(text, previous_output, violations):
    issues = "\n".join(f"- {v.message}" for v in violations)

    return f"""
You previously returned:

{previous_output}

It failed for the following reasons:
{issues}

Fix the output. Do not repeat the same mistakes.

Rules:
- Keep correct entities
- Remove hallucinated entities
- Add missing entities ONLY if they appear exactly in text
- Do NOT guess
- Output ONLY valid JSON

Schema:
{{
  "persons": [],
  "organizations": [],
  "locations": []
}}

Text:
{text}
""".strip()


# ------------------------------------------------------------------------------
# 6. Scoring
# ------------------------------------------------------------------------------

def parse_prediction(result):
    if not result:
        return None

    text = result.strip()

    # Remove markdown fences
    if "```" in text:
        parts = text.split("```")
        for part in parts:
            part = part.strip()
            if part.startswith("{") and part.endswith("}"):
                text = part
                break

    # Try direct parse
    try:
        return json.loads(text)
    except (json.JSONDecodeError, ValueError):
        return None

    # Try extracting JSON substring
    start = text.find("{")
    end = text.rfind("}")
    if start != -1 and end != -1:
        try:
            return json.loads(text[start:end+1])
        except:
            return None

    return None

def compute_f1(pred, truth):
    def norm(x): return set(i.lower() for i in x)

    scores = []
    for k in truth:
        p, t = norm(pred.get(k, [])), norm(truth[k])
        if not p and not t:
            scores.append(1)
            continue
        inter = len(p & t)
        prec = inter / len(p) if p else 0
        rec = inter / len(t) if t else 0
        scores.append(0 if prec+rec==0 else 2*prec*rec/(prec+rec))
    return sum(scores)/len(scores)

# ------------------------------------------------------------------------------
# 7. Ground Truth
# ------------------------------------------------------------------------------

def build_truth(item):
    tokens, tags = item["tokens"], item["ner_tags"]
    mapping = {1:"persons",2:"persons",3:"organizations",4:"organizations",5:"locations",6:"locations"}

    out = {"persons":[], "organizations":[], "locations":[]}
    cur, typ = [], None

    for tok, tag in zip(tokens, tags):
        if tag in mapping:
            t = mapping[tag]
            if tag in [1,3,5]:
                if cur:
                    out[typ].append(" ".join(cur))
                cur, typ = [tok], t
            else:
                cur.append(tok)
        else:
            if cur:
                out[typ].append(" ".join(cur))
                cur, typ = [], None

    if cur:
        out[typ].append(" ".join(cur))

    return out

# ------------------------------------------------------------------------------
# 8. Kernel
# ------------------------------------------------------------------------------

class Kernel:
    def __init__(self, verifier, memory, use_verification=True):
        self.verifier = verifier
        self.memory = memory
        self.use_verification = use_verification

        self.retry_count = 0
        self.blocking_failures = 0
        self.semantic_failures = 0
        self.repair_prompt_uses = 0
        self.pass_count = 0

        self.total_runs = 0
        self.commit_count = 0
        self.reject_count = 0
        self.failed_verifications = 0
        self.repair_success_count = 0
        self.initial_failures = 0

        self.committed_scores: List[float] = []

    def run(self, text, truth, log: Logger):
        self.total_runs += 1

        final_result = None
        v = VerificationResult(True)

        first_attempt_failed = False
        repaired_successfully = False

        for attempt in range(3):
            action_name = "extract" if attempt == 0 else "repair"
            log.attempt(attempt, action_name)

            if attempt == 0:
                prompt = build_initial_prompt(text)
                print("   ๐Ÿ“ Using initial prompt")
            else:
                prompt = build_repair_prompt(text, final_result, v.violations)
                self.repair_prompt_uses += 1
                print("   ๐Ÿ›  Using repair prompt")

            result = run_llm(prompt, log)

            trace = ExecutionTrace(
                task="extract",
                action=action_name,
                result=result,
                context={"truth": truth},
            )

            v = self.verifier.verify(trace)
            log.verify(v)

            if not v.passed:
                self.failed_verifications += 1

                if attempt == 0:
                    self.initial_failures += 1
                    first_attempt_failed = True

                if v.is_blocking:
                    self.blocking_failures += 1
                else:
                    self.semantic_failures += 1

            final_result = result

            if v.passed:
                self.pass_count += 1

                if first_attempt_failed:
                    repaired_successfully = True
                    self.repair_success_count += 1

                break

            if attempt < 2:
                self.retry_count += 1

        pred = parse_prediction(final_result)
        score = compute_f1(pred, truth) if pred else 0.0

        log.score(score)

        should_commit = (not self.use_verification) or v.passed
        log.commit(should_commit)

        if should_commit:
            self.memory.record(score)
            self.committed_scores.append(score)
            self.commit_count += 1
        else:
            self.reject_count += 1

        return score
# ------------------------------------------------------------------------------
# 9. Experiment
# ------------------------------------------------------------------------------

def run_experiment(mode, eval_items):
    policy = ConstraintPolicy()

    policy.define_constraint(ConstraintType.SCHEMA, SeverityLevel.CRITICAL, {})
    policy.define_constraint(ConstraintType.LOGIC, SeverityLevel.WARNING, {"min_f1": 0.5})

    verifier = Verifier(policy)
    memory = Memory()

    kernel = Kernel(verifier, memory, use_verification=(mode == "verified"))

    scores = []

    for i, item in enumerate(eval_items):
        log.stage(f"Task {i+1}/{len(eval_items)}")

        text = " ".join(item["tokens"])
        truth = build_truth(item)

        print(f"   Input: {text[:100]}...")

        score = kernel.run(text, truth, log)
        scores.append(score)

    avg_f1 = mean(scores) if scores else 0.0

    committed_scores = kernel.committed_scores
    bad_committed = sum(1 for s in committed_scores if s < 0.5)
    clean_committed = sum(1 for s in committed_scores if s >= 0.5)
    n_committed = len(committed_scores)

    bad_trace_admission = (
        bad_committed / n_committed if n_committed else 0.0
    )

    repair_success_rate = (
        kernel.repair_success_count / kernel.initial_failures
        if kernel.initial_failures else 0.0
    )

    rejection_rate = (
        kernel.reject_count / kernel.total_runs
        if kernel.total_runs else 0.0
    )

    clean_memory_rate = (
        clean_committed / n_committed if n_committed else 0.0
    )

    print(f"\n๐Ÿ“Š {mode.upper()} SUMMARY")
    print(f"   Avg F1: {avg_f1:.2f}")
    print(f"   Stored traces: {len(memory.data)}")

    print("\n   ๐Ÿง  SYSTEM METRICS")
    print(f"   Total Runs: {kernel.total_runs}")
    print(f"   Commits: {kernel.commit_count}")
    print(f"   Rejections: {kernel.reject_count}")

    print("\n   ๐Ÿ” VERIFICATION")
    print(f"   Failed Verifications: {kernel.failed_verifications}")
    print(f"   Blocking Failures: {kernel.blocking_failures}")
    print(f"   Semantic Failures: {kernel.semantic_failures}")

    print("\n   ๐Ÿ” REPAIR")
    print(f"   Repair Attempts: {kernel.repair_prompt_uses}")
    print(f"   Repair Success Rate: {repair_success_rate:.2f}")

    print("\n   ๐Ÿงช QUALITY")
    print(f"   Bad Trace Admission: {bad_trace_admission:.2f}")
    print(f"   Rejection Rate: {rejection_rate:.2f}")
    print(f"   Clean Memory Rate: {clean_memory_rate:.2f}")

    return {
        "avg_f1": avg_f1,
        "bad_trace_admission": bad_trace_admission,
        "repair_success_rate": repair_success_rate,
        "rejection_rate": rejection_rate,
        "clean_memory_rate": clean_memory_rate,
    }    

# ------------------------------------------------------------------------------
# 10. Main
# ------------------------------------------------------------------------------

if __name__ == "__main__":
    random.seed(42)

    dataset = load_dataset("conll2003", split="train", trust_remote_code=True)
    data = [dataset[i] for i in range(1000)]

    # Freeze the exact evaluation slice ONCE
    eval_items = random.sample(data, 100)

    print("\n๐Ÿ”ต STANDARD")
    res_std = run_experiment("standard", eval_items)

    print("\n๐ŸŸข VERIFIED")
    res_ver = run_experiment("verified", eval_items)

    print("\nRESULT:")
    print(f"Standard F1: {res_std['avg_f1']:.2f}")
    print(f"Verified F1: {res_ver['avg_f1']:.2f}")
    print(f"Delta: {res_ver['avg_f1'] - res_std['avg_f1']:+.2f}")

๐Ÿš€ Part 3: Deployment Instructions

1. Run the Demo

pip install datasets<4
pip install requests
python eck_verified_demo.py

Expected Output:

๐Ÿ“Š VERIFIED SUMMARY
   Avg F1: 0.78
   Stored traces: 56

   ๐Ÿง  SYSTEM METRICS
   Total Runs: 100
   Commits: 56
   Rejections: 44

   ๐Ÿ” VERIFICATION
   Failed Verifications: 157
   Blocking Failures: 21
   Semantic Failures: 136

   ๐Ÿ” REPAIR
   Repair Attempts: 113
   Repair Success Rate: 0.28

   ๐Ÿงช QUALITY
   Bad Trace Admission: 0.00
   Rejection Rate: 0.44
   Clean Memory Rate: 1.00

RESULT:
Standard F1: 0.77
Verified F1: 0.78
Delta: +0.01