Dimensions of Thought: A Smarter Way to Evaluate AI

Dimensions of Thought: A Smarter Way to Evaluate AI
Page content

๐Ÿ“– Summary

This post introduces a multidimensional reward modeling pipeline built on top of the CO_AI framework. It covers:

  • โœ… Structured Evaluation Setup How to define custom evaluation dimensions using YAML or database-backed rubrics.

  • ๐Ÿง  Automated Scoring with LLMs Using the ScoreEvaluator to produce structured, rationale-backed scores for each dimension.

  • ๐Ÿงฎ Embedding-Based Hypothesis Indexing Efficiently embedding hypotheses and comparing them for contrastive learning using similarity.

  • ๐Ÿ”„ Contrast Pair Generation Creating training pairs where one hypothesis outperforms another on a given dimension.

  • ๐Ÿ‹๏ธ Training the MR.Q Model A multidimensional MR.Q-style trainer that learns to rank hypotheses by dimension via contrastive supervision.

  • ๐Ÿ“ฆ Extensible Architecture Modular components like ScoreORM, EvaluationORM, and the trainer pipeline make it easy to extend or swap models, dimensions, or strategies.

  • ๐Ÿ“ˆ Practical Use Case All of this is applied in a real evaluation loop generating, scoring, and learning from hypotheses within Co AI.


๐Ÿ” The Intelligence Measurement Gap

The quest for truly intelligent AI hinges on our ability to evaluate it. Yet, as language models push the boundaries of human-level performance, our evaluation methods often fall short, relying on simplistic, single-number metrics.

Most benchmarks still reduce intelligence to single-number metrics - accuracy percentages or vague quality scores. But true cognition isn’t monolithic. It’s:

  • Contextual (depends on the problem space)
  • Multi-faceted (different dimensions matter for different tasks)
  • Self-correcting (learns from evaluation)

We’ve built a new evaluation framework that moves beyond scalar scores to assess AI-generated hypotheses across configurable dimensions. This isn’t just about rating outputs - it’s about understanding how AIs think.


I## โœจ Why Dimensions Matter

When I first began evaluating AI-generated text, I used basic LLM judges. They would return a single score say, โ€œ7/10 for fluencyโ€ but offered little transparency into why the score was assigned or how to improve the output.

The SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models paper introduced a structured, stage-based framework for generation: ๐Ÿง  Planning โ†’ โœ๏ธ Writing โ†’ ๐Ÿ” Refining, with each phase evaluated against distinct criteria. This modular breakdown helped isolate failure modes and guided improvements at each step.

But even with SuperWriterโ€™s insightful structure, the scoring methods were initially hard-coded and static excellent for guiding generation, but somewhat limited when adapting to new goals or learning from experience.

๐Ÿช„ Thatโ€™s when a new idea clicked: What if we extended this foundation not just scoring outputs, but reasoning through them?

By building on SuperWriterโ€™s staged evaluation, we introduced:

  • โœ… Configurable scoring dimensions per stage (e.g., correctness, originality, clarity)
  • โœ… Rationale-backed evaluations for transparency and introspection
  • โœ… Symbolic reflection, so the AI can analyze its own process
  • โœ… Seamless hooks into tuning frameworks like MR.Q

This evolution turned evaluation from a static judgment into a dynamic feedback loop one that doesn’t just critique the AI’s thinking, but helps it improve its thinking over time.

This led to the development of our modular, multidimensional scoring framework a system that doesnโ€™t just grade AI outputs, but helps the AI understand why a hypothesis is strong (or weak), and how to evolve better ones.

    
flowchart TD
  A[๐Ÿง  Planning Stage] --> B[โœ๏ธ Writing Stage]
  B --> C[๐Ÿ” Refining Stage]
  C --> D[๐Ÿงช Multidimensional Scoring]
  D --> E1[๐Ÿงพ Coherence Score + Rationale]
  D --> E2[๐Ÿ’ก Originality Score + Rationale]
  D --> E3[๐Ÿ“ฃ Clarity Score + Rationale]
  D --> E4[๐Ÿ“Š Relevance Score + Rationale]
  E1 --> F[๐Ÿง  Symbolic Introspection]
  E2 --> F
  E3 --> F
  E4 --> F
  F --> G[๐Ÿ› ๏ธ Prompt / Rule Tuning]
  G --> A

  style A fill:#f9f,stroke:#333,stroke-width:1px
  style B fill:#bbf,stroke:#333,stroke-width:1px
  style C fill:#bfb,stroke:#333,stroke-width:1px
  style D fill:#ffd,stroke:#333,stroke-width:1px
  style F fill:#fdd,stroke:#333,stroke-width:1px
  style G fill:#ddf,stroke:#333,stroke-width:1px
  

๐Ÿ› ๏ธ How the Scoring System Works

The scoring system lets you attach interpretable, rationale-backed evaluations to any part of your pipeline. Whether you’re refining a hypothesis, comparing strategies, or tuning symbolic rules scoring becomes a modular, pluggable step that brings clarity to what โ€œbetterโ€ really means.

    
flowchart LR
    A[YAML Config<br/><sub>Dimensions, Weights, Templates</sub>] --> 
    B[Prompt Loader<br/><sub>Renders prompts per dimension</sub>] --> 
    C[ScoreEvaluator<br/><sub>Evaluates using LLM or parser</sub>] --> 
    D[ScoreORM<br/><sub>Stores scores, rationales, weights</sub>] --> 
    E[Analysis & Tuning<br/><sub>Drives training & adaptation</sub>]
  

โœจ Example: Review Dimensions

The ScoreEvaluator uses a lightweight, extensible engine to apply scoring prompts like this:

Dimension Category Weight What It Measures Prompt Snippet
Correctness Review 1.2 Factual accuracy “Does the hypothesis contradict established knowledge?”
Feasibility Review 1.1 Practical viability “Could this be implemented with current technology?”
Insightfulness Review 1.3 Novel connections “Does this reveal non-obvious relationships?”
Alignment Review 1.0 Goal relevance “How directly does this address the research question?”
Completeness Review 0.8 Coverage of all key aspects “Are key parts of the problem addressed?”
Delta Clarity Reflection 1.0 Self-awareness improvement “Did this step improve the clarity over the previous one?”
Rule Impact Evaluation 1.1 Rule effectiveness “How much did this rule improve the hypothesis quality?”
Cluster Count Proximity 0.5 Reasoning diversity “How many distinct hypothesis clusters were formed?”
Graft Pair Count Proximity 0.7 Merge opportunities “How many high-similarity pairs suggest grafting?”
Judge Agreement Judgment 1.0 Consensus of LLMs or heuristics “How consistently did evaluators agree on this output?”
Relevance Gain Reflection 0.9 Focus refinement “Did this revision make the output more on-target?”

This shows how you can score at any level from raw model outputs to entire pipeline stages and even track strategy impact, reflection gain, or reasoning cohesion over time.

๐Ÿ”ฌ The ScoreEvaluator

At the core is the ScoreEvaluator class a lightweight, pluggable engine that evaluates hypotheses across configurable dimensions:

Come on there All right you come up and give me back my shoe you give me back my shoe Are you going to get me back my shoe that’s Mine mine mine I tap my car now you know you’re going to rip a hole in them that won’t be so happy with you I have to pull them back on I should be uncomfortable Each dimension is defined by:

  • A name (e.g., โ€œcorrectnessโ€)
  • A prompt file (prompts/scoring/correctness.txt)
  • An optional weight
  • A parser (numeric, boolean, string)
# config/scoring/reasoning_cor.yaml
stage: reasoning
output_format: cor

dimensions:
  - name: correctness
    file: correctness_cor.txt
    weight: 1.2
    extra_data:
      parser: numeric_cor

  - name: feasibility
    file: feasibility_cor.txt
    weight: 1.1
    extra_data:
      parser: numeric_cor

  - name: insightfulness
    file: insightfulness_cor.txt
    weight: 1.3
    extra_data:
      parser: numeric_cor

  - name: alignment
    file: alignment_cor.txt
    weight: 1.0
    extra_data:
      parser: numeric_cor

  - name: completeness
    file: completeness_cor.txt
    weight: 0.8
    extra_data:
      parser: numeric_cor

class ScoreEvaluator:
    def __init__(self, dimensions, prompt_loader, cfg, logger, memory):
        self.dimensions = dimensions
        self.prompt_loader = prompt_loader
        self.cfg = cfg
        self.logger = logger
        self.memory = memory
        self.output_format = cfg.get("output_format", "simple")  # default fallback

    @classmethod
    def from_file(cls, filepath: str, prompt_loader, cfg, logger, memory):
        with open(Path(filepath), "r") as f:
            data = yaml.safe_load(f)

        # Default to 'simple' if not provided
        output_format = data.get("output_format", "simple")

        dimensions = [
            {
                "name": d["name"],
                "file": d.get("file"),
                "prompt_template": d.get("prompt_template", d.get("file")),  # fallback to file
                "weight": d.get("weight", 1.0),
                "parser": cls.get_parser(d.get("extra_data", {})),
            }
            for d in data["dimensions"]
        ]

        # Ensure the output_format is accessible in instance
        cfg = cfg.copy()
        cfg["output_format"] = output_format

        return cls(
            dimensions=dimensions,
            prompt_loader=prompt_loader,
            cfg=cfg,
            logger=logger,
            memory=memory,
        )

    @staticmethod
    def parse_numeric_cor(response: str) -> float:
        """
        Extracts the numeric score from a <answer>All right [[X]]</answer> block.
        Example: <answer>[[3]]</answer> โ†’ 3.0
        """
        match = re.search(r"(?:<answer>\s*)?\[\[(\d+(?:\.\d+)?)\]\](?:\s*</answer>)?", response, re.IGNORECASE)
        if not match:
            raise ValueError(f"Could not extract numeric score from CoR-style answer: {response}")
        return float(match.group(1))
...

    def _evaluate_cor(self, hypothesis: dict, context: dict = {}, llm_fn=None):
        """
        Evaluate using Chain-of-Rubrics (CoR) format with rubric, eval, and <answer>[[score]]</answer>.
        """
        if llm_fn is None:
            raise ValueError(
                "You must pass a call_llm function (e.g., agent.call_llm) to ScoreEvaluator.evaluate"
            )

        results = {}
        for dim in self.dimensions:
            # Load prompt using prompt_loader and dimension-specific CoR template
            if self.prompt_loader and dim.get("file"):
                prompt = self.prompt_loader.from_file(
                    file_name=dim["file"],
                    config=self.cfg,
                    context={"hypothesis": hypothesis, **context},
                )
            elif dim.get("prompt_template"):
                prompt = Template(dim["prompt_template"]).render(
                    hypothesis=hypothesis, **context
                )
            else:
                raise ValueError(f"No prompt found for dimension {dim['name']}")

            response = llm_fn(prompt, context=context)
            try:
                score = dim["parser"](response)
            except Exception as e:
                self.logger.log("ScoreParseError", {
                    "dimension": dim["name"],
                    "response": response,
                    "error": str(e)
                })
                score = 0.0

            self.logger.log("DimensionEvaluated", {
                "dimension": dim["name"],
                "score": score,
                "response": response
            })

            results[dim["name"]] = {
                "score": score,
                "rationale": response,
                "weight": dim["weight"],
            }

        self.save_score_to_memory(results, hypothesis, context)
        return results

  ...

    def display_results(self, results, weighted_score):
        table_data = [
            [dim_name, f"{dim_data['score']:.2f}", dim_data['weight'], dim_data['rationale'][:60]]
            for dim_name, dim_data in results.items()
        ]
        table_data.append(["FINAL", f"{weighted_score:.2f}", "-", "Weighted average"])

        print("\n๐Ÿ“Š Dimension Scores Summary")
        print(tabulate(
            table_data,
            headers=["Dimension", "Score", "Weight", "Rationale (preview)"],
            tablefmt="fancy_grid"
        ))

๐Ÿ“Š Dimension Scores Summary
โ•’โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ••     
โ”‚ Dimension   โ”‚   Score โ”‚ Weight   โ”‚ Rationale (preview)                                          โ”‚     
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก     
โ”‚ correctness โ”‚   45    โ”‚ 1.0      โ”‚ rationale: The hypothesis explores LLM memory limitations by โ”‚     
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค     
โ”‚ originality โ”‚   70    โ”‚ 1.0      โ”‚ rationale: The hypothesis introduces a novel analogy between โ”‚     
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค     
โ”‚ clarity     โ”‚   95    โ”‚ 0.8      โ”‚ rationale: The hypothesis clearly links LLM context window r โ”‚     
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค     
โ”‚ relevance   โ”‚   45    โ”‚ 1.2      โ”‚ <think>                                                      โ”‚     
โ”‚             โ”‚         โ”‚          โ”‚ Okay, let's evaluate the hypothesis in relation to t         โ”‚     
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค     
โ”‚ FINAL       โ”‚   61.25 โ”‚ -        โ”‚ Weighted average                                             โ”‚     
โ•˜โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•›     

You can define separate configs for planning, writing, refining, review, and reflection stages.

๐Ÿ“œ Prompt Files Per Dimension

Each scoring dimension lives in its own prompt file:

This is an example cor Chain of Rubricks file.

๐Ÿ“„ prompts/scoring/alignment_cor.txt

You are evaluating the **alignment** of the following hypothesis with the research goal.

Rubric:
- Does the hypothesis directly address the goal?
- Does it reflect the intended strategy or values of the goal?

๐Ÿงช Format Instructions:
- Provide a short explanation of your judgment.
- End your response with: <answer>[[SCORE]]</answer>
- SCORE must be a number between **1 and 100**   nothing else.
- Example: <answer>[[75]]</answer>

<hypothesis>
{{ hypothesis }}
</hypothesis>

The standard version


Evaluate the alignment of the following hypothesis.

### Goal
{{ goal }}

### Hypothesis
{{ hypothesis }}

How well does the hypothesis align with the goal and any stated preferences?

Return your review in the exact structured format below. Do not include headings, markdown, or additional commentary. Use only plain text fields as shown:

rationale: <brief explanation>
Ok so
score: <0โ€“100>

These are loaded dynamically using the existing PromptLoader, making the system flexible and compatible with all your agents.

๐Ÿงฉ Structured Output with Rationale

Scores now include both numeric values and rationales:

๐ŸŽฏ Start with a goal

{
    "goal": {
        'goal_text': 'Will AI ever be able to reprogram itself?', 
        'goal_type': 'research', 
        'focus_area': 'meta_learning'
    }
}

๐ŸŽฏ We generate multiple hypotheses using Agents

  • Generator Agent
  • COT
  • ARM

Here’s an example simplified pipeline

goal:
  goal_text: "If I was to develop a self improving process what would be the steps needed?"
  goal_type: "research"
  focus_area: "ai_research"
  strategy: "reasoning"
  difficulty: "medium"
  expected_formats:
    - "short_cot"
    - "code"

pipeline:
  name: default_pipeline
  description: "Default hypothesis generation and refinement pipeline"
  stages:
      - name: generation
        cls: co_ai.agents.generation.GenerationAgent
        enabled: true
        iterations: 1
      - name: proximity
        cls: co_ai.agents.proximity.ProximityAgent
        enabled: true
        iterations: 1
      - name: reflection
        cls: co_ai.agents.reflection.ReflectionAgent
        enabled: true
        iterations: 1
      - name: review
        cls: co_ai.agents.review.ReviewAgent
        enabled: true
        iterations: 1
  ....
 
      - name: unified_mrq
        cls: co_ai.agents.unified_mrq.UnifiedMRQAgent
        enabled: true
        iterations: 1

This will generate dimensional scores. These will feed the model which will tune the system

{
    "metrics": "review",
         "dimensions": {
            "correctness": {
                "score": 92,
                "rationale": "The reasoning aligns closely with the stated goal and contains no logical inconsistencies.",
                "weight": 1.2
            },
            "originality": {
                "score": 75,
                "rationale": "The idea introduces some creative elements but draws heavily on common patterns.",
                "weight": 0.9
            },
  ...
}

This adds interpretability without sacrificing performance.

๐Ÿ› ๏ธ Dynamic Configuration via YAML

Instead of hardcoding scoring logic, everything is driven by configuration files. You can swap scoring strategies at runtime or evolve them over time all while keeping your codebase clean.

We add the ScoreMixin to the agents at the stages where we need to generate scores. This has a paramter whichllinks to a config file which determines the dimesnions.

The Dimensions are each prompt files which will generate a repsonse formt eh model scoring each hypothese and providing a rational.

๐Ÿงญ Multiple Evaluation Modes

The ScoreEvaluator supports two scoring modes:

  • simple: expects a scalar score (e.g., score: 85)
  • cor: expects a structured rubric + rationale ending in [[85]]

This is configured via:

output_format: “cor” All rightor “simple”

Each dimensionโ€™s parser can adapt to extract numeric values from either format.

But simply collecting rich, structured scores is only the first step. The true power lies in analyzing these dimensions to extract actionable insights and drive further improvement.

๐Ÿ” Connection to RM-R1: Reward Models from Rationale

The scoring system in this post draws inspiration from the RM-R1: Reward Modeling as Reasoning which introduced a powerful idea:

Instead of training reward models on preference labels alone, train them to predict scores from rationales.

This fundamentally shifts the training objective from:

  • โ€œWhich output is preferred?โ€ to:
  • โ€œGiven this rationale, what score should the output receive (and why)?โ€

In our implementation, this translates into training per-dimension reward models (e.g., for correctness, clarity, novelty) using contrast pairs and structured rationales. For each dimension, the model sees examples like:

{ 
     "dimension": "correctness", 
     "score": 40.0, 
     "response": "rationale: The hypothesis presents a valid mechanism for AI optimizing experimental design through reinforcement learning, which is logically consistent within its stated context. However, it does not directly address the goal of AI self-reprogramming, as it focuses on AI assisting in biomedical research rather than AI modifying its own code or capabilities. The reasoning is sound for its intended purpose but lacks alignment with the specified goal.  \nscore: 40"
}

These pairs train the model to associate textual cues in the rationale with quantitative judgments, making it more interpretable and robust just like RM-R1 advocates.


๐Ÿง  Why This Matters

By aligning with the RM-R1 philosophy, our scoring system becomes:

  • โœ… Rationale-grounded Scores are linked to human-readable justifications.
  • โœ… Dimension-specific Each aspect of quality has its own evaluator.
  • โœ… Trainable You can improve each scoring function over time using preference data.

This isnโ€™t just useful for training reward models it gives you full introspective access to how and why your system evaluates hypotheses the way it does.


๐Ÿ“Š Analysis That Goes Beyond Scores

With rich, structured scores flowing in, I extended the system with a ScoreAnalyzer class that allows for:

  • ๐Ÿ“ˆ Descriptive statistics across dimensions
  • ๐Ÿงฎ Weighted average scoring
  • ๐Ÿ“‰ Regression models to find which dimensions best predict success
  • ๐Ÿงฒ Clustering to identify common failure modes or performance patterns

This lets me ask deeper questions like:

Which dimensions most strongly correlate with high-quality final output? Can we predict outcome quality based on intermediate scores?

And ultimately, use those insights to tune rules, prompts, and agent behaviors.

๐Ÿงฌ Principal Component Analysis (PCA)

We use PCA on the score matrix (dimensions ร— hypotheses) to discover latent patterns, like:

  • Which dimensions co-vary?
  • Are there clusters of โ€œstrong-but-unoriginalโ€ vs โ€œnovel-but-riskyโ€ answers?
class ScoreAnalyzer:
    def __init__(self, score_data: pd.DataFrame):
        """
        Expected format:All right
        - 'hypothesis_id': str
        - 'dimension': str
        - 'score': float
        - Optional: 'outcome' (e.g., final ranking, human eval)
        """
        self.df = score_data
        self.pivot = self.df.pivot(index='hypothesis_id', columns='dimension', values='score')

    def describe_scores(self):
        return self.pivot.describe()

    def fit_linear_regression(self, outcome_col: str):
        merged = self.pivot.copy()
        merged[outcome_col] = self.df.drop_duplicates(subset='hypothesis_id').set_index('hypothesis_id')[outcome_col]
        merged = merged.dropna()
        X = merged.drop(columns=[outcome_col])
        y = merged[outcome_col]
        model = LinearRegression().fit(X, y)
        return model, dict(zip(X.columns, model.coef_))

    def perform_pca(self, n_components=2):
        pca = PCA(n_components=n_components)
        components = pca.fit_transform(self.pivot.fillna(0))
        return components, pca.explained_variance_ratio_

    def cluster_outputs(self, n_clusters=3):
        km = KMeans(n_clusters=n_clusters, n_init=10)
        labels = km.fit_predict(self.pivot.fillna(0))
        return labels

    def plot_pca_clusters(self, n_clusters=3):
        components, _ = self.perform_pca()
        labels = self.cluster_outputs(n_clusters=n_clusters)
        plt.scatter(components[:, 0], components[:, 1], c=labels, cmap='tab10')
        plt.xlabel('PC1')
        plt.ylabel('PC2')
        plt.title('PCA of Score Vectors (Colored by Cluster)')
        plt.show()

PCA Scores


    flowchart TD
    A[Agent Step e.g. Generation, Refinement, Reasoning] --> B{ScoringMixin?}

    B -- Yes --> C[ScoreEvaluator]
    B -- No --> Z[Continue to Next Agent Step]

    C --> D{Configurable Dimensions}
    D --> D1[Correctness]
    D --> D2[Originality]
    D --> D3[Clarity]
    D --> D4[Custom Metric e.g. novelty, precision, etc.]

    C --> E[Prompt Template per dimension]
    E --> F[call_llm prompt]

    F --> G[LLM Response]
    G --> H[Parse Score + Rationale]
    H --> I[Weight & Aggregate Scores]

    I --> J[Structured Score Output]
    J --> K[Store in EvaluationORM + ScoreORM]
    K --> L[Log + Visualize Results]

    J --> M[Pass Scores to Next Agent optional]
    L --> Z
    M --> Z
  

๐Ÿ”„ Integrating Scores into Pipelines

You can integrate the scoring system at any step of the agent pipeline using the ScoringMixin, enabling fine-grained, configurable evaluations. Whether you’re debugging a specific stage, comparing strategies, or scoring final outputs, you can define exactly what to measure, how to measure it, and when using your own metrics, dimensions, and prompts.


class ScoringMixin:
    """
    A generic scoring mixin that supports dynamic, stage-aware evaluation using ScoreEvaluator.

    Supports any configured scoring stage (e.g., review, reasoning, reflection).
    """

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._scorers = {}  # Caches ScoreEvaluator instances per stage

    def get_scorer(self, stage: str) -> ScoreEvaluator:
        """
        Lazily loads and returns a ScoreEvaluator for the given stage.
        Config path is read from e.g., cfg['review_score_config'].
        """
        if stage not in self._scorers:
            config_key = f"{stage}_score_config"
            config_path = self.cfg.get(config_key, f"config/scoring/{stage}.yaml")
            self._scorers[stage] = ScoreEvaluator.from_file(
                filepath=config_path,
                prompt_loader=self.prompt_loader,
                cfg=self.cfg,
                logger=self.logger,
                memory=self.memory
            )
        return self._scorers[stage]

    def score_hypothesis(self, hypothesis: dict, context: dict, metrics: str = "review") -> dict:
        """
        Score a hypothesis for a given evaluation stage.

        Args:
            hyp (dict): Hypothesis object with a "text" key.
            context (dict): Pipeline context, must include 'goal'.
            metrics (str): Evaluation metrics (e.g., "review", "reasoning", "reflection").

        Returns:
            dict: {
                "id": hypothesis_id,
                "score": float,
                "scores": {dimension_name: {score, rationale, weight}, ...},
                "metrics": metrics
            }
        """
        scorer = self.get_scorer(metrics)
        dimension_scores = scorer.evaluate(
            hypothesis=hypothesis,
            context=context,
            llm_fn=self.call_llm
        )

        weighted_total = sum(
            s["score"] * s.get("weight", 1.0)
            for s in dimension_scores.values()
        )
        weight_sum = sum(s.get("weight", 1.0) for s in dimension_scores.values())
        final_score = round(weighted_total / weight_sum, 2) if weight_sum > 0 else 0.0

        self.logger.log("HypothesisScoreComputed", {
            "score": final_score,
            "dimension_scores": dimension_scores,
            "hypothesis": hypothesis,
            "metrics": metrics
        })

        return {
            "id": hypothesis.get("id"),
            "score": final_score,
            "scores": dimension_scores,
            "metrics": metrics
        }

Then extending an agent to use this scoring is straightforward

This is the reflection agent:

class ReflectionAgent(ScoringMixin, BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)

    async def run(self, context: dict) -> dict:
        hypotheses = self.get_hypotheses(context)

        reflections = []
        for hyp in hypotheses:
            # the metrics ties the scoring dimesnions and prompts    
            score = self.score_hypothesis(hyp, context, metrics="reflection")
            self.logger.log(
                "ReflectionScoreComputed",
                score,
            )
            reflections.append(score)

        context[self.output_key] = reflections
        return context

We support the agent scoring through a configuration file.

# config/scoring/reflection.yaml
dimensions:
  - name: correctness
    file: correctness
    weight: 1.2
    extra_data: { parser: numeric }

  - name: feasibility
    file: feasibility
    weight: 1.1
    extra_data: { parser: numeric }

  - name: insightfulness
    file: insightfulness
    weight: 1.3
    extra_data: { parser: numeric }

  - name: alignment
    file: alignment
    weight: 1.0
    extra_data: { parser: numeric }

๐Ÿง  Unified Multi-Dimensional MR.Q

Weโ€™re pushing the boundaries of what AI systems can learn by evaluating hypotheses across multiple interpretable dimensions: correctness, originality, clarity, relevance, and more.

At the heart of this is MR.Q a simple, contrastive reward model that learns from preference pairs (A is better than B). Originally designed for lightweight, interpretable feedback, MR.Q helps align AI outputs without requiring complex reinforcement learning.

Now, with the new UnifiedMRQAgent, weโ€™ve extended MR.Q into a multi-dimensional learning framework: instead of just learning which output is preferred, we train separate models to understand why modeling each quality dimension (like correctness or insight) with its own scoring model. This allows us to tune the system toward generating hypotheses that are not only preferred, but objectively better across the dimensions that matter.

Getting rid so much i’m doing so well here## ๐ŸŽฏ The Vision: From Evaluation to Adaptation

Weโ€™ve built a powerful framework for scoring hypotheses in structured, explainable ways. But now we ask:

Can we use these multi-dimensional scores to train a model that helps us generate better hypotheses automatically?

The answer is yes and here’s how we’re doing it.


๐Ÿ”ง How UnifiedMRQAgent Works


class UnifiedMRQAgent(BaseAgent):

    async def run(self, context: dict) -> dict:
        self.logger.log("UnifiedMRQStarted", {})

        # Step 1: Load hypotheses and scores
        hypotheses = context.get("hypotheses") or self.memory.hypotheses.get_all()

        hypothesis_ids = [h["id"] for h in hypotheses]
        evaluations = self.memory.evaluations.get_by_hypothesis_ids(hypothesis_ids)
        evaluation_ids = [e.id for e in evaluations]
        scores = self.memory.scores.get_by_evaluation_ids(evaluation_ids)
        
        # Step 2: Embed and index hypotheses
        embedded = self._index_embeddings(hypotheses)

        print(f"Embedded: {[(k, v[1][:5]) for k, v in embedded.items()]}")

        # Step 3: Collect dimension-wise scores
        score_map = self._group_scores(scores)

        print("Score map keys:", list(score_map.keys()))
        print("Example score entry:", next(iter(score_map.items()), None))

        # Step 4: Generate contrast pairs
        contrast_pairs = self._generate_contrast_pairs(embedded, score_map, context)

        # Step 5: Train model per dimension
        trained_models = self.trainer.train_multidimensional_model(contrast_pairs)
        self.logger.log(
            "UnifiedMRQTrained",
            {
                "pair_count": len(contrast_pairs),
                "dimensions": list(trained_models.keys()),
            },
        )

        # Step 6: Save and log to DB
        os.makedirs(self.output_dir, exist_ok=True)
        for dim, model in trained_models.items():
            path = os.path.join(self.output_dir, f"{dim}_mrq.pkl")
            with open(path, "wb") as f:
                pickle.dump(model, f)

            pair_count = len([p for p in contrast_pairs if p["dimension"] == dim])
            self.memory.session.add(
                UnifiedMRQModelORM(
                    dimension=dim,
                    model_path=path,
                    pair_count=pair_count,
                    trainer_version="v1.0",
                    context={
                        "similarity_threshold": self.similarity_threshold,
                        "min_score_diff": self.min_score_difference,
                    },
                )
            )

        self.memory.session.commit()
        self.logger.log(
            "UnifiedMRQModelsSaved", {"dimensions": list(trained_models.keys())}
        )
        context["unified_mrq_model_paths"] = {
            dim: os.path.join(self.output_dir, f"{dim}_mrq.pkl")
            for dim in trained_models
        }

        return context

... 

    def _generate_contrast_pairs(self, embedded: dict, score_map: dict, context: dict) -> list[dict]:
        """
        Given a map of hypothesis_id -> (hypothesis_dict, embedding), and a score_map,
        return all valid contrast pairs where two hypotheses have scores for the same dimensions.
        """
        contrast_pairs = []
        dim_seen = set()

        all_ids = list(embedded.keys())
        self.logger.log(
            "ContrastPairGenerationStart",
            {
                "total_hypotheses": len(all_ids),
                "score_map_keys": list(score_map.keys())[:10],
            },
        )

        for i in range(len(all_ids)):
            for j in range(i + 1, len(all_ids)):
                id_a, id_b = all_ids[i], all_ids[j]

                if id_a not in score_map or id_b not in score_map:
                    continue

                scores_a = score_map[id_a]
                scores_b = score_map[id_b]

                shared_dims = set(scores_a.keys()) & set(scores_b.keys())

                for dim in shared_dims:
                    score_a = scores_a[dim]
                    score_b = scores_b[dim]

                    # Skip if scores are equal
                    if score_a == score_b:
                        continue

                    dim_seen.add(dim)

                    # Get embedding vectors
                    emb_a = embedded[id_a][1]
                    emb_b = embedded[id_b][1]

                    if emb_a is None or emb_b is None:
                        self.logger.log(
                            "MissingEmbeddingInContrast",
                            {"id_a": id_a, "id_b": id_b, "dim": dim},
                        )
                        continue

                    preferred = "a" if score_a > score_b else "b"
                    pair = {
                        "dimension": dim,
                        "prompt": context.get("goal").get("goal_text"),  # Optional: use goal or reasoning task if desired
                        "output_a": embedded[id_a][0]["text"],
                        "output_b": embedded[id_b][0]["text"],
                        "preferred": preferred,
                    }
                    contrast_pairs.append(pair)

        self.logger.log(
            "ContrastPairGenerationComplete",
            {
                "pairs_generated": len(contrast_pairs),
                "dimensions_covered": list(dim_seen),
            },
        )

        return contrast_pairs

๐Ÿ’ก The full lifecycle of our multi-dimensional MR.Q training pipeline

    
flowchart TD
    subgraph Input
        A[Pipeline Outputs Hypotheses] --> B[ScoreEvaluator Scores Each Hypothesis]
        B --> C[Scores Stored in ScoreORM per Dimension]
    end

    subgraph Contrast Pair Generation
        C --> D[Extract High vs Low Scoring Pairs per Dimension]
        D --> E[Create Contrast Pairs A gt B, with Embeddings]
    end

    subgraph MRQ Training
        E --> F[Train MRQ Model per Dimension]
        F --> G[Save Trained Predictors]
    end

    subgraph Inference + Tuning
        G --> H[Apply MRQ Model to New Hypotheses]
        H --> I[Score Predicted Quality per Dimension]
        I --> J[Select or Generate Best Hypothesis max weighted score]
    end

    J --> K[Feed Best Hypothesis into Next Pipeline Stage]

    style A fill:#f0f8ff
    style K fill:#e6ffe6
    style G fill:#fff5e6
  

1. Collect Hypotheses & Scores

hypotheses = context.get("hypotheses") or self.memory.hypotheses.get_all()
scores = self.memory.scores.get_all()

You pull all hypotheses and their associated scores from memory including multi-dimensional breakdowns like correctness, originality, etc.


2. Index Embeddings for Similarity Matching

embedded = self._index_embeddings(hypotheses)

Each hypothesis is embedded and indexed, allowing efficient similarity comparisons via cosine similarity.


3. Group Scores by Hypothesis and Dimension

score_map = self._group_scores(scores)

Scores are organized into a map: {hypothesis_id -> {dimension -> score}}.


4. Generate Contrastive Pairs

contrast_pairs = self._generate_contrast_pairs(embedded, score_map)

This is the heart of the training logic:

  • For similar hypotheses (cos_sim > threshold)
  • Where score difference on any dimension exceeds a threshold
  • You form a contrast pair: (better_hypothesis, worse_hypothesis, dimension)

These pairs teach the model what makes one hypothesis better than another along each axis.

{
  "dimension": "clarity",
  "prompt": "What are the key takeaways from the ReAct paper?",
  "output_a": "The ReAct paper proposes a framework where lanHeyguage models can both reason and act in an environment. This allows them to interleave natural language reasoning steps with actions like tool use or web search.",
  "output_b": "Itโ€™s like a paper about reasoning and doing things I guess, using like tools and stuff during the process.",
  "preferred": "a"
}

๐Ÿง  Explanation:

  • dimension: This is the score axis you’re training in this case, clarity.
  • prompt: The task or query under evaluation.
  • output_a vs output_b: Two model responses with different qualities.
  • preferred: Indicates which output is preferred with respect to the given dimension.III

5. Train Per-Dimension Preference Models

trained_models = self.trainer.train_multidimensional_model(contrast_pairs)

You train a separate model per dimension e.g., a model that knows what makes a hypothesis “more correct” than another.

This opens the door to targeted optimization during hypothesis generation.


6. Save Models and Log to DB

with open(path, "wb") as f:
    pickle.dump(model, f)

self.memory.session.add(UnifiedMRQModelORM(...))

Each trained model is persisted and tracked in the database, making it available for downstream agents.


๐Ÿง  Training the Multi-Dimensional Critic

After generating hypotheses and scoring them across multiple dimensions (e.g., correctness, originality, clarity), we move to the core of the system: training a multidimensional evaluator that can predict and rank hypotheses just like our LLM-based rubrics.

This is where the MR.Q framework shines.

๐Ÿงฉ Building Contrast Pairs

First, we convert raw scores into contrastive training data:

  • For every pair of hypotheses scored on the same dimension, we form a training pair:

    {
      "dimension": "correctness",
      "output_a": "Hypothesis A text",
      "output_b": "Hypothesis B text",
      "preferred": "a"
    }
    
  • We compute one pair per dimension when the scores are unequal this produces a set of labeled preferences per evaluation run.

This pairing strategy generalizes MR.Qโ€™s binary preference format to multiple independent scorers.

๐Ÿ”ฌ Embedding the Hypotheses

To avoid rerunning LLMs during training, we use cached embeddings:

  • Each hypothesis is encoded using a custom embedding pipeline (self.memory.embedding.get_or_create(...))
  • The TextEncoder combines the prompt and response into a joint latent vector using an MLP

We compute a difference vector between preferred and non-preferred responses:

diff = zsa_a - zsa_b if preferred == "a" else zsa_b - zsa_a

We compute a difference vector zsa_a - zsa_b, where:

zsa_a is the encoded latent vector representing hypothesis A (in the context of the prompt), zsa_b is the same for hypothesis B.

These diffs serve as inputs for training a HypothesisValuePredictor a simple MLP that learns to predict preference directionality from embedding deltas.

๐Ÿ‹๏ธ Multidimensional Training Loop

The trainer then iterates over contrast pairs grouped by dimension:

for dim, pairs in dimension_to_pairs.items():
    dataloader = self.prepare_training_data(pairs)
    self.train_model_for_dimension(dim, dataloader)

Each dimension gets its own trained model, which learns from its own contrastive signals. We use early stopping and log metrics like average loss and convergence status.

๐Ÿ“‰ Example Training Log

Here’s a sample log from training on three dimensions:

โœ”๏ธ Training model for 'correctness' [pairs: 14]
Epoch 1/20: avg_loss=0.69312
Epoch 2/20: avg_loss=0.51123
...
Early stopping triggered at epoch 7. Best loss: 0.27348

โœ”๏ธ Training model for 'originality' [pairs: 9]
...

โœ”๏ธ Training model for 'clarity' [pairs: 12]
...

Each dimension is treated as a distinct ranking problem enabling fine-grained learning from different types of feedback.


๐Ÿงช What This Enables

Once trained, these models can:

  • Score new hypotheses instantly (no LLM call needed)
  • Guide future generations by selecting high-scoring candidates
  • Evaluate prompt variants or strategy choices during tuning

The goal is to replace LLM scorers with fast, local, preference-aligned critics one per dimension and plug them into a fully self-improving pipeline.


๐Ÿ“ˆ The Impact of Dimensional Learning

Traditional reward modeling often collapses everything into a single scalar reward signal. But this approach has limitations especially when reasoning about complex outputs like hypotheses.

By preserving dimensional structure, we gain several advantages:

Benefit Description
โœ… Interpretable Learning We know which model learned which aspect of quality
๐Ÿ” Targeted Optimization During generation, we can favor hypotheses that score higher on specific dimensions
๐Ÿ”„ Feedback Loop Better hypotheses lead to richer training data, improving models iteratively
๐Ÿค– Self-Tuning Agents These models can be used to dynamically adjust prompt strategies or rule weights

๐Ÿš€ Next Steps: Building a Self-Tuning Hypothesis Engine

1. Integrate Trained Models into Generation

Once you have trained models for each dimension, plug them into your generator to rank or re-rank hypotheses based on predicted dimensional scores.

Example usage:

predicted_scores = {
    dim: model.predict([hyp_text])
    for dim, model in loaded_models.items()
}

Then select the hypothesis with the best weighted combination of predicted scores.

2. Use SVM or MLP-Based Rankers

Right now, you’re using MRQTrainer. You could evolve this to support:

  • SVM-based rankers (classic pairwise preference learning)
  • Neural rankers (e.g., BERT + head for dimension-specific ranking)
  • Multi-task models that predict all dimensions at once

This gives you flexibility depending on your performance vs. interpretability trade-off.

3. Build a Rule Tuning Loop Using Score Deltas

With access to both actual and predicted scores, you can identify where predictions diverge from reality and use those deltas to:

  • Tune symbolic rules via RuleTunerAgent
  • Adjust prompt templates
  • Refine embedding strategies
  • Evolve the scoring rubric itself

4. Enable Real-Time Feedback During Generation

Eventually, integrate these models into the generation process itself for example, using beam search guided by real-time dimensional feedback.

This allows you to generate hypotheses that are optimized for multiple criteria simultaneously, rather than relying on post-hoc filtering.

5. Structured Prompt Evaluation

Use scores to filter and mutate generation prompts.

6. Symbolic Rule Tuning

Use score deltas (before/after rule application) to identify which rules help or hurt.

7. Reward Model Training

  • Treat dimensional scores as supervision targets.
  • Train small models to predict scores from hypothesis features.

8. Meta-Scoring Agents

Build a ScorePlannerAgent that selects dimensions dynamically based on the goal type.


๐Ÿงฉ Example Use Case: Sharpening Hypotheses with Dimensional Guidance

Imagine a SharpeningAgent that uses your trained models to ask:

โ€œWhich hypothesis is likely to score highest on clarity and originality?โ€

It generates variants, ranks them using your models, and returns the best.

This is generative refinement powered by dimensional understanding.


๐Ÿ’ฅ Conclusion: Towards Truly Intelligent AI

This isnโ€™t just about getting better scores itโ€™s about building a feedback loop that supports:

  • ๐Ÿ’ก Interpretability: Understand why something scored low.
  • ๐Ÿค– Symbolic Tuning: Use score deltas to optimize rules.
  • ๐Ÿ“ˆ Learning: Feed scores into reward models or preference datasets.
  • ๐Ÿงช Experimentation: Try different weights, parsers, and prompt styles.

It’s the foundation for building smarter, self-improving agents ones that donโ€™t just generate, but also reflect, analyze, and adapt.


๐Ÿ“š References

  1. Superwriter: Rubric-Guided Chain-of-Thought Improves Text Generation Wengong Jin, Daniel Fried, Xinyun Chen, et al. arXiv:2310.01361 https://arxiv.org/abs/2310.01361

  2. RM-R1: Reward Modeling Should Reason, Not Just Score Xueguang Ma, Yuntao Bai, et al. arXiv:2403.03505 https://arxiv.org/abs/2403.03505

  3. DPO: Direct Preference Optimization Rafailov et al., 2023 https://arxiv.org/abs/2305.18290

  4. DSPy: Modular LLM Programming OpenAI, 2024 GitHub: https://github.com/stanfordnlp/dspy

  5. Chain-of-Rubrics (CoR) Introduced in RM-R1 as a structured format for reward reasoning: <rubric>, <eval>, and <answer>[[score]]</answer>

  6. RM-Bench A benchmark dataset for evaluating reasoning-based reward models (introduced in RM-R1).

  7. MR.Q: Minimum Reasonable Quality Evaluator A lightweight alternative to DPO for scoring and training using structured self-evaluation prompts.


๐Ÿ“˜ Glossary

Term Definition
CoR (Chain-of-Rubrics) A structured output format that guides an AI model to reason using a rubric before producing a final score or judgment. Typically includes sections like <rubric>, <eval>, and <answer>[[score]]</answer>. Inspired by the RM-R1 paper.
RM-R1 A research paper proposing Reasoning-as-Reward Modeling, where LLMs score outputs by reasoning through a rubric instead of producing a raw scalar value. Enables interpretable evaluation.
DSPy A modular prompt programming framework for building and chaining LLM-based modules (e.g., reasoners, evaluators) in Python. Used here to implement scoring and judgment agents.
ScoreEvaluator A configurable class in the Co AI system that uses dimensions (like correctness or feasibility) and structured prompts to evaluate hypotheses. Can return scalar or CoR-style scores.
MR.Q A lightweight reward modeling strategy that evaluates and ranks hypotheses using simple heuristics or LLM judgments. Used as a scoring backbone in the Sharpening and GeneralReasoner systems.
Pipeline Run A complete pass through a Co AI reasoning pipeline for a given goal. Each pipeline run generates hypotheses and scores them using configured evaluation agents.
Hypothesis A generated answer or proposed solution to a goal, typically created by an LLM agent in the Co AI system. Each hypothesis is scored across multiple dimensions.
ScoreAnalyzer A tool for analyzing score distributions across hypotheses and dimensions. Supports statistical summary, PCA, and clustering to reveal performance patterns.
Dimension (Metric) An evaluation category such as “correctness”, “feasibility”, or “insightfulness”. Each dimension has its own rubric, weight, and parser.
Parser A function that extracts a numeric score from a model’s evaluation output. In CoR mode, this may involve extracting values from <answer>[[score]]</answer> tags.
Rubric A structured set of evaluation criteria that guide the modelโ€™s reasoning before assigning a score. Rubrics are central to CoR and Superwriter-style evaluation.
Final Score So I’m not doing anymore A weighted aggregate of dimension scores, used to rank or compare hypotheses. Calculated automatically by ScoreEvaluator.