Dimensions of Thought: A Smarter Way to Evaluate AI

๐ Summary
This post introduces a multidimensional reward modeling pipeline built on top of the CO_AI framework. It covers:
-
โ Structured Evaluation Setup How to define custom evaluation dimensions using YAML or database-backed rubrics.
-
๐ง Automated Scoring with LLMs Using the
ScoreEvaluator
to produce structured, rationale-backed scores for each dimension. -
๐งฎ Embedding-Based Hypothesis Indexing Efficiently embedding hypotheses and comparing them for contrastive learning using similarity.
-
๐ Contrast Pair Generation Creating training pairs where one hypothesis outperforms another on a given dimension.
-
๐๏ธ Training the MR.Q Model A multidimensional MR.Q-style trainer that learns to rank hypotheses by dimension via contrastive supervision.
-
๐ฆ Extensible Architecture Modular components like
ScoreORM
,EvaluationORM
, and the trainer pipeline make it easy to extend or swap models, dimensions, or strategies. -
๐ Practical Use Case All of this is applied in a real evaluation loop generating, scoring, and learning from hypotheses within Co AI.
๐ The Intelligence Measurement Gap
The quest for truly intelligent AI hinges on our ability to evaluate it. Yet, as language models push the boundaries of human-level performance, our evaluation methods often fall short, relying on simplistic, single-number metrics.
Most benchmarks still reduce intelligence to single-number metrics - accuracy percentages or vague quality scores. But true cognition isn’t monolithic. It’s:
- Contextual (depends on the problem space)
- Multi-faceted (different dimensions matter for different tasks)
- Self-correcting (learns from evaluation)
We’ve built a new evaluation framework that moves beyond scalar scores to assess AI-generated hypotheses across configurable dimensions. This isn’t just about rating outputs - it’s about understanding how AIs think.
I## โจ Why Dimensions Matter
When I first began evaluating AI-generated text, I used basic LLM judges. They would return a single score say, โ7/10 for fluencyโ but offered little transparency into why the score was assigned or how to improve the output.
The SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models paper introduced a structured, stage-based framework for generation: ๐ง Planning โ โ๏ธ Writing โ ๐ Refining, with each phase evaluated against distinct criteria. This modular breakdown helped isolate failure modes and guided improvements at each step.
But even with SuperWriterโs insightful structure, the scoring methods were initially hard-coded and static excellent for guiding generation, but somewhat limited when adapting to new goals or learning from experience.
๐ช Thatโs when a new idea clicked: What if we extended this foundation not just scoring outputs, but reasoning through them?
By building on SuperWriterโs staged evaluation, we introduced:
- โ Configurable scoring dimensions per stage (e.g., correctness, originality, clarity)
- โ Rationale-backed evaluations for transparency and introspection
- โ Symbolic reflection, so the AI can analyze its own process
- โ Seamless hooks into tuning frameworks like MR.Q
This evolution turned evaluation from a static judgment into a dynamic feedback loop one that doesn’t just critique the AI’s thinking, but helps it improve its thinking over time.
This led to the development of our modular, multidimensional scoring framework a system that doesnโt just grade AI outputs, but helps the AI understand why a hypothesis is strong (or weak), and how to evolve better ones.
flowchart TD A[๐ง Planning Stage] --> B[โ๏ธ Writing Stage] B --> C[๐ Refining Stage] C --> D[๐งช Multidimensional Scoring] D --> E1[๐งพ Coherence Score + Rationale] D --> E2[๐ก Originality Score + Rationale] D --> E3[๐ฃ Clarity Score + Rationale] D --> E4[๐ Relevance Score + Rationale] E1 --> F[๐ง Symbolic Introspection] E2 --> F E3 --> F E4 --> F F --> G[๐ ๏ธ Prompt / Rule Tuning] G --> A style A fill:#f9f,stroke:#333,stroke-width:1px style B fill:#bbf,stroke:#333,stroke-width:1px style C fill:#bfb,stroke:#333,stroke-width:1px style D fill:#ffd,stroke:#333,stroke-width:1px style F fill:#fdd,stroke:#333,stroke-width:1px style G fill:#ddf,stroke:#333,stroke-width:1px
๐ ๏ธ How the Scoring System Works
The scoring system lets you attach interpretable, rationale-backed evaluations to any part of your pipeline. Whether you’re refining a hypothesis, comparing strategies, or tuning symbolic rules scoring becomes a modular, pluggable step that brings clarity to what โbetterโ really means.
flowchart LR A[YAML Config<br/><sub>Dimensions, Weights, Templates</sub>] --> B[Prompt Loader<br/><sub>Renders prompts per dimension</sub>] --> C[ScoreEvaluator<br/><sub>Evaluates using LLM or parser</sub>] --> D[ScoreORM<br/><sub>Stores scores, rationales, weights</sub>] --> E[Analysis & Tuning<br/><sub>Drives training & adaptation</sub>]
โจ Example: Review Dimensions
The ScoreEvaluator
uses a lightweight, extensible engine to apply scoring prompts like this:
Dimension | Category | Weight | What It Measures | Prompt Snippet |
---|---|---|---|---|
Correctness | Review | 1.2 | Factual accuracy | “Does the hypothesis contradict established knowledge?” |
Feasibility | Review | 1.1 | Practical viability | “Could this be implemented with current technology?” |
Insightfulness | Review | 1.3 | Novel connections | “Does this reveal non-obvious relationships?” |
Alignment | Review | 1.0 | Goal relevance | “How directly does this address the research question?” |
Completeness | Review | 0.8 | Coverage of all key aspects | “Are key parts of the problem addressed?” |
Delta Clarity | Reflection | 1.0 | Self-awareness improvement | “Did this step improve the clarity over the previous one?” |
Rule Impact | Evaluation | 1.1 | Rule effectiveness | “How much did this rule improve the hypothesis quality?” |
Cluster Count | Proximity | 0.5 | Reasoning diversity | “How many distinct hypothesis clusters were formed?” |
Graft Pair Count | Proximity | 0.7 | Merge opportunities | “How many high-similarity pairs suggest grafting?” |
Judge Agreement | Judgment | 1.0 | Consensus of LLMs or heuristics | “How consistently did evaluators agree on this output?” |
Relevance Gain | Reflection | 0.9 | Focus refinement | “Did this revision make the output more on-target?” |
This shows how you can score at any level from raw model outputs to entire pipeline stages and even track strategy impact, reflection gain, or reasoning cohesion over time.
๐ฌ The ScoreEvaluator
At the core is the ScoreEvaluator
class a lightweight, pluggable engine that evaluates hypotheses across configurable dimensions:
Come on there All right you come up and give me back my shoe you give me back my shoe Are you going to get me back my shoe that’s Mine mine mine I tap my car now you know you’re going to rip a hole in them that won’t be so happy with you I have to pull them back on I should be uncomfortable Each dimension is defined by:
- A name (e.g., โcorrectnessโ)
- A prompt file (
prompts/scoring/correctness.txt
) - An optional weight
- A parser (numeric, boolean, string)
# config/scoring/reasoning_cor.yaml
stage: reasoning
output_format: cor
dimensions:
- name: correctness
file: correctness_cor.txt
weight: 1.2
extra_data:
parser: numeric_cor
- name: feasibility
file: feasibility_cor.txt
weight: 1.1
extra_data:
parser: numeric_cor
- name: insightfulness
file: insightfulness_cor.txt
weight: 1.3
extra_data:
parser: numeric_cor
- name: alignment
file: alignment_cor.txt
weight: 1.0
extra_data:
parser: numeric_cor
- name: completeness
file: completeness_cor.txt
weight: 0.8
extra_data:
parser: numeric_cor
class ScoreEvaluator:
def __init__(self, dimensions, prompt_loader, cfg, logger, memory):
self.dimensions = dimensions
self.prompt_loader = prompt_loader
self.cfg = cfg
self.logger = logger
self.memory = memory
self.output_format = cfg.get("output_format", "simple") # default fallback
@classmethod
def from_file(cls, filepath: str, prompt_loader, cfg, logger, memory):
with open(Path(filepath), "r") as f:
data = yaml.safe_load(f)
# Default to 'simple' if not provided
output_format = data.get("output_format", "simple")
dimensions = [
{
"name": d["name"],
"file": d.get("file"),
"prompt_template": d.get("prompt_template", d.get("file")), # fallback to file
"weight": d.get("weight", 1.0),
"parser": cls.get_parser(d.get("extra_data", {})),
}
for d in data["dimensions"]
]
# Ensure the output_format is accessible in instance
cfg = cfg.copy()
cfg["output_format"] = output_format
return cls(
dimensions=dimensions,
prompt_loader=prompt_loader,
cfg=cfg,
logger=logger,
memory=memory,
)
@staticmethod
def parse_numeric_cor(response: str) -> float:
"""
Extracts the numeric score from a <answer>All right [[X]]</answer> block.
Example: <answer>[[3]]</answer> โ 3.0
"""
match = re.search(r"(?:<answer>\s*)?\[\[(\d+(?:\.\d+)?)\]\](?:\s*</answer>)?", response, re.IGNORECASE)
if not match:
raise ValueError(f"Could not extract numeric score from CoR-style answer: {response}")
return float(match.group(1))
...
def _evaluate_cor(self, hypothesis: dict, context: dict = {}, llm_fn=None):
"""
Evaluate using Chain-of-Rubrics (CoR) format with rubric, eval, and <answer>[[score]]</answer>.
"""
if llm_fn is None:
raise ValueError(
"You must pass a call_llm function (e.g., agent.call_llm) to ScoreEvaluator.evaluate"
)
results = {}
for dim in self.dimensions:
# Load prompt using prompt_loader and dimension-specific CoR template
if self.prompt_loader and dim.get("file"):
prompt = self.prompt_loader.from_file(
file_name=dim["file"],
config=self.cfg,
context={"hypothesis": hypothesis, **context},
)
elif dim.get("prompt_template"):
prompt = Template(dim["prompt_template"]).render(
hypothesis=hypothesis, **context
)
else:
raise ValueError(f"No prompt found for dimension {dim['name']}")
response = llm_fn(prompt, context=context)
try:
score = dim["parser"](response)
except Exception as e:
self.logger.log("ScoreParseError", {
"dimension": dim["name"],
"response": response,
"error": str(e)
})
score = 0.0
self.logger.log("DimensionEvaluated", {
"dimension": dim["name"],
"score": score,
"response": response
})
results[dim["name"]] = {
"score": score,
"rationale": response,
"weight": dim["weight"],
}
self.save_score_to_memory(results, hypothesis, context)
return results
...
def display_results(self, results, weighted_score):
table_data = [
[dim_name, f"{dim_data['score']:.2f}", dim_data['weight'], dim_data['rationale'][:60]]
for dim_name, dim_data in results.items()
]
table_data.append(["FINAL", f"{weighted_score:.2f}", "-", "Weighted average"])
print("\n๐ Dimension Scores Summary")
print(tabulate(
table_data,
headers=["Dimension", "Score", "Weight", "Rationale (preview)"],
tablefmt="fancy_grid"
))
๐ Dimension Scores Summary
โโโโโโโโโโโโโโโคโโโโโโโโโโคโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Dimension โ Score โ Weight โ Rationale (preview) โ
โโโโโโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ correctness โ 45 โ 1.0 โ rationale: The hypothesis explores LLM memory limitations by โ
โโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ originality โ 70 โ 1.0 โ rationale: The hypothesis introduces a novel analogy between โ
โโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ clarity โ 95 โ 0.8 โ rationale: The hypothesis clearly links LLM context window r โ
โโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ relevance โ 45 โ 1.2 โ <think> โ
โ โ โ โ Okay, let's evaluate the hypothesis in relation to t โ
โโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ FINAL โ 61.25 โ - โ Weighted average โ
โโโโโโโโโโโโโโโงโโโโโโโโโโงโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
You can define separate configs for planning, writing, refining, review, and reflection stages.
๐ Prompt Files Per Dimension
Each scoring dimension lives in its own prompt file:
This is an example cor
Chain of Rubricks file.
๐ prompts/scoring/alignment_cor.txt
You are evaluating the **alignment** of the following hypothesis with the research goal.
Rubric:
- Does the hypothesis directly address the goal?
- Does it reflect the intended strategy or values of the goal?
๐งช Format Instructions:
- Provide a short explanation of your judgment.
- End your response with: <answer>[[SCORE]]</answer>
- SCORE must be a number between **1 and 100** nothing else.
- Example: <answer>[[75]]</answer>
<hypothesis>
{{ hypothesis }}
</hypothesis>
The standard version
Evaluate the alignment of the following hypothesis.
### Goal
{{ goal }}
### Hypothesis
{{ hypothesis }}
How well does the hypothesis align with the goal and any stated preferences?
Return your review in the exact structured format below. Do not include headings, markdown, or additional commentary. Use only plain text fields as shown:
rationale: <brief explanation>
Ok so
score: <0โ100>
These are loaded dynamically using the existing PromptLoader
, making the system flexible and compatible with all your agents.
๐งฉ Structured Output with Rationale
Scores now include both numeric values and rationales:
๐ฏ Start with a goal
{
"goal": {
'goal_text': 'Will AI ever be able to reprogram itself?',
'goal_type': 'research',
'focus_area': 'meta_learning'
}
}
๐ฏ We generate multiple hypotheses using Agents
- Generator Agent
- COT
- ARM
Here’s an example simplified pipeline
goal:
goal_text: "If I was to develop a self improving process what would be the steps needed?"
goal_type: "research"
focus_area: "ai_research"
strategy: "reasoning"
difficulty: "medium"
expected_formats:
- "short_cot"
- "code"
pipeline:
name: default_pipeline
description: "Default hypothesis generation and refinement pipeline"
stages:
- name: generation
cls: co_ai.agents.generation.GenerationAgent
enabled: true
iterations: 1
- name: proximity
cls: co_ai.agents.proximity.ProximityAgent
enabled: true
iterations: 1
- name: reflection
cls: co_ai.agents.reflection.ReflectionAgent
enabled: true
iterations: 1
- name: review
cls: co_ai.agents.review.ReviewAgent
enabled: true
iterations: 1
....
- name: unified_mrq
cls: co_ai.agents.unified_mrq.UnifiedMRQAgent
enabled: true
iterations: 1
This will generate dimensional scores. These will feed the model which will tune the system
{
"metrics": "review",
"dimensions": {
"correctness": {
"score": 92,
"rationale": "The reasoning aligns closely with the stated goal and contains no logical inconsistencies.",
"weight": 1.2
},
"originality": {
"score": 75,
"rationale": "The idea introduces some creative elements but draws heavily on common patterns.",
"weight": 0.9
},
...
}
This adds interpretability without sacrificing performance.
๐ ๏ธ Dynamic Configuration via YAML
Instead of hardcoding scoring logic, everything is driven by configuration files. You can swap scoring strategies at runtime or evolve them over time all while keeping your codebase clean.
We add the ScoreMixin
to the agents at the stages where we need to generate scores.
This has a paramter whichllinks to a config file which determines the dimesnions.
The Dimensions are each prompt files which will generate a repsonse formt eh model scoring each hypothese and providing a rational.
๐งญ Multiple Evaluation Modes
The ScoreEvaluator supports two scoring modes:
- simple: expects a scalar score (e.g., score: 85)
- cor: expects a structured rubric + rationale ending in [[85]]
This is configured via:
output_format: “cor” All rightor “simple”
Each dimensionโs parser can adapt to extract numeric values from either format.
But simply collecting rich, structured scores is only the first step. The true power lies in analyzing these dimensions to extract actionable insights and drive further improvement.
๐ Connection to RM-R1: Reward Models from Rationale
The scoring system in this post draws inspiration from the RM-R1: Reward Modeling as Reasoning which introduced a powerful idea:
Instead of training reward models on preference labels alone, train them to predict scores from rationales.
This fundamentally shifts the training objective from:
- โWhich output is preferred?โ to:
- โGiven this rationale, what score should the output receive (and why)?โ
In our implementation, this translates into training per-dimension reward models (e.g., for correctness, clarity, novelty) using contrast pairs and structured rationales. For each dimension, the model sees examples like:
{
"dimension": "correctness",
"score": 40.0,
"response": "rationale: The hypothesis presents a valid mechanism for AI optimizing experimental design through reinforcement learning, which is logically consistent within its stated context. However, it does not directly address the goal of AI self-reprogramming, as it focuses on AI assisting in biomedical research rather than AI modifying its own code or capabilities. The reasoning is sound for its intended purpose but lacks alignment with the specified goal. \nscore: 40"
}
These pairs train the model to associate textual cues in the rationale with quantitative judgments, making it more interpretable and robust just like RM-R1 advocates.
๐ง Why This Matters
By aligning with the RM-R1 philosophy, our scoring system becomes:
- โ Rationale-grounded Scores are linked to human-readable justifications.
- โ Dimension-specific Each aspect of quality has its own evaluator.
- โ Trainable You can improve each scoring function over time using preference data.
This isnโt just useful for training reward models it gives you full introspective access to how and why your system evaluates hypotheses the way it does.
๐ Analysis That Goes Beyond Scores
With rich, structured scores flowing in, I extended the system with a ScoreAnalyzer
class that allows for:
- ๐ Descriptive statistics across dimensions
- ๐งฎ Weighted average scoring
- ๐ Regression models to find which dimensions best predict success
- ๐งฒ Clustering to identify common failure modes or performance patterns
This lets me ask deeper questions like:
Which dimensions most strongly correlate with high-quality final output? Can we predict outcome quality based on intermediate scores?
And ultimately, use those insights to tune rules, prompts, and agent behaviors.
๐งฌ Principal Component Analysis (PCA)
We use PCA on the score matrix (dimensions ร hypotheses) to discover latent patterns, like:
- Which dimensions co-vary?
- Are there clusters of โstrong-but-unoriginalโ vs โnovel-but-riskyโ answers?
class ScoreAnalyzer:
def __init__(self, score_data: pd.DataFrame):
"""
Expected format:All right
- 'hypothesis_id': str
- 'dimension': str
- 'score': float
- Optional: 'outcome' (e.g., final ranking, human eval)
"""
self.df = score_data
self.pivot = self.df.pivot(index='hypothesis_id', columns='dimension', values='score')
def describe_scores(self):
return self.pivot.describe()
def fit_linear_regression(self, outcome_col: str):
merged = self.pivot.copy()
merged[outcome_col] = self.df.drop_duplicates(subset='hypothesis_id').set_index('hypothesis_id')[outcome_col]
merged = merged.dropna()
X = merged.drop(columns=[outcome_col])
y = merged[outcome_col]
model = LinearRegression().fit(X, y)
return model, dict(zip(X.columns, model.coef_))
def perform_pca(self, n_components=2):
pca = PCA(n_components=n_components)
components = pca.fit_transform(self.pivot.fillna(0))
return components, pca.explained_variance_ratio_
def cluster_outputs(self, n_clusters=3):
km = KMeans(n_clusters=n_clusters, n_init=10)
labels = km.fit_predict(self.pivot.fillna(0))
return labels
def plot_pca_clusters(self, n_clusters=3):
components, _ = self.perform_pca()
labels = self.cluster_outputs(n_clusters=n_clusters)
plt.scatter(components[:, 0], components[:, 1], c=labels, cmap='tab10')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA of Score Vectors (Colored by Cluster)')
plt.show()
flowchart TD A[Agent Step e.g. Generation, Refinement, Reasoning] --> B{ScoringMixin?} B -- Yes --> C[ScoreEvaluator] B -- No --> Z[Continue to Next Agent Step] C --> D{Configurable Dimensions} D --> D1[Correctness] D --> D2[Originality] D --> D3[Clarity] D --> D4[Custom Metric e.g. novelty, precision, etc.] C --> E[Prompt Template per dimension] E --> F[call_llm prompt] F --> G[LLM Response] G --> H[Parse Score + Rationale] H --> I[Weight & Aggregate Scores] I --> J[Structured Score Output] J --> K[Store in EvaluationORM + ScoreORM] K --> L[Log + Visualize Results] J --> M[Pass Scores to Next Agent optional] L --> Z M --> Z
๐ Integrating Scores into Pipelines
You can integrate the scoring system at any step of the agent pipeline using the ScoringMixin
, enabling fine-grained, configurable evaluations. Whether you’re debugging a specific stage, comparing strategies, or scoring final outputs, you can define exactly what to measure, how to measure it, and when using your own metrics, dimensions, and prompts.
class ScoringMixin:
"""
A generic scoring mixin that supports dynamic, stage-aware evaluation using ScoreEvaluator.
Supports any configured scoring stage (e.g., review, reasoning, reflection).
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._scorers = {} # Caches ScoreEvaluator instances per stage
def get_scorer(self, stage: str) -> ScoreEvaluator:
"""
Lazily loads and returns a ScoreEvaluator for the given stage.
Config path is read from e.g., cfg['review_score_config'].
"""
if stage not in self._scorers:
config_key = f"{stage}_score_config"
config_path = self.cfg.get(config_key, f"config/scoring/{stage}.yaml")
self._scorers[stage] = ScoreEvaluator.from_file(
filepath=config_path,
prompt_loader=self.prompt_loader,
cfg=self.cfg,
logger=self.logger,
memory=self.memory
)
return self._scorers[stage]
def score_hypothesis(self, hypothesis: dict, context: dict, metrics: str = "review") -> dict:
"""
Score a hypothesis for a given evaluation stage.
Args:
hyp (dict): Hypothesis object with a "text" key.
context (dict): Pipeline context, must include 'goal'.
metrics (str): Evaluation metrics (e.g., "review", "reasoning", "reflection").
Returns:
dict: {
"id": hypothesis_id,
"score": float,
"scores": {dimension_name: {score, rationale, weight}, ...},
"metrics": metrics
}
"""
scorer = self.get_scorer(metrics)
dimension_scores = scorer.evaluate(
hypothesis=hypothesis,
context=context,
llm_fn=self.call_llm
)
weighted_total = sum(
s["score"] * s.get("weight", 1.0)
for s in dimension_scores.values()
)
weight_sum = sum(s.get("weight", 1.0) for s in dimension_scores.values())
final_score = round(weighted_total / weight_sum, 2) if weight_sum > 0 else 0.0
self.logger.log("HypothesisScoreComputed", {
"score": final_score,
"dimension_scores": dimension_scores,
"hypothesis": hypothesis,
"metrics": metrics
})
return {
"id": hypothesis.get("id"),
"score": final_score,
"scores": dimension_scores,
"metrics": metrics
}
Then extending an agent to use this scoring is straightforward
This is the reflection agent:
class ReflectionAgent(ScoringMixin, BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
async def run(self, context: dict) -> dict:
hypotheses = self.get_hypotheses(context)
reflections = []
for hyp in hypotheses:
# the metrics ties the scoring dimesnions and prompts
score = self.score_hypothesis(hyp, context, metrics="reflection")
self.logger.log(
"ReflectionScoreComputed",
score,
)
reflections.append(score)
context[self.output_key] = reflections
return context
We support the agent scoring through a configuration file.
# config/scoring/reflection.yaml
dimensions:
- name: correctness
file: correctness
weight: 1.2
extra_data: { parser: numeric }
- name: feasibility
file: feasibility
weight: 1.1
extra_data: { parser: numeric }
- name: insightfulness
file: insightfulness
weight: 1.3
extra_data: { parser: numeric }
- name: alignment
file: alignment
weight: 1.0
extra_data: { parser: numeric }
๐ง Unified Multi-Dimensional MR.Q
Weโre pushing the boundaries of what AI systems can learn by evaluating hypotheses across multiple interpretable dimensions: correctness, originality, clarity, relevance, and more.
At the heart of this is MR.Q a simple, contrastive reward model that learns from preference pairs (A is better than B). Originally designed for lightweight, interpretable feedback, MR.Q helps align AI outputs without requiring complex reinforcement learning.
Now, with the new UnifiedMRQAgent
, weโve extended MR.Q into a multi-dimensional learning framework: instead of just learning which output is preferred, we train separate models to understand why modeling each quality dimension (like correctness or insight) with its own scoring model. This allows us to tune the system toward generating hypotheses that are not only preferred, but objectively better across the dimensions that matter.
Getting rid so much i’m doing so well here## ๐ฏ The Vision: From Evaluation to Adaptation
Weโve built a powerful framework for scoring hypotheses in structured, explainable ways. But now we ask:
Can we use these multi-dimensional scores to train a model that helps us generate better hypotheses automatically?
The answer is yes and here’s how we’re doing it.
๐ง How UnifiedMRQAgent
Works
class UnifiedMRQAgent(BaseAgent):
async def run(self, context: dict) -> dict:
self.logger.log("UnifiedMRQStarted", {})
# Step 1: Load hypotheses and scores
hypotheses = context.get("hypotheses") or self.memory.hypotheses.get_all()
hypothesis_ids = [h["id"] for h in hypotheses]
evaluations = self.memory.evaluations.get_by_hypothesis_ids(hypothesis_ids)
evaluation_ids = [e.id for e in evaluations]
scores = self.memory.scores.get_by_evaluation_ids(evaluation_ids)
# Step 2: Embed and index hypotheses
embedded = self._index_embeddings(hypotheses)
print(f"Embedded: {[(k, v[1][:5]) for k, v in embedded.items()]}")
# Step 3: Collect dimension-wise scores
score_map = self._group_scores(scores)
print("Score map keys:", list(score_map.keys()))
print("Example score entry:", next(iter(score_map.items()), None))
# Step 4: Generate contrast pairs
contrast_pairs = self._generate_contrast_pairs(embedded, score_map, context)
# Step 5: Train model per dimension
trained_models = self.trainer.train_multidimensional_model(contrast_pairs)
self.logger.log(
"UnifiedMRQTrained",
{
"pair_count": len(contrast_pairs),
"dimensions": list(trained_models.keys()),
},
)
# Step 6: Save and log to DB
os.makedirs(self.output_dir, exist_ok=True)
for dim, model in trained_models.items():
path = os.path.join(self.output_dir, f"{dim}_mrq.pkl")
with open(path, "wb") as f:
pickle.dump(model, f)
pair_count = len([p for p in contrast_pairs if p["dimension"] == dim])
self.memory.session.add(
UnifiedMRQModelORM(
dimension=dim,
model_path=path,
pair_count=pair_count,
trainer_version="v1.0",
context={
"similarity_threshold": self.similarity_threshold,
"min_score_diff": self.min_score_difference,
},
)
)
self.memory.session.commit()
self.logger.log(
"UnifiedMRQModelsSaved", {"dimensions": list(trained_models.keys())}
)
context["unified_mrq_model_paths"] = {
dim: os.path.join(self.output_dir, f"{dim}_mrq.pkl")
for dim in trained_models
}
return context
...
def _generate_contrast_pairs(self, embedded: dict, score_map: dict, context: dict) -> list[dict]:
"""
Given a map of hypothesis_id -> (hypothesis_dict, embedding), and a score_map,
return all valid contrast pairs where two hypotheses have scores for the same dimensions.
"""
contrast_pairs = []
dim_seen = set()
all_ids = list(embedded.keys())
self.logger.log(
"ContrastPairGenerationStart",
{
"total_hypotheses": len(all_ids),
"score_map_keys": list(score_map.keys())[:10],
},
)
for i in range(len(all_ids)):
for j in range(i + 1, len(all_ids)):
id_a, id_b = all_ids[i], all_ids[j]
if id_a not in score_map or id_b not in score_map:
continue
scores_a = score_map[id_a]
scores_b = score_map[id_b]
shared_dims = set(scores_a.keys()) & set(scores_b.keys())
for dim in shared_dims:
score_a = scores_a[dim]
score_b = scores_b[dim]
# Skip if scores are equal
if score_a == score_b:
continue
dim_seen.add(dim)
# Get embedding vectors
emb_a = embedded[id_a][1]
emb_b = embedded[id_b][1]
if emb_a is None or emb_b is None:
self.logger.log(
"MissingEmbeddingInContrast",
{"id_a": id_a, "id_b": id_b, "dim": dim},
)
continue
preferred = "a" if score_a > score_b else "b"
pair = {
"dimension": dim,
"prompt": context.get("goal").get("goal_text"), # Optional: use goal or reasoning task if desired
"output_a": embedded[id_a][0]["text"],
"output_b": embedded[id_b][0]["text"],
"preferred": preferred,
}
contrast_pairs.append(pair)
self.logger.log(
"ContrastPairGenerationComplete",
{
"pairs_generated": len(contrast_pairs),
"dimensions_covered": list(dim_seen),
},
)
return contrast_pairs
๐ก The full lifecycle of our multi-dimensional MR.Q training pipeline
flowchart TD subgraph Input A[Pipeline Outputs Hypotheses] --> B[ScoreEvaluator Scores Each Hypothesis] B --> C[Scores Stored in ScoreORM per Dimension] end subgraph Contrast Pair Generation C --> D[Extract High vs Low Scoring Pairs per Dimension] D --> E[Create Contrast Pairs A gt B, with Embeddings] end subgraph MRQ Training E --> F[Train MRQ Model per Dimension] F --> G[Save Trained Predictors] end subgraph Inference + Tuning G --> H[Apply MRQ Model to New Hypotheses] H --> I[Score Predicted Quality per Dimension] I --> J[Select or Generate Best Hypothesis max weighted score] end J --> K[Feed Best Hypothesis into Next Pipeline Stage] style A fill:#f0f8ff style K fill:#e6ffe6 style G fill:#fff5e6
1. Collect Hypotheses & Scores
hypotheses = context.get("hypotheses") or self.memory.hypotheses.get_all()
scores = self.memory.scores.get_all()
You pull all hypotheses and their associated scores from memory including multi-dimensional breakdowns like correctness, originality, etc.
2. Index Embeddings for Similarity Matching
embedded = self._index_embeddings(hypotheses)
Each hypothesis is embedded and indexed, allowing efficient similarity comparisons via cosine similarity.
3. Group Scores by Hypothesis and Dimension
score_map = self._group_scores(scores)
Scores are organized into a map: {hypothesis_id -> {dimension -> score}}
.
4. Generate Contrastive Pairs
contrast_pairs = self._generate_contrast_pairs(embedded, score_map)
This is the heart of the training logic:
- For similar hypotheses (
cos_sim > threshold
) - Where score difference on any dimension exceeds a threshold
- You form a contrast pair:
(better_hypothesis, worse_hypothesis, dimension)
These pairs teach the model what makes one hypothesis better than another along each axis.
{
"dimension": "clarity",
"prompt": "What are the key takeaways from the ReAct paper?",
"output_a": "The ReAct paper proposes a framework where lanHeyguage models can both reason and act in an environment. This allows them to interleave natural language reasoning steps with actions like tool use or web search.",
"output_b": "Itโs like a paper about reasoning and doing things I guess, using like tools and stuff during the process.",
"preferred": "a"
}
๐ง Explanation:
- dimension: This is the score axis you’re training in this case, clarity.
- prompt: The task or query under evaluation.
- output_a vs output_b: Two model responses with different qualities.
- preferred: Indicates which output is preferred with respect to the given dimension.III
5. Train Per-Dimension Preference Models
trained_models = self.trainer.train_multidimensional_model(contrast_pairs)
You train a separate model per dimension e.g., a model that knows what makes a hypothesis “more correct” than another.
This opens the door to targeted optimization during hypothesis generation.
6. Save Models and Log to DB
with open(path, "wb") as f:
pickle.dump(model, f)
self.memory.session.add(UnifiedMRQModelORM(...))
Each trained model is persisted and tracked in the database, making it available for downstream agents.
๐ง Training the Multi-Dimensional Critic
After generating hypotheses and scoring them across multiple dimensions (e.g., correctness, originality, clarity), we move to the core of the system: training a multidimensional evaluator that can predict and rank hypotheses just like our LLM-based rubrics.
This is where the MR.Q framework shines.
๐งฉ Building Contrast Pairs
First, we convert raw scores into contrastive training data:
-
For every pair of hypotheses scored on the same dimension, we form a training pair:
{ "dimension": "correctness", "output_a": "Hypothesis A text", "output_b": "Hypothesis B text", "preferred": "a" }
-
We compute one pair per dimension when the scores are unequal this produces a set of labeled preferences per evaluation run.
This pairing strategy generalizes MR.Qโs binary preference format to multiple independent scorers.
๐ฌ Embedding the Hypotheses
To avoid rerunning LLMs during training, we use cached embeddings:
- Each hypothesis is encoded using a custom embedding pipeline (
self.memory.embedding.get_or_create(...)
) - The TextEncoder combines the prompt and response into a joint latent vector using an MLP
We compute a difference vector between preferred and non-preferred responses:
diff = zsa_a - zsa_b if preferred == "a" else zsa_b - zsa_a
We compute a difference vector zsa_a - zsa_b, where:
zsa_a is the encoded latent vector representing hypothesis A (in the context of the prompt), zsa_b is the same for hypothesis B.
These diffs serve as inputs for training a HypothesisValuePredictor a simple MLP that learns to predict preference directionality from embedding deltas.
๐๏ธ Multidimensional Training Loop
The trainer then iterates over contrast pairs grouped by dimension:
for dim, pairs in dimension_to_pairs.items():
dataloader = self.prepare_training_data(pairs)
self.train_model_for_dimension(dim, dataloader)
Each dimension gets its own trained model, which learns from its own contrastive signals. We use early stopping and log metrics like average loss and convergence status.
๐ Example Training Log
Here’s a sample log from training on three dimensions:
โ๏ธ Training model for 'correctness' [pairs: 14]
Epoch 1/20: avg_loss=0.69312
Epoch 2/20: avg_loss=0.51123
...
Early stopping triggered at epoch 7. Best loss: 0.27348
โ๏ธ Training model for 'originality' [pairs: 9]
...
โ๏ธ Training model for 'clarity' [pairs: 12]
...
Each dimension is treated as a distinct ranking problem enabling fine-grained learning from different types of feedback.
๐งช What This Enables
Once trained, these models can:
- Score new hypotheses instantly (no LLM call needed)
- Guide future generations by selecting high-scoring candidates
- Evaluate prompt variants or strategy choices during tuning
The goal is to replace LLM scorers with fast, local, preference-aligned critics one per dimension and plug them into a fully self-improving pipeline.
๐ The Impact of Dimensional Learning
Traditional reward modeling often collapses everything into a single scalar reward signal. But this approach has limitations especially when reasoning about complex outputs like hypotheses.
By preserving dimensional structure, we gain several advantages:
Benefit | Description |
---|---|
โ Interpretable Learning | We know which model learned which aspect of quality |
๐ Targeted Optimization | During generation, we can favor hypotheses that score higher on specific dimensions |
๐ Feedback Loop | Better hypotheses lead to richer training data, improving models iteratively |
๐ค Self-Tuning Agents | These models can be used to dynamically adjust prompt strategies or rule weights |
๐ Next Steps: Building a Self-Tuning Hypothesis Engine
1. Integrate Trained Models into Generation
Once you have trained models for each dimension, plug them into your generator to rank or re-rank hypotheses based on predicted dimensional scores.
Example usage:
predicted_scores = {
dim: model.predict([hyp_text])
for dim, model in loaded_models.items()
}
Then select the hypothesis with the best weighted combination of predicted scores.
2. Use SVM or MLP-Based Rankers
Right now, you’re using MRQTrainer
. You could evolve this to support:
- SVM-based rankers (classic pairwise preference learning)
- Neural rankers (e.g., BERT + head for dimension-specific ranking)
- Multi-task models that predict all dimensions at once
This gives you flexibility depending on your performance vs. interpretability trade-off.
3. Build a Rule Tuning Loop Using Score Deltas
With access to both actual and predicted scores, you can identify where predictions diverge from reality and use those deltas to:
- Tune symbolic rules via
RuleTunerAgent
- Adjust prompt templates
- Refine embedding strategies
- Evolve the scoring rubric itself
4. Enable Real-Time Feedback During Generation
Eventually, integrate these models into the generation process itself for example, using beam search guided by real-time dimensional feedback.
This allows you to generate hypotheses that are optimized for multiple criteria simultaneously, rather than relying on post-hoc filtering.
5. Structured Prompt Evaluation
Use scores to filter and mutate generation prompts.
6. Symbolic Rule Tuning
Use score deltas (before/after rule application) to identify which rules help or hurt.
7. Reward Model Training
- Treat dimensional scores as supervision targets.
- Train small models to predict scores from hypothesis features.
8. Meta-Scoring Agents
Build a ScorePlannerAgent
that selects dimensions dynamically based on the goal type.
๐งฉ Example Use Case: Sharpening Hypotheses with Dimensional Guidance
Imagine a SharpeningAgent
that uses your trained models to ask:
โWhich hypothesis is likely to score highest on clarity and originality?โ
It generates variants, ranks them using your models, and returns the best.
This is generative refinement powered by dimensional understanding.
๐ฅ Conclusion: Towards Truly Intelligent AI
This isnโt just about getting better scores itโs about building a feedback loop that supports:
- ๐ก Interpretability: Understand why something scored low.
- ๐ค Symbolic Tuning: Use score deltas to optimize rules.
- ๐ Learning: Feed scores into reward models or preference datasets.
- ๐งช Experimentation: Try different weights, parsers, and prompt styles.
It’s the foundation for building smarter, self-improving agents ones that donโt just generate, but also reflect, analyze, and adapt.
๐ References
-
Superwriter: Rubric-Guided Chain-of-Thought Improves Text Generation Wengong Jin, Daniel Fried, Xinyun Chen, et al. arXiv:2310.01361 https://arxiv.org/abs/2310.01361
-
RM-R1: Reward Modeling Should Reason, Not Just Score Xueguang Ma, Yuntao Bai, et al. arXiv:2403.03505 https://arxiv.org/abs/2403.03505
-
DPO: Direct Preference Optimization Rafailov et al., 2023 https://arxiv.org/abs/2305.18290
-
DSPy: Modular LLM Programming OpenAI, 2024 GitHub: https://github.com/stanfordnlp/dspy
-
Chain-of-Rubrics (CoR) Introduced in RM-R1 as a structured format for reward reasoning:
<rubric>
,<eval>
, and<answer>[[score]]</answer>
-
RM-Bench A benchmark dataset for evaluating reasoning-based reward models (introduced in RM-R1).
-
MR.Q: Minimum Reasonable Quality Evaluator A lightweight alternative to DPO for scoring and training using structured self-evaluation prompts.
๐ Glossary
Term | Definition |
---|---|
CoR (Chain-of-Rubrics) | A structured output format that guides an AI model to reason using a rubric before producing a final score or judgment. Typically includes sections like <rubric> , <eval> , and <answer>[[score]]</answer> . Inspired by the RM-R1 paper. |
RM-R1 | A research paper proposing Reasoning-as-Reward Modeling, where LLMs score outputs by reasoning through a rubric instead of producing a raw scalar value. Enables interpretable evaluation. |
DSPy | A modular prompt programming framework for building and chaining LLM-based modules (e.g., reasoners, evaluators) in Python. Used here to implement scoring and judgment agents. |
ScoreEvaluator | A configurable class in the Co AI system that uses dimensions (like correctness or feasibility) and structured prompts to evaluate hypotheses. Can return scalar or CoR-style scores. |
MR.Q | A lightweight reward modeling strategy that evaluates and ranks hypotheses using simple heuristics or LLM judgments. Used as a scoring backbone in the Sharpening and GeneralReasoner systems. |
Pipeline Run | A complete pass through a Co AI reasoning pipeline for a given goal. Each pipeline run generates hypotheses and scores them using configured evaluation agents. |
Hypothesis | A generated answer or proposed solution to a goal, typically created by an LLM agent in the Co AI system. Each hypothesis is scored across multiple dimensions. |
ScoreAnalyzer | A tool for analyzing score distributions across hypotheses and dimensions. Supports statistical summary, PCA, and clustering to reveal performance patterns. |
Dimension (Metric) | An evaluation category such as “correctness”, “feasibility”, or “insightfulness”. Each dimension has its own rubric, weight, and parser. |
Parser | A function that extracts a numeric score from a model’s evaluation output. In CoR mode, this may involve extracting values from <answer>[[score]]</answer> tags. |
Rubric | A structured set of evaluation criteria that guide the modelโs reasoning before assigning a score. Rubrics are central to CoR and Superwriter-style evaluation. |
Final Score So I’m not doing anymore | A weighted aggregate of dimension scores, used to rank or compare hypotheses. Calculated automatically by ScoreEvaluator. |