A Novel Approach to Autonomous Research: Implementing NOVELSEEK with Modular AI Agents

Summary
AI research tools today are often narrow: one generates summaries, another ranks models, a third suggests ideas. But real scientific discovery isnβt a single stepβitβs a pipeline. Itβs iterative, structured, and full of feedback loops.
In this post, I show how to build a modular AI system that mirrors this full research lifecycle. From initial idea generation to method planning, each phase is handled by a specialized agent working in concert.
This implementation is inspired by the recent paper:
NOVELSEEK: When Agent Becomes the Scientist
Weβve implemented the complete Idea-to-Methodology Construction loop:
- π§ Autonomous idea generation
- π§ Multi-round hypothesis refinement
- π― Preference learning and evaluation
- π Semantic memory & knowledge retrieval
- π± Hypothesis evolution via mutation & grafting
- πΊοΈ Structured planning of experimental methods
This aligns closely with NOVELSEEKβs vision of agent-driven research:
“Ideas are generated, refined, scored, and evolved into structured methodologies.” “Each idea is mapped to testable components before being executed.”
Letβs explore the architecture, the agents, and how this system begins to act like a scientistβone step at a time.
π What Is NOVELSEEK?
From the paper:
βNOVELSEEK autonomously generates scientific hypotheses, transforms them into executable methodologies, and validates them via closed-loop experiments.β
The core stages include:
- Self-Evolving Idea Generation
- Idea-to-Methodology Construction
- Evolutionary Experimental Planning
In this post we’re focusing on replicating the first two stages:
graph LR A[π§ Idea Innovation] --> B[π Idea Sharpening] B --> C[π Idea Evaluation] C --> D[𧬠Idea Evolution] D --> E[π οΈ Method Development]
And preparing for future integration with code execution tools like Aider or OpenHands.
π Our Pipeline Overview
Here’s how our current autonomous research loop looks:
graph TD A[Goal] --> B[(SurveyAgent)] B --> C[(SearchOrchestratorAgent)] C --> D[(IdeaInnovationAgent)] D --> E[(IdeaSharpeningAgent)] E --> F[(RankingAgent - Elo-style)] F --> G[(IdeaEvaluatorAgent - Mr Q)] G --> H[(IdeaEvolutionAgent)] H --> I[(MethodPlannerAgent)] I --> J[(Next Round)]
This mirrors NOVELSEEKβs:
“Multi-round experimental planning and execution”
“Each idea is evolved into 3 variants; top performers selected based on scoring.”
“Ideas are mapped to testable components before being executed.”
π§± Core Components of co_ai
1. Goal Definition
Every experiment starts with a goal:
goal:
id: 1
goal_text: "Will AI ever be able to reprogram itself?"
focus_area: "meta_learning"
strategy: "graph_attention_with_positional_embeddings"
baseline_method: "Standard transformer-based LLM with static prompt."
Goals define:
- What we’re trying to prove/disprove
- The domain (
chemistry
,nlp
, etc.) - Baseline used for comparison
- Strategy for improvement
2. The pipeline
pipeline:
name: default_pipeline
description: "NOVELSEEK pipeline for exploring the question: 'Will AI ever be able to reprogram itself?'"
stages:
- name: survey
cls: co_ai.agents.survey.SurveyAgent
enabled: true
iterations: 1
- name: search_orchestrator
cls: co_ai.agents.search_orchestrator.SearchOrchestratorAgent
enabled: false
iterations: 1
- name: knowledge_loader
cls: co_ai.agents.knowledge_loader.KnowledgeLoaderAgent
enabled: true
iterations: 1
- name: idea_innovation
cls: co_ai.agents.idea_innovation.IdeaInnovationAgent
enabled: true
iterations: 1
- name: idea_sharpening
cls: co_ai.agents.idea_sharpening.IdeaSharpeningAgent
enabled: true
iterations: 1
- name: ranking
cls: co_ai.agents.ranking.RankingAgent
enabled: true
iterations: 1
- name: idea_evaluator
cls: co_ai.agents.idea_evaluator.IdeaEvaluatorAgent
enabled: true
iterations: 1
- name: idea_evolution
cls: co_ai.agents.idea_evolution.IdeaEvolutionAgent
enabled: true
iterations: 3
- name: method_planner
cls: co_ai.agents.method_planner.MethodPlannerAgent
enabled: true
iterations: 1
π΅οΈββοΈ SurveyAgent β Query Generation
Generates adaptive search queries from goal + baseline + preferences.
graph LR A[Goal] --> B[(SurveyAgent)]:::metallicBlue B --> C[(SearchOrchestratorAgent)] C --> D[(IdeaInnovationAgent)] D --> E[(IdeaSharpeningAgent)] E --> F[(RankingAgent - Elo-style)] F --> G[(IdeaEvaluatorAgent - Mr Q)] G --> H[(IdeaEvolutionAgent)] H --> I[(MethodPlannerAgent)] I --> J[(Next Round)] classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
# co_ai/agents/survey.py
from co_ai.agents.base import BaseAgent
from co_ai.constants import GOAL
class SurveyAgent(BaseAgent):
"""
The Survey Agent generates adaptive search queries for literature exploration.
From the paper:
> 'The Survey Agent deconstructs the research task into multiple keyword combinations'
> 'It supports two distinct modes: literature review mode and deep research mode'
> 'Each idea is mapped to testable components before being executed'
"""
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.max_queries = cfg.get("max_queries", 5)
async def run(self, context: dict) -> dict:
goal = context.get(GOAL, {})
if not goal:
self.logger.log("NoGoalProvided", {"reason": "survey_agent_skipped"})
return context
# Generate new queries based on goal + baseline + preferences
prompt_context = {
"goal_text": goal.get("goal_text"),
"focus_area": goal.get("focus_area"),
"baseline_method": context.get("baseline_method", ""),
"preferences": context.get("preferences", ["novelty", "feasibility"]),
"previous_ideas": context.get("ideas", [])
}
merged = {**self.cfg, **prompt_context}
prompt = self.prompt_loader.load_prompt(self.cfg, merged)
raw_output = self.call_llm(prompt, context)
queries = self._parse_query_response(goal, raw_output)
# Store in context for SearchOrchestratorAgent
context["search_queries"] = queries
context["search_strategy"] = self.strategy
self.logger.log("SurveyQueriesGenerated", {
"queries": queries,
"strategy_used": self.strategy,
"pipeline_stage": context.get("pipeline_stage")
})
return context
def _parse_query_response(self, goal, response: str) -> list:
"""Parse LLM output into clean list of search queries"""
lines = [line.strip() for line in response.splitlines() if line.strip()]
if not lines:
# Fallback strategy
return [
f"{goal.get('focus_area')} machine learning",
f"{goal.get('goal_text')}"
]
return lines[:self.max_queries]
def expand_queries_to_goals(self, queries: list, base_goal: dict) -> list:
"""
Convert queries into sub-goals for future pipeline stages
Args:
queries (list): Generated search strings
base_goal (dict): Original goal
Returns:
list: List of structured sub-goals
"""
return [
{
"goal_text": q,
"parent_goal": base_goal.get("goal_text"),
"focus_area": base_goal.get("focus_area"),
"strategy": base_goal.get("strategy"),
"source": "survey_agent"
}
for q in queries
]
Prompt Template β survey.txt
You are the Survey Agent. Generate adaptive search queries for literature exploration.
Goal: {{ goal.goal_text }}
Focus Area: {{ goal.focus_area }}
Baseline Method: {{ baseline_method }}
Research Preferences: {{ preferences }}
Previous Ideas:
{% for idea in previous_ideas %}
- "{{ idea }}"
{% endfor %}
Generate up to {{ max_queries }} search queries that would help us understand the current state of research around this topic.
Return only the queries, one per line.
Example output:
Self-modifying AI architectures
LLM introspection and reflection
Safety constraints for autonomous reprogramming
Dynamic model architecture adaptation
AI systems evolving over time
These queries feed into downstream agents like SearchOrchestratorAgent
.
π§ SearchOrchestratorAgent: Choosing the Right Tool for the Job
graph LR A[Goal] --> B[(SurveyAgent)] B --> C[(SearchOrchestratorAgent)]:::metallicBlue C --> D[(IdeaInnovationAgent)] D --> E[(IdeaSharpeningAgent)] E --> F[(RankingAgent - Elo-style)] F --> G[(IdeaEvaluatorAgent - Mr Q)] G --> H[(IdeaEvolutionAgent)] H --> I[(MethodPlannerAgent)] I --> J[(Next Round)] classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
Once the SurveyAgent
generates structured queries from the userβs research goal, the SearchOrchestratorAgent
takes over to intelligently route each query to one of several specialized search tools. This mimics a skilled research assistant deciding where to look based on what you’re asking.
π οΈ The Real Power Lies in the Tools
While large language models are impressive in their reasoning and generation capabilities, they are ultimately bounded by the static knowledge they were trained on. The real research value of this system doesn’t come from what the model already knows but from what it can retrieve and integrate dynamically through tools. In the SearchOrchestratorAgent
, five distinct tools (Arxiv, HuggingFace, Wikipedia, cosine similarity search, and local WebSearch via SearxNG) form a knowledge augmentation layer. These aren’t just add-ons they are essential research interfaces.
The quality of insights the system can generate depends directly on the accuracy, relevance, and coverage of the information returned by these tools. In many ways, these tools are the reality check they ground the AI’s creativity in what’s actually happening in the world of science and data. As we scale this system further, building richer, more accurate, and more specialized tools will be key to making these AI research agents not just plausible, but truly useful collaborators in knowledge discovery.
π¬ ArxivTool
For scientific papers and methodological insight.
-
When it’s used: The query or goal indicates a need for peer-reviewed literature, new models, or baseline comparisons.
-
What it does: Searches arXiv.org for recent research papers using the query.
-
Use cases:
"transformer-based anomaly detection"
"zero-shot learning methods"
π HuggingFaceTool
For datasets and model repositories.
-
When it’s used: The goal mentions “datasets”, “data collection”, or is classified as a
data_search
task. -
What it does: Searches the Hugging Face Hub for datasets that match the research query.
-
Use cases:
"multilingual text classification dataset"
"sentiment analysis for medical notes"
π WikipediaTool
For concept grounding and general background knowledge.
-
When it’s used: The goal is categorized as background research or includes words like “overview” or “definition”.
-
What it does: Performs a similarity-ranked search over Wikipedia entries using cosine similarity.
-
Use cases:
"definition of generative AI"
"overview of reinforcement learning"
π WebSearchTool
For general exploration when the intent is unclear or cross-domain.
-
When it’s used: When no strong match is found via metadata or similarity. Acts as a catch-all fallback.
-
What it does: Runs a broad web search and retrieves summaries + URLs.
-
Use cases:
"AI startup funding trends 2024"
"open-source LLM deployment on edge devices"
π Why We Use SearchXNG for Web Search
In building a dynamic, AI-driven research assistant, one of the trickiest components is reliable, configurable web search. Many third-party tools and APIs have unpredictable rate limits, require API keys, or yield inconsistent formats.
We tested multiple solutions, but SearchXNG emerged as the clear winner.
β Why SearchXNG?
- Self-hostable: You can run it locally or privately, ensuring no external tracking or throttling.
- Blockable: Unlike cloud APIs that are black-boxed, SearchXNG is easy to sandbox, monitor, or override.
- Consistent JSON structure: It returns clean, parseable output ideal for AI tools to consume and summarize.
- Fast and flexible: It’s optimized for quick retrieval over large domains and adapts to different query structures easily.
π οΈ Integration Simplicity
We connected WebSearchTool
to SearchXNG via a small config file. This lets the SearchOrchestratorAgent
seamlessly route web-style queries (fallbacks or broad information requests) to a fast, local search engine.
I run it in docker
services:
searxng:
image: searxng/searxng
ports:
- "8080:8080"
environment:
- SEARXNG_PORT=8080
- SEARXNG_BASE_URL=http://localhost:8080
π WebSearchTool
code: A Local Web Search Wrapper for SearchXNG
This class provides a fast and lightweight interface to the SearchXNG engine. It enables agents in the pipeline to issue real-time web queries and retrieve structured, parseable results.
Key Capabilities:
- Asynchronous Search: Uses
httpx
for efficient non-blocking web queries. - Customizable Parameters: Supports tuning language, categories, and result limits.
- HTML Parsing: Extracts titles, URLs, and snippets from SearxNG HTML responses using
BeautifulSoup
. - Optional Full Page Fetching: If enabled, also fetches and stores the full HTML content of the page.
- Readable Text Extraction: Uses the
readability
package to extract clean, human-readable text for downstream summarization or embedding.
This module gives your AI agents reliable access to web data without relying on commercial APIs making it ideal for local, privacy-preserving deployments.
import asyncio
import httpx
import requests
from bs4 import BeautifulSoup
from readability import Document
from co_ai.utils.file_utils import write_text_to_file
class WebSearchTool:
def __init__(self, cfg: dict, logger):
self.base_url = f'{cfg.get("instance_url", "localhost:8080")}/search'
self.max_results = cfg.get("max_results", 15)
self.fetch_page = cfg.get("fetch_page", False)
self.categories = cfg.get("categories", "general")
self.language = cfg.get("language", "en")
self.logger = logger
async def search(self, query: str, max_results: int = 15) -> list[str] | None:
max_results = max_results or self.max_results
params = {
"q": query,
"categories": "general",
"language": self.language,
"formats": ["html", "json"]
}
try:
async with httpx.AsyncClient(timeout=10.0) as client:
resp = await client.get(self.base_url, params=params)
resp.raise_for_status()
html = resp.text
except Exception as e:
print(f"β Exception: {type(e).__name__}: {e}")
return None
return self.parse_searxng_results(html, max_results)
from bs4 import BeautifulSoup
def parse_searxng_results(self, html: str, max_results:int=20):
soup = BeautifulSoup(html, "html.parser")
results = []
for i, article in enumerate(soup.find_all("article", class_="result")):
if i > max_results:
continue
link_tag = article.find("a", class_="url_header")
href = link_tag["href"] if link_tag else None
title_tag = article.find("h3")
title = title_tag.get_text(strip=True) if title_tag else None
snippet_tag = article.find("p", class_="content")
snippet = snippet_tag.get_text(strip=True) if snippet_tag else None
cleand_page = ""
if self.fetch_page:
cleand_page = self.fetch_html(href)
if href and title:
results.append(
{
"title": title,
"url": href,
"snippet": snippet,
"page": cleand_page,
}
)
return results
import requests
def fetch_html(self, url: str) -> str | None:
headers = {"User-Agent": "Mozilla/5.0"}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response.text
except requests.RequestException as e:
if self.logger:
self.logger.log("FetchHTMLFailed", {"url": url, "error": str(e)})
return None # or return ""
def fetch_and_parse_readable(self, url:str):
html = self.fetch_html(url)
title, clean_text = self.extract_main_text(html)
return {"url": url, "title": title, "text": clean_text}
def extract_main_text(self, html):
doc = Document(html)
title = doc.short_title()
summary_html = doc.summary()
# Use BeautifulSoup to clean text
soup = BeautifulSoup(summary_html, 'html.parser')
clean_text = soup.get_text(separator='\n', strip=True)
return title, clean_text
π§ Cosine Similarity Tool
For semantic routing fallback and phrase matching.
- What it does: Embeds the query and compares it to known intent phrases using cosine similarity. Helps infer intent like βthis sounds like a paper queryβ or βthis feels like a dataset lookupβ.
- Why it’s useful: When queries are ambiguous or phrased in non-obvious ways.
def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
"""Compute cosine similarity between two vectors."""
dot = np.dot(vec1, vec2)
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
return dot / (norm1 * norm2 + 1e-8) # Avoid division by zero
def get_top_k_similar(
query: str,
documents: List[str],
memory,
top_k: int = 5
) -> List[Tuple[str, float]]:
"""
Compute similarity between query and each document, return top_k most similar.
Args:
query: The input query text.
documents: A list of document strings.
get_embedding: Callable that takes a string and returns a vector (np.ndarray).
top_k: Number of top results to return.
Returns:
List of (document, similarity_score) tuples.
"""
query_vec = memory.embedding.get_or_create(query)
doc_vecs = [memory.embedding.get_or_create(doc) for doc in documents]
similarities = [cosine_similarity(query_vec, vec) for vec in doc_vecs]
scored = list(zip(documents, similarities))
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:top_k]
π How Routing Works
The SearchOrchestratorAgent
follows this decision-making pipeline:
- Metadata-based routing: Checks the goal type and query keywords against known heuristics.
- Semantic fallback: If unclear, uses the cosine similarity tool to match against a curated set of intent templates.
- Executes tool: Sends the query to the selected tool and enriches results with goal metadata.
- Stores: Results are stored in the
search_results
database table and returned to context for downstream use.
π§ββοΈ Reflections on Knowledge Base Quality: Dynamic vs. Curated
As powerful as automated tools like the SearchOrchestratorAgent
are, this experiment revealed a crucial limitation: the quality of the dynamically constructed knowledge base often lags behind expectations.
π§ͺ What We Observed
When queries are automatically generated and routed for instance, through SurveyAgent
and then fanned out by SearchOrchestratorAgent
the downstream knowledge base (a collection of summaries, papers, or dataset descriptions) tends to suffer from low relevance, redundancy, or superficial depth.
This happens because:
- Automatically generated queries may not align tightly with the actual research goal.
- Retrieved content often lacks domain specificity.
- LLM-based summarization can flatten nuance or miss key points.
π‘ Key Takeaway
Manually curated or hand-tuned knowledge bases still outperform automated ones in meaningful, focused research.
When you personally select papers, distill key points, and structure the background context, you end up with a knowledge base thatβs:
- Richer in insights.
- Tighter in focus.
- More reusable for downstream agents like hypothesis generators or evaluators.
π§ Strategic Implication
If you’re building a self-evolving or reflective AI research system, you might consider:
- Starting with a human-curated core knowledge base, and
- Letting the AI augment it dynamically rather than letting AI fully control it from scratch.
This hybrid approach respects the current limits of LLM-based search + summarization and acknowledges that “what you feed the pipeline determines what you get out.”
π Generating Novel Directions with the IdeaInnovationAgent
graph LR A[Goal] --> B[(SurveyAgent)] B --> C[(SearchOrchestratorAgent)] C --> D[(IdeaInnovationAgent)]:::metallicBlue D --> E[(IdeaSharpeningAgent)] E --> F[(RankingAgent - Elo-style)] F --> G[(IdeaEvaluatorAgent - Mr Q)] G --> H[(IdeaEvolutionAgent)] H --> I[(MethodPlannerAgent)] I --> J[(Next Round)] classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
In the research pipeline, originality begins here. The IdeaInnovationAgent
is responsible for translating background research, preferences, and strategic intent into concrete, novel research directions. Acting as a creative synthesizer, it absorbs context from the goal, survey results, and literature findings, then prompts a language model to propose abstract but actionable ideas that push beyond the current state of the art.
This agent marks the first leap from information-gathering to innovation. It’s not just summarizing it’s imagining. Each output is a potential seed for a testable hypothesis, a paper, or even a new research agenda. By systematizing idea generation with structure, memory, and semantic grounding, this agent lays the intellectual foundation of the entire pipeline.
class IdeaInnovationAgent(BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
async def run(self, context: dict) -> dict:
goal = context.get(GOAL)
survey_results = context.get("survey_results", [])
search_results = context.get("search_results", [])
# Build prompt context
prompt_context = {
"goal_text": goal.get("goal_text"),
"focus_area": goal.get("focus_area"),
"goal_type": goal.get("goal_type"),
"strategy": goal.get("strategy"),
"survey_summary": self._summarize_results(survey_results),
"search_result_summaries": self._summarize_results(search_results),
"preferences": self.cfg.get("preferences", []),
}
merged = {**context, **prompt_context}
# Load and render prompt
prompt = self.prompt_loader.load_prompt(self.cfg, merged)
# Call LLM to generate ideas
raw_ideas = self.call_llm(prompt, merged)
# Parse and structure ideas
ideas = self._parse_raw_ideas(raw_ideas, goal)
# Store generated ideas
stored_ideas = self.memory.ideas.bulk_add_ideas(ideas)
# Update context with results
context["ideas"] = [idea.to_dict() for idea in stored_ideas]
context["idea_ids"] = [idea.id for idea in stored_ideas]
return context
def _summarize_results(self, results: list) -> str:
"""Converts list of result dicts into a summary string"""
if not results:
return "No prior research found."
summaries = []
for r in results[:5]: # limit to top 5 for brevity
title = r.get("title", "")
summary = r.get("summary", "")[:200] + "..." if len(r.get("summary", "")) > 200 else ""
url = r.get("url", "")
summaries.append(f"- {title}: {summary} ({url})")
return "\n".join(summaries)
def _parse_raw_ideas(self, raw_text: str, goal: dict) -> list:
"""Parses raw LLM response into structured idea objects"""
lines = [line.strip() for line in raw_text.splitlines() if line.strip()]
ideas = []
for line in lines:
ideas.append({
"idea_text": line,
"parent_goal": goal.get("goal_text"),
"focus_area": goal.get("focus_area"),
"strategy": goal.get("strategy"),
"source": "generated_by_IdeaInnovationAgent",
"origin": "llm",
"extra_data": {}
})
return ideas
You are the Idea Innovation Agent.
Your task is to generate novel research directions based on the following inputs:
Goal: {{ goal }}
Focus Area: {{ focus_area }}
Baseline Methods: {{ baseline_methods }}
Literature Summary:
{{ literature_summary }}
Generate 5β10 innovative research directions that:
- Build on existing methods
- Are grounded in recent literature
- Propose meaningful technical changes
- Align with the research goal
Return only the list of ideas, one per line.
Matches NOVELSEEK’s:
“Idea Innovation Agent generates novel directions based on prior knowledge”_
“Ideas are turned into detailed methodologies using T: I Γ T Γ B Γ L β M”_
πͺ IdeaSharpeningAgent β Refinement & Critique
Converts vague ideas into testable hypotheses using preference learning and reflection.
We covered sharpening previously Self-Improving Agents: Applying the Sharpening Framework to Local LLMs
I reused the ideas from that post here.
graph LR A[Goal] --> B[(SurveyAgent)] B --> C[(SearchOrchestratorAgent)] C --> D[(IdeaInnovationAgent)] D --> E[(IdeaSharpeningAgent)]:::metallicBlue E --> F[(RankingAgent - Elo-style)] F --> G[(IdeaEvaluatorAgent - Mr Q)] G --> H[(IdeaEvolutionAgent)] H --> I[(MethodPlannerAgent)] I --> J[(Next Round)] classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
class IdeaSharpeningAgent(BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.target = cfg.get("target", "generation")
self.device = cfg.get("device", "cpu")
self.evaluator = MRQSelfEvaluator(memory, logger, device=self.device)
self.templates = cfg.get("templates", ["critic"])
self.save_count = cfg.get("save_count", 3)
async def run(self, context: dict) -> dict:
"""
Main execution loop for IdeaSharpeningAgent.
Takes a list of ideas, sharpens them using templates,
judges against baseline using evaluator, and logs results.
"""
goal = context.get(GOAL, {})
ideas = context.get("ideas", [])
if not ideas:
self.logger.log("NoIdeasToSharpen", {"reason": "empty_input"})
return context
sharpened_results = []
for idea in ideas:
idea_text = idea.get("idea_text")
result = await self._sharpen_and_evaluate(idea_text, goal, context)
sharpened_results.append(result)
# Sort by score
sharpened_results.sort(key=lambda x: x["score"], reverse=True)
# Update context
context["sharpened_ideas"] = [r["sharpened_hypothesis"] for r in sharpened_results]
context["scored_ideas"] = sharpened_results
best_idea = sharpened_results[0]["sharpened_hypothesis"]
context["top_idea"] = best_idea
hypotheses = context.get(HYPOTHESES, [])
if hypotheses:
# Find the hypothesis with the maximum confidence value
sorted_hyps = sorted(
hypotheses, key=lambda h: h.get("confidence", 0.0), reverse=True
)
# Keep only the top hypothesis
context[HYPOTHESES] = sorted_hyps[:self.save_count]
# For scoring later
context["baseline_hypotheses"] = sorted_hyps[-1]
return context
async def _sharpen_and_evaluate(self, idea: str, goal: dict, context: dict) -> dict:
# Build prompt for refinement
focus_area = goal.get("focus_area", "")
baselines = self.cfg.get("baselines")
baseline = baselines.get(focus_area, baselines.get("default"))
merged = {
"goal": goal,
"idea": idea,
"baseline": baseline,
"literature_summary": context.get("knowledge_base_summaries", []),
"examples": self.memory.hypotheses.get_similar_hypotheses(idea, limit=3),
"strategy": goal.get("strategy", "default"),
}
improved = None
winner = "original"
scores = {}
for name in self.templates:
prompt_template = self.prompt_loader.from_file(name, self.cfg, merged)
sharpened = self.call_llm(prompt_template, merged)
try:
preferred_output, scores = self.evaluator.judge(
goal=goal,
prompt=idea,
output_a=idea,
output_b=sharpened,
)
improved = preferred_output
winner = "b" if improved == sharpened else "a"
except Exception as e:
self.logger.log("IdeaSharpeningFailed", {"error": str(e)})
improved = idea
winner = "a"
scores = {"value_a": 5.0, "value_b": 5.0}
result = {
"template_used": name,
"original_idea": idea,
"sharpened_hypothesis": improved,
"winner": winner,
"improved": winner == "b",
"scores": scores,
"score": max(scores.values()),
"pipeline_stage": context.get(PIPELINE),
"prompt_template": prompt_template,
}
saved_hyp = self.save_improved(goal, idea, result, context)
if saved_hyp:
context.setdefault(HYPOTHESES, []).append(saved_hyp.to_dict())
return result
def save_improved(self, goal: dict, original_idea: str, result: dict, context: dict):
if not result["improved"]:
return None
sharpened = result["sharpened_hypothesis"]
prompt_id = self.memory.prompt.get_id_from_response(sharpened)
# Save to HypothesisORM
hyp = HypothesisORM(
goal_id=goal.get("id"),
text=sharpened,
prompt_id=prompt_id,
pipeline_signature=context.get(PIPELINE),
source="idea_sharpening_agent",
confidence=result["score"]
)
self.memory.hypotheses.insert(hyp)
self.logger.log(
"IdeaSharpenedAndSaved",
{
"prompt_snippet": original_idea[:100],
"response_snippet": sharpened[:100],
"score": result["score"],
},
)
return hyp
Supports NOVELSEEKβs:
“Self-evolving idea generation with human-interactive feedback” “Each idea is refined into 3 variants; top performers selected based on scoring”
Once weβve sharpened our ideas, we want to know how they stack up. Thatβs where the RankingAgent comes in.
π RankingAgent β Elo-Based Selection
Uses pairwise comparisons to rank hypotheses.
graph LR A[Goal] --> B[(SurveyAgent)] B --> C[(SearchOrchestratorAgent)] C --> D[(IdeaInnovationAgent)] D --> E[(IdeaSharpeningAgent)] E --> F[(RankingAgent - Elo-style)]:::metallicBlue F --> G[(IdeaEvaluatorAgent - Mr Q)] G --> H[(IdeaEvolutionAgent)] H --> I[(MethodPlannerAgent)] I --> J[(Next Round)] classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
Once the hypotheses are sharpened, the RankingAgent steps in to simulate a scientific tournament comparing ideas head-to-head to identify the strongest contenders.
Inspired by the Elo rating system from competitive gaming and chess, this agent doesn’t just score hypotheses in isolation. Instead, it pits them against each other using LLM-based pairwise comparisons. The result? A ranked list of hypotheses that reflects not only individual merit but also how each idea stacks up in direct competition.
π‘ From the paper:
“The Ranking agent employs an Elo-based tournament to assess and prioritize generated hypotheses.” “Top performers are selected for the next round.”
How it works
- Hypotheses are first initialized with a base Elo score (default: 750).
- A selection of pairwise matchups is generated using random or proximity-based sampling.
- The agent builds prompts comparing two hypotheses based on the user’s preferences (e.g., novelty, feasibility).
- The LLM selects a winner in each round.
- Elo scores are updated after each comparison, simulating win/loss dynamics.
- Top-ranked hypotheses are passed to the next stage for deeper evaluation and potential evolution.
π§ This agent also supports multi-turn debates, proximity-based pairing strategies, and adaptive scoring based on match outcomes. It’s a flexible, strategic layer that gives the pipeline a robust way to prioritize promising research directions.
By embedding structured comparison into the refinement loop, RankingAgent brings rigor to the exploration process ensuring only the most compelling ideas move forward in the research pipeline.
class RankingAgent(BaseAgent):
"""
The Ranking agent simulates scientific debate between hypotheses using a tournament-style approach.
From the paper:
> 'The Ranking agent employs an Elo-based tournament to assess and prioritize generated hypotheses'
"""
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.elo_scores = {}
self.strategy = cfg.get("strategy", "debate")
self.max_comparisons = cfg.get("max_comparisons", 6)
self.initial_elo_score = cfg.get("initial_elo_score", 750)
self.win_history = []
self.preferences = cfg.get("preferences", ["novelty", "feasibility"])
async def run(self, context: dict) -> dict:
"""
Rank hypotheses using pairwise comparisons and Elo updates.
Args:
context: Dictionary with keys:
- hypotheses: list of hypothesis strings
- goal: research objective
- preferences: override criteria
"""
hypotheses = self.get_hypotheses(context)
if len(hypotheses) < 2:
self.logger.log("NotEnoughHypothesesForRanking", {
"count": len(hypotheses),
"reason": "less than 2 hypotheses"
})
context[self.output_key] = [(h, 1000) for h in hypotheses]
return context
self._initialize_elo(hypotheses)
pairs = list(itertools.combinations(hypotheses, 2))
comparisons = random.sample(pairs, k=min(self.max_comparisons, len(pairs)))
for hyp1, hyp2 in comparisons:
prompt = self._build_ranking_prompt(hyp1, hyp2, context)
response = self.call_llm(prompt, context)
winner = self._parse_response(response)
if winner:
self._update_elo(hyp1, hyp2, winner)
else:
self.logger.log(
"ComparisonParseFailed",
{
"prompt_snippet": prompt[:200],
"response_snippet": response[:300],
"agent": self.__class__.__name__,
},
)
ranked = sorted(self.elo_scores.items(), key=lambda x: x[1], reverse=True)
context[self.output_key] = ranked
self.logger.log(
"TournamentCompleted",
{
"total_hypotheses": len(ranked),
"win_loss_patterns": self._extract_win_loss_feedback(),
"preferences": self.preferences,
},
)
return context
def _initialize_elo(self, hypotheses):
for h in hypotheses:
text = h.get("text")
if text not in self.elo_scores:
self.elo_scores[text] = self.initial_elo_score
def _build_ranking_prompt(self, hyp1, hyp2, context):
return self.prompt_loader.load_prompt(
self.cfg,
{
**context,
"hypothesis_a": hyp1.get("text"),
"hypothesis_b": hyp2.get("text"),
},
)
def _conduct_multi_turn_debate(self, context:dict, hyp1:str, hyp2:str, turns:int=3):
"""Simulate multi-turn scientific debate between hypotheses"""
for i in range(turns):
prompt = self._build_ranking_prompt(hyp1, hyp2, context=context)
response = self.call_llm(prompt, context)
winner = self._parse_response(response)
if winner:
self._update_elo(hyp1, hyp2, winner)
else:
break
def _generate_pairwise_comparisons(self, hypotheses):
"""Generate combinations of hypothesis pairs for ranking"""
return itertools.combinations(hypotheses, 2)
def _generate_proximity_based_pairs(self, hypotheses):
"""Prioritize comparisons between similar hypotheses"""
similarities = [
(h1, h2, self._compute_similarity(h1, h2))
for h1, h2 in itertools.combinations(hypotheses, 2)
]
return sorted(similarities, key=lambda x: x[2], reverse=True)
def _extract_win_loss_feedback(self):
"""Return summary of which hypotheses won most often"""
win_counts = {}
for hyp1, hyp2, winner in self.win_history:
winner_hypothesis = hyp1 if winner == "A" else hyp2
win_counts[winner_hypothesis] = win_counts.get(winner_hypothesis, 0) + 1
return {
"top_performers": [
{"hypotheses": h, "wins": w}
for h, w in sorted(win_counts.items(), key=lambda x: x[1], reverse=True)
],
"total_matches": len(self.win_history),
"preferences_used": self.preferences
}
def _rank_pairwise(self, reviewed, context):
pairs = list(itertools.combinations(reviewed, 2))
if not pairs:
return
# Limit number of comparisons per round
comparisons = random.sample(pairs, k=min(self.cfg.get("max_comparisons", 6), len(pairs)))
for item1, item2 in comparisons:
hyp1 = item1["hypotheses"]
hyp2 = item2["hypotheses"]
merged = {**self.cfg, **{"hypothesis_a": hyp1, "hypothesis_b": hyp2}}
prompt = self.prompt_loader.load_prompt(merged, context=context)
self.logger.log("RankingCompare", {"hyp1": hyp1[:60], "hyp2":hyp2[:60]})
try:
response = self.call_llm(prompt, context)
winner = self._parse_response(response)
if winner:
self._update_elo(hyp1, hyp2, winner)
else:
self.logger.log("ComparisonParseFailed", {
"prompt_snippet": prompt[:200],
"response_snippet": response[:300]
})
except Exception as e:
self.logger.log(
"ComparisonError",
{"error": str(e), "hypotheses": [hyp1[:100], hyp2[:100]]},
)
def _update_elo(self, hyp1, hyp2, winner):
text1 = hyp1.get("text")
text2 = hyp2.get("text")
K = self.cfg.get("elo_k", 32)
R1 = 10 ** (self.elo_scores[text1] / 400)
R2 = 10 ** (self.elo_scores[text2] / 400)
E1 = R1 / (R1 + R2)
E2 = R2 / (R1 + R2)
S1 = 1 if winner == "A" else 0
S2 = 1 - S1
self.elo_scores[text1] = max(
100, min(2800, self.elo_scores[text1] + K * (S1 - E1))
)
self.elo_scores[text2] = max(
100, min(2800, self.elo_scores[text2] + K * (S2 - E2))
)
self.memory.hypotheses.update_elo_rating(hyp1.get("id"), self.elo_scores[text1])
self.memory.hypotheses.update_elo_rating(hyp2.get("id"), self.elo_scores[text2])
self.win_history.append((text1, text2, winner))
self.logger.log(
"RankingUpdated",
{
"hypothesis_a": text1,
"hypothesis_b": text2,
"winner": winner,
"elo_a": self.elo_scores[text1],
"elo_b": self.elo_scores[text2],
},
)
def _parse_response(self, response: str) -> Optional[str]:
"""
Try multiple methods to extract winner from LLM output
Returns:
'A' or 'B' based on comparison
"""
# Try matching structured formats first
structured_match = re.search(r"better[\s_]?hypothesis[^\w]*([AB12])", response, re.IGNORECASE)
if structured_match:
winner_key = structured_match.group(1).upper()
return "A" if winner_key in ("A", "1") else "B"
# Try matching natural language statements
lang_match = re.search(r"(?:prefer|choose|recommend|select)(\s+idea|\s+hypothesis)?[:\s]+([AB12])", response, re.IGNORECASE)
if lang_match:
winner_key = lang_match.group(2).upper()
return "A" if winner_key in ("A", "1") else "B"
# Try matching conclusion phrases
conclusion_match = re.search(r"conclude[d]?\s+with\s+better[\s_]idea:\s*(\d)", response, re.IGNORECASE)
if conclusion_match:
winner_key = conclusion_match.group(1)
return "A" if winner_key == "1" else "B"
# Default fallback logic
self.logger.log("ParseError", {
"error": "Could not extract winner from response",
"response": response
})
return response
Aligns with NOVELSEEKβs:
“Ranking agent employs an Elo-based tournament to assess and prioritize generated hypotheses”_
“Top performers are selected for next round.”
βοΈ IdeaEvaluatorAgent (Mr Q) β Scoring
Autonomously scores hypotheses across dimensions.
graph LR A[Goal] --> B[(SurveyAgent)] B --> C[(SearchOrchestratorAgent)] C --> D[(IdeaInnovationAgent)] D --> E[(IdeaSharpeningAgent)] E --> F[(RankingAgent - Elo-style)] F --> G[(IdeaEvaluatorAgent - Mr Q)]:::metallicBlue G --> H[(IdeaEvolutionAgent)] H --> I[(MethodPlannerAgent)] I --> J[(Next Round)] classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
Once hypotheses have been sharpened and ranked, they must be rigorously evaluated to determine their merit across a spectrum of scientific criteria. The IdeaEvaluatorAgent performs this role with flexibility and precision, using either traditional LLM-based judgments or a local preference model called MR.Q.
This agent operates as the systemβs analytical conscience. It assesses each hypothesis based on dimensions such as:
- Coherence β Does the hypothesis make sense logically?
- Credibility β Is it scientifically plausible?
- Verifiability β Could it be tested in practice?
- Novelty β Does it add something new to the field?
- Alignment β Does it match the user’s goal and preferences?
Depending on configuration, the agent can use:
- An LLM-based judge to simulate comparative reasoning (similar to DPO), or
- The MR.Q Self Evaluator, a lightweight reflection engine that learns user preferences from earlier feedback and refines hypotheses accordingly.
Each hypothesis is scored independently, and the best-performing ones are selected for the next stage of the pipeline. Scores and reasoning are logged, providing transparency and future trainability.
“Scores are combined using weighted summation” “Assessment Agent evaluates performance on multiple criteria: coherence, credibility, verifiability, novelty, and alignment.”
This structured evaluation ensures that only the most promising ideas advance those that aren’t just interesting, but also testable and aligned with research intent.
class IdeaEvaluatorAgent(BaseAgent):
"""
Evaluates research ideas and hypotheses using multiple strategies:
- LLM-based pairwise comparison (like DPO)
- Preference learning via MR.Q Self Evaluator
"""
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.strategy = cfg.get("strategy", "llm") # llm | mrq
self.evaluator = self._init_evaluator()
self.top_k = cfg.get("top_k", 5)
async def run(self, context: dict) -> dict:
hypotheses = self.get_hypotheses(context)
goal = context.get(GOAL)
baseline = context.get("baseline_hypotheses", {}).get("text")
if not hypotheses:
self.logger.log("NoHypothesesToEvaluate", {})
context["scored_hypotheses"] = []
return context
scored_results = []
for hyp in hypotheses:
hyp_text = hyp["text"]
preferred, scores = self.evaluator.judge(
goal=goal,
prompt=hyp_text,
output_a=baseline or hyp_text,
output_b=hyp_text,
)
scored_results.append(
{
"text": hyp_text,
"preferred": preferred,
"scores": scores,
"source": "llm-judge",
"score": scores.get("score_b", 0),
"reasoning": scores.get("reason", ""),
}
)
scored_results.sort(key=lambda x: x["score"], reverse=True)
context["scored_hypotheses"] = scored_results
context["top_hypothesis"] = scored_results[0]
return context
def get_top_k(self, context: dict, k: int = 5):
return sorted(
context.get("scored_hypotheses", []), key=lambda x: x["score"], reverse=True
)[:k]
def _init_evaluator(self):
if self.cfg.get("evaluator", "llm") == "llm":
llm_model = self.cfg.get("evaluator_model", self.cfg.get("model"))
prompt_file = self.cfg.get("evaluator_prompt_file", "evaluator.txt")
return LLMJudgeEvaluator(
self.cfg,
llm_cfg=llm_model,
prompt_file=prompt_file,
llm=self.call_llm,
logger=self.logger,
)
else:
return MRQSelfEvaluator(
memory=self.memory,
logger=self.logger,
device=self.cfg.get("device", "cpu"),
)
Supports:
“Assessment Agent evaluates performance on multiple criteria: coherence, credibility, verifiability, novelty, and alignment.”
“Scores are combined using weighted summation.”
𧬠IdeaEvolutionAgent β Mutation & Grafting
Evolves top hypotheses into better variants.
graph LR A[Goal] --> B[(SurveyAgent)] B --> C[(SearchOrchestratorAgent)] C --> D[(IdeaInnovationAgent)] D --> E[(IdeaSharpeningAgent)] E --> F[(RankingAgent - Elo-style)] F --> G[(IdeaEvaluatorAgent - Mr Q)] G --> H[(IdeaEvolutionAgent)]:::metallicBlue H --> I[(MethodPlannerAgent)] I --> J[(Next Round)] classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
After scoring and ranking, the system enters its creative refinement phase IdeaEvolutionAgent takes center stage to evolve hypotheses into smarter, sharper forms. Inspired by biological evolution and collaborative science, this agent mutates, grafts, and improves hypotheses over multiple generations.
What it does
- Mutation: Generates multiple variants for each top-performing idea. These mutations focus on clarity, novelty, feasibility, or any preference you configure.
- Grafting: Combines similar hypotheses into unified, higher-quality statements when semantic overlap is high. This mimics how researchers synthesize overlapping ideas into consensus theories.
- Iteration: Evolution proceeds through multiple rounds, and each new generation is scored and filtered to retain only the strongest variants.
“Each idea is evolved into 3 variants; top performers are selected for the next round.” “Preference data collected from past evaluations can be used for training.”
This agent acts like a research assistant with a memory and a bias toward improvement. It draws from scoring data, user preferences, and prior reasoning to guide how each variant is generated and refined. Whether improving feasibility or amplifying originality, it ensures hypotheses continue to move forward not just laterally.
The result? A growing tree of ideas, where each branch is smarter than the last.
class IdeaEvolutionAgent(BaseAgent):
"""
The Evolution Agent refines hypotheses iteratively using several strategies:
- Grafting similar hypotheses into unified statements
- Feasibility improvement through LLM reasoning
- Out-of-the-box hypothesis generation
- Inspiration from top-ranked ideas
- Simplification and clarity enhancement
These improvements are based on the paper:
"NOVELSEEK: When Agent Becomes the Scientist β Building Closed-Loop System from Hypothesis to Verification"
"""
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.use_grafting = cfg.get("use_grafting", False)
self.max_variants_per_idea = cfg.get("max_variants", 3)
self.max_evolution_rounds = cfg.get("evolution_rounds", 4)
self.selection_top_k = cfg.get("select_top_k", 5)
self.preferences = cfg.get("preferences", ["novelty", "feasibility"])
async def run(self, context: dict) -> dict:
"""
Evolve top-ranked hypotheses across multiple rounds.
"""
# Get input hypotheses
ranked_hypotheses = context.get(RANKING, [])
fallback_hypotheses = context.get(HYPOTHESES, [])
preferences = context.get("preferences", self.preferences)
current_round = context.get("evolution_round", 0)
if not ranked_hypotheses and not fallback_hypotheses:
self.logger.log("NoHypothesesToEvolve", {"reason": "no_ranked_or_unranked_input"})
context[EVOLVED] = []
return context
# Decide which hypotheses to evolve
top_texts = [h.text for h, _ in ranked_hypotheses[:3]] if ranked_hypotheses else fallback_hypotheses
# Run evolution strategies
all_variants = await self._mutate_all(top_texts, context, preferences)
# Optionally use grafting
if self.use_grafting:
all_variants += await self.graft_similar(context)
# Score and select top K
scored_variants = self._score_variants(all_variants, context)
top_variants = scored_variants[:self.selection_top_k]
# Save to DB
self._save_evolved(top_variants, context)
# Update context
context["evolved"] = top_variants
context["evolution_round"] = current_round + 1
context["evolved_count"] = len(top_variants)
self.logger.log(
"EvolutionCompleted",
{
"evolved_count": len(top_variants),
"preferences": preferences,
"round": current_round + 1
}
)
return context
async def _mutate_all(self, hypotheses: list, context: dict, preferences: list) -> list:
"""Generate multiple variants for each hypothesis"""
all_mutants = []
for h in hypotheses:
prompt_context = {
"hypothesis": h,
"literature_summary": context.get("knowledge_base_summaries", []),
"critique": context.get("scores", {}),
"focus_area": context.get(GOAL, {}).get("focus_area"),
"preferences": ", ".join(preferences)
}
prompt = self.prompt_loader.load_prompt("evolve.txt", prompt_context)
raw_output = self.call_llm(prompt, context)
mutants = extract_hypotheses(raw_output)
self.logger.log("HypothesisMutated", {
"original": h[:60],
"mutations": mutants[:2]
})
all_mutants.extend(mutants)
return all_mutants
async def graft_similar(self, context: dict, threshold: float = 0.85) -> list:
"""
Graft pairs of highly similar hypotheses into unified versions.
"""
hypotheses = self.get_hypotheses(context)
embeddings = [await self.memory.embedding.get_or_create(h.get("text")) for h in hypotheses]
used = set()
grafted = []
for (i, h1), (j, h2) in itertools.combinations(enumerate(hypotheses), 2):
if i in used or j in used:
continue
sim = self.cosine_similarity(embeddings[i], embeddings[j])
if sim >= threshold:
self.logger.log("GraftingPair", {
"similarity": sim,
"h1": h1[:60] + "...",
"h2": h2[:60] + "..."
})
prompt = (
f"Combine the following hypotheses into a clearer, more innovative statement:\n\n"
f'A: {h1.get("text")}\nB: {h2.get("text")}'
)
try:
response = self.call_llm(prompt, context)
combined = extract_hypotheses(response)
grafted.extend(combined)
used.update([i, j])
except Exception as e:
self.logger.log("GraftingFailed", {"error": str(e)})
continue
# Add ungrafted hypotheses back
hypotheses = context.get(HYPOTHESES, [])
for i, h in enumerate(hypotheses):
if i not in used:
grafted.append(h)
return grafted
def _score_variants(self, variants: list, context: dict) -> list:
"""
Score variants using ScorerAgent logic and sort by total score
"""
scorer = self.memory.scorer
scored = []
for v in variants:
score_data = scorer.score(v, context)
score_data["text"] = v
scored.append(score_data)
# Sort by composite score
scored.sort(key=lambda x: x["score"], reverse=True)
return scored
def _save_evolved(self, variants: list, context: dict):
"""
Save evolved hypotheses to database with lineage info
"""
goal_text = self.extract_goal_text(context.get(GOAL))
pipeline_sig = context.get(PIPELINE)
for v in variants:
hyp = HypothesisORM(
goal=goal_text,
text=v["text"],
pipeline_signature=pipeline_sig,
parent=context.get("current_hypothesis", None),
evolution_level=context.get("evolution_round", 0)
)
self.db.add(hyp)
self.db.commit()
def cosine_similarity(self, vec1, vec2):
"""Compute cosine similarity between two vectors."""
v1 = np.array(vec1)
v2 = np.array(vec2)
return float(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))
Matches NOVELSEEK’s:
“Each idea is evolved into 3 variants; top performers are selected for next round.”
“Preference data collected from past evaluations can be used for training.”
πΊοΈ MethodPlannerAgent β Idea β Method Mapping
Takes top hypothesis and turns it into structured methodology.
graph LR A[Goal] --> B[(SurveyAgent)] B --> C[(SearchOrchestratorAgent)] C --> D[(IdeaInnovationAgent)] D --> E[(IdeaSharpeningAgent)] E --> F[(RankingAgent - Elo-style)] F --> G[(IdeaEvaluatorAgent - Mr Q)] G --> H[(IdeaEvolutionAgent)] H --> I[(MethodPlannerAgent)]:::metallicBlue I --> J[(Next Round)] classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
Once a strong hypothesis has emerged through evolution and evaluation, the MethodPlannerAgent translates it into action. This agent is the final bridge between imagination and execution turning abstract research ideas into structured, testable methodologies.
What it does
-
Maps ideas to methods: Given a hypothesis, task description, baseline implementation, and supporting literature, it builds a full experimental plan.
-
Follows a transformation function: “T: I Γ T Γ B Γ L β M” Where:
I
= IdeaT
= Task descriptionB
= BaselineL
= LiteratureM
= Method plan
-
Generates reproducible plans: Outputs include objectives, experimental setups, components needed, knowledge gaps, and next steps.
-
Supports refinement: Plans can be iteratively revised based on feedback or scoring.
Why it matters
In the NOVELSEEK system and in real scientific workflows an idea is only as good as the method used to test it. MethodPlannerAgent ensures that every hypothesis can lead to actual discovery by structuring the next steps as a coherent, executable research plan.
“Each idea is mapped to testable components” “Experimental plans guide future rounds of search and evaluation”
By anchoring creative hypotheses in concrete methods, this agent transforms theoretical innovation into practical science.
class MethodPlannerAgent(BaseAgent):
"""
The MethodPlannerAgent converts abstract research ideas into executable methodological frameworks.
Based on NOVELSEEK's Method Development Agent:
> _"The transformation function is represented as: T: I Γ T Γ B Γ L β M"_
Where:
- I = Research idea
- T = Task description
- B = Baseline implementation
- L = Relevant literature or knowledge baseAll right Um OK so you're going to come here come here now what you can do is that
- M = Resulting method plan
This agent supports both initial planning and iterative refinement of methodologies.
"""
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.max_refinements = cfg.get("max_refinements", 3)
self.use_refinement = cfg.get("use_refinement", True)
async def run(self, context: dict) -> dict:
"""
Main execution loop for the MethodPlannerAgent.
Args:
context (dict): Contains goal, hypotheses, baseline code, and literature summary
Returns:
dict: Updated context with generated method plan
"""
# Extract input from context
goal = context.get(GOAL, {})
hypothesis = context.get(HYPOTHESES, "")
baseline_approach = self._get_baseline(goal.get("focus_area"))
literature_summary = context.get("knowledge_base_summaries", [])
pipeline_stage = context.get(PIPELINE, "initial_method_plan")
# Build prompt context
prompt_context = {
"idea": hypothesis or goal.get("goal_text"),
"task_description": self._extract_task_description(goal),
"baseline_approach": baseline_approach,
"literature_summary": self._summarize_literature(literature_summary),
"preferences": self.cfg.get("preferences", ["novelty", "feasibility"]),
}
merged = {**context, **prompt_context}
# Load and render prompt
prompt = self.prompt_loader.load_prompt(self.cfg, merged)
# Call LLM to generate method plan
raw_plan = self.call_llm(prompt, merged)
# Parse output into structured format
try:
plan_data = self.parse_method_plan_output(raw_plan)
except Exception as e:
self.logger.log("MethodPlanParseFailed", {"error": str(e), "raw": raw_plan})
return context
# Save to database
method_plan = self._save_to_db(plan_data, context)
# Update context with result
context[self.output_key] = plan_data
context["method_plan_id"] = method_plan.id
context["code_plan"] = plan_data.get("code_plan", "")
self.logger.log(
"MethodPlanGenerated", {"plan": plan_data, "pipeline_stage": pipeline_stage}
)
return context
def _extract_task_description(self, goal: dict) -> str:
"""
Extract domain-specific constraints and goals
Example: Reaction Yield Prediction on Suzuki-Miyaura dataset using SMILES input
"""
if goal.get("focus_area") == "chemistry":
return f"{goal.get('goal_text')} ({goal.get('focus_area')})"
elif goal.get("focus_area") == "nlp":
return f"{goal.get('goal_text')} ({goal.get('focus_area')})"
else:
return goal.get("goal_text", "")
def _get_baseline(self, focus_area: str) -> str:
"""
Retrieve baseline implementation from config or file system
"""
if focus_area == "chemistry":
return self.cfg.get("baselines").get("reaction_yield_model", "")
elif focus_area == "nlp":
return self.cfg.get("baselines").get("sentiment_transformer", "")
elif focus_area == "cv":
return self.cfg.get("baselines").get("pointnet_classifier", "")
else:
return ""
def _summarize_literature(self, literature: list) -> str:
"""
Format literature summaries for use in prompt
"""
if not literature:
return "No relevant prior work found."
return "\n".join(
[f"- {r['title']}: {r['refined_summary']}" for r in literature[:5]]
)
def parse_method_plan_output(self, output: str) -> dict:
sections = {
"research_objective": r"\*\*Research Objective:\*\*(.*?)\n\n",
"key_components": r"\*\*Key Components:\*\*(.*?)\n\n",
"experimental_plan": r"\*\*Experimental Plan:\*\*(.*?)\n\n",
"hypothesis_mapping": r"\*\*Hypothesis Mapping:\*\*(.*?)\n\n",
"search_strategy": r"\*\*Search Strategy:\*\*(.*?)\n\n",
"knowledge_gaps": r"\*\*Knowledge Gaps:\*\*(.*?)\n\n",
"next_steps": r"\*\*Next Steps:\*\*(.*?)$",
}
result = {}
for key, pattern in sections.items():
match = re.search(pattern, output, re.DOTALL)
if match:
content = match.group(1).strip()
if key in ["key_components"]:
result[key] = [
line.strip() for line in content.splitlines() if line.strip()
]
else:
result[key] = content
else:
result[key] = ""
return result
def _save_to_db(self, plan_data: dict, goal_id: int) -> MethodPlanORM:
"""
Store method plan in ORM with metadata
"""
plan = MethodPlanORM(
idea_text=plan_data.get("idea"),
task_description=plan_data.get("task_description"),
baseline_method=plan_data.get("baseline_used"),
literature_summary=plan_data.get("relevant_papers"),
code_plan=plan_data.get("code_plan"),
score_novelty=plan_data.get("score_novelty"),
score_feasibility=plan_data.get("score_feasibility"),
score_impact=plan_data.get("score_impact"),
score_alignment=plan_data.get("score_alignment"),
goal_id=goal_id,
focus_area=plan_data.get("focus_area"),
strategy=plan_data.get("strategy"),
evolution_level=0, # Initial plan
)
self.memory.method_plans.add_method_plan(plan.to_dict()) # Or plan.to_dict() if needed
return plan
def _refine_plan(self, plan: dict, feedback: dict) -> dict:
"""
Apply refinement logic based on critique or scoring data
"""
refinement_prompt = self.prompt_loader.load_prompt(
"prompts/method_refine.j2", {"current_plan": plan, "feedback": feedback}
)
raw_refined = self.call_llm(refinement_prompt)
return self._parse_plan_output(raw_refined)
def _score_plan(self, plan: dict, context: dict) -> dict:
"""
Use ScorerAgent to evaluate methodology quality
"""
scorer = self.memory.scorer
scores = scorer.score(plan, context)
return scores
Supports NOVELSEEKβs:
“Transformation function T: I Γ T Γ B Γ L β M”
“Each idea is mapped to testable components”
“Experimental plans guide future rounds of search and evaluation”
π¦ ORM Models β Persistent Memory
We use SQLAlchemy ORM to store everything for traceability.
Example: HypothesisORM
class HypothesisORM(Base):
__tablename__ = "hypotheses"
id = Column(Integer, primary_key=True)
text = Column(String, nullable=False)
goal_id = Column(Integer, ForeignKey("goals.id"), nullable=False)
score_novelty = Column(Float)
score_feasibility = Column(Float)
score_impact = Column(Float)
score_alignment = Column(Float)
confidence = Column(Float)
origin = Column(String)
source = Column(String)
created_at = Column(DateTime, default=datetime.utcnow)
Example: MethodPlanORM
class MethodPlanORM(Base):
__tablename__ = "method_plans"
id = Column(Integer, primary_key=True)
idea_text = Column(String, nullable=False)
research_objective = Column(String, nullable=False)
key_components = Column(JSON)
experimental_plan = Column(String)
hypothesis_mapping = Column(String)
search_strategy = Column(String)
knowledge_gaps = Column(String)
next_steps = Column(String)
code_plan = Column(String)
score_novelty = Column(Float)
score_feasibility = Column(Float)
score_impact = Column(Float)
score_alignment = Column(Float)
evolution_level = Column(Integer, default=0)
parent_plan_id = Column(Integer, ForeignKey("method_plans.id"), nullable=True)
is_refinement = Column(Boolean, default=False)
This gives us persistent memory of every step in the pipeline.
π§ͺ Sample Output After Planning
For the goal:
“Will AI ever be able to reprogram itself?”
Your MethodPlannerAgent
might generate:
{
"research_objective": "Determine whether AI systems can autonomously reprogram their own code or parameters while ensuring safety.",
"key_components": [
"Introspection module for analyzing outputs",
"Reinforcement learning component for guiding changes",
"Safety constraints layer to prevent instability",
"Validation protocol for stability and functionality checks"
],
"experimental_plan": "1. Use introspection to analyze model outputs after each inference step.\n2. Generate proposed updates using reflection and meta-learning.\n3. Apply changes only if they pass safety and validation checks.\n4. Measure impact over time via performance, stability, error propagation.",
"hypothesis_mapping": "- Introspection module β addresses H1: Reprogramming feasibility\n- Reinforcement learning β guides change process\n- Safety constraints β ensures functional integrity",
"search_strategy": "Use Arxiv and GitHub to search for:\n- 'self-modifying LLM'\n- 'LLM-based introspection'\n- 'dynamic architecture updating'",
"knowledge_gaps": "What real-world systems attempt self-editing?\nHow do current models handle weight freezing and dynamic updates?",
"next_steps": "SurveyAgent should run queries for:\n- Existing implementations of self-modifying LLMs\n- Papers discussing AI safety and self-editing\n- Code repositories implementing dynamic attention modules"
}
𧬠Evolution Tree Visualization
We support tracking evolutionary lineage:
graph TD H1[Hypothesis 1: Basic Introspection] H2[Hypothesis 2: Reinforcement-Guided Updates] H3[Hypothesis 3: Dynamic Weight Updating] H1 -->|Ranked Top| M1[Method Plan 1: Introspection-Based] H2 -->|Improved Feasibility| M2[Method Plan 2: RL-Guided Architecture Change] H3 -->|High Novelty| M3[Method Plan 3: Dynamic Weight Updating] M1 -->|Refinement| M1R1[Refined Method 1] M2 -->|Refinement| M2R1[Refined Method 2] M3 -->|Refinement| M3R1[Refined Method 3]
Which supports NOVELSEEKβs:
“Each idea is evolved into 3 variants; top performers selected for next round.“I think it’s that bad
“Ideas are iteratively polished and refined.”
π Benefits of This Design
Feature | Description |
---|---|
π§ Fully Autonomous Research Loop | No need for manual review |
π§© Modular Design | Swap out agents easily |
ποΈ Persistent Memory | Stores winning hypotheses and strategies |
𧬠Evolutionary Guidance | Builds better ideas over time |
π Traceable Process | Logs help track idea β methodology mapping |
π Future Work
Where could we go next?
A. Add Code Execution Support
Using tools like Aider, evolve and execute actual code.
B. Add Human-in-the-loop Feedback
Allow researchers to validate hypotheses and refine scores manually.
C. Build a Web Interface
Visualize:
- Idea trees
- Scored hypotheses
- Evolved method plans
- Literature β idea flow
D. Add Training from Preference Data
Train your evaluator using stored A/B comparisons and Elo rankings.
β Summary
Weβve implemented the full NOVELSEEK pipeline:
[Goal] β [SurveyAgent]
β
[SearchOrchestratorAgent]
β
[IdeaInnovationAgent] β generates N abstract ideas
β
[IdeaSharpeningAgent] β refines into testable hypotheses
β
[IdeaEvaluatorAgent] β scores using Mr Q-style logic
β
[IdeaEvolutionAgent] β evolves top hypotheses
β
[MethodPlannerAgent] β builds structured methodology
β
[Database] β stores everything for future rounds
π References
-
NOVELSEEK: When Agent Becomes the Scientist Xinyu Lei, Yi Ren, Xiaojian Ma, et al. arXiv:2505.16938 Describes the core pipeline of autonomous scientific research through modular AI agents, including generation, refinement, evolution, and method planning.
-
MR.Q: Preference-Based Evaluation for Reasoning Chains OpenAI Team (internal reference) Provides the mechanism for scoring hypotheses based on multiple soft preferences such as novelty, feasibility, and coherence. Used in the
IdeaEvaluatorAgent
. -
Self-Refinement: Improving Code Generation with Verifier LMs Zhengbao Jiang, Tristan Thrush, et al. (2023) arXiv:2304.12244 Basis for multi-round refinement ideas embedded into the sharpening and evolution steps.
-
AutoGPT: An Autonomous GPT-4 Experiment Toran Bruce Richards et al. GitHub Inspired the modular agent framework used in this project, with improvements to support research workflows.
-
DSPy: Declarative Self-Improving Language Models Srinivasan et al. (2024) arXiv:2402.00821 Framework for declarative prompting and runtime improvement. Used selectively in earlier tuning and evaluation prototypes.
-
Symbolic Learning for Reasoning Optimization Follow-up project in development Used as inspiration for future work on pipeline adaptation and learning from symbolic feedback.
β Implementation Checklist β NOVELSEEK Alignment
Component | Description | Status |
---|---|---|
Idea Generation | Generate multiple candidate research hypotheses using structured prompting | β
Implemented via IdeaInnovationAgent |
Multi-Round Refinement | Iteratively sharpen hypotheses using preference-guided evaluation | β
Implemented via IdeaSharpeningAgent |
Preference Learning (MR.Q) | Evaluate and improve hypotheses using MR.Q-based scoring | β
Implemented via IdeaEvaluatorAgent |
Evolution of Hypotheses | Apply mutation, combination, and grafting to evolve ideas | β
Implemented via IdeaEvolutionAgent |
Elo-Style Ranking | Rank ideas using head-to-head comparison and Elo updates | β
Implemented via RankingAgent |
Structured Method Planning | Translate ideas into testable methods using literature, task, and baseline context | β
Implemented via MethodPlannerAgent |
Literature-Grounded Search | Retrieve relevant papers to condition idea development | β
Partially implemented via SearchOrchestratorAgent and local memory |
Task + Baseline Conditioning | Use structured baselines and task profiles to steer hypotheses | β
Included in MethodPlannerAgent |
Closed-Loop Iteration | Feedback from method planning informs future idea rounds | π Under development / planned for next post |
Verification & Experiment Execution | Run experiments or simulations to test ideas | β Not implemented (out of scope for this post) |
Learning from Pipeline Outcomes | Tune agent behavior based on past pipeline performance | β Not implemented (planned via symbolic learner) |
π Glossary
Agent A modular, autonomous component that performs a specific research task (e.g., idea generation, evaluation). Each agent follows a common interface and interacts with a shared memory.
Pipeline The full sequence of agents orchestrated to perform a complex task such as moving from a research goal to a structured methodology.
SurveyAgent Generates targeted search queries based on a research goal to initiate literature exploration and tool-based discovery.
SearchOrchestratorAgent Coordinates multiple information retrieval tools (Web, ArXiv, Wikipedia, Hugging Face, etc.) to enrich context with external knowledge.
IdeaInnovationAgent Generates novel, abstract research directions based on retrieved context and past knowledge.
IdeaSharpeningAgent Refines and elaborates initial ideas using prompt-based tuning and logic enhancements (e.g., DSPy-style CoT reasoning or template-based mutations).
RankingAgent Performs Elo-style tournament ranking of hypotheses through LLM-guided pairwise comparisons, simulating scientific debate.
IdeaEvaluatorAgent (Mr Q) Scores each hypothesis across dimensions like novelty, feasibility, and alignment, using MR.Q-style preference-based evaluation or LLM comparison.
IdeaEvolutionAgent Mutates and grafts top hypotheses to generate stronger, more testable variants across multiple rounds. Simulates evolutionary refinement.
MethodPlannerAgent Transforms a top-ranked idea into a detailed research methodology, including key components, experimental plan, baseline references, and next steps.
MR.Q A preference learning and scoring framework for reasoning-based outputs. Supports scoring hypotheses based on soft dimensions rather than fixed ground truth.
Elo Ranking A scoring method from competitive games, adapted here to track hypothesis strength based on win/loss performance in head-to-head comparisons.
Grafting Combining similar hypotheses into a single, clearer, and more innovative variant. Used in the evolution phase.
Prompt Templating The use of structured Jinja templates to dynamically generate prompts based on input context (goal, literature, preferences, etc.).
Semantic Memory A vector store that enables retrieval of past hypotheses, evaluations, and context based on similarity for use in future steps.
Baseline Method An existing, known approach to a task, used as a reference for evaluation or as a component in method planning.
T: I Γ T Γ B Γ L β M The transformation function defined in NOVELSEEK for turning an Idea (I), Task (T), Baseline (B), and Literature (L) into a Method plan (M).