A Novel Approach to Autonomous Research: Implementing NOVELSEEK with Modular AI Agents

A Novel Approach to Autonomous Research: Implementing NOVELSEEK with Modular AI Agents
Page content

Summary

AI research tools today are often narrow: one generates summaries, another ranks models, a third suggests ideas. But real scientific discovery isn’t a single stepβ€”it’s a pipeline. It’s iterative, structured, and full of feedback loops.

In this post, I show how to build a modular AI system that mirrors this full research lifecycle. From initial idea generation to method planning, each phase is handled by a specialized agent working in concert.

This implementation is inspired by the recent paper:
NOVELSEEK: When Agent Becomes the Scientist

We’ve implemented the complete Idea-to-Methodology Construction loop:

  • 🧠 Autonomous idea generation
  • πŸ”§ Multi-round hypothesis refinement
  • 🎯 Preference learning and evaluation
  • πŸ“š Semantic memory & knowledge retrieval
  • 🌱 Hypothesis evolution via mutation & grafting
  • πŸ—ΊοΈ Structured planning of experimental methods

This aligns closely with NOVELSEEK’s vision of agent-driven research:

“Ideas are generated, refined, scored, and evolved into structured methodologies.” “Each idea is mapped to testable components before being executed.”

Let’s explore the architecture, the agents, and how this system begins to act like a scientistβ€”one step at a time.


πŸ” What Is NOVELSEEK?

From the paper:

β€œNOVELSEEK autonomously generates scientific hypotheses, transforms them into executable methodologies, and validates them via closed-loop experiments.”

The core stages include:

  1. Self-Evolving Idea Generation
  2. Idea-to-Methodology Construction
  3. Evolutionary Experimental Planning

In this post we’re focusing on replicating the first two stages:

    graph LR
    A[🧠 Idea Innovation] --> B[πŸ” Idea Sharpening]
    B --> C[πŸ“Š Idea Evaluation]
    C --> D[🧬 Idea Evolution]
    D --> E[πŸ› οΈ Method Development]
  

And preparing for future integration with code execution tools like Aider or OpenHands.


πŸ”„ Our Pipeline Overview

Here’s how our current autonomous research loop looks:

    graph TD
    A[Goal] --> B[(SurveyAgent)]
    B --> C[(SearchOrchestratorAgent)]
    C --> D[(IdeaInnovationAgent)]
    D --> E[(IdeaSharpeningAgent)]
    E --> F[(RankingAgent - Elo-style)]
    F --> G[(IdeaEvaluatorAgent - Mr Q)]
    G --> H[(IdeaEvolutionAgent)]
    H --> I[(MethodPlannerAgent)]
    I --> J[(Next Round)]
  

This mirrors NOVELSEEK’s:

“Multi-round experimental planning and execution”
“Each idea is evolved into 3 variants; top performers selected based on scoring.”
“Ideas are mapped to testable components before being executed.”


🧱 Core Components of co_ai

1. Goal Definition

Every experiment starts with a goal:

goal:
  id: 1
  goal_text: "Will AI ever be able to reprogram itself?"
  focus_area: "meta_learning"
  strategy: "graph_attention_with_positional_embeddings"
  baseline_method: "Standard transformer-based LLM with static prompt."

Goals define:

  • What we’re trying to prove/disprove
  • The domain (chemistry, nlp, etc.)
  • Baseline used for comparison
  • Strategy for improvement

2. The pipeline


pipeline:
  name: default_pipeline
  description: "NOVELSEEK pipeline for exploring the question: 'Will AI ever be able to reprogram itself?'"
  stages:
     - name: survey
       cls: co_ai.agents.survey.SurveyAgent
       enabled: true
       iterations: 1
     - name: search_orchestrator
       cls: co_ai.agents.search_orchestrator.SearchOrchestratorAgent
       enabled: false
       iterations: 1
     - name: knowledge_loader
       cls: co_ai.agents.knowledge_loader.KnowledgeLoaderAgent
       enabled: true
       iterations: 1
     - name: idea_innovation
       cls: co_ai.agents.idea_innovation.IdeaInnovationAgent
       enabled: true
       iterations: 1
     - name: idea_sharpening
       cls: co_ai.agents.idea_sharpening.IdeaSharpeningAgent
       enabled: true
       iterations: 1
     - name: ranking
       cls: co_ai.agents.ranking.RankingAgent
       enabled: true
       iterations: 1
     - name: idea_evaluator
       cls: co_ai.agents.idea_evaluator.IdeaEvaluatorAgent
       enabled: true
       iterations: 1
     - name: idea_evolution
       cls: co_ai.agents.idea_evolution.IdeaEvolutionAgent
       enabled: true
       iterations: 3
     - name: method_planner
       cls: co_ai.agents.method_planner.MethodPlannerAgent
       enabled: true
       iterations: 1

πŸ•΅οΈβ€β™‚οΈ SurveyAgent – Query Generation

Generates adaptive search queries from goal + baseline + preferences.

    graph LR
    A[Goal] --> B[(SurveyAgent)]:::metallicBlue
    B --> C[(SearchOrchestratorAgent)]
    C --> D[(IdeaInnovationAgent)]
    D --> E[(IdeaSharpeningAgent)]
    E --> F[(RankingAgent - Elo-style)]
    F --> G[(IdeaEvaluatorAgent - Mr Q)]
    G --> H[(IdeaEvolutionAgent)]
    H --> I[(MethodPlannerAgent)]
    I --> J[(Next Round)]

   classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
  
# co_ai/agents/survey.py
from co_ai.agents.base import BaseAgent
from co_ai.constants import GOAL


class SurveyAgent(BaseAgent):
    """
    The Survey Agent generates adaptive search queries for literature exploration.
    
    From the paper:
    > 'The Survey Agent deconstructs the research task into multiple keyword combinations'
    > 'It supports two distinct modes: literature review mode and deep research mode'
    > 'Each idea is mapped to testable components before being executed'
    """

    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.max_queries = cfg.get("max_queries", 5)

    async def run(self, context: dict) -> dict:
        goal = context.get(GOAL, {})
        if not goal:
            self.logger.log("NoGoalProvided", {"reason": "survey_agent_skipped"})
            return context

        # Generate new queries based on goal + baseline + preferences
        prompt_context = {
            "goal_text": goal.get("goal_text"),
            "focus_area": goal.get("focus_area"),
            "baseline_method": context.get("baseline_method", ""),
            "preferences": context.get("preferences", ["novelty", "feasibility"]),
            "previous_ideas": context.get("ideas", [])
        }
        merged = {**self.cfg, **prompt_context}

        prompt = self.prompt_loader.load_prompt(self.cfg, merged)


        raw_output = self.call_llm(prompt, context)
        queries = self._parse_query_response(goal, raw_output)

        # Store in context for SearchOrchestratorAgent
        context["search_queries"] = queries
        context["search_strategy"] = self.strategy

        self.logger.log("SurveyQueriesGenerated", {
            "queries": queries,
            "strategy_used": self.strategy,
            "pipeline_stage": context.get("pipeline_stage")
        })

        return context

    def _parse_query_response(self, goal, response: str) -> list:
        """Parse LLM output into clean list of search queries"""
        lines = [line.strip() for line in response.splitlines() if line.strip()]
        if not lines:
            # Fallback strategy
            return [
                f"{goal.get('focus_area')} machine learning",
                f"{goal.get('goal_text')}"
            ]
        return lines[:self.max_queries]

    def expand_queries_to_goals(self, queries: list, base_goal: dict) -> list:
        """
        Convert queries into sub-goals for future pipeline stages
        
        Args:
            queries (list): Generated search strings
            base_goal (dict): Original goal
            
        Returns:
            list: List of structured sub-goals
        """
        return [
            {
                "goal_text": q,
                "parent_goal": base_goal.get("goal_text"),
                "focus_area": base_goal.get("focus_area"),
                "strategy": base_goal.get("strategy"),
                "source": "survey_agent"
            }
            for q in queries
        ]

Prompt Template – survey.txt

You are the Survey Agent. Generate adaptive search queries for literature exploration.

Goal: {{ goal.goal_text }}
Focus Area: {{ goal.focus_area }}

Baseline Method: {{ baseline_method }}
Research Preferences: {{ preferences }}
Previous Ideas: 
{% for idea in previous_ideas %}
- "{{ idea }}"
{% endfor %}

Generate up to {{ max_queries }} search queries that would help us understand the current state of research around this topic.
Return only the queries, one per line.

Example output:

Self-modifying AI architectures
LLM introspection and reflection
Safety constraints for autonomous reprogramming
Dynamic model architecture adaptation
AI systems evolving over time

These queries feed into downstream agents like SearchOrchestratorAgent.


🧭 SearchOrchestratorAgent: Choosing the Right Tool for the Job

    graph LR
    A[Goal] --> B[(SurveyAgent)]
    B --> C[(SearchOrchestratorAgent)]:::metallicBlue
    C --> D[(IdeaInnovationAgent)]
    D --> E[(IdeaSharpeningAgent)]
    E --> F[(RankingAgent - Elo-style)]
    F --> G[(IdeaEvaluatorAgent - Mr Q)]
    G --> H[(IdeaEvolutionAgent)]
    H --> I[(MethodPlannerAgent)]
    I --> J[(Next Round)]

   classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
  

Once the SurveyAgent generates structured queries from the user’s research goal, the SearchOrchestratorAgent takes over to intelligently route each query to one of several specialized search tools. This mimics a skilled research assistant deciding where to look based on what you’re asking.

πŸ› οΈ The Real Power Lies in the Tools

While large language models are impressive in their reasoning and generation capabilities, they are ultimately bounded by the static knowledge they were trained on. The real research value of this system doesn’t come from what the model already knows but from what it can retrieve and integrate dynamically through tools. In the SearchOrchestratorAgent, five distinct tools (Arxiv, HuggingFace, Wikipedia, cosine similarity search, and local WebSearch via SearxNG) form a knowledge augmentation layer. These aren’t just add-ons they are essential research interfaces.

The quality of insights the system can generate depends directly on the accuracy, relevance, and coverage of the information returned by these tools. In many ways, these tools are the reality check they ground the AI’s creativity in what’s actually happening in the world of science and data. As we scale this system further, building richer, more accurate, and more specialized tools will be key to making these AI research agents not just plausible, but truly useful collaborators in knowledge discovery.

πŸ”¬ ArxivTool

For scientific papers and methodological insight.

  • When it’s used: The query or goal indicates a need for peer-reviewed literature, new models, or baseline comparisons.

  • What it does: Searches arXiv.org for recent research papers using the query.

  • Use cases:

    • "transformer-based anomaly detection"
    • "zero-shot learning methods"

πŸ“Š HuggingFaceTool

For datasets and model repositories.

  • When it’s used: The goal mentions “datasets”, “data collection”, or is classified as a data_search task.

  • What it does: Searches the Hugging Face Hub for datasets that match the research query.

  • Use cases:

    • "multilingual text classification dataset"
    • "sentiment analysis for medical notes"

πŸ“š WikipediaTool

For concept grounding and general background knowledge.

  • When it’s used: The goal is categorized as background research or includes words like “overview” or “definition”.

  • What it does: Performs a similarity-ranked search over Wikipedia entries using cosine similarity.

  • Use cases:

    • "definition of generative AI"
    • "overview of reinforcement learning"

🌐 WebSearchTool

For general exploration when the intent is unclear or cross-domain.

  • When it’s used: When no strong match is found via metadata or similarity. Acts as a catch-all fallback.

  • What it does: Runs a broad web search and retrieves summaries + URLs.

  • Use cases:

    • "AI startup funding trends 2024"
    • "open-source LLM deployment on edge devices"

In building a dynamic, AI-driven research assistant, one of the trickiest components is reliable, configurable web search. Many third-party tools and APIs have unpredictable rate limits, require API keys, or yield inconsistent formats.

We tested multiple solutions, but SearchXNG emerged as the clear winner.

βœ… Why SearchXNG?

  • Self-hostable: You can run it locally or privately, ensuring no external tracking or throttling.
  • Blockable: Unlike cloud APIs that are black-boxed, SearchXNG is easy to sandbox, monitor, or override.
  • Consistent JSON structure: It returns clean, parseable output ideal for AI tools to consume and summarize.
  • Fast and flexible: It’s optimized for quick retrieval over large domains and adapts to different query structures easily.

πŸ› οΈ Integration Simplicity

We connected WebSearchTool to SearchXNG via a small config file. This lets the SearchOrchestratorAgent seamlessly route web-style queries (fallbacks or broad information requests) to a fast, local search engine.

I run it in docker

services:
  searxng:
    image: searxng/searxng
    ports:
      - "8080:8080"
    environment:
      - SEARXNG_PORT=8080
      - SEARXNG_BASE_URL=http://localhost:8080

πŸ” WebSearchTool code: A Local Web Search Wrapper for SearchXNG

This class provides a fast and lightweight interface to the SearchXNG engine. It enables agents in the pipeline to issue real-time web queries and retrieve structured, parseable results.

Key Capabilities:

  • Asynchronous Search: Uses httpx for efficient non-blocking web queries.
  • Customizable Parameters: Supports tuning language, categories, and result limits.
  • HTML Parsing: Extracts titles, URLs, and snippets from SearxNG HTML responses using BeautifulSoup.
  • Optional Full Page Fetching: If enabled, also fetches and stores the full HTML content of the page.
  • Readable Text Extraction: Uses the readability package to extract clean, human-readable text for downstream summarization or embedding.

This module gives your AI agents reliable access to web data without relying on commercial APIs making it ideal for local, privacy-preserving deployments.

import asyncio

import httpx
import requests
from bs4 import BeautifulSoup
from readability import Document

from co_ai.utils.file_utils import write_text_to_file


class WebSearchTool:
    def __init__(self, cfg: dict, logger):
        self.base_url = f'{cfg.get("instance_url", "localhost:8080")}/search'
        self.max_results = cfg.get("max_results", 15)
        self.fetch_page = cfg.get("fetch_page", False)
        self.categories = cfg.get("categories", "general")
        self.language = cfg.get("language", "en")
        self.logger = logger

    async def search(self, query: str, max_results: int = 15) -> list[str] | None:
        max_results = max_results or self.max_results

        params = {
            "q": query,
            "categories": "general",
            "language": self.language,
            "formats": ["html", "json"]
        }

        try:
            async with httpx.AsyncClient(timeout=10.0) as client:
                resp = await client.get(self.base_url, params=params)
                resp.raise_for_status()
                html = resp.text

        except Exception as e:
            print(f"❌ Exception:  {type(e).__name__}: {e}")
            return None

        return self.parse_searxng_results(html, max_results)

    from bs4 import BeautifulSoup

    def parse_searxng_results(self, html: str, max_results:int=20):
        soup = BeautifulSoup(html, "html.parser")
        results = []

        for i, article in enumerate(soup.find_all("article", class_="result")):
            if i > max_results:
                continue
            link_tag = article.find("a", class_="url_header")
            href = link_tag["href"] if link_tag else None

            title_tag = article.find("h3")
            title = title_tag.get_text(strip=True) if title_tag else None

            snippet_tag = article.find("p", class_="content")
            snippet = snippet_tag.get_text(strip=True) if snippet_tag else None

            cleand_page = ""
            if self.fetch_page:
                cleand_page = self.fetch_html(href)

            if href and title:
                results.append(
                    {
                        "title": title,
                        "url": href,
                        "snippet": snippet,
                        "page": cleand_page,
                    }
                )

        return results

    import requests

    def fetch_html(self, url: str) -> str | None:
        headers = {"User-Agent": "Mozilla/5.0"}
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            return response.text
        except requests.RequestException as e:
            if self.logger:
                self.logger.log("FetchHTMLFailed", {"url": url, "error": str(e)})
            return None  # or return ""

    def fetch_and_parse_readable(self, url:str):
        html = self.fetch_html(url)
        title, clean_text = self.extract_main_text(html)
        return {"url": url, "title": title, "text": clean_text}


    def extract_main_text(self, html):
        doc = Document(html)
        title = doc.short_title()
        summary_html = doc.summary()

        # Use BeautifulSoup to clean text
        soup = BeautifulSoup(summary_html, 'html.parser')
        clean_text = soup.get_text(separator='\n', strip=True)
        return title, clean_text

🧠 Cosine Similarity Tool

For semantic routing fallback and phrase matching.

  • What it does: Embeds the query and compares it to known intent phrases using cosine similarity. Helps infer intent like β€œthis sounds like a paper query” or β€œthis feels like a dataset lookup”.
  • Why it’s useful: When queries are ambiguous or phrased in non-obvious ways.

def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """Compute cosine similarity between two vectors."""
    dot = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot / (norm1 * norm2 + 1e-8)  # Avoid division by zero


def get_top_k_similar(
    query: str,
    documents: List[str],
    memory,
    top_k: int = 5
) -> List[Tuple[str, float]]:
    """
    Compute similarity between query and each document, return top_k most similar.
    
    Args:
        query: The input query text.
        documents: A list of document strings.
        get_embedding: Callable that takes a string and returns a vector (np.ndarray).
        top_k: Number of top results to return.
    
    Returns:
        List of (document, similarity_score) tuples.
    """
    query_vec = memory.embedding.get_or_create(query)
    doc_vecs = [memory.embedding.get_or_create(doc) for doc in documents]

    similarities = [cosine_similarity(query_vec, vec) for vec in doc_vecs]
    scored = list(zip(documents, similarities))
    scored.sort(key=lambda x: x[1], reverse=True)

    return scored[:top_k]

πŸ” How Routing Works

The SearchOrchestratorAgent follows this decision-making pipeline:

  1. Metadata-based routing: Checks the goal type and query keywords against known heuristics.
  2. Semantic fallback: If unclear, uses the cosine similarity tool to match against a curated set of intent templates.
  3. Executes tool: Sends the query to the selected tool and enriches results with goal metadata.
  4. Stores: Results are stored in the search_results database table and returned to context for downstream use.

πŸ§‘β€βš–οΈ Reflections on Knowledge Base Quality: Dynamic vs. Curated

As powerful as automated tools like the SearchOrchestratorAgent are, this experiment revealed a crucial limitation: the quality of the dynamically constructed knowledge base often lags behind expectations.

πŸ§ͺ What We Observed

When queries are automatically generated and routed for instance, through SurveyAgent and then fanned out by SearchOrchestratorAgent the downstream knowledge base (a collection of summaries, papers, or dataset descriptions) tends to suffer from low relevance, redundancy, or superficial depth.

This happens because:

  • Automatically generated queries may not align tightly with the actual research goal.
  • Retrieved content often lacks domain specificity.
  • LLM-based summarization can flatten nuance or miss key points.

πŸ’‘ Key Takeaway

Manually curated or hand-tuned knowledge bases still outperform automated ones in meaningful, focused research.

When you personally select papers, distill key points, and structure the background context, you end up with a knowledge base that’s:

  • Richer in insights.
  • Tighter in focus.
  • More reusable for downstream agents like hypothesis generators or evaluators.

🧭 Strategic Implication

If you’re building a self-evolving or reflective AI research system, you might consider:

  • Starting with a human-curated core knowledge base, and
  • Letting the AI augment it dynamically rather than letting AI fully control it from scratch.

This hybrid approach respects the current limits of LLM-based search + summarization and acknowledges that “what you feed the pipeline determines what you get out.”


πŸš€ Generating Novel Directions with the IdeaInnovationAgent

    graph LR
    A[Goal] --> B[(SurveyAgent)]
    B --> C[(SearchOrchestratorAgent)]
    C --> D[(IdeaInnovationAgent)]:::metallicBlue
    D --> E[(IdeaSharpeningAgent)]
    E --> F[(RankingAgent - Elo-style)]
    F --> G[(IdeaEvaluatorAgent - Mr Q)]
    G --> H[(IdeaEvolutionAgent)]
    H --> I[(MethodPlannerAgent)]
    I --> J[(Next Round)]

   classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
  

In the research pipeline, originality begins here. The IdeaInnovationAgent is responsible for translating background research, preferences, and strategic intent into concrete, novel research directions. Acting as a creative synthesizer, it absorbs context from the goal, survey results, and literature findings, then prompts a language model to propose abstract but actionable ideas that push beyond the current state of the art.

This agent marks the first leap from information-gathering to innovation. It’s not just summarizing it’s imagining. Each output is a potential seed for a testable hypothesis, a paper, or even a new research agenda. By systematizing idea generation with structure, memory, and semantic grounding, this agent lays the intellectual foundation of the entire pipeline.

class IdeaInnovationAgent(BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)

    async def run(self, context: dict) -> dict:
        goal = context.get(GOAL)
        survey_results = context.get("survey_results", [])
        search_results = context.get("search_results", [])

        # Build prompt context
        prompt_context = {
            "goal_text": goal.get("goal_text"),
            "focus_area": goal.get("focus_area"),
            "goal_type": goal.get("goal_type"),
            "strategy": goal.get("strategy"),
            "survey_summary": self._summarize_results(survey_results),
            "search_result_summaries": self._summarize_results(search_results),
            "preferences": self.cfg.get("preferences", []),
        }

        merged = {**context, **prompt_context}

        # Load and render prompt
        prompt = self.prompt_loader.load_prompt(self.cfg, merged)

        # Call LLM to generate ideas
        raw_ideas = self.call_llm(prompt, merged)

        # Parse and structure ideas
        ideas = self._parse_raw_ideas(raw_ideas, goal)

        # Store generated ideas
        stored_ideas = self.memory.ideas.bulk_add_ideas(ideas)

        # Update context with results
        context["ideas"] = [idea.to_dict() for idea in stored_ideas]
        context["idea_ids"] = [idea.id for idea in stored_ideas]

        return context

    def _summarize_results(self, results: list) -> str:
        """Converts list of result dicts into a summary string"""
        if not results:
            return "No prior research found."
        summaries = []
        for r in results[:5]:  # limit to top 5 for brevity
            title = r.get("title", "")
            summary = r.get("summary", "")[:200] + "..." if len(r.get("summary", "")) > 200 else ""
            url = r.get("url", "")
            summaries.append(f"- {title}: {summary} ({url})")
        return "\n".join(summaries)

    def _parse_raw_ideas(self, raw_text: str, goal: dict) -> list:
        """Parses raw LLM response into structured idea objects"""
        lines = [line.strip() for line in raw_text.splitlines() if line.strip()]
        ideas = []

        for line in lines:
            ideas.append({
                "idea_text": line,
                "parent_goal": goal.get("goal_text"),
                "focus_area": goal.get("focus_area"),
                "strategy": goal.get("strategy"),
                "source": "generated_by_IdeaInnovationAgent",
                "origin": "llm",
                "extra_data": {}
            })

        return ideas
You are the Idea Innovation Agent.
Your task is to generate novel research directions based on the following inputs:

Goal: {{ goal }}
Focus Area: {{ focus_area }}
Baseline Methods: {{ baseline_methods }}
Literature Summary:
{{ literature_summary }}

Generate 5–10 innovative research directions that:
- Build on existing methods
- Are grounded in recent literature
- Propose meaningful technical changes
- Align with the research goal

Return only the list of ideas, one per line.

Matches NOVELSEEK’s:

“Idea Innovation Agent generates novel directions based on prior knowledge”_
“Ideas are turned into detailed methodologies using T: I Γ— T Γ— B Γ— L β†’ M”_


πŸͺ“ IdeaSharpeningAgent – Refinement & Critique

Converts vague ideas into testable hypotheses using preference learning and reflection.

We covered sharpening previously Self-Improving Agents: Applying the Sharpening Framework to Local LLMs

I reused the ideas from that post here.

    graph LR
    A[Goal] --> B[(SurveyAgent)]
    B --> C[(SearchOrchestratorAgent)]
    C --> D[(IdeaInnovationAgent)]
    D --> E[(IdeaSharpeningAgent)]:::metallicBlue
    E --> F[(RankingAgent - Elo-style)]
    F --> G[(IdeaEvaluatorAgent - Mr Q)]
    G --> H[(IdeaEvolutionAgent)]
    H --> I[(MethodPlannerAgent)]
    I --> J[(Next Round)]

   classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
  
class IdeaSharpeningAgent(BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.target = cfg.get("target", "generation")
        self.device = cfg.get("device", "cpu")
        self.evaluator = MRQSelfEvaluator(memory, logger, device=self.device)
        self.templates = cfg.get("templates", ["critic"])
        self.save_count = cfg.get("save_count", 3)


    async def run(self, context: dict) -> dict:
        """
        Main execution loop for IdeaSharpeningAgent.

        Takes a list of ideas, sharpens them using templates,
        judges against baseline using evaluator, and logs results.
        """
        goal = context.get(GOAL, {})
        ideas = context.get("ideas", [])

        if not ideas:
            self.logger.log("NoIdeasToSharpen", {"reason": "empty_input"})
            return context

        sharpened_results = []
        for idea in ideas:
            idea_text = idea.get("idea_text")
            result = await self._sharpen_and_evaluate(idea_text, goal, context)
            sharpened_results.append(result)

        # Sort by score
        sharpened_results.sort(key=lambda x: x["score"], reverse=True)

        # Update context
        context["sharpened_ideas"] = [r["sharpened_hypothesis"] for r in sharpened_results]
        context["scored_ideas"] = sharpened_results
        best_idea = sharpened_results[0]["sharpened_hypothesis"]
        context["top_idea"] = best_idea

        hypotheses = context.get(HYPOTHESES, [])
        if hypotheses:
            # Find the hypothesis with the maximum confidence value
            sorted_hyps = sorted(
                hypotheses, key=lambda h: h.get("confidence", 0.0), reverse=True
            )

            # Keep only the top hypothesis
            context[HYPOTHESES] = sorted_hyps[:self.save_count]
            # For scoring later
            context["baseline_hypotheses"] = sorted_hyps[-1]

        return context

    async def _sharpen_and_evaluate(self, idea: str, goal: dict, context: dict) -> dict:
        # Build prompt for refinement
        focus_area = goal.get("focus_area", "")
        baselines = self.cfg.get("baselines")
        baseline = baselines.get(focus_area, baselines.get("default"))
        merged = {
            "goal": goal,
            "idea": idea,
            "baseline": baseline,
            "literature_summary": context.get("knowledge_base_summaries", []),
            "examples": self.memory.hypotheses.get_similar_hypotheses(idea, limit=3),
            "strategy": goal.get("strategy", "default"),
        }

        improved = None
        winner = "original"
        scores = {}

        for name in self.templates:
            prompt_template = self.prompt_loader.from_file(name, self.cfg, merged)
            sharpened = self.call_llm(prompt_template, merged)

            try:
                preferred_output, scores = self.evaluator.judge(
                    goal=goal,
                    prompt=idea,
                    output_a=idea,
                    output_b=sharpened,
                )
                improved = preferred_output
                winner = "b" if improved == sharpened else "a"
            except Exception as e:
                self.logger.log("IdeaSharpeningFailed", {"error": str(e)})
                improved = idea
                winner = "a"
                scores = {"value_a": 5.0, "value_b": 5.0}

            result = {
                "template_used": name,
                "original_idea": idea,
                "sharpened_hypothesis": improved,
                "winner": winner,
                "improved": winner == "b",
                "scores": scores,
                "score": max(scores.values()),
                "pipeline_stage": context.get(PIPELINE),
                "prompt_template": prompt_template,
            }

            saved_hyp = self.save_improved(goal, idea, result, context)
            if saved_hyp:
                context.setdefault(HYPOTHESES, []).append(saved_hyp.to_dict())
            return result

    def save_improved(self, goal: dict, original_idea: str, result: dict, context: dict):
        if not result["improved"]:
            return None
        sharpened = result["sharpened_hypothesis"]
        prompt_id = self.memory.prompt.get_id_from_response(sharpened)

        # Save to HypothesisORM
        hyp = HypothesisORM(
            goal_id=goal.get("id"),
            text=sharpened,
            prompt_id=prompt_id,
            pipeline_signature=context.get(PIPELINE),
            source="idea_sharpening_agent",
            confidence=result["score"]
        )
        self.memory.hypotheses.insert(hyp)

        self.logger.log(
            "IdeaSharpenedAndSaved",
            {
                "prompt_snippet": original_idea[:100],
                "response_snippet": sharpened[:100],
                "score": result["score"],
            },
        )

        return hyp

Supports NOVELSEEK’s:

“Self-evolving idea generation with human-interactive feedback” “Each idea is refined into 3 variants; top performers selected based on scoring”

Once we’ve sharpened our ideas, we want to know how they stack up. That’s where the RankingAgent comes in.


πŸ† RankingAgent – Elo-Based Selection

Uses pairwise comparisons to rank hypotheses.

    graph LR
    A[Goal] --> B[(SurveyAgent)]
    B --> C[(SearchOrchestratorAgent)]
    C --> D[(IdeaInnovationAgent)]
    D --> E[(IdeaSharpeningAgent)]
    E --> F[(RankingAgent - Elo-style)]:::metallicBlue
    F --> G[(IdeaEvaluatorAgent - Mr Q)]
    G --> H[(IdeaEvolutionAgent)]
    H --> I[(MethodPlannerAgent)]
    I --> J[(Next Round)]

   classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
  

Once the hypotheses are sharpened, the RankingAgent steps in to simulate a scientific tournament comparing ideas head-to-head to identify the strongest contenders.

Inspired by the Elo rating system from competitive gaming and chess, this agent doesn’t just score hypotheses in isolation. Instead, it pits them against each other using LLM-based pairwise comparisons. The result? A ranked list of hypotheses that reflects not only individual merit but also how each idea stacks up in direct competition.

πŸ’‘ From the paper:

“The Ranking agent employs an Elo-based tournament to assess and prioritize generated hypotheses.” “Top performers are selected for the next round.”

How it works

  • Hypotheses are first initialized with a base Elo score (default: 750).
  • A selection of pairwise matchups is generated using random or proximity-based sampling.
  • The agent builds prompts comparing two hypotheses based on the user’s preferences (e.g., novelty, feasibility).
  • The LLM selects a winner in each round.
  • Elo scores are updated after each comparison, simulating win/loss dynamics.
  • Top-ranked hypotheses are passed to the next stage for deeper evaluation and potential evolution.

πŸ”§ This agent also supports multi-turn debates, proximity-based pairing strategies, and adaptive scoring based on match outcomes. It’s a flexible, strategic layer that gives the pipeline a robust way to prioritize promising research directions.

By embedding structured comparison into the refinement loop, RankingAgent brings rigor to the exploration process ensuring only the most compelling ideas move forward in the research pipeline.

class RankingAgent(BaseAgent):
    """
    The Ranking agent simulates scientific debate between hypotheses using a tournament-style approach.

    From the paper:
    > 'The Ranking agent employs an Elo-based tournament to assess and prioritize generated hypotheses'
    """
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.elo_scores = {}
        self.strategy = cfg.get("strategy", "debate")
        self.max_comparisons = cfg.get("max_comparisons", 6)
        self.initial_elo_score = cfg.get("initial_elo_score", 750)
        self.win_history = []
        self.preferences = cfg.get("preferences", ["novelty", "feasibility"])


    async def run(self, context: dict) -> dict:
        """
        Rank hypotheses using pairwise comparisons and Elo updates.

        Args:
            context: Dictionary with keys:
                - hypotheses: list of hypothesis strings
                - goal: research objective
                - preferences: override criteria
        """
        hypotheses = self.get_hypotheses(context)

        if len(hypotheses) < 2:
            self.logger.log("NotEnoughHypothesesForRanking", {
                "count": len(hypotheses),
                "reason": "less than 2 hypotheses"
            })
            context[self.output_key] = [(h, 1000) for h in hypotheses]
            return context

        self._initialize_elo(hypotheses)

        pairs = list(itertools.combinations(hypotheses, 2))
        comparisons = random.sample(pairs, k=min(self.max_comparisons, len(pairs)))

        for hyp1, hyp2 in comparisons:
            prompt = self._build_ranking_prompt(hyp1, hyp2, context)
            response = self.call_llm(prompt, context)
            winner = self._parse_response(response)

            if winner:
                self._update_elo(hyp1, hyp2, winner)
            else:
                self.logger.log(
                    "ComparisonParseFailed",
                    {
                        "prompt_snippet": prompt[:200],
                        "response_snippet": response[:300],
                        "agent": self.__class__.__name__,
                    },
                )

        ranked = sorted(self.elo_scores.items(), key=lambda x: x[1], reverse=True)
        context[self.output_key] = ranked

        self.logger.log(
            "TournamentCompleted",
            {
                "total_hypotheses": len(ranked),
                "win_loss_patterns": self._extract_win_loss_feedback(),
                "preferences": self.preferences,
            },
        )

        return context

    def _initialize_elo(self, hypotheses):
        for h in hypotheses:
            text = h.get("text")
            if text not in self.elo_scores:
                self.elo_scores[text] = self.initial_elo_score

    def _build_ranking_prompt(self, hyp1, hyp2, context):
        return self.prompt_loader.load_prompt(
            self.cfg,
            {
                **context,
                "hypothesis_a": hyp1.get("text"),
                "hypothesis_b": hyp2.get("text"),
            },
        )

    def _conduct_multi_turn_debate(self, context:dict, hyp1:str, hyp2:str, turns:int=3):
        """Simulate multi-turn scientific debate between hypotheses"""
        for i in range(turns):
            prompt = self._build_ranking_prompt(hyp1, hyp2, context=context)
            response = self.call_llm(prompt, context)
            winner = self._parse_response(response)
            if winner:
                self._update_elo(hyp1, hyp2, winner)
            else:
                break


    def _generate_pairwise_comparisons(self, hypotheses):
        """Generate combinations of hypothesis pairs for ranking"""
        return itertools.combinations(hypotheses, 2)

    def _generate_proximity_based_pairs(self, hypotheses):
        """Prioritize comparisons between similar hypotheses"""
        similarities = [
            (h1, h2, self._compute_similarity(h1, h2))
            for h1, h2 in itertools.combinations(hypotheses, 2)
        ]
        return sorted(similarities, key=lambda x: x[2], reverse=True)

    def _extract_win_loss_feedback(self):
        """Return summary of which hypotheses won most often"""
        win_counts = {}

        for hyp1, hyp2, winner in self.win_history:
            winner_hypothesis = hyp1 if winner == "A" else hyp2
            win_counts[winner_hypothesis] = win_counts.get(winner_hypothesis, 0) + 1

        return {
            "top_performers": [
                {"hypotheses": h, "wins": w}
                for h, w in sorted(win_counts.items(), key=lambda x: x[1], reverse=True)
            ],
            "total_matches": len(self.win_history),
            "preferences_used": self.preferences
        }

    def _rank_pairwise(self, reviewed, context):
        pairs = list(itertools.combinations(reviewed, 2))
        if not pairs:
            return

        # Limit number of comparisons per round
        comparisons = random.sample(pairs, k=min(self.cfg.get("max_comparisons", 6), len(pairs)))

        for item1, item2 in comparisons:
            hyp1 = item1["hypotheses"]
            hyp2 = item2["hypotheses"]

            merged = {**self.cfg, **{"hypothesis_a": hyp1, "hypothesis_b": hyp2}}


            prompt = self.prompt_loader.load_prompt(merged, context=context)

            self.logger.log("RankingCompare", {"hyp1": hyp1[:60],  "hyp2":hyp2[:60]})

            try:
                response = self.call_llm(prompt, context)
                winner = self._parse_response(response)

                if winner:
                    self._update_elo(hyp1, hyp2, winner)
                else:
                    self.logger.log("ComparisonParseFailed", {
                        "prompt_snippet": prompt[:200],
                        "response_snippet": response[:300]
                    })
            except Exception as e:
                self.logger.log(
                    "ComparisonError",
                    {"error": str(e), "hypotheses": [hyp1[:100], hyp2[:100]]},
                )

    def _update_elo(self, hyp1, hyp2, winner):
        text1 = hyp1.get("text")
        text2 = hyp2.get("text")

        K = self.cfg.get("elo_k", 32)
        R1 = 10 ** (self.elo_scores[text1] / 400)
        R2 = 10 ** (self.elo_scores[text2] / 400)
        E1 = R1 / (R1 + R2)
        E2 = R2 / (R1 + R2)

        S1 = 1 if winner == "A" else 0
        S2 = 1 - S1

        self.elo_scores[text1] = max(
            100, min(2800, self.elo_scores[text1] + K * (S1 - E1))
        )
        self.elo_scores[text2] = max(
            100, min(2800, self.elo_scores[text2] + K * (S2 - E2))
        )

        self.memory.hypotheses.update_elo_rating(hyp1.get("id"), self.elo_scores[text1])
        self.memory.hypotheses.update_elo_rating(hyp2.get("id"), self.elo_scores[text2])

        self.win_history.append((text1, text2, winner))
        self.logger.log(
            "RankingUpdated",
            {
                "hypothesis_a": text1,
                "hypothesis_b": text2,
                "winner": winner,
                "elo_a": self.elo_scores[text1],
                "elo_b": self.elo_scores[text2],
            },
        )

    def _parse_response(self, response: str) -> Optional[str]:
        """
        Try multiple methods to extract winner from LLM output

        Returns:
            'A' or 'B' based on comparison
        """
        # Try matching structured formats first
        structured_match = re.search(r"better[\s_]?hypothesis[^\w]*([AB12])", response, re.IGNORECASE)
        if structured_match:
            winner_key = structured_match.group(1).upper()
            return "A" if winner_key in ("A", "1") else "B"

        # Try matching natural language statements
        lang_match = re.search(r"(?:prefer|choose|recommend|select)(\s+idea|\s+hypothesis)?[:\s]+([AB12])", response, re.IGNORECASE)
        if lang_match:
            winner_key = lang_match.group(2).upper()
            return "A" if winner_key in ("A", "1") else "B"

        # Try matching conclusion phrases
        conclusion_match = re.search(r"conclude[d]?\s+with\s+better[\s_]idea:\s*(\d)", response, re.IGNORECASE)
        if conclusion_match:
            winner_key = conclusion_match.group(1)
            return "A" if winner_key == "1" else "B"

        # Default fallback logic
        self.logger.log("ParseError", {
                    "error": "Could not extract winner from response",
                    "response": response
                })
        return response

Aligns with NOVELSEEK’s:

“Ranking agent employs an Elo-based tournament to assess and prioritize generated hypotheses”_
“Top performers are selected for next round.”


βš–οΈ IdeaEvaluatorAgent (Mr Q) – Scoring

Autonomously scores hypotheses across dimensions.

    graph LR
    A[Goal] --> B[(SurveyAgent)]
    B --> C[(SearchOrchestratorAgent)]
    C --> D[(IdeaInnovationAgent)]
    D --> E[(IdeaSharpeningAgent)]
    E --> F[(RankingAgent - Elo-style)]
    F --> G[(IdeaEvaluatorAgent - Mr Q)]:::metallicBlue
    G --> H[(IdeaEvolutionAgent)]
    H --> I[(MethodPlannerAgent)]
    I --> J[(Next Round)]

   classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
  

Once hypotheses have been sharpened and ranked, they must be rigorously evaluated to determine their merit across a spectrum of scientific criteria. The IdeaEvaluatorAgent performs this role with flexibility and precision, using either traditional LLM-based judgments or a local preference model called MR.Q.

This agent operates as the system’s analytical conscience. It assesses each hypothesis based on dimensions such as:

  • Coherence – Does the hypothesis make sense logically?
  • Credibility – Is it scientifically plausible?
  • Verifiability – Could it be tested in practice?
  • Novelty – Does it add something new to the field?
  • Alignment – Does it match the user’s goal and preferences?

Depending on configuration, the agent can use:

  • An LLM-based judge to simulate comparative reasoning (similar to DPO), or
  • The MR.Q Self Evaluator, a lightweight reflection engine that learns user preferences from earlier feedback and refines hypotheses accordingly.

Each hypothesis is scored independently, and the best-performing ones are selected for the next stage of the pipeline. Scores and reasoning are logged, providing transparency and future trainability.

“Scores are combined using weighted summation” “Assessment Agent evaluates performance on multiple criteria: coherence, credibility, verifiability, novelty, and alignment.”

This structured evaluation ensures that only the most promising ideas advance those that aren’t just interesting, but also testable and aligned with research intent.

class IdeaEvaluatorAgent(BaseAgent):
    """
    Evaluates research ideas and hypotheses using multiple strategies:

    - LLM-based pairwise comparison (like DPO)
    - Preference learning via MR.Q Self Evaluator
    """

    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.strategy = cfg.get("strategy", "llm")  # llm | mrq
        self.evaluator = self._init_evaluator()
        self.top_k = cfg.get("top_k", 5)

    async def run(self, context: dict) -> dict:
        hypotheses = self.get_hypotheses(context)
        goal = context.get(GOAL)
        baseline = context.get("baseline_hypotheses", {}).get("text")

        if not hypotheses:
            self.logger.log("NoHypothesesToEvaluate", {})
            context["scored_hypotheses"] = []
            return context

        scored_results = []
        for hyp in hypotheses:
            hyp_text = hyp["text"]
            preferred, scores = self.evaluator.judge(
                goal=goal,
                prompt=hyp_text,
                output_a=baseline or hyp_text,
                output_b=hyp_text,
            )
            scored_results.append(
                {
                    "text": hyp_text,
                    "preferred": preferred,
                    "scores": scores,
                    "source": "llm-judge",
                    "score": scores.get("score_b", 0),
                    "reasoning": scores.get("reason", ""),
                }
            )

        scored_results.sort(key=lambda x: x["score"], reverse=True)
        context["scored_hypotheses"] = scored_results
        context["top_hypothesis"] = scored_results[0]
        return context

    def get_top_k(self, context: dict, k: int = 5):
        return sorted(
            context.get("scored_hypotheses", []), key=lambda x: x["score"], reverse=True
        )[:k]

    def _init_evaluator(self):
        if self.cfg.get("evaluator", "llm") == "llm":
            llm_model = self.cfg.get("evaluator_model", self.cfg.get("model"))
            prompt_file = self.cfg.get("evaluator_prompt_file", "evaluator.txt")
            return LLMJudgeEvaluator(
                self.cfg,
                llm_cfg=llm_model,
                prompt_file=prompt_file,
                llm=self.call_llm,
                logger=self.logger,
            )
        else:
            return MRQSelfEvaluator(
                memory=self.memory,
                logger=self.logger,
                device=self.cfg.get("device", "cpu"),
            )

Supports:

“Assessment Agent evaluates performance on multiple criteria: coherence, credibility, verifiability, novelty, and alignment.”
“Scores are combined using weighted summation.”


🧬 IdeaEvolutionAgent – Mutation & Grafting

Evolves top hypotheses into better variants.

    graph LR
    A[Goal] --> B[(SurveyAgent)]
    B --> C[(SearchOrchestratorAgent)]
    C --> D[(IdeaInnovationAgent)]
    D --> E[(IdeaSharpeningAgent)]
    E --> F[(RankingAgent - Elo-style)]
    F --> G[(IdeaEvaluatorAgent - Mr Q)]
    G --> H[(IdeaEvolutionAgent)]:::metallicBlue
    H --> I[(MethodPlannerAgent)]
    I --> J[(Next Round)]

   classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
  

After scoring and ranking, the system enters its creative refinement phase IdeaEvolutionAgent takes center stage to evolve hypotheses into smarter, sharper forms. Inspired by biological evolution and collaborative science, this agent mutates, grafts, and improves hypotheses over multiple generations.

What it does

  • Mutation: Generates multiple variants for each top-performing idea. These mutations focus on clarity, novelty, feasibility, or any preference you configure.
  • Grafting: Combines similar hypotheses into unified, higher-quality statements when semantic overlap is high. This mimics how researchers synthesize overlapping ideas into consensus theories.
  • Iteration: Evolution proceeds through multiple rounds, and each new generation is scored and filtered to retain only the strongest variants.

“Each idea is evolved into 3 variants; top performers are selected for the next round.” “Preference data collected from past evaluations can be used for training.”

This agent acts like a research assistant with a memory and a bias toward improvement. It draws from scoring data, user preferences, and prior reasoning to guide how each variant is generated and refined. Whether improving feasibility or amplifying originality, it ensures hypotheses continue to move forward not just laterally.

The result? A growing tree of ideas, where each branch is smarter than the last.

class IdeaEvolutionAgent(BaseAgent):
    """
    The Evolution Agent refines hypotheses iteratively using several strategies:

    - Grafting similar hypotheses into unified statements
    - Feasibility improvement through LLM reasoning
    - Out-of-the-box hypothesis generation
    - Inspiration from top-ranked ideas
    - Simplification and clarity enhancement

    These improvements are based on the paper:
    "NOVELSEEK: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification"
    """

    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.use_grafting = cfg.get("use_grafting", False)
        self.max_variants_per_idea = cfg.get("max_variants", 3)
        self.max_evolution_rounds = cfg.get("evolution_rounds", 4)
        self.selection_top_k = cfg.get("select_top_k", 5)
        self.preferences = cfg.get("preferences", ["novelty", "feasibility"])

    async def run(self, context: dict) -> dict:
        """
        Evolve top-ranked hypotheses across multiple rounds.
        """
        # Get input hypotheses
        ranked_hypotheses = context.get(RANKING, [])
        fallback_hypotheses = context.get(HYPOTHESES, [])
        preferences = context.get("preferences", self.preferences)
        current_round = context.get("evolution_round", 0)

        if not ranked_hypotheses and not fallback_hypotheses:
            self.logger.log("NoHypothesesToEvolve", {"reason": "no_ranked_or_unranked_input"})
            context[EVOLVED] = []
            return context

        # Decide which hypotheses to evolve
        top_texts = [h.text for h, _ in ranked_hypotheses[:3]] if ranked_hypotheses else fallback_hypotheses

        # Run evolution strategies
        all_variants = await self._mutate_all(top_texts, context, preferences)

        # Optionally use grafting
        if self.use_grafting:
            all_variants += await self.graft_similar(context)

        # Score and select top K
        scored_variants = self._score_variants(all_variants, context)
        top_variants = scored_variants[:self.selection_top_k]

        # Save to DB
        self._save_evolved(top_variants, context)

        # Update context
        context["evolved"] = top_variants
        context["evolution_round"] = current_round + 1
        context["evolved_count"] = len(top_variants)

        self.logger.log(
            "EvolutionCompleted",
            {
                "evolved_count": len(top_variants),
                "preferences": preferences,
                "round": current_round + 1
            }
        )

        return context

    async def _mutate_all(self, hypotheses: list, context: dict, preferences: list) -> list:
        """Generate multiple variants for each hypothesis"""
        all_mutants = []

        for h in hypotheses:
            prompt_context = {
                "hypothesis": h,
                "literature_summary": context.get("knowledge_base_summaries", []),
                "critique": context.get("scores", {}),
                "focus_area": context.get(GOAL, {}).get("focus_area"),
                "preferences": ", ".join(preferences)
            }

            prompt = self.prompt_loader.load_prompt("evolve.txt", prompt_context)
            raw_output = self.call_llm(prompt, context)

            mutants = extract_hypotheses(raw_output)
            self.logger.log("HypothesisMutated", {
                "original": h[:60],
                "mutations": mutants[:2]
            })

            all_mutants.extend(mutants)

        return all_mutants

    async def graft_similar(self, context: dict, threshold: float = 0.85) -> list:
        """
        Graft pairs of highly similar hypotheses into unified versions.
        """
        hypotheses = self.get_hypotheses(context)
        embeddings = [await self.memory.embedding.get_or_create(h.get("text")) for h in hypotheses]
        used = set()
        grafted = []

        for (i, h1), (j, h2) in itertools.combinations(enumerate(hypotheses), 2):
            if i in used or j in used:
                continue

            sim = self.cosine_similarity(embeddings[i], embeddings[j])
            if sim >= threshold:
                self.logger.log("GraftingPair", {
                    "similarity": sim,
                    "h1": h1[:60] + "...",
                    "h2": h2[:60] + "..."
                })
                prompt = (
                    f"Combine the following hypotheses into a clearer, more innovative statement:\n\n"
                    f'A: {h1.get("text")}\nB: {h2.get("text")}'
                )
                try:
                    response = self.call_llm(prompt, context)
                    combined = extract_hypotheses(response)
                    grafted.extend(combined)
                    used.update([i, j])
                except Exception as e:
                    self.logger.log("GraftingFailed", {"error": str(e)})
                    continue

        # Add ungrafted hypotheses back
        hypotheses = context.get(HYPOTHESES, [])
        for i, h in enumerate(hypotheses):
            if i not in used:
                grafted.append(h)

        return grafted

    def _score_variants(self, variants: list, context: dict) -> list:
        """
        Score variants using ScorerAgent logic and sort by total score
        """
        scorer = self.memory.scorer
        scored = []

        for v in variants:
            score_data = scorer.score(v, context)
            score_data["text"] = v
            scored.append(score_data)

        # Sort by composite score
        scored.sort(key=lambda x: x["score"], reverse=True)
        return scored

    def _save_evolved(self, variants: list, context: dict):
        """
        Save evolved hypotheses to database with lineage info
        """
        goal_text = self.extract_goal_text(context.get(GOAL))
        pipeline_sig = context.get(PIPELINE)

        for v in variants:
            hyp = HypothesisORM(
                goal=goal_text,
                text=v["text"],
                pipeline_signature=pipeline_sig,
                parent=context.get("current_hypothesis", None),
                evolution_level=context.get("evolution_round", 0)
            )
            self.db.add(hyp)
        self.db.commit()

    def cosine_similarity(self, vec1, vec2):
        """Compute cosine similarity between two vectors."""
        v1 = np.array(vec1)
        v2 = np.array(vec2)
        return float(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))

Matches NOVELSEEK’s:

“Each idea is evolved into 3 variants; top performers are selected for next round.”
“Preference data collected from past evaluations can be used for training.”


πŸ—ΊοΈ MethodPlannerAgent – Idea β†’ Method Mapping

Takes top hypothesis and turns it into structured methodology.

    graph LR
    A[Goal] --> B[(SurveyAgent)]
    B --> C[(SearchOrchestratorAgent)]
    C --> D[(IdeaInnovationAgent)]
    D --> E[(IdeaSharpeningAgent)]
    E --> F[(RankingAgent - Elo-style)]
    F --> G[(IdeaEvaluatorAgent - Mr Q)]
    G --> H[(IdeaEvolutionAgent)]
    H --> I[(MethodPlannerAgent)]:::metallicBlue
    I --> J[(Next Round)]

   classDef metallicBlue fill:#3A9BDC,stroke:#1F4F82,stroke-width:2px,color:#fff;
  

Once a strong hypothesis has emerged through evolution and evaluation, the MethodPlannerAgent translates it into action. This agent is the final bridge between imagination and execution turning abstract research ideas into structured, testable methodologies.

What it does

  • Maps ideas to methods: Given a hypothesis, task description, baseline implementation, and supporting literature, it builds a full experimental plan.

  • Follows a transformation function: “T: I Γ— T Γ— B Γ— L β†’ M” Where:

    • I = Idea
    • T = Task description
    • B = Baseline
    • L = Literature
    • M = Method plan
  • Generates reproducible plans: Outputs include objectives, experimental setups, components needed, knowledge gaps, and next steps.

  • Supports refinement: Plans can be iteratively revised based on feedback or scoring.

Why it matters

In the NOVELSEEK system and in real scientific workflows an idea is only as good as the method used to test it. MethodPlannerAgent ensures that every hypothesis can lead to actual discovery by structuring the next steps as a coherent, executable research plan.

“Each idea is mapped to testable components” “Experimental plans guide future rounds of search and evaluation”

By anchoring creative hypotheses in concrete methods, this agent transforms theoretical innovation into practical science.

class MethodPlannerAgent(BaseAgent):
    """
    The MethodPlannerAgent converts abstract research ideas into executable methodological frameworks.

    Based on NOVELSEEK's Method Development Agent:
    > _"The transformation function is represented as: T: I Γ— T Γ— B Γ— L β†’ M"_

    Where:
    - I = Research idea
    - T = Task description
    - B = Baseline implementation
    - L = Relevant literature or knowledge baseAll right Um OK so you're going to come here come here now what you can do is that
    - M = Resulting method plan

    This agent supports both initial planning and iterative refinement of methodologies.
    """

    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.max_refinements = cfg.get("max_refinements", 3)
        self.use_refinement = cfg.get("use_refinement", True)

    async def run(self, context: dict) -> dict:
        """
        Main execution loop for the MethodPlannerAgent.

        Args:
            context (dict): Contains goal, hypotheses, baseline code, and literature summary

        Returns:
            dict: Updated context with generated method plan
        """
        # Extract input from context
        goal = context.get(GOAL, {})
        hypothesis = context.get(HYPOTHESES, "")
        baseline_approach = self._get_baseline(goal.get("focus_area"))
        literature_summary = context.get("knowledge_base_summaries", [])
        pipeline_stage = context.get(PIPELINE, "initial_method_plan")

        # Build prompt context
        prompt_context = {
            "idea": hypothesis or goal.get("goal_text"),
            "task_description": self._extract_task_description(goal),
            "baseline_approach": baseline_approach,
            "literature_summary": self._summarize_literature(literature_summary),
            "preferences": self.cfg.get("preferences", ["novelty", "feasibility"]),
        }

        merged = {**context, **prompt_context}

        # Load and render prompt
        prompt = self.prompt_loader.load_prompt(self.cfg, merged)

        # Call LLM to generate method plan
        raw_plan = self.call_llm(prompt, merged)

        # Parse output into structured format
        try:
            plan_data = self.parse_method_plan_output(raw_plan)
        except Exception as e:
            self.logger.log("MethodPlanParseFailed", {"error": str(e), "raw": raw_plan})
            return context

        # Save to database
        method_plan = self._save_to_db(plan_data, context)

        # Update context with result
        context[self.output_key] = plan_data
        context["method_plan_id"] = method_plan.id
        context["code_plan"] = plan_data.get("code_plan", "")

        self.logger.log(
            "MethodPlanGenerated", {"plan": plan_data, "pipeline_stage": pipeline_stage}
        )

        return context

    def _extract_task_description(self, goal: dict) -> str:
        """
        Extract domain-specific constraints and goals
        Example: Reaction Yield Prediction on Suzuki-Miyaura dataset using SMILES input
        """
        if goal.get("focus_area") == "chemistry":
            return f"{goal.get('goal_text')} ({goal.get('focus_area')})"

        elif goal.get("focus_area") == "nlp":
            return f"{goal.get('goal_text')} ({goal.get('focus_area')})"

        else:
            return goal.get("goal_text", "")

    def _get_baseline(self, focus_area: str) -> str:
        """
        Retrieve baseline implementation from config or file system
        """
        if focus_area == "chemistry":
            return self.cfg.get("baselines").get("reaction_yield_model", "")
        elif focus_area == "nlp":
            return self.cfg.get("baselines").get("sentiment_transformer", "")
        elif focus_area == "cv":
            return self.cfg.get("baselines").get("pointnet_classifier", "")
        else:
            return ""

    def _summarize_literature(self, literature: list) -> str:
        """
        Format literature summaries for use in prompt
        """
        if not literature:
            return "No relevant prior work found."

        return "\n".join(
            [f"- {r['title']}: {r['refined_summary']}" for r in literature[:5]]
        )

    def parse_method_plan_output(self, output: str) -> dict:
        sections = {
            "research_objective": r"\*\*Research Objective:\*\*(.*?)\n\n",
            "key_components": r"\*\*Key Components:\*\*(.*?)\n\n",
            "experimental_plan": r"\*\*Experimental Plan:\*\*(.*?)\n\n",
            "hypothesis_mapping": r"\*\*Hypothesis Mapping:\*\*(.*?)\n\n",
            "search_strategy": r"\*\*Search Strategy:\*\*(.*?)\n\n",
            "knowledge_gaps": r"\*\*Knowledge Gaps:\*\*(.*?)\n\n",
            "next_steps": r"\*\*Next Steps:\*\*(.*?)$",
        }

        result = {}
        for key, pattern in sections.items():
            match = re.search(pattern, output, re.DOTALL)
            if match:
                content = match.group(1).strip()
                if key in ["key_components"]:
                    result[key] = [
                        line.strip() for line in content.splitlines() if line.strip()
                    ]
                else:
                    result[key] = content
            else:
                result[key] = ""

        return result

    def _save_to_db(self, plan_data: dict, goal_id: int) -> MethodPlanORM:
        """
        Store method plan in ORM with metadata
        """
        plan = MethodPlanORM(
            idea_text=plan_data.get("idea"),
            task_description=plan_data.get("task_description"),
            baseline_method=plan_data.get("baseline_used"),
            literature_summary=plan_data.get("relevant_papers"),
            code_plan=plan_data.get("code_plan"),
            score_novelty=plan_data.get("score_novelty"),
            score_feasibility=plan_data.get("score_feasibility"),
            score_impact=plan_data.get("score_impact"),
            score_alignment=plan_data.get("score_alignment"),
            goal_id=goal_id,
            focus_area=plan_data.get("focus_area"),
            strategy=plan_data.get("strategy"),
            evolution_level=0,  # Initial plan
        )

        self.memory.method_plans.add_method_plan(plan.to_dict())  # Or plan.to_dict() if needed
        return plan

    def _refine_plan(self, plan: dict, feedback: dict) -> dict:
        """
        Apply refinement logic based on critique or scoring data
        """
        refinement_prompt = self.prompt_loader.load_prompt(
            "prompts/method_refine.j2", {"current_plan": plan, "feedback": feedback}
        )

        raw_refined = self.call_llm(refinement_prompt)
        return self._parse_plan_output(raw_refined)

    def _score_plan(self, plan: dict, context: dict) -> dict:
        """
        Use ScorerAgent to evaluate methodology quality
        """
        scorer = self.memory.scorer
        scores = scorer.score(plan, context)
        return scores

Supports NOVELSEEK’s:

“Transformation function T: I Γ— T Γ— B Γ— L β†’ M”
“Each idea is mapped to testable components”
“Experimental plans guide future rounds of search and evaluation”


πŸ“¦ ORM Models – Persistent Memory

We use SQLAlchemy ORM to store everything for traceability.

Example: HypothesisORM

class HypothesisORM(Base):
    __tablename__ = "hypotheses"

    id = Column(Integer, primary_key=True)
    text = Column(String, nullable=False)
    goal_id = Column(Integer, ForeignKey("goals.id"), nullable=False)
    score_novelty = Column(Float)
    score_feasibility = Column(Float)
    score_impact = Column(Float)
    score_alignment = Column(Float)
    confidence = Column(Float)
    origin = Column(String)
    source = Column(String)
    created_at = Column(DateTime, default=datetime.utcnow)

Example: MethodPlanORM

class MethodPlanORM(Base):
    __tablename__ = "method_plans"

    id = Column(Integer, primary_key=True)
    idea_text = Column(String, nullable=False)
    research_objective = Column(String, nullable=False)
    key_components = Column(JSON)
    experimental_plan = Column(String)
    hypothesis_mapping = Column(String)
    search_strategy = Column(String)
    knowledge_gaps = Column(String)
    next_steps = Column(String)
    code_plan = Column(String)
    score_novelty = Column(Float)
    score_feasibility = Column(Float)
    score_impact = Column(Float)
    score_alignment = Column(Float)
    evolution_level = Column(Integer, default=0)
    parent_plan_id = Column(Integer, ForeignKey("method_plans.id"), nullable=True)
    is_refinement = Column(Boolean, default=False)

This gives us persistent memory of every step in the pipeline.


πŸ§ͺ Sample Output After Planning

For the goal:

“Will AI ever be able to reprogram itself?”

Your MethodPlannerAgent might generate:

{
  "research_objective": "Determine whether AI systems can autonomously reprogram their own code or parameters while ensuring safety.",
  "key_components": [
    "Introspection module for analyzing outputs",
    "Reinforcement learning component for guiding changes",
    "Safety constraints layer to prevent instability",
    "Validation protocol for stability and functionality checks"
  ],
  "experimental_plan": "1. Use introspection to analyze model outputs after each inference step.\n2. Generate proposed updates using reflection and meta-learning.\n3. Apply changes only if they pass safety and validation checks.\n4. Measure impact over time via performance, stability, error propagation.",
  "hypothesis_mapping": "- Introspection module β†’ addresses H1: Reprogramming feasibility\n- Reinforcement learning β†’ guides change process\n- Safety constraints β†’ ensures functional integrity",
  "search_strategy": "Use Arxiv and GitHub to search for:\n- 'self-modifying LLM'\n- 'LLM-based introspection'\n- 'dynamic architecture updating'",
  "knowledge_gaps": "What real-world systems attempt self-editing?\nHow do current models handle weight freezing and dynamic updates?",
  "next_steps": "SurveyAgent should run queries for:\n- Existing implementations of self-modifying LLMs\n- Papers discussing AI safety and self-editing\n- Code repositories implementing dynamic attention modules"
}

🧬 Evolution Tree Visualization

We support tracking evolutionary lineage:

    graph TD
    H1[Hypothesis 1: Basic Introspection]
    H2[Hypothesis 2: Reinforcement-Guided Updates]
    H3[Hypothesis 3: Dynamic Weight Updating]

    H1 -->|Ranked Top| M1[Method Plan 1: Introspection-Based]
    H2 -->|Improved Feasibility| M2[Method Plan 2: RL-Guided Architecture Change]
    H3 -->|High Novelty| M3[Method Plan 3: Dynamic Weight Updating]

    M1 -->|Refinement| M1R1[Refined Method 1]
    M2 -->|Refinement| M2R1[Refined Method 2]
    M3 -->|Refinement| M3R1[Refined Method 3]
  

Which supports NOVELSEEK’s:

“Each idea is evolved into 3 variants; top performers selected for next round.“I think it’s that bad
“Ideas are iteratively polished and refined.”


πŸ“Š Benefits of This Design

Feature Description
🧠 Fully Autonomous Research Loop No need for manual review
🧩 Modular Design Swap out agents easily
πŸ—ƒοΈ Persistent Memory Stores winning hypotheses and strategies
🧬 Evolutionary Guidance Builds better ideas over time
πŸ“Š Traceable Process Logs help track idea β†’ methodology mapping

πŸš€ Future Work

Where could we go next?

A. Add Code Execution Support

Using tools like Aider, evolve and execute actual code.

B. Add Human-in-the-loop Feedback

Allow researchers to validate hypotheses and refine scores manually.

C. Build a Web Interface

Visualize:

  • Idea trees
  • Scored hypotheses
  • Evolved method plans
  • Literature β†’ idea flow

D. Add Training from Preference Data

Train your evaluator using stored A/B comparisons and Elo rankings.


βœ… Summary

We’ve implemented the full NOVELSEEK pipeline:

[Goal] β†’ [SurveyAgent]
     ↓
[SearchOrchestratorAgent]
     ↓
[IdeaInnovationAgent] β†’ generates N abstract ideas
     ↓
[IdeaSharpeningAgent] β†’ refines into testable hypotheses
     ↓
[IdeaEvaluatorAgent] β†’ scores using Mr Q-style logic
     ↓
[IdeaEvolutionAgent] β†’ evolves top hypotheses
     ↓
[MethodPlannerAgent] β†’ builds structured methodology
     ↓
[Database] β†’ stores everything for future rounds

πŸ“š References

  1. NOVELSEEK: When Agent Becomes the Scientist Xinyu Lei, Yi Ren, Xiaojian Ma, et al. arXiv:2505.16938 Describes the core pipeline of autonomous scientific research through modular AI agents, including generation, refinement, evolution, and method planning.

  2. MR.Q: Preference-Based Evaluation for Reasoning Chains OpenAI Team (internal reference) Provides the mechanism for scoring hypotheses based on multiple soft preferences such as novelty, feasibility, and coherence. Used in the IdeaEvaluatorAgent.

  3. Self-Refinement: Improving Code Generation with Verifier LMs Zhengbao Jiang, Tristan Thrush, et al. (2023) arXiv:2304.12244 Basis for multi-round refinement ideas embedded into the sharpening and evolution steps.

  4. AutoGPT: An Autonomous GPT-4 Experiment Toran Bruce Richards et al. GitHub Inspired the modular agent framework used in this project, with improvements to support research workflows.

  5. DSPy: Declarative Self-Improving Language Models Srinivasan et al. (2024) arXiv:2402.00821 Framework for declarative prompting and runtime improvement. Used selectively in earlier tuning and evaluation prototypes.

  6. Symbolic Learning for Reasoning Optimization Follow-up project in development Used as inspiration for future work on pipeline adaptation and learning from symbolic feedback.


βœ… Implementation Checklist – NOVELSEEK Alignment

Component Description Status
Idea Generation Generate multiple candidate research hypotheses using structured prompting βœ… Implemented via IdeaInnovationAgent
Multi-Round Refinement Iteratively sharpen hypotheses using preference-guided evaluation βœ… Implemented via IdeaSharpeningAgent
Preference Learning (MR.Q) Evaluate and improve hypotheses using MR.Q-based scoring βœ… Implemented via IdeaEvaluatorAgent
Evolution of Hypotheses Apply mutation, combination, and grafting to evolve ideas βœ… Implemented via IdeaEvolutionAgent
Elo-Style Ranking Rank ideas using head-to-head comparison and Elo updates βœ… Implemented via RankingAgent
Structured Method Planning Translate ideas into testable methods using literature, task, and baseline context βœ… Implemented via MethodPlannerAgent
Literature-Grounded Search Retrieve relevant papers to condition idea development βœ… Partially implemented via SearchOrchestratorAgent and local memory
Task + Baseline Conditioning Use structured baselines and task profiles to steer hypotheses βœ… Included in MethodPlannerAgent
Closed-Loop Iteration Feedback from method planning informs future idea rounds πŸ”„ Under development / planned for next post
Verification & Experiment Execution Run experiments or simulations to test ideas ❌ Not implemented (out of scope for this post)
Learning from Pipeline Outcomes Tune agent behavior based on past pipeline performance ❌ Not implemented (planned via symbolic learner)

πŸ“˜ Glossary

Agent A modular, autonomous component that performs a specific research task (e.g., idea generation, evaluation). Each agent follows a common interface and interacts with a shared memory.

Pipeline The full sequence of agents orchestrated to perform a complex task such as moving from a research goal to a structured methodology.

SurveyAgent Generates targeted search queries based on a research goal to initiate literature exploration and tool-based discovery.

SearchOrchestratorAgent Coordinates multiple information retrieval tools (Web, ArXiv, Wikipedia, Hugging Face, etc.) to enrich context with external knowledge.

IdeaInnovationAgent Generates novel, abstract research directions based on retrieved context and past knowledge.

IdeaSharpeningAgent Refines and elaborates initial ideas using prompt-based tuning and logic enhancements (e.g., DSPy-style CoT reasoning or template-based mutations).

RankingAgent Performs Elo-style tournament ranking of hypotheses through LLM-guided pairwise comparisons, simulating scientific debate.

IdeaEvaluatorAgent (Mr Q) Scores each hypothesis across dimensions like novelty, feasibility, and alignment, using MR.Q-style preference-based evaluation or LLM comparison.

IdeaEvolutionAgent Mutates and grafts top hypotheses to generate stronger, more testable variants across multiple rounds. Simulates evolutionary refinement.

MethodPlannerAgent Transforms a top-ranked idea into a detailed research methodology, including key components, experimental plan, baseline references, and next steps.

MR.Q A preference learning and scoring framework for reasoning-based outputs. Supports scoring hypotheses based on soft dimensions rather than fixed ground truth.

Elo Ranking A scoring method from competitive games, adapted here to track hypothesis strength based on win/loss performance in head-to-head comparisons.

Grafting Combining similar hypotheses into a single, clearer, and more innovative variant. Used in the evolution phase.

Prompt Templating The use of structured Jinja templates to dynamically generate prompts based on input context (goal, literature, preferences, etc.).

Semantic Memory A vector store that enables retrieval of past hypotheses, evaluations, and context based on similarity for use in future steps.

Baseline Method An existing, known approach to a task, used as a reference for evaluation or as a component in method planning.

T: I Γ— T Γ— B Γ— L β†’ M The transformation function defined in NOVELSEEK for turning an Idea (I), Task (T), Baseline (B), and Literature (L) into a Method plan (M).