Building an AI Co-Scientist

Building an AI Co-Scientist
Page content

This is the fiOr you guys want to scare all of them All right problem this is where I say norst post in a 100-part series, where we take breakthrough AI papers and turn them into working code building the next generation of AI, one idea at a time.

๐Ÿงพ Summary

In this post, Iโ€™ll walk through how I implemented the ideas from Towards an AI Co-Scientist into a working system called co_ai.

The paper presents a vision for an AI system that collaborates with scientists by generating hypotheses, engaging in peer review, ranking proposals, and evolving them over time. Inspired by this work, Iโ€™ve built a modular, open-source version of the co-scientist architecture using local tools like:

  • โœ… Ollama (for running Qwen3, Mistral, etc.)
  • โœ… DSPy (for prompt optimization)
  • โœ… Hydra (for configuration management, dynamic extension)
  • โœ… pgvector (for vector memory + retrieval)
  • โœ… postgresql (database)

๐Ÿ”ฌ What Is co_ai?

co_ai is a working implementation of the AI co-scientist concept introduced in the recent DeepMind paper. The core idea? Build a multi-agent system that mimics scientific reasoning โ€” proposing hypotheses, debating their validity, refining them, and evolving toward better explanations.

The system uses a modular agent pipeline:

    flowchart LR
    A[๐ŸŽฏ Goal] --> B[๐Ÿงช Generation Agent]
    B --> C[๐Ÿชž Reflection Agent]
    C --> D[๐Ÿ† Ranking Agent]
    D --> E[๐Ÿงฌ Evolution Agent]
    E --> F[๐Ÿง  Meta-review Agent]
  

Each step has a specific role:

  • โœ… Generation Agent: Proposes multiple hypotheses grounded in literature.
  • ๐Ÿง  Reflection Agent: Critiques each hypothesis for correctness, novelty, and feasibility.
  • ๐Ÿ“Š Ranking Agent: Uses simulated debates to rank and refine ideas.
  • ๐Ÿ”„ Evolution Agent: Evolves promising hypotheses using simplification or inspiration.
  • ๐Ÿงฌ Meta-review Agent: Synthesizes insights across all reviews into strategic directions applicable to all agents.

๐Ÿ›  Key Design Decisions & Why They Matter

1. โœ… Local Execution via Ollama

One of the most important choices was to use local LLMs via Ollama. This ensures:

  • No dependency on cloud APIs
  • Full reproducibility
  • No cost to your research
  • Better control over inference settings (temperature, max tokens, etc.)

Example config:

model:
  name: ollama/qwen3
  api_base: http://localhost:11434
  api_key:

As you can see you can optionally enable any model you require.


2. ๐Ÿง  Preference-Driven Prompting

A major innovation from the paper is the use of preferences to guide hypothesis quality.

I implemented this cleanly using Hydra config:

File: configs/agents/generation.yaml

This is a sample of one of the agent configuration file.

generation:
  name: generation
  enabled: true
  strategy: goal_aligned
  preferences:
    - goal_consistency
    - biological_plausibility
    - experimental_validity
    - novelty
    - simplicity

These are injected into prompts dynamically using Jinja2 templates:

This is a sample of one of the agent prompt template files.

File: prompts/generation/goal_aligned.txt

You are an expert researcher generating novel scientific hypotheses.
Use inspiration from analogous domains or mechanisms to develop creative solutions.

Goal:
{{ goal }}

{literate_title}: {{ literature }}

Preferences:
{% for p in preferences %}
- {{ p }}
{% endfor %}

Instructions:
1. Review findings above before generating new hypotheses
2. Generate 3 distinct, testable hypotheses
3. Each must include mechanism, rationale, and experiment plan
4. {% if "goal_consistency" in preferences %}Prioritize direct alignment with the stated goal.{% endif %}
5. {% if "novelty" in preferences %}Focus on originality and unexpected connections.{% endif %}
6. {% if "feasibility" in preferences %}Ensure experiments can be realistically tested in the lab.{% endif %}
7. {% if "biological_plausibility" in preferences %}Make sure biological mechanisms are valid and well-explained.{% endif %}
8. {% if "simplicity" in preferences %}Favor clarity and simplicity over complexity.{% endif %}

This lets users tune agent behavior without changing code โ€” just by adjusting preferences.


3. ๐Ÿ”„ Self-Improving Loop via DSPy

To implement the feedback loop described in the paper:

โ€œFeedback from tournaments enables iterative improvement, creating a self-improving loop toward novel and high-quality outputsโ€

I added a PromptRefinerAgent that:

  • Pulls top-ranked hypotheses from memory
  • Uses DSPyโ€™s BootstrapFewShot to refine prompts
  • Stores only improved versions
  • Logs everything for traceability

File: co_ai/agents/prompt_refiner.py

class PromptRefinerAgent(BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.agent_name = cfg.get("target_agent", "generation")
        self.strategy = cfg.get("strategy", "basic_refinement")

    async def run(self, context: dict) -> dict:
        few_shot_data = self.memory.get_ranked_hypotheses(context["goal"], limit=5)
        
        refined_prompt = self._refine_with_dspy(context, few_shot_data)
        
        old_score = self._evaluate_prompt(context["prompt"], few_shot_data)
        new_score = self._evaluate_prompt(refined_prompt, few_shot_data)

        if new_score > old_score:
            self.memory.store_prompt_version(
                agent_name=self.agent_name,
                prompt_key=self.strategy,
                prompt_text=refined_prompt,
                source="dsp_refinement",
                version=context.get("prompt_version", 1) + 1,
                metadata={"few_shot_count": len(few_shot_data)}
            )
            context["prompt"] = refined_prompt
        return context

Now the system improves its own prompting strategy based on real-world performance โ€” just like the paper describes:

โ€œEach comparison concludes with the phrase ‘better hypothesis:<1 or 2>’โ€


4. ๐Ÿ“‹ Structured Output + Memory System

All agents generate structured output so future stages can parse and reason about them.

For example:

# Hypothesis 1
Mechanism: Inhibition of IRE1ฮฑ signaling reduces tumor viability in AML cells.

# Hypothesis 2
Mechanism: KIRA6 induces apoptosis in MOLM13 cells through ER stress pathways.

# Hypothesis 3
Mechanism: Targeting UPR pathways enhances chemosensitivity in FLT3-ITD+ leukemias.

This format makes it easy to extract, compare, and evolve hypotheses.

From the paper:

โ€œAppendix Figure A.4 shows example prompts for comparing two hypotheses during a tournament matchโ€

So we built a consistent structure that supports:

  • โœ… Parsing
  • ๐Ÿงฑ Comparison
  • ๐Ÿค– Refinement
  • ๐Ÿ“‹ Logging

5. ๐Ÿงฉ Modular Architecture with Configurable Agents

Using Hydra-based config, each agent can be enabled/disabled and configured independently.

File: configs/agents/generation.yaml

defaults:
  - /model/qwen3
  - /prompt_refiner/disabled

generation:
  name: generation
  enabled: true        
  save_prompt: true
  save_context: true
  skip_if_completed: true
  strategy: goal_aligned
  input_keys: ["goal", "literature"]
  output_key: hypotheses
  prompt_mode: file
  prompt_file: goal_aligned.txt
  extraction_regex: "Hypothesis \\d+:\\n(.+?)\\n"

And in the pipeline config:

File: configs/pipeline/default.yaml


# The goal of the pipeline."
goal: "Can generative AI models reduce the time required to make scientific discoveries in research?"

paths:  # directory path where the app will search for the prompt templates
  prompts: ${hydra:runtime.cwd}/prompts

pipeline:
  stages:
    - name: generation    # name for agent (useful if you want run same one at different times)  
      cls: co_ai.agents.generation.GenerationAgent    # you can add cutom classes
      enabled: true
      iterations: 1

    - name: reflection
      cls: co_ai.agents.reflection.ReflectionAgent
      enabled: true
      iterations: 1

    - name: ranking
      cls: co_ai.agents.ranking.RankingAgent
      enabled: true
      iterations: 2

This gives you full flexibility to:

  • โœ… Swap out models
  • ๐Ÿง  Inject different strategies
  • ๐Ÿ“Š Log everything
  • ๐Ÿ›  Tune behavior per stage

6. ๐Ÿ›ข๏ธ Postgres and the MemoryTool

Great idea โ€” here’s a clear and informative section you can include in your blog post that explains why you chose the database architecture and what each component is for. Itโ€™s written for developers and system designers to understand the motivations behind the choices in the co_ai system.


๐Ÿ—„๏ธ Why We Chose a Database-Centric Design

The co_ai framework relies heavily on structured and evolving information โ€” hypotheses, prompt versions, context states, and performance metrics. To make this information persistent, queryable, and extensible, we use PostgreSQL with pgvector for semantic search. Hereโ€™s why we chose this route and how each component contributes to the pipeline:

๐Ÿ“ฆ Overview of Database Components

Component Purpose Why It Matters
hypotheses_store Stores generated hypotheses, confidence scores, reviews, and embeddings Ensures we can track hypothesis evolution, compare outputs across versions, and rank by quality
prompt_store Records each prompt version, agent, strategy, and associated goal Enables reproducibility and analysis of which prompts yield better hypotheses
context_states Saves full pipeline context at each stage Allows recovery, audit trails, and comparative runs with different agents/settings
report_logger Tracks generated reports with summaries and run metadata Useful for end-of-run outputs and dashboard summaries
embedding_store Caches vector embeddings for fast similarity searches Boosts performance for semantic search and clustering

๐Ÿง  Why Not Just Use Files or In-Memory Storage?

While YAML/JSON files are great for flexibility and fast prototyping, we needed:

  • ๐Ÿ”„ Long-term memory โ€” Hypotheses and evaluations need to persist across runs.
  • ๐Ÿ“ˆ Version control โ€” Prompt refinement requires tracking iterations and improvements.
  • ๐Ÿ” Query capabilities โ€” Ranking and tuning are based on filtering and sorting large sets.
  • ๐Ÿ”— Relational integrity โ€” Hypotheses link to prompts, which link to agents and evaluations.

This structure also sets the stage for more advanced features like:

  • Similarity search across past hypotheses
  • Dataset-based prompt tuning
  • Interactive dashboards or admin panels

File: schema.sql


--- prompts table
CREATE TABLE IF NOT EXISTS prompts (
    id SERIAL PRIMARY KEY,
    agent_name TEXT NOT NULL,
    prompt_key TEXT NOT NULL,         -- e.g., generation_goal_aligned.txt
    prompt_text TEXT NOT NULL,
    goal TEXT;
    response_text TEXT,
    source TEXT,                      -- e.g., manual, dsp_refinement, feedback_injection
    version INT DEFAULT 1,
    is_current BOOLEAN DEFAULT FALSE,
    strategy TEXT,                    -- e.g., goal_aligned, out_of_the_box
    metadata JSONB DEFAULT '{}'::JSONB,
    timestamp TIMESTAMPTZ DEFAULT NOW()
);

-- Stores all generated hypotheses and their evaluations
CREATE TABLE IF NOT EXISTS hypotheses (
    id SERIAL PRIMARY KEY,
    goal TEXT NOT NULL,                 -- Research objective
    text TEXT NOT NULL,                 -- Hypothesis statement
    confidence FLOAT DEFAULT 0.0 ,      -- Confidence score (0โ€“1 scale)
    review TEXT,                        -- Structured review data
    reflection TEXT,                    -- Structured reflection data
    elo_rating FLOAT DEFAULT 750.0,    -- Tournament ranking score
    embedding VECTOR(1024),             -- Vector representation of hypothesis
    features JSONB,                     -- Mechanism, rationale, experiment plan
    prompt_id INT REFERENCES prompts(id), -- Prompt used to generate this hypothesis
    source_hypothesis INT REFERENCES hypotheses(id), -- If derived from another
    strategy_used TEXT,                 -- e.g., goal_aligned, out_of_the_box
    version INT DEFAULT 1,              -- Evolve count
    source TEXT,                        -- e.g., manual, refinement, grafting
    enabled BOOLEAN DEFAULT TRUE,       -- Soft delete flag
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

6. ๐Ÿ“ฆ Vector Memory with PostgreSQL + pgvector

Then I built a VectorMemory class to retrieve similar hypotheses:

File: co_ai/memory/vector_store.py

def get_similar_hypotheses(self, goal: str, limit: int = 5):
    """Get hypotheses from memory that are most relevant to current goal"""
    try:
        goal_embedding = get_embedding(goal)

        cur.execute("""
            SELECT text, review, elo_rating
            FROM hypotheses
            ORDER BY embedding <-> %s
            LIMIT %s
        """, (str(goal_embedding), limit))

        rows = cur.fetchall()
        return [
            {
                "text": row[0],
                "review": row[1] or "",
                "score": row[2] or 1000
            } for row in rows
        ]

    except Exception as e:
        print(f"[VectorMemory] Failed to fetch similar hypotheses: {e}")
        return []

๐Ÿ“„ Jinja2 & How It Powers Flexible Prompting

๐Ÿ’ก Motivation: Prompt Flexibility Is Scientific Freedom

One of the key goals when building co_ai was to make the system highly adaptable to different domains, preferences, and scientific workflows.

To do that, we needed a way to:

  • โœ… Inject dynamic values (goal, literature, preferences)
  • ๐Ÿงฉ Support conditional logic in prompts
  • ๐Ÿ”„ Allow users to define their own templates
  • ๐Ÿ“‹ Maintain structure while enabling customization
  • ๐Ÿ›  Integrate cleanly with Hydra config and agent logic

Thatโ€™s where Jinja2 templates came in.


๐Ÿงฑ How Jinja2 Templates Work in co_ai

We use Jinja2-style templating to build structured prompts dynamically.

Hereโ€™s an example from prompts/generation/goal_aligned.txt:

You are an expert researcher generating novel hypotheses.
Use inspiration from analogous domains or mechanisms to develop creative solutions.

Goal:
{{ goal }}

{literate_title}: {{ literature }}

{% if preferences %}
Preferences:
{% for p in preferences %}
- {{ p }}
{% endfor %}
{% endif %}

Instructions:
1. Review findings above before generating new hypotheses
2. Generate 3 distinct, testable hypotheses
3. Each must include mechanism, rationale, and experiment plan
4. {% if "goal_consistency" in preferences %}Prioritize direct alignment with the stated goal.{% endif %}
5. {% if "novelty" in preferences %}Focus on originality and unexpected connections.{% endif %}
6. {% if "feasibility" in preferences %}Ensure experiments can be realistically tested in the lab.{% endif %}
7. {% if "biological_plausibility" in preferences %}Make sure biological mechanisms are valid and well-explained.{% endif %}
8. {% if "simplicity" in preferences %}Favor clarity and simplicity over complexity.{% endif %}

In your agent logic:

def _build_prompt(self, context):
    literate = "\n".join([
        f"{i+1}. {r['title']}\nโ†’ {r['summary']}"
        for i, r in enumerate(context.get("literature", []))
    ])

    return self.prompt_renderer.render(
        goal=context["goal"],
        literature=literate,
        preferences=context.get("preferences", [])
    )

๐Ÿš€ Advantages of Using Jinja2 for Prompting

Here are the main reasons we chose Jinja2 for prompt generation:

1. โœ… Dynamic Variable Injection

You can inject any variable into your prompt:

Goal:
{{ goal }}

Literature:
{{ literature }}

This makes it easy to generate prompts based on real-time context.

2. ๐Ÿงฉ Conditional Logic Based on Preferences

Want to change the prompt based on preference?

{% if "goal_consistency" in preferences %}
Prioritize direct alignment with the stated goal.
{% endif %}

Now your LLM follows instructions only if certain preferences are active.

3. ๐Ÿ”„ Clean Template Reuse Across Agents

Same template can be used by:

  • โœ… GenerationAgent
  • ๐Ÿง  EvolutionAgent
  • ๐Ÿ“Š Meta-reviewAgent

Just change the inputs โ†’ same prompt format works differently.

4. ๐Ÿ“‹ Full Traceability and Version Control

Because prompts are stored in files, you can:

  • โœ… Track prompt versions in git
  • ๐Ÿ“ˆ Compare old vs new prompts
  • ๐Ÿ›  Roll back if something breaks

5. ๐Ÿค– Local Model Compatibility

Works great with local models like Qwen3, Mistral, or Llama via Ollama:

  • No need for hosted APIs
  • Everything runs locally
  • Fully reproducible

6. ๐Ÿงฌ Feedback Injection from Reviews

Want to add strategic directions from Meta-review?

{% if meta_review.insights %}
Strategic Directions:
{% for point in meta_review.insights %}
- {{ point }}
{% endfor %}**** right now Jeff you know what else **** thing
{% endif %}

Now future generations incorporate insights automatically.

7. ๐Ÿ“Š Easy Customization Without Code Changes

Users can modify .txt files directly:

  • โœ… Add new preferences
  • ๐Ÿง  Change instruction phrasing
  • ๐Ÿ“‹ Adjust output structure
  • ๐Ÿ›  Enable/disable sections

No coding required.


๐Ÿ“‹ Overview of Implemented Agents

Each component in the system plays a unique role in the pipeline. Below is a list of implemented agents and their responsibilities in our AI co-scientist system.

Agent Name Purpose Why We Created It
base.py Shared base class for all agents Modular design with common utilities
debate.py Simulates scientific debate for ranking To evaluate and prioritize hypotheses
evolution.py Evolves hypotheses using strategies like grafting or simplification For continuous improvement
generation.py Hypothesis generation after literature grounding Core reasoning engine
generic.py Template for custom agent creation Supports reusable agent pattern
literature.py Performs web search and literature parsing Grounds hypotheses in prior work
meta_review.py Synthesizes insights from all reviews Creates strategic feedback for future agents
prompt_refiner.py Uses past outputs to improve prompts Enables self-tuning via preference injection
prompt_tuning.py Applies DSPy-based tuning to improve prompting Builds better prompts over time
proximity.py Measures similarity between hypotheses Tracks evolution and improves ranking
ranking.py Manages tournaments and compares hypotheses using Elo ratings Determines which ideas are stronger
reflection.py Analyzes hypotheses for correctness, novelty, and feasibility Filters weak ideas early
review.py Lighter version of the Reflection agent Quick evaluation of hypothesis quality

โœ… 1. Short Description of Each Agent

Here’s a clean, concise description of each agent in your system โ€” based on both the paper and your implementation.

๐Ÿง  Generation Agent

The Generation Agent starts with a research goal and generates multiple hypotheses by synthesizing prior knowledge or exploring new directions.

It uses literature grounding when available, and supports strategies like:

  • Goal-aligned generation
  • Out-of-the-box thinking
  • Feasibility-focused prompting

From the paper:

โ€œThe Generation agent iteratively searches the web, retrieves and reads relevant research articles, and grounds its reasoning by summarizing prior work.โ€

๐Ÿ” Reflection Agent

The Reflection Agent simulates peer review. It evaluates hypothesis correctness, novelty, feasibility, and quality using structured prompts.

Supports multiple types of reflection:

  • Full review
  • Deep verification
  • Simulation-based testing
  • Observation-driven critique

From the paper:

โ€œThe Reflection agent performs full reviews, deep verification, observation reviews, and simulation reviews to ensure high-quality outputs.โ€

๐Ÿ“Š Ranking Agent

The Ranking Agent runs simulated scientific debates between hypotheses, assigning Elo ratings based on their performance.

Uses:

  • Direct comparison tournaments
  • Preference-weighted scoring
  • Strategic ranking based on novelty/feasibility/goal alignment

From the paper:

โ€œEach comparison concludes with the phrase ‘better hypothesis:<1 or 2>’โ€

๐Ÿ”„ Evolution Agent

The Evolution Agent refines top-ranked hypotheses using techniques like simplification, inspiration from other ideas, grafting, or out-of-the-box rethinking.

Key strategies include:

  • Enhancing clarity
  • Improving biological plausibility
  • Increasing testability

From the paper:

โ€œThe Evolution agent continuously refines and improves existing hypotheses and proposals using several approaches including enhancement through grounding, coherence and feasibility improvementsโ€ฆโ€

๐Ÿงฌ Meta-review Agent

The Meta-review Agent synthesizes insights from all agents into strategic directions that guide future generations.

It identifies common patterns in critiques and rankings, then injects them into future prompts.

From the paper:

โ€œThe Meta-review agent synthesizes insights from all reviews into strategic directions applicable to all agentsโ€

๐Ÿงฎ Proximity Agent

The Proximity Agent computes similarity between hypotheses to detect duplicates, track evolution, and identify promising clusters of related ideas.

Supports:

  • Hypothesis clustering
  • Redundancy elimination
  • Similarity tracking over time

From the paper:

โ€œThe dark gray arrows represent the information flow through the co-scientist system, while the red arrows represent the feedback loop between the specialized agents.โ€


๐Ÿงฑ 2. How They Fit Into the Pipeline

Hereโ€™s how the agents interact in sequence:

    flowchart LR
    subgraph InputStage [Input]
        A[๐ŸŽฏ Research Goal] --> B[๐Ÿ“š LiteratureAgent]
    end

    subgraph HypothesisGeneration [Hypothesis Generation]
        B --> C[๐Ÿงช GenerationAgent]
    end

    subgraph ReviewAndRanking [Evaluation & Ranking]
        C --> D[๐Ÿชž ReflectionAgent]
        D --> E[๐Ÿ… RankingAgent]
    end

    subgraph FeedbackLoop [Strategic Improvement]
        E --> F[๐Ÿง  MetaReviewAgent]
        F --> G[๐Ÿงฌ EvolutionAgent]
        G --> H[๐Ÿ› ๏ธ PromptTuningAgent]
        H --> C
        H --> I([๐Ÿ” Refined Generation])
    end

    subgraph OutputStage [Final Output]
        E --> J[๐Ÿ“ Final Hypotheses + Reports]
    end

    style InputStage fill:#f9f9f9,stroke:#333
    style HypothesisGeneration fill:#e6f7ff,stroke:#333
    style ReviewAndRanking fill:#fffbe6,stroke:#333
    style FeedbackLoop fill:#e6ffe6,stroke:#333
    style OutputStage fill:#f9f9f9,stroke:#333
  

Each stage contributes to building better hypotheses.

Stage Input Output Purpose
Generation Goal + Literature Multiple hypotheses Start with novel ideas
Reflection Hypotheses + Preferences Structured reviews Filter weak ideas
Ranking Two hypotheses Better one (via Elo) Prioritize best ideas
Evolution Top-ranked hypothesis Refined version Improve clarity and feasibility
Meta-review All reviews + rankings Strategic directions Inject insights into future agents
Prompt Tuning Old prompt + ranked data Improved prompt Refine prompting strategy

This matches exactly what the paper describes:

โ€œFeedback from tournaments enables iterative improvementโ€

โ€œThe Meta-review agent synthesizes insights from all reviews into strategic directions applicable to all agentsโ€


๐Ÿ”„ 3. Feedback Loops Between Agents

These are the key feedback loops in your AI co-scientist system.

Loop 1: Peer Review โ†’ Prompt Refinement

    graph LR
    Generation --> Reflection
    Reflection --> PromptTuning
    PromptTuning --> Generation
  

How it works:

  • โœ… The Reflection Agent gives detailed reviews
  • ๐Ÿง  The PromptTuningAgent learns from those reviews
  • ๐Ÿ”„ The improved prompt is used in the next round

Supports Appendix A.2.4:

โ€œRefine the following conceptual idea, enhancing its practical implementability by leveraging contemporary technological capabilitiesโ€


Loop 2: Tournament โ†’ Strategic Directions

    graph LR
    Ranking --> MetaReview
    MetaReview --> Generation
    MetaReview --> Evolution
  

How it works:

  • ๐Ÿ“Š The Ranking Agent identifies which hypotheses win most often
  • ๐Ÿงฌ The Meta-review Agent extracts recurring themes and preferences
  • ๐ŸŽฏ These strategic directions are injected into future agents’ prompts

Matches whatโ€™s described in Appendix A.2.5:

โ€œThe Meta-review agent generates feedback applicable to all agents… simply appended to their prompts in the next iterationโ€”a capability facilitated by the long-context search and reasoning capabilities of the underlying Gemini 2.0 modelsโ€


Loop 3: Hypothesis โ†’ Memory โ†’ New Hypothesis

    graph LR
    Generation --> Memory
    Memory --> Generation
  

How it works:

  • ๐Ÿ“š The LiteratureAgent stores results in vector memory
  • ๐Ÿง  The GenerationAgent pulls similar past hypotheses during future runs
  • ๐Ÿ”„ This creates a self-improving loop where old ideas inform new ones

Supports what the paper describes:

โ€œAppendix Figure A.4 shows example prompts for comparing two hypotheses during a tournament matchโ€


๐Ÿ“Š 4. Visualization of Agent Interactions

๐Ÿงฉ Full Pipeline Diagram

    graph TB
    subgraph InputStage [Input]
        A[๐ŸŽฏ Research Goal + Preferences] --> B[๐Ÿ“š LiteratureAgent]
    end

    subgraph GenerationStage [Hypothesis Generation]
        B --> C[๐Ÿงช GenerationAgent]
    end

    subgraph ReviewStage [Evaluation & Critique]
        C --> D[๐Ÿชž ReflectionAgent]
    end

    subgraph RankingStage [Prioritization]
        D --> E[๐Ÿ† RankingAgent]
    end

    subgraph FeedbackLoop [Strategic Improvement]
        E --> F[๐Ÿง  MetaReviewAgent]
        F --> G[๐Ÿงฌ EvolutionAgent]
        G --> H[๐Ÿ”ง PromptTuningAgent]
        H --> C
    end

    subgraph OutputStage [Refined Output]
        E --> I[๐Ÿ“ Final Hypotheses + Reports]
    end

    style InputStage fill:#f9f9f9,stroke:#333
    style GenerationStage fill:#e6f7ff,stroke:#333
    style ReviewStage fill:#fffbe6,stroke:#333
    style RankingStage fill:#ffe6e6,stroke:#333
    style FeedbackLoop fill:#e6ffe6,stroke:#333
    style OutputStage fill:#f9f9f9,stroke:#333
  
  • ๐ŸŽฏ The pipeline starts with a goal and preferences, then uses the ๐Ÿง  LiteratureAgent to gather relevant research.
  • ๐Ÿงช The GenerationAgent proposes hypotheses based on this foundation.
  • ๐Ÿชž The ReflectionAgent performs structured reviews of hypothesis quality.
  • ๐Ÿ… The RankingAgent assigns Elo ratings and selects top performers.
  • ๐Ÿง  The MetaReviewAgent synthesizes insights into strategic directions.
  • ๐Ÿงฌ These feed into both the EvolutionAgent and ๐Ÿ› ๏ธ PromptTuningAgent, enabling continuous refinement.
  • ๐Ÿ” The improved prompts and strategies are fed back into the ๐Ÿงช GenerationAgent, creating a full feedback loop.

๐Ÿ”„ Feedback Injection Diagram

    graph LR
    subgraph CoScientistPipeline
        direction LR
        Gen[GenerationAgent] --> Refl[ReflectionAgent]
        Refl --> Rank[RankingAgent]
        Rank --> Meta[MetaReviewAgent]
        Meta --> Evo[EvolutionAgent]
        Meta --> Gen
        Evo --> Gen
    end
  
  • The Feedback Loop allows insights from reviews and rankings to be injected directly into future generations.
  • The Meta-reviewAgent plays a central role by synthesizing recurring themes.
  • Strategic directions are sent to: โœ… GenerationAgent for better prompting ๐Ÿงฌ EvolutionAgent for structural improvements

๐Ÿ“ˆ Prompt Improvement Loop

    graph TB
    PromptLoader --> Generation
    Generation --> Reflection
    Reflection --> Ranking
    Ranking --> MetaReview
    MetaReview --> PromptTuning
    PromptTuning --> PromptLoader
  

๐Ÿ” Prompt Evolution Workflow

  • ๐Ÿงช The GenerationAgent creates hypotheses using the current version of the prompt.
  • ๐Ÿชž The ReflectionAgent and ๐Ÿ… RankingAgent evaluate the quality of these hypotheses through structured reviews and Elo scoring.
  • ๐Ÿง  The MetaReviewAgent identifies recurring patterns and strategic weaknesses in the critiques.
  • ๐Ÿ› ๏ธ The PromptTuningAgent uses those insights to refine the original prompt.
  • ๐Ÿ“ฅ The refined prompt is stored and reloaded via the PromptLoader, enabling the system to generate better hypotheses in the next cycle.

This iterative loop allows prompt evolution to adapt dynamically to performance signals, creating a self-improving hypothesis generation system.

๐Ÿ”„ Self-Improving Loop: From Goal to Insight

    graph TD
    A[Goal Input] --> B[GenerationAgent generates hypotheses]
    B --> C[Prompt stored in vector memory]
    C --> D[ReflectionAgent critiques them]
    D --> E[RankingAgent compares using Elo-style tournament]
    E --> F[PromptTunerAgent tunes future generations]
    F --> G[Meta-reviewAgent synthesizes insights]
    G --> H[EvolutionAgent refines hypotheses]
    H --> B
  
  • ๐ŸŽฏ It starts with a research goal, passed to the ๐Ÿงช GenerationAgent.
  • ๐Ÿง  Prompts are stored in vector memory, ensuring traceability and enabling future retrieval or refinement.
  • ๐Ÿชž Each hypothesis is reviewed for correctness, feasibility, and alignment with research goals.
  • ๐ŸŒฑ Top-ranked hypotheses undergo further refinement using inspiration, simplification, and preference-driven tuning.
  • ๐Ÿง  The MetaReviewAgent aggregates reviews and rankings to uncover strategic insights.
  • ๐Ÿ” These insights feed back into the next generation of prompts and hypotheses โ€” completing the learning loop.

Over time, the system becomes smarter and more aligned with your research preferences and objectives.

This creates a full feedback loop:

โ€œFeedback from tournaments enables iterative improvementโ€

โ€œStrategic directions guide future generationsโ€


๐Ÿš€ Overview: How PromptTuningAgent Improves Scientific Prompts with DSPy

The PromptTuningAgent is a smart component in the co_ai pipeline that uses LLMs and DSPy to automatically improve prompts used for generating scientific hypotheses. It uses few-shot learning, prompt evaluation, and feedback loops to refine prompts based on real data collected from earlier runs.

Here’s how it works:

1. ๐Ÿ“ Signature Definition

We start by defining a PromptTuningSignature. This tells DSPy what inputs and outputs to expect:

  • The goal (e.g. “What happens if the US defaults?”)
  • The original prompt used to generate hypotheses
  • The best hypothesis produced
  • A review and a score of that hypothesis
  • The output: a refined version of the prompt

2. ๐Ÿง  IIIInitialize the Agent

When PromptTuningAgent is initialized:

  • It loads configuration values (e.g. how many examples to train on).
  • It sets up a connection to a local LLM through Ollama.
  • It configures DSPy with this LLM so it can use it for training.

3. ๐Ÿš€ Agent Execution: run(context)

When the pipeline runs this agent:

  • It extracts the goal from the context.
  • It pulls a mix of recent prompts + their hypothesis results from memory (used for training and validation).
  • If no data is found, it logs the issue and exits gracefully.

4. ๐ŸŽฏ Training the Prompt Refiner

  • It creates DSPy Examples from the training data.
  • It defines a custom scoring function (_prompt_quality_metric) that evaluates if a new prompt is better than the original.
  • It compiles a new tuned_program using BootstrapFewShot with the training data and scoring metric.

5. ๐Ÿ” Generating New Prompts

Using the tuned DSPy program, it runs the validation examples through the newly trained model:

  • For each validation sample, it generates a refined prompt.
  • It saves each refined prompt to the database, recording metadata like strategy and version.
  • It logs each prompt and stores it in the contextโ€™s prompt_history.

6. ๐Ÿ“Š Evaluating Prompt Quality

The custom scoring function (_prompt_quality_metric) runs like this:

  • It generates hypotheses using the original and the refined prompts.
  • It builds a comparison prompt to evaluate which prompt performed better.
  • It parses the comparison response (looking for โ€œbetter prompt: <A/B>โ€) and returns a score.
  • All intermediate steps are logged for transparency and debugging.

7. ๐Ÿ’ก Why This Matters

This class enables automated prompt evolution. Instead of manually tweaking prompts and checking their results, the agent:

  • Trains itself using past runs.
  • Evaluates prompts automatically.
  • Improves prompts continuously based on actual hypothesis quality.

Itโ€™s a self-improving loop that can adapt as your data or scientific objectives evolve.


from abc import ABC, abstractmethod

import re
import dspy
from dspy import Predict, Signature, InputField, OutputField, Example, BootstrapFewShot

from co_ai.agents.base import BaseAgent
from co_ai.constants import GOAL


# DSPy signature for prompt refinement: defines input/output fields for tuning
class PromptTuningSignature(Signature):
    goal = InputField(desc="Scientific research goal or question")
    input_prompt = InputField(desc="Original prompt used to generate hypotheses")
    hypotheses = InputField(desc="Best hypothesis generated")
    review = InputField(desc="Expert review of the hypothesis")
    score = InputField(desc="Numeric score evaluating the hypothesis quality")
    refined_prompt = OutputField(desc="Improved version of the original prompt")


# Simple evaluation result class to return from evaluator
class EvaluationResult:
    def __init__(self, score: float, reason: str):
        self.score = score
        self.reason = reason


# Base evaluator interface (not used directly, but useful for future extensions)
class BaseEvaluator(ABC):
    @abstractmethod
    def evaluate(
        self, original: str, proposal: str, metadata: dict = None
    ) -> EvaluationResult:
        pass


# DSPy-based evaluator that can run a Chain-of-Thought program
class DSPyEvaluator(BaseEvaluator):
    def __init__(self):
        self.program = dspy.ChainOfThought(PromptTuningSignature)

    def evaluate(
        self, original: str, proposal: str, metadata: dict = None
    ) -> EvaluationResult:
        result = self.program(
            goal=metadata["goal"],
            input_prompt=original,
            hypotheses=metadata["hypotheses"],
            review=metadata.get("review", ""),
            score=metadata.get("score", 750),
        )
        try:
            score = float(result.score)
        except (ValueError, TypeError):
            score = 0.0
        return EvaluationResult(score=score, reason=result.explanation)


# Main agent class responsible for training and tuning prompts using DSPy
class PromptTuningAgent(BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.agent_name = cfg.get("name", "prompt_tuning")
        self.prompt_key = cfg.get("prompt_key", "default")
        self.sample_size = cfg.get("sample_size", 20)
        self.generate_count = cfg.get("generate_count", 10)
        self.current_version = cfg.get("version", 1)

        # Configure DSPy with local LLM (Ollama)
        lm = dspy.LM(
            "ollama_chat/qwen3",
            api_base="http://localhost:11434",
            api_key="",
        )
        dspy.configure(lm=lm)

    async def run(self, context: dict) -> dict:
        goal = context.get(GOAL, "")
        generation_count = self.sample_size + self.generate_count
        self.logger.log(
            "PromptTuningExamples",
            {"samples size": self.sample_size, "generation count": generation_count},
        )

        # Get training + validation data
        examples = self.memory.prompt.get_prompt_training_set(goal, generation_count)
        train_data = examples[: self.sample_size]
        val_data = examples[self.sample_size :]

        if not examples:
            self.logger.log(
                "PromptTuningSkipped", {"reason": "no_training_data", "goal": goal}
            )
            return context

        # Build training set for DSPy
        training_set = [
            Example(
                goal=item["goal"],
                input_prompt=item["prompt_text"],
                hypotheses=item["hypothesis_text"],
                review=item.get("review", ""),
                score=item.get("elo_rating", 1000),
            ).with_inputs("goal", "input_prompt", "hypotheses", "review", "score")
            for item in train_data
        ]

        # Wrap our scoring metric so we can inject context during tuning
        def wrapped_metric(example, pred, trace=None):
            return self._prompt_quality_metric(example, pred, context=context)

        # Train prompt-tuning program
        tuner = BootstrapFewShot(metric=wrapped_metric)
        student = Predict(PromptTuningSignature)
        tuned_program = tuner.compile(student=student, trainset=training_set)

        # Use tuned program to generate and store new refined prompt
        await self.generate_and_store_refined_prompts(
            tuned_program, goal, context, val_data
        )
        self.logger.log(
            "PromptTuningCompleted",
            {
                "goal": goal,
                "example_count": len(training_set),
                "generated_count": len(val_data),
            },
        )

        return context

    async def generate_and_store_refined_prompts(
        self, tuned_program, goal: str, context: dict, val_set
    ):
        """
        Generate refined prompts using the tuned DSPy program and store them in the database.

        Args:
            tuned_program: A compiled DSPy program capable of generating refined prompts.
            goal: The scientific goal for this run.
            context: Shared pipeline state.
            val_set: Validation examples to run through the tuned program.
        """

        stored_count = 0
        for i, example in enumerate(val_set):
            try:
                # Run DSPy program on new example
                result = tuned_program(
                    goal=example["goal"],
                    input_prompt=example["prompt_text"],
                    hypotheses=example["hypothesis_text"],
                    review=example.get("review", ""),
                    score=example.get("elo_rating", 1000),
                )

                refined_prompt = result.refined_prompt.strip()

                # Store refined prompt to the DB
                self.memory.prompt.save(
                    goal=example["goal"],
                    agent_name=self.name,
                    prompt_key=self.prompt_key,
                    prompt_text=refined_prompt,
                    response=None,
                    strategy="refined_via_dspy",
                    version=self.current_version + 1,
                )

                stored_count += 1

                # Update context with prompt history
                self.add_to_prompt_history(
                    context, refined_prompt, {"original": example["prompt_text"]}
                )

                self.logger.log(
                    "TunedPromptStored",
                    {"goal": goal, "refined_snippet": refined_prompt[:100]},
                )

            except Exception as e:
                self.logger.log(
                    "TunedPromptGenerationFailed",
                    {"error": str(e), "example_snippet": str(example)[:100]},
                )

        self.logger.log(
            "BatchTunedPromptsComplete", {"goal": goal, "count": stored_count}
        )

    def _prompt_quality_metric(self, example, pred, context: dict) -> float:
        """Run both prompts and compare results"""
        try:
            prompt_a = example.input_prompt
            prompt_b = pred.refined_prompt
            self.logger.log(
                "PromptQualityCompareStart",
                {
                    "prompt_a_snippet": prompt_a[:100],
                    "prompt_b_snippet": prompt_b[:100],
                },
            )

            hypotheses_a = self.call_llm(prompt_a, context)
            self.logger.log(
                "PromptAResponseGenerated", {"hypotheses_a_snippet": hypotheses_a[:200]}
            )

            hypotheses_b = self.call_llm(prompt_b, context)
            self.logger.log(
                "PromptBResponseGenerated", {"hypotheses_b_snippet": hypotheses_b[:200]}
            )

            # Run comparison
            merged = {
                **context,
                **{
                    "prompt_a": prompt_a,
                    "prompt_b": prompt_b,
                    "hypotheses_a": hypotheses_a,
                    "hypotheses_b": hypotheses_b,
                },
            }
            comparison_prompt = self.prompt_loader.load_prompt(self.cfg, merged)
            self.logger.log(
                "ComparisonPromptConstructed",
                {"comparison_prompt_snippet": comparison_prompt[:200]},
            )

            response = self.call_llm(comparison_prompt, context)
            self.logger.log(
                "ComparisonResponseReceived", {"response_snippet": response[:200]}
            )

            match = re.search(r"better prompt:<([AB])>", response, re.IGNORECASE)
            if match:
                choice = match.group(1).upper()
                score = 1.0 if choice == "B" else 0.5
                self.logger.log(
                    "PromptComparisonResult", {"winner": choice, "score": score}
                )
                return score
            else:
                self.logger.log("PromptComparisonNoMatch", {"response": response})
                return 0.0
        except Exception as e:
            self.logger.log(
                "PromptQualityMetricError",
                {
                    "error": str(e),
                    "example_input_prompt_snippet": example.input_prompt[:100],
                    "refined_prompt_snippet": getattr(pred, "refined_prompt", "")[:100],
                },
            )
            return 0.0

๐Ÿ“ JSON Logging: Tracking Every Step

Every run of the co_ai pipeline is automatically logged in a structured JSON Lines (.jsonl) file. This makes it easy to audit, debug, and analyze the behavior of the system over time.

๐Ÿ” How It Works

  • A unique run_id is generated at the start of each pipeline execution.

  • This ID is used to create a dedicated log file stored under logs/, such as:

    logs/run_us_debt_analysis_20240516_130245.jsonl
    
  • Each agent and component in the system (generation, ranking, reflection, etc.) logs its actions using a consistent structure:

    {
      "timestamp": "2025-05-16T13:02:45.123Z",
      "event_type": "GeneratedHypotheses",
      "data": {
        "goal": "Can generative AI models reduce the time required to make scientific discoveries in biomedical research?",
        "snippet": "Hypothesis 1: ..."
      }
    }
    
  • These events are emitted through the JSONLogger class and tagged with emojis and event types for easy tracking.

โœ… Why It Matters

  • ๐Ÿ“œ Transparency โ€“ Every stage of the reasoning process is recorded and can be revisited.
  • ๐Ÿ› Debugging โ€“ If something goes wrong, the log reveals where and why.
  • ๐Ÿ“Š Experiment Tracking โ€“ Logs form the foundation for analyzing pipeline performance and tuning over time.

This approach ensures that the entire scientific reasoning process is not a black box, but a transparent and reproducible workflow.


๐Ÿ’พ Context Persistence: Save and Resume at Any Step

One of the key architectural decisions in co_ai is persistent context storage. At every stage of the pipeline, the system can store the full pipeline contextโ€”a structured representation of everything the system knows so far.

๐Ÿง  What Is “Context”?

The context is a Python dictionary containing:

  • The research goal
  • literature findings
  • hypotheses generated so far
  • reviews, reflections, rankings, and more

It flows through each stage like a growing memory of the reasoning process.

๐Ÿ—‚ How Context Is Stored

Each agent has config options such as:

save_context: true
skip_if_completed: true

When save_context is enabled:

  • The full context is saved to the PostgreSQL database in the context_states table.
  • Metadata such as run_id, stage_name, version, and preferences are stored alongside it.
  • The most recent version is flagged with is_current = true.

This lets you resume a run, skip steps that have already completed, or inspect intermediate pipeline states.

๐Ÿ” How It Works in Practice

  1. After each agent runs, it saves the new context to the database if save_context is enabled.

  2. Before running, the agent checks if a completed version already exists:

    • If skip_if_completed = true, it loads the saved context and skips computation.
    • If not, it runs normally and updates the context.

โœ… Why This Matters

  • ๐Ÿ”„ Restartable Runs โ€“ You can resume from any stage without rerunning earlier steps.
  • ๐Ÿ” Debugging and Exploration โ€“ You can load a saved context and inspect it for errors or insights.
  • ๐Ÿงช A/B Testing and Iteration โ€“ Run multiple strategies or prompt versions against the same stored context.

This persistent, stage-wise approach makes co_ai robust, debuggable, and suitable for real scientific workflows.


๐Ÿš€ Getting Started (Try It Locally)

You can run the full co_ai system locally in just a few steps:

  1. Clone the repo
    git clone https://github.com/ernanhughes/co-ai.git
    cd co-ai
    

2. **Install dependencies**

   ```bash
   pip install -r requirements.txt
   ```

3. **Start Ollama with a local model**

   ```bash
   ollama run qwen:latest
   ```

4. **Install Postgress**

- Download from https://www.postgresql.org/download/windows/

Use the installer to install PostgreSQL and pgAdmin
During setup, set:

    Username: postgres

    Password: (choose something and remember it)


- Enable the pgvector extension. 

I explained this in aprevious post:  
[PostgreSQL for AI: Storing and Searching Embeddings with pgvector](/post/pgvector/)

- Create a Database

for example you can use coai as the name

- In your `config.yaml` change teh db connection information to match your database

```yaml
db:
  driver: psycopg2
  user: postgres
  password: yourpassword
  host: localhost
  port: 5432
  database: co_ai
```

- Load the schema.sql file.

At the root of the projec there is a [schema.sql](https://github.com/ernanhughes/co-ai/blob/main/schema.sql) file.
This will create the required tables to run the applciation in your database.


5. **Run the pipeline**

```bash
python -m co_ai.main goal="What happens if the USA defaults on its debt?"
```

---

### โœ… Add Real Examples of Prompts and Outputs

You already show how Jinja2 templates work โ€” but a real **before/after** prompt or hypothesis example would make it even more compelling.

For example:

```markdown
### ๐ŸŽฏ Example Goal

> **Goal:** The USA is about to default on its national debt.

**Original Prompt:**
```text
You are an expert researcher generating hypothesesโ€ฆ

Generated Hypotheses:

  • Hypothesis 1: [text]
  • Hypothesis 2: [text]

After Refinement:

You are an economic strategist focused on sovereign default scenariosโ€ฆ

Improved Hypotheses:

  • Hypothesis 1: [text]
  • Hypothesis 2: [text]

This lets readers see the real benefit of tuning.


---

### โœ… Add โ€œExtending This Projectโ€ Section

This tells readers what they can do next.

```markdown
## ๐Ÿงฉ How to Extend `co_ai`

This system is fully modular โ€” here are some ways to extend it:

- ๐Ÿ” **Swap out Ollama for another LLM backend**
- ๐Ÿ“š **Use academic papers from Semantic Scholar instead of web search**
- ๐Ÿงช **Add a โ€œSimulationAgentโ€ to test hypotheses**
- ๐Ÿง  **Connect to LangChain tools or Autogen agents**
- ๐Ÿ“ˆ **Use more advanced evaluation metrics (BLEU, ROUGE, etc.)**

You can also add your own agent by subclassing `BaseAgent` and defining a `run()` method.

๐Ÿ’ฌ Questions or Feedback?

This is an open-source research tool. If youโ€™d like to:

  • Contribute a new agent
  • Report a bug
  • Suggest improvements
  • Use this in your lab or project

Feel free to open an issue or reach out at GitHub.


๐Ÿ”— References

Towards an AI Co-Scientist

๐Ÿ“Ž Appendix

Repo: github.com/ernanhughes/co
Core Tools: DSPy, pgvector, Ollama, Searxng Search
Format: Local Python modules using Hydra and async I/O