Building an AI Co-Scientist

May 15, 2025

Page content

This is the first post in a 100-part series, where we take breakthrough AI papers and turn them into working code building the next generation of AI, one idea at a time.

🧾 Summary

In this post, I’ll walk through how I implemented the ideas from
AI Co-Scientist: Towards an AI Co-Scientist into a working system called co_ai.

The paper presents a vision for an AI system that collaborates with scientists by generating hypotheses, engaging in peer review, ranking proposals, and evolving them over time. Inspired by this work, I’ve built a modular, open-source version of the co-scientist architecture using local tools like:

✅ Ollama (for running Qwen3, Mistral, etc.)
✅ DSPy (for prompt optimization)
✅ Hydra (for configuration management, dynamic extension)
✅ pgvector (for vector memory + retrieval)
✅ postgresql (database)

🔬 What Is `co_ai`?

co_ai is a working implementation of the AI co-scientist concept introduced in the recent DeepMind paper. The core idea? Build a multi-agent system that mimics scientific reasoning — proposing hypotheses, debating their validity, refining them, and evolving toward better explanations.

The system uses a modular agent pipeline:

    flowchart LR
    A[🎯 Goal] --> B[🧪 Generation Agent]
    B --> C[🪞 Reflection Agent]
    C --> D[🏆 Ranking Agent]
    D --> E[🧬 Evolution Agent]
    E --> F[🧠 Meta-review Agent]

Each step has a specific role:

✅ Generation Agent: Proposes multiple hypotheses grounded in literature.
🧠 Reflection Agent: Critiques each hypothesis for correctness, novelty, and feasibility.
📊 Ranking Agent: Uses simulated debates to rank and refine ideas.
🔄 Evolution Agent: Evolves promising hypotheses using simplification or inspiration.
🧬 Meta-review Agent: Synthesizes insights across all reviews into strategic directions applicable to all agents.

🛠 Key Design Decisions & Why They Matter

1. ✅ Local Execution via Ollama

One of the most important choices was to use local LLMs via Ollama. This ensures:

No dependency on cloud APIs
Full reproducibility
No cost to your research
Better control over inference settings (temperature, max tokens, etc.)

Example config:

model:
  name: ollama/qwen3
  api_base: http://localhost:11434
  api_key:

As you can see you can optionally enable any model you require.

2. 🧠 Preference-Driven Prompting

A major innovation from the paper is the use of preferences to guide hypothesis quality.

I implemented this cleanly using Hydra config:

File: `configs/agents/generation.yaml`

This is a sample of one of the agent configuration file.

generation:
  name: generation
  enabled: true
  strategy: goal_aligned
  preferences:
    - goal_consistency
    - biological_plausibility
    - experimental_validity
    - novelty
    - simplicity

These are injected into prompts dynamically using Jinja2 templates:

This is a sample of one of the agent prompt template files.

File: `prompts/generation/goal_aligned.txt`

You are an expert researcher generating novel scientific hypotheses.
Use inspiration from analogous domains or mechanisms to develop creative solutions.

Goal:
{{ goal }}

{literate_title}: {{ literature }}

Preferences:
{% for p in preferences %}
- {{ p }}
{% endfor %}

Instructions:
1. Review findings above before generating new hypotheses
2. Generate 3 distinct, testable hypotheses
3. Each must include mechanism, rationale, and experiment plan
4. {% if "goal_consistency" in preferences %}Prioritize direct alignment with the stated goal.{% endif %}
5. {% if "novelty" in preferences %}Focus on originality and unexpected connections.{% endif %}
6. {% if "feasibility" in preferences %}Ensure experiments can be realistically tested in the lab.{% endif %}
7. {% if "biological_plausibility" in preferences %}Make sure biological mechanisms are valid and well-explained.{% endif %}
8. {% if "simplicity" in preferences %}Favor clarity and simplicity over complexity.{% endif %}

This lets users tune agent behavior without changing code — just by adjusting preferences.

3. 🔄 Self-Improving Loop via DSPy

To implement the feedback loop described in the paper:

“Feedback from tournaments enables iterative improvement, creating a self-improving loop toward novel and high-quality outputs”

I added a PromptRefinerAgent that:

Pulls top-ranked hypotheses from memory
Uses DSPy’s BootstrapFewShot to refine prompts
Stores only improved versions
Logs everything for traceability

File: `co_ai/agents/prompt_refiner.py`

class PromptRefinerAgent(BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.agent_name = cfg.get("target_agent", "generation")
        self.strategy = cfg.get("strategy", "basic_refinement")

    async def run(self, context: dict) -> dict:
        few_shot_data = self.memory.get_ranked_hypotheses(context["goal"], limit=5)
        
        refined_prompt = self._refine_with_dspy(context, few_shot_data)
        
        old_score = self._evaluate_prompt(context["prompt"], few_shot_data)
        new_score = self._evaluate_prompt(refined_prompt, few_shot_data)

        if new_score > old_score:
            self.memory.store_prompt_version(
                agent_name=self.agent_name,
                prompt_key=self.strategy,
                prompt_text=refined_prompt,
                source="dsp_refinement",
                version=context.get("prompt_version", 1) + 1,
                metadata={"few_shot_count": len(few_shot_data)}
            )
            context["prompt"] = refined_prompt
        return context

Now the system improves its own prompting strategy based on real-world performance — just like the paper describes:

“Each comparison concludes with the phrase ‘better hypothesis:<1 or 2>’”

4. 📋 Structured Output + Memory System

All agents generate structured output so future stages can parse and reason about them.

For example:

# Hypothesis 1
Mechanism: Inhibition of IRE1α signaling reduces tumor viability in AML cells.

# Hypothesis 2
Mechanism: KIRA6 induces apoptosis in MOLM13 cells through ER stress pathways.

# Hypothesis 3
Mechanism: Targeting UPR pathways enhances chemosensitivity in FLT3-ITD+ leukemias.

This format makes it easy to extract, compare, and evolve hypotheses.

From the paper:

“Appendix Figure A.4 shows example prompts for comparing two hypotheses during a tournament match”

So we built a consistent structure that supports:

✅ Parsing
🧱 Comparison
🤖 Refinement
📋 Logging

5. 🧩 Modular Architecture with Configurable Agents

Using Hydra-based config, each agent can be enabled/disabled and configured independently.

File: `configs/agents/generation.yaml`

defaults:
  - /model/qwen3
  - /prompt_refiner/disabled

generation:
  name: generation
  enabled: true        
  save_prompt: true
  save_context: true
  skip_if_completed: true
  strategy: goal_aligned
  input_keys: ["goal", "literature"]
  output_key: hypotheses
  prompt_mode: file
  prompt_file: goal_aligned.txt
  extraction_regex: "Hypothesis \\d+:\\n(.+?)\\n"

And in the pipeline config:

File: `configs/pipeline/default.yaml`


# The goal of the pipeline."
goal: "Can generative AI models reduce the time required to make scientific discoveries in research?"

paths:  # directory path where the app will search for the prompt templates
  prompts: ${hydra:runtime.cwd}/prompts

pipeline:
  stages:
    - name: generation    # name for agent (useful if you want run same one at different times)  
      cls: co_ai.agents.generation.GenerationAgent    # you can add cutom classes
      enabled: true
      iterations: 1

    - name: reflection
      cls: co_ai.agents.reflection.ReflectionAgent
      enabled: true
      iterations: 1

    - name: ranking
      cls: co_ai.agents.ranking.RankingAgent
      enabled: true
      iterations: 2

This gives you full flexibility to:

✅ Swap out models
🧠 Inject different strategies
📊 Log everything
🛠 Tune behavior per stage

6. 🛢️ Postgres and the MemoryTool

Great idea — here’s a clear and informative section you can include in your blog post that explains why you chose the database architecture and what each component is for. It’s written for developers and system designers to understand the motivations behind the choices in the co_ai system.

🗄️ Why We Chose a Database-Centric Design

The co_ai framework relies heavily on structured and evolving information — hypotheses, prompt versions, context states, and performance metrics. To make this information persistent, queryable, and extensible, we use PostgreSQL with pgvector for semantic search. Here’s why we chose this route and how each component contributes to the pipeline:

📦 Overview of Database Components

Component	Purpose	Why It Matters
`hypotheses_store`	Stores generated hypotheses, confidence scores, reviews, and embeddings	Ensures we can track hypothesis evolution, compare outputs across versions, and rank by quality
`prompt_store`	Records each prompt version, agent, strategy, and associated goal	Enables reproducibility and analysis of which prompts yield better hypotheses
`context_states`	Saves full pipeline context at each stage	Allows recovery, audit trails, and comparative runs with different agents/settings
`report_logger`	Tracks generated reports with summaries and run metadata	Useful for end-of-run outputs and dashboard summaries
`embedding_store`	Caches vector embeddings for fast similarity searches	Boosts performance for semantic search and clustering

🧠 Why Not Just Use Files or In-Memory Storage?

While YAML/JSON files are great for flexibility and fast prototyping, we needed:

🔄 Long-term memory — Hypotheses and evaluations need to persist across runs.
📈 Version control — Prompt refinement requires tracking iterations and improvements.
🔍 Query capabilities — Ranking and tuning are based on filtering and sorting large sets.
🔗 Relational integrity — Hypotheses link to prompts, which link to agents and evaluations.

This structure also sets the stage for more advanced features like:

Similarity search across past hypotheses
Dataset-based prompt tuning
Interactive dashboards or admin panels

File: `schema.sql`


--- prompts table
CREATE TABLE IF NOT EXISTS prompts (
    id SERIAL PRIMARY KEY,
    agent_name TEXT NOT NULL,
    prompt_key TEXT NOT NULL,         -- e.g., generation_goal_aligned.txt
    prompt_text TEXT NOT NULL,
    goal TEXT;
    response_text TEXT,
    source TEXT,                      -- e.g., manual, dsp_refinement, feedback_injection
    version INT DEFAULT 1,
    is_current BOOLEAN DEFAULT FALSE,
    strategy TEXT,                    -- e.g., goal_aligned, out_of_the_box
    metadata JSONB DEFAULT '{}'::JSONB,
    timestamp TIMESTAMPTZ DEFAULT NOW()
);

-- Stores all generated hypotheses and their evaluations
CREATE TABLE IF NOT EXISTS hypotheses (
    id SERIAL PRIMARY KEY,
    goal TEXT NOT NULL,                 -- Research objective
    text TEXT NOT NULL,                 -- Hypothesis statement
    confidence FLOAT DEFAULT 0.0 ,      -- Confidence score (0–1 scale)
    review TEXT,                        -- Structured review data
    reflection TEXT,                    -- Structured reflection data
    elo_rating FLOAT DEFAULT 750.0,    -- Tournament ranking score
    embedding VECTOR(1024),             -- Vector representation of hypothesis
    features JSONB,                     -- Mechanism, rationale, experiment plan
    prompt_id INT REFERENCES prompts(id), -- Prompt used to generate this hypothesis
    source_hypothesis INT REFERENCES hypotheses(id), -- If derived from another
    strategy_used TEXT,                 -- e.g., goal_aligned, out_of_the_box
    version INT DEFAULT 1,              -- Evolve count
    source TEXT,                        -- e.g., manual, refinement, grafting
    enabled BOOLEAN DEFAULT TRUE,       -- Soft delete flag
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

6. 📦 Vector Memory with PostgreSQL + pgvector

Then I built a VectorMemory class to retrieve similar hypotheses:

File: `co_ai/memory/vector_store.py`

def get_similar_hypotheses(self, goal: str, limit: int = 5):
    """Get hypotheses from memory that are most relevant to current goal"""
    try:
        goal_embedding = get_embedding(goal)

        cur.execute("""
            SELECT text, review, elo_rating
            FROM hypotheses
            ORDER BY embedding <-> %s
            LIMIT %s
        """, (str(goal_embedding), limit))

        rows = cur.fetchall()
        return [
            {
                "text": row[0],
                "review": row[1] or "",
                "score": row[2] or 1000
            } for row in rows
        ]

    except Exception as e:
        print(f"[VectorMemory] Failed to fetch similar hypotheses: {e}")
        return []

📄 Jinja2 & How It Powers Flexible Prompting

💡 Motivation: Prompt Flexibility Is Scientific Freedom

One of the key goals when building co_ai was to make the system highly adaptable to different domains, preferences, and scientific workflows.

To do that, we needed a way to:

✅ Inject dynamic values (goal, literature, preferences)
🧩 Support conditional logic in prompts
🔄 Allow users to define their own templates
📋 Maintain structure while enabling customization
🛠 Integrate cleanly with Hydra config and agent logic

That’s where Jinja2 templates came in.

🧱 How Jinja2 Templates Work in `co_ai`

We use Jinja2-style templating to build structured prompts dynamically.

Here’s an example from prompts/generation/goal_aligned.txt:

You are an expert researcher generating novel hypotheses.
Use inspiration from analogous domains or mechanisms to develop creative solutions.

Goal:
{{ goal }}

{literate_title}: {{ literature }}

{% if preferences %}
Preferences:
{% for p in preferences %}
- {{ p }}
{% endfor %}
{% endif %}

Instructions:
1. Review findings above before generating new hypotheses
2. Generate 3 distinct, testable hypotheses
3. Each must include mechanism, rationale, and experiment plan
4. {% if "goal_consistency" in preferences %}Prioritize direct alignment with the stated goal.{% endif %}
5. {% if "novelty" in preferences %}Focus on originality and unexpected connections.{% endif %}
6. {% if "feasibility" in preferences %}Ensure experiments can be realistically tested in the lab.{% endif %}
7. {% if "biological_plausibility" in preferences %}Make sure biological mechanisms are valid and well-explained.{% endif %}
8. {% if "simplicity" in preferences %}Favor clarity and simplicity over complexity.{% endif %}

In your agent logic:

def _build_prompt(self, context):
    literate = "\n".join([
        f"{i+1}. {r['title']}\n→ {r['summary']}"
        for i, r in enumerate(context.get("literature", []))
    ])

    return self.prompt_renderer.render(
        goal=context["goal"],
        literature=literate,
        preferences=context.get("preferences", [])
    )

🚀 Advantages of Using Jinja2 for Prompting

Here are the main reasons we chose Jinja2 for prompt generation:

1. ✅ Dynamic Variable Injection

You can inject any variable into your prompt:

Goal:
{{ goal }}

Literature:
{{ literature }}

This makes it easy to generate prompts based on real-time context.

2. 🧩 Conditional Logic Based on Preferences

Want to change the prompt based on preference?

{% if "goal_consistency" in preferences %}
Prioritize direct alignment with the stated goal.
{% endif %}

Now your LLM follows instructions only if certain preferences are active.

3. 🔄 Clean Template Reuse Across Agents

Same template can be used by:

✅ GenerationAgent
🧠 EvolutionAgent
📊 Meta-reviewAgent

Just change the inputs → same prompt format works differently.

4. 📋 Full Traceability and Version Control

Because prompts are stored in files, you can:

✅ Track prompt versions in git
📈 Compare old vs new prompts
🛠 Roll back if something breaks

5. 🤖 Local Model Compatibility

Works great with local models like Qwen3, Mistral, or Llama via Ollama:

No need for hosted APIs
Everything runs locally
Fully reproducible

6. 🧬 Feedback Injection from Reviews

Want to add strategic directions from Meta-review?

{% if meta_review.insights %}
Strategic Directions:
{% for point in meta_review.insights %}
- {{ point }}
{% endfor %}**** right now Jeff you know what else **** thing
{% endif %}

Now future generations incorporate insights automatically.

7. 📊 Easy Customization Without Code Changes

Users can modify .txt files directly:

✅ Add new preferences
🧠 Change instruction phrasing
📋 Adjust output structure
🛠 Enable/disable sections

No coding required.

📋 Overview of Implemented Agents

Each component in the system plays a unique role in the pipeline. Below is a list of implemented agents and their responsibilities in our AI co-scientist system.

Agent Name	Purpose	Why We Created It
base.py	Shared base class for all agents	Modular design with common utilities
debate.py	Simulates scientific debate for ranking	To evaluate and prioritize hypotheses
evolution.py	Evolves hypotheses using strategies like grafting or simplification	For continuous improvement
generation.py	Hypothesis generation after literature grounding	Core reasoning engine
generic.py	Template for custom agent creation	Supports reusable agent pattern
literature.py	Performs web search and literature parsing	Grounds hypotheses in prior work
meta_review.py	Synthesizes insights from all reviews	Creates strategic feedback for future agents
prompt_refiner.py	Uses past outputs to improve prompts	Enables self-tuning via preference injection
prompt_tuning.py	Applies DSPy-based tuning to improve prompting	Builds better prompts over time
proximity.py	Measures similarity between hypotheses	Tracks evolution and improves ranking
ranking.py	Manages tournaments and compares hypotheses using Elo ratings	Determines which ideas are stronger
reflection.py	Analyzes hypotheses for correctness, novelty, and feasibility	Filters weak ideas early
review.py	Lighter version of the Reflection agent	Quick evaluation of hypothesis quality

✅ 1. Short Description of Each Agent

Here’s a clean, concise description of each agent in your system — based on both the paper and your implementation.

🧠 Generation Agent

The Generation Agent starts with a research goal and generates multiple hypotheses by synthesizing prior knowledge or exploring new directions.

It uses literature grounding when available, and supports strategies like:

Goal-aligned generation
Out-of-the-box thinking
Feasibility-focused prompting

From the paper:

“The Generation agent iteratively searches the web, retrieves and reads relevant research articles, and grounds its reasoning by summarizing prior work.”

🔍 Reflection Agent

The Reflection Agent simulates peer review. It evaluates hypothesis correctness, novelty, feasibility, and quality using structured prompts.

Supports multiple types of reflection:

Full review
Deep verification
Simulation-based testing
Observation-driven critique

From the paper:

“The Reflection agent performs full reviews, deep verification, observation reviews, and simulation reviews to ensure high-quality outputs.”

📊 Ranking Agent

The Ranking Agent runs simulated scientific debates between hypotheses, assigning Elo ratings based on their performance.

Uses:

Direct comparison tournaments
Preference-weighted scoring
Strategic ranking based on novelty/feasibility/goal alignment

From the paper:

“Each comparison concludes with the phrase ‘better hypothesis:<1 or 2>’”

🔄 Evolution Agent

The Evolution Agent refines top-ranked hypotheses using techniques like simplification, inspiration from other ideas, grafting, or out-of-the-box rethinking.

Key strategies include:

Enhancing clarity
Improving biological plausibility
Increasing testability

From the paper:

“The Evolution agent continuously refines and improves existing hypotheses and proposals using several approaches including enhancement through grounding, coherence and feasibility improvements…”

🧬 Meta-review Agent

The Meta-review Agent synthesizes insights from all agents into strategic directions that guide future generations.

It identifies common patterns in critiques and rankings, then injects them into future prompts.

From the paper:

“The Meta-review agent synthesizes insights from all reviews into strategic directions applicable to all agents”

🧮 Proximity Agent

The Proximity Agent computes similarity between hypotheses to detect duplicates, track evolution, and identify promising clusters of related ideas.

Supports:

Hypothesis clustering
Redundancy elimination
Similarity tracking over time

From the paper:

“The dark gray arrows represent the information flow through the co-scientist system, while the red arrows represent the feedback loop between the specialized agents.”

🧱 2. How They Fit Into the Pipeline

Here’s how the agents interact in sequence:

    flowchart LR
    subgraph InputStage [Input]
        A[🎯 Research Goal] --> B[📚 LiteratureAgent]
    end

    subgraph HypothesisGeneration [Hypothesis Generation]
        B --> C[🧪 GenerationAgent]
    end

    subgraph ReviewAndRanking [Evaluation & Ranking]
        C --> D[🪞 ReflectionAgent]
        D --> E[🏅 RankingAgent]
    end

    subgraph FeedbackLoop [Strategic Improvement]
        E --> F[🧠 MetaReviewAgent]
        F --> G[🧬 EvolutionAgent]
        G --> H[🛠️ PromptTuningAgent]
        H --> C
        H --> I([🔁 Refined Generation])
    end

    subgraph OutputStage [Final Output]
        E --> J[📁 Final Hypotheses + Reports]
    end

    style InputStage fill:#f9f9f9,stroke:#333
    style HypothesisGeneration fill:#e6f7ff,stroke:#333
    style ReviewAndRanking fill:#fffbe6,stroke:#333
    style FeedbackLoop fill:#e6ffe6,stroke:#333
    style OutputStage fill:#f9f9f9,stroke:#333

Each stage contributes to building better hypotheses.

Stage	Input	Output	Purpose
Generation	Goal + Literature	Multiple hypotheses	Start with novel ideas
Reflection	Hypotheses + Preferences	Structured reviews	Filter weak ideas
Ranking	Two hypotheses	Better one (via Elo)	Prioritize best ideas
Evolution	Top-ranked hypothesis	Refined version	Improve clarity and feasibility
Meta-review	All reviews + rankings	Strategic directions	Inject insights into future agents
Prompt Tuning	Old prompt + ranked data	Improved prompt	Refine prompting strategy

This matches exactly what the paper describes:

“Feedback from tournaments enables iterative improvement”

“The Meta-review agent synthesizes insights from all reviews into strategic directions applicable to all agents”

🔄 3. Feedback Loops Between Agents

These are the key feedback loops in your AI co-scientist system.

    graph LR
    Generation --> Reflection
    Reflection --> PromptTuning
    PromptTuning --> Generation

How it works:

✅ The Reflection Agent gives detailed reviews
🧠 The PromptTuningAgent learns from those reviews
🔄 The improved prompt is used in the next round

Supports Appendix A.2.4:

“Refine the following conceptual idea, enhancing its practical implementability by leveraging contemporary technological capabilities”

Loop 2: Tournament → Strategic Directions

    graph LR
    Ranking --> MetaReview
    MetaReview --> Generation
    MetaReview --> Evolution

How it works:

📊 The Ranking Agent identifies which hypotheses win most often
🧬 The Meta-review Agent extracts recurring themes and preferences
🎯 These strategic directions are injected into future agents’ prompts

Matches what’s described in Appendix A.2.5:

“The Meta-review agent generates feedback applicable to all agents… simply appended to their prompts in the next iteration—a capability facilitated by the long-context search and reasoning capabilities of the underlying Gemini 2.0 models”

Loop 3: Hypothesis → Memory → New Hypothesis

    graph LR
    Generation --> Memory
    Memory --> Generation

How it works:

📚 The LiteratureAgent stores results in vector memory
🧠 The GenerationAgent pulls similar past hypotheses during future runs
🔄 This creates a self-improving loop where old ideas inform new ones

Supports what the paper describes:

“Appendix Figure A.4 shows example prompts for comparing two hypotheses during a tournament match”

📊 4. Visualization of Agent Interactions

🧩 Full Pipeline Diagram

    graph TB
    subgraph InputStage [Input]
        A[🎯 Research Goal + Preferences] --> B[📚 LiteratureAgent]
    end

    subgraph GenerationStage [Hypothesis Generation]
        B --> C[🧪 GenerationAgent]
    end

    subgraph ReviewStage [Evaluation & Critique]
        C --> D[🪞 ReflectionAgent]
    end

    subgraph RankingStage [Prioritization]
        D --> E[🏆 RankingAgent]
    end

    subgraph FeedbackLoop [Strategic Improvement]
        E --> F[🧠 MetaReviewAgent]
        F --> G[🧬 EvolutionAgent]
        G --> H[🔧 PromptTuningAgent]
        H --> C
    end

    subgraph OutputStage [Refined Output]
        E --> I[📁 Final Hypotheses + Reports]
    end

    style InputStage fill:#f9f9f9,stroke:#333
    style GenerationStage fill:#e6f7ff,stroke:#333
    style ReviewStage fill:#fffbe6,stroke:#333
    style RankingStage fill:#ffe6e6,stroke:#333
    style FeedbackLoop fill:#e6ffe6,stroke:#333
    style OutputStage fill:#f9f9f9,stroke:#333

🎯 The pipeline starts with a goal and preferences, then uses the 🧠 LiteratureAgent to gather relevant research.
🧪 The GenerationAgent proposes hypotheses based on this foundation.
🪞 The ReflectionAgent performs structured reviews of hypothesis quality.
🏅 The RankingAgent assigns Elo ratings and selects top performers.
🧠 The MetaReviewAgent synthesizes insights into strategic directions.
🧬 These feed into both the EvolutionAgent and 🛠️ PromptTuningAgent, enabling continuous refinement.
🔁 The improved prompts and strategies are fed back into the 🧪 GenerationAgent, creating a full feedback loop.

🔄 Feedback Injection Diagram

    graph LR
    subgraph CoScientistPipeline
        direction LR
        Gen[GenerationAgent] --> Refl[ReflectionAgent]
        Refl --> Rank[RankingAgent]
        Rank --> Meta[MetaReviewAgent]
        Meta --> Evo[EvolutionAgent]
        Meta --> Gen
        Evo --> Gen
    end

The Feedback Loop allows insights from reviews and rankings to be injected directly into future generations.
The Meta-reviewAgent plays a central role by synthesizing recurring themes.
Strategic directions are sent to: ✅ GenerationAgent for better prompting 🧬 EvolutionAgent for structural improvements

📈 Prompt Improvement Loop

    graph TB
    PromptLoader --> Generation
    Generation --> Reflection
    Reflection --> Ranking
    Ranking --> MetaReview
    MetaReview --> PromptTuning
    PromptTuning --> PromptLoader

🔁 Prompt Evolution Workflow

🧪 The GenerationAgent creates hypotheses using the current version of the prompt.
🪞 The ReflectionAgent and 🏅 RankingAgent evaluate the quality of these hypotheses through structured reviews and Elo scoring.
🧠 The MetaReviewAgent identifies recurring patterns and strategic weaknesses in the critiques.
🛠️ The PromptTuningAgent uses those insights to refine the original prompt.
📥 The refined prompt is stored and reloaded via the PromptLoader, enabling the system to generate better hypotheses in the next cycle.

This iterative loop allows prompt evolution to adapt dynamically to performance signals, creating a self-improving hypothesis generation system.

🔄 Self-Improving Loop: From Goal to Insight

    graph TD
    A[Goal Input] --> B[GenerationAgent generates hypotheses]
    B --> C[Prompt stored in vector memory]
    C --> D[ReflectionAgent critiques them]
    D --> E[RankingAgent compares using Elo-style tournament]
    E --> F[PromptTunerAgent tunes future generations]
    F --> G[Meta-reviewAgent synthesizes insights]
    G --> H[EvolutionAgent refines hypotheses]
    H --> B

🎯 It starts with a research goal, passed to the 🧪 GenerationAgent.
🧠 Prompts are stored in vector memory, ensuring traceability and enabling future retrieval or refinement.
🪞 Each hypothesis is reviewed for correctness, feasibility, and alignment with research goals.
🌱 Top-ranked hypotheses undergo further refinement using inspiration, simplification, and preference-driven tuning.
🧠 The MetaReviewAgent aggregates reviews and rankings to uncover strategic insights.
🔁 These insights feed back into the next generation of prompts and hypotheses — completing the learning loop.

Over time, the system becomes smarter and more aligned with your research preferences and objectives.

This creates a full feedback loop:

“Feedback from tournaments enables iterative improvement”

“Strategic directions guide future generations”

🚀 Overview: How `PromptTuningAgent` Improves Scientific Prompts with DSPy

The PromptTuningAgent is a smart component in the co_ai pipeline that uses LLMs and DSPy to automatically improve prompts used for generating scientific hypotheses. It uses few-shot learning, prompt evaluation, and feedback loops to refine prompts based on real data collected from earlier runs.

Here’s how it works:

1. 📝 Signature Definition

We start by defining a PromptTuningSignature. This tells DSPy what inputs and outputs to expect:

The goal (e.g. “What happens if the US defaults?”)
The original prompt used to generate hypotheses
The best hypothesis produced
A review and a score of that hypothesis
The output: a refined version of the prompt

2. 🧠 IIIInitialize the Agent

When PromptTuningAgent is initialized:

It loads configuration values (e.g. how many examples to train on).
It sets up a connection to a local LLM through Ollama.
It configures DSPy with this LLM so it can use it for training.

3. 🚀 Agent Execution: `run(context)`

When the pipeline runs this agent:

It extracts the goal from the context.
It pulls a mix of recent prompts + their hypothesis results from memory (used for training and validation).
If no data is found, it logs the issue and exits gracefully.

4. 🎯 Training the Prompt Refiner

It creates DSPy Examples from the training data.
It defines a custom scoring function (_prompt_quality_metric) that evaluates if a new prompt is better than the original.
It compiles a new tuned_program using BootstrapFewShot with the training data and scoring metric.

5. 🔁 Generating New Prompts

Using the tuned DSPy program, it runs the validation examples through the newly trained model:

For each validation sample, it generates a refined prompt.
It saves each refined prompt to the database, recording metadata like strategy and version.
It logs each prompt and stores it in the context’s prompt_history.

6. 📊 Evaluating Prompt Quality

The custom scoring function (_prompt_quality_metric) runs like this:

It generates hypotheses using the original and the refined prompts.
It builds a comparison prompt to evaluate which prompt performed better.
It parses the comparison response (looking for “better prompt: <A/B>”) and returns a score.
All intermediate steps are logged for transparency and debugging.

7. 💡 Why This Matters

This class enables automated prompt evolution. Instead of manually tweaking prompts and checking their results, the agent:

Trains itself using past runs.
Evaluates prompts automatically.
Improves prompts continuously based on actual hypothesis quality.

It’s a self-improving loop that can adapt as your data or scientific objectives evolve.


from abc import ABC, abstractmethod

import re
import dspy
from dspy import Predict, Signature, InputField, OutputField, Example, BootstrapFewShot

from co_ai.agents.base import BaseAgent
from co_ai.constants import GOAL


# DSPy signature for prompt refinement: defines input/output fields for tuning
class PromptTuningSignature(Signature):
    goal = InputField(desc="Scientific research goal or question")
    input_prompt = InputField(desc="Original prompt used to generate hypotheses")
    hypotheses = InputField(desc="Best hypothesis generated")
    review = InputField(desc="Expert review of the hypothesis")
    score = InputField(desc="Numeric score evaluating the hypothesis quality")
    refined_prompt = OutputField(desc="Improved version of the original prompt")


# Simple evaluation result class to return from evaluator
class EvaluationResult:
    def __init__(self, score: float, reason: str):
        self.score = score
        self.reason = reason


# Base evaluator interface (not used directly, but useful for future extensions)
class BaseEvaluator(ABC):
    @abstractmethod
    def evaluate(
        self, original: str, proposal: str, metadata: dict = None
    ) -> EvaluationResult:
        pass


# DSPy-based evaluator that can run a Chain-of-Thought program
class DSPyEvaluator(BaseEvaluator):
    def __init__(self):
        self.program = dspy.ChainOfThought(PromptTuningSignature)

    def evaluate(
        self, original: str, proposal: str, metadata: dict = None
    ) -> EvaluationResult:
        result = self.program(
            goal=metadata["goal"],
            input_prompt=original,
            hypotheses=metadata["hypotheses"],
            review=metadata.get("review", ""),
            score=metadata.get("score", 750),
        )
        try:
            score = float(result.score)
        except (ValueError, TypeError):
            score = 0.0
        return EvaluationResult(score=score, reason=result.explanation)


# Main agent class responsible for training and tuning prompts using DSPy
class PromptTuningAgent(BaseAgent):
    def __init__(self, cfg, memory=None, logger=None):
        super().__init__(cfg, memory, logger)
        self.agent_name = cfg.get("name", "prompt_tuning")
        self.prompt_key = cfg.get("prompt_key", "default")
        self.sample_size = cfg.get("sample_size", 20)
        self.generate_count = cfg.get("generate_count", 10)
        self.current_version = cfg.get("version", 1)

        # Configure DSPy with local LLM (Ollama)
        lm = dspy.LM(
            "ollama_chat/qwen3",
            api_base="http://localhost:11434",
            api_key="",
        )
        dspy.configure(lm=lm)

    async def run(self, context: dict) -> dict:
        goal = context.get(GOAL, "")
        generation_count = self.sample_size + self.generate_count
        self.logger.log(
            "PromptTuningExamples",
            {"samples size": self.sample_size, "generation count": generation_count},
        )

        # Get training + validation data
        examples = self.memory.prompt.get_prompt_training_set(goal, generation_count)
        train_data = examples[: self.sample_size]
        val_data = examples[self.sample_size :]

        if not examples:
            self.logger.log(
                "PromptTuningSkipped", {"reason": "no_training_data", "goal": goal}
            )
            return context

        # Build training set for DSPy
        training_set = [
            Example(
                goal=item["goal"],
                input_prompt=item["prompt_text"],
                hypotheses=item["hypothesis_text"],
                review=item.get("review", ""),
                score=item.get("elo_rating", 1000),
            ).with_inputs("goal", "input_prompt", "hypotheses", "review", "score")
            for item in train_data
        ]

        # Wrap our scoring metric so we can inject context during tuning
        def wrapped_metric(example, pred, trace=None):
            return self._prompt_quality_metric(example, pred, context=context)

        # Train prompt-tuning program
        tuner = BootstrapFewShot(metric=wrapped_metric)
        student = Predict(PromptTuningSignature)
        tuned_program = tuner.compile(student=student, trainset=training_set)

        # Use tuned program to generate and store new refined prompt
        await self.generate_and_store_refined_prompts(
            tuned_program, goal, context, val_data
        )
        self.logger.log(
            "PromptTuningCompleted",
            {
                "goal": goal,
                "example_count": len(training_set),
                "generated_count": len(val_data),
            },
        )

        return context

    async def generate_and_store_refined_prompts(
        self, tuned_program, goal: str, context: dict, val_set
    ):
        """
        Generate refined prompts using the tuned DSPy program and store them in the database.

        Args:
            tuned_program: A compiled DSPy program capable of generating refined prompts.
            goal: The scientific goal for this run.
            context: Shared pipeline state.
            val_set: Validation examples to run through the tuned program.
        """

        stored_count = 0
        for i, example in enumerate(val_set):
            try:
                # Run DSPy program on new example
                result = tuned_program(
                    goal=example["goal"],
                    input_prompt=example["prompt_text"],
                    hypotheses=example["hypothesis_text"],
                    review=example.get("review", ""),
                    score=example.get("elo_rating", 1000),
                )

                refined_prompt = result.refined_prompt.strip()

                # Store refined prompt to the DB
                self.memory.prompt.save(
                    goal=example["goal"],
                    agent_name=self.name,
                    prompt_key=self.prompt_key,
                    prompt_text=refined_prompt,
                    response=None,
                    strategy="refined_via_dspy",
                    version=self.current_version + 1,
                )

                stored_count += 1

                # Update context with prompt history
                self.add_to_prompt_history(
                    context, refined_prompt, {"original": example["prompt_text"]}
                )

                self.logger.log(
                    "TunedPromptStored",
                    {"goal": goal, "refined_snippet": refined_prompt[:100]},
                )

            except Exception as e:
                self.logger.log(
                    "TunedPromptGenerationFailed",
                    {"error": str(e), "example_snippet": str(example)[:100]},
                )

        self.logger.log(
            "BatchTunedPromptsComplete", {"goal": goal, "count": stored_count}
        )

    def _prompt_quality_metric(self, example, pred, context: dict) -> float:
        """Run both prompts and compare results"""
        try:
            prompt_a = example.input_prompt
            prompt_b = pred.refined_prompt
            self.logger.log(
                "PromptQualityCompareStart",
                {
                    "prompt_a_snippet": prompt_a[:100],
                    "prompt_b_snippet": prompt_b[:100],
                },
            )

            hypotheses_a = self.call_llm(prompt_a, context)
            self.logger.log(
                "PromptAResponseGenerated", {"hypotheses_a_snippet": hypotheses_a[:200]}
            )

            hypotheses_b = self.call_llm(prompt_b, context)
            self.logger.log(
                "PromptBResponseGenerated", {"hypotheses_b_snippet": hypotheses_b[:200]}
            )

            # Run comparison
            merged = {
                **context,
                **{
                    "prompt_a": prompt_a,
                    "prompt_b": prompt_b,
                    "hypotheses_a": hypotheses_a,
                    "hypotheses_b": hypotheses_b,
                },
            }
            comparison_prompt = self.prompt_loader.load_prompt(self.cfg, merged)
            self.logger.log(
                "ComparisonPromptConstructed",
                {"comparison_prompt_snippet": comparison_prompt[:200]},
            )

            response = self.call_llm(comparison_prompt, context)
            self.logger.log(
                "ComparisonResponseReceived", {"response_snippet": response[:200]}
            )

            match = re.search(r"better prompt:<([AB])>", response, re.IGNORECASE)
            if match:
                choice = match.group(1).upper()
                score = 1.0 if choice == "B" else 0.5
                self.logger.log(
                    "PromptComparisonResult", {"winner": choice, "score": score}
                )
                return score
            else:
                self.logger.log("PromptComparisonNoMatch", {"response": response})
                return 0.0
        except Exception as e:
            self.logger.log(
                "PromptQualityMetricError",
                {
                    "error": str(e),
                    "example_input_prompt_snippet": example.input_prompt[:100],
                    "refined_prompt_snippet": getattr(pred, "refined_prompt", "")[:100],
                },
            )
            return 0.0

📁 JSON Logging: Tracking Every Step

Every run of the co_ai pipeline is automatically logged in a structured JSON Lines (.jsonl) file. This makes it easy to audit, debug, and analyze the behavior of the system over time.

🔍 How It Works

A unique run_id is generated at the start of each pipeline execution.
This ID is used to create a dedicated log file stored under logs/, such as:

  logs/run_us_debt_analysis_20240516_130245.jsonl

Each agent and component in the system (generation, ranking, reflection, etc.) logs its actions using a consistent structure:

{
    "timestamp": "2025-05-16T13:02:45.123Z",
    "event_type": "GeneratedHypotheses",
    "data": {
        "goal": "Can generative AI models reduce the time required to make scientific discoveries in biomedical research?",
        "snippet": "Hypothesis 1: ..."
    }
}

These events are emitted through the JSONLogger class and tagged with emojis and event types for easy tracking.

✅ Why It Matters

📜 Transparency – Every stage of the reasoning process is recorded and can be revisited.
🐛 Debugging – If something goes wrong, the log reveals where and why.
📊 Experiment Tracking – Logs form the foundation for analyzing pipeline performance and tuning over time.

This approach ensures that the entire scientific reasoning process is not a black box, but a transparent and reproducible workflow.

💾 Context Persistence: Save and Resume at Any Step

One of the key architectural decisions in co_ai is persistent context storage. At every stage of the pipeline, the system can store the full pipeline context—a structured representation of everything the system knows so far.

🧠 What Is “Context”?

The context is a Python dictionary containing:

The research goal
literature findings
hypotheses generated so far
reviews, reflections, rankings, and more

It flows through each stage like a growing memory of the reasoning process.

🗂 How Context Is Stored

Each agent has config options such as:

save_context: true
skip_if_completed: true

When save_context is enabled:

The full context is saved to the PostgreSQL database in the context_states table.
Metadata such as run_id, stage_name, version, and preferences are stored alongside it.
The most recent version is flagged with is_current = true.

This lets you resume a run, skip steps that have already completed, or inspect intermediate pipeline states.

🔁 How It Works in Practice

After each agent runs, it saves the new context to the database if save_context is enabled.
Before running, the agent checks if a completed version already exists:
- If skip_if_completed = true, it loads the saved context and skips computation.
- If not, it runs normally and updates the context.

✅ Why This Matters

🔄 Restartable Runs – You can resume from any stage without rerunning earlier steps.
🔍 Debugging and Exploration – You can load a saved context and inspect it for errors or insights.
🧪 A/B Testing and Iteration – Run multiple strategies or prompt versions against the same stored context.

This persistent, stage-wise approach makes co_ai robust, debuggable, and suitable for real scientific workflows.

🚀 Getting Started (Try It Locally)

You can run the full co_ai system locally in just a few steps:

Clone the repo

   git clone https://github.com/ernanhughes/co-ai.git
   cd co-ai

Install dependencies
```
pip install -r requirements.txt
```
Start Ollama with a local model
```
ollama run qwen:latest
```
Install Postgress

Download from https://www.postgresql.org/download/windows/

Use the installer to install PostgreSQL and pgAdmin During setup, set:

    Username: postgres

    Password: (choose something and remember it)

Enable the pgvector extension.

I explained this in aprevious post:
PostgreSQL for AI: Storing and Searching Embeddings with pgvector

Create a Database

for example you can use coai as the name

In your config.yaml change the db connection information to match your database

db:
  driver: psycopg2
  user: postgres
  password: yourpassword
  host: localhost
  port: 5432
  database: co_ai

Load the schema.sql file.

At the root of the projec there is a schema.sql file. This will create the required tables to run the applciation in your database.

Run the pipeline

python -m co_ai.main goal="What happens if the USA defaults on its debt?"

🔗 References

Towards an AI Co-Scientist

📎 Appendix

Repo: github.com/ernanhughes/co
Core Tools: DSPy, pgvector, Ollama, Searxng Search
Format: Local Python modules using Hydra and async I/O

🧾 Summary

🔬 What Is co_ai?

🛠 Key Design Decisions & Why They Matter

1. ✅ Local Execution via Ollama

2. 🧠 Preference-Driven Prompting

File: configs/agents/generation.yaml

File: prompts/generation/goal_aligned.txt

3. 🔄 Self-Improving Loop via DSPy

File: co_ai/agents/prompt_refiner.py

4. 📋 Structured Output + Memory System

5. 🧩 Modular Architecture with Configurable Agents

File: configs/agents/generation.yaml

File: configs/pipeline/default.yaml

6. 🛢️ Postgres and the MemoryTool

🗄️ Why We Chose a Database-Centric Design

📦 Overview of Database Components

🧠 Why Not Just Use Files or In-Memory Storage?

File: schema.sql

6. 📦 Vector Memory with PostgreSQL + pgvector

File: co_ai/memory/vector_store.py

📄 Jinja2 & How It Powers Flexible Prompting

💡 Motivation: Prompt Flexibility Is Scientific Freedom

🧱 How Jinja2 Templates Work in co_ai

🚀 Advantages of Using Jinja2 for Prompting

1. ✅ Dynamic Variable Injection

2. 🧩 Conditional Logic Based on Preferences

3. 🔄 Clean Template Reuse Across Agents

4. 📋 Full Traceability and Version Control

5. 🤖 Local Model Compatibility

6. 🧬 Feedback Injection from Reviews

7. 📊 Easy Customization Without Code Changes

📋 Overview of Implemented Agents

✅ 1. Short Description of Each Agent

🧠 Generation Agent

🔍 Reflection Agent

📊 Ranking Agent

🔄 Evolution Agent

🧬 Meta-review Agent

🧮 Proximity Agent

🧱 2. How They Fit Into the Pipeline

🔄 3. Feedback Loops Between Agents

Loop 1: Peer Review → Prompt Refinement

Loop 2: Tournament → Strategic Directions

Loop 3: Hypothesis → Memory → New Hypothesis

📊 4. Visualization of Agent Interactions

🧩 Full Pipeline Diagram

🔄 Feedback Injection Diagram

📈 Prompt Improvement Loop

🔁 Prompt Evolution Workflow

🔄 Self-Improving Loop: From Goal to Insight

🚀 Overview: How PromptTuningAgent Improves Scientific Prompts with DSPy

1. 📝 Signature Definition

2. 🧠 IIIInitialize the Agent

3. 🚀 Agent Execution: run(context)

4. 🎯 Training the Prompt Refiner

5. 🔁 Generating New Prompts

6. 📊 Evaluating Prompt Quality

7. 💡 Why This Matters

📁 JSON Logging: Tracking Every Step

🔍 How It Works

✅ Why It Matters

💾 Context Persistence: Save and Resume at Any Step

🧠 What Is “Context”?

🗂 How Context Is Stored

🔁 How It Works in Practice

✅ Why This Matters

🚀 Getting Started (Try It Locally)

🔗 References

📎 Appendix

🔬 What Is `co_ai`?

File: `configs/agents/generation.yaml`

File: `prompts/generation/goal_aligned.txt`

File: `co_ai/agents/prompt_refiner.py`

File: `configs/agents/generation.yaml`

File: `configs/pipeline/default.yaml`

File: `schema.sql`

File: `co_ai/memory/vector_store.py`

🧱 How Jinja2 Templates Work in `co_ai`

🚀 Overview: How `PromptTuningAgent` Improves Scientific Prompts with DSPy

3. 🚀 Agent Execution: `run(context)`