Building an AI Co-Scientist

This is the fiOr you guys want to scare all of them All right problem this is where I say norst post in a 100-part series, where we take breakthrough AI papers and turn them into working code building the next generation of AI, one idea at a time.
๐งพ Summary
In this post, Iโll walk through how I implemented the ideas from Towards an AI Co-Scientist into a working system called co_ai
.
The paper presents a vision for an AI system that collaborates with scientists by generating hypotheses, engaging in peer review, ranking proposals, and evolving them over time. Inspired by this work, Iโve built a modular, open-source version of the co-scientist architecture using local tools like:
- โ Ollama (for running Qwen3, Mistral, etc.)
- โ DSPy (for prompt optimization)
- โ Hydra (for configuration management, dynamic extension)
- โ pgvector (for vector memory + retrieval)
- โ postgresql (database)
๐ฌ What Is co_ai
?
co_ai
is a working implementation of the AI co-scientist concept introduced in the recent DeepMind paper. The core idea? Build a multi-agent system that mimics scientific reasoning โ proposing hypotheses, debating their validity, refining them, and evolving toward better explanations.
The system uses a modular agent pipeline:
flowchart LR A[๐ฏ Goal] --> B[๐งช Generation Agent] B --> C[๐ช Reflection Agent] C --> D[๐ Ranking Agent] D --> E[๐งฌ Evolution Agent] E --> F[๐ง Meta-review Agent]
Each step has a specific role:
- โ Generation Agent: Proposes multiple hypotheses grounded in literature.
- ๐ง Reflection Agent: Critiques each hypothesis for correctness, novelty, and feasibility.
- ๐ Ranking Agent: Uses simulated debates to rank and refine ideas.
- ๐ Evolution Agent: Evolves promising hypotheses using simplification or inspiration.
- ๐งฌ Meta-review Agent: Synthesizes insights across all reviews into strategic directions applicable to all agents.
๐ Key Design Decisions & Why They Matter
1. โ Local Execution via Ollama
One of the most important choices was to use local LLMs via Ollama. This ensures:
- No dependency on cloud APIs
- Full reproducibility
- No cost to your research
- Better control over inference settings (temperature, max tokens, etc.)
Example config:
model:
name: ollama/qwen3
api_base: http://localhost:11434
api_key:
As you can see you can optionally enable any model you require.
2. ๐ง Preference-Driven Prompting
A major innovation from the paper is the use of preferences to guide hypothesis quality.
I implemented this cleanly using Hydra config:
File: configs/agents/generation.yaml
This is a sample of one of the agent configuration file.
generation:
name: generation
enabled: true
strategy: goal_aligned
preferences:
- goal_consistency
- biological_plausibility
- experimental_validity
- novelty
- simplicity
These are injected into prompts dynamically using Jinja2 templates:
This is a sample of one of the agent prompt template files.
File: prompts/generation/goal_aligned.txt
You are an expert researcher generating novel scientific hypotheses.
Use inspiration from analogous domains or mechanisms to develop creative solutions.
Goal:
{{ goal }}
{literate_title}: {{ literature }}
Preferences:
{% for p in preferences %}
- {{ p }}
{% endfor %}
Instructions:
1. Review findings above before generating new hypotheses
2. Generate 3 distinct, testable hypotheses
3. Each must include mechanism, rationale, and experiment plan
4. {% if "goal_consistency" in preferences %}Prioritize direct alignment with the stated goal.{% endif %}
5. {% if "novelty" in preferences %}Focus on originality and unexpected connections.{% endif %}
6. {% if "feasibility" in preferences %}Ensure experiments can be realistically tested in the lab.{% endif %}
7. {% if "biological_plausibility" in preferences %}Make sure biological mechanisms are valid and well-explained.{% endif %}
8. {% if "simplicity" in preferences %}Favor clarity and simplicity over complexity.{% endif %}
This lets users tune agent behavior without changing code โ just by adjusting preferences.
3. ๐ Self-Improving Loop via DSPy
To implement the feedback loop described in the paper:
โFeedback from tournaments enables iterative improvement, creating a self-improving loop toward novel and high-quality outputsโ
I added a PromptRefinerAgent
that:
- Pulls top-ranked hypotheses from memory
- Uses DSPyโs
BootstrapFewShot
to refine prompts - Stores only improved versions
- Logs everything for traceability
File: co_ai/agents/prompt_refiner.py
class PromptRefinerAgent(BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.agent_name = cfg.get("target_agent", "generation")
self.strategy = cfg.get("strategy", "basic_refinement")
async def run(self, context: dict) -> dict:
few_shot_data = self.memory.get_ranked_hypotheses(context["goal"], limit=5)
refined_prompt = self._refine_with_dspy(context, few_shot_data)
old_score = self._evaluate_prompt(context["prompt"], few_shot_data)
new_score = self._evaluate_prompt(refined_prompt, few_shot_data)
if new_score > old_score:
self.memory.store_prompt_version(
agent_name=self.agent_name,
prompt_key=self.strategy,
prompt_text=refined_prompt,
source="dsp_refinement",
version=context.get("prompt_version", 1) + 1,
metadata={"few_shot_count": len(few_shot_data)}
)
context["prompt"] = refined_prompt
return context
Now the system improves its own prompting strategy based on real-world performance โ just like the paper describes:
โEach comparison concludes with the phrase ‘better hypothesis:<1 or 2>’โ
4. ๐ Structured Output + Memory System
All agents generate structured output so future stages can parse and reason about them.
For example:
# Hypothesis 1
Mechanism: Inhibition of IRE1ฮฑ signaling reduces tumor viability in AML cells.
# Hypothesis 2
Mechanism: KIRA6 induces apoptosis in MOLM13 cells through ER stress pathways.
# Hypothesis 3
Mechanism: Targeting UPR pathways enhances chemosensitivity in FLT3-ITD+ leukemias.
This format makes it easy to extract, compare, and evolve hypotheses.
From the paper:
โAppendix Figure A.4 shows example prompts for comparing two hypotheses during a tournament matchโ
So we built a consistent structure that supports:
- โ Parsing
- ๐งฑ Comparison
- ๐ค Refinement
- ๐ Logging
5. ๐งฉ Modular Architecture with Configurable Agents
Using Hydra-based config, each agent can be enabled/disabled and configured independently.
File: configs/agents/generation.yaml
defaults:
- /model/qwen3
- /prompt_refiner/disabled
generation:
name: generation
enabled: true
save_prompt: true
save_context: true
skip_if_completed: true
strategy: goal_aligned
input_keys: ["goal", "literature"]
output_key: hypotheses
prompt_mode: file
prompt_file: goal_aligned.txt
extraction_regex: "Hypothesis \\d+:\\n(.+?)\\n"
And in the pipeline config:
File: configs/pipeline/default.yaml
# The goal of the pipeline."
goal: "Can generative AI models reduce the time required to make scientific discoveries in research?"
paths: # directory path where the app will search for the prompt templates
prompts: ${hydra:runtime.cwd}/prompts
pipeline:
stages:
- name: generation # name for agent (useful if you want run same one at different times)
cls: co_ai.agents.generation.GenerationAgent # you can add cutom classes
enabled: true
iterations: 1
- name: reflection
cls: co_ai.agents.reflection.ReflectionAgent
enabled: true
iterations: 1
- name: ranking
cls: co_ai.agents.ranking.RankingAgent
enabled: true
iterations: 2
This gives you full flexibility to:
- โ Swap out models
- ๐ง Inject different strategies
- ๐ Log everything
- ๐ Tune behavior per stage
6. ๐ข๏ธ Postgres and the MemoryTool
Great idea โ here’s a clear and informative section you can include in your blog post that explains why you chose the database architecture and what each component is for. Itโs written for developers and system designers to understand the motivations behind the choices in the co_ai
system.
๐๏ธ Why We Chose a Database-Centric Design
The co_ai
framework relies heavily on structured and evolving information โ hypotheses, prompt versions, context states, and performance metrics. To make this information persistent, queryable, and extensible, we use PostgreSQL with pgvector for semantic search. Hereโs why we chose this route and how each component contributes to the pipeline:
๐ฆ Overview of Database Components
Component | Purpose | Why It Matters |
---|---|---|
hypotheses_store |
Stores generated hypotheses, confidence scores, reviews, and embeddings | Ensures we can track hypothesis evolution, compare outputs across versions, and rank by quality |
prompt_store |
Records each prompt version, agent, strategy, and associated goal | Enables reproducibility and analysis of which prompts yield better hypotheses |
context_states |
Saves full pipeline context at each stage | Allows recovery, audit trails, and comparative runs with different agents/settings |
report_logger |
Tracks generated reports with summaries and run metadata | Useful for end-of-run outputs and dashboard summaries |
embedding_store |
Caches vector embeddings for fast similarity searches | Boosts performance for semantic search and clustering |
๐ง Why Not Just Use Files or In-Memory Storage?
While YAML/JSON files are great for flexibility and fast prototyping, we needed:
- ๐ Long-term memory โ Hypotheses and evaluations need to persist across runs.
- ๐ Version control โ Prompt refinement requires tracking iterations and improvements.
- ๐ Query capabilities โ Ranking and tuning are based on filtering and sorting large sets.
- ๐ Relational integrity โ Hypotheses link to prompts, which link to agents and evaluations.
This structure also sets the stage for more advanced features like:
- Similarity search across past hypotheses
- Dataset-based prompt tuning
- Interactive dashboards or admin panels
File: schema.sql
--- prompts table
CREATE TABLE IF NOT EXISTS prompts (
id SERIAL PRIMARY KEY,
agent_name TEXT NOT NULL,
prompt_key TEXT NOT NULL, -- e.g., generation_goal_aligned.txt
prompt_text TEXT NOT NULL,
goal TEXT;
response_text TEXT,
source TEXT, -- e.g., manual, dsp_refinement, feedback_injection
version INT DEFAULT 1,
is_current BOOLEAN DEFAULT FALSE,
strategy TEXT, -- e.g., goal_aligned, out_of_the_box
metadata JSONB DEFAULT '{}'::JSONB,
timestamp TIMESTAMPTZ DEFAULT NOW()
);
-- Stores all generated hypotheses and their evaluations
CREATE TABLE IF NOT EXISTS hypotheses (
id SERIAL PRIMARY KEY,
goal TEXT NOT NULL, -- Research objective
text TEXT NOT NULL, -- Hypothesis statement
confidence FLOAT DEFAULT 0.0 , -- Confidence score (0โ1 scale)
review TEXT, -- Structured review data
reflection TEXT, -- Structured reflection data
elo_rating FLOAT DEFAULT 750.0, -- Tournament ranking score
embedding VECTOR(1024), -- Vector representation of hypothesis
features JSONB, -- Mechanism, rationale, experiment plan
prompt_id INT REFERENCES prompts(id), -- Prompt used to generate this hypothesis
source_hypothesis INT REFERENCES hypotheses(id), -- If derived from another
strategy_used TEXT, -- e.g., goal_aligned, out_of_the_box
version INT DEFAULT 1, -- Evolve count
source TEXT, -- e.g., manual, refinement, grafting
enabled BOOLEAN DEFAULT TRUE, -- Soft delete flag
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
6. ๐ฆ Vector Memory with PostgreSQL + pgvector
Then I built a VectorMemory
class to retrieve similar hypotheses:
File: co_ai/memory/vector_store.py
def get_similar_hypotheses(self, goal: str, limit: int = 5):
"""Get hypotheses from memory that are most relevant to current goal"""
try:
goal_embedding = get_embedding(goal)
cur.execute("""
SELECT text, review, elo_rating
FROM hypotheses
ORDER BY embedding <-> %s
LIMIT %s
""", (str(goal_embedding), limit))
rows = cur.fetchall()
return [
{
"text": row[0],
"review": row[1] or "",
"score": row[2] or 1000
} for row in rows
]
except Exception as e:
print(f"[VectorMemory] Failed to fetch similar hypotheses: {e}")
return []
๐ Jinja2 & How It Powers Flexible Prompting
๐ก Motivation: Prompt Flexibility Is Scientific Freedom
One of the key goals when building co_ai
was to make the system highly adaptable to different domains, preferences, and scientific workflows.
To do that, we needed a way to:
- โ
Inject dynamic values (
goal
,literature
,preferences
) - ๐งฉ Support conditional logic in prompts
- ๐ Allow users to define their own templates
- ๐ Maintain structure while enabling customization
- ๐ Integrate cleanly with Hydra config and agent logic
Thatโs where Jinja2 templates came in.
๐งฑ How Jinja2 Templates Work in co_ai
We use Jinja2-style templating to build structured prompts dynamically.
Hereโs an example from prompts/generation/goal_aligned.txt
:
You are an expert researcher generating novel hypotheses.
Use inspiration from analogous domains or mechanisms to develop creative solutions.
Goal:
{{ goal }}
{literate_title}: {{ literature }}
{% if preferences %}
Preferences:
{% for p in preferences %}
- {{ p }}
{% endfor %}
{% endif %}
Instructions:
1. Review findings above before generating new hypotheses
2. Generate 3 distinct, testable hypotheses
3. Each must include mechanism, rationale, and experiment plan
4. {% if "goal_consistency" in preferences %}Prioritize direct alignment with the stated goal.{% endif %}
5. {% if "novelty" in preferences %}Focus on originality and unexpected connections.{% endif %}
6. {% if "feasibility" in preferences %}Ensure experiments can be realistically tested in the lab.{% endif %}
7. {% if "biological_plausibility" in preferences %}Make sure biological mechanisms are valid and well-explained.{% endif %}
8. {% if "simplicity" in preferences %}Favor clarity and simplicity over complexity.{% endif %}
In your agent logic:
def _build_prompt(self, context):
literate = "\n".join([
f"{i+1}. {r['title']}\nโ {r['summary']}"
for i, r in enumerate(context.get("literature", []))
])
return self.prompt_renderer.render(
goal=context["goal"],
literature=literate,
preferences=context.get("preferences", [])
)
๐ Advantages of Using Jinja2 for Prompting
Here are the main reasons we chose Jinja2 for prompt generation:
1. โ Dynamic Variable Injection
You can inject any variable into your prompt:
Goal:
{{ goal }}
Literature:
{{ literature }}
This makes it easy to generate prompts based on real-time context.
2. ๐งฉ Conditional Logic Based on Preferences
Want to change the prompt based on preference?
{% if "goal_consistency" in preferences %}
Prioritize direct alignment with the stated goal.
{% endif %}
Now your LLM follows instructions only if certain preferences are active.
3. ๐ Clean Template Reuse Across Agents
Same template can be used by:
- โ GenerationAgent
- ๐ง EvolutionAgent
- ๐ Meta-reviewAgent
Just change the inputs โ same prompt format works differently.
4. ๐ Full Traceability and Version Control
Because prompts are stored in files, you can:
- โ Track prompt versions in git
- ๐ Compare old vs new prompts
- ๐ Roll back if something breaks
5. ๐ค Local Model Compatibility
Works great with local models like Qwen3, Mistral, or Llama via Ollama:
- No need for hosted APIs
- Everything runs locally
- Fully reproducible
6. ๐งฌ Feedback Injection from Reviews
Want to add strategic directions from Meta-review?
{% if meta_review.insights %}
Strategic Directions:
{% for point in meta_review.insights %}
- {{ point }}
{% endfor %}**** right now Jeff you know what else **** thing
{% endif %}
Now future generations incorporate insights automatically.
7. ๐ Easy Customization Without Code Changes
Users can modify .txt
files directly:
- โ Add new preferences
- ๐ง Change instruction phrasing
- ๐ Adjust output structure
- ๐ Enable/disable sections
No coding required.
๐ Overview of Implemented Agents
Each component in the system plays a unique role in the pipeline. Below is a list of implemented agents and their responsibilities in our AI co-scientist system.
Agent Name | Purpose | Why We Created It |
---|---|---|
base.py | Shared base class for all agents | Modular design with common utilities |
debate.py | Simulates scientific debate for ranking | To evaluate and prioritize hypotheses |
evolution.py | Evolves hypotheses using strategies like grafting or simplification | For continuous improvement |
generation.py | Hypothesis generation after literature grounding | Core reasoning engine |
generic.py | Template for custom agent creation | Supports reusable agent pattern |
literature.py | Performs web search and literature parsing | Grounds hypotheses in prior work |
meta_review.py | Synthesizes insights from all reviews | Creates strategic feedback for future agents |
prompt_refiner.py | Uses past outputs to improve prompts | Enables self-tuning via preference injection |
prompt_tuning.py | Applies DSPy-based tuning to improve prompting | Builds better prompts over time |
proximity.py | Measures similarity between hypotheses | Tracks evolution and improves ranking |
ranking.py | Manages tournaments and compares hypotheses using Elo ratings | Determines which ideas are stronger |
reflection.py | Analyzes hypotheses for correctness, novelty, and feasibility | Filters weak ideas early |
review.py | Lighter version of the Reflection agent | Quick evaluation of hypothesis quality |
โ 1. Short Description of Each Agent
Here’s a clean, concise description of each agent in your system โ based on both the paper and your implementation.
๐ง Generation Agent
The Generation Agent starts with a research goal and generates multiple hypotheses by synthesizing prior knowledge or exploring new directions.
It uses literature grounding when available, and supports strategies like:
- Goal-aligned generation
- Out-of-the-box thinking
- Feasibility-focused prompting
From the paper:
โThe Generation agent iteratively searches the web, retrieves and reads relevant research articles, and grounds its reasoning by summarizing prior work.โ
๐ Reflection Agent
The Reflection Agent simulates peer review. It evaluates hypothesis correctness, novelty, feasibility, and quality using structured prompts.
Supports multiple types of reflection:
- Full review
- Deep verification
- Simulation-based testing
- Observation-driven critique
From the paper:
โThe Reflection agent performs full reviews, deep verification, observation reviews, and simulation reviews to ensure high-quality outputs.โ
๐ Ranking Agent
The Ranking Agent runs simulated scientific debates between hypotheses, assigning Elo ratings based on their performance.
Uses:
- Direct comparison tournaments
- Preference-weighted scoring
- Strategic ranking based on novelty/feasibility/goal alignment
From the paper:
โEach comparison concludes with the phrase ‘better hypothesis:<1 or 2>’โ
๐ Evolution Agent
The Evolution Agent refines top-ranked hypotheses using techniques like simplification, inspiration from other ideas, grafting, or out-of-the-box rethinking.
Key strategies include:
- Enhancing clarity
- Improving biological plausibility
- Increasing testability
From the paper:
โThe Evolution agent continuously refines and improves existing hypotheses and proposals using several approaches including enhancement through grounding, coherence and feasibility improvementsโฆโ
๐งฌ Meta-review Agent
The Meta-review Agent synthesizes insights from all agents into strategic directions that guide future generations.
It identifies common patterns in critiques and rankings, then injects them into future prompts.
From the paper:
โThe Meta-review agent synthesizes insights from all reviews into strategic directions applicable to all agentsโ
๐งฎ Proximity Agent
The Proximity Agent computes similarity between hypotheses to detect duplicates, track evolution, and identify promising clusters of related ideas.
Supports:
- Hypothesis clustering
- Redundancy elimination
- Similarity tracking over time
From the paper:
โThe dark gray arrows represent the information flow through the co-scientist system, while the red arrows represent the feedback loop between the specialized agents.โ
๐งฑ 2. How They Fit Into the Pipeline
Hereโs how the agents interact in sequence:
flowchart LR subgraph InputStage [Input] A[๐ฏ Research Goal] --> B[๐ LiteratureAgent] end subgraph HypothesisGeneration [Hypothesis Generation] B --> C[๐งช GenerationAgent] end subgraph ReviewAndRanking [Evaluation & Ranking] C --> D[๐ช ReflectionAgent] D --> E[๐ RankingAgent] end subgraph FeedbackLoop [Strategic Improvement] E --> F[๐ง MetaReviewAgent] F --> G[๐งฌ EvolutionAgent] G --> H[๐ ๏ธ PromptTuningAgent] H --> C H --> I([๐ Refined Generation]) end subgraph OutputStage [Final Output] E --> J[๐ Final Hypotheses + Reports] end style InputStage fill:#f9f9f9,stroke:#333 style HypothesisGeneration fill:#e6f7ff,stroke:#333 style ReviewAndRanking fill:#fffbe6,stroke:#333 style FeedbackLoop fill:#e6ffe6,stroke:#333 style OutputStage fill:#f9f9f9,stroke:#333
Each stage contributes to building better hypotheses.
Stage | Input | Output | Purpose |
---|---|---|---|
Generation | Goal + Literature | Multiple hypotheses | Start with novel ideas |
Reflection | Hypotheses + Preferences | Structured reviews | Filter weak ideas |
Ranking | Two hypotheses | Better one (via Elo) | Prioritize best ideas |
Evolution | Top-ranked hypothesis | Refined version | Improve clarity and feasibility |
Meta-review | All reviews + rankings | Strategic directions | Inject insights into future agents |
Prompt Tuning | Old prompt + ranked data | Improved prompt | Refine prompting strategy |
This matches exactly what the paper describes:
โFeedback from tournaments enables iterative improvementโ
โThe Meta-review agent synthesizes insights from all reviews into strategic directions applicable to all agentsโ
๐ 3. Feedback Loops Between Agents
These are the key feedback loops in your AI co-scientist system.
Loop 1: Peer Review โ Prompt Refinement
graph LR Generation --> Reflection Reflection --> PromptTuning PromptTuning --> Generation
How it works:
- โ The Reflection Agent gives detailed reviews
- ๐ง The PromptTuningAgent learns from those reviews
- ๐ The improved prompt is used in the next round
Supports Appendix A.2.4:
โRefine the following conceptual idea, enhancing its practical implementability by leveraging contemporary technological capabilitiesโ
Loop 2: Tournament โ Strategic Directions
graph LR Ranking --> MetaReview MetaReview --> Generation MetaReview --> Evolution
How it works:
- ๐ The Ranking Agent identifies which hypotheses win most often
- ๐งฌ The Meta-review Agent extracts recurring themes and preferences
- ๐ฏ These strategic directions are injected into future agents’ prompts
Matches whatโs described in Appendix A.2.5:
โThe Meta-review agent generates feedback applicable to all agents… simply appended to their prompts in the next iterationโa capability facilitated by the long-context search and reasoning capabilities of the underlying Gemini 2.0 modelsโ
Loop 3: Hypothesis โ Memory โ New Hypothesis
graph LR Generation --> Memory Memory --> Generation
How it works:
- ๐ The LiteratureAgent stores results in vector memory
- ๐ง The GenerationAgent pulls similar past hypotheses during future runs
- ๐ This creates a self-improving loop where old ideas inform new ones
Supports what the paper describes:
โAppendix Figure A.4 shows example prompts for comparing two hypotheses during a tournament matchโ
๐ 4. Visualization of Agent Interactions
๐งฉ Full Pipeline Diagram
graph TB subgraph InputStage [Input] A[๐ฏ Research Goal + Preferences] --> B[๐ LiteratureAgent] end subgraph GenerationStage [Hypothesis Generation] B --> C[๐งช GenerationAgent] end subgraph ReviewStage [Evaluation & Critique] C --> D[๐ช ReflectionAgent] end subgraph RankingStage [Prioritization] D --> E[๐ RankingAgent] end subgraph FeedbackLoop [Strategic Improvement] E --> F[๐ง MetaReviewAgent] F --> G[๐งฌ EvolutionAgent] G --> H[๐ง PromptTuningAgent] H --> C end subgraph OutputStage [Refined Output] E --> I[๐ Final Hypotheses + Reports] end style InputStage fill:#f9f9f9,stroke:#333 style GenerationStage fill:#e6f7ff,stroke:#333 style ReviewStage fill:#fffbe6,stroke:#333 style RankingStage fill:#ffe6e6,stroke:#333 style FeedbackLoop fill:#e6ffe6,stroke:#333 style OutputStage fill:#f9f9f9,stroke:#333
- ๐ฏ The pipeline starts with a goal and preferences, then uses the ๐ง LiteratureAgent to gather relevant research.
- ๐งช The GenerationAgent proposes hypotheses based on this foundation.
- ๐ช The ReflectionAgent performs structured reviews of hypothesis quality.
- ๐ The RankingAgent assigns Elo ratings and selects top performers.
- ๐ง The MetaReviewAgent synthesizes insights into strategic directions.
- ๐งฌ These feed into both the EvolutionAgent and ๐ ๏ธ PromptTuningAgent, enabling continuous refinement.
- ๐ The improved prompts and strategies are fed back into the ๐งช GenerationAgent, creating a full feedback loop.
๐ Feedback Injection Diagram
graph LR subgraph CoScientistPipeline direction LR Gen[GenerationAgent] --> Refl[ReflectionAgent] Refl --> Rank[RankingAgent] Rank --> Meta[MetaReviewAgent] Meta --> Evo[EvolutionAgent] Meta --> Gen Evo --> Gen end
- The Feedback Loop allows insights from reviews and rankings to be injected directly into future generations.
- The Meta-reviewAgent plays a central role by synthesizing recurring themes.
- Strategic directions are sent to: โ GenerationAgent for better prompting ๐งฌ EvolutionAgent for structural improvements
๐ Prompt Improvement Loop
graph TB PromptLoader --> Generation Generation --> Reflection Reflection --> Ranking Ranking --> MetaReview MetaReview --> PromptTuning PromptTuning --> PromptLoader
๐ Prompt Evolution Workflow
- ๐งช The GenerationAgent creates hypotheses using the current version of the prompt.
- ๐ช The ReflectionAgent and ๐ RankingAgent evaluate the quality of these hypotheses through structured reviews and Elo scoring.
- ๐ง The MetaReviewAgent identifies recurring patterns and strategic weaknesses in the critiques.
- ๐ ๏ธ The PromptTuningAgent uses those insights to refine the original prompt.
- ๐ฅ The refined prompt is stored and reloaded via the
PromptLoader
, enabling the system to generate better hypotheses in the next cycle.
This iterative loop allows prompt evolution to adapt dynamically to performance signals, creating a self-improving hypothesis generation system.
๐ Self-Improving Loop: From Goal to Insight
graph TD A[Goal Input] --> B[GenerationAgent generates hypotheses] B --> C[Prompt stored in vector memory] C --> D[ReflectionAgent critiques them] D --> E[RankingAgent compares using Elo-style tournament] E --> F[PromptTunerAgent tunes future generations] F --> G[Meta-reviewAgent synthesizes insights] G --> H[EvolutionAgent refines hypotheses] H --> B
- ๐ฏ It starts with a research goal, passed to the ๐งช GenerationAgent.
- ๐ง Prompts are stored in vector memory, ensuring traceability and enabling future retrieval or refinement.
- ๐ช Each hypothesis is reviewed for correctness, feasibility, and alignment with research goals.
- ๐ฑ Top-ranked hypotheses undergo further refinement using inspiration, simplification, and preference-driven tuning.
- ๐ง The MetaReviewAgent aggregates reviews and rankings to uncover strategic insights.
- ๐ These insights feed back into the next generation of prompts and hypotheses โ completing the learning loop.
Over time, the system becomes smarter and more aligned with your research preferences and objectives.
This creates a full feedback loop:
โFeedback from tournaments enables iterative improvementโ
โStrategic directions guide future generationsโ
๐ Overview: How PromptTuningAgent
Improves Scientific Prompts with DSPy
The PromptTuningAgent
is a smart component in the co_ai
pipeline that uses LLMs and DSPy to automatically improve prompts used for generating scientific hypotheses. It uses few-shot learning, prompt evaluation, and feedback loops to refine prompts based on real data collected from earlier runs.
Here’s how it works:
1. ๐ Signature Definition
We start by defining a PromptTuningSignature
. This tells DSPy what inputs and outputs to expect:
- The goal (e.g. “What happens if the US defaults?”)
- The original prompt used to generate hypotheses
- The best hypothesis produced
- A review and a score of that hypothesis
- The output: a refined version of the prompt
2. ๐ง IIIInitialize the Agent
When PromptTuningAgent
is initialized:
- It loads configuration values (e.g. how many examples to train on).
- It sets up a connection to a local LLM through Ollama.
- It configures DSPy with this LLM so it can use it for training.
3. ๐ Agent Execution: run(context)
When the pipeline runs this agent:
- It extracts the goal from the context.
- It pulls a mix of recent prompts + their hypothesis results from memory (used for training and validation).
- If no data is found, it logs the issue and exits gracefully.
4. ๐ฏ Training the Prompt Refiner
- It creates DSPy Examples from the training data.
- It defines a custom scoring function (
_prompt_quality_metric
) that evaluates if a new prompt is better than the original. - It compiles a new
tuned_program
usingBootstrapFewShot
with the training data and scoring metric.
5. ๐ Generating New Prompts
Using the tuned DSPy program, it runs the validation examples through the newly trained model:
- For each validation sample, it generates a refined prompt.
- It saves each refined prompt to the database, recording metadata like strategy and version.
- It logs each prompt and stores it in the contextโs
prompt_history
.
6. ๐ Evaluating Prompt Quality
The custom scoring function (_prompt_quality_metric
) runs like this:
- It generates hypotheses using the original and the refined prompts.
- It builds a comparison prompt to evaluate which prompt performed better.
- It parses the comparison response (looking for โbetter prompt: <A/B>โ) and returns a score.
- All intermediate steps are logged for transparency and debugging.
7. ๐ก Why This Matters
This class enables automated prompt evolution. Instead of manually tweaking prompts and checking their results, the agent:
- Trains itself using past runs.
- Evaluates prompts automatically.
- Improves prompts continuously based on actual hypothesis quality.
Itโs a self-improving loop that can adapt as your data or scientific objectives evolve.
from abc import ABC, abstractmethod
import re
import dspy
from dspy import Predict, Signature, InputField, OutputField, Example, BootstrapFewShot
from co_ai.agents.base import BaseAgent
from co_ai.constants import GOAL
# DSPy signature for prompt refinement: defines input/output fields for tuning
class PromptTuningSignature(Signature):
goal = InputField(desc="Scientific research goal or question")
input_prompt = InputField(desc="Original prompt used to generate hypotheses")
hypotheses = InputField(desc="Best hypothesis generated")
review = InputField(desc="Expert review of the hypothesis")
score = InputField(desc="Numeric score evaluating the hypothesis quality")
refined_prompt = OutputField(desc="Improved version of the original prompt")
# Simple evaluation result class to return from evaluator
class EvaluationResult:
def __init__(self, score: float, reason: str):
self.score = score
self.reason = reason
# Base evaluator interface (not used directly, but useful for future extensions)
class BaseEvaluator(ABC):
@abstractmethod
def evaluate(
self, original: str, proposal: str, metadata: dict = None
) -> EvaluationResult:
pass
# DSPy-based evaluator that can run a Chain-of-Thought program
class DSPyEvaluator(BaseEvaluator):
def __init__(self):
self.program = dspy.ChainOfThought(PromptTuningSignature)
def evaluate(
self, original: str, proposal: str, metadata: dict = None
) -> EvaluationResult:
result = self.program(
goal=metadata["goal"],
input_prompt=original,
hypotheses=metadata["hypotheses"],
review=metadata.get("review", ""),
score=metadata.get("score", 750),
)
try:
score = float(result.score)
except (ValueError, TypeError):
score = 0.0
return EvaluationResult(score=score, reason=result.explanation)
# Main agent class responsible for training and tuning prompts using DSPy
class PromptTuningAgent(BaseAgent):
def __init__(self, cfg, memory=None, logger=None):
super().__init__(cfg, memory, logger)
self.agent_name = cfg.get("name", "prompt_tuning")
self.prompt_key = cfg.get("prompt_key", "default")
self.sample_size = cfg.get("sample_size", 20)
self.generate_count = cfg.get("generate_count", 10)
self.current_version = cfg.get("version", 1)
# Configure DSPy with local LLM (Ollama)
lm = dspy.LM(
"ollama_chat/qwen3",
api_base="http://localhost:11434",
api_key="",
)
dspy.configure(lm=lm)
async def run(self, context: dict) -> dict:
goal = context.get(GOAL, "")
generation_count = self.sample_size + self.generate_count
self.logger.log(
"PromptTuningExamples",
{"samples size": self.sample_size, "generation count": generation_count},
)
# Get training + validation data
examples = self.memory.prompt.get_prompt_training_set(goal, generation_count)
train_data = examples[: self.sample_size]
val_data = examples[self.sample_size :]
if not examples:
self.logger.log(
"PromptTuningSkipped", {"reason": "no_training_data", "goal": goal}
)
return context
# Build training set for DSPy
training_set = [
Example(
goal=item["goal"],
input_prompt=item["prompt_text"],
hypotheses=item["hypothesis_text"],
review=item.get("review", ""),
score=item.get("elo_rating", 1000),
).with_inputs("goal", "input_prompt", "hypotheses", "review", "score")
for item in train_data
]
# Wrap our scoring metric so we can inject context during tuning
def wrapped_metric(example, pred, trace=None):
return self._prompt_quality_metric(example, pred, context=context)
# Train prompt-tuning program
tuner = BootstrapFewShot(metric=wrapped_metric)
student = Predict(PromptTuningSignature)
tuned_program = tuner.compile(student=student, trainset=training_set)
# Use tuned program to generate and store new refined prompt
await self.generate_and_store_refined_prompts(
tuned_program, goal, context, val_data
)
self.logger.log(
"PromptTuningCompleted",
{
"goal": goal,
"example_count": len(training_set),
"generated_count": len(val_data),
},
)
return context
async def generate_and_store_refined_prompts(
self, tuned_program, goal: str, context: dict, val_set
):
"""
Generate refined prompts using the tuned DSPy program and store them in the database.
Args:
tuned_program: A compiled DSPy program capable of generating refined prompts.
goal: The scientific goal for this run.
context: Shared pipeline state.
val_set: Validation examples to run through the tuned program.
"""
stored_count = 0
for i, example in enumerate(val_set):
try:
# Run DSPy program on new example
result = tuned_program(
goal=example["goal"],
input_prompt=example["prompt_text"],
hypotheses=example["hypothesis_text"],
review=example.get("review", ""),
score=example.get("elo_rating", 1000),
)
refined_prompt = result.refined_prompt.strip()
# Store refined prompt to the DB
self.memory.prompt.save(
goal=example["goal"],
agent_name=self.name,
prompt_key=self.prompt_key,
prompt_text=refined_prompt,
response=None,
strategy="refined_via_dspy",
version=self.current_version + 1,
)
stored_count += 1
# Update context with prompt history
self.add_to_prompt_history(
context, refined_prompt, {"original": example["prompt_text"]}
)
self.logger.log(
"TunedPromptStored",
{"goal": goal, "refined_snippet": refined_prompt[:100]},
)
except Exception as e:
self.logger.log(
"TunedPromptGenerationFailed",
{"error": str(e), "example_snippet": str(example)[:100]},
)
self.logger.log(
"BatchTunedPromptsComplete", {"goal": goal, "count": stored_count}
)
def _prompt_quality_metric(self, example, pred, context: dict) -> float:
"""Run both prompts and compare results"""
try:
prompt_a = example.input_prompt
prompt_b = pred.refined_prompt
self.logger.log(
"PromptQualityCompareStart",
{
"prompt_a_snippet": prompt_a[:100],
"prompt_b_snippet": prompt_b[:100],
},
)
hypotheses_a = self.call_llm(prompt_a, context)
self.logger.log(
"PromptAResponseGenerated", {"hypotheses_a_snippet": hypotheses_a[:200]}
)
hypotheses_b = self.call_llm(prompt_b, context)
self.logger.log(
"PromptBResponseGenerated", {"hypotheses_b_snippet": hypotheses_b[:200]}
)
# Run comparison
merged = {
**context,
**{
"prompt_a": prompt_a,
"prompt_b": prompt_b,
"hypotheses_a": hypotheses_a,
"hypotheses_b": hypotheses_b,
},
}
comparison_prompt = self.prompt_loader.load_prompt(self.cfg, merged)
self.logger.log(
"ComparisonPromptConstructed",
{"comparison_prompt_snippet": comparison_prompt[:200]},
)
response = self.call_llm(comparison_prompt, context)
self.logger.log(
"ComparisonResponseReceived", {"response_snippet": response[:200]}
)
match = re.search(r"better prompt:<([AB])>", response, re.IGNORECASE)
if match:
choice = match.group(1).upper()
score = 1.0 if choice == "B" else 0.5
self.logger.log(
"PromptComparisonResult", {"winner": choice, "score": score}
)
return score
else:
self.logger.log("PromptComparisonNoMatch", {"response": response})
return 0.0
except Exception as e:
self.logger.log(
"PromptQualityMetricError",
{
"error": str(e),
"example_input_prompt_snippet": example.input_prompt[:100],
"refined_prompt_snippet": getattr(pred, "refined_prompt", "")[:100],
},
)
return 0.0
๐ JSON Logging: Tracking Every Step
Every run of the co_ai
pipeline is automatically logged in a structured JSON Lines (.jsonl
) file. This makes it easy to audit, debug, and analyze the behavior of the system over time.
๐ How It Works
-
A unique
run_id
is generated at the start of each pipeline execution. -
This ID is used to create a dedicated log file stored under
logs/
, such as:logs/run_us_debt_analysis_20240516_130245.jsonl
-
Each agent and component in the system (generation, ranking, reflection, etc.) logs its actions using a consistent structure:
{ "timestamp": "2025-05-16T13:02:45.123Z", "event_type": "GeneratedHypotheses", "data": { "goal": "Can generative AI models reduce the time required to make scientific discoveries in biomedical research?", "snippet": "Hypothesis 1: ..." } }
-
These events are emitted through the
JSONLogger
class and tagged with emojis and event types for easy tracking.
โ Why It Matters
- ๐ Transparency โ Every stage of the reasoning process is recorded and can be revisited.
- ๐ Debugging โ If something goes wrong, the log reveals where and why.
- ๐ Experiment Tracking โ Logs form the foundation for analyzing pipeline performance and tuning over time.
This approach ensures that the entire scientific reasoning process is not a black box, but a transparent and reproducible workflow.
๐พ Context Persistence: Save and Resume at Any Step
One of the key architectural decisions in co_ai
is persistent context storage. At every stage of the pipeline, the system can store the full pipeline contextโa structured representation of everything the system knows so far.
๐ง What Is “Context”?
The context is a Python dictionary containing:
- The research
goal
literature
findingshypotheses
generated so farreviews
,reflections
,rankings
, and more
It flows through each stage like a growing memory of the reasoning process.
๐ How Context Is Stored
Each agent has config options such as:
save_context: true
skip_if_completed: true
When save_context
is enabled:
- The full context is saved to the PostgreSQL database in the
context_states
table. - Metadata such as
run_id
,stage_name
,version
, andpreferences
are stored alongside it. - The most recent version is flagged with
is_current = true
.
This lets you resume a run, skip steps that have already completed, or inspect intermediate pipeline states.
๐ How It Works in Practice
-
After each agent runs, it saves the new context to the database if
save_context
is enabled. -
Before running, the agent checks if a completed version already exists:
- If
skip_if_completed = true
, it loads the saved context and skips computation. - If not, it runs normally and updates the context.
- If
โ Why This Matters
- ๐ Restartable Runs โ You can resume from any stage without rerunning earlier steps.
- ๐ Debugging and Exploration โ You can load a saved context and inspect it for errors or insights.
- ๐งช A/B Testing and Iteration โ Run multiple strategies or prompt versions against the same stored context.
This persistent, stage-wise approach makes co_ai
robust, debuggable, and suitable for real scientific workflows.
๐ Getting Started (Try It Locally)
You can run the full co_ai
system locally in just a few steps:
- Clone the repo
git clone https://github.com/ernanhughes/co-ai.git cd co-ai
2. **Install dependencies**
```bash
pip install -r requirements.txt
```
3. **Start Ollama with a local model**
```bash
ollama run qwen:latest
```
4. **Install Postgress**
- Download from https://www.postgresql.org/download/windows/
Use the installer to install PostgreSQL and pgAdmin
During setup, set:
Username: postgres
Password: (choose something and remember it)
- Enable the pgvector extension.
I explained this in aprevious post:
[PostgreSQL for AI: Storing and Searching Embeddings with pgvector](/post/pgvector/)
- Create a Database
for example you can use coai as the name
- In your `config.yaml` change teh db connection information to match your database
```yaml
db:
driver: psycopg2
user: postgres
password: yourpassword
host: localhost
port: 5432
database: co_ai
```
- Load the schema.sql file.
At the root of the projec there is a [schema.sql](https://github.com/ernanhughes/co-ai/blob/main/schema.sql) file.
This will create the required tables to run the applciation in your database.
5. **Run the pipeline**
```bash
python -m co_ai.main goal="What happens if the USA defaults on its debt?"
```
---
### โ
Add Real Examples of Prompts and Outputs
You already show how Jinja2 templates work โ but a real **before/after** prompt or hypothesis example would make it even more compelling.
For example:
```markdown
### ๐ฏ Example Goal
> **Goal:** The USA is about to default on its national debt.
**Original Prompt:**
```text
You are an expert researcher generating hypothesesโฆ
Generated Hypotheses:
- Hypothesis 1: [text]
- Hypothesis 2: [text]
After Refinement:
You are an economic strategist focused on sovereign default scenariosโฆ
Improved Hypotheses:
- Hypothesis 1: [text]
- Hypothesis 2: [text]
This lets readers see the real benefit of tuning.
---
### โ
Add โExtending This Projectโ Section
This tells readers what they can do next.
```markdown
## ๐งฉ How to Extend `co_ai`
This system is fully modular โ here are some ways to extend it:
- ๐ **Swap out Ollama for another LLM backend**
- ๐ **Use academic papers from Semantic Scholar instead of web search**
- ๐งช **Add a โSimulationAgentโ to test hypotheses**
- ๐ง **Connect to LangChain tools or Autogen agents**
- ๐ **Use more advanced evaluation metrics (BLEU, ROUGE, etc.)**
You can also add your own agent by subclassing `BaseAgent` and defining a `run()` method.
๐ฌ Questions or Feedback?
This is an open-source research tool. If youโd like to:
- Contribute a new agent
- Report a bug
- Suggest improvements
- Use this in your lab or project
Feel free to open an issue or reach out at GitHub.
๐ References
๐ Appendix
Repo: github.com/ernanhughes/co
Core Tools: DSPy, pgvector, Ollama, Searxng Search
Format: Local Python modules using Hydra and async I/O