Building a Self-Improving Chain-of-Thought Agent: Local LLMs Meet the CoT Encyclopedia

Reasoning, Chain of Thought, Research Reproduction

May 21, 2025

Building a Self-Improving Chain-of-Thought Agent: Local LLMs Meet the CoT Encyclopedia

Page content

Most AI systems generate answers. Ours examines how they think. This isn’t just prompt engineering this is structured reasoning at scale.

🔧 Summary

Large Language Models are transforming every field, yet their internal reasoning remains a formidable black box. We can get brilliant outputs, but without understanding how those conclusions were reached, we’re left guessing how to improve, debug, or even trust them. This opacity limits our ability to build truly reliable and self-improving AI systems.

What if we could not only generate answers, but also understand how those answers were formed, why certain reasoning paths worked better than others, and *which strategies consistently led to high-quality outputs?

That’s exactly what the Chain-of-Thought Encyclopedia paper (arXiv:2505.10185) paper proposes.

And in this post, I’ll show you how I built a local-first implementation of that system one that:

Generates multiple chains of thought
Evaluates and selects the best using MR.Q or LLM judges
Classifies reasoning patterns via rubrics
Stores everything for analysis and evolution
And ultimately, learns from its own thinking

This isn’t just reproducing research it’s building the foundation for AI agents that reason, reflect, and improve over time

We didn’t want to just reproduce this work. We wanted to absorb it into our local, modular reasoning framework one that works offline, supports multiple agents, evolves over time, and serves as the foundation for a self-improving assistant (Stephanie).

🔗 Introduction & Context

This is the third post in a 100 part series. We are targetting specific areas of AI and grafting them one targetted solution.

In previous posts, we explored:

🔍 The sharpening mechanism how to refine hypotheses without retraining 🧩 MR.Q a lightweight framework for preference-based learning 🛠 Structured prompting programming intelligence at both ends 🧠 co_ai the framwork we are using to build the components.

Now, we’re taking it further.

We’re integrating these ideas into a full reasoning pipeline one that doesn’t just generate responses, but understands how the model thinks.

So we rebuilt the CoT Encyclopedia pipeline using:

Local LLMs (e.g., Qwen, Mistral via Ollama)
Our own co_ai agent system
Rubric-based reasoning classification
Dynamic evaluators (MRQ, LLM-based judges)
Embedded reasoning patterns and visualization-ready storage

🧭 Why This Ties Into Our Broader Mission

This seemingly focused project is, in fact, a crucial step towards our grander vision:

🧠 Build an agent that can generate, evaluate, and refine its own reasoning and eventually teach other models to reason better.

It’s not about one model output. It’s about:

Structured chains of thought
Reasoning style diversity
Self-evaluation and tuning
A library of patterns a local CoT Encyclopedia that grows over time

We believe reasoning is programmable, trainable, and auditable and this project makes that belief concrete.

🎯 Why We Need Classification and Evaluation of Reasoning Patterns

1. Reasoning Is the Core Competency of AI Assistants

Your AI system isn’t just a chatbot. It’s:

A research assistant
A planner
A strategist
A scientific co-author

To do these well, it needs more than fluent language it needs structured reasoning. But CoT reasoning can look wildly different depending on:

The model
The prompt
The task
The domain

You need a system that can tell what kind of reasoning is happening, not just whether an answer looks good.

2. Without Structure, There’s No Feedback Loop

You can’t improve what you don’t understand.

If your agent generates 5 hypotheses for a goal, how do you know:

Which one is the most rigorous?
Which reasoning styles tend to perform well for this goal type?
How to refine the reasoning process (not just the result)?

Rubric-based classification and evaluators (like MRQ or LLM judges) give you:

A labelled reasoning profile
A way to track trends
A mechanism to train models (or strategies) that adapt

3. It Turns Reasoning Into a First-Class Citizen

Most CoT pipelines stop at generating one “good enough” answer. This project treats reasoning as something:

You can analyze
You can store
You can cluster
You can select and improve over time

That’s what lets you build:

Strategy-aware agents
Goal-type to CoT-profile routing
Self-improving hypothesis engines

🧠 System Overview: A Modular Agent Pipeline for Reasoning

💬 Chain-of-Thought Generator Agent with Dual Evaluation

    flowchart TD
    A[User Goal] --> B[Prompt Loader]
    B --> C[Generate Multiple Candidates 2–3 chains of thought]

    subgraph Candidate Generation
        C --> D1[Call LLM for Hypothesis 1]
        C --> D2[Call LLM for Hypothesis 2]
        C --> D3[Call LLM for Hypothesis N]
    end

    D1 --> E[Evaluator Selection]
    D2 --> E
    D3 --> E

    subgraph Evaluator Module
        direction LR
        E --> F{Use MR.Q?}
        F -- Yes --> G[MRQSelfEvaluator<br>→ Compare Embeddings<br>→ Score via Value Net]
        F -- No --> H[LLMJudgeEvaluator<br>→ Prompt-based Judgment<br>→ Rank Outputs]
    end

    G --> I[Select Best Output]
    H --> I

    I --> J[Classify Reasoning Style Rubrics: Logical Flow, Evidence-Based, etc.]

    J --> K[Store Best Hypothesis + Confidence Score + Pattern]
    K --> L[Log Everything for Future Learning & Evolution]

    style A fill:#f9f,stroke:#333
    style I fill:#ffdd00,stroke:#333
    style J fill:#c9f,stroke:#333
    style K fill:#6cf,stroke:#333

🔹 Step 1: Candidate Generation (`ChainOfThoughtGeneratorAgent`)

goal: 
  text: "Will AI ever be able to reprogram itself?"
  type: research # Options: math, science, commonsense, factoid, ethical, policy, planning, creative, other, research

We begin with a goal a user prompt or scientific question. Every goal is annotated not only with its text, but also with a goal type such as:

math
science
commonsense
planning
creative
research

This was a key addition to our implementation, enabling us to analyze which reasoning styles work best for different types of tasks, just like the original paper did.

The CoT generator agent then:

Loads a structured prompt template (Jinja-powered)
Generates multiple reasoning candidates using a local LLM (e.g., Qwen via Ollama)
Each candidate is a chain of thought a natural-language reasoning path

🔹 Step 2: Evaluation (MRQ or LLM Judge)

To choose the best candidate, we support two modes of evaluation:

MRQ Evaluator: A self-supervised scoring model trained over time using embedding-based differences between prompt + hypothesis pairs.
LLM Judge: A structured prompt sent to an LLM that compares the candidates and outputs a preferred one along with justification and optional confidence scores.

This mirrors the paper’s use of human preference signals, but keeps it local, reproducible, and pluggable.

🔹 Step 3: Rubric-Based Reasoning Classification

After selecting the best chain of thought, we classify it using a structured set of rubrics:

Is it deductive or analogical?
Does it show shallow or deep reasoning?
Is it belief-driven or evidence-driven?

These rubrics are defined in a config file, then executed as prompt templates that analyze the CoT through LLM reflection.

The output is a structured reasoning pattern a fingerprint of the model’s reasoning behavior for this goal.

🔹 Step 4: Logging and Storage

Each run records:

The goal and goal type
The CoT candidates and chosen output
The rubric pattern
Evaluation scores
Model and agent metadata

Everything is stored in a structured, queryable format using PostgreSQL + JSON fields. This supports:

Large-scale analysis of reasoning strategies
Per-goal-type aggregation
Strategy-aware filtering and model tuning

🔹 Step 5: Pattern Embedding and Clustering

To explore the structure of the reasoning space, we embed:

The rubric patterns
Optionally the CoT texts themselves

We then cluster and visualize these embeddings to uncover:

Common strategy clusters
How different models behave
Which strategies dominate in different goal types

This replicates the CoT Encyclopedia’s “strategy space” visualizations and gives us a tool for dynamic reasoning analysis.

🧬 Why This Agent Uses 3 Different Models

One of the powerful design decisions in this system is to separate the models by role. Instead of relying on a single LLM to generate, evaluate, and analyze reasoning chains, we use three specialized model configurations:

Model Role	Purpose	Config Key	Example Model
🧠 Reasoning Model	Generates candidate CoTs	`model`	`ollama/qwen3`
🧪 Evaluator Model	Compares two CoTs and picks the better one	`evaluator_model`	`ollama/mistral:7b-instruct`
🧾 Analysis Model	Classifies the winning CoT using rubric prompts	`analysis_model`	`ollama/gemma3`

This lets you:

Optimize cost by using smaller models for evaluation/classification
Swap in higher-quality models for critical tasks like judging
Experiment independently with reasoning vs evaluation logic

🧩 Rubrics, Reasoning Styles, and Strategic Diversity

A single answer can look good. But two answers can look equally good and be based on totally different reasoning strategies.

This is what the CoT Encyclopedia paper highlights so clearly: we need to go beyond correctness and start paying attention to how models reason, not just what they say.

That’s where rubrics come in.

🎯 What Are Rubrics?

Rubrics are structured criteria for analyzing reasoning patterns. Each one asks a question about the nature of a model’s thought process:

Dimension	Rubric Prompt	Options
Inference Style	Is the reasoning based on deduction or analogy?	Deductive / Analogical
Reasoning Depth	Does the reasoning go deep with multiple steps, or stay surface-level?	Deep / Shallow
Strategy Orientation	Does the reasoning start from a hypothesis or from evidence?	Top-Down / Bottom-Up
Evidence Use	Is the argument belief-driven or guided by data?	Belief / Evidence

Each rubric is configurable in YAML. You can easily add, remove, or disable dimensions. This means you can fine-tune what “good reasoning” means for your domain whether you’re working on policy, science, ethics, or creative writing.

🧠 How We Use Rubrics

For every selected hypothesis (CoT), the system:

Loads a Jinja prompt for the dimension (e.g., “Is this reasoning deductive or analogical?”)
Sends the prompt to a local LLM
Extracts the classified label
Stores all dimension-label pairs in the database

This results in a pattern fingerprint for each chain of thought a structured representation of its reasoning strategy.

📦 What This Enables

By labeling thousands of CoTs this way, we can:

Cluster similar reasoning styles
Compare models (e.g., does Mistral prefer shallow strategies while Qwen favors deep ones?)
Track strategy diversity (e.g., how many unique patterns a model can produce)
Link patterns to goal types (e.g., planning goals often require top-down strategies)

This turns reasoning into a dataset and that unlocks visualizations, comparisons, and even prompt tuning.

📌 A Small Example

Here’s a real classification from our system:

🧾 Goal: Will AI ever be able to reprogram itself? 🧠 Hypothesis: [full chain of thought output] 🔎 Pattern:

Inference Style: Analogical

Reasoning Depth: Deep

Strategy Orientation: Top-Down

Evidence Use: Belief-Driven

This shows the style behind the substance which we can now analyze, compare, and evolve over time.

🧭 Embedding and Exploring the Strategy Space

Once we’ve classified thousands of reasoning traces using rubrics, we’re left with something powerful: a library of labeled thoughts.

But that’s not just a record it’s a map waiting to be drawn.

To explore the “strategy space” of reasoning, we transform each classified CoT into a vector embedding and that lets us visualize, compare, and cluster reasoning styles in ways that go far beyond qualitative analysis.

🔢 How Embedding Works

For each hypothesis (chain of thought), we:

Generate a text summary of its pattern, e.g.:

Inference Style: Analogical; Reasoning Depth: Deep; Strategy Orientation: Top-Down

Combine that with the hypothesis text itself, if desired:

"Hypothesis text here..." // Pattern: Analogical, Deep, Top-Down

Pass the combined text to our embedding_store, which uses a local embedding model (e.g., bge, e5, etc.) to get a vector representation
Store this embedding in a new table, cot_embeddings, alongside the goal, model, and pattern metadata

🧠 What This Enables

With these embeddings in place, we can:

🔹 Visualize the reasoning landscape

Using UMAP or t-SNE, we reduce the high-dimensional embeddings to 2D and plot:

Reasoning clusters
Per-model or per-goal-type distributions
Color-coded dimensions (e.g., analogical vs deductive)

🔹 Cluster reasoning styles

Using HDBSCAN or KMeans, we group similar reasoning strategies and:

Identify dominant “modes” of thought
Compare model diversity
Track style drift over time or between tasks

🔹 Label clusters with LLMs

For each cluster, we sample 5–10 CoTs and prompt an LLM:

“What reasoning strategy is common to the following chains of thought?”

This gives us human-readable strategy names, just like the paper’s “cluster archetypes.”

📊 Example Use Cases

Use Case	What You Learn
Model comparison	Does Qwen favor exhaustive reasoning more than Gemma?
Goal-type matching	Do planning goals correlate with bottom-up strategies?
Strategy analysis	Are most hypotheses clustered around 3–4 dominant patterns?
Prompt refinement	Are certain prompt templates pulling reasoning toward shallow clusters?

This lets us treat reasoning like a dynamic system one we can observe, debug, and tune just like software.

You’re not just evaluating answers. You’re understanding thought.

⚙️ Configuring the Chain-of-Thought Agent

One of the most powerful aspects of this system is that every agent including the CoT generator is configured declaratively via YAML. This lets us:

Swap models easily (local vs remote)
Switch evaluation strategies (MRQ or LLM)
Adjust training, logging, and storage behavior
Control rubric classification dimensions
Enable or disable features per run

Here’s a breakdown of the key fields in our cot_generator config file:

🧠 Core Identity and Control

name: cot_generator
enabled: true
save_prompt: true
save_context: false
skip_if_completed: false

enabled: Controls whether this agent is run in the pipeline.
save_prompt: Persists the prompt and model response in the database for traceability.
skip_if_completed: If true, skips execution if an output already exists for this goal.

🤖 Model Settings

model:
  name: ollama/qwen3
  api_base: http://localhost:11434

This specifies the reasoning model used to generate CoTs.
Any Ollama-hosted local model can be used here (Qwen, Mistral, Gemma, etc.).

🧪 Evaluation Strategy

evaluator: llm  # or 'mrq' if sufficient training data exists
evaluator_model:
  name: ollama/mistral:7b-instruct
evaluator_prompt_file: evaluation.txt

evaluator: Chooses between MRQ (embedding-based) or LLM judge (prompt-based).
evaluator_model: Used only for LLM judging.
evaluation.txt: The Jinja prompt file used to compare candidates.

This design lets the agent self-assess the quality of its outputs, and fall back to a robust LLM-based approach when MRQ doesn’t have enough training data.

🔍 Analysis Model

analysis_model:
  name: ollama/mistral:7b-instruct

Used during rubric-based reasoning pattern classification separate from the generation model so you can use a model with stronger reflection abilities if desired.

📋 Prompt Configuration

prompt_mode: file
prompt_file: generate_cot.txt
pattern_prompt_file: cot_pattern.txt
remove_think: false

prompt_file: The base prompt to generate CoTs
pattern_prompt_file: Used to classify reasoning styles across rubrics
remove_think: If true, strips <think>...</think> blocks from model output; here we leave them for introspection.

🧠 Training Parameters (MRQ Evaluator Only)

device: cpu
limit: 1000
epochs: 20
patience: 3
min_delta: 0.0001

These control the MRQ evaluator’s self-supervised training loop, tuning its value predictor to prefer better reasoning chains.

🧪 Rubric Classification

rubrics:
  - dimension: "Strategy Orientation"
    rubric: "Does the reasoning proceed in a hypothesis-first (top-down) or data-first (bottom-up) manner?"
    options: ["Top-Down", "Bottom-Up"]
    enabled: true
  ...

Each rubric defines a dimension of reasoning (e.g., depth, inference style).
The system prompts an LLM to classify each CoT according to these.
Disabled rubrics are ignored enabling easy customization per run.

This structure aligns with the CoT Encyclopedia paper and allows future tools (like clustering or filtering) to use reasoning as structured data.

✅ Why This Matters

This configuration file makes the agent:

Transparent
Reproducible
Extensible

You can:

Run experiments with different models and evaluators
Analyze how reasoning patterns shift by rubric
Track strategy changes over time or across tasks

And because it’s YAML, every experiment is version-controllable and explainable.

✅ Implementation Checklist: What We Built from the CoT Encyclopedia Paper

Paper Component	Description	Implemented?	Notes
Multi-CoT Generation	Generate multiple chains of thought per goal using a reasoning model	✅	Done via `ChainOfThoughtGeneratorAgent` with local LLMs (Ollama)
Candidate Evaluation	Select best output using pairwise or tournament evaluation	✅	MRQ and LLM Judge evaluators both supported
LLM-Based Preference Judging	Use another LLM to decide between CoTs	✅	`LLMJudgeEvaluator` uses `evaluation.txt` prompt
Rubric-Based Classification	Label reasoning across 10+ dimensions (e.g., depth, inference style)	✅	Configurable YAML rubrics; results logged and stored
Pattern Storage and Analysis	Save CoT patterns with metadata (goal, model, score)	✅	Stored via `cot_patterns` table; includes embeddings
Goal Type Annotation	Label goals by task type (e.g., math, science, commonsense)	✅	Used to route and analyze strategies per goal type
Pattern Embedding	Embed CoT + rubric data for clustering	✅	Integrated via `RubricClusterer` and vector DB
Cluster Analysis	Group similar reasoning styles via embeddings	✅	Cluster summaries logged; clustering done on run
Pattern Diversity Metrics	Count unique patterns and measure strategy spread	⚠️ Partial	Clustering supports this; summary stats to be expanded
Model Comparison	Compare reasoning styles between different LLMs	✅	Supported via config-level model switching
Strategy-to-Goal Insights	Link certain rubrics to specific goal types	✅	Enabled via `goal.type` + rubric logs
Human-Readable Strategy Labels	Assign names to clusters (e.g., “careful planner”)	⚠️ Optional	Can be added with LLM summarization of cluster samples
Self-Improvement Loop	Use evaluations to retrain or refine reasoning prompts	✅	MRQ supports tuning; DSPy version planned for feedback learning
DSPy Integration (Optional)	Use programmatic, structured prompting	✅	`ChainOfThoughtDSPyGeneratorAgent` supports this natively

📚 References

Wang, Y., Radhakrishna, A., Chi, E., & Lee, P. (2024). The Chain-of-Thought Encyclopedia: Mapping Reasoning Strategies in Language Models. arXiv:2505.10185
DSPy: A Library for Declarative Structured Prompting Arora, S., Zhang, W., Xiong, C., et al. GitHub – Stanford DSPy
MRQ: Model-Relative Quality Evaluator Inspired by self-supervised evaluation strategies using embedding distance and preference comparison. Implementation adapted from: Sharpening Language Models with Self-Evaluation (work-in-progress).
Ollama – Local LLM runner supporting models like Qwen, Mistral, and Gemma. https://ollama.com
BGE / E5 / MTEB – Embedding models for text similarity and clustering. MTEB: Massive Text Embedding Benchmark
Hydra Config System – Flexible configuration management for ML pipelines. https://hydra.cc

🛠️ Code and Project Repository

The full implementation of this Chain-of-Thought reasoning system — including multi-agent pipelines, rubric classification, MRQ evaluation, DSPy integration, and strategy analysis — is available on GitHub:

🔗 View the Project on GitHub →

This repository includes:

✅ Local model integration via Ollama
✅ Configurable agent-based reasoning pipelines
✅ Rubric classification and CoT clustering
✅ MRQ and LLM-based evaluation support
✅ Structured prompt templates and analysis modules
✅ Example configs and scripts for running and tuning

📬 Contributions Welcome

This project is open source and actively evolving. If you’re working on:

Reasoning systems
CoT evaluation
LLM orchestration
Local-first tooling

…we’d love your feedback, use cases, and pull requests!

📖 Glossary of Key Terms

Term	Definition
Chain of Thought (CoT)	A sequence of reasoning steps used by a language model to arrive at an answer. Often involves natural-language explanation.
MRQ (Model-Relative Quality)	A lightweight evaluator that compares hypotheses based on learned value differences using embeddings and simple neural scoring.
LLM Judge	An evaluation mechanism where a language model compares two responses and selects the better one using a structured prompt.
Rubric	A structured criterion or dimension used to classify how a reasoning chain behaves (e.g., depth, style, orientation).
CoT Encyclopedia	A research framework that maps and analyzes diverse reasoning strategies in language models by generating, evaluating, and clustering chains of thought.
Prompt Template	A predefined structure, often written using Jinja2, that guides the language model to perform specific tasks or respond in a specific format.
Evaluator	A module that scores or ranks different chains of thought, either via training (MRQ) or inference (LLM-based judgment).
Pattern Embedding	The process of converting a classified CoT reasoning pattern into a vector for clustering or similarity search.
Goal Type	A semantic category assigned to a task prompt, such as `math`, `science`, `commonsense`, or `policy`.
DSPy	A declarative Python library for structured prompting and modular reasoning pipelines, developed by Stanford NLP.
Self-Improving Agent	An AI agent that iteratively generates, evaluates, classifies, and refines its own reasoning over time.
Ollama	A local language model runner that supports models like Qwen, Mistral, and Gemma via a simple REST API.
Rubric Clusterer	A module that embeds and clusters rubric-based reasoning patterns to identify strategy archetypes or diversity.

🚀 Applications and Extensions

Building the CoT Encyclopedia locally wasn’t just a replication project it laid the foundation for next-generation AI agents that can reason, reflect, and improve over time.

Here are just a few of the applications this enables:

🔁 1. Self-Improving Reasoning Agents

With MRQ or LLM-based evaluators, we don’t just generate answers we compare, refine, and learn from them.

Imagine an AI that gets smarter not just by consuming more data, but by actively reflecting on and optimizing its own thought processes.

By recording which chains of thought are preferred (and why), the system can:

Tune prompt templates
Adapt strategies to different goal types
Train evaluators on real performance feedback

This is reasoning as a feedback loop a pathway to self-improving agents.

🧠 2. Goal-Type to Strategy Routing

Every goal is tagged with a type (e.g., math, planning, ethics). Because we now track reasoning patterns per goal type, we can:

Learn which reasoning strategies work best for each type
Route future goals to different agents or prompt styles based on their type
Even suggest goal-type–specific templates dynamically

This supports DOTS-style planning and strategy-aware agent orchestration systems that don’t just reason, but reason intentionally.

🔍 3. Hypothesis Search and Filtering

With reasoning patterns stored as structured data, we can now:

Search for chains of thought with specific strategies (e.g., “Show me all top-down, deductive answers for policy goals”)
Retrieve reasoning by model, rubric match, or evaluation score
Build dashboards or tools for interactive CoT exploration

This turns the system into a living CoT Encyclopedia, not just a pipeline.

🛠 4. Plug-and-Play Evaluation Backends

Thanks to modular evaluator support, you can swap between:

MRQ (efficient, trainable, local)
LLM judge (rich, pairwise, interpretable)
Future scoring systems (RLHF signals, task-specific verifiers, etc.)

This lets users configure how “reasoning quality” is defined, depending on the application research, creativity, engineering, etc.

This config entry determines the evaluator.

  evaluator: llm         #(mrq or llm)  may not be enough items fo mrq

def _init_evaluator(self):
    if self.cfg.get("evaluator", "mrq") == "llm":
        return LLMJudgeEvaluator(...)
    else:
        return MRQSelfEvaluator(...)

In an upcoming post I will show a direct comparison bwetween MRQ and other forms of evaluation.

🔮 5. Research-Grade Transparency

Every decision every prompt, classification, and evaluation is stored with:

Timestamps
Agents
Models
Strategies used

This enables:

Scientific reproducibility
Transparent debugging
Long-term tracking of reasoning trends

You’re not just building a product you’re building an auditable reasoning engine.

    def call_llm(self, prompt: str, context: dict, llm_cfg: dict = None) -> str:
        """Call the default or custom LLM, log the prompt, and handle output."""
        props = llm_cfg or self.llm  # Use passed-in config or default
        
        messages = [{"role": "user", "content": prompt}]
        try:
            response = litellm.completion(
                model=props[NAME],
                messages=messages,
                api_base=props[API_BASE],
                api_key=props.get(API_KEY, ""),
            )
            output = response["choices"][0]["message"]["content"]

            # Save prompt and response if enabled
            if self.cfg.get(SAVE_PROMPT, False) and self.memory:
                self.memory.prompt.save(
                    context.get("goal"),
                    agent_name=self.name,
                    prompt_key=self.cfg.get(PROMPT_PATH, ""),
                    prompt_text=prompt,
                    response=output,
                    strategy=self.cfg.get(STRATEGY, ""),
                    version=self.cfg.get("version", 1),
                )

            # Remove [THINK] blocks if configured
            response_cleaned = remove_think_blocks(output) if self.remove_think else output

            # Optionally add to context history
            if self.cfg.get("add_prompt_to_history", True):
                self.add_to_prompt_history(context, prompt, {"response": response_cleaned})

            return response_cleaned

        except Exception as e:
            print(f"❌ Exception: {type(e).__name__}: {e}")
            self.logger.log("LLMCallError", {"exception": str(e)})
            raise

🧾 Conclusion and Future Work

We set out to rebuild the Chain-of-Thought Encyclopedia not just to replicate the paper, but to make its ideas usable, local, and extensible.

What we ended up with is more than a reproduction.

It’s a system that:

Generates multiple reasoning paths for a goal
Evaluates and selects the best one
Classifies it across structured rubrics
Embeds and clusters reasoning patterns
Stores everything for analysis and tuning
Learns from its own outputs over time

All of it powered by local LLMs, structured prompt templates, and a modular agent framework that grows as the reasoning tasks evolve.