Building a Self-Improving Chain-of-Thought Agent: Local LLMs Meet the CoT Encyclopedia

Most AI systems generate answers. Ours examines how they think. This isn’t just prompt engineering this is structured reasoning at scale.
🔧 Summary
Large Language Models are transforming every field, yet their internal reasoning remains a formidable black box. We can get brilliant outputs, but without understanding how those conclusions were reached, we’re left guessing how to improve, debug, or even trust them. This opacity limits our ability to build truly reliable and self-improving AI systems.
What if we could not only generate answers, but also understand how those answers were formed, why certain reasoning paths worked better than others, and *which strategies consistently led to high-quality outputs?
That’s exactly what the Chain-of-Thought Encyclopedia paper (arXiv:2505.10185) paper proposes.
And in this post, I’ll show you how I built a local-first implementation of that system one that:
- Generates multiple chains of thought
- Evaluates and selects the best using MR.Q or LLM judges
- Classifies reasoning patterns via rubrics
- Stores everything for analysis and evolution
- And ultimately, learns from its own thinking
This isn’t just reproducing research it’s building the foundation for AI agents that reason, reflect, and improve over time
We didn’t want to just reproduce this work. We wanted to absorb it into our local, modular reasoning framework one that works offline, supports multiple agents, evolves over time, and serves as the foundation for a self-improving assistant (Stephanie).
🔗 Introduction & Context
This is the third post in a 100 part series. We are targetting specific areas of AI and grafting them one targetted solution.
In previous posts, we explored:
🔍 The sharpening mechanism how to refine hypotheses without retraining
🧩 MR.Q a lightweight framework for preference-based learning
🛠 Structured prompting programming intelligence at both ends
🧠 co_ai
the framwork we are using to build the components.
Now, we’re taking it further.
We’re integrating these ideas into a full reasoning pipeline one that doesn’t just generate responses, but understands how the model thinks.
So we rebuilt the CoT Encyclopedia pipeline using:
- Local LLMs (e.g., Qwen, Mistral via Ollama)
- Our own
co_ai
agent system - Rubric-based reasoning classification
- Dynamic evaluators (MRQ, LLM-based judges)
- Embedded reasoning patterns and visualization-ready storage
🧭 Why This Ties Into Our Broader Mission
This seemingly focused project is, in fact, a crucial step towards our grander vision:
🧠 Build an agent that can generate, evaluate, and refine its own reasoning and eventually teach other models to reason better.
It’s not about one model output. It’s about:
- Structured chains of thought
- Reasoning style diversity
- Self-evaluation and tuning
- A library of patterns a local CoT Encyclopedia that grows over time
We believe reasoning is programmable, trainable, and auditable and this project makes that belief concrete.
🎯 Why We Need Classification and Evaluation of Reasoning Patterns
1. Reasoning Is the Core Competency of AI Assistants
Your AI system isn’t just a chatbot. It’s:
- A research assistant
- A planner
- A strategist
- A scientific co-author
To do these well, it needs more than fluent language it needs structured reasoning. But CoT reasoning can look wildly different depending on:
- The model
- The prompt
- The task
- The domain
You need a system that can tell what kind of reasoning is happening, not just whether an answer looks good.
2. Without Structure, There’s No Feedback Loop
You can’t improve what you don’t understand.
If your agent generates 5 hypotheses for a goal, how do you know:
- Which one is the most rigorous?
- Which reasoning styles tend to perform well for this goal type?
- How to refine the reasoning process (not just the result)?
Rubric-based classification and evaluators (like MRQ or LLM judges) give you:
- A labelled reasoning profile
- A way to track trends
- A mechanism to train models (or strategies) that adapt
3. It Turns Reasoning Into a First-Class Citizen
Most CoT pipelines stop at generating one “good enough” answer. This project treats reasoning as something:
- You can analyze
- You can store
- You can cluster
- You can select and improve over time
That’s what lets you build:
- Strategy-aware agents
- Goal-type to CoT-profile routing
- Self-improving hypothesis engines
🧠 System Overview: A Modular Agent Pipeline for Reasoning
💬 Chain-of-Thought Generator Agent with Dual Evaluation
flowchart TD A[User Goal] --> B[Prompt Loader] B --> C[Generate Multiple Candidates 2–3 chains of thought] subgraph Candidate Generation C --> D1[Call LLM for Hypothesis 1] C --> D2[Call LLM for Hypothesis 2] C --> D3[Call LLM for Hypothesis N] end D1 --> E[Evaluator Selection] D2 --> E D3 --> E subgraph Evaluator Module direction LR E --> F{Use MR.Q?} F -- Yes --> G[MRQSelfEvaluator<br>→ Compare Embeddings<br>→ Score via Value Net] F -- No --> H[LLMJudgeEvaluator<br>→ Prompt-based Judgment<br>→ Rank Outputs] end G --> I[Select Best Output] H --> I I --> J[Classify Reasoning Style Rubrics: Logical Flow, Evidence-Based, etc.] J --> K[Store Best Hypothesis + Confidence Score + Pattern] K --> L[Log Everything for Future Learning & Evolution] style A fill:#f9f,stroke:#333 style I fill:#ffdd00,stroke:#333 style J fill:#c9f,stroke:#333 style K fill:#6cf,stroke:#333
🔹 Step 1: Candidate Generation (ChainOfThoughtGeneratorAgent
)
goal:
text: "Will AI ever be able to reprogram itself?"
type: research # Options: math, science, commonsense, factoid, ethical, policy, planning, creative, other, research
We begin with a goal a user prompt or scientific question. Every goal is annotated not only with its text, but also with a goal type such as:
math
science
commonsense
planning
creative
research
This was a key addition to our implementation, enabling us to analyze which reasoning styles work best for different types of tasks, just like the original paper did.
The CoT generator agent then:
- Loads a structured prompt template (Jinja-powered)
- Generates multiple reasoning candidates using a local LLM (e.g., Qwen via Ollama)
- Each candidate is a chain of thought a natural-language reasoning path
🔹 Step 2: Evaluation (MRQ or LLM Judge)
To choose the best candidate, we support two modes of evaluation:
- MRQ Evaluator: A self-supervised scoring model trained over time using embedding-based differences between prompt + hypothesis pairs.
- LLM Judge: A structured prompt sent to an LLM that compares the candidates and outputs a preferred one along with justification and optional confidence scores.
This mirrors the paper’s use of human preference signals, but keeps it local, reproducible, and pluggable.
🔹 Step 3: Rubric-Based Reasoning Classification
After selecting the best chain of thought, we classify it using a structured set of rubrics:
- Is it deductive or analogical?
- Does it show shallow or deep reasoning?
- Is it belief-driven or evidence-driven?
These rubrics are defined in a config file, then executed as prompt templates that analyze the CoT through LLM reflection.
The output is a structured reasoning pattern a fingerprint of the model’s reasoning behavior for this goal.
🔹 Step 4: Logging and Storage
Each run records:
- The goal and goal type
- The CoT candidates and chosen output
- The rubric pattern
- Evaluation scores
- Model and agent metadata
Everything is stored in a structured, queryable format using PostgreSQL + JSON fields. This supports:
- Large-scale analysis of reasoning strategies
- Per-goal-type aggregation
- Strategy-aware filtering and model tuning
🔹 Step 5: Pattern Embedding and Clustering
To explore the structure of the reasoning space, we embed:
- The rubric patterns
- Optionally the CoT texts themselves
We then cluster and visualize these embeddings to uncover:
- Common strategy clusters
- How different models behave
- Which strategies dominate in different goal types
This replicates the CoT Encyclopedia’s “strategy space” visualizations and gives us a tool for dynamic reasoning analysis.
🧬 Why This Agent Uses 3 Different Models
One of the powerful design decisions in this system is to separate the models by role. Instead of relying on a single LLM to generate, evaluate, and analyze reasoning chains, we use three specialized model configurations:
Model Role | Purpose | Config Key | Example Model |
---|---|---|---|
🧠 Reasoning Model | Generates candidate CoTs | model |
ollama/qwen3 |
🧪 Evaluator Model | Compares two CoTs and picks the better one | evaluator_model |
ollama/mistral:7b-instruct |
🧾 Analysis Model | Classifies the winning CoT using rubric prompts | analysis_model |
ollama/gemma3 |
This lets you:
- Optimize cost by using smaller models for evaluation/classification
- Swap in higher-quality models for critical tasks like judging
- Experiment independently with reasoning vs evaluation logic
🧩 Rubrics, Reasoning Styles, and Strategic Diversity
A single answer can look good. But two answers can look equally good and be based on totally different reasoning strategies.
This is what the CoT Encyclopedia paper highlights so clearly: we need to go beyond correctness and start paying attention to how models reason, not just what they say.
That’s where rubrics come in.
🎯 What Are Rubrics?
Rubrics are structured criteria for analyzing reasoning patterns. Each one asks a question about the nature of a model’s thought process:
Dimension | Rubric Prompt | Options |
---|---|---|
Inference Style | Is the reasoning based on deduction or analogy? | Deductive / Analogical |
Reasoning Depth | Does the reasoning go deep with multiple steps, or stay surface-level? | Deep / Shallow |
Strategy Orientation | Does the reasoning start from a hypothesis or from evidence? | Top-Down / Bottom-Up |
Evidence Use | Is the argument belief-driven or guided by data? | Belief / Evidence |
Each rubric is configurable in YAML. You can easily add, remove, or disable dimensions. This means you can fine-tune what “good reasoning” means for your domain whether you’re working on policy, science, ethics, or creative writing.
🧠 How We Use Rubrics
For every selected hypothesis (CoT), the system:
- Loads a Jinja prompt for the dimension (e.g., “Is this reasoning deductive or analogical?”)
- Sends the prompt to a local LLM
- Extracts the classified label
- Stores all dimension-label pairs in the database
This results in a pattern fingerprint for each chain of thought a structured representation of its reasoning strategy.
📦 What This Enables
By labeling thousands of CoTs this way, we can:
- Cluster similar reasoning styles
- Compare models (e.g., does Mistral prefer shallow strategies while Qwen favors deep ones?)
- Track strategy diversity (e.g., how many unique patterns a model can produce)
- Link patterns to goal types (e.g., planning goals often require top-down strategies)
This turns reasoning into a dataset and that unlocks visualizations, comparisons, and even prompt tuning.
📌 A Small Example
Here’s a real classification from our system:
🧾 Goal: Will AI ever be able to reprogram itself? 🧠 Hypothesis: [full chain of thought output] 🔎 Pattern:
- Inference Style: Analogical
- Reasoning Depth: Deep
- Strategy Orientation: Top-Down
- Evidence Use: Belief-Driven
This shows the style behind the substance which we can now analyze, compare, and evolve over time.
🧭 Embedding and Exploring the Strategy Space
Once we’ve classified thousands of reasoning traces using rubrics, we’re left with something powerful: a library of labeled thoughts.
But that’s not just a record it’s a map waiting to be drawn.
To explore the “strategy space” of reasoning, we transform each classified CoT into a vector embedding and that lets us visualize, compare, and cluster reasoning styles in ways that go far beyond qualitative analysis.
🔢 How Embedding Works
For each hypothesis (chain of thought), we:
-
Generate a text summary of its pattern, e.g.:
Inference Style: Analogical; Reasoning Depth: Deep; Strategy Orientation: Top-Down
-
Combine that with the hypothesis text itself, if desired:
"Hypothesis text here..." // Pattern: Analogical, Deep, Top-Down
-
Pass the combined text to our
embedding_store
, which uses a local embedding model (e.g.,bge
,e5
, etc.) to get a vector representation -
Store this embedding in a new table,
cot_embeddings
, alongside the goal, model, and pattern metadata
🧠 What This Enables
With these embeddings in place, we can:
🔹 Visualize the reasoning landscape
Using UMAP or t-SNE, we reduce the high-dimensional embeddings to 2D and plot:
- Reasoning clusters
- Per-model or per-goal-type distributions
- Color-coded dimensions (e.g., analogical vs deductive)
🔹 Cluster reasoning styles
Using HDBSCAN or KMeans, we group similar reasoning strategies and:
- Identify dominant “modes” of thought
- Compare model diversity
- Track style drift over time or between tasks
🔹 Label clusters with LLMs
For each cluster, we sample 5–10 CoTs and prompt an LLM:
“What reasoning strategy is common to the following chains of thought?”
This gives us human-readable strategy names, just like the paper’s “cluster archetypes.”
📊 Example Use Cases
Use Case | What You Learn |
---|---|
Model comparison | Does Qwen favor exhaustive reasoning more than Gemma? |
Goal-type matching | Do planning goals correlate with bottom-up strategies? |
Strategy analysis | Are most hypotheses clustered around 3–4 dominant patterns? |
Prompt refinement | Are certain prompt templates pulling reasoning toward shallow clusters? |
This lets us treat reasoning like a dynamic system one we can observe, debug, and tune just like software.
You’re not just evaluating answers. You’re understanding thought.
⚙️ Configuring the Chain-of-Thought Agent
One of the most powerful aspects of this system is that every agent including the CoT generator is configured declaratively via YAML. This lets us:
- Swap models easily (local vs remote)
- Switch evaluation strategies (MRQ or LLM)
- Adjust training, logging, and storage behavior
- Control rubric classification dimensions
- Enable or disable features per run
Here’s a breakdown of the key fields in our cot_generator
config file:
🧠 Core Identity and Control
name: cot_generator
enabled: true
save_prompt: true
save_context: false
skip_if_completed: false
enabled
: Controls whether this agent is run in the pipeline.save_prompt
: Persists the prompt and model response in the database for traceability.skip_if_completed
: Iftrue
, skips execution if an output already exists for this goal.
🤖 Model Settings
model:
name: ollama/qwen3
api_base: http://localhost:11434
- This specifies the reasoning model used to generate CoTs.
- Any Ollama-hosted local model can be used here (Qwen, Mistral, Gemma, etc.).
🧪 Evaluation Strategy
evaluator: llm # or 'mrq' if sufficient training data exists
evaluator_model:
name: ollama/mistral:7b-instruct
evaluator_prompt_file: evaluation.txt
evaluator
: Chooses between MRQ (embedding-based) or LLM judge (prompt-based).evaluator_model
: Used only for LLM judging.evaluation.txt
: The Jinja prompt file used to compare candidates.
This design lets the agent self-assess the quality of its outputs, and fall back to a robust LLM-based approach when MRQ doesn’t have enough training data.
🔍 Analysis Model
analysis_model:
name: ollama/mistral:7b-instruct
Used during rubric-based reasoning pattern classification separate from the generation model so you can use a model with stronger reflection abilities if desired.
📋 Prompt Configuration
prompt_mode: file
prompt_file: generate_cot.txt
pattern_prompt_file: cot_pattern.txt
remove_think: false
prompt_file
: The base prompt to generate CoTspattern_prompt_file
: Used to classify reasoning styles across rubricsremove_think
: Iftrue
, strips<think>...</think>
blocks from model output; here we leave them for introspection.
🧠 Training Parameters (MRQ Evaluator Only)
device: cpu
limit: 1000
epochs: 20
patience: 3
min_delta: 0.0001
These control the MRQ evaluator’s self-supervised training loop, tuning its value predictor to prefer better reasoning chains.
🧪 Rubric Classification
rubrics:
- dimension: "Strategy Orientation"
rubric: "Does the reasoning proceed in a hypothesis-first (top-down) or data-first (bottom-up) manner?"
options: ["Top-Down", "Bottom-Up"]
enabled: true
...
- Each rubric defines a dimension of reasoning (e.g., depth, inference style).
- The system prompts an LLM to classify each CoT according to these.
- Disabled rubrics are ignored enabling easy customization per run.
This structure aligns with the CoT Encyclopedia paper and allows future tools (like clustering or filtering) to use reasoning as structured data.
✅ Why This Matters
This configuration file makes the agent:
- Transparent
- Reproducible
- Extensible
You can:
- Run experiments with different models and evaluators
- Analyze how reasoning patterns shift by rubric
- Track strategy changes over time or across tasks
And because it’s YAML, every experiment is version-controllable and explainable.
✅ Implementation Checklist: What We Built from the CoT Encyclopedia Paper
Paper Component | Description | Implemented? | Notes |
---|---|---|---|
Multi-CoT Generation | Generate multiple chains of thought per goal using a reasoning model | ✅ | Done via ChainOfThoughtGeneratorAgent with local LLMs (Ollama) |
Candidate Evaluation | Select best output using pairwise or tournament evaluation | ✅ | MRQ and LLM Judge evaluators both supported |
LLM-Based Preference Judging | Use another LLM to decide between CoTs | ✅ | LLMJudgeEvaluator uses evaluation.txt prompt |
Rubric-Based Classification | Label reasoning across 10+ dimensions (e.g., depth, inference style) | ✅ | Configurable YAML rubrics; results logged and stored |
Pattern Storage and Analysis | Save CoT patterns with metadata (goal, model, score) | ✅ | Stored via cot_patterns table; includes embeddings |
Goal Type Annotation | Label goals by task type (e.g., math, science, commonsense) | ✅ | Used to route and analyze strategies per goal type |
Pattern Embedding | Embed CoT + rubric data for clustering | ✅ | Integrated via RubricClusterer and vector DB |
Cluster Analysis | Group similar reasoning styles via embeddings | ✅ | Cluster summaries logged; clustering done on run |
Pattern Diversity Metrics | Count unique patterns and measure strategy spread | ⚠️ Partial | Clustering supports this; summary stats to be expanded |
Model Comparison | Compare reasoning styles between different LLMs | ✅ | Supported via config-level model switching |
Strategy-to-Goal Insights | Link certain rubrics to specific goal types | ✅ | Enabled via goal.type + rubric logs |
Human-Readable Strategy Labels | Assign names to clusters (e.g., “careful planner”) | ⚠️ Optional | Can be added with LLM summarization of cluster samples |
Self-Improvement Loop | Use evaluations to retrain or refine reasoning prompts | ✅ | MRQ supports tuning; DSPy version planned for feedback learning |
DSPy Integration (Optional) | Use programmatic, structured prompting | ✅ | ChainOfThoughtDSPyGeneratorAgent supports this natively |
📚 References
-
Wang, Y., Radhakrishna, A., Chi, E., & Lee, P. (2024). The Chain-of-Thought Encyclopedia: Mapping Reasoning Strategies in Language Models. arXiv:2505.10185
-
DSPy: A Library for Declarative Structured Prompting Arora, S., Zhang, W., Xiong, C., et al. GitHub – Stanford DSPy
-
MRQ: Model-Relative Quality Evaluator Inspired by self-supervised evaluation strategies using embedding distance and preference comparison. Implementation adapted from: Sharpening Language Models with Self-Evaluation (work-in-progress).
-
Ollama – Local LLM runner supporting models like Qwen, Mistral, and Gemma. https://ollama.com
-
BGE / E5 / MTEB – Embedding models for text similarity and clustering. MTEB: Massive Text Embedding Benchmark
-
Hydra Config System – Flexible configuration management for ML pipelines. https://hydra.cc
🛠️ Code and Project Repository
The full implementation of this Chain-of-Thought reasoning system — including multi-agent pipelines, rubric classification, MRQ evaluation, DSPy integration, and strategy analysis — is available on GitHub:
🔗 View the Project on GitHub →
This repository includes:
- ✅ Local model integration via Ollama
- ✅ Configurable agent-based reasoning pipelines
- ✅ Rubric classification and CoT clustering
- ✅ MRQ and LLM-based evaluation support
- ✅ Structured prompt templates and analysis modules
- ✅ Example configs and scripts for running and tuning
📬 Contributions Welcome
This project is open source and actively evolving. If you’re working on:
- Reasoning systems
- CoT evaluation
- LLM orchestration
- Local-first tooling
…we’d love your feedback, use cases, and pull requests!
📖 Glossary of Key Terms
Term | Definition |
---|---|
Chain of Thought (CoT) | A sequence of reasoning steps used by a language model to arrive at an answer. Often involves natural-language explanation. |
MRQ (Model-Relative Quality) | A lightweight evaluator that compares hypotheses based on learned value differences using embeddings and simple neural scoring. |
LLM Judge | An evaluation mechanism where a language model compares two responses and selects the better one using a structured prompt. |
Rubric | A structured criterion or dimension used to classify how a reasoning chain behaves (e.g., depth, style, orientation). |
CoT Encyclopedia | A research framework that maps and analyzes diverse reasoning strategies in language models by generating, evaluating, and clustering chains of thought. |
Prompt Template | A predefined structure, often written using Jinja2, that guides the language model to perform specific tasks or respond in a specific format. |
Evaluator | A module that scores or ranks different chains of thought, either via training (MRQ) or inference (LLM-based judgment). |
Pattern Embedding | The process of converting a classified CoT reasoning pattern into a vector for clustering or similarity search. |
Goal Type | A semantic category assigned to a task prompt, such as math , science , commonsense , or policy . |
DSPy | A declarative Python library for structured prompting and modular reasoning pipelines, developed by Stanford NLP. |
Self-Improving Agent | An AI agent that iteratively generates, evaluates, classifies, and refines its own reasoning over time. |
Ollama | A local language model runner that supports models like Qwen, Mistral, and Gemma via a simple REST API. |
Rubric Clusterer | A module that embeds and clusters rubric-based reasoning patterns to identify strategy archetypes or diversity. |
🚀 Applications and Extensions
Building the CoT Encyclopedia locally wasn’t just a replication project it laid the foundation for next-generation AI agents that can reason, reflect, and improve over time.
Here are just a few of the applications this enables:
🔁 1. Self-Improving Reasoning Agents
With MRQ or LLM-based evaluators, we don’t just generate answers we compare, refine, and learn from them.
Imagine an AI that gets smarter not just by consuming more data, but by actively reflecting on and optimizing its own thought processes.
By recording which chains of thought are preferred (and why), the system can:
- Tune prompt templates
- Adapt strategies to different goal types
- Train evaluators on real performance feedback
This is reasoning as a feedback loop a pathway to self-improving agents.
🧠 2. Goal-Type to Strategy Routing
Every goal is tagged with a type (e.g., math
, planning
, ethics
). Because we now track reasoning patterns per goal type, we can:
- Learn which reasoning strategies work best for each type
- Route future goals to different agents or prompt styles based on their type
- Even suggest goal-type–specific templates dynamically
This supports DOTS-style planning and strategy-aware agent orchestration systems that don’t just reason, but reason intentionally.
🔍 3. Hypothesis Search and Filtering
With reasoning patterns stored as structured data, we can now:
- Search for chains of thought with specific strategies (e.g., “Show me all top-down, deductive answers for policy goals”)
- Retrieve reasoning by model, rubric match, or evaluation score
- Build dashboards or tools for interactive CoT exploration
This turns the system into a living CoT Encyclopedia, not just a pipeline.
🛠 4. Plug-and-Play Evaluation Backends
Thanks to modular evaluator support, you can swap between:
- MRQ (efficient, trainable, local)
- LLM judge (rich, pairwise, interpretable)
- Future scoring systems (RLHF signals, task-specific verifiers, etc.)
This lets users configure how “reasoning quality” is defined, depending on the application research, creativity, engineering, etc.
This config entry determines the evaluator.
evaluator: llm #(mrq or llm) may not be enough items fo mrq
def _init_evaluator(self):
if self.cfg.get("evaluator", "mrq") == "llm":
return LLMJudgeEvaluator(...)
else:
return MRQSelfEvaluator(...)
In an upcoming post I will show a direct comparison bwetween MRQ and other forms of evaluation.
🔮 5. Research-Grade Transparency
Every decision every prompt, classification, and evaluation is stored with:
- Timestamps
- Agents
- Models
- Strategies used
This enables:
- Scientific reproducibility
- Transparent debugging
- Long-term tracking of reasoning trends
You’re not just building a product you’re building an auditable reasoning engine.
def call_llm(self, prompt: str, context: dict, llm_cfg: dict = None) -> str:
"""Call the default or custom LLM, log the prompt, and handle output."""
props = llm_cfg or self.llm # Use passed-in config or default
messages = [{"role": "user", "content": prompt}]
try:
response = litellm.completion(
model=props[NAME],
messages=messages,
api_base=props[API_BASE],
api_key=props.get(API_KEY, ""),
)
output = response["choices"][0]["message"]["content"]
# Save prompt and response if enabled
if self.cfg.get(SAVE_PROMPT, False) and self.memory:
self.memory.prompt.save(
context.get("goal"),
agent_name=self.name,
prompt_key=self.cfg.get(PROMPT_PATH, ""),
prompt_text=prompt,
response=output,
strategy=self.cfg.get(STRATEGY, ""),
version=self.cfg.get("version", 1),
)
# Remove [THINK] blocks if configured
response_cleaned = remove_think_blocks(output) if self.remove_think else output
# Optionally add to context history
if self.cfg.get("add_prompt_to_history", True):
self.add_to_prompt_history(context, prompt, {"response": response_cleaned})
return response_cleaned
except Exception as e:
print(f"❌ Exception: {type(e).__name__}: {e}")
self.logger.log("LLMCallError", {"exception": str(e)})
raise
🧾 Conclusion and Future Work
We set out to rebuild the Chain-of-Thought Encyclopedia not just to replicate the paper, but to make its ideas usable, local, and extensible.
What we ended up with is more than a reproduction.
It’s a system that:
- Generates multiple reasoning paths for a goal
- Evaluates and selects the best one
- Classifies it across structured rubrics
- Embeds and clusters reasoning patterns
- Stores everything for analysis and tuning
- Learns from its own outputs over time
All of it powered by local LLMs, structured prompt templates, and a modular agent framework that grows as the reasoning tasks evolve.