Review: What We’ve Learned So Far

December 01, 2025

Page content

😶‍🌫️ Summary

This post is a quick review of the journey so far.

We’re one third of the way through the Self-Learning Systems (100-part) series, and this checkpoint pulls together the first 33 posts and the research papers that shaped them. The table below lists each post, its place in the series, and the key references it builds on, so you can see how the system and the ideas behind it have evolved since May.

#	Title	Summary	Reference Papers
33	✨ TINY CRITICS: Lightweight Reasoning Checks for Large AI Systems	Finishes this third of the series with Tiny Critics: sub-50KB models trained on frontier features that sit beside big LLMs and flag broken reasoning in real time.	Self-Evolving Vision-Language Models from Images Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Training Verifiers to Solve Math Word Problems Tree of Thoughts: Deliberate Problem Solving with Large Language Models Solving math word problems with process- and outcome-based feedback Let’s Verify Step by Step NaturalProofs: Mathematical Theorem Proving in Natural Language Support-vector networks
32	The Nexus Blossom: How AI Thoughts Turn into Habits	Introduces the Nexus graph, where thoughts become nodes, relationships become edges, and repeated patterns harden into habits that guide future reasoning and action.	–
31	Search–Solve–Prove: building a place for thoughts to develop	Designs a Search–Solve–Prove habitat where reasoning steps can be generated, checked, and iteratively improved, giving Stephanie a structured space to grow and test ideas.	Search Self-play: Pushing the Frontier of Agent Capability without Supervision
30	The Space Between Models Has Holes: Mapping the AI Gap	Defines the “gap” between models where they systematically disagree and builds tools to map, visualize, and eventually exploit those gaps instead of pretending they don’t exist.	–
29	A Complete Visual Reasoning Stack: From Conversations to Epistemic Fields	Pulls ZeroModel, Phōs, SIS, and Learning-from-Learning into a single visual reasoning stack where conversations, metrics, and epistemic fields all share one coherent, inspectable space.	Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
28	🔦 Phōs: Visualizing How AI Learns and How to Build It Yourself	Uses Phōs as a visual lens on learning itself, showing how gradients, scores, and traces can be turned into intuitive pictures of what Stephanie is doing and why.	–
27	Episteme: Distilling Knowledge into AI	Takes all the epistemic machinery cartridges, scores, traces and focuses it on knowledge distillation, turning raw research and usage data into durable, inspectable understanding.	Deep reinforcement learning from human preferences Learning to summarize from human feedback A General Language Assistant as a Laboratory for Alignment Training language models to follow instructions with human feedback Scaling Laws for Neural Language Models
26	🔄 Learning from Learning: Stephanie’s Breakthrough	Closes the loop on Learning-from-Learning: Stephanie scores conversations for knowledge gain, builds training pairs, and fine-tunes tiny models that specialize in recognizing useful insight.	–
25	From Photo Albums to Movies: Teaching AI to See Its Own Progress	Extends the visual stack from static snapshots to sequences, letting Stephanie track progress over time like flipping from scattered photos to a coherent movie of its learning.	Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings
24	Case Based Reasoning: Teaching AI to Learn From itself	Builds a Case-Based Reasoning layer where Stephanie stores past reasoning episodes as cases and reuses them, so each solved problem becomes a worked example for the next one.	Memento: Fine-tuning LLM Agents without Fine-tuning LLMs Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches Docs
23	SIS: The Visual Dashboard That Makes Stephanie	Adds SIS, a visual dashboard that shows pipelines, scores, and traces in one place, turning Stephanie from a black box into a system you can actually see and debug.	Retrieval-Augmented Generation (RAG) Papers
22	ZeroModel: Visual AI you can scrutinize	Introduces ZeroModel and Visual Policy Maps, encoding policy data into images so decisions can be inspected, compared, and even executed using simple visual logic instead of big models.	–
21	Everything is a Trace: Stephanie Enters Full Reflective Mode	Redesigns Stephanie so every action becomes a traceable plan, making it possible to replay, compare, and learn from complete reasoning trajectories instead of isolated calls.	Hierarchical Reasoning Model TOWARDS GENERAL-PURPOSE MODEL-FREE REINFORCEMENT LEARNING Recurrent Independent Mechanisms AutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation Human-level control through deep reinforcement learning Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning Root Mean Square Layer Normalization Super-efficiency of automatic differentiation for functions defined as a minimum Introduction to Latent Variable Energy-Based Models: A Path Towards Autonomous Machine Intelligence
20	Layers of thought: smarter reasoning with the Hierarchical Reasoning Model	Builds a Hierarchical Reasoning Model that looks at full traces, not just final answers, giving Stephanie a way to score the quality of entire chains of thought.	Recurrent Independent Mechanisms AutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation Nature Machine Intelligence, 2022 Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning Root Mean Square Layer Normalization Super-efficiency of automatic differentiation for functions defined as a minimum Mitigating Catastrophic Forgetting in Long Short-Term Memory Networks
19	Stephanie’s Secret: The Dawn of Reflective AI	Introduces the GILD-style learning loop for Stephanie: greedy inference plus distillation, so every solved task becomes data that sharpens the system’s policies and scorers.	SCALABLE IN-CONTEXT Q-LEARNING Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data Exact phylodynamic likelihood via structured Markov genealogy processes ClusT3: Information Invariant Test-Time Training
18	The Shape of Thought: Exploring Embedding Strategies with Ollama, HF, and H-Net	Compares multiple embedding backends on real research data, treating embeddings as different “shapes of thought” and measuring how they affect extraction, scoring, and downstream reasoning.	H-Net AlphaEdit: Null-Space Constrained Model Editing Self-Refine: Iterative Refinement with Self-Feedback React: Synergizing Reasoning and Acting in Language Models Reflexion: An Automatic Framework for Iterative Strategy Refinement Deep Reinforcement Learning from Human Preferences Direct Preference Optimization Tutorial on Energy-Based Learning GPT-4 Technical Report BERT: Pre-training of Deep Bidirectional Transformers Dynamic Context Partitioning for Long Document Understanding
17	Getting Smarter at Getting Smarter: A Practical Guide to Self-Tuning AI	Turns self-tuning into a concrete engineering pattern, wiring MRQ, EBT, SVM, and LLM judgments into a modular scoring stack that can be trained, compared, and swapped in and out.	Energy-Based Transformers are Scalable Learners and Thinkers Direct Preference Optimization: Your Language Model is Secretly a Reward Model Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
16	Epistemic Engines: Building Reflective Minds with Belief Cartridges and In-Context Learning	Introduces epistemic engines and belief cartridges compact, scored summaries that let Stephanie carry forward what it knows and reason reflectively over its own beliefs.	Language Models are Few-Shot Learners Fine-tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata SimCSE: Simple Contrastive Learning of Sentence Embeddings Autobiographical event memory and aging: older adults get the gist Snorkel: Rapid Training Data Creation with Weak Supervision Concrete Problems in AI Safety
15	Self-Improving AI: A System That Learns, Validates, and Retrains Itself	Assembles the first full self-improving loop: Stephanie scores its own outputs, trains internal scorers like MRQ, and uses those scorers to decide what to read, trust, and retrain on.	RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation
14	Teaching Tiny Models to Think Big: Distilling Intelligence Across Devices	Shows how to distill reasoning skills from big models into tiny local ones, letting phones or edge devices share in the same self-improving behavior as the main system.Order anything	LoRD: Aligning Language Models via Embedding Space Distillation
13	Compiling Thought: Building a Prompt Compiler for Self-Improving AI	Treats prompts as programs and introduces a prompt compiler that turns high-level intentions into structured pipelines Stephanie can run, inspect, and improve over time.	–
12	Thoughts of Algorithms	Steps back to ask what counts as a “thought” for an AI, sketching how algorithms, traces, and symbolic structures become the raw material of Stephanie’s inner mental life.	Self-Adapting Language Models ReAct: Synergizing Reasoning and Acting in Language Models Deep reinforcement learning from human preferences Direct Preference Optimization: Your Language Model is Secretly a Reward Model AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models System-2 Fine-tuning for Robust Integration of New Knowledge
11	Document Intelligence: Turning Documents into Structured Knowledge	Builds the document intelligence layer that slices papers into sections, extracts structured nuggets, and turns messy PDFs into material that Stephanie can actually reason over.	Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
10	Learning to Learn: A LATS-Based Framework for Self-Aware AI Pipelines	Applies LATS to Stephanie’s pipelines so the system can generate multiple candidate traces, score them, and gradually prefer reasoning strategies that lead to better outcomes.	RM-R1: Reward Modeling as Reasoning “LATS: Language Agent Tree Search” Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models Mastering the game of Go with deep neural networks and tree search GPT-4 Technical Report
9	Dimensions of Thought: A Smarter Way to Evaluate AI	Defines “dimensions of thought” goal-conditioned axes like usefulness, risk, and novelty and builds the scoring infrastructure that lets Stephanie rate itself along many dimensions at once.	GenSim: Generating Robotic Simulation Tasks via Large Language Models Direct Preference Optimization: Your Language Model is Secretly a Reward Model
8	Programming Intelligence: Using Symbolic Rules to Steer and Evolve AI	Re-introduces symbolic rules as first-class citizens, wiring a rules engine into Stephanie so we can steer, constrain, and gradually evolve behavior alongside learned models.	Representation Engineering: A Top-Down Approach to AI Transparency The Random Forest Model for Analyzing and Forecasting the US Stock Market in the Context of Smart Finance Direct Preference Optimization: Your Language Model is Secretly a Reward Model
7	Adaptive Reasoning with ARM: Teaching AI the Right Way to Think	Uses ARM-style adaptive reasoning to break problems into steps, evaluate each step, and teach Stephanie to adjust its thinking strategy mid-trajectory instead of only at the end.	Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Training language models to follow instructions with human feedback Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
6	A Novel Approach to Autonomous Research: Implementing NOVELSEEK with Modular AI Agents	Implements a NOVELSEEK-style autonomous research loop where modular agents search papers, generate ideas, and evolve research directions without manual babysitting.	InternAgent: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification WizardLM: Empowering large pre-trained language models to follow complex instructions
5	The Self-Aware Pipeline: Empowering AI to Choose Its Own Path to the Goal	Introduces a self-aware pipeline that treats each run as a plan, letting Stephanie choose which stages to execute, skip, or repeat based on the goal and feedback.I	HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages Devil’s Advocate: Anticipatory Reflection for LLM Agents Symbolic Learning Enables Self-Evolving Agents
4	General Reasoner: The smarter Local Agent	Turns scattered tools into a single General Reasoner that can pick goals, choose appropriate skills, and coordinate local LLMs into a reusable, task-agnostic problem solver.	General-Reasoner: Advancing LLM Reasoning Across All Domains
3	Building a Self-Improving Chain-of-Thought Agent: Local LLMs Meet the CoT Encyclopedia	Builds a chain-of-thought “encyclopedia” agent that collects, scores, and reuses reasoning traces so local models learn not just answers, but better ways of thinking.	The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
2	Self-Improving Agents: Applying the Sharpening Framework to Local LLMs	Adapts the Sharpening Framework to small, local LLMs so agents can repeatedly critique, refine, and re-run their own outputs instead of relying on a larger remote model.	Self-Improvement in Language Models: The Sharpening Mechanism Towards General-Purpose Model-Free Reinforcement Learning
1	Building an AI Co-Scientist	Defines Stephanie as an AI co-scientist and builds the first end-to-end loop that can read papers, propose experiments, and help a human push a research idea forward.	Towards an AI Co-Scientist

Over these first 33 posts, Stephanie has gone from a single AI co-scientist prototype to a sprawling, self-improving ecosystem. We started with one local model reading papers and helping with experiments, then layered on better reasoning (CoT, ARM, LATS), a smarter pipeline that can choose its own path, and a General Reasoner that turns scattered tools into a reusable thinking system.

From there, the series shifted toward epistemic structure and scoring. We built belief cartridges and epistemic engines, MRQ/EBT/SVM scoring stacks, and Hierarchical Reasoning Models that look at full traces instead of isolated answers. Stephanie learned to rate its own thoughts along many dimensions, train tiny specialist models, and use GILD-style loops to get smarter at getting smarter.

The most recent posts focused on visibility and habitats for thought. ZeroModel and Phōs turned policy data and learning signals into images. SIS, CBR, PACS, and the visual reasoning stack gave us dashboards, case libraries, progress movies, and epistemic fields. GAP, Search–Solve–Prove, Nexus, and Tiny Critics then pushed into the space between models mapping disagreements, giving thoughts a graph to live in, and adding lightweight critics that can call out bad reasoning in real time.

🚀 Where We’re Going Next

The first third of this series was about building the inside of Stephanie:
scorers, traces, critics, Nexus graphs, Visual Policy Maps, and the first sketches of an entelechial system — a will encoded into code and models.

From here on, the work turns outward.

The next phase will focus on practical applications of that architecture:

Code & System Guardians
Tiny Critics, VPM risk badges, and ZeroModel maps that can sit over real codebases and services, spotting bad reasoning, fragile areas, and AI-generated landmines before they ship.
Living Research Assistants
Always-on pipelines that watch your work, mine your past conversations, pull in new papers, and come back with “you’re looping here” or “here’s a better approach” without needing to be asked.
Cognitive Dashboards
Visual, trace-driven views (VPMs, Phōs, SIS) that let humans see what the system is doing — where it’s confident, where it’s blind, and how its policies are evolving over time.
Shared Entelechy Graphs
A living knowledge graph of “wills in code”: reusable goal-modules with clear attributes, quality scores, and interfaces, so other people (and other AIs) can plug Stephanie’s skills into their own work.

In other words:
the experimental lab phase is over.
The next posts will show how this new medium behaves in the real world —
not just as an idea, but as tools and systems people can actually touch.