Calibration

Hallucination Energy: A Geometric Foundation for Policy-Bounded AI

🚀 Summary

This post presents the current research draft and implementation of a geometric framework for bounding stochastic language models through deterministic policy enforcement.

The central contribution is a scalar metric termed Hallucination Energy, defined as the projection residual between a claim embedding and the subspace spanned by its supporting evidence embeddings. This metric operationalizes grounding as a measurable geometric quantity.

We proceed in three stages:

Formal Definition a draft manuscript introducing Hallucination Energy, its mathematical formulation, and its role within a policy-controlled architecture.
Empirical Evaluation structured calibration and adversarial stress testing across multiple domains to assess the robustness and limits of projection-based grounding.
Applied Validation large-scale evaluation on 10,000 samples from the HaluEval summarization benchmark, demonstrating that projection-based containment functions as a strong first-order grounding signal in a real generative setting.

This work does not claim to solve hallucination. Rather, it characterizes the boundary of projection-based grounding, establishes its suitability as a deterministic policy scalar, and documents both its strengths and its structural limitations.

The Space Between Models Has Holes: Mapping the AI Gap

🌌 Summary

What if the most valuable insights in AI evaluation aren’t in model agreements, but in systematic disagreements?

This post reveals that the “gap” between large and small reasoning models contains structured, measurable intelligence about how different architectures reason. We demonstrate how to transform model disagreements from a problem into a solution, using the space between models to make tiny networks behave more like their heavyweight counterparts.

We start by assembling a high-quality corpus (10k–50k conversation turns), score it with a local LLM to create targets, and train both HRM and Tiny models under identical conditions. Then we run fresh documents through both models, collecting not just final scores but rich auxiliary signals (uncertainty, consistency, OOD detection, etc.) and visualize what these signals reveal.