Element 55
En . Confidence

Encodes the model's own certainty level about its output.

What It Does

Entropy.Confidence neurons encode the model's epistemic state about the reliability of its own output — not the uncertainty of external content being processed, but the model's internal signal about how sure it is. They activate when the model has high certainty about what it is generating, and are absent or suppressed when the model is generating content it has low warrant for. This is the neuron cluster most directly linked to hallucination.

How It Behaves

Confidence neurons show a strong late-layer concentration — the strongest late-layer specialization of any element in our corpus. This is the right architectural location for a confidence signal: certainty about an output should be computed after the content of that output has been largely determined. Critically, Confidence neurons are the one element where absence is more diagnostic than presence: when Entropy.Confidence neurons are absent on a difficult factual question, the model produces a hallucinated but confident-sounding answer. The model doesn't know it doesn't know. This finding was the key result of our Hallucination Trace Experiment.

Research Example

In our GPT-2 Small hallucination trace experiment, Entropy.Confidence neurons were consistently absent during the 10 hallucinated responses to factual questions ('The capital of Burkina Faso is now'), while showing weak but non-zero activation on the 2 correct responses ('Water is made of hydrogen and oxygen'). The model confidently produced wrong answers on questions where Confidence neurons never activated — it had no internal signal that its outputs were unreliable. This is the neuron-level signature of confident hallucination.

Other Entropy Elements