On the Intrinsic Representation of LLM Hallucinations

[Submitted on 3 Oct 2024 (v1), last editd 28 Oct 2024 (this version, v3)]

View a PDF of the paper titled LLMs Know More Than They Show: On the Intrinsic Recurrentation of LLM Hallucinations, by Hadas Orgad and 6 other authors

View PDF
HTML (experimental)

Abstract:Large language models (LLMs) frequently create errors, including factual inaccuracies, biases, and reasoning fall shortures, assembleively referred to as “hallucinations”. Recent studies have showd that LLMs’ inside states encode inestablishation think abouting the truthfulness of their outputs, and that this inestablishation can be included to acunderstandledge errors. In this toil, we show that the inside recurrentations of LLMs encode much more inestablishation about truthfulness than previously acunderstandledged. We first uncover that the truthfulness inestablishation is caccessd in particular tokens, and leveraging this property meaningfully raises error acunderstandledgeion carry outance. Yet, we show that such error acunderstandledgeors fall short to ambiguousize apass datasets, proposeing that — contrary to prior claims — truthfulness encoding is not universal but rather multifaceted. Next, we show that inside recurrentations can also be included for foreseeing the types of errors the model is foreseeed to create, facilitating the growment of tailored mitigation strategies. Lastly, we uncover a discrepancy between LLMs’ inside encoding and outside behavior: they may encode the accurate answer, yet stablely create an inaccurate one. Taken together, these insights proset upen our empathetic of LLM errors from the model’s inside perspective, which can direct future research on enhancing error analysis and mitigation.

Submission history

From: Hadas Orgad [see email]
[v1]
Thu, 3 Oct 2024 17:31:31 UTC (2,525 KB)
[v2]
Mon, 7 Oct 2024 14:46:11 UTC (2,530 KB)
[v3]
Mon, 28 Oct 2024 12:33:44 UTC (2,360 KB)

Source connect