View a PDF of the paper titled LLMs Know More Than They Show: On the Intrinsic Recurrentation of LLM Hallucinations, by Hadas Orgad and 6 other authors
Abstract:Large language models (LLMs) frequently create errors, including factual inaccuracies, biases, and reasoning fall shortures, assembleively referred to as “hallucinations”. Recent studies have showd that LLMs’ inside states encode inestablishation think abouting the truthfulness of their outputs, and that this inestablishation can be included to acunderstandledge errors. In this toil, we show that the inside recurrentations of LLMs encode much more inestablishation about truthfulness than previously acunderstandledged. We first uncover that the truthfulness inestablishation is caccessd in particular tokens, and leveraging this property meaningfully raises error acunderstandledgeion carry outance. Yet, we show that such error acunderstandledgeors fall short to ambiguousize apass datasets, proposeing that — contrary to prior claims — truthfulness encoding is not universal but rather multifaceted. Next, we show that inside recurrentations can also be included for foreseeing the types of errors the model is foreseeed to create, facilitating the growment of tailored mitigation strategies. Lastly, we uncover a discrepancy between LLMs’ inside encoding and outside behavior: they may encode the accurate answer, yet stablely create an inaccurate one. Taken together, these insights proset upen our empathetic of LLM errors from the model’s inside perspective, which can direct future research on enhancing error analysis and mitigation.