Hello everyone,
We will meet on Friday at 2 PM PST.
We're continuing to explore transformer interpretability. Our next paper is "
Knowledge Neurons in Pretrained Transformers" (
GitHub implementation). This paper is able to identify where pretrained transformers store knowledge and suppress, amplify or modify that knowledge. If you're at all interested in how transformers learn and represent information, you're welcome to attend!
Abstract:
Large-scale pretrained language models are surprisingly good at recalling factual knowledge presented in the training corpus. In this paper, we explore how implicit knowledge is stored in pretrained Transformers by introducing the concept of knowledge neurons. Given a relational fact, we propose a knowledge attribution method to identify the neurons that express the fact. We present that the activation of such knowledge neurons is highly correlated to the expression of their corresponding facts. In addition, even without fine-tuning, we can leverage knowledge neurons to explicitly edit (such as update, and erase) specific factual knowledge for pretrained Transformers.