Hello everyone,
If you've ever asked yourself, "Okay, but what do transformers actually do?", this is a great paper to read. Newcomers are welcome!
Paper abstract:
Feed-forward layers constitute two-thirds of a transformer model's
parameters, yet their role in the network remains under-explored. We show that
feed-forward layers in transformer-based language models operate as key-value
memories, where each key correlates with textual patterns in the training
examples, and each value induces a distribution over the output vocabulary. Our
experiments show that the learned patterns are human-interpretable, and that
lower layers tend to capture shallow patterns, while upper layers learn more
semantic ones. The values complement the keys' input patterns by inducing
output distributions that concentrate probability mass on tokens likely to
appear immediately after each pattern, particularly in the upper layers.
Finally, we demonstrate that the output of a feed-forward layer is a
composition of its memories, which is subsequently refined throughout the
model's layers via residual connections to produce the final output
distribution.