Self-attention is a softmax function, it does not model position information. Paper Self-Attention with Relative Position Representations proposed a method on how to incorporate relative position in self-attention.
What is self-attention?
There are some methods to implement self-attention. This paper proposed one.
For example:
\(x = (x_1, . . . , x_n)\) is a sequence with \(n\) elements where \(x_i\in R^{d_x}\), and computes a new sequence \(z = (z_1, . . . , z_n)\) of the same length where \(z_i\in R^{d_z}\).
As to this kind of self-attention, we will get a attention score matrix with n*n. Here \(W^Q\), \(W^K\) and \(W^V \in R^{d_x * d_z}\) are trainable parameters.
What is relative position representations?
Relative position representations are defined based on a target, for example \(x_i\) in a sentence. In order to determine relative position representations, we should know how to compute relative position based on \(x_i\).
For example, \(x\) is the target, we can use equation below to compute relative postion.
clip(x, k) = max(−k, min(k, x))
Then we can create relation postion representations as below:
How to incorporate relative position representations in self-attention?
Here is the computation:
Here \(a^V_{ij}\) is the relative position representations \(x_j\) to target \(x_i\).