Self-attention – sometimes referred to as intra-attention – is a machine learning mechanism that relates different positions of a sequence to compute a representation of that sequence. In natural language processing (NLP), this process usually considers the relationship between words in the same sentence.
Understanding self-attention
Self-attention describes a transformer model’s ability to attend to various parts of an input sequence when making predictions.
The idea of self-attention was first proposed by Google Research and Google Brain staff in response to problems the encoder-decoder model encountered with long sequences. Attention mechanisms were proposed to avoid models that encode the input sequence to a fixed-length vector from which each output time step is decoded.
Self-attention mechanisms work differently. In simple terms, they process n inputs and return n outputs. The mechanism allows the inputs to interact with each other (“self”) in order to determine what they should focus on (“attention”). The outputs comprise the aggregates of these interactions and also attention scores that are calculated based on a single input.
Put differently, self-attention looks at the entire context of a sequence whilst the input elements are decoded. While encoder-decoder models sometimes “forget” facts if the information window is too large, self-attention ensures the window of information retention is only as large as it needs to be.
The three components of self-attention
To better understand how self-attention works, it is worth describing three fundamental components.
Queries, keys, and values
Queries, keys, and values comprise various model inputs. If a user searches a term in Google, for example, the text they enter in the search box is the query. The search results (in the form of article and video titles) are the keys, while the content inside each result is the value.
To find the best matches, the query has to determine how similar it is to the key. This is performed with the cosine similarity method, a mathematical way to find similarities between two vectors on a scale of -1 to 1 where -1 is the most dissimilar and 1 is the most similar.
Positional encoding
Before textual data is fed into machine learning models, it must first be converted into numbers. An embedding layer converts each word into a vector of fixed length and each is listed in a lookup table with its associated vector value.
Positional encoding is necessary because unlike other models that embed inputs one at a time (sequentially), transformer models embed all inputs at the same time. While beyond the scope of this article, positional encoding helps transformer models work quickly without losing information about the word order.
Passing queries, keys, and values
Position-aware input sequences are fed into the query layer, but two copies are also fed to the key and value layers. Why should this be so?
The answer has to do with self-attention. The input sequence is passed to the input embedding layer where position encoding is performed. The positionally-aware embeddings are then passed to the query and key layer where the output of each moves to what is called the matrix multiplication step. The result of this multiplication is called the attention filter.
Attention filters occupy a matrix of random numbers which become more meaningful over time as the model is trained. These numbers become attention scores which are then converted into values between 0 and 1 to derive the final attention filter.
In the last step, the attention filter is multiplied by the initial value matrix. The filter, as the name suggests, prioritizes some elements and removes irrelevant elements to manage finite computational resources.
The result of the multiplication is then passed to a linear layer to obtain the desired output.
Where is self-attention useful?
Self-attention enables transformer models to attend to different parts of the same input sequence and is thus an important aspect of their performance. This ability is particularly relevant to NLP tasks where the model needs to understand the relationship between the various elements of input and output sequences.
To that end, self-attention has been used successfully in tasks such as abstract summarization, image description generation, textual entailment, reading comprehension, and task-independent sentence representation.
Key takeaways
- Self-attention describes a transformer model’s ability to attend to various parts of an input sequence when making predictions.
- Self-attention looks at the entire context of a sequence whilst the input elements are decoded. While encoder-decoder models and their neural networks sometimes “forget” facts if the window of information is too large, self-attention ensures the window of information retention is only as large as it needs to be.
- Self-attention’s ability to attend to different parts of the same input in a transformer model makes them suited to a range of NLP tasks such as image description generation and abstract summarization.
Read Next: History of OpenAI, AI Business Models, AI Economy.
Connected Business Model Analyses
AI Paradigm





OpenAI Organizational Structure




Stability AI Ecosystem
