Self-attention In AI And Why It Matters

Self-attention – sometimes referred to as intra-attention – is a machine learning mechanism that relates different positions of a sequence to compute a representation of that sequence. In natural language processing (NLP), this process usually considers the relationship between words in the same sentence.

Understanding self-attention

Self-attention describes a transformer model’s ability to attend to various parts of an input sequence when making predictions.

The idea of self-attention was first proposed by Google Research and Google Brain staff in response to problems the encoder-decoder model encountered with long sequences. Attention mechanisms were proposed to avoid models that encode the input sequence to a fixed-length vector from which each output time step is decoded. 

Self-attention mechanisms work differently. In simple terms, they process n inputs and return n outputs. The mechanism allows the inputs to interact with each other (“self”) in order to determine what they should focus on (“attention”). The outputs comprise the aggregates of these interactions and also attention scores that are calculated based on a single input.

Put differently, self-attention looks at the entire context of a sequence whilst the input elements are decoded. While encoder-decoder models sometimes “forget” facts if the information window is too large, self-attention ensures the window of information retention is only as large as it needs to be.

The three components of self-attention

To better understand how self-attention works, it is worth describing three fundamental components.

Queries, keys, and values

Queries, keys, and values comprise various model inputs. If a user searches a term in Google, for example, the text they enter in the search box is the query. The search results (in the form of article and video titles) are the keys, while the content inside each result is the value.

To find the best matches, the query has to determine how similar it is to the key. This is performed with the cosine similarity method, a mathematical way to find similarities between two vectors on a scale of -1 to 1 where -1 is the most dissimilar and 1 is the most similar.

Positional encoding

Before textual data is fed into machine learning models, it must first be converted into numbers. An embedding layer converts each word into a vector of fixed length and each is listed in a lookup table with its associated vector value.

Positional encoding is necessary because unlike other models that embed inputs one at a time (sequentially), transformer models embed all inputs at the same time. While beyond the scope of this article, positional encoding helps transformer models work quickly without losing information about the word order.

Passing queries, keys, and values

Position-aware input sequences are fed into the query layer, but two copies are also fed to the key and value layers. Why should this be so?

The answer has to do with self-attention. The input sequence is passed to the input embedding layer where position encoding is performed. The positionally-aware embeddings are then passed to the query and key layer where the output of each moves to what is called the matrix multiplication step. The result of this multiplication is called the attention filter.

Attention filters occupy a matrix of random numbers which become more meaningful over time as the model is trained. These numbers become attention scores which are then converted into values between 0 and 1 to derive the final attention filter.

In the last step, the attention filter is multiplied by the initial value matrix. The filter, as the name suggests, prioritizes some elements and removes irrelevant elements to manage finite computational resources.

The result of the multiplication is then passed to a linear layer to obtain the desired output.

Where is self-attention useful?

Self-attention enables transformer models to attend to different parts of the same input sequence and is thus an important aspect of their performance. This ability is particularly relevant to NLP tasks where the model needs to understand the relationship between the various elements of input and output sequences. 

To that end, self-attention has been used successfully in tasks such as abstract summarization, image description generation, textual entailment, reading comprehension, and task-independent sentence representation.

Key takeaways

  • Self-attention describes a transformer model’s ability to attend to various parts of an input sequence when making predictions.
  • Self-attention looks at the entire context of a sequence whilst the input elements are decoded. While encoder-decoder models and their neural networks sometimes “forget” facts if the window of information is too large, self-attention ensures the window of information retention is only as large as it needs to be.
  • Self-attention’s ability to attend to different parts of the same input in a transformer model makes them suited to a range of NLP tasks such as image description generation and abstract summarization.

Read Next: History of OpenAI, AI Business Models, AI Economy.

Connected Business Model Analyses

AI Paradigm




Large Language Models

Large language models (LLMs) are AI tools that can read, summarize, and translate text. This enables them to predict words and craft sentences that reflect how humans write and speak.

Generative Models


Prompt Engineering

Prompt engineering is a natural language processing (NLP) concept that involves discovering inputs that yield desirable or useful results. Like most processes, the quality of the inputs determines the quality of the outputs in prompt engineering. Designing effective prompts increases the likelihood that the model will return a response that is both favorable and contextual. Developed by OpenAI, the CLIP (Contrastive Language-Image Pre-training) model is an example of a model that utilizes prompts to classify images and captions from over 400 million image-caption pairs.

OpenAI Organizational Structure

OpenAI is an artificial intelligence research laboratory that transitioned into a for-profit organization in 2019. The corporate structure is organized around two entities: OpenAI, Inc., which is a single-member Delaware LLC controlled by OpenAI non-profit, And OpenAI LP, which is a capped, for-profit organization. The OpenAI LP is governed by the board of OpenAI, Inc (the foundation), which acts as a General Partner. At the same time, Limited Partners comprise employees of the LP, some of the board members, and other investors like Reid Hoffman’s charitable foundation, Khosla Ventures, and Microsoft, the leading investor in the LP.

OpenAI Business Model

OpenAI has built the foundational layer of the AI industry. With large generative models like GPT-3 and DALL-E, OpenAI offers API access to businesses that want to develop applications on top of its foundational models while being able to plug these models into their products and customize these models with proprietary data and additional AI features. On the other hand, OpenAI also released ChatGPT, developing around a freemium model. Microsoft also commercializes opener products through its commercial partnership.


OpenAI and Microsoft partnered up from a commercial standpoint. The history of the partnership started in 2016 and consolidated in 2019, with Microsoft investing a billion dollars into the partnership. It’s now taking a leap forward, with Microsoft in talks to put $10 billion into this partnership. Microsoft, through OpenAI, is developing its Azure AI Supercomputer while enhancing its Azure Enterprise Platform and integrating OpenAI’s models into its business and consumer products (GitHub, Office, Bing).

Stability AI Business Model

Stability AI is the entity behind Stable Diffusion. Stability makes money from our AI products and from providing AI consulting services to businesses. Stability AI monetizes Stable Diffusion via DreamStudio’s APIs. While it also releases it open-source for anyone to download and use. Stability AI also makes money via enterprise services, where its core development team offers the chance to enterprise customers to service, scale, and customize Stable Diffusion or other large generative models to their needs.

Stability AI Ecosystem


About The Author

Scroll to Top