Skip to main content

Multi-Head Attention and Self-Attention of Transformers

 

Transformer Architecture


Multi-Head Attention and Self-Attention are key components of the Transformer architecture, introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017.

Self-Attention (or Intrusive Attention)

Self-Attention is a mechanism that allows the model to attend to different parts of the input sequence simultaneously and weigh their importance. It's called "self" because the attention is applied to the input sequence itself, rather than to some external context.

Given an input sequence of tokens (e.g., words or characters), the Self-Attention mechanism computes the representation of each token in the sequence by attending to all other tokens. This is done by:

Query (Q): The input sequence is linearly transformed into a query matrix.
Key (K): The input sequence is linearly transformed into a key matrix.
Value (V): The input sequence is linearly transformed into a value matrix.
Compute Attention Weights: The dot product of Q and K is computed, followed by a softmax function to obtain attention weights.
Compute Output: The attention weights are multiplied with V to produce the output.

Mathematical Representation

Let's denote the input sequence as X = [x1, x2, ..., xn], where xi is a token embedding. The self-attention computation can be represented as:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
where d is the dimensionality of the token embeddings.


Multi-Head Attention

Multi-Head Attention is an extension of Self-Attention that allows the model to jointly attend to information from different representation subspaces at different positions.

The main idea is to:

Split the input sequence into multiple attention "heads."
Apply Self-Attention to each head independently.
Concatenate the outputs from all heads.
Linearly transform the concatenated output.

Multi-Head Attention Mechanism

Split: The input sequence is split into h attention heads, each with a smaller dimensionality (d/h).
Apply Self-Attention: Self-Attention is applied to each head independently.
Concat: The outputs from all heads are concatenated.
Linear Transform: The concatenated output is linearly transformed.

Mathematical Representation

MultiHead(Q, K, V) = Concat(head1, ..., headh) * W^O
where headi = Attention(Q * Wi^Q, K * Wi^K, V * Wi^V)
Wi^Q, Wi^K, Wi^V, and W^O are learnable linear transformations.

Benefits

Multi-Head Attention and Self-Attention provide several benefits:
Parallelization: Self-Attention allows for parallel computation, unlike recurrent neural networks (RNNs).
Scalability: Multi-Head Attention enables the model to capture complex patterns and relationships.
Improved Performance: Transformer models with Multi-Head Attention have achieved state-of-the-art results in various natural language processing tasks.

Transformer Architecture

The Transformer architecture consists of:
Encoder: A stack of identical layers, each comprising Self-Attention and Feed Forward Network (FFN).
Decoder: A stack of identical layers, each comprising Self-Attention, Encoder-Decoder Attention, and FFN.
Each layer in the Encoder and Decoder consists of two sub-layers:
Self-Attention Mechanism
Feed Forward Network (FFN)

The Transformer architecture has revolutionized the field of natural language processing and has been widely adopted for various tasks, including machine translation, text generation, and question answering.

Comments

Popular posts from this blog

Financial Engineering

Financial Engineering: Key Concepts Financial engineering is a multidisciplinary field that combines financial theory, mathematics, and computer science to design and develop innovative financial products and solutions. Here's an in-depth look at the key concepts you mentioned: 1. Statistical Analysis Statistical analysis is a crucial component of financial engineering. It involves using statistical techniques to analyze and interpret financial data, such as: Hypothesis testing : to validate assumptions about financial data Regression analysis : to model relationships between variables Time series analysis : to forecast future values based on historical data Probability distributions : to model and analyze risk Statistical analysis helps financial engineers to identify trends, patterns, and correlations in financial data, which informs decision-making and risk management. 2. Machine Learning Machine learning is a subset of artificial intelligence that involves training algorithms t...

Wholesale Customer Solution with Magento Commerce

The client want to have a shop where regular customers to be able to see products with their retail price, while Wholesale partners to see the prices with ? discount. The extra condition: retail and wholesale prices hasn’t mathematical dependency. So, a product could be $100 for retail and $50 for whole sale and another one could be $60 retail and $50 wholesale. And of course retail users should not be able to see wholesale prices at all. Basically, I will explain what I did step-by-step, but in order to understand what I mean, you should be familiar with the basics of Magento. 1. Creating two magento websites, stores and views (Magento meaning of website of course) It’s done from from System->Manage Stores. The result is: Website | Store | View ———————————————— Retail->Retail->Default Wholesale->Wholesale->Default Both sites using the same category/product tree 2. Setting the price scope in System->Configuration->Catalog->Catalog->Price set drop-down to...

How to Prepare for AI Driven Career

  Introduction We are all living in our "ChatGPT moment" now. It happened when I asked ChatGPT to plan a 10-day holiday in rural India. Within seconds, I had a detailed list of activities and places to explore. The speed and usefulness of the response left me stunned, and I realized instantly that life would never be the same again. ChatGPT felt like a bombshell—years of hype about Artificial Intelligence had finally materialized into something tangible and accessible. Suddenly, AI wasn’t just theoretical; it was writing limericks, crafting decent marketing content, and even generating code. The world is still adjusting to this rapid shift. We’re in the middle of a technological revolution—one so fast and transformative that it’s hard to fully comprehend. This revolution brings both exciting opportunities and inevitable challenges. On the one hand, AI is enabling remarkable breakthroughs. It can detect anomalies in MRI scans that even seasoned doctors might miss. It can trans...