Thursday

Multi-Head Attention and Self-Attention of Transformers

 

Transformer Architecture


Multi-Head Attention and Self-Attention are key components of the Transformer architecture, introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017.

Self-Attention (or Intrusive Attention)

Self-Attention is a mechanism that allows the model to attend to different parts of the input sequence simultaneously and weigh their importance. It's called "self" because the attention is applied to the input sequence itself, rather than to some external context.

Given an input sequence of tokens (e.g., words or characters), the Self-Attention mechanism computes the representation of each token in the sequence by attending to all other tokens. This is done by:

Query (Q): The input sequence is linearly transformed into a query matrix.
Key (K): The input sequence is linearly transformed into a key matrix.
Value (V): The input sequence is linearly transformed into a value matrix.
Compute Attention Weights: The dot product of Q and K is computed, followed by a softmax function to obtain attention weights.
Compute Output: The attention weights are multiplied with V to produce the output.

Mathematical Representation

Let's denote the input sequence as X = [x1, x2, ..., xn], where xi is a token embedding. The self-attention computation can be represented as:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
where d is the dimensionality of the token embeddings.


Multi-Head Attention

Multi-Head Attention is an extension of Self-Attention that allows the model to jointly attend to information from different representation subspaces at different positions.

The main idea is to:

Split the input sequence into multiple attention "heads."
Apply Self-Attention to each head independently.
Concatenate the outputs from all heads.
Linearly transform the concatenated output.

Multi-Head Attention Mechanism

Split: The input sequence is split into h attention heads, each with a smaller dimensionality (d/h).
Apply Self-Attention: Self-Attention is applied to each head independently.
Concat: The outputs from all heads are concatenated.
Linear Transform: The concatenated output is linearly transformed.

Mathematical Representation

MultiHead(Q, K, V) = Concat(head1, ..., headh) * W^O
where headi = Attention(Q * Wi^Q, K * Wi^K, V * Wi^V)
Wi^Q, Wi^K, Wi^V, and W^O are learnable linear transformations.

Benefits

Multi-Head Attention and Self-Attention provide several benefits:
Parallelization: Self-Attention allows for parallel computation, unlike recurrent neural networks (RNNs).
Scalability: Multi-Head Attention enables the model to capture complex patterns and relationships.
Improved Performance: Transformer models with Multi-Head Attention have achieved state-of-the-art results in various natural language processing tasks.

Transformer Architecture

The Transformer architecture consists of:
Encoder: A stack of identical layers, each comprising Self-Attention and Feed Forward Network (FFN).
Decoder: A stack of identical layers, each comprising Self-Attention, Encoder-Decoder Attention, and FFN.
Each layer in the Encoder and Decoder consists of two sub-layers:
Self-Attention Mechanism
Feed Forward Network (FFN)

The Transformer architecture has revolutionized the field of natural language processing and has been widely adopted for various tasks, including machine translation, text generation, and question answering.

No comments: