NLP with Transformers
Understanding the Transformer Architecture:
- Overview: The introduction of the Transformer architecture in the Vaswani et al. paper “Attention is All You Need” revolutionized NLP jobs by doing away with the necessity for recurrent or convolutional layers.
- Key Components: Transformers are made up of a decoder and an encoder, each with several identical layers. While the decoder creates output sequences, the encoder analyses input sequences.
- Key Advantages: Transformers allow for parallel computation, capture distant dependencies, and provide cutting-edge performance across a range of NLP tasks.
Self-Attention Mechanism:
- Overview: Without relying on fixed-length context windows, self-attention enables a model to evaluate the importance of each word in a phrase.
- Key Concepts: In order to identify the significance of other words in a sequence, self-attention computes the query, key, and value vectors for each word in the sequence. This results in an attention score.
- Attention Scores: Dot product or scaled dot product attention is used to calculate the attention scores, which capture word dependencies.
Transformer Layers and Multi-Head Attention:
- Overview: To capture hierarchical representations of the input sequence, transformers stack many layers.
- Transformer Layer: A position-wise completely connected feed-forward network and a multi-head self-attention mechanism are the two sub-layers that make up each transformer layer.
- Multi-Head Attention: The model’s capacity for representation is increased by the multi-head attention mechanism, which enables it to jointly attend to input from several subspaces at various places.
Positional Encoding:
- Overview: Transformers lack inherent position information because they lack recurrent or convolutional components. The input sequence is given positional information by positional encoding.
- Encoding Techniques: Sine and cosine functions are often used positional encoding techniques that give the model a sense of word order.
Sample Code Example (using Python and PyTorch):
import torch
import torch.nn as nn
class Transformer(nn.Module):
def __init__(self, input_dim, hidden_dim, num_layers, num_heads):
super(Transformer, self).__init__()
self.embedding = nn.Embedding(input_dim, hidden_dim)
self.positional_encoding = PositionalEncoding(hidden_dim)
self.transformer_layers = nn.ModuleList([
TransformerLayer(hidden_dim, num_heads) for _ in range(num_layers)
])
def forward(self, inputs):
embedded = self.embedding(inputs)
encoded = self.positional_encoding(embedded)
for layer in self.transformer_layers:
encoded = layer(encoded)
return encoded
class TransformerLayer(nn.Module):
def __init__(self, hidden_dim, num_heads):
super(TransformerLayer, self).__init__()
self.self_attention = MultiHeadAttention(hidden_dim, num_heads)
self.feed_forward = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim)
)
self.layer_norm = nn.LayerNorm(hidden_dim)
def forward(self, inputs):
attended = self.self_attention(inputs)
residual = attended + inputs
normalized = self.layer_norm(residual)
transformed = self.feed_forward(normalized)
output = transformed + residual
return output
class MultiHeadAttention(nn.Module):
def __init__(self, hidden_dim, num_heads):
super(MultiHeadAttention, self).__init__()
self.hidden_dim = hidden_dim
self.num_heads = num_heads
self.head_dim = hidden_dim // num_heads
self.query_projection = nn.Linear(hidden_dim, hidden_dim)
self.key_projection = nn.Linear(hidden_dim, hidden_dim)
self.value_projection = nn.Linear(hidden_dim, hidden_dim)
self.final_projection = nn.Linear(hidden_dim, hidden_dim)
def forward(self, inputs):
batch_size, seq_len, hidden_dim = inputs.size()
queries = self.query_projection(inputs)
keys = self.key_projection(inputs)
values = self.value_projection(inputs)
queries = queries.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
keys = keys.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
values = values.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
attention_scores = torch.matmul(queries, keys.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.head_dim).float())
attention_weights = torch.softmax(attention_scores, dim=-1)
attended = torch.matmul(attention_weights, values).transpose(1, 2).contiguous().view(batch_size, seq_len, hidden_dim)
output = self.final_projection(attended)
return output
This example program uses PyTorch to demonstrate a condensed version of a Transformer model. It comprises the key elements covered, including positional encoding, the Transformer layer, and multi-head attention.
Keep in mind that there are other alternatives and refinements you can explore when dealing with Transformers in NLP; this is just a simple example.