[NLP with Transformers] Fundamentals of Transformers

Understanding the Transformer Architecture:

  • Overview: The introduction of the Transformer architecture in the Vaswani et al. paper “Attention is All You Need” revolutionized NLP jobs by doing away with the necessity for recurrent or convolutional layers.
  • Key Components: Transformers are made up of a decoder and an encoder, each with several identical layers. While the decoder creates output sequences, the encoder analyses input sequences.
  • Key Advantages: Transformers allow for parallel computation, capture distant dependencies, and provide cutting-edge performance across a range of NLP tasks.

Self-Attention Mechanism:

  • Overview: Without relying on fixed-length context windows, self-attention enables a model to evaluate the importance of each word in a phrase.
  • Key Concepts: In order to identify the significance of other words in a sequence, self-attention computes the query, key, and value vectors for each word in the sequence. This results in an attention score.
  • Attention Scores: Dot product or scaled dot product attention is used to calculate the attention scores, which capture word dependencies.

Transformer Layers and Multi-Head Attention:

  • Overview: To capture hierarchical representations of the input sequence, transformers stack many layers.
  • Transformer Layer: A position-wise completely connected feed-forward network and a multi-head self-attention mechanism are the two sub-layers that make up each transformer layer.
  • Multi-Head Attention: The model’s capacity for representation is increased by the multi-head attention mechanism, which enables it to jointly attend to input from several subspaces at various places.

Positional Encoding:

  • Overview: Transformers lack inherent position information because they lack recurrent or convolutional components. The input sequence is given positional information by positional encoding.
  • Encoding Techniques: Sine and cosine functions are often used positional encoding techniques that give the model a sense of word order.

Sample Code Example (using Python and PyTorch):

import torch
import torch.nn as nn

class Transformer(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, num_heads):
        super(Transformer, self).__init__()
        self.embedding = nn.Embedding(input_dim, hidden_dim)
        self.positional_encoding = PositionalEncoding(hidden_dim)
        self.transformer_layers = nn.ModuleList([
            TransformerLayer(hidden_dim, num_heads) for _ in range(num_layers)
        ])

    def forward(self, inputs):
        embedded = self.embedding(inputs)
        encoded = self.positional_encoding(embedded)

        for layer in self.transformer_layers:
            encoded = layer(encoded)

        return encoded

class TransformerLayer(nn.Module):
    def __init__(self, hidden_dim, num_heads):
        super(TransformerLayer, self).__init__()
        self.self_attention = MultiHeadAttention(hidden_dim, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )
        self.layer_norm = nn.LayerNorm(hidden_dim)

    def forward(self, inputs):
        attended = self.self_attention(inputs)
        residual = attended + inputs
        normalized = self.layer_norm(residual)
        transformed = self.feed_forward(normalized)
        output = transformed + residual
        return output

class MultiHeadAttention(nn.Module):
    def __init__(self, hidden_dim, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.hidden_dim = hidden_dim
        self.num_heads = num_heads
        self.head_dim = hidden_dim // num_heads



 self.query_projection = nn.Linear(hidden_dim, hidden_dim)
        self.key_projection = nn.Linear(hidden_dim, hidden_dim)
        self.value_projection = nn.Linear(hidden_dim, hidden_dim)
        self.final_projection = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, inputs):
        batch_size, seq_len, hidden_dim = inputs.size()

        queries = self.query_projection(inputs)
        keys = self.key_projection(inputs)
        values = self.value_projection(inputs)

        queries = queries.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        keys = keys.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        values = values.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        attention_scores = torch.matmul(queries, keys.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.head_dim).float())
        attention_weights = torch.softmax(attention_scores, dim=-1)

        attended = torch.matmul(attention_weights, values).transpose(1, 2).contiguous().view(batch_size, seq_len, hidden_dim)
        output = self.final_projection(attended)
        return output

This example program uses PyTorch to demonstrate a condensed version of a Transformer model. It comprises the key elements covered, including positional encoding, the Transformer layer, and multi-head attention.

Keep in mind that there are other alternatives and refinements you can explore when dealing with Transformers in NLP; this is just a simple example.

Leave a Comment