[NLP with Transformers] Text Preprocessing for NLP

Tokenization and Subword Encoding:

    • Tokenization is the technique of dividing text into discrete tokens (such as words or subwords) to make additional analysis easier.
    • A tokenization method called subword encoding, which divides words into subword units, is especially helpful for dealing with words that are not commonly used. Tokenization and subword encoding using HuggingFace’s Tokenizers library example code:
       from transformers import AutoTokenizer
    
       # Load tokenizer
       tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    
       # Tokenize a sentence
       sentence = "I love natural language processing!"
       tokens = tokenizer.tokenize(sentence)
    
       print(tokens)

    Output:

    ['i', 'love', 'natural', 'language', 'processing', '!']

    Handling Special Tokens and Padding:

      • In Transformer-based models, special tokens like [CLS], [SEP], and [PAD] are frequently used for a variety of functions, such as indicating the beginning or end of a sentence, separating sentences, or denoting padding tokens.
      • Making sure that all input sequences are the same length is a technique known as padding, which is essential for effective batch processing. Using HuggingFace’s tokenizers, an example of code handling special tokens and padding is as follows:
         from transformers import AutoTokenizer
      
         # Load tokenizer
         tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
      
         # Add special tokens and pad sequences
         sentences = ["I love NLP!", "Transformers are amazing!"]
         encoded_inputs = tokenizer(sentences, padding=True, truncation=True)
      
         print(encoded_inputs)

      Output:

      {'input_ids': [[101, 1045, 2293, 2175, 999, 102], [101, 9587, 2024, 6429, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

      Data Cleaning and Normalization Techniques:

        • Data cleaning entails deleting or correcting extraneous or distracting text components, like HTML tags, URLs, special characters, and punctuation.
        • Normalization procedures, such changing text to lowercase or expanding contractions, try to put content into a uniform format. Regular expressions (regex) are a programming construct that may be used to clean and normalize data in Python.
           import re
        
           def clean_text(text):
               # Remove URLs
               text = re.sub(r"http\S+|www\S+", "", text)
        
               # Remove HTML tags
               text = re.sub(r"<.*?>", "", text)
        
               # Remove special characters and punctuation
               text = re.sub(r"[^\w\s]", "", text)
        
               # Convert text to lowercase
               text = text.lower()
        
               return text
        
           sentence = "Check out this amazing website: www.example.com!"
           cleaned_sentence = clean_text(sentence)
        
           print(cleaned_sentence)

        Output:

        check out this amazing website

        These sample codes show how to use HuggingFace’s Tokenizers library and regular expressions to accomplish tokenization, subword encoding, managing special tokens and padding, as well as data cleaning and normalization procedures. Based on your unique NLP goals and requirements, you can further develop and improve these strategies.

        Leave a Comment