当前位置：网站首页>Hands on deep learning (35) -- text preprocessing (NLP)

Hands on deep learning (35) -- text preprocessing (NLP)

2022-07-04 09:37:00 【Stay a little star】

List of articles

Text preprocessing (Pre-processing)

NLP In order to facilitate the later training of neural network, the preprocessing of text data is indispensable for natural language processing .
Usually text preprocessing includes ：

Raw data loading （raw data）
participle （segmentation）
Data cleaning （Cleaning）
Data standardization （Normalization）：Stemming / Lemmazation
feature extraction （Feature extraction）：tf-idf/word2vec
modeling （modeling）: Similarity algorithm 、 Classification algorithm

cite： Greedy for Technology

This paper mainly introduces the most basic English text preprocessing , Including raw data reading and word segmentation , To learn more about preprocessing operations, please refer to NLP introduction -- Text preprocessing Pre-processing. At present, there are many better word segmentation libraries , Can be called directly , But Li Mu's great God code It explains how to build a sub thesaurus from the most basic , Personally, I think it is very useful , From the basic understanding, it is also more convenient to call various libraries to work efficiently .

1. Reading data sets

An article can be simply regarded as a sequence of words , Even a sequence of characters . In order to facilitate the use of sequence data in future experiments , Here, text data is preprocessed , It mainly consists of the following steps ：

Load text data into memory
Split string into Tags （ Such as , Words and characters ）
Build a vocabulary , Map split tags to numeric indexes
Convert text to a numeric index sequence , So that the model can easily perform other operations on it

import collections
import re
from d2l import torch as d2l

We from H.G.Well Of Time machine Start by loading text in . This is a fairly small corpus , Only 30000 Multiple words , But enough to achieve our goal , That is, text preprocessing . In reality, a document set may contain billions of words . The following function reads the data set into a list of text lines , Each line is a string . For the sake of simplicity , We ignore punctuation and capital letters here .

#@save
d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL+'timemachine.txt','090b5e7e70c295757f55df93cb0a180b9691891a')

def read_time_machine():
    """ The pretreatment operation here is violent , Eliminate punctuation marks and special characters , There's only... Left 26 Letters and spaces """
    with open(d2l.download('time_machine'),'r') as f:
        lines = f.readlines()
    return [re.sub('[^A-Za-z]+',' ',line).strip().lower() for line in lines]

lines = read_time_machine()
print(f'# text lines: {
      len(lines)}')
print(lines[0])
print(lines[10])

Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
# text lines: 3221
the time machine by h g wells
twinkled and his usually pale face was flushed and animated the

2. Tokenization

Split the text sequence into a tag list , Mark （token） Is the basic unit of text . Finally, a tag list is returned , Each tag is a string （string）

def tokenize(lines,token='word'):
    if token=="word":
        return [line.split() for line in lines]
    elif token =="char":
        return [list(line) for line in lines]
    else:
        print("Error: Unknown token type ："+token)
        
tokens = tokenize(lines)
for i in range(11):
    print(tokens[i])

['the', 'time', 'machine', 'by', 'h', 'g', 'wells']
[]
[]
[]
[]
['i']
[]
[]
['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him']
['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']
['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']

3. vocabulary

The marked string type is not convenient for the use of the model , The input required by the model is numbers . We build a dictionary （ Thesaurus vocabulary）, Used to map string tags to from 0 In the initial numerical index .

To do this, we need to first count the unique tags in all documents in the training set , It is called corpus （corpus）.
Then assign a numerical index to each unique tag according to its frequency of occurrence .（ Rarely seen tags are usually removed to reduce complexity ）
corpus （corpus） Any tag that does not exist in or has been deleted can be mapped to a specific unknown tag “<unk >”.
We can choose to add a list , Used to save reserved tags , for example “<pad>” Indicates filling ;“<bos>” Represents the beginning of a sequence ;“<eos>” Indicates the end of the sequence .

def count_corpus(tokens):
    """ Count the frequency of tags ： there tokens yes 1D List or 2D list """
    if len(tokens) ==0 or isinstance(tokens[0],list):
        #  take tokens Flatten into a list filled with tags 
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)

class Vocab:
    """ Build a text vocabulary """
    def __init__(self,tokens=None,min_freq=0,reserved_tokens=None):
        if tokens is None:
            tokens=[]
        if reserved_tokens is None:
            reserved_tokens = []
        #  Sort according to frequency 
        counter = count_corpus(tokens)
        self.token_freqs = sorted(counter.items(),key=lambda x:x[1],reverse=True)
        
        #  Index of unknown tag is 0
        self.unk , uniq_tokens = 0, ['<unk>']+reserved_tokens
        uniq_tokens += [token for token,freq in self.token_freqs
                       if freq >= min_freq and token not in uniq_tokens]
        self.idx_to_token,self.token_to_idx = [],dict() #  Find the tag according to the index and find the index according to the tag 
        for token in uniq_tokens:
            self.idx_to_token.append(token)
            self.token_to_idx[token] = len(self.idx_to_token)-1
            
    def __len__(self):
        return len(self.idx_to_token)
    
    def __getitem__(self,tokens):
        """ Switch to one by one item For the output """
        if not isinstance(tokens,(list,tuple)):
            return self.token_to_idx.get(tokens,self.unk)
        return [self.__getitem__(token) for token in tokens]
    
    def to_tokens(self,indices):
        """ If it is a single index Direct output , If it is list perhaps tuple Iterative output """
        if not isinstance(indices,(list,tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

#  Use time machine Data sets are used as corpora to construct vocabularies , Then print the first few common tags and indexes 
vocab = Vocab(tokens)
print(list(vocab.token_to_idx.items())[:10])

[('<unk>', 0), ('the', 1), ('i', 2), ('and', 3), ('of', 4), ('a', 5), ('to', 6), ('was', 7), ('in', 8), ('that', 9)]

#  Now you can convert each line of text into a numeric index 
for i in [0,10]:
    print('words:',tokens[i])
    print('indeces:',vocab[tokens[i]])

words: ['the', 'time', 'machine', 'by', 'h', 'g', 'wells']
indeces: [1, 19, 50, 40, 2183, 2184, 400]
words: ['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']
indeces: [2186, 3, 25, 1044, 362, 113, 7, 1421, 3, 1045, 1]

4. Integrate the above functions

Pack all the content into load_corpus_time_machine Among functions , This function returns corpus( Tag index list ) and vocab( Vocabulary of time machine corpus ). Two things need to be changed ：

We convert text tags into characters , Not words , In order to simplify the training in the following chapters
corpus Is a single list , Instead of using a list of tags , because time machine Each line of text in the dataset is not necessarily a sentence or a paragraph .

def load_corpus_time_machine(max_tokens=-1):
    """ return Time machine Tag index list and glossary in dataset """
    lines = read_time_machine()
    tokens = tokenize(lines,'char')
    vocab = Vocab(tokens)
    #  because Time machine Every text line in the dataset , Not necessarily a sentence or paragraph 
    #  So flatten all text lines into a list 
    corpus = [vocab[token] for line in tokens for token in line]
    if max_tokens >0:
        corpus = corpus[:max_tokens]
    return corpus,vocab

corpus,vocab = load_corpus_time_machine()
len(corpus),len(vocab)

(170580, 28)

list(vocab.token_to_idx.items())

[('<unk>', 0),
 (' ', 1),
 ('e', 2),
 ('t', 3),
 ('a', 4),
 ('i', 5),
 ('n', 6),
 ('o', 7),
 ('s', 8),
 ('h', 9),
 ('r', 10),
 ('d', 11),
 ('l', 12),
 ('m', 13),
 ('u', 14),
 ('c', 15),
 ('f', 16),
 ('w', 17),
 ('g', 18),
 ('y', 19),
 ('p', 20),
 ('b', 21),
 ('v', 22),
 ('k', 23),
 ('x', 24),
 ('z', 25),
 ('j', 26),
 ('q', 27)]

lines = read_time_machine()
tokens = tokenize(lines,'char')
for i in [0,10]:
    print('words:',tokens[i])
    print('indeces:',vocab[tokens[i]])

words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'b', 'y', ' ', 'h', ' ', 'g', ' ', 'w', 'e', 'l', 'l', 's']
indeces: [3, 9, 2, 1, 3, 5, 13, 2, 1, 13, 4, 15, 9, 5, 6, 2, 1, 21, 19, 1, 9, 1, 18, 1, 17, 2, 12, 12, 8]
words: ['t', 'w', 'i', 'n', 'k', 'l', 'e', 'd', ' ', 'a', 'n', 'd', ' ', 'h', 'i', 's', ' ', 'u', 's', 'u', 'a', 'l', 'l', 'y', ' ', 'p', 'a', 'l', 'e', ' ', 'f', 'a', 'c', 'e', ' ', 'w', 'a', 's', ' ', 'f', 'l', 'u', 's', 'h', 'e', 'd', ' ', 'a', 'n', 'd', ' ', 'a', 'n', 'i', 'm', 'a', 't', 'e', 'd', ' ', 't', 'h', 'e']
indeces: [3, 17, 5, 6, 23, 12, 2, 11, 1, 4, 6, 11, 1, 9, 5, 8, 1, 14, 8, 14, 4, 12, 12, 19, 1, 20, 4, 12, 2, 1, 16, 4, 15, 2, 1, 17, 4, 8, 1, 16, 12, 14, 8, 9, 2, 11, 1, 4, 6, 11, 1, 4, 6, 5, 13, 4, 3, 2, 11, 1, 3, 9, 2]

summary

Text is an important form of sequential data
In order to preprocess the text , We usually need to split the text into tags , Build vocabulary to map tag strings to numeric indexes , And convert the text data into Tag Index for model operation

Reference resources ：
【1】 Hands-on deep learning PyTorch edition
【2】NLP introduction -- Text preprocessing Pre-processing
【3】《 Hands-on deep learning 》

原网站

版权声明
本文为[Stay a little star]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202141424096057.html