当前位置:网站首页>Hands on deep learning (35) -- text preprocessing (NLP)
Hands on deep learning (35) -- text preprocessing (NLP)
2022-07-04 09:37:00 【Stay a little star】
List of articles
Text preprocessing (Pre-processing)
NLP In order to facilitate the later training of neural network, the preprocessing of text data is indispensable for natural language processing .
Usually text preprocessing includes :
- Raw data loading (raw data)
- participle (segmentation)
- Data cleaning (Cleaning)
- Data standardization (Normalization):Stemming / Lemmazation
- feature extraction (Feature extraction):tf-idf/word2vec
- modeling (modeling): Similarity algorithm 、 Classification algorithm
cite: Greedy for Technology
This paper mainly introduces the most basic English text preprocessing , Including raw data reading and word segmentation , To learn more about preprocessing operations, please refer to NLP introduction -- Text preprocessing Pre-processing. At present, there are many better word segmentation libraries , Can be called directly , But Li Mu's great God code It explains how to build a sub thesaurus from the most basic , Personally, I think it is very useful , From the basic understanding, it is also more convenient to call various libraries to work efficiently .
1. Reading data sets
An article can be simply regarded as a sequence of words , Even a sequence of characters . In order to facilitate the use of sequence data in future experiments , Here, text data is preprocessed , It mainly consists of the following steps :
- Load text data into memory
- Split string into Tags ( Such as , Words and characters )
- Build a vocabulary , Map split tags to numeric indexes
- Convert text to a numeric index sequence , So that the model can easily perform other operations on it
import collections
import re
from d2l import torch as d2l
We from H.G.Well Of Time machine Start by loading text in . This is a fairly small corpus , Only 30000 Multiple words , But enough to achieve our goal , That is, text preprocessing . In reality, a document set may contain billions of words . The following function reads the data set into a list of text lines , Each line is a string . For the sake of simplicity , We ignore punctuation and capital letters here .
#@save
d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL+'timemachine.txt','090b5e7e70c295757f55df93cb0a180b9691891a')
def read_time_machine():
""" The pretreatment operation here is violent , Eliminate punctuation marks and special characters , There's only... Left 26 Letters and spaces """
with open(d2l.download('time_machine'),'r') as f:
lines = f.readlines()
return [re.sub('[^A-Za-z]+',' ',line).strip().lower() for line in lines]
lines = read_time_machine()
print(f'# text lines: {
len(lines)}')
print(lines[0])
print(lines[10])
Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
# text lines: 3221
the time machine by h g wells
twinkled and his usually pale face was flushed and animated the
2. Tokenization
Split the text sequence into a tag list , Mark (token) Is the basic unit of text . Finally, a tag list is returned , Each tag is a string (string)
def tokenize(lines,token='word'):
if token=="word":
return [line.split() for line in lines]
elif token =="char":
return [list(line) for line in lines]
else:
print("Error: Unknown token type :"+token)
tokens = tokenize(lines)
for i in range(11):
print(tokens[i])
['the', 'time', 'machine', 'by', 'h', 'g', 'wells']
[]
[]
[]
[]
['i']
[]
[]
['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him']
['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']
['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']
3. vocabulary
The marked string type is not convenient for the use of the model , The input required by the model is numbers . We build a dictionary ( Thesaurus vocabulary), Used to map string tags to from 0 In the initial numerical index .
- To do this, we need to first count the unique tags in all documents in the training set , It is called corpus (corpus).
- Then assign a numerical index to each unique tag according to its frequency of occurrence .( Rarely seen tags are usually removed to reduce complexity )
- corpus (corpus) Any tag that does not exist in or has been deleted can be mapped to a specific unknown tag “<unk >”.
- We can choose to add a list , Used to save reserved tags , for example “<pad>” Indicates filling ;“<bos>” Represents the beginning of a sequence ;“<eos>” Indicates the end of the sequence .
def count_corpus(tokens):
""" Count the frequency of tags : there tokens yes 1D List or 2D list """
if len(tokens) ==0 or isinstance(tokens[0],list):
# take tokens Flatten into a list filled with tags
tokens = [token for line in tokens for token in line]
return collections.Counter(tokens)
class Vocab:
""" Build a text vocabulary """
def __init__(self,tokens=None,min_freq=0,reserved_tokens=None):
if tokens is None:
tokens=[]
if reserved_tokens is None:
reserved_tokens = []
# Sort according to frequency
counter = count_corpus(tokens)
self.token_freqs = sorted(counter.items(),key=lambda x:x[1],reverse=True)
# Index of unknown tag is 0
self.unk , uniq_tokens = 0, ['<unk>']+reserved_tokens
uniq_tokens += [token for token,freq in self.token_freqs
if freq >= min_freq and token not in uniq_tokens]
self.idx_to_token,self.token_to_idx = [],dict() # Find the tag according to the index and find the index according to the tag
for token in uniq_tokens:
self.idx_to_token.append(token)
self.token_to_idx[token] = len(self.idx_to_token)-1
def __len__(self):
return len(self.idx_to_token)
def __getitem__(self,tokens):
""" Switch to one by one item For the output """
if not isinstance(tokens,(list,tuple)):
return self.token_to_idx.get(tokens,self.unk)
return [self.__getitem__(token) for token in tokens]
def to_tokens(self,indices):
""" If it is a single index Direct output , If it is list perhaps tuple Iterative output """
if not isinstance(indices,(list,tuple)):
return self.idx_to_token[indices]
return [self.idx_to_token[index] for index in indices]
# Use time machine Data sets are used as corpora to construct vocabularies , Then print the first few common tags and indexes
vocab = Vocab(tokens)
print(list(vocab.token_to_idx.items())[:10])
[('<unk>', 0), ('the', 1), ('i', 2), ('and', 3), ('of', 4), ('a', 5), ('to', 6), ('was', 7), ('in', 8), ('that', 9)]
# Now you can convert each line of text into a numeric index
for i in [0,10]:
print('words:',tokens[i])
print('indeces:',vocab[tokens[i]])
words: ['the', 'time', 'machine', 'by', 'h', 'g', 'wells']
indeces: [1, 19, 50, 40, 2183, 2184, 400]
words: ['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']
indeces: [2186, 3, 25, 1044, 362, 113, 7, 1421, 3, 1045, 1]
4. Integrate the above functions
Pack all the content into load_corpus_time_machine
Among functions , This function returns corpus
( Tag index list ) and vocab
( Vocabulary of time machine corpus ). Two things need to be changed :
- We convert text tags into characters , Not words , In order to simplify the training in the following chapters
corpus
Is a single list , Instead of using a list of tags , because time machine Each line of text in the dataset is not necessarily a sentence or a paragraph .
def load_corpus_time_machine(max_tokens=-1):
""" return Time machine Tag index list and glossary in dataset """
lines = read_time_machine()
tokens = tokenize(lines,'char')
vocab = Vocab(tokens)
# because Time machine Every text line in the dataset , Not necessarily a sentence or paragraph
# So flatten all text lines into a list
corpus = [vocab[token] for line in tokens for token in line]
if max_tokens >0:
corpus = corpus[:max_tokens]
return corpus,vocab
corpus,vocab = load_corpus_time_machine()
len(corpus),len(vocab)
(170580, 28)
list(vocab.token_to_idx.items())
[('<unk>', 0),
(' ', 1),
('e', 2),
('t', 3),
('a', 4),
('i', 5),
('n', 6),
('o', 7),
('s', 8),
('h', 9),
('r', 10),
('d', 11),
('l', 12),
('m', 13),
('u', 14),
('c', 15),
('f', 16),
('w', 17),
('g', 18),
('y', 19),
('p', 20),
('b', 21),
('v', 22),
('k', 23),
('x', 24),
('z', 25),
('j', 26),
('q', 27)]
lines = read_time_machine()
tokens = tokenize(lines,'char')
for i in [0,10]:
print('words:',tokens[i])
print('indeces:',vocab[tokens[i]])
words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'b', 'y', ' ', 'h', ' ', 'g', ' ', 'w', 'e', 'l', 'l', 's']
indeces: [3, 9, 2, 1, 3, 5, 13, 2, 1, 13, 4, 15, 9, 5, 6, 2, 1, 21, 19, 1, 9, 1, 18, 1, 17, 2, 12, 12, 8]
words: ['t', 'w', 'i', 'n', 'k', 'l', 'e', 'd', ' ', 'a', 'n', 'd', ' ', 'h', 'i', 's', ' ', 'u', 's', 'u', 'a', 'l', 'l', 'y', ' ', 'p', 'a', 'l', 'e', ' ', 'f', 'a', 'c', 'e', ' ', 'w', 'a', 's', ' ', 'f', 'l', 'u', 's', 'h', 'e', 'd', ' ', 'a', 'n', 'd', ' ', 'a', 'n', 'i', 'm', 'a', 't', 'e', 'd', ' ', 't', 'h', 'e']
indeces: [3, 17, 5, 6, 23, 12, 2, 11, 1, 4, 6, 11, 1, 9, 5, 8, 1, 14, 8, 14, 4, 12, 12, 19, 1, 20, 4, 12, 2, 1, 16, 4, 15, 2, 1, 17, 4, 8, 1, 16, 12, 14, 8, 9, 2, 11, 1, 4, 6, 11, 1, 4, 6, 5, 13, 4, 3, 2, 11, 1, 3, 9, 2]
summary
- Text is an important form of sequential data
- In order to preprocess the text , We usually need to split the text into tags , Build vocabulary to map tag strings to numeric indexes , And convert the text data into Tag Index for model operation
Reference resources :
【1】 Hands-on deep learning PyTorch edition
【2】NLP introduction -- Text preprocessing Pre-processing
【3】《 Hands-on deep learning 》
边栏推荐
- You can see the employment prospects of PMP project management
- Write a jison parser from scratch (6/10): parse, not define syntax
- How to display √ 2 on the command line terminal ̅? This is actually a blog's Unicode test article
- Investment analysis and future production and marketing demand forecast report of China's paper industry Ⓥ 2022 ~ 2028
- Solution to null JSON after serialization in golang
- Analysis report on the production and marketing demand and investment forecast of tellurium dioxide in the world and China Ⓣ 2022 ~ 2027
- Reading notes on how to connect the network - tcp/ip connection (II)
- 回复评论的sql
- Global and Chinese markets of hemoglobin analyzers in care points 2022-2028: Research Report on technology, participants, trends, market size and share
- Markdown syntax
猜你喜欢
ArrayBuffer
Target detection -- intensive reading of yolov3 paper
el-table单选并隐藏全选框
Web端自动化测试失败原因汇总
Daughter love in lunch box
2022-2028 global gasket metal plate heat exchanger industry research and trend analysis report
Solve the problem of "Chinese garbled MySQL fields"
2022-2028 global protein confectionery industry research and trend analysis report
2022-2028 global industrial gasket plate heat exchanger industry research and trend analysis report
How does idea withdraw code from remote push
随机推荐
Markdown syntax
[on February 11, 2022, the latest and most fully available script library collection of the whole network, a total of 23]
IIS configure FTP website
2022-2028 global gasket metal plate heat exchanger industry research and trend analysis report
How to write unit test cases
Reading notes on how to connect the network - hubs, routers and routers (III)
H5 audio tag custom style modification and adding playback control events
Write a jison parser from scratch (6/10): parse, not define syntax
Golang 类型比较
Svg image quoted from CodeChina
Investment analysis and prospect prediction report of global and Chinese high purity tin oxide Market Ⓞ 2022 ~ 2027
Summary of the most comprehensive CTF web question ideas (updating)
What is uid? What is auth? What is a verifier?
How do microservices aggregate API documents? This wave of show~
Multilingual Wikipedia website source code development part II
AMLOGIC gsensor debugging
Implementing expired localstorage cache with lazy deletion and scheduled deletion
What is permission? What is a role? What are users?
PHP book borrowing management system, with complete functions, supports user foreground management and background management, and supports the latest version of PHP 7 x. Database mysql
Tkinter Huarong Road 4x4 tutorial II