当前位置:网站首页>Natural language processing series (II) -- building character level language model using RNN
Natural language processing series (II) -- building character level language model using RNN
2022-07-02 11:56:00 【raelum】
Catalog
One 、 Mission background
The task of this paper mainly comes from PyTorch Of The official tutorial , That is, the data set given the names of people in various countries , You need to train a RNN, It can judge which country this person comes from according to the input person's name ( Classification task ).
A dataset is a names
Folder , It contains 18 18 18 A text document , in [Language].txt
name . In every text document , Each line is one under the language ( common ) The person's name .
It should be noted that , Official data set Suspected error ( The reptile didn't climb clean ) , stay Russian.txt
In file , The first 7941 ∼ 7964 7941\sim7964 7941∼7964 There's a line To The First Page
word , Obviously, this is not a personal name . Besides , The first 4395 , 5236 , 5255 4395,5236,5255 4395,5236,5255 The names of the people in the line are ,
ending ( I don't think Russian names end in commas ?).
Bloggers have corrected this data set for everyone , Download address ( Required points 0).
Different from the official tutorial , In this article , Bloggers will refactor the code according to their own understanding , Make it easy to understand .
Two 、 Data preprocessing
First , We need to construct a dictionary , Its format is :{language: [names ...]}
.
Because people's names are made up of Unicode Character composition , We need to convert it into ASCII character :
import unicodedata
import string
# The converted person name consists of upper and lower case letters and space characters , Single quotation mark characters
all_letters = string.ascii_letters + " '"
def unicodeToAscii(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn' and c in all_letters)
print(unicodeToAscii('Ślusàrski'))
# Slusarski
Construct a dictionary :
import os
filenames = os.listdir('names')
data = dict()
for filename in filenames:
# Pay attention to the need to utf-8 Format open
with open(f'names/{
filename}', encoding='utf-8') as f:
# Need to get rid of it. filename Medium .txt suffix
data[filename[:-4]] = [unicodeToAscii(name) for name in f.readlines()]
all_categories = list(data.keys())
print(all_categories)
# ['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French', 'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean', 'Polish', 'Portuguese', 'Russian', 'Scottish', 'Spanish', 'Vietnamese']
print(data['Arabic'][:6])
# ['Khoury', 'Nahas', 'Daher', 'Gerges', 'Nazari', 'Maalouf']
From the following output results, we can see , The dataset has 18 class , And the number of data in each category is also different :
print(len(all_categories))
# 18
print([len(data[category]) for category in all_categories])
# [2000, 268, 519, 297, 3668, 277, 724, 203, 232, 709, 991, 94, 139, 74, 9384, 100, 298, 73]
Neural networks cannot directly process letters , So we need to convert the letters into corresponding One-Hot vector :
import torch
import torch.nn.functional as F
def letterToTensor(letter):
# obtain letter stay all_letters Index in
letter_idx = torch.tensor(all_letters.index(letter))
return F.one_hot(letter_idx, num_classes=len(all_letters))
r = letterToTensor("c")
print(r)
# tensor([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0])
The length of the vector is 54, It's because of 26 Lowercase letters +26 Capital letters + Space character + Single quote characters .
Next we need to put the whole word ( The person's name ) Encoded as tensor . Each letter in the word represents a time step , Therefore, a complete word can be regarded as a sequence . Because the length of words varies , So we regard a word as a batch, So the word should be in the shape of (sequence_length, batch_size, features) = (sequence_length, 1, 54)
Tensor .
def nameToTensor(name):
result = torch.zeros(len(name), len(all_letters))
for i in range(len(name)):
result[i] = letterToTensor(name[i])
return result.unsqueeze(1)
print(nameToTensor('ab'))
# tensor([[[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
# 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
# 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
# 0., 0., 0.]],
#
# [[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
# 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
# 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
# 0., 0., 0.]]])
print(nameToTensor('abcd').size())
# torch.Size([4, 1, 54])
Because we have regarded a word as a batch, So don't use DataLoader To construct iterators , Instead, the method of random sampling is used to extract data for training .
3、 ... and 、 Model building and training
We use PyTorch Medium nn.RNN
Module to build a single hidden layer one-way RNN:
import torch.nn as nn
class RNN(nn.Module):
def __init__(self):
super().__init__()
self.rnn = nn.RNN(
input_size=54,
# 128 It's a random choice , You can also choose other values
hidden_size=128,
)
# 18 Classification task , So the last floor has 18 Neurons
self.out = nn.Linear(128, 18)
def forward(self, x):
# None Represents all zero initialization implicit state
output, h_n = self.rnn(x, None)
# output[-1] Is the hidden state output at the last moment , Equate to h_n[0]
return self.out(output[-1])
Considering the different number of samples in different categories ( There's a big difference ), And we have regarded a word as a batch, Therefore use DataLoader It will appear more difficult . Keep the same practice as the official tutorial , That is, every time randomly from the data set data
Take one from category
, Again from category
Take a random sample from name
Feed to RNN in .
import random
def random_sample():
category = random.choice(all_categories)
name = random.choice(data[category])
return category, name
print(random_sample())
# ('Irish', "O'Kane")
We choose to GPU Training on , Choose cross entropy loss and SGD Optimizer :
LR = 1e-3 # Learning rate
N_ITERS = 10**5 # How many to train iteration
device = 'cuda' if torch.cuda.is_available() else 'cpu'
rnn = RNN()
rnn.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnn.parameters(), lr=LR, momentum=0.9)
train_loss = []
def train(model, criterion, optimizer):
model.train()
avg_loss = 0
for iteration in range(N_ITERS):
category, name = random_sample()
# take name Turn it into numbers
X = nameToTensor(name).to(device)
# because output The shape of is (1,18), therefore target The shape of must be (1,) Not scalar
target = torch.tensor([all_categories.index(category)]).to(device)
# Positive communication
output = model(X)
loss = criterion(output, target)
avg_loss += loss
# Back propagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
# every other 1000 individual iteration Output one-time average loss And save
if (iteration + 1) % 1000 == 0:
avg_loss /= 1000
train_loss.append(avg_loss.item())
print(f"Iteration: [{
iteration + 1}/{
N_ITERS}] | Train Loss: {
avg_loss:.4f}")
avg_loss = 0
After training , Draw the curve of the loss function :
import numpy as np
import matplotlib.pyplot as plt
plt.plot(np.arange(1, N_ITERS + 1, 1000), train_loss)
plt.ylabel('Train Loss')
plt.xlabel('Iteration')
plt.show()
In order to ensure the reproducibility of the results , We need to set the global seed :
def setup_seed(seed):
random.seed(seed) # by random Set seeds in the Library
np.random.seed(seed) # by numpy Set seeds in the Library
torch.manual_seed(seed) # by Pytorch-CPU Set seeds
torch.cuda.manual_seed(seed) # For the current GPU Set seeds
torch.cuda.manual_seed_all(seed) # For all GPU Set seeds
Set the seed to 42, The loss curve is as follows :
As can be seen from the above figure , The model loss is in 70000 70000 70000 individual iteration It reaches the minimum around . But here for convenience , We use the final model to test .
Four 、 Model test
In the process of training , We have trained a total of 100000 100000 100000 individual Iteration, Every Iteration There is only one sample in . In the test phase , We randomly selected 10000 10000 10000 Samples to plot the confusion matrix :
from sklearn.metrics import ConfusionMatrixDisplay
def test(model):
model.eval()
y_true, y_pred = [], []
for _ in range(10000):
category, name = random_sample()
# Get the subscript corresponding to the real label of the sample
true_idx = all_categories.index(category)
y_true.append(true_idx)
# Get the subscript corresponding to the prediction tag
X = nameToTensor(name).to(device)
output = model(X)
y_pred.append(output.argmax().item())
# Draw confusion matrix
ConfusionMatrixDisplay.from_predictions(y_true,
y_pred,
labels=np.arange(18),
display_labels=all_categories,
xticks_rotation='vertical',
normalize='true',
include_values=False)
plt.show()
final result :
From the result of confusion matrix, we can see :
- Korean It is easy to be misjudged as Chinese,Czech It is easy to be misjudged as Polish,German It is easy to be misjudged as Dutch;
- English、German、Czech It's not easy to identify .
appendix : Complete code
Running environment :
- System :Ubuntu 20.04
- GPU:RTX 3090
- PyTorch edition :1.10
- Python edition :3.8
- Cuda:11.3
It should be noted that , stay Linux In the system ,os.listdir()
The return result of and Windows Different systems , So even if the random seeds are the same , It is also possible to produce different experimental results .
import os
import string
import random
import unicodedata
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
class RNN(nn.Module):
def __init__(self):
super().__init__()
self.rnn = nn.RNN(input_size=54, hidden_size=128)
self.out = nn.Linear(128, 18)
def forward(self, x):
output, h_n = self.rnn(x, None)
return self.out(output[-1])
def unicodeToAscii(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn' and c in all_letters)
def letterToTensor(letter):
letter_idx = torch.tensor(all_letters.index(letter))
return F.one_hot(letter_idx, num_classes=len(all_letters))
def nameToTensor(name):
result = torch.zeros(len(name), len(all_letters))
for i in range(len(name)):
result[i] = letterToTensor(name[i])
return result.unsqueeze(1)
def random_sample():
category = random.choice(all_categories)
name = random.choice(data[category])
return category, name
def train(model, critertion, optimizer):
model.train()
avg_loss = 0
for iteration in range(N_ITERS):
category, name = random_sample()
X = nameToTensor(name).to(device)
target = torch.tensor([all_categories.index(category)]).to(device)
output = model(X)
loss = criterion(output, target)
avg_loss += loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (iteration + 1) % 1000 == 0:
avg_loss /= 1000
train_loss.append(avg_loss.item())
print(f"Iteration: [{
iteration + 1}/{
N_ITERS}] | Train Loss: {
avg_loss:.4f}")
avg_loss = 0
def test(model):
model.eval()
y_true, y_pred = [], []
for _ in range(10000):
category, name = random_sample()
true_idx = all_categories.index(category)
y_true.append(true_idx)
X = nameToTensor(name).to(device)
output = model(X)
y_pred.append(output.argmax().item())
ConfusionMatrixDisplay.from_predictions(y_true,
y_pred,
labels=np.arange(18),
display_labels=all_categories,
xticks_rotation='vertical',
normalize='true',
include_values=False)
plt.show()
def setup_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
""" Data preprocessing """
all_letters = string.ascii_letters + " '"
filenames = os.listdir('names')
data = dict()
for filename in filenames:
with open(f'names/{
filename}', encoding='utf-8') as f:
data[filename[:-4]] = [unicodeToAscii(name) for name in f.readlines()]
all_categories = list(data.keys())
""" Model building and training """
setup_seed(42)
LR = 1e-3
N_ITERS = 10**5
device = 'cuda' if torch.cuda.is_available() else 'cpu'
rnn = RNN()
rnn.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnn.parameters(), lr=LR, momentum=0.9)
train_loss = []
train(rnn, criterion, optimizer)
plt.plot(np.arange(1, N_ITERS + 1, 1000), train_loss)
plt.ylabel('Train Loss')
plt.xlabel('Iteration')
plt.show()
""" Testing """
test(rnn)
If this article is helpful to you, please click a free like , This will be the biggest motivation for my creation !
边栏推荐
- How to Visualize Missing Data in R using a Heatmap
- 可昇級合約的原理-DelegateCall
- BEAUTIFUL GGPLOT VENN DIAGRAM WITH R
- 动态内存(进阶四)
- SSRF
- Implementation of address book (file version)
- GGPUBR: HOW TO ADD ADJUSTED P-VALUES TO A MULTI-PANEL GGPLOT
- YYGH-BUG-05
- K-Means Clustering Visualization in R: Step By Step Guide
- Data analysis - Matplotlib sample code
猜你喜欢
TDSQL|就业难?腾讯云数据库微认证来帮你
GGPLOT: HOW TO DISPLAY THE LAST VALUE OF EACH LINE AS LABEL
GGPLOT: HOW TO DISPLAY THE LAST VALUE OF EACH LINE AS LABEL
The position of the first underline selected by the vant tabs component is abnormal
Seriation in R: How to Optimally Order Objects in a Data Matrice
HOW TO EASILY CREATE BARPLOTS WITH ERROR BARS IN R
【2022 ACTF-wp】
What is the relationship between digital transformation of manufacturing industry and lean production
YYGH-BUG-05
6方面带你认识LED软膜屏 LED软膜屏尺寸|价格|安装|应用
随机推荐
Develop scalable contracts based on hardhat and openzeppelin (II)
Redis exceeds the maximum memory error oom command not allowed when used memory & gt; ' maxmemory'
HOW TO ADD P-VALUES ONTO A GROUPED GGPLOT USING THE GGPUBR R PACKAGE
Log4j2
行业的分析
RPA advanced (II) uipath application practice
CMake交叉编译
PYQT5+openCV项目实战:微循环仪图片、视频记录和人工对比软件(附源码)
YYGH-9-预约下单
Attribute acquisition method and operation notes of C # multidimensional array
YYGH-BUG-05
GGPLOT: HOW TO DISPLAY THE LAST VALUE OF EACH LINE AS LABEL
Some problems encountered in introducing lvgl into esp32 Arduino
进入前六!博云在中国云管理软件市场销量排行持续上升
K-Means Clustering Visualization in R: Step By Step Guide
Wechat applet uses Baidu API to achieve plant recognition
Always report errors when connecting to MySQL database
How to Create a Beautiful Plots in R with Summary Statistics Labels
Flesh-dect (media 2021) -- a viewpoint of material decomposition
ESP32存储配网信息+LED显示配网状态+按键清除配网信息(附源码)