当前位置:网站首页>A course on word embedding
A course on word embedding
2020-11-06 01:28:00 【Artificial intelligence meets pioneer】
author |Shraddha Anala compile |VK source |Towards Data Science
No matter who we are , read 、 understand 、 Communicating and ultimately generating new content is something we all need to do in our professional life .
When it comes to extracting useful features from a given text body , The process involved is a vector of continuous integers ( The word bag ) Comparison is fundamentally different . This is because the information in a sentence or text is encoded in a structured order , The semantic position of a word conveys the meaning of the text .
therefore , While maintaining the context of the text , The dual requirement of proper representation of data prompted me to learn and implement two different kinds of NLP Model to implement the task of text classification .
Word embedding is a dense representation of a single word in a text , Consider the context and other words associated with it .
Compared with the simple bag of words model , The real value vector can select dimension more effectively , More effectively capture the semantic relationship between words .
In short , Words with similar meanings or often appearing in similar contexts , Will have similar vector representations , It depends on the meaning of these words “ near ” or “ apart ” How far is it .
In this paper , I'm going to explore the embedding of two words -
- Train our own embeddedness
- In the process of the training GloVe Word embedding
Data sets
For this case study , We will use Kaggle Of Stack Overflow Data sets (https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate). This dataset contains 6 Ten thousand users asked questions on the website , The main task is to divide the problem into 3 class .
Now let's look at this multi category NLP The actual model of the project itself .
however , Before we start , Please make sure you have installed these packages / library .
pip install gensim # be used for NLP Preprocessing tasks
pip install keras # Embedded layer
1. Training word embedding
If you want to skip the explanation , Please visit the complete code of the first model :https://github.com/shraddha-an/nlp/blob/main/word_embedding_classification.ipynb
1) Data preprocessing
In the first model , We will train a neural network to learn embedding from our text corpus . To be specific , We will use Keras The library provides word identification and index for the embedded layer of neural network .
Before training our network , Some key parameters have to be determined . These include the size of the word or the number of unique words in the corpus and the dimension of the embedded vector .
The following links are data sets for training and testing . Now we're going to import them , Only problem and quality columns are kept for analysis :https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate
I also changed the column name and defined a function text_clean To clean up the problem .
# Import library
# Data manipulation / Handle
import pandas as pd, numpy as np
# visualization
import seaborn as sb, matplotlib.pyplot as plt
# NLP
import re
from nltk.corpus import stopwords
from gensim.utils import simple_preprocess
stop_words = set(stopwords.words('english'))
# Import dataset
dataset = pd.read_csv('train.csv')[['Body', 'Y']].rename(columns = {'Body': 'question', 'Y': 'category'})
ds = pd.read_csv('valid.csv')[['Body', 'Y']].rename(columns = {'Body': 'question', 'Y': 'category'})
# Clean up symbols and HTML label
symbols = re.compile(pattern = '[/<>(){}\[\]\|@,;]')
tags = ['href', 'http', 'https', 'www']
def text_clean(s: str) -> str:
s = symbols.sub(' ', s)
for i in tags:
s = s.replace(i, ' ')
return ' '.join(word for word in simple_preprocess(s) if not word in stop_words)
dataset.iloc[:, 0] = dataset.iloc[:, 0].apply(text_clean)
ds.iloc[:, 0] = ds.iloc[:, 0].apply(text_clean)
# Training and test sets
X_train, y_train = dataset.iloc[:, 0].values, dataset.iloc[:, 1].values.reshape(-1, 1)
X_test, y_test = ds.iloc[:, 0].values, ds.iloc[:, 1].values.reshape(-1, 1)
# one-hot code
from sklearn.preprocessing import OneHotEncoder as ohe
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers = [('one_hot_encoder', ohe(categories = 'auto'), [0])],
remainder = 'passthrough')
y_train = ct.fit_transform(y_train)
y_test = ct.transform(y_test)
# Set parameters
vocab_size = 2000
sequence_length = 100
If you look at the original dataset , You'll find that HTML The question contained in the tag , for example ,<p>…question</p>. Besides , There are also some words , Such as href,https etc. , Throughout the text , So I want to make sure that these two sets of unnecessary characters are removed from the text .
Gensim Of simple_preprocess Method returns a list of lowercase tags , Remove the accent .
Use it here apply Method will iterate through the preprocessing function to run each line , And return the output before moving on to the next line . Apply text preprocessing to training and test data sets .
Because there's a dependent variable in the vector 3 Categories , We will apply one-hot Encode and initialize some parameters for later use .
2) Tagging
Next , We will use Keras Tokenizer Class converts the problem of word composition into an array , Use their index to represent words .
therefore , We must first use fit_on_texts Method , Building an index vocabulary from the words that appear in the dataset .
After building a vocabulary , We use text_to_sequences Method converts a sentence into a list of numbers representing words .
pad_sequences Function ensures that all observations have the same length , Can be set to any number or the length of the longest question in the dataset .
We initialized it earlier vocab_size Parameters are just the size of our vocabulary ( For learning and indexing ).
# Keras The identifier of
from keras.preprocessing.text import Tokenizer
tk = Tokenizer(num_words = vocab_size)
tk.fit_on_texts(X_train)
X_train = tk.texts_to_sequences(X_train)
X_test = tk.texts_to_sequences(X_test)
# use 0 Fill in everything
from keras.preprocessing.sequence import pad_sequences
X_train_seq = pad_sequences(X_train, maxlen = sequence_length, padding = 'post')
X_test_seq = pad_sequences(X_test, maxlen = sequence_length, padding = 'post')
3) Training embedding layer
Last , In this part , We will build our model and our training , It consists of two main layers , An embedded layer will learn the training documents prepared above , And a dense output layer to implement the classification task .
The embedding layer will learn the representation of words , At the same time, train the neural network , A lot of text data is needed to provide accurate predictions . In our case ,45000 The training observations are sufficient to effectively learn the corpus and classify the quality of the problem . We will see from the indicators that .
# Training embedding layer and neural network
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten
model = Sequential()
model.add(Embedding(input_dim = vocab_size, output_dim = 5, input_length = sequence_length))
model.add(Flatten())
model.add(Dense(units = 3, activation = 'softmax'))
model.compile(loss = 'categorical_crossentropy',
optimizer = 'rmsprop',
metrics = ['accuracy'])
model.summary()
history = model.fit(X_train_seq, y_train, epochs = 20, batch_size = 512, verbose = 1)
# Save the model after training
#model.save("model.h5")
4) Assessment and measurement map
The rest is to evaluate the performance of our model , And draw a graph to see how the accuracy and loss metrics of the model change over time .
The performance metrics for our model are shown in the following screenshot .
The code is the same as the code shown below .
# Evaluate the performance of the model on the test set
loss, accuracy = model.evaluate(X_test_seq, y_test, verbose = 1)
print("\nAccuracy: {}\nLoss: {}".format(accuracy, loss))
# Draw accuracy and loss
sb.set_style('darkgrid')
# 1) Accuracy
plt.plot(history.history['accuracy'], label = 'training', color = '#003399')
plt.legend(shadow = True, loc = 'lower right')
plt.title('Accuracy Plot over Epochs')
plt.show()
# 2) Loss
plt.plot(history.history['loss'], label = 'training loss', color = '#FF0033')
plt.legend(shadow = True, loc = 'upper right')
plt.title('Loss Plot over Epochs')
plt.show()
Here's how to improve accuracy in training
20 individual epoch The loss map of
2. In the process of the training GloVe Word embedding
If you just want to run the model , Here's the complete code :https://github.com/shraddha-an/nlp/blob/main/pretrained_glove_classification.ipynb
Instead of training your own embeddedness , Another option is to use pre trained word embedding , such as GloVe or Word2Vec. In this part , We will use it in Wikipedia+gigaword5 Trained on GloVe Word embedding ; Download... From here :https://nlp.stanford.edu/projects/glove/
i) Select a pre training word to embed , If
Your dataset is made up of more “ Universal ” The language of , Generally speaking, you don't have that big data set .
Because these embeddings have been trained on a large number of words from different sources , If your data is also universal , Then the pre training model may be well done .
Besides , Through pre training embedding , You can save time and computing resources .
ii) Choose to train your own embedding , If
Your data ( And projects ) It's based on a niche industry , Like medicine 、 Finance or any other non generic and highly specific area .
under these circumstances , The general word embedding representation may not be suitable for you , And some words may not be in the vocabulary .
A large amount of domain data is needed to ensure that the learned word embedding can correctly represent different words and their semantic relations
Besides , It requires a lot of computing resources to browse your corpus and build word embedding .
Final , It's training your own embedding based on existing data , Or use the pre trained embedding , It will depend on your project .
obviously , You can still experiment with both models , And choose a more accurate model , But the above tutorial is just a simplified one , Can help you make decisions .
The process
The previous section has taken most of the steps required , Just make some adjustments .
We just need to build an embedding matrix of a word and its vectors , Then use it to set the weight of the embedded layer .
therefore , Keep preprocessing 、 The tagging and filling steps remain unchanged .
Once we import the original dataset and run the previous text cleaning steps , Let's run the code below to build the matrix .
Let's decide how many dimensions to embed (50、100、200), And include its name in the path variable below .
# # Import embedded
path = 'Full path to your glove file (with the dimensions)'
embeddings = dict()
with open(path, 'r', encoding = 'utf-8') as f:
for line in f:
# Every line in the file is a word plus 50 Number ( The vector that represents the word )
values = line.split()
# The first element of each line is a word , The rest 50 One is its vector
embeddings[values[0]] = np.array(values[1:], 'float32')
# Set some parameters
vocab_size = 2100
glove_dim = 50
sequence_length = 200
# Construct embedding matrix from words in corpus
embedding_matrix = np.zeros((vocab_size, glove_dim))
for word, index in word_index.items():
if index < vocab_size:
try:
# If the embedding of a given word exists , Retrieve it and map it to words .
embedding_matrix[index] = embeddings[word]
except:
pass
The code for building and training the embedded layer and neural network should be slightly modified , To allow embedding matrices to be used as weights .
# neural network
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten
model = Sequential()
model.add(Embedding(input_dim = vocab_size,
output_dim = glove_dim,
input_length = sequence))
model.add(Flatten())
model.add(Dense(units = 3, activation = 'softmax'))
model.compile(optimizer = 'adam', metrics = ['accuracy'], loss = 'categorical_crossentropy')
# Load our pre trained embedding matrix into the embedding layer
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False # Weights are not updated during training
# Training models
history = model.fit(X_train_seq, y_train, epochs = 20, batch_size = 512, verbose = 1)
Here are the performance indicators of our pre trained model in the test set .
Conclusion
From the performance index of the two models , The training embedding layer seems to be more suitable for this dataset .
Some of the reasons may be
1) Most of the problems with stack overflow are related to IT It's about programming , in other words , This is a domain specific scenario .
2) 45000 A large training data set of samples provides a good learning scenario for our embedded layer .
I hope this tutorial will help you , Thanks for reading , See you for the next article .
Link to the original text :https://towardsdatascience.com/a-guide-to-word-embeddings-8a23817ab60f
Welcome to join us AI Blog station : http://panchuang.net/
sklearn Machine learning Chinese official documents : http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/
版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢
边栏推荐
- Brief introduction and advantages and disadvantages of deepwalk model
- 多机器人行情共享解决方案
- 一篇文章带你了解CSS 渐变知识
- ES6学习笔记(二):教你玩转类的继承和类的对象
- 带你学习ES5中新增的方法
- Summary of common algorithms of linked list
- Advanced Vue component pattern (3)
- How long does it take you to work out an object-oriented programming interview question from Ali school?
- 基於MVC的RESTFul風格API實戰
- High availability cluster deployment of jumpserver: (6) deployment of SSH agent module Koko and implementation of system service management
猜你喜欢
Aprelu: cross border application, adaptive relu | IEEE tie 2020 for machine fault detection
From zero learning artificial intelligence, open the road of career planning!
采购供应商系统是什么?采购供应商管理平台解决方案
加速「全民直播」洪流,如何攻克延时、卡顿、高并发难题?
熬夜总结了报表自动化、数据可视化和挖掘的要点,和你想的不一样
TRON智能钱包PHP开发包【零TRX归集】
关于Kubernetes 与 OAM 构建统一、标准化的应用管理平台知识!(附网盘链接)
Face to face Manual Chapter 16: explanation and implementation of fair lock of code peasant association lock and reentrantlock
axios学习笔记(二):轻松弄懂XHR的使用及如何封装简易axios
Character string and memory operation function in C language
随机推荐
6.1.1 handlermapping mapping processor (1) (in-depth analysis of SSM and project practice)
Face to face Manual Chapter 16: explanation and implementation of fair lock of code peasant association lock and reentrantlock
I'm afraid that the spread sequence calculation of arbitrage strategy is not as simple as you think
After reading this article, I understand a lot of webpack scaffolding
带你学习ES5中新增的方法
Analysis of react high order components
数据产品不就是报表吗?大错特错!这分类里有大学问
基於MVC的RESTFul風格API實戰
Network security engineer Demo: the original * * is to get your computer administrator rights! 【***】
2019年的一个小目标,成为csdn的博客专家,纪念一下
ES6 essence:
The road of C + + Learning: from introduction to mastery
How to encapsulate distributed locks more elegantly
深度揭祕垃圾回收底層,這次讓你徹底弄懂她
Thoughts on interview of Ali CCO project team
助力金融科技创新发展,ATFX走在行业最前列
Deep understanding of common methods of JS array
從小公司進入大廠,我都做對了哪些事?
一篇文章带你了解CSS3圆角知识
[C / C + + 1] clion configuration and running C language