当前位置:网站首页>A course on word embedding
A course on word embedding
2020-11-06 01:28:00 【Artificial intelligence meets pioneer】
author |Shraddha Anala compile |VK source |Towards Data Science
No matter who we are , read 、 understand 、 Communicating and ultimately generating new content is something we all need to do in our professional life .
When it comes to extracting useful features from a given text body , The process involved is a vector of continuous integers ( The word bag ) Comparison is fundamentally different . This is because the information in a sentence or text is encoded in a structured order , The semantic position of a word conveys the meaning of the text .
therefore , While maintaining the context of the text , The dual requirement of proper representation of data prompted me to learn and implement two different kinds of NLP Model to implement the task of text classification .
Word embedding is a dense representation of a single word in a text , Consider the context and other words associated with it .
Compared with the simple bag of words model , The real value vector can select dimension more effectively , More effectively capture the semantic relationship between words .

In short , Words with similar meanings or often appearing in similar contexts , Will have similar vector representations , It depends on the meaning of these words “ near ” or “ apart ” How far is it .
In this paper , I'm going to explore the embedding of two words -
- Train our own embeddedness
- In the process of the training GloVe Word embedding
Data sets
For this case study , We will use Kaggle Of Stack Overflow Data sets (https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate). This dataset contains 6 Ten thousand users asked questions on the website , The main task is to divide the problem into 3 class .
Now let's look at this multi category NLP The actual model of the project itself .
however , Before we start , Please make sure you have installed these packages / library .
pip install gensim # be used for NLP Preprocessing tasks
pip install keras # Embedded layer
1. Training word embedding
If you want to skip the explanation , Please visit the complete code of the first model :https://github.com/shraddha-an/nlp/blob/main/word_embedding_classification.ipynb
1) Data preprocessing
In the first model , We will train a neural network to learn embedding from our text corpus . To be specific , We will use Keras The library provides word identification and index for the embedded layer of neural network .
Before training our network , Some key parameters have to be determined . These include the size of the word or the number of unique words in the corpus and the dimension of the embedded vector .
The following links are data sets for training and testing . Now we're going to import them , Only problem and quality columns are kept for analysis :https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate
I also changed the column name and defined a function text_clean To clean up the problem .
# Import library
# Data manipulation / Handle
import pandas as pd, numpy as np
# visualization
import seaborn as sb, matplotlib.pyplot as plt
# NLP
import re
from nltk.corpus import stopwords
from gensim.utils import simple_preprocess
stop_words = set(stopwords.words('english'))
# Import dataset
dataset = pd.read_csv('train.csv')[['Body', 'Y']].rename(columns = {'Body': 'question', 'Y': 'category'})
ds = pd.read_csv('valid.csv')[['Body', 'Y']].rename(columns = {'Body': 'question', 'Y': 'category'})
# Clean up symbols and HTML label
symbols = re.compile(pattern = '[/<>(){}\[\]\|@,;]')
tags = ['href', 'http', 'https', 'www']
def text_clean(s: str) -> str:
s = symbols.sub(' ', s)
for i in tags:
s = s.replace(i, ' ')
return ' '.join(word for word in simple_preprocess(s) if not word in stop_words)
dataset.iloc[:, 0] = dataset.iloc[:, 0].apply(text_clean)
ds.iloc[:, 0] = ds.iloc[:, 0].apply(text_clean)
# Training and test sets
X_train, y_train = dataset.iloc[:, 0].values, dataset.iloc[:, 1].values.reshape(-1, 1)
X_test, y_test = ds.iloc[:, 0].values, ds.iloc[:, 1].values.reshape(-1, 1)
# one-hot code
from sklearn.preprocessing import OneHotEncoder as ohe
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers = [('one_hot_encoder', ohe(categories = 'auto'), [0])],
remainder = 'passthrough')
y_train = ct.fit_transform(y_train)
y_test = ct.transform(y_test)
# Set parameters
vocab_size = 2000
sequence_length = 100
If you look at the original dataset , You'll find that HTML The question contained in the tag , for example ,<p>…question</p>. Besides , There are also some words , Such as href,https etc. , Throughout the text , So I want to make sure that these two sets of unnecessary characters are removed from the text .
Gensim Of simple_preprocess Method returns a list of lowercase tags , Remove the accent .
Use it here apply Method will iterate through the preprocessing function to run each line , And return the output before moving on to the next line . Apply text preprocessing to training and test data sets .
Because there's a dependent variable in the vector 3 Categories , We will apply one-hot Encode and initialize some parameters for later use .
2) Tagging
Next , We will use Keras Tokenizer Class converts the problem of word composition into an array , Use their index to represent words .
therefore , We must first use fit_on_texts Method , Building an index vocabulary from the words that appear in the dataset .
After building a vocabulary , We use text_to_sequences Method converts a sentence into a list of numbers representing words .
pad_sequences Function ensures that all observations have the same length , Can be set to any number or the length of the longest question in the dataset .
We initialized it earlier vocab_size Parameters are just the size of our vocabulary ( For learning and indexing ).
# Keras The identifier of
from keras.preprocessing.text import Tokenizer
tk = Tokenizer(num_words = vocab_size)
tk.fit_on_texts(X_train)
X_train = tk.texts_to_sequences(X_train)
X_test = tk.texts_to_sequences(X_test)
# use 0 Fill in everything
from keras.preprocessing.sequence import pad_sequences
X_train_seq = pad_sequences(X_train, maxlen = sequence_length, padding = 'post')
X_test_seq = pad_sequences(X_test, maxlen = sequence_length, padding = 'post')
3) Training embedding layer
Last , In this part , We will build our model and our training , It consists of two main layers , An embedded layer will learn the training documents prepared above , And a dense output layer to implement the classification task .
The embedding layer will learn the representation of words , At the same time, train the neural network , A lot of text data is needed to provide accurate predictions . In our case ,45000 The training observations are sufficient to effectively learn the corpus and classify the quality of the problem . We will see from the indicators that .
# Training embedding layer and neural network
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten
model = Sequential()
model.add(Embedding(input_dim = vocab_size, output_dim = 5, input_length = sequence_length))
model.add(Flatten())
model.add(Dense(units = 3, activation = 'softmax'))
model.compile(loss = 'categorical_crossentropy',
optimizer = 'rmsprop',
metrics = ['accuracy'])
model.summary()
history = model.fit(X_train_seq, y_train, epochs = 20, batch_size = 512, verbose = 1)
# Save the model after training
#model.save("model.h5")
4) Assessment and measurement map
The rest is to evaluate the performance of our model , And draw a graph to see how the accuracy and loss metrics of the model change over time .
The performance metrics for our model are shown in the following screenshot .

The code is the same as the code shown below .
# Evaluate the performance of the model on the test set
loss, accuracy = model.evaluate(X_test_seq, y_test, verbose = 1)
print("\nAccuracy: {}\nLoss: {}".format(accuracy, loss))
# Draw accuracy and loss
sb.set_style('darkgrid')
# 1) Accuracy
plt.plot(history.history['accuracy'], label = 'training', color = '#003399')
plt.legend(shadow = True, loc = 'lower right')
plt.title('Accuracy Plot over Epochs')
plt.show()
# 2) Loss
plt.plot(history.history['loss'], label = 'training loss', color = '#FF0033')
plt.legend(shadow = True, loc = 'upper right')
plt.title('Loss Plot over Epochs')
plt.show()
Here's how to improve accuracy in training

20 individual epoch The loss map of

2. In the process of the training GloVe Word embedding
If you just want to run the model , Here's the complete code :https://github.com/shraddha-an/nlp/blob/main/pretrained_glove_classification.ipynb
Instead of training your own embeddedness , Another option is to use pre trained word embedding , such as GloVe or Word2Vec. In this part , We will use it in Wikipedia+gigaword5 Trained on GloVe Word embedding ; Download... From here :https://nlp.stanford.edu/projects/glove/
i) Select a pre training word to embed , If
Your dataset is made up of more “ Universal ” The language of , Generally speaking, you don't have that big data set .
Because these embeddings have been trained on a large number of words from different sources , If your data is also universal , Then the pre training model may be well done .
Besides , Through pre training embedding , You can save time and computing resources .
ii) Choose to train your own embedding , If
Your data ( And projects ) It's based on a niche industry , Like medicine 、 Finance or any other non generic and highly specific area .
under these circumstances , The general word embedding representation may not be suitable for you , And some words may not be in the vocabulary .
A large amount of domain data is needed to ensure that the learned word embedding can correctly represent different words and their semantic relations
Besides , It requires a lot of computing resources to browse your corpus and build word embedding .
Final , It's training your own embedding based on existing data , Or use the pre trained embedding , It will depend on your project .
obviously , You can still experiment with both models , And choose a more accurate model , But the above tutorial is just a simplified one , Can help you make decisions .
The process
The previous section has taken most of the steps required , Just make some adjustments .
We just need to build an embedding matrix of a word and its vectors , Then use it to set the weight of the embedded layer .
therefore , Keep preprocessing 、 The tagging and filling steps remain unchanged .
Once we import the original dataset and run the previous text cleaning steps , Let's run the code below to build the matrix .
Let's decide how many dimensions to embed (50、100、200), And include its name in the path variable below .
# # Import embedded
path = 'Full path to your glove file (with the dimensions)'
embeddings = dict()
with open(path, 'r', encoding = 'utf-8') as f:
for line in f:
# Every line in the file is a word plus 50 Number ( The vector that represents the word )
values = line.split()
# The first element of each line is a word , The rest 50 One is its vector
embeddings[values[0]] = np.array(values[1:], 'float32')
# Set some parameters
vocab_size = 2100
glove_dim = 50
sequence_length = 200
# Construct embedding matrix from words in corpus
embedding_matrix = np.zeros((vocab_size, glove_dim))
for word, index in word_index.items():
if index < vocab_size:
try:
# If the embedding of a given word exists , Retrieve it and map it to words .
embedding_matrix[index] = embeddings[word]
except:
pass
The code for building and training the embedded layer and neural network should be slightly modified , To allow embedding matrices to be used as weights .
# neural network
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten
model = Sequential()
model.add(Embedding(input_dim = vocab_size,
output_dim = glove_dim,
input_length = sequence))
model.add(Flatten())
model.add(Dense(units = 3, activation = 'softmax'))
model.compile(optimizer = 'adam', metrics = ['accuracy'], loss = 'categorical_crossentropy')
# Load our pre trained embedding matrix into the embedding layer
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False # Weights are not updated during training
# Training models
history = model.fit(X_train_seq, y_train, epochs = 20, batch_size = 512, verbose = 1)
Here are the performance indicators of our pre trained model in the test set .

Conclusion
From the performance index of the two models , The training embedding layer seems to be more suitable for this dataset .
Some of the reasons may be
1) Most of the problems with stack overflow are related to IT It's about programming , in other words , This is a domain specific scenario .
2) 45000 A large training data set of samples provides a good learning scenario for our embedded layer .
I hope this tutorial will help you , Thanks for reading , See you for the next article .
Link to the original text :https://towardsdatascience.com/a-guide-to-word-embeddings-8a23817ab60f
Welcome to join us AI Blog station : http://panchuang.net/
sklearn Machine learning Chinese official documents : http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/
版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢
边栏推荐
- axios学习笔记(二):轻松弄懂XHR的使用及如何封装简易axios
- Mongodb (from 0 to 1), 11 days mongodb primary to intermediate advanced secret
- 一篇文章带你了解CSS 分页实例
- Want to do read-write separation, give you some small experience
- ES6学习笔记(五):轻松了解ES6的内置扩展对象
- After reading this article, I understand a lot of webpack scaffolding
- Synchronous configuration from git to consult with git 2consul
- The road of C + + Learning: from introduction to mastery
- Our best practices for writing react components
- Filecoin主网上线以来Filecoin矿机扇区密封到底是什么意思
猜你喜欢

Summary of common string algorithms

Linked blocking Queue Analysis of blocking queue

比特币一度突破14000美元,即将面临美国大选考验

Grouping operation aligned with specified datum

一篇文章带你了解CSS3圆角知识

一篇文章教会你使用HTML5 SVG 标签

Mongodb (from 0 to 1), 11 days mongodb primary to intermediate advanced secret

在大规模 Kubernetes 集群上实现高 SLO 的方法

教你轻松搞懂vue-codemirror的基本用法:主要实现代码编辑、验证提示、代码格式化

CCR炒币机器人:“比特币”数字货币的大佬,你不得不了解的知识
随机推荐
Computer TCP / IP interview 10 even asked, how many can you withstand?
比特币一度突破14000美元,即将面临美国大选考验
Architecture article collection
一篇文章带你了解CSS3 背景知识
In order to save money, I learned PHP in one day!
vue任意关系组件通信与跨组件监听状态 vue-communication
Summary of common string algorithms
This article will introduce you to jest unit test
Keyboard entry lottery random draw
6.6.1 localeresolver internationalization parser (1) (in-depth analysis of SSM and project practice)
Leetcode's ransom letter
快快使用ModelArts,零基礎小白也能玩轉AI!
6.1.2 handlermapping mapping processor (2) (in-depth analysis of SSM and project practice)
Multi classification of unbalanced text using AWS sagemaker blazingtext
Nodejs crawler captures ancient books and records, a total of 16000 pages, experience summary and project sharing
一篇文章带你了解CSS 渐变知识
Five vuex plug-ins for your next vuejs project
6.4 viewresolver view parser (in-depth analysis of SSM and project practice)
It's so embarrassing, fans broke ten thousand, used for a year!
Troubleshooting and summary of JVM Metaspace memory overflow