当前位置:网站首页>Word2vec+ regression model to achieve classification tasks
Word2vec+ regression model to achieve classification tasks
2022-07-28 06:11:00 【Alan and fish】
Use word2vec+ The regression model is completely a classification prediction task
Data sets
link :https://pan.baidu.com/s/1d8IbyXcyo-uG65ZPdgkXzg
Extraction code :nbpa
1. Data preprocessing model
Ideas :
- Use pandas Read tsv data
- Get rid of html label
- Remove punctuation
- participle
- To stop using words
- Make up new sentences
(1) Guide pack
import re
import numpy as np
import pandas as pd
import warnings
from bs4 import BeautifulSoup
from gensim.models.word2vec import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import nltk
import matplotlib.pyplot as plt
import itertools
from tqdm import tqdm
(2) Reading data
# use pandas Read in the training data
df = pd.read_csv('../movie_data/labeledTrainData.tsv', sep='\t', escapechar='\\')
print('Number of reviews: {}'.format(len(df)))
print(df)
The original data looks like this , All of them are data crawled from the Internet , Mixed with many other useless things .
(3) Data preprocessing
# 1. Get rid of HTML Label data , obtain review In this column 100 Row data
example = BeautifulSoup(df['review'][1000], 'html.parser').get_text()
# 2. Remove punctuation
example_letters = re.sub(r'[^a-zA-Z]', ' ', example)
words = example_letters.lower().split()
# 3. Get stop words
stopwords = {}.fromkeys([ line.rstrip() for line in open('../movie_data/stopwords.txt')])
# After obtaining the stop words , Use set Gather to repeat
eng_stopwords = set(stopwords)
# 4. Remove stop words
words_nostop = [w for w in words if w not in stopwords]
(4) Data cleaning
# This data cleaning function is actually de punctuation , To stop using words , Capitalize to lowercase
def clean_text(text):
text = BeautifulSoup(text, 'html.parser').get_text()
text = re.sub(r'[^a-zA-Z]', ' ', text)
words = text.lower().split()
words = [w for w in words if w not in eng_stopwords]
return ' '.join(words)
# Data cleaning
words=clean_text(df['review'][1000])
# Cleaning data is added to dataframe in
df['clean_review'] = df.review.apply(clean_text)
After data preprocessing , The cleaned data is more formal data , Then we add a row to the original data clean_review, This line is the cleaned data , We need to use these data for training word2vec
(5) participle
# Build a word splitter
nltk.download()
warnings.filterwarnings("ignore")
# nltk.download('punkt')
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
# obtain df A new line of data cleaned in
review_part = df['clean_review']
# Show add clean_review The following table looks like
print(df.head())
# Use nltk Carry out word segmentation
def split_sentences(review):
raw_sentences = tokenizer.tokenize(review.strip())
sentences = [clean_text(s) for s in raw_sentences if s]
return sentences
sentences = sum(review_part.apply(split_sentences), [])
print('{} reviews -> {} sentences'.format(len(review_part), len(sentences)))
print(sentence)
The word breaker here uses nltk Participator in ,nltk It can fail to download , Then you need to go online to download , Then import by yourself . After word segmentation of each sentence , And put all the divided words together , Then put it in a list in , Get the data as follows .
(6) Add all the words in the sentence after word segmentation to a single list in
sentences_list = []
for line in sentences:
sentences_list.append(nltk.word_tokenize(line))
print(sentences_list)
Get the participle list As shown in the figure below , because word2vec Need to train one word by one .
2.word2vec Model module
(1) Set the parameters required by the model
# Set the parameters of word vector training
'''
sentences: It could be a list
sg: Used to set up training algorithm , The default is 0, Corresponding CBOW Algorithm ;sg=1 Then skip-gram Algorithm .
size: It's the dimension of the eigenvector , The default is 100. Big size More training data is needed , But the effect will be better . The recommended values are dozens to hundreds .
window: What is the maximum distance between the current word and the prediction word in a sentence
alpha: Is the rate of learning
seed: For random number generators . It's about initializing word vectors .
min_count: You can truncate the dictionary . The frequency of words is less than min_count The number of words will be discarded , The default value is 5
max_vocab_size: Set... During word vector construction RAM Limit . If all the independent words exceed this , The least frequent one is eliminated . For every 10 million words, it takes about 1GB Of RAM. Set to None There is no limit .
workers Parallel number of parameter control training .
hs: If 1 Will use hierarchica·softmax skill . If set to 0(defau·t), be negative sampling Can be used .
negative: If >0, Will use negativesamp·ing, Used to set how many noise words
iter: The number of iterations , The default is 5
'''
num_features = 300 # Word vector dimensionality
min_word_count = 40 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 10 # Context window size
model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)
(2) Training models
# 9. Training models
model=Word2Vec(sentences_list,workers=num_workers,vector_size=num_features, min_count = min_word_count, window = context)
model.init_sims(replace=True)
# Save the model
model.save("F:\python\word2vect\model\demo3_model")
# test
# Calculate the similarity of these words , Return the most irrelevant
print(model.wv.doesnt_match(['man','woman','child','kitchen']))
# Calculation boy Related words
print(model.wv.most_similar("boy"))
First test , The least relevant of these words is kitchen
The second test result is as follows 
3. take trian,test Data is converted into vectors
What we need to calculate is the word vector of a whole sentence , Not a one word vector , So here we segment a sentence , Then add all the word segmentation vectors to get the average value .
In fact, this expression may not be very accurate , There is a better tf_idf Algorithm , I'll learn again after meeting .
# Define a data cleaning function to clean the data we need to use for training
def to_review_vector(review):
global word_vec
review = clean_text(review, remove_stopwords=True)
# words = nltk.word_tokenize(review)
word_vec = np.zeros((1, 300))
for word in review:
# word_vec = np.zeros((1,300))
if word in list(model.wv.key_to_index):
word_vec += np.array([model.wv[word]])
# print (word_vec.mean(axis = 0))
# hold 300 The vector of dimension is put into it one by one , And mark the serial number
return pd.Series(word_vec.mean(axis=0))
# The following sentence is pandas A special function of , take review This column of data circulates into to_review_vector in
train_data_features = df.review.apply(to_review_vector)
print(" Output the word vector after superposition ")
print(train_data_features.head())
Because our word vector is 300 dimension , Then the average obtained after stacking them is also 300 dimension , The following is for each line review The vector of the sentence of the data ,300 individual .
Then divide the data to be trained into training set and test set , Training set 80%, Test set 20%
X_train, X_test, y_train, y_test = train_test_split(train_data_features,df.sentiment,test_size = 0.2, random_state = 0)
4. Linear regression module
# Get the logistic regression model
LR_model = LogisticRegression()
# Put the training set into the model for training
LR_model = LR_model.fit(X_train, y_train)
# Put the test data into it to test
y_pred = LR_model.predict(X_test)
# Put the real value and the predicted value into it to get a confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred
The confusion matrix is as follows 
5. Visualization results
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,title='Confusion matrix')
plt.show()
This function is the visual confusion matrix , Through the confusion matrix visualized above , The following data can be obtained .
In the confusion matrix , Dark blue is the predicted correct value , The grey ones are wrong .
Evaluation model
Generally, the accuracy of the evaluation model , Recall rate ,F1 Value to evaluate
TP:( In fact, it is a positive example , The prediction is also a positive example ) Actually, I'm a boy , Predicted to be male ;
FP:( It's actually a negative example , The prediction is a positive example ) Actually a girl , Predicted to be male ;
FN:( In fact, it is a positive example , The prediction is negative ) Actually, I'm a boy , Predicted for girls ;
TN:( It's actually a negative example , The prediction is also negative ) Actually a girl , Predicted for girls ;
Calculation formula :
Accuracy rate (Accuracy) = (TP + TN) / Total sample .
Is defined as : For a given test data set , The ratio between the number of samples correctly classified by the classifier and the total number of samples .
Accuracy (Precision) = TP / (TP + FP) .
It said : How many of the predicted positive samples are really positive samples , It is aimed at our prediction results .Precision Also known as precision .
Recall rate (Recall) = TP / (TP + FN)
It said : How many positive examples in the sample are predicted correctly , It is for our original sample .Recall Also known as recall .
F1=(2 x precision x recall)/(precision+recoall)
F1 Value is an average calculation method of recall rate and accuracy
The code is as follows
print("accuracy(test): ", (cnf_matrix[1,1]+cnf_matrix[0,0])/(cnf_matrix[0,0]+cnf_matrix[1,1]+cnf_matrix[1,0]+cnf_matrix[0,1]))
print("precision:",(cnf_matrix[0,0])/(cnf_matrix[0,0]+cnf_matrix[1,0]))
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
The result is as follows :
accuracy(test): 0.861
precision: 0.8773930753564155
Recall : 0.8772430668841762
边栏推荐
- Deep learning (self supervised: Moco V3): An Empirical Study of training self supervised vision transformers
- Applet development
- First meet flask
- Dataset class loads datasets in batches
- Wechat applet development and production should pay attention to these key aspects
- 小程序开发流程详细是什么呢?
- 高端大气的小程序开发设计有哪些注意点?
- 3: MySQL master-slave replication setup
- Ssh/scp breakpoint resume Rsync
- vscode uniapp
猜你喜欢

分布式集群架构场景优化解决方案:分布式ID解决方案

self-attention学习笔记

Deep learning (self supervised: Moco V3): An Empirical Study of training self supervised vision transformers

Deep learning (self supervision: Moco V2) -- improved bases with momentum contractual learning

微信小程序开发费用制作费用是多少?

How much does it cost to make a small program mall? What are the general expenses?

Applet development

高端大气的小程序开发设计有哪些注意点?

深度学习(自监督:SimCLR)——A Simple Framework for Contrastive Learning of Visual Representations

神经网络实现鸢尾花分类
随机推荐
Bert的使用方法
KubeSphere安装版本问题
Distinguish between real-time data, offline data, streaming data and batch data
Deep learning - patches are all you need
微信小程序开发语言一般有哪些?
Record the problems encountered in online capacity expansion server nochange: partition 1 is size 419428319. It cannot be grown
Structured streaming in spark
深度学习(增量学习)——(ICCV)Striking a Balance between Stability and Plasticity for Class-Incremental Learning
强化学习——连续控制
NLP中基于Bert的数据预处理
The signature of the update package is inconsistent with that of the installed app
Manually create a simple RPC (< - < -)
小程序开发系统有哪些优点?为什么要选择它?
字节Android岗4轮面试,收到 50k*18 Offer,裁员风口下成功破局
Four perspectives to teach you to choose applet development tools?
卷积神经网络
Reinforcement learning -- SARS in value learning
How to do wechat group purchase applet? How much does it usually cost?
小程序开发要多少钱?两种开发方法分析!
深度学习(自监督:SimCLR)——A Simple Framework for Contrastive Learning of Visual Representations