当前位置:网站首页>Word2vec+ regression model to achieve classification tasks
Word2vec+ regression model to achieve classification tasks
2022-07-28 06:11:00 【Alan and fish】
Use word2vec+ The regression model is completely a classification prediction task
Data sets
link :https://pan.baidu.com/s/1d8IbyXcyo-uG65ZPdgkXzg
Extraction code :nbpa
1. Data preprocessing model
Ideas :
- Use pandas Read tsv data
- Get rid of html label
- Remove punctuation
- participle
- To stop using words
- Make up new sentences
(1) Guide pack
import re
import numpy as np
import pandas as pd
import warnings
from bs4 import BeautifulSoup
from gensim.models.word2vec import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import nltk
import matplotlib.pyplot as plt
import itertools
from tqdm import tqdm
(2) Reading data
# use pandas Read in the training data
df = pd.read_csv('../movie_data/labeledTrainData.tsv', sep='\t', escapechar='\\')
print('Number of reviews: {}'.format(len(df)))
print(df)
The original data looks like this , All of them are data crawled from the Internet , Mixed with many other useless things .
(3) Data preprocessing
# 1. Get rid of HTML Label data , obtain review In this column 100 Row data
example = BeautifulSoup(df['review'][1000], 'html.parser').get_text()
# 2. Remove punctuation
example_letters = re.sub(r'[^a-zA-Z]', ' ', example)
words = example_letters.lower().split()
# 3. Get stop words
stopwords = {}.fromkeys([ line.rstrip() for line in open('../movie_data/stopwords.txt')])
# After obtaining the stop words , Use set Gather to repeat
eng_stopwords = set(stopwords)
# 4. Remove stop words
words_nostop = [w for w in words if w not in stopwords]
(4) Data cleaning
# This data cleaning function is actually de punctuation , To stop using words , Capitalize to lowercase
def clean_text(text):
text = BeautifulSoup(text, 'html.parser').get_text()
text = re.sub(r'[^a-zA-Z]', ' ', text)
words = text.lower().split()
words = [w for w in words if w not in eng_stopwords]
return ' '.join(words)
# Data cleaning
words=clean_text(df['review'][1000])
# Cleaning data is added to dataframe in
df['clean_review'] = df.review.apply(clean_text)
After data preprocessing , The cleaned data is more formal data , Then we add a row to the original data clean_review, This line is the cleaned data , We need to use these data for training word2vec
(5) participle
# Build a word splitter
nltk.download()
warnings.filterwarnings("ignore")
# nltk.download('punkt')
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
# obtain df A new line of data cleaned in
review_part = df['clean_review']
# Show add clean_review The following table looks like
print(df.head())
# Use nltk Carry out word segmentation
def split_sentences(review):
raw_sentences = tokenizer.tokenize(review.strip())
sentences = [clean_text(s) for s in raw_sentences if s]
return sentences
sentences = sum(review_part.apply(split_sentences), [])
print('{} reviews -> {} sentences'.format(len(review_part), len(sentences)))
print(sentence)
The word breaker here uses nltk Participator in ,nltk It can fail to download , Then you need to go online to download , Then import by yourself . After word segmentation of each sentence , And put all the divided words together , Then put it in a list in , Get the data as follows .
(6) Add all the words in the sentence after word segmentation to a single list in
sentences_list = []
for line in sentences:
sentences_list.append(nltk.word_tokenize(line))
print(sentences_list)
Get the participle list As shown in the figure below , because word2vec Need to train one word by one .
2.word2vec Model module
(1) Set the parameters required by the model
# Set the parameters of word vector training
'''
sentences: It could be a list
sg: Used to set up training algorithm , The default is 0, Corresponding CBOW Algorithm ;sg=1 Then skip-gram Algorithm .
size: It's the dimension of the eigenvector , The default is 100. Big size More training data is needed , But the effect will be better . The recommended values are dozens to hundreds .
window: What is the maximum distance between the current word and the prediction word in a sentence
alpha: Is the rate of learning
seed: For random number generators . It's about initializing word vectors .
min_count: You can truncate the dictionary . The frequency of words is less than min_count The number of words will be discarded , The default value is 5
max_vocab_size: Set... During word vector construction RAM Limit . If all the independent words exceed this , The least frequent one is eliminated . For every 10 million words, it takes about 1GB Of RAM. Set to None There is no limit .
workers Parallel number of parameter control training .
hs: If 1 Will use hierarchica·softmax skill . If set to 0(defau·t), be negative sampling Can be used .
negative: If >0, Will use negativesamp·ing, Used to set how many noise words
iter: The number of iterations , The default is 5
'''
num_features = 300 # Word vector dimensionality
min_word_count = 40 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 10 # Context window size
model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)
(2) Training models
# 9. Training models
model=Word2Vec(sentences_list,workers=num_workers,vector_size=num_features, min_count = min_word_count, window = context)
model.init_sims(replace=True)
# Save the model
model.save("F:\python\word2vect\model\demo3_model")
# test
# Calculate the similarity of these words , Return the most irrelevant
print(model.wv.doesnt_match(['man','woman','child','kitchen']))
# Calculation boy Related words
print(model.wv.most_similar("boy"))
First test , The least relevant of these words is kitchen
The second test result is as follows 
3. take trian,test Data is converted into vectors
What we need to calculate is the word vector of a whole sentence , Not a one word vector , So here we segment a sentence , Then add all the word segmentation vectors to get the average value .
In fact, this expression may not be very accurate , There is a better tf_idf Algorithm , I'll learn again after meeting .
# Define a data cleaning function to clean the data we need to use for training
def to_review_vector(review):
global word_vec
review = clean_text(review, remove_stopwords=True)
# words = nltk.word_tokenize(review)
word_vec = np.zeros((1, 300))
for word in review:
# word_vec = np.zeros((1,300))
if word in list(model.wv.key_to_index):
word_vec += np.array([model.wv[word]])
# print (word_vec.mean(axis = 0))
# hold 300 The vector of dimension is put into it one by one , And mark the serial number
return pd.Series(word_vec.mean(axis=0))
# The following sentence is pandas A special function of , take review This column of data circulates into to_review_vector in
train_data_features = df.review.apply(to_review_vector)
print(" Output the word vector after superposition ")
print(train_data_features.head())
Because our word vector is 300 dimension , Then the average obtained after stacking them is also 300 dimension , The following is for each line review The vector of the sentence of the data ,300 individual .
Then divide the data to be trained into training set and test set , Training set 80%, Test set 20%
X_train, X_test, y_train, y_test = train_test_split(train_data_features,df.sentiment,test_size = 0.2, random_state = 0)
4. Linear regression module
# Get the logistic regression model
LR_model = LogisticRegression()
# Put the training set into the model for training
LR_model = LR_model.fit(X_train, y_train)
# Put the test data into it to test
y_pred = LR_model.predict(X_test)
# Put the real value and the predicted value into it to get a confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred
The confusion matrix is as follows 
5. Visualization results
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,title='Confusion matrix')
plt.show()
This function is the visual confusion matrix , Through the confusion matrix visualized above , The following data can be obtained .
In the confusion matrix , Dark blue is the predicted correct value , The grey ones are wrong .
Evaluation model
Generally, the accuracy of the evaluation model , Recall rate ,F1 Value to evaluate
TP:( In fact, it is a positive example , The prediction is also a positive example ) Actually, I'm a boy , Predicted to be male ;
FP:( It's actually a negative example , The prediction is a positive example ) Actually a girl , Predicted to be male ;
FN:( In fact, it is a positive example , The prediction is negative ) Actually, I'm a boy , Predicted for girls ;
TN:( It's actually a negative example , The prediction is also negative ) Actually a girl , Predicted for girls ;
Calculation formula :
Accuracy rate (Accuracy) = (TP + TN) / Total sample .
Is defined as : For a given test data set , The ratio between the number of samples correctly classified by the classifier and the total number of samples .
Accuracy (Precision) = TP / (TP + FP) .
It said : How many of the predicted positive samples are really positive samples , It is aimed at our prediction results .Precision Also known as precision .
Recall rate (Recall) = TP / (TP + FN)
It said : How many positive examples in the sample are predicted correctly , It is for our original sample .Recall Also known as recall .
F1=(2 x precision x recall)/(precision+recoall)
F1 Value is an average calculation method of recall rate and accuracy
The code is as follows
print("accuracy(test): ", (cnf_matrix[1,1]+cnf_matrix[0,0])/(cnf_matrix[0,0]+cnf_matrix[1,1]+cnf_matrix[1,0]+cnf_matrix[0,1]))
print("precision:",(cnf_matrix[0,0])/(cnf_matrix[0,0]+cnf_matrix[1,0]))
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
The result is as follows :
accuracy(test): 0.861
precision: 0.8773930753564155
Recall : 0.8772430668841762
边栏推荐
- Ssh/scp breakpoint resume Rsync
- 微信小程序制作模板套用时需要注意什么呢?
- Distinguish between real-time data, offline data, streaming data and batch data
- 微信上的小程序店铺怎么做?
- 深度学习(增量学习)——ICCV2021:SS-IL: Separated Softmax for Incremental Learning
- Distributed lock redis implementation
- Sqlalchemy usage related
- Deep learning (incremental learning) - (iccv) striking a balance between stability and plasticity for class incremental learning
- tensorboard可视化
- UNL class diagram
猜你喜欢

Reinforcement learning - incomplete observation problem, MCTs

深度学习——MetaFormer Is Actually What You Need for Vision

What are the advantages of small program development system? Why choose it?

4个角度教你选小程序开发工具?

小程序开发解决零售业的焦虑

tensorboard可视化

3: MySQL master-slave replication setup

强化学习——价值学习中的SARSA

UNL class diagram

Reinforcement learning - Strategic Learning
随机推荐
word2vec+回归模型实现分类任务
Structured streaming in spark
Deploy the project to GPU and run
Briefly understand MVC and three-tier architecture
Deep learning (self supervision: simpl) -- a simple framework for contractual learning of visual representations
深度学习(自监督:SimCLR)——A Simple Framework for Contrastive Learning of Visual Representations
Idempotent component
What is the detail of the applet development process?
强化学习——基础概念
【7】 Consistency between redis cache and database data
分布式集群架构场景优化解决方案:分布式调度问题
使用神经网络实现对天气的预测
Deep learning (self supervision: simple Siam) -- Exploring simple Siamese representation learning
微信小程序开发语言一般有哪些?
Service reliability guarantee -watchdog
What about the app store on wechat?
word2vec和bert的基本使用方法
【2】 Redis basic commands and usage scenarios
Various programming languages decimal | time | Base64 and other operations of the quick look-up table
Deep learning - patches are all you need