当前位置：网站首页>Word2vec+ regression model to achieve classification tasks

Word2vec+ regression model to achieve classification tasks

2022-07-28 06:11:00 【Alan and fish】

Use word2vec+ The regression model is completely a classification prediction task
Data sets

link ：https://pan.baidu.com/s/1d8IbyXcyo-uG65ZPdgkXzg
Extraction code ：nbpa

1. Data preprocessing model

Ideas :

Use pandas Read tsv data
Get rid of html label
Remove punctuation
participle
To stop using words
Make up new sentences
(1) Guide pack

import re
import numpy as np
import pandas as pd
import warnings
from bs4 import BeautifulSoup
from gensim.models.word2vec import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import nltk
import matplotlib.pyplot as plt
import itertools
from tqdm import tqdm

(2) Reading data

#  use pandas Read in the training data 
df = pd.read_csv('../movie_data/labeledTrainData.tsv', sep='\t', escapechar='\\')
print('Number of reviews: {}'.format(len(df)))
print(df)

The original data looks like this , All of them are data crawled from the Internet , Mixed with many other useless things .

(3) Data preprocessing

# 1. Get rid of HTML Label data , obtain review In this column 100 Row data 
example = BeautifulSoup(df['review'][1000], 'html.parser').get_text()
# 2. Remove punctuation 
example_letters = re.sub(r'[^a-zA-Z]', ' ', example)
words = example_letters.lower().split()
# 3. Get stop words 
stopwords = {}.fromkeys([ line.rstrip() for line in open('../movie_data/stopwords.txt')])
#  After obtaining the stop words , Use set Gather to repeat 
eng_stopwords = set(stopwords)
# 4. Remove stop words 
words_nostop = [w for w in words if w not in stopwords]

(4) Data cleaning

#  This data cleaning function is actually de punctuation , To stop using words , Capitalize to lowercase 
def clean_text(text):
    text = BeautifulSoup(text, 'html.parser').get_text()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.lower().split()
    words = [w for w in words if w not in eng_stopwords]
    return ' '.join(words)
 #  Data cleaning 
words=clean_text(df['review'][1000])
#  Cleaning data is added to dataframe in 
df['clean_review'] = df.review.apply(clean_text)

After data preprocessing , The cleaned data is more formal data , Then we add a row to the original data clean_review, This line is the cleaned data , We need to use these data for training word2vec

（5） participle

#  Build a word splitter 
nltk.download()
warnings.filterwarnings("ignore")
# nltk.download('punkt')
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#  obtain df A new line of data cleaned in 
review_part = df['clean_review']
#  Show add clean_review The following table looks like 
print(df.head())

#  Use nltk Carry out word segmentation 
def split_sentences(review):
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = [clean_text(s) for s in raw_sentences if s]
    return sentences
sentences = sum(review_part.apply(split_sentences), [])
print('{} reviews -> {} sentences'.format(len(review_part), len(sentences)))
print（sentence）

The word breaker here uses nltk Participator in ,nltk It can fail to download , Then you need to go online to download , Then import by yourself . After word segmentation of each sentence , And put all the divided words together , Then put it in a list in , Get the data as follows .

（6） Add all the words in the sentence after word segmentation to a single list in

sentences_list = []
for line in sentences:
    sentences_list.append(nltk.word_tokenize(line))
print(sentences_list)

Get the participle list As shown in the figure below , because word2vec Need to train one word by one .

2.word2vec Model module

（1） Set the parameters required by the model

#  Set the parameters of word vector training 
'''
sentences： It could be a list
sg：  Used to set up training algorithm , The default is 0, Corresponding CBOW Algorithm ;sg=1 Then skip-gram Algorithm .
size： It's the dimension of the eigenvector , The default is 100. Big size More training data is needed , But the effect will be better .  The recommended values are dozens to hundreds .
window： What is the maximum distance between the current word and the prediction word in a sentence 
alpha:  Is the rate of learning 
seed： For random number generators . It's about initializing word vectors .
min_count:  You can truncate the dictionary .  The frequency of words is less than min_count The number of words will be discarded ,  The default value is 5
max_vocab_size:  Set... During word vector construction RAM Limit . If all the independent words exceed this , The least frequent one is eliminated . For every 10 million words, it takes about 1GB Of RAM. Set to None There is no limit .

workers Parallel number of parameter control training .

hs:  If 1 Will use hierarchica·softmax skill . If set to 0（defau·t）, be negative sampling Can be used .
negative:  If >0, Will use negativesamp·ing, Used to set how many noise words
iter：  The number of iterations , The default is 5
'''
num_features = 300    # Word vector dimensionality
min_word_count = 40   # Minimum word count
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size
model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)

（2） Training models

# 9. Training models 
model=Word2Vec(sentences_list,workers=num_workers,vector_size=num_features, min_count = min_word_count, window = context)
model.init_sims(replace=True)
#  Save the model 
model.save("F:\python\word2vect\model\demo3_model")
#  test 
#  Calculate the similarity of these words , Return the most irrelevant 
print(model.wv.doesnt_match(['man','woman','child','kitchen']))

#  Calculation boy  Related words 
print(model.wv.most_similar("boy"))

First test , The least relevant of these words is kitchen
The second test result is as follows

3. take trian,test Data is converted into vectors

What we need to calculate is the word vector of a whole sentence , Not a one word vector , So here we segment a sentence , Then add all the word segmentation vectors to get the average value .
In fact, this expression may not be very accurate , There is a better tf_idf Algorithm , I'll learn again after meeting .

#  Define a data cleaning function to clean the data we need to use for training 
def to_review_vector(review):
    global word_vec
    review = clean_text(review, remove_stopwords=True)
    # words = nltk.word_tokenize(review)
    word_vec = np.zeros((1, 300))
    for word in review:
        # word_vec = np.zeros((1,300))
        if word in list(model.wv.key_to_index):
            word_vec += np.array([model.wv[word]])
    # print (word_vec.mean(axis = 0))
    #  hold 300 The vector of dimension is put into it one by one , And mark the serial number 
    return pd.Series(word_vec.mean(axis=0))
#  The following sentence is pandas A special function of , take review This column of data circulates into to_review_vector in 
train_data_features = df.review.apply(to_review_vector)
print(" Output the word vector after superposition ")
print(train_data_features.head())

Because our word vector is 300 dimension , Then the average obtained after stacking them is also 300 dimension , The following is for each line review The vector of the sentence of the data ,300 individual .

Then divide the data to be trained into training set and test set , Training set 80%, Test set 20%

X_train, X_test, y_train, y_test = train_test_split(train_data_features,df.sentiment,test_size = 0.2, random_state = 0)

4. Linear regression module

#  Get the logistic regression model 
LR_model = LogisticRegression()
#  Put the training set into the model for training 
LR_model = LR_model.fit(X_train, y_train)
#  Put the test data into it to test 
y_pred = LR_model.predict(X_test)
#  Put the real value and the predicted value into it to get a confusion matrix 
cnf_matrix = confusion_matrix(y_test,y_pred

The confusion matrix is as follows

5. Visualization results

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,title='Confusion matrix')
plt.show()

This function is the visual confusion matrix , Through the confusion matrix visualized above , The following data can be obtained .

In the confusion matrix , Dark blue is the predicted correct value , The grey ones are wrong .

Evaluation model

Generally, the accuracy of the evaluation model , Recall rate ,F1 Value to evaluate

TP：( In fact, it is a positive example , The prediction is also a positive example ) Actually, I'm a boy , Predicted to be male ;
FP：( It's actually a negative example , The prediction is a positive example ) Actually a girl , Predicted to be male ;
FN：( In fact, it is a positive example , The prediction is negative ) Actually, I'm a boy , Predicted for girls ;
TN：( It's actually a negative example , The prediction is also negative ) Actually a girl , Predicted for girls ;

Calculation formula :

Accuracy rate (Accuracy) ＝ (TP + TN) / Total sample .
Is defined as : For a given test data set , The ratio between the number of samples correctly classified by the classifier and the total number of samples .

Accuracy (Precision) ＝ TP / (TP + FP) .
It said ： How many of the predicted positive samples are really positive samples , It is aimed at our prediction results .Precision Also known as precision .

Recall rate (Recall) ＝ TP / (TP + FN)
It said ： How many positive examples in the sample are predicted correctly , It is for our original sample .Recall Also known as recall .

F1=(2 x precision x recall)/(precision+recoall)
F1 Value is an average calculation method of recall rate and accuracy

The code is as follows

print("accuracy(test): ", (cnf_matrix[1,1]+cnf_matrix[0,0])/(cnf_matrix[0,0]+cnf_matrix[1,1]+cnf_matrix[1,0]+cnf_matrix[0,1]))
print("precision:",(cnf_matrix[0,0])/(cnf_matrix[0,0]+cnf_matrix[1,0]))
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

The result is as follows :

accuracy(test):  0.861
precision: 0.8773930753564155
Recall :  0.8772430668841762

原网站

版权声明
本文为[Alan and fish]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280518169260.html