当前位置：网站首页>Emotion analysis based on IMDB comment data set

Emotion analysis based on IMDB comment data set

2022-07-01 11:39:00 【Xi Anxian】

List of articles

Preface
One 、 Introduction to large film review data sets
Two 、 Environment configuration
3、 ... and 、 Experimental code
Four 、 experimental result
5、 ... and 、 Problems encountered
summary

Preface

The main content of this paper is based on IMDB Emotional analysis of comment data sets , This article includes an introduction to the large-scale film review data set 、 Environment configuration 、 Experimental code 、 Running results and problems encountered , This experiment uses multilayer perceptron (MLP)、 Recursive neural network (RNN) And long-term and long-term memory (LSTM) The equal depth learning models were tested , Among them, the short-term and long-term memory model has the best effect , The test accuracy has reached 86.4%.

One 、 Introduction to large film review data sets

Big movie review data set (Large Movie Review Dataset)： Click here to jump to download
This data set is a data set for binary emotion classification , Including positive and negative comments , It contains a lot more data than the previous benchmark dataset , Among them is 25000 Highly polar film reviews for training ,25000 Bar is used to test , There are also some unmarked data available .
The files contained in the downloaded dataset after decompression are shown in the following figure .
Insert picture description here
test The file information contained in the folder is as follows , among neg The folder contains 12500 Negative comments ,pos The folder contains 12500 A positive comment .

train The file information contained in the folder is as follows , among neg The folder contains 12500 Negative comments ,pos The folder contains 12500 A positive comment ,unsup The folder contains 50000 Unmarked comments are available .
Insert picture description here

Two 、 Environment configuration

This experiment is based on the previous experiment ： be based on PyTorch Of cifar-10 Image classification Install based on the environment configuration tensorflow and keras that will do , I have installed tensorflow The version is 2.1.0,keras The version is 2.3.1, Install the version that suits your environment .
Tensorflow Is a data flow based programming （dataflow programming） Symbolic mathematics system of , It is widely used in the programming of various machine learning algorithms .Tensorflow It has a multi-level structure , It can be deployed on all kinds of servers 、PC Terminal and web page support GPU and TPU High performance numerical calculation .
Keras It's a by Python An open source artificial neural network library , It can be used as Tensorflow、Microsoft-CNTK and Theano High level application program interface of , Design of deep learning model 、 debugging 、 assessment 、 Application and visualization .Keras The code structure is written by object-oriented method , Fully modular and scalable , It attempts to simplify the implementation difficulty of complex algorithms .Keras Support the mainstream algorithms in the field of modern artificial intelligence , Neural networks including feedforward structure and recursive structure , You can also participate in the construction of statistical learning models through encapsulation . In terms of hardware and development environment ,Keras Support multiple operating systems GPU Parallel computing , Can be converted to... According to background settings Tensorflow、Microsoft-CNTK And other components under the system .

3、 ... and 、 Experimental code

This experiment uses multilayer perceptron (MLP)、 Recursive neural network (RNN) And long-term and long-term memory (LSTM) Equal depth learning model , The code is as follows .

1. Multilayer perceptron model code

The code based on the multilayer perceptron model is as follows .

#  Statement ： This code is not written by myself , The code source has been linked at the end of the article 
import urllib.request  #  Download the file 
import os
import tarfile   #  Unzip the file 
import re
import numpy as np
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer   #  Build a dictionary 
from keras.preprocessing import sequence  #  cut off from the long to support the deficiency of the short 
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding

#  Download the movie review data set 
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath = r"G:\PycharmProjects\aclImdb_v1.tar.gz"  #  The location of the data set compressed package path here varies from person to person 
if not os.path.isfile(filepath):  #  Download if there is no file in this path 
    result = urllib.request.urlretrieve(url, filepath)
    print('downloaded:', result)

if not os.path.exists(r"G:\PycharmProjects\aclImdb"):
    tfile = tarfile.open(filepath, 'r:gz')  #  Extract the dataset file 
    result = tfile.extractall(r"G:/PycharmProjects/")


def rm_tags(text):
    re_tag = re.compile(r'<[^>]+>')  #  Remove html label 
    return re_tag.sub('', text)


def read_file(filetype):  #  Read the file 
    path = "G:/PycharmProjects/aclImdb/"
    file_list = []

    positive_path = path + filetype + '/pos/'   #  File path for positive comments 
    for f in os.listdir(positive_path):
        file_list += [positive_path + f]   #  Stored in the file list 

    negative_path = path + filetype + '/neg/'   #  File path of negative comments 
    for f in os.listdir(negative_path):
        file_list += [negative_path + f]

    print('read', filetype, 'files:', len(file_list))   #  Number of printed files 

    all_labels = ([1] * 12500 + [0] * 12500)  #  front 12500 It's all positive 1; after 12500 Both negative 0
    all_texts = []

    for fi in file_list:  #  Read all files 
        with open(fi, encoding='utf8') as file_input:
            #  Read the file first , Use join Connect all strings , And then use rm_tags To eliminate tag Finally, save it into the list all_texts
            all_texts += [rm_tags(" ".join(file_input.readlines()))]
    return all_labels, all_texts


y_train, train_text = read_file("train")
y_test, train_test = read_file("test")

y_train = np.array(y_train)
y_test = np.array(y_test)
test_text = train_test


#  establish  token
token = Tokenizer(num_words=2000)  #  The number of words in the dictionary is 2000
#  establish token The dictionary 
token.fit_on_texts(train_text)  #  Sort by word occurrence   Take before 2000 individual 

#  Convert movie review text into a digital list （ A film review text is converted into a numerical list ）
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)

#  Operation of cutting short 
x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test = sequence.pad_sequences(x_test_seq, maxlen=100)

# mlp perceptron 
model = Sequential()
model.add(Embedding(output_dim=32, input_dim=2000, input_length=100))
#  A word is used 32 Dimension word vector represents ; The number of words in the dictionary ( dimension ) by 2000; Each number list has 100 A digital , Equivalent to using 100 A number to represent a comment 
model.add(Dropout(0.2))
model.add(Flatten())  #  Enter “ Flatten ”, That is, the multidimensional input is unidimensional   share 32*100=3200 individual 
model.add(Dense(units=256, activation='relu'))  #  The number of neuron nodes is 256, The activation function is relu
model.add(Dropout(0.35))
model.add(Dense(units=1, activation='sigmoid'))  #  Output 1 Indicates a positive evaluation of ,0 A negative comment , The activation function is sigmoid
model.summary()  #  Model summary 

#  To configure 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  #  Define the loss function 、 Optimizer and evaluation 
#  Training 
train_history=model.fit(x=x_train, y=y_train, validation_split=0.2, epochs=10, batch_size=300, verbose=1)
#  Training 10 individual epoch, Each batch of training 300 Data 


#  Show training results 
def show_train_history(train_history, train, validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()


show_train_history(train_history, 'accuracy', 'val_accuracy')   #  Accuracy line chart 
show_train_history(train_history, 'loss', 'val_loss')   #  Loss function line chart 

scores = model.evaluate(x_test, y_test)  #  assessment 
print(scores)
print('Test loss: ', scores[0])
print('Test accuracy: ', scores[1])

2. Recursive neural network model code

The code based on recurrent neural network model is as follows .

#  Statement ： This code is not written by myself , The code source has been linked at the end of the article 
import urllib.request  #  Download the file 
import os
import tarfile   #  Unzip the file 
import re
import numpy as np
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer   #  Build a dictionary 
from keras.preprocessing import sequence  #  cut off from the long to support the deficiency of the short 
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN  # RNN

#  Download the movie review data set 
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath = r"G:\PycharmProjects\aclImdb_v1.tar.gz"
if not os.path.isfile(filepath):  #  Download if there is no file in this path 
    result = urllib.request.urlretrieve(url, filepath)
    print('downloaded:', result)

if not os.path.exists(r"G:\PycharmProjects\aclImdb"):
    tfile = tarfile.open(filepath, 'r:gz')  #  Extract the dataset file 
    result = tfile.extractall(r"G:/PycharmProjects/")


def rm_tags(text):
    re_tag = re.compile(r'<[^>]+>')  #  Remove html label 
    return re_tag.sub('', text)


def read_file(filetype):  #  Read the file 
    path = "G:/PycharmProjects/aclImdb/"
    file_list = []

    positive_path = path + filetype + '/pos/'   #  File path for positive comments 
    for f in os.listdir(positive_path):
        file_list += [positive_path + f]   #  Stored in the file list 

    negative_path = path + filetype + '/neg/'   #  File path of negative comments 
    for f in os.listdir(negative_path):
        file_list += [negative_path + f]

    print('read', filetype, 'files:', len(file_list))   #  Number of printed files 

    all_labels = ([1] * 12500 + [0] * 12500)  #  front 12500 It's all positive 1; after 12500 Both negative 0
    all_texts = []

    for fi in file_list:  #  Read all files 
        with open(fi, encoding='utf8') as file_input:
            #  Read the file first , Use join Connect all strings , And then use rm_tags To eliminate tag Finally, save it into the list all_texts
            all_texts += [rm_tags(" ".join(file_input.readlines()))]
    return all_labels, all_texts


y_train, train_text = read_file("train")
y_test, train_test = read_file("test")

y_train = np.array(y_train)
y_test = np.array(y_test)
test_text = train_test


#  establish  token
token = Tokenizer(num_words=2000)  #  The number of words in the dictionary is 2000
#  establish token The dictionary 
token.fit_on_texts(train_text)  #  Sort by word occurrence   Take before 2000 individual 

#  Convert movie review text into a digital list （ A film review text is converted into a numerical list ）
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)

#  Operation of cutting short 
x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test = sequence.pad_sequences(x_test_seq, maxlen=380)

# RNN  Model 
model = Sequential()
model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
#  A word is used 32 Dimension word vector represents ; The number of words in the dictionary ( dimension ) by 3800; Each number list has 100 A digital , Equivalent to using 100 A number to represent a comment 
model.add(Dropout(0.35))
model.add(SimpleRNN(16))  # RNN
model.add(Dense(units=256, activation='relu'))  #  The number of neuron nodes is 256, The activation function is relu
model.add(Dropout(0.35))
model.add(Dense(units=1, activation='sigmoid'))  #  Output 1 Indicates a positive evaluation of ,0 A negative comment , The activation function is sigmoid
model.summary()  #  Model summary 


#  To configure 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  #  Define the loss function 、 Optimizer and evaluation 
#  Training 
train_history = model.fit(x=x_train, y=y_train, validation_split=0.2, epochs=10, batch_size=300, verbose=1)
#  Training 10 individual epoch, Each batch of training 300 Data 


#  Show training results 
def show_train_history(train_history, train, validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()


show_train_history(train_history, 'accuracy', 'val_accuracy')   #  Accuracy line chart 
show_train_history(train_history, 'loss', 'val_loss')   #  Loss function line chart 

scores = model.evaluate(x_test, y_test)  #  assessment 
print(scores)
print('Test loss: ', scores[0])
print('Test accuracy: ', scores[1])

3. Long and short term memory model code

The code based on the long-term and short-term memory model is as follows .

#  Statement ： This code is not written by myself , The code source has been linked at the end of the article 
import urllib.request  #  Download the file 
import os
import tarfile   #  Unzip the file 
import re
import numpy as np
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer   #  Build a dictionary 
from keras.preprocessing import sequence  #  cut off from the long to support the deficiency of the short 
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM  # LSTM

#  Download the movie review data set 
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath = r"G:\PycharmProjects\aclImdb_v1.tar.gz"
if not os.path.isfile(filepath):  #  Download if there is no file in this path 
    result = urllib.request.urlretrieve(url, filepath)
    print('downloaded:', result)

if not os.path.exists(r"G:\PycharmProjects\aclImdb"):
    tfile = tarfile.open(filepath, 'r:gz')  #  Extract the dataset file 
    result = tfile.extractall(r"G:/PycharmProjects/")


def rm_tags(text):
    re_tag = re.compile(r'<[^>]+>')  #  Remove html label 
    return re_tag.sub('', text)


def read_file(filetype):  #  Read the file 
    path = "G:/PycharmProjects/aclImdb/"
    file_list = []

    positive_path = path + filetype + '/pos/'   #  File path for positive comments 
    for f in os.listdir(positive_path):
        file_list += [positive_path + f]   #  Stored in the file list 

    negative_path = path + filetype + '/neg/'   #  File path of negative comments 
    for f in os.listdir(negative_path):
        file_list += [negative_path + f]

    print('read', filetype, 'files:', len(file_list))   #  Number of printed files 

    all_labels = ([1] * 12500 + [0] * 12500)  #  front 12500 It's all positive 1; after 12500 Both negative 0
    all_texts = []

    for fi in file_list:  #  Read all files 
        with open(fi, encoding='utf8') as file_input:
            #  Read the file first , Use join Connect all strings , And then use rm_tags To eliminate tag Finally, save it into the list all_texts
            all_texts += [rm_tags(" ".join(file_input.readlines()))]
    return all_labels, all_texts


y_train, train_text = read_file("train")
y_test, train_test = read_file("test")

y_train = np.array(y_train)
y_test = np.array(y_test)
test_text = train_test


#  establish  token
token = Tokenizer(num_words=2000)  #  The number of words in the dictionary is 2000
#  establish token The dictionary 
token.fit_on_texts(train_text)  #  Sort by word occurrence   Take before 2000 individual 

#  Convert movie review text into a digital list （ A film review text is converted into a numerical list ）
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)

#  Operation of cutting short 
x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test = sequence.pad_sequences(x_test_seq, maxlen=380)

# LSTM  Model 
model = Sequential()
model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
#  A word is used 32 Dimension word vector represents ; The number of words in the dictionary ( dimension ) by 3800; Each number list has 100 A digital , Equivalent to using 100 A number to represent a comment 
model.add(Dropout(0.2))
model.add(LSTM(32))  # LSTM
model.add(Dense(units=256, activation='relu'))  #  The number of neuron nodes is 256, The activation function is relu
model.add(Dropout(0.2))
model.add(Dense(units=1, activation='sigmoid'))  #  Output 1 Indicates a positive evaluation of ,0 A negative comment , The activation function is sigmoid
model.summary()  #  Model summary 


#  To configure 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  #  Define the loss function 、 Optimizer and evaluation 
#  Training 
train_history = model.fit(x=x_train, y=y_train, validation_split=0.2, epochs=10, batch_size=300, verbose=1)
#  Training 10 individual epoch, Each batch of training 300 Data 


#  Show training results 
def show_train_history(train_history, train, validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()


show_train_history(train_history, 'accuracy', 'val_accuracy')   #  Accuracy line chart 
show_train_history(train_history, 'loss', 'val_loss')   #  Loss function line chart 

scores = model.evaluate(x_test, y_test)  #  assessment 
print(scores)
print('Test loss: ', scores[0])
print('Test accuracy: ', scores[1])

Four 、 experimental result

1. The running result of multilayer perceptron model

The parameter training of multilayer perceptron model is shown in the following figure .
Insert picture description here
The broken line graph of the accuracy of the multilayer perceptron model changing with the learning cycle .

The broken line graph of the loss function value of the multilayer perceptron model changing with the learning cycle .

The final test loss function value and test accuracy of the multilayer perceptron model .
Insert picture description here

2. Running results of recurrent neural network model

The parameter training of recurrent neural network model is shown in the figure below .
Insert picture description here
The line graph of the accuracy of recurrent neural network model changing with the learning cycle .

The line graph of the loss function value of recurrent neural network model changing with the learning cycle .

The final test loss function value and test accuracy of recurrent neural network model .
Insert picture description here

3. Results of long-term and short-term memory model

The parameter training of long-term and short-term memory model is shown in the figure below .
Insert picture description here
The broken line graph of the accuracy of the long-term and short-term memory model changing with the learning cycle .

The broken line graph of the loss function value of the long-term and short-term memory model changing with the learning cycle .

The final test loss function value and test accuracy of the long-term and short-term memory model .
Insert picture description here
Through the experimental results of these models , The test accuracy of multilayer perceptron model is 82.01%, The test accuracy of recurrent neural network model is 82.79%, The test accuracy of the long-term and short-term memory model is 86.39%, The test accuracy of the long-term and short-term memory model is the highest among the three models .

5、 ... and 、 Problems encountered

Install... In the environment tensorflow when , It's not good to install directly with the following command ,Solving environment Keep turning , There will be no done.

conda install tensorflow

So I first use the following command to check all tensorflow edition .

anaconda show anaconda/tensorflow

The output result is shown in the figure below .
Insert picture description here
Then choose to be with yourself python Version compatible tensorflow Version to install , The order is as follows .

conda install --channel https://conda.anaconda.org/anaconda tensorflow=2.1.0

The second error is an error after running the program ：Warning! HDF5 library version mismatched error
resolvent ： Use the following two commands successively , To uninstall first h5py, Then install environment compatible h5py edition .

pip uninstall h5py
pip install h5py

summary

The above is based on IMDB Review all the content of emotional analysis in the data set , Finding the right open source code and configuring the right environment is the key to successfully running the code on your computer , I hope this article is helpful to your study ！
Refer to the website ：
TensorFlow Baidu Encyclopedia
Keras Baidu Encyclopedia
IMDb Network movie data set processing and LSTM Sentiment analysis
Reference code ： Click here to jump

原网站

版权声明
本文为[Xi Anxian]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207011135052568.html