当前位置:网站首页>Emotion analysis based on IMDB comment data set
Emotion analysis based on IMDB comment data set
2022-07-01 11:39:00 【Xi Anxian】
List of articles
Preface
The main content of this paper is based on IMDB Emotional analysis of comment data sets , This article includes an introduction to the large-scale film review data set 、 Environment configuration 、 Experimental code 、 Running results and problems encountered , This experiment uses multilayer perceptron (MLP)、 Recursive neural network (RNN) And long-term and long-term memory (LSTM) The equal depth learning models were tested , Among them, the short-term and long-term memory model has the best effect , The test accuracy has reached 86.4%.
One 、 Introduction to large film review data sets
Big movie review data set (Large Movie Review Dataset): Click here to jump to download
This data set is a data set for binary emotion classification , Including positive and negative comments , It contains a lot more data than the previous benchmark dataset , Among them is 25000 Highly polar film reviews for training ,25000 Bar is used to test , There are also some unmarked data available .
The files contained in the downloaded dataset after decompression are shown in the following figure .
test The file information contained in the folder is as follows , among neg The folder contains 12500 Negative comments ,pos The folder contains 12500 A positive comment .
train The file information contained in the folder is as follows , among neg The folder contains 12500 Negative comments ,pos The folder contains 12500 A positive comment ,unsup The folder contains 50000 Unmarked comments are available .
Two 、 Environment configuration
This experiment is based on the previous experiment : be based on PyTorch Of cifar-10 Image classification Install based on the environment configuration tensorflow and keras that will do , I have installed tensorflow The version is 2.1.0,keras The version is 2.3.1, Install the version that suits your environment .
Tensorflow Is a data flow based programming (dataflow programming) Symbolic mathematics system of , It is widely used in the programming of various machine learning algorithms .Tensorflow It has a multi-level structure , It can be deployed on all kinds of servers 、PC Terminal and web page support GPU and TPU High performance numerical calculation .
Keras It's a by Python An open source artificial neural network library , It can be used as Tensorflow、Microsoft-CNTK and Theano High level application program interface of , Design of deep learning model 、 debugging 、 assessment 、 Application and visualization .Keras The code structure is written by object-oriented method , Fully modular and scalable , It attempts to simplify the implementation difficulty of complex algorithms .Keras Support the mainstream algorithms in the field of modern artificial intelligence , Neural networks including feedforward structure and recursive structure , You can also participate in the construction of statistical learning models through encapsulation . In terms of hardware and development environment ,Keras Support multiple operating systems GPU Parallel computing , Can be converted to... According to background settings Tensorflow、Microsoft-CNTK And other components under the system .
3、 ... and 、 Experimental code
This experiment uses multilayer perceptron (MLP)、 Recursive neural network (RNN) And long-term and long-term memory (LSTM) Equal depth learning model , The code is as follows .
1. Multilayer perceptron model code
The code based on the multilayer perceptron model is as follows .
# Statement : This code is not written by myself , The code source has been linked at the end of the article
import urllib.request # Download the file
import os
import tarfile # Unzip the file
import re
import numpy as np
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer # Build a dictionary
from keras.preprocessing import sequence # cut off from the long to support the deficiency of the short
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
# Download the movie review data set
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath = r"G:\PycharmProjects\aclImdb_v1.tar.gz" # The location of the data set compressed package path here varies from person to person
if not os.path.isfile(filepath): # Download if there is no file in this path
result = urllib.request.urlretrieve(url, filepath)
print('downloaded:', result)
if not os.path.exists(r"G:\PycharmProjects\aclImdb"):
tfile = tarfile.open(filepath, 'r:gz') # Extract the dataset file
result = tfile.extractall(r"G:/PycharmProjects/")
def rm_tags(text):
re_tag = re.compile(r'<[^>]+>') # Remove html label
return re_tag.sub('', text)
def read_file(filetype): # Read the file
path = "G:/PycharmProjects/aclImdb/"
file_list = []
positive_path = path + filetype + '/pos/' # File path for positive comments
for f in os.listdir(positive_path):
file_list += [positive_path + f] # Stored in the file list
negative_path = path + filetype + '/neg/' # File path of negative comments
for f in os.listdir(negative_path):
file_list += [negative_path + f]
print('read', filetype, 'files:', len(file_list)) # Number of printed files
all_labels = ([1] * 12500 + [0] * 12500) # front 12500 It's all positive 1; after 12500 Both negative 0
all_texts = []
for fi in file_list: # Read all files
with open(fi, encoding='utf8') as file_input:
# Read the file first , Use join Connect all strings , And then use rm_tags To eliminate tag Finally, save it into the list all_texts
all_texts += [rm_tags(" ".join(file_input.readlines()))]
return all_labels, all_texts
y_train, train_text = read_file("train")
y_test, train_test = read_file("test")
y_train = np.array(y_train)
y_test = np.array(y_test)
test_text = train_test
# establish token
token = Tokenizer(num_words=2000) # The number of words in the dictionary is 2000
# establish token The dictionary
token.fit_on_texts(train_text) # Sort by word occurrence Take before 2000 individual
# Convert movie review text into a digital list ( A film review text is converted into a numerical list )
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
# Operation of cutting short
x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test = sequence.pad_sequences(x_test_seq, maxlen=100)
# mlp perceptron
model = Sequential()
model.add(Embedding(output_dim=32, input_dim=2000, input_length=100))
# A word is used 32 Dimension word vector represents ; The number of words in the dictionary ( dimension ) by 2000; Each number list has 100 A digital , Equivalent to using 100 A number to represent a comment
model.add(Dropout(0.2))
model.add(Flatten()) # Enter “ Flatten ”, That is, the multidimensional input is unidimensional share 32*100=3200 individual
model.add(Dense(units=256, activation='relu')) # The number of neuron nodes is 256, The activation function is relu
model.add(Dropout(0.35))
model.add(Dense(units=1, activation='sigmoid')) # Output 1 Indicates a positive evaluation of ,0 A negative comment , The activation function is sigmoid
model.summary() # Model summary
# To configure
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Define the loss function 、 Optimizer and evaluation
# Training
train_history=model.fit(x=x_train, y=y_train, validation_split=0.2, epochs=10, batch_size=300, verbose=1)
# Training 10 individual epoch, Each batch of training 300 Data
# Show training results
def show_train_history(train_history, train, validation):
plt.plot(train_history.history[train])
plt.plot(train_history.history[validation])
plt.title('Train History')
plt.ylabel(train)
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
show_train_history(train_history, 'accuracy', 'val_accuracy') # Accuracy line chart
show_train_history(train_history, 'loss', 'val_loss') # Loss function line chart
scores = model.evaluate(x_test, y_test) # assessment
print(scores)
print('Test loss: ', scores[0])
print('Test accuracy: ', scores[1])
2. Recursive neural network model code
The code based on recurrent neural network model is as follows .
# Statement : This code is not written by myself , The code source has been linked at the end of the article
import urllib.request # Download the file
import os
import tarfile # Unzip the file
import re
import numpy as np
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer # Build a dictionary
from keras.preprocessing import sequence # cut off from the long to support the deficiency of the short
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN # RNN
# Download the movie review data set
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath = r"G:\PycharmProjects\aclImdb_v1.tar.gz"
if not os.path.isfile(filepath): # Download if there is no file in this path
result = urllib.request.urlretrieve(url, filepath)
print('downloaded:', result)
if not os.path.exists(r"G:\PycharmProjects\aclImdb"):
tfile = tarfile.open(filepath, 'r:gz') # Extract the dataset file
result = tfile.extractall(r"G:/PycharmProjects/")
def rm_tags(text):
re_tag = re.compile(r'<[^>]+>') # Remove html label
return re_tag.sub('', text)
def read_file(filetype): # Read the file
path = "G:/PycharmProjects/aclImdb/"
file_list = []
positive_path = path + filetype + '/pos/' # File path for positive comments
for f in os.listdir(positive_path):
file_list += [positive_path + f] # Stored in the file list
negative_path = path + filetype + '/neg/' # File path of negative comments
for f in os.listdir(negative_path):
file_list += [negative_path + f]
print('read', filetype, 'files:', len(file_list)) # Number of printed files
all_labels = ([1] * 12500 + [0] * 12500) # front 12500 It's all positive 1; after 12500 Both negative 0
all_texts = []
for fi in file_list: # Read all files
with open(fi, encoding='utf8') as file_input:
# Read the file first , Use join Connect all strings , And then use rm_tags To eliminate tag Finally, save it into the list all_texts
all_texts += [rm_tags(" ".join(file_input.readlines()))]
return all_labels, all_texts
y_train, train_text = read_file("train")
y_test, train_test = read_file("test")
y_train = np.array(y_train)
y_test = np.array(y_test)
test_text = train_test
# establish token
token = Tokenizer(num_words=2000) # The number of words in the dictionary is 2000
# establish token The dictionary
token.fit_on_texts(train_text) # Sort by word occurrence Take before 2000 individual
# Convert movie review text into a digital list ( A film review text is converted into a numerical list )
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
# Operation of cutting short
x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test = sequence.pad_sequences(x_test_seq, maxlen=380)
# RNN Model
model = Sequential()
model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
# A word is used 32 Dimension word vector represents ; The number of words in the dictionary ( dimension ) by 3800; Each number list has 100 A digital , Equivalent to using 100 A number to represent a comment
model.add(Dropout(0.35))
model.add(SimpleRNN(16)) # RNN
model.add(Dense(units=256, activation='relu')) # The number of neuron nodes is 256, The activation function is relu
model.add(Dropout(0.35))
model.add(Dense(units=1, activation='sigmoid')) # Output 1 Indicates a positive evaluation of ,0 A negative comment , The activation function is sigmoid
model.summary() # Model summary
# To configure
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Define the loss function 、 Optimizer and evaluation
# Training
train_history = model.fit(x=x_train, y=y_train, validation_split=0.2, epochs=10, batch_size=300, verbose=1)
# Training 10 individual epoch, Each batch of training 300 Data
# Show training results
def show_train_history(train_history, train, validation):
plt.plot(train_history.history[train])
plt.plot(train_history.history[validation])
plt.title('Train History')
plt.ylabel(train)
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
show_train_history(train_history, 'accuracy', 'val_accuracy') # Accuracy line chart
show_train_history(train_history, 'loss', 'val_loss') # Loss function line chart
scores = model.evaluate(x_test, y_test) # assessment
print(scores)
print('Test loss: ', scores[0])
print('Test accuracy: ', scores[1])
3. Long and short term memory model code
The code based on the long-term and short-term memory model is as follows .
# Statement : This code is not written by myself , The code source has been linked at the end of the article
import urllib.request # Download the file
import os
import tarfile # Unzip the file
import re
import numpy as np
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer # Build a dictionary
from keras.preprocessing import sequence # cut off from the long to support the deficiency of the short
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM # LSTM
# Download the movie review data set
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath = r"G:\PycharmProjects\aclImdb_v1.tar.gz"
if not os.path.isfile(filepath): # Download if there is no file in this path
result = urllib.request.urlretrieve(url, filepath)
print('downloaded:', result)
if not os.path.exists(r"G:\PycharmProjects\aclImdb"):
tfile = tarfile.open(filepath, 'r:gz') # Extract the dataset file
result = tfile.extractall(r"G:/PycharmProjects/")
def rm_tags(text):
re_tag = re.compile(r'<[^>]+>') # Remove html label
return re_tag.sub('', text)
def read_file(filetype): # Read the file
path = "G:/PycharmProjects/aclImdb/"
file_list = []
positive_path = path + filetype + '/pos/' # File path for positive comments
for f in os.listdir(positive_path):
file_list += [positive_path + f] # Stored in the file list
negative_path = path + filetype + '/neg/' # File path of negative comments
for f in os.listdir(negative_path):
file_list += [negative_path + f]
print('read', filetype, 'files:', len(file_list)) # Number of printed files
all_labels = ([1] * 12500 + [0] * 12500) # front 12500 It's all positive 1; after 12500 Both negative 0
all_texts = []
for fi in file_list: # Read all files
with open(fi, encoding='utf8') as file_input:
# Read the file first , Use join Connect all strings , And then use rm_tags To eliminate tag Finally, save it into the list all_texts
all_texts += [rm_tags(" ".join(file_input.readlines()))]
return all_labels, all_texts
y_train, train_text = read_file("train")
y_test, train_test = read_file("test")
y_train = np.array(y_train)
y_test = np.array(y_test)
test_text = train_test
# establish token
token = Tokenizer(num_words=2000) # The number of words in the dictionary is 2000
# establish token The dictionary
token.fit_on_texts(train_text) # Sort by word occurrence Take before 2000 individual
# Convert movie review text into a digital list ( A film review text is converted into a numerical list )
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
# Operation of cutting short
x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test = sequence.pad_sequences(x_test_seq, maxlen=380)
# LSTM Model
model = Sequential()
model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
# A word is used 32 Dimension word vector represents ; The number of words in the dictionary ( dimension ) by 3800; Each number list has 100 A digital , Equivalent to using 100 A number to represent a comment
model.add(Dropout(0.2))
model.add(LSTM(32)) # LSTM
model.add(Dense(units=256, activation='relu')) # The number of neuron nodes is 256, The activation function is relu
model.add(Dropout(0.2))
model.add(Dense(units=1, activation='sigmoid')) # Output 1 Indicates a positive evaluation of ,0 A negative comment , The activation function is sigmoid
model.summary() # Model summary
# To configure
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Define the loss function 、 Optimizer and evaluation
# Training
train_history = model.fit(x=x_train, y=y_train, validation_split=0.2, epochs=10, batch_size=300, verbose=1)
# Training 10 individual epoch, Each batch of training 300 Data
# Show training results
def show_train_history(train_history, train, validation):
plt.plot(train_history.history[train])
plt.plot(train_history.history[validation])
plt.title('Train History')
plt.ylabel(train)
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
show_train_history(train_history, 'accuracy', 'val_accuracy') # Accuracy line chart
show_train_history(train_history, 'loss', 'val_loss') # Loss function line chart
scores = model.evaluate(x_test, y_test) # assessment
print(scores)
print('Test loss: ', scores[0])
print('Test accuracy: ', scores[1])
Four 、 experimental result
1. The running result of multilayer perceptron model
The parameter training of multilayer perceptron model is shown in the following figure .
The broken line graph of the accuracy of the multilayer perceptron model changing with the learning cycle .
The broken line graph of the loss function value of the multilayer perceptron model changing with the learning cycle .
The final test loss function value and test accuracy of the multilayer perceptron model .
2. Running results of recurrent neural network model
The parameter training of recurrent neural network model is shown in the figure below .
The line graph of the accuracy of recurrent neural network model changing with the learning cycle .
The line graph of the loss function value of recurrent neural network model changing with the learning cycle .
The final test loss function value and test accuracy of recurrent neural network model .
3. Results of long-term and short-term memory model
The parameter training of long-term and short-term memory model is shown in the figure below .
The broken line graph of the accuracy of the long-term and short-term memory model changing with the learning cycle .
The broken line graph of the loss function value of the long-term and short-term memory model changing with the learning cycle .
The final test loss function value and test accuracy of the long-term and short-term memory model .
Through the experimental results of these models , The test accuracy of multilayer perceptron model is 82.01%, The test accuracy of recurrent neural network model is 82.79%, The test accuracy of the long-term and short-term memory model is 86.39%, The test accuracy of the long-term and short-term memory model is the highest among the three models .
5、 ... and 、 Problems encountered
Install... In the environment tensorflow when , It's not good to install directly with the following command ,Solving environment Keep turning , There will be no done.
conda install tensorflow
So I first use the following command to check all tensorflow edition .
anaconda show anaconda/tensorflow
The output result is shown in the figure below .
Then choose to be with yourself python Version compatible tensorflow Version to install , The order is as follows .
conda install --channel https://conda.anaconda.org/anaconda tensorflow=2.1.0
The second error is an error after running the program :Warning! HDF5 library version mismatched error
resolvent : Use the following two commands successively , To uninstall first h5py, Then install environment compatible h5py edition .
pip uninstall h5py
pip install h5py
summary
The above is based on IMDB Review all the content of emotional analysis in the data set , Finding the right open source code and configuring the right environment is the key to successfully running the code on your computer , I hope this article is helpful to your study !
Refer to the website :
TensorFlow Baidu Encyclopedia
Keras Baidu Encyclopedia
IMDb Network movie data set processing and LSTM Sentiment analysis
Reference code : Click here to jump
边栏推荐
- [buuctf.reverse] 144_ [xman2018 qualifying]easyvm
- redis中value/SortedSet
- Personnaliser le plug - in GRPC
- kafuka学习之路(一)kafuka安装和简单使用
- 8款最佳实践,保护你的 IaC 安全!
- 基于IMDB评论数据集的情感分析
- Question: what professional qualities should test engineers have?
- [Maui] add click events for label, image and other controls
- 博途V15添加GSD文件
- Value/list in redis
猜你喜欢
Harbor webhook from principle to construction
基于IMDB评论数据集的情感分析
Brief analysis of edgedb architecture
Tempest HDMI leak receive 5
TEMPEST HDMI泄漏接收 3
Explore the contour detection function findcontours() of OpenCV in detail with practical examples, and thoroughly understand the real role and meaning of each parameter and mode
Learning summary on June 28, 2022
Binary stack (I) - principle and C implementation
Exposure: a white box photo post processing framework reading notes
TEMPEST HDMI泄漏接收 5
随机推荐
我在中山,到哪里开户比较好?实际上网上开户安全么?
Brief analysis of edgedb architecture
Learning summary on June 29, 2022
Y48. Chapter III kubernetes from introduction to mastery -- pod status and probe (21)
Redis startup and library entry
Paxos 入门
Adjacency matrix undirected graph (I) - basic concepts and C language
2022/6/29学习总结
Question: what professional qualities should test engineers have?
CPI tutorial - asynchronous interface creation and use
Neo4j 中文开发者月刊 - 202206期
金融壹账通拟7月4日香港上市:2年亏近30亿 市值蒸发超90%
Force button homepage introduction animation
分享psd格式怎么预览的方法和psd文件缩略图插件[通俗易懂]
京东与腾讯续签合作:向腾讯发行A类股 价值最高达2.2亿美元
redis配置环境变量
redis中value/hush
Several cases of index failure
如何看懂开发的查询语句
Activity workflow engine