当前位置:网站首页>Emotion analysis based on IMDB comment data set
Emotion analysis based on IMDB comment data set
2022-07-01 11:39:00 【Xi Anxian】
List of articles
Preface
The main content of this paper is based on IMDB Emotional analysis of comment data sets , This article includes an introduction to the large-scale film review data set 、 Environment configuration 、 Experimental code 、 Running results and problems encountered , This experiment uses multilayer perceptron (MLP)、 Recursive neural network (RNN) And long-term and long-term memory (LSTM) The equal depth learning models were tested , Among them, the short-term and long-term memory model has the best effect , The test accuracy has reached 86.4%.
One 、 Introduction to large film review data sets
Big movie review data set (Large Movie Review Dataset): Click here to jump to download
This data set is a data set for binary emotion classification , Including positive and negative comments , It contains a lot more data than the previous benchmark dataset , Among them is 25000 Highly polar film reviews for training ,25000 Bar is used to test , There are also some unmarked data available .
The files contained in the downloaded dataset after decompression are shown in the following figure .
test The file information contained in the folder is as follows , among neg The folder contains 12500 Negative comments ,pos The folder contains 12500 A positive comment .
train The file information contained in the folder is as follows , among neg The folder contains 12500 Negative comments ,pos The folder contains 12500 A positive comment ,unsup The folder contains 50000 Unmarked comments are available .
Two 、 Environment configuration
This experiment is based on the previous experiment : be based on PyTorch Of cifar-10 Image classification Install based on the environment configuration tensorflow and keras that will do , I have installed tensorflow The version is 2.1.0,keras The version is 2.3.1, Install the version that suits your environment .
Tensorflow Is a data flow based programming (dataflow programming) Symbolic mathematics system of , It is widely used in the programming of various machine learning algorithms .Tensorflow It has a multi-level structure , It can be deployed on all kinds of servers 、PC Terminal and web page support GPU and TPU High performance numerical calculation .
Keras It's a by Python An open source artificial neural network library , It can be used as Tensorflow、Microsoft-CNTK and Theano High level application program interface of , Design of deep learning model 、 debugging 、 assessment 、 Application and visualization .Keras The code structure is written by object-oriented method , Fully modular and scalable , It attempts to simplify the implementation difficulty of complex algorithms .Keras Support the mainstream algorithms in the field of modern artificial intelligence , Neural networks including feedforward structure and recursive structure , You can also participate in the construction of statistical learning models through encapsulation . In terms of hardware and development environment ,Keras Support multiple operating systems GPU Parallel computing , Can be converted to... According to background settings Tensorflow、Microsoft-CNTK And other components under the system .
3、 ... and 、 Experimental code
This experiment uses multilayer perceptron (MLP)、 Recursive neural network (RNN) And long-term and long-term memory (LSTM) Equal depth learning model , The code is as follows .
1. Multilayer perceptron model code
The code based on the multilayer perceptron model is as follows .
# Statement : This code is not written by myself , The code source has been linked at the end of the article
import urllib.request # Download the file
import os
import tarfile # Unzip the file
import re
import numpy as np
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer # Build a dictionary
from keras.preprocessing import sequence # cut off from the long to support the deficiency of the short
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
# Download the movie review data set
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath = r"G:\PycharmProjects\aclImdb_v1.tar.gz" # The location of the data set compressed package path here varies from person to person
if not os.path.isfile(filepath): # Download if there is no file in this path
result = urllib.request.urlretrieve(url, filepath)
print('downloaded:', result)
if not os.path.exists(r"G:\PycharmProjects\aclImdb"):
tfile = tarfile.open(filepath, 'r:gz') # Extract the dataset file
result = tfile.extractall(r"G:/PycharmProjects/")
def rm_tags(text):
re_tag = re.compile(r'<[^>]+>') # Remove html label
return re_tag.sub('', text)
def read_file(filetype): # Read the file
path = "G:/PycharmProjects/aclImdb/"
file_list = []
positive_path = path + filetype + '/pos/' # File path for positive comments
for f in os.listdir(positive_path):
file_list += [positive_path + f] # Stored in the file list
negative_path = path + filetype + '/neg/' # File path of negative comments
for f in os.listdir(negative_path):
file_list += [negative_path + f]
print('read', filetype, 'files:', len(file_list)) # Number of printed files
all_labels = ([1] * 12500 + [0] * 12500) # front 12500 It's all positive 1; after 12500 Both negative 0
all_texts = []
for fi in file_list: # Read all files
with open(fi, encoding='utf8') as file_input:
# Read the file first , Use join Connect all strings , And then use rm_tags To eliminate tag Finally, save it into the list all_texts
all_texts += [rm_tags(" ".join(file_input.readlines()))]
return all_labels, all_texts
y_train, train_text = read_file("train")
y_test, train_test = read_file("test")
y_train = np.array(y_train)
y_test = np.array(y_test)
test_text = train_test
# establish token
token = Tokenizer(num_words=2000) # The number of words in the dictionary is 2000
# establish token The dictionary
token.fit_on_texts(train_text) # Sort by word occurrence Take before 2000 individual
# Convert movie review text into a digital list ( A film review text is converted into a numerical list )
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
# Operation of cutting short
x_train = sequence.pad_sequences(x_train_seq, maxlen=100)
x_test = sequence.pad_sequences(x_test_seq, maxlen=100)
# mlp perceptron
model = Sequential()
model.add(Embedding(output_dim=32, input_dim=2000, input_length=100))
# A word is used 32 Dimension word vector represents ; The number of words in the dictionary ( dimension ) by 2000; Each number list has 100 A digital , Equivalent to using 100 A number to represent a comment
model.add(Dropout(0.2))
model.add(Flatten()) # Enter “ Flatten ”, That is, the multidimensional input is unidimensional share 32*100=3200 individual
model.add(Dense(units=256, activation='relu')) # The number of neuron nodes is 256, The activation function is relu
model.add(Dropout(0.35))
model.add(Dense(units=1, activation='sigmoid')) # Output 1 Indicates a positive evaluation of ,0 A negative comment , The activation function is sigmoid
model.summary() # Model summary
# To configure
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Define the loss function 、 Optimizer and evaluation
# Training
train_history=model.fit(x=x_train, y=y_train, validation_split=0.2, epochs=10, batch_size=300, verbose=1)
# Training 10 individual epoch, Each batch of training 300 Data
# Show training results
def show_train_history(train_history, train, validation):
plt.plot(train_history.history[train])
plt.plot(train_history.history[validation])
plt.title('Train History')
plt.ylabel(train)
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
show_train_history(train_history, 'accuracy', 'val_accuracy') # Accuracy line chart
show_train_history(train_history, 'loss', 'val_loss') # Loss function line chart
scores = model.evaluate(x_test, y_test) # assessment
print(scores)
print('Test loss: ', scores[0])
print('Test accuracy: ', scores[1])
2. Recursive neural network model code
The code based on recurrent neural network model is as follows .
# Statement : This code is not written by myself , The code source has been linked at the end of the article
import urllib.request # Download the file
import os
import tarfile # Unzip the file
import re
import numpy as np
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer # Build a dictionary
from keras.preprocessing import sequence # cut off from the long to support the deficiency of the short
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN # RNN
# Download the movie review data set
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath = r"G:\PycharmProjects\aclImdb_v1.tar.gz"
if not os.path.isfile(filepath): # Download if there is no file in this path
result = urllib.request.urlretrieve(url, filepath)
print('downloaded:', result)
if not os.path.exists(r"G:\PycharmProjects\aclImdb"):
tfile = tarfile.open(filepath, 'r:gz') # Extract the dataset file
result = tfile.extractall(r"G:/PycharmProjects/")
def rm_tags(text):
re_tag = re.compile(r'<[^>]+>') # Remove html label
return re_tag.sub('', text)
def read_file(filetype): # Read the file
path = "G:/PycharmProjects/aclImdb/"
file_list = []
positive_path = path + filetype + '/pos/' # File path for positive comments
for f in os.listdir(positive_path):
file_list += [positive_path + f] # Stored in the file list
negative_path = path + filetype + '/neg/' # File path of negative comments
for f in os.listdir(negative_path):
file_list += [negative_path + f]
print('read', filetype, 'files:', len(file_list)) # Number of printed files
all_labels = ([1] * 12500 + [0] * 12500) # front 12500 It's all positive 1; after 12500 Both negative 0
all_texts = []
for fi in file_list: # Read all files
with open(fi, encoding='utf8') as file_input:
# Read the file first , Use join Connect all strings , And then use rm_tags To eliminate tag Finally, save it into the list all_texts
all_texts += [rm_tags(" ".join(file_input.readlines()))]
return all_labels, all_texts
y_train, train_text = read_file("train")
y_test, train_test = read_file("test")
y_train = np.array(y_train)
y_test = np.array(y_test)
test_text = train_test
# establish token
token = Tokenizer(num_words=2000) # The number of words in the dictionary is 2000
# establish token The dictionary
token.fit_on_texts(train_text) # Sort by word occurrence Take before 2000 individual
# Convert movie review text into a digital list ( A film review text is converted into a numerical list )
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
# Operation of cutting short
x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test = sequence.pad_sequences(x_test_seq, maxlen=380)
# RNN Model
model = Sequential()
model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
# A word is used 32 Dimension word vector represents ; The number of words in the dictionary ( dimension ) by 3800; Each number list has 100 A digital , Equivalent to using 100 A number to represent a comment
model.add(Dropout(0.35))
model.add(SimpleRNN(16)) # RNN
model.add(Dense(units=256, activation='relu')) # The number of neuron nodes is 256, The activation function is relu
model.add(Dropout(0.35))
model.add(Dense(units=1, activation='sigmoid')) # Output 1 Indicates a positive evaluation of ,0 A negative comment , The activation function is sigmoid
model.summary() # Model summary
# To configure
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Define the loss function 、 Optimizer and evaluation
# Training
train_history = model.fit(x=x_train, y=y_train, validation_split=0.2, epochs=10, batch_size=300, verbose=1)
# Training 10 individual epoch, Each batch of training 300 Data
# Show training results
def show_train_history(train_history, train, validation):
plt.plot(train_history.history[train])
plt.plot(train_history.history[validation])
plt.title('Train History')
plt.ylabel(train)
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
show_train_history(train_history, 'accuracy', 'val_accuracy') # Accuracy line chart
show_train_history(train_history, 'loss', 'val_loss') # Loss function line chart
scores = model.evaluate(x_test, y_test) # assessment
print(scores)
print('Test loss: ', scores[0])
print('Test accuracy: ', scores[1])
3. Long and short term memory model code
The code based on the long-term and short-term memory model is as follows .
# Statement : This code is not written by myself , The code source has been linked at the end of the article
import urllib.request # Download the file
import os
import tarfile # Unzip the file
import re
import numpy as np
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer # Build a dictionary
from keras.preprocessing import sequence # cut off from the long to support the deficiency of the short
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM # LSTM
# Download the movie review data set
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath = r"G:\PycharmProjects\aclImdb_v1.tar.gz"
if not os.path.isfile(filepath): # Download if there is no file in this path
result = urllib.request.urlretrieve(url, filepath)
print('downloaded:', result)
if not os.path.exists(r"G:\PycharmProjects\aclImdb"):
tfile = tarfile.open(filepath, 'r:gz') # Extract the dataset file
result = tfile.extractall(r"G:/PycharmProjects/")
def rm_tags(text):
re_tag = re.compile(r'<[^>]+>') # Remove html label
return re_tag.sub('', text)
def read_file(filetype): # Read the file
path = "G:/PycharmProjects/aclImdb/"
file_list = []
positive_path = path + filetype + '/pos/' # File path for positive comments
for f in os.listdir(positive_path):
file_list += [positive_path + f] # Stored in the file list
negative_path = path + filetype + '/neg/' # File path of negative comments
for f in os.listdir(negative_path):
file_list += [negative_path + f]
print('read', filetype, 'files:', len(file_list)) # Number of printed files
all_labels = ([1] * 12500 + [0] * 12500) # front 12500 It's all positive 1; after 12500 Both negative 0
all_texts = []
for fi in file_list: # Read all files
with open(fi, encoding='utf8') as file_input:
# Read the file first , Use join Connect all strings , And then use rm_tags To eliminate tag Finally, save it into the list all_texts
all_texts += [rm_tags(" ".join(file_input.readlines()))]
return all_labels, all_texts
y_train, train_text = read_file("train")
y_test, train_test = read_file("test")
y_train = np.array(y_train)
y_test = np.array(y_test)
test_text = train_test
# establish token
token = Tokenizer(num_words=2000) # The number of words in the dictionary is 2000
# establish token The dictionary
token.fit_on_texts(train_text) # Sort by word occurrence Take before 2000 individual
# Convert movie review text into a digital list ( A film review text is converted into a numerical list )
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
# Operation of cutting short
x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
x_test = sequence.pad_sequences(x_test_seq, maxlen=380)
# LSTM Model
model = Sequential()
model.add(Embedding(output_dim=32, input_dim=3800, input_length=380))
# A word is used 32 Dimension word vector represents ; The number of words in the dictionary ( dimension ) by 3800; Each number list has 100 A digital , Equivalent to using 100 A number to represent a comment
model.add(Dropout(0.2))
model.add(LSTM(32)) # LSTM
model.add(Dense(units=256, activation='relu')) # The number of neuron nodes is 256, The activation function is relu
model.add(Dropout(0.2))
model.add(Dense(units=1, activation='sigmoid')) # Output 1 Indicates a positive evaluation of ,0 A negative comment , The activation function is sigmoid
model.summary() # Model summary
# To configure
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Define the loss function 、 Optimizer and evaluation
# Training
train_history = model.fit(x=x_train, y=y_train, validation_split=0.2, epochs=10, batch_size=300, verbose=1)
# Training 10 individual epoch, Each batch of training 300 Data
# Show training results
def show_train_history(train_history, train, validation):
plt.plot(train_history.history[train])
plt.plot(train_history.history[validation])
plt.title('Train History')
plt.ylabel(train)
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
show_train_history(train_history, 'accuracy', 'val_accuracy') # Accuracy line chart
show_train_history(train_history, 'loss', 'val_loss') # Loss function line chart
scores = model.evaluate(x_test, y_test) # assessment
print(scores)
print('Test loss: ', scores[0])
print('Test accuracy: ', scores[1])
Four 、 experimental result
1. The running result of multilayer perceptron model
The parameter training of multilayer perceptron model is shown in the following figure .
The broken line graph of the accuracy of the multilayer perceptron model changing with the learning cycle .
The broken line graph of the loss function value of the multilayer perceptron model changing with the learning cycle .
The final test loss function value and test accuracy of the multilayer perceptron model .
2. Running results of recurrent neural network model
The parameter training of recurrent neural network model is shown in the figure below .
The line graph of the accuracy of recurrent neural network model changing with the learning cycle .
The line graph of the loss function value of recurrent neural network model changing with the learning cycle .
The final test loss function value and test accuracy of recurrent neural network model .
3. Results of long-term and short-term memory model
The parameter training of long-term and short-term memory model is shown in the figure below .
The broken line graph of the accuracy of the long-term and short-term memory model changing with the learning cycle .
The broken line graph of the loss function value of the long-term and short-term memory model changing with the learning cycle .
The final test loss function value and test accuracy of the long-term and short-term memory model .
Through the experimental results of these models , The test accuracy of multilayer perceptron model is 82.01%, The test accuracy of recurrent neural network model is 82.79%, The test accuracy of the long-term and short-term memory model is 86.39%, The test accuracy of the long-term and short-term memory model is the highest among the three models .
5、 ... and 、 Problems encountered
Install... In the environment tensorflow when , It's not good to install directly with the following command ,Solving environment Keep turning , There will be no done.
conda install tensorflow
So I first use the following command to check all tensorflow edition .
anaconda show anaconda/tensorflow
The output result is shown in the figure below .
Then choose to be with yourself python Version compatible tensorflow Version to install , The order is as follows .
conda install --channel https://conda.anaconda.org/anaconda tensorflow=2.1.0
The second error is an error after running the program :Warning! HDF5 library version mismatched error
resolvent : Use the following two commands successively , To uninstall first h5py, Then install environment compatible h5py edition .
pip uninstall h5py
pip install h5py
summary
The above is based on IMDB Review all the content of emotional analysis in the data set , Finding the right open source code and configuring the right environment is the key to successfully running the code on your computer , I hope this article is helpful to your study !
Refer to the website :
TensorFlow Baidu Encyclopedia
Keras Baidu Encyclopedia
IMDb Network movie data set processing and LSTM Sentiment analysis
Reference code : Click here to jump
边栏推荐
- Force button homepage introduction animation
- IPlImage的width和widthStep
- Wonderful! MarkBERT
- JS date format conversion method
- Unittest框架中跳过要执行的测试用例
- ACLY与代谢性疾病
- [Maui] add click events for label, image and other controls
- redis配置环境变量
- Comment Nike a - t - il dominé la première place toute l'année? Voici les derniers résultats financiers.
- 超详细黑苹果安装图文教程送EFI配置合集及系统
猜你喜欢

Acly and metabolic diseases

Why must we move from Devops to bizdevops?

Redis的攻击手法

Wonderful! MarkBERT

Exposure: a white box photo post processing framework reading notes

Mingchuang plans to be listed on July 13: the highest issue price is HK $22.1, and the net profit in a single quarter decreases by 19%

Brief analysis of edgedb architecture

Numpy的矩阵

How to set decimal places in CAD

博途V15添加GSD文件
随机推荐
I'm in Zhongshan. Where is a better place to open an account? Is it actually safe to open an account online?
solo 可以通过 IPV6 访问吗?
Redis startup and library entry
2022/6/28学习总结
ABBIRB120工业机器人机械零点位置
ACLY与代谢性疾病
Can I open a securities account anywhere? Is it safe to open an account
Understanding of MVVM and MVC
Redis的攻击手法
Matrix of numpy
超详细黑苹果安装图文教程送EFI配置合集及系统
Binary stack (I) - principle and C implementation
Building external modules
Comment Cao définit la décimale de dimension
ES6 Promise用法小结
Comment Nike a - t - il dominé la première place toute l'année? Voici les derniers résultats financiers.
金鱼哥RHCA回忆录:DO447使用Ansible与API通信--使用Ansible Tower API启动作业
Tempest HDMI leak reception 4
TEMPEST HDMI泄漏接收 5
Explore the contour detection function findcontours() of OpenCV in detail with practical examples, and thoroughly understand the real role and meaning of each parameter and mode