当前位置:网站首页>News classification based on LSTM model
News classification based on LSTM model
2022-07-01 21:36:00 【Desperately_ petty thief】
1、 sketch LSTM Model
LSTM It's short-term and long-term memory neural network , According to the paper, most of the retrieved data are used for classification 、 Machine translation 、 Emotion recognition and other scenes , In text , The main use of tensorflow And keras, build LSTM The model realizes news classification cases .( Only discuss and implement the application case of its model , Don't describe the implementation principle )
2、 Data processing
News data and stop word documents are needed to prepare the data in the early stage , Use jieba Participle and pandas Preprocess the initial data , The total amount of data is 12000. The initial data set is shown in the figure below :
First read the list of stop words , Next use pandas Read the data file , Use jieba The database processes word segmentation and stop words for each line of data , The processing code is shown in the figure below :
def get_custom_stopwords(stop_words_file):
with open(stop_words_file,encoding='utf-8') as f:
stopwords = f.read()
stopwords_list = stopwords.split('\n')
custom_stopwords_list = [i for i in stopwords_list]
return custom_stopwords_list
cachedStopWords = get_custom_stopwords("stopwords.txt")
import pandas as np
import jieba
data = np.read_csv("sohu_test.txt", sep="\t",header=None)
lable_dict = {v:k for k,v in enumerate(data[0].unique())}
data[0] = data[0].map(lable_dict)
def chinese_word_cut(mytext):
return " ".join([word for word in jieba.cut(mytext) if word not in cachedStopWords])
data[1] = data[1].apply(chinese_word_cut)
data
3、 Text data vectorization
Set the initial parameters of the model batch_size: Data batch of each round ,class_size: Category ,epochs: Number of training rounds ,num_words: The number of words that appear most often ( This variable is in progress Embeding You need to fill in the size of the vocabulary +1),max_len : The dimension of the text vector , Use Tokenizer Realize vector construction and make padding
batch_size = 32
class_size = 12
epochs = 300
num_words = 5000
max_len = 600
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(data[1])
# print(tokenizer.word_index)
# train = tokenizer.texts_to_matrix(data[1])
train = tokenizer.texts_to_sequences(data[1])
train = sequence.pad_sequences(train,maxlen=max_len)
4、 Model structures,
Use train_test_split Split the data set , And build a model
from tensorflow.keras.layers import *
from tensorflow.keras import Sequential
from tensorflow.keras.models import load_model
from tensorflow.keras import optimizers
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import ModelCheckpoint
import numpy as np
lable = np_utils.to_categorical(data[0], num_classes=12)
X_train, X_test, y_train, y_test = train_test_split(train, lable, test_size=0.1, random_state=200)
model = Sequential()
model.add(Embedding(num_words+1, 128, input_length=max_len))
model.add(LSTM(128,dropout=0.2, recurrent_dropout=0.2))
# model.add(Dense(64,activation="relu"))
# model.add(Dropout(0.2))
# model.add(Dense(32,activation="relu"))
# model.add(Dropout(0.2))
model.add(Dense(class_size,activation="softmax"))
# Load model
# model = load_model('my_model2.h5')
model.compile(optimizer = 'adam', loss='categorical_crossentropy',metrics=['accuracy'])
checkpointer = ModelCheckpoint("./model/model_{epoch:03d}.h5", verbose=0, save_best_only=False, save_weights_only=False, period=2)
model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs=epochs, batch_size=batch_size, callbacks=[checkpointer])
# model.fit(X_train, y_train, validation_split = 0.2, shuffle=True, epochs=epochs, batch_size=batch_size, callbacks=[checkpointer])
model.save('my_model4.h5')
# print(model.summary())
The training process is shown in the figure below :
5、 Visualization of model training results
import matplotlib.pyplot as plt
# Drawing training & Verified accuracy value
plt.plot(model.history.history['accuracy'])
plt.plot(model.history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
# Drawing training & Verified loss value
plt.plot(model.history.history['loss'])
plt.plot(model.history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
6、 Model to predict
text = " an exclusive news report Zhang ziyi The airport go ballistic Ben lover chart summary 6 month star Enrich On one side dedication love actively participate in Disaster relief The reconstruction Activities "
text = [" ".join([word for word in jieba.cut(text) if word not in cachedStopWords])]
# tokenizer = Tokenizer(num_words=num_words)
# tokenizer.fit_on_texts(text)
seq = tokenizer.texts_to_sequences(text)
padded = sequence.pad_sequences(seq, maxlen=max_len)
# np.expand_dims(padded,axis=0)
test_pre = test_model.predict(padded)
test_pre.argmax(axis=1)
7、 The code download
边栏推荐
猜你喜欢
随机推荐
NIO与传统IO的区别
天气预报小程序源码 天气类微信小程序源码
旁路由设置的正确方式
Difference and use between require and import
leetcode刷题:二叉树03(二叉树的后序遍历)
latex如何打空格
想请教一下,券商选哪个比较好尼?本人小白不懂,现在网上开户安全么?
基于YOLOv5的口罩佩戴检测方法
js如何获取集合对象中某元素列表
leetcode刷题:二叉树02(二叉树的中序遍历)
2022熔化焊接与热切割上岗证题目模拟考试平台操作
九章云极DataCanvas公司蝉联中国机器学习平台市场TOP 3
leetcode刷题:栈与队列04(删除字符串中的所有相邻重复项)
开环和闭环是什么意思?
游览器打开摄像头案例
Talking from mlperf: how to lead the next wave of AI accelerator
编程英语生词笔记本
Internship: gradually moving towards project development
杰理之蓝牙耳机品控和生产技巧【篇】
股票手机开户哪个app好,安全性较高的