当前位置:网站首页>基于LSTM模型实现新闻分类
基于LSTM模型实现新闻分类
2022-07-01 21:32:00 【拼命_小李】
1、简述LSTM模型
LSTM是长短期记忆神经网络,根据论文检索数据大部分应用于分类、机器翻译、情感识别等场景,在文本中,主要使用tensorflow及keras,搭建LSTM模型实现新闻分类案例。(只讨论和实现其模型的应用案例,不去叙述起实现原理)
2、 数据处理
需要有新闻数据和停用词文档做前期的数据准备工作,使用jieba分词和pandas对初始数据进行预处理工作,数据总量为12000。初始数据集如下图:
首先读取停用词列表,其次使用pandas对数据文件读取,使用jieba库对每行数据进行分词及停用词的处理,处理代码如下图:
def get_custom_stopwords(stop_words_file):
with open(stop_words_file,encoding='utf-8') as f:
stopwords = f.read()
stopwords_list = stopwords.split('\n')
custom_stopwords_list = [i for i in stopwords_list]
return custom_stopwords_list
cachedStopWords = get_custom_stopwords("stopwords.txt")
import pandas as np
import jieba
data = np.read_csv("sohu_test.txt", sep="\t",header=None)
lable_dict = {v:k for k,v in enumerate(data[0].unique())}
data[0] = data[0].map(lable_dict)
def chinese_word_cut(mytext):
return " ".join([word for word in jieba.cut(mytext) if word not in cachedStopWords])
data[1] = data[1].apply(chinese_word_cut)
data
3、文本数据向量化
设置模型初始参数batch_size:每轮的数据批次,class_size:类别,epochs:训练轮数,num_words:最常出现的词的数(该变量在进行Embeding的时候需要填入词汇表大小+1),max_len :文本向量的维度,使用Tokenizer实现向量构建并作padding
batch_size = 32
class_size = 12
epochs = 300
num_words = 5000
max_len = 600
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(data[1])
# print(tokenizer.word_index)
# train = tokenizer.texts_to_matrix(data[1])
train = tokenizer.texts_to_sequences(data[1])
train = sequence.pad_sequences(train,maxlen=max_len)
4、模型搭建
使用train_test_split对数据集进行数据拆分,并搭建模型
from tensorflow.keras.layers import *
from tensorflow.keras import Sequential
from tensorflow.keras.models import load_model
from tensorflow.keras import optimizers
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import ModelCheckpoint
import numpy as np
lable = np_utils.to_categorical(data[0], num_classes=12)
X_train, X_test, y_train, y_test = train_test_split(train, lable, test_size=0.1, random_state=200)
model = Sequential()
model.add(Embedding(num_words+1, 128, input_length=max_len))
model.add(LSTM(128,dropout=0.2, recurrent_dropout=0.2))
# model.add(Dense(64,activation="relu"))
# model.add(Dropout(0.2))
# model.add(Dense(32,activation="relu"))
# model.add(Dropout(0.2))
model.add(Dense(class_size,activation="softmax"))
# 载入模型
# model = load_model('my_model2.h5')
model.compile(optimizer = 'adam', loss='categorical_crossentropy',metrics=['accuracy'])
checkpointer = ModelCheckpoint("./model/model_{epoch:03d}.h5", verbose=0, save_best_only=False, save_weights_only=False, period=2)
model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs=epochs, batch_size=batch_size, callbacks=[checkpointer])
# model.fit(X_train, y_train, validation_split = 0.2, shuffle=True, epochs=epochs, batch_size=batch_size, callbacks=[checkpointer])
model.save('my_model4.h5')
# print(model.summary())
训练过程如下图所示:
5、模型训练结果可视化
import matplotlib.pyplot as plt
# 绘制训练 & 验证的准确率值
plt.plot(model.history.history['accuracy'])
plt.plot(model.history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
# 绘制训练 & 验证的损失值
plt.plot(model.history.history['loss'])
plt.plot(model.history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
6、模型预测
text = "独家新闻 章子怡 机场 暴走 奔 情郎 图 提要 6 月 明星 充实 一边 奉献 爱心 积极参与 赈灾 重建 活动 "
text = [" ".join([word for word in jieba.cut(text) if word not in cachedStopWords])]
# tokenizer = Tokenizer(num_words=num_words)
# tokenizer.fit_on_texts(text)
seq = tokenizer.texts_to_sequences(text)
padded = sequence.pad_sequences(seq, maxlen=max_len)
# np.expand_dims(padded,axis=0)
test_pre = test_model.predict(padded)
test_pre.argmax(axis=1)
7、代码下载
边栏推荐
- String类型转换BigDecimal、Date类型
- Detailed explanation and code example of affinity propagation clustering calculation formula based on graph
- 关联线探究,如何连接流程图的两个节点
- [multithreading] realize the singleton mode (hungry and lazy) realize the thread safe singleton mode (double validation lock)
- Target detection - Yolo series
- Review notes of Zhang Haifan in introduction to software engineering (Sixth Edition)
- 升级版手机检测微信工具小程序源码-支持多种流量主模式
- Écrire un document de blog
- ngnix基础知识
- Exclusive news: Alibaba cloud quietly launched RPA cloud computer and has opened cooperation with many RPA manufacturers
猜你喜欢
随机推荐
Spark面试题
cmake工程化相关
【Opencv450】HOG+SVM 与Hog+cascade进行行人检测
股票手机开户哪个app好,安全性较高的
8K HDR!| Hevc hard solution for chromium - principle / Measurement Guide
多个张量与多个卷积核做卷积运算的输出结果
8K HDR!|为 Chromium 实现 HEVC 硬解 - 原理/实测指南
从20s优化到500ms,我用了这三招
Exclusive news: Alibaba cloud quietly launched RPA cloud computer and has opened cooperation with many RPA manufacturers
How can I know if I want to get the preferential link of stock account opening? Is it safe to open an account online?
burpsuite简单抓包教程[通俗易懂]
BC35&BC95 ONENET MQTT(旧)
Big factories are wolves, small factories are dogs?
同花顺股票开户选哪个券商好手机开户是安全么?
Past and present life of product modular design
EURA eurui E1000 series inverter uses PID to realize the relevant parameter setting and wiring of constant pressure water supply function
matlab遍历图像、字符串数组等基本操作
leetcode刷题:栈与队列02(用队列实现栈)
Oracle deadlock test
强大的万年历微信小程序源码-支持多做流量主模式