当前位置:网站首页>叁拾肆- sklearn 根据样本对文本情绪进行分类
叁拾肆- sklearn 根据样本对文本情绪进行分类
2022-06-09 09:59:00 【BerryBC】
1。前言
通过 sklearn 对从爬虫捉取的网页文本进行情绪分类,只是简单化的工科内容而不是理科内容(无理论分析)。
2。思路
从 MongoDB 中随机抽取数据,然后用jieba分词再进行分词,然后用 sklearn 做学习样本进行分类。jieba分词后可能会多达4、5万个词,所以必须计算各词信息熵,把信息熵高的词汇剔除掉。
剩余信息熵低的关键字用 sklearn 包进行学习。
3。爬虫捉取内容
廿捌-原爬虫项目加入客制化内容,Python 读取 URL 域名
通过前期爬虫项目捉取内容,并且手动对接近1000多条数据进行人工情绪分类。
人工分类页如下:
4。分析用代码
4-1) 读取MongoDB思路
这个其实就从数据库抽样。通过以下 MongoDB Filiter:
curPos = colSample.aggregate(
[{
'$match': {
'cf': True, 'e': 1}}, {
'$sample': {
'size': 300}}])
4-2) 分词并整理为相应关键字
分词是用jieba包进行分词的,我只是简单做,并没用比较复杂的分词方式。
遍历3个(正面、负面、无用)情绪集之后,把他们按分词之后的词按统一顺序排,代码如下:
# 初始化定义
# dictSampleOfNew={'kw':[],'e':1}
arrXPreTrain = []
arrYPreTrain = []
# 定义数据记录
dictAllResult = {
'intIndexNow': 0, 'arrXPreTrain': [], 'arrYPreTrain': [], 'dictKW': {
}, 'arrSample': [], 'arrColunms': []}
# 把 MongoDB 的 Cursor 遍历转为数组
def ToGrepSampleKW(curSamples, dictInAllResult):
for eleSamples in curSamples:
# 不设睡眠时间很容易 CPU 爆掉
time.sleep(0.05)
genSampleWord = jieba.cut(eleSamples['ct'], cut_all=False)
dictSampleOfNew = {
'kw': [], 'e': eleSamples['e']}
# 对分词之后的每个关键字进行处理
for eleKW in genSampleWord:
# 如果前期没有该关键字即进行下一步
if not eleKW in dictSampleOfNew['kw']:
dictSampleOfNew['kw'].append(eleKW)
# 如果所有样本都不存在此关键字即存在全局字典内
if not eleKW in dictInAllResult['dictKW'].keys():
dictInAllResult['dictKW'][eleKW] = dictInAllResult['intIndexNow']
dictInAllResult['intIndexNow'] += 1
dictInAllResult['arrColunms'].append(eleKW)
dictInAllResult['arrSample'].append(dictSampleOfNew)
return dictInAllResult
# 做成一个矩阵,如果该样本含有该关键字,即对应位置为 True,否则为 False
# 样本大致为
# 我 爱 中国
# 样本1 True False True
# 样本2 False True True
def ToArraySample(dictInAllResult):
for dictEle in dictInAllResult['arrSample']:
arrNewSample = [False for intX in range(
dictInAllResult['intIndexNow'])]
for eleKWFI in dictEle['kw']:
arrNewSample[dictInAllResult['dictKW'][eleKWFI]] = True
dictInAllResult['arrXPreTrain'].append(arrNewSample)
dictInAllResult['arrYPreTrain'].append(dictEle['e'])
return dictInAllResult
dictAllResult = ToGrepSampleKW(curPos, dictAllResult)
print('Done Pos ' + time.strftime('%Y-%m-%d %H:%M:%S'))
dictAllResult = ToGrepSampleKW(curUseless, dictAllResult)
print('Done Useless '+time.strftime('%Y-%m-%d %H:%M:%S'))
dictAllResult = ToGrepSampleKW(curNeg, dictAllResult)
print('Done Neg '+time.strftime('%Y-%m-%d %H:%M:%S'))
dictAllResult = ToArraySample(dictAllResult)
print('Done Arr '+time.strftime('%Y-%m-%d %H:%M:%S'))
4-3) 计算关键字信息熵
关于信息熵的计算我是参考了:
python, pandas 实现信息熵计算
代码为:
# DataFrame 中第几列
def get_entropy(data_df, columns=None):
time.sleep(0.01)
if (columns is None):
raise "the columns must be not empty!"
# Information Entropy
pe_value_array = data_df[columns].unique()
ent = 0.0
for x_value in pe_value_array:
p = float(data_df[data_df[columns] ==
x_value].shape[0]) / data_df.shape[0]
logp = np.log2(p)
ent -= p * logp
return ent
# 有多少条样本
intLenOfXPreT = len(arrXPreTrain)
arrKWForEntropy = []
# 遍历每一个关键字
for intI in range(dictAllResult['intIndexNow']):
# 所有样本中该关键字是否存在
arrTmp = [arrXPreTrain[intJ][intI] for intJ in range(intLenOfXPreT)]
dfX = pd.DataFrame(arrTmp)
# 打印现在进度
if intI % 1000 ==0:
print(intI)
# 打印该关键字信息熵
print( dictAllResult['arrColunms'][intI]+' 信息熵为: ' +
str(get_entropy(dfY[dfX[dictAllResult['arrColunms'][intI]]==True],0)))
# 在数组 arrKWForEntropy 中插入一个关键字以及对应信息熵
if dfY[dfX[0]].shape[0] > 1:
arrKWForEntropy.append(
[dictAllResult['arrColunms'][intI], get_entropy(dfY[dfX[0]], 0)])
4-4) 关键字信息熵分桶
关于 pandas 的 分桶其实可以参考很多文章,在此我就不再写了,此段代码主要是为了把高信息熵跟低信息熵区分开:
dfEntropy = pd.DataFrame(arrKWForEntropy, columns=['KW', 'IE'])
print(dfEntropy.head(10))
# 分桶
cutB = pd.cut(dfEntropy['IE'], 2, labels=['L', 'H'])
dfEntropy['IEBin'] = cutB
print(dfEntropy[dfEntropy['IEBin'] == 'L'])
nparrKWWaitFor = dfEntropy[dfEntropy['IEBin'] == 'L']['KW'].values
关于分桶使用 cut还是 qcut,我当初有想过使用 qcut,但 qcut 可能会造成关键字过多,对后续处理很麻烦,而且有些高熵的都会混进去。
4-5) 整理并交叉验证分开
使用sklearn 的 train_test_split 进行交叉验证分开。
# 重新对低熵关键字在样本中排序
arrForTrain = [[] for intI in range(len(arrXPreTrain))]
for nparrEle in nparrKWWaitFor:
intJ = dictAllResult['dictKW'][nparrEle]
for intK in range(intLenOfXPreT):
arrForTrain[intK].append(arrXPreTrain[intK][intJ])
# 过滤无用的样本
arrXForTrainReal = []
arrYForTrainReal = []
for intI in range(len(arrYPreTrain)):
if(arrYPreTrain[intI] != 0):
arrXForTrainReal.append(arrForTrain[intI])
arrYForTrainReal.append(arrYPreTrain[intI])
# 交叉验证脚本
arrXtrain, arrXtest, arrYtrain, arrYtest = train_test_split(
arrXForTrainReal, arrYForTrainReal, test_size=0.2)
4-6) 评分
只是一段评分的代码。
其实并没有对比过很多的方法,也没用调参包。
主要我是思路我是觉得不同词语的存在与否,是多维的,而并不是统一的,所以用决策树可能会出现有某些词语出现而出现误导的情况。
所以就随便选几个根据维度进行学习的算法。
# ------- SKLearn 测试开始 ------
rcfClassifier = RandomForestClassifier()
rcfClassifier = rcfClassifier.fit(arrXtrain,arrYtrain)
clfScore = rcfClassifier.score(arrXtest,arrYtest)
print("随机森林评分: "+str(clfScore))
print(time.strftime('%Y-%m-%d %H:%M:%S'))
clfBagging = BaggingClassifier(base_estimator=LinearSVC(
random_state=0, tol=1e-05, max_iter=10000))
clfBagging.fit(arrXtrain, arrYtrain)
clfScore = clfBagging.score(arrXtest, arrYtest)
print("装袋 SVC 评分: "+str(clfScore))
print(time.strftime('%Y-%m-%d %H:%M:%S'))
clfAdaB = AdaBoostClassifier()
clfAdaB.fit(arrXtrain, arrYtrain)
clfScore = clfAdaB.score(arrXtest,arrYtest)
print("AdaBoost 评分: "+str(clScore))
print(time.strftime('%Y-%m-%d %H:%M:%S'))
# # ----------------------------------
4-7) 输出结果
看到结果,那我就直接用 SVC了~
# # ----------------------------------
>>> clfScore = rcfClassifier.score(arrXtest,arrYtest)
>>>
>>> print("随机森林评分: "+str(clfScore))
随机森林评分: 0.7916666666666666
>>> print(time.strftime('%Y-%m-%d %H:%M:%S'))
2020-02-22 15:10:16
>>>
>>> clfBagging = BaggingClassifier(base_estimator=LinearSVC(
... random_state=0, tol=1e-05, max_iter=10000))
>>> clfBagging.fit(arrXtrain, arrYtrain)
BaggingClassifier(base_estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
fit_intercept=True,
intercept_scaling=1,
loss='squared_hinge', max_iter=10000,
multi_class='ovr', penalty='l2',
random_state=0, tol=1e-05,
verbose=0),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=1.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
>>> clfScore = clfBagging.score(arrXtest, arrYtest)
>>>
>>> print("装袋 SVC 评分: "+str(clfScore))
装袋 SVC 评分: 0.825
>>> print(time.strftime('%Y-%m-%d %H:%M:%S'))
2020-02-22 15:10:17
>>>
>>> clfAdaB = AdaBoostClassifier()
>>> clfAdaB.fit(arrXtrain, arrYtrain)
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
n_estimators=50, random_state=None)
>>> clfScore = clfAdaB.score(arrXtest,arrYtest)
>>>
>>> print("AdaBoost 评分: "+str(clfScore))
AdaBoost 评分: 0.725
>>> print(time.strftime('%Y-%m-%d %H:%M:%S'))
2020-02-22 15:10:20
4-8) 完整代码
import jieba
import pymongo
import time
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import LinearSVC
dbClient = pymongo.MongoClient('mongodb://127.0.0.1:27017/')
dbMongo = dbClient['dbSample']
dbMongo.authenticate('Berry', 'Berry')
colSample = dbMongo['tbSample']
curPos = colSample.aggregate(
[{
'$match': {
'cf': True, 'e': 1}}, {
'$sample': {
'size': 300}}])
curUseless = colSample.aggregate(
[{
'$match': {
'cf': True, 'e': 0}}, {
'$sample': {
'size': 200}}])
curNeg = colSample.aggregate(
[{
'$match': {
'cf': True, 'e': -1}}, {
'$sample': {
'size': 300}}])
dictKW = {
}
arrSample = []
# dictSampleOfNew={'kw':[],'e':1}
intIndexNow = 0
arrXPreTrain = []
arrYPreTrain = []
dictAllResult = {
'intIndexNow': 0, 'arrXPreTrain': [], 'arrYPreTrain': [], 'dictKW': {
}, 'arrSample': [], 'arrColunms': []}
def ToGrepSampleKW(curSamples, dictInAllResult):
for eleSamples in curSamples:
time.sleep(0.05)
genSampleWord = jieba.cut(eleSamples['ct'], cut_all=False)
dictSampleOfNew = {
'kw': [], 'e': eleSamples['e']}
for eleKW in genSampleWord:
if not eleKW in dictSampleOfNew['kw']:
dictSampleOfNew['kw'].append(eleKW)
if not eleKW in dictInAllResult['dictKW'].keys():
dictInAllResult['dictKW'][eleKW] = dictInAllResult['intIndexNow']
dictInAllResult['intIndexNow'] += 1
dictInAllResult['arrColunms'].append(eleKW)
dictInAllResult['arrSample'].append(dictSampleOfNew)
return dictInAllResult
def ToArraySample(dictInAllResult):
for dictEle in dictInAllResult['arrSample']:
arrNewSample = [False for intX in range(
dictInAllResult['intIndexNow'])]
for eleKWFI in dictEle['kw']:
arrNewSample[dictInAllResult['dictKW'][eleKWFI]] = True
dictInAllResult['arrXPreTrain'].append(arrNewSample)
dictInAllResult['arrYPreTrain'].append(dictEle['e'])
return dictInAllResult
dictAllResult = ToGrepSampleKW(curPos, dictAllResult)
print('Done Pos ' + time.strftime('%Y-%m-%d %H:%M:%S'))
dictAllResult = ToGrepSampleKW(curUseless, dictAllResult)
print('Done Useless '+time.strftime('%Y-%m-%d %H:%M:%S'))
dictAllResult = ToGrepSampleKW(curNeg, dictAllResult)
print('Done Neg '+time.strftime('%Y-%m-%d %H:%M:%S'))
dictAllResult = ToArraySample(dictAllResult)
print('Done Arr '+time.strftime('%Y-%m-%d %H:%M:%S'))
curPos.close()
curUseless.close()
curNeg.close()
dbClient.close()
arrXPreTrain = dictAllResult['arrXPreTrain']
arrYPreTrain = dictAllResult['arrYPreTrain']
dfY = pd.DataFrame(arrYPreTrain)
def get_entropy(data_df, columns=None):
time.sleep(0.01)
if (columns is None):
raise "the columns must be not empty!"
# Information Entropy
pe_value_array = data_df[columns].unique()
ent = 0.0
for x_value in pe_value_array:
p = float(data_df[data_df[columns] ==
x_value].shape[0]) / data_df.shape[0]
logp = np.log2(p)
ent -= p * logp
return ent
intLenOfXPreT = len(arrXPreTrain)
arrKWForEntropy = []
for intI in range(dictAllResult['intIndexNow']):
arrTmp = [arrXPreTrain[intJ][intI] for intJ in range(intLenOfXPreT)]
dfX = pd.DataFrame(arrTmp)
# if intI % 1000 ==0:
# print(intI)
# print( dictAllResult['arrColunms'][intI]+' 信息熵为: ' +
# str(get_entropy(dfY[dfX[dictAllResult['arrColunms'][intI]]==True],0)))
if dfY[dfX[0]].shape[0] > 1:
arrKWForEntropy.append(
[dictAllResult['arrColunms'][intI], get_entropy(dfY[dfX[0]], 0)])
dfEntropy = pd.DataFrame(arrKWForEntropy, columns=['KW', 'IE'])
print(dfEntropy.head(10))
cutB = pd.cut(dfEntropy['IE'], 2, labels=['L', 'H'])
dfEntropy['IEBin'] = cutB
print(dfEntropy[dfEntropy['IEBin'] == 'L'])
nparrKWWaitFor = dfEntropy[dfEntropy['IEBin'] == 'L']['KW'].values
arrForTrain = [[] for intI in range(len(arrXPreTrain))]
for nparrEle in nparrKWWaitFor:
intJ = dictAllResult['dictKW'][nparrEle]
for intK in range(intLenOfXPreT):
arrForTrain[intK].append(arrXPreTrain[intK][intJ])
arrXForTrainReal = []
arrYForTrainReal = []
for intI in range(len(arrYPreTrain)):
if(arrYPreTrain[intI] != 0):
arrXForTrainReal.append(arrForTrain[intI])
arrYForTrainReal.append(arrYPreTrain[intI])
arrXtrain, arrXtest, arrYtrain, arrYtest = train_test_split(
arrXForTrainReal, arrYForTrainReal, test_size=0.2)
# ------- SKLearn 测试开始 ------
rcfClassifier = RandomForestClassifier()
rcfClassifier = rcfClassifier.fit(arrXtrain,arrYtrain)
clfScore = rcfClassifier.score(arrXtest,arrYtest)
print("随机森林评分: "+str(clfScore))
print(time.strftime('%Y-%m-%d %H:%M:%S'))
clfBagging = BaggingClassifier(base_estimator=LinearSVC(
random_state=0, tol=1e-05, max_iter=10000))
clfBagging.fit(arrXtrain, arrYtrain)
clfScore = clfBagging.score(arrXtest, arrYtest)
print("装袋 SVC 评分: "+str(clfScore))
print(time.strftime('%Y-%m-%d %H:%M:%S'))
clfAdaB = AdaBoostClassifier()
clfAdaB.fit(arrXtrain, arrYtrain)
clfScore = clfAdaB.score(arrXtest,arrYtest)
print("AdaBoost 评分: "+str(clScore))
print(time.strftime('%Y-%m-%d %H:%M:%S'))
# # ----------------------------------
5。后续
保存 sklearn 生成的模型之后,直接以后就能用了。
打算以后对新捉取的信息进行识别之后再作为样本,重新再生成新的模型。
以后再说吧。
边栏推荐
- 【PHP】代码复用特殊类Trait的简要说明和相关举例
- 长大后的我们为何贪恋年少?
- 跨域请求的问题
- Other permission verification methods
- Implementing recurrent neural networks from scratch -- [torch learning notes]
- Sword finger offer 18 Delete the node of the linked list
- Webassembly 2022 survey coming
- Custom permission verification method
- flutter 生成海报
- Fabric CA introduction, installation and use
猜你喜欢

树莓派开发笔记(十五):树莓派4B+从源码编译安装mysql数据库

InfoQ geek media's 15th anniversary solicitation 𞓜 migration of Eureka to Nacos: dual registration and dual subscription model

Jincang of the National People's Congress won the recognition of "key software enterprises encouraged by the state" again
![[genius_platform software platform development] lecture 37: network card hybrid mode and raw socket](/img/bf/880fbf4122b66723b6e17c6d9d97c9.jpg)
[genius_platform software platform development] lecture 37: network card hybrid mode and raw socket

从SQL注入绕过最新安全狗WAF中学习fuzz

Some instructions in dict intersect with the difference sum in set, and increase or decrease elements

视频融合云服务EasyCVR平台部署在云服务器中的配置关键点

Notes on the development of raspberry pie (15): Raspberry pie 4b+ compile and install MySQL database from the source code

Security monitoring video easycvr video access interface adds the close button of a single video

Solutions to some medium and high risk vulnerabilities detected by appscan
随机推荐
AI 考生挑战高考作文获 48 分;IBM 宣布退出俄罗斯市场,已暂停在俄所有业务;OpenCV 4.6 发布|极客头条...
dict中的部分指令与set中的差和交集与增减元素
go strconv包
Some instructions in dict intersect with the difference sum in set, and increase or decrease elements
剑指 Offer 18. 删除链表的节点
How does cloud based LDAP save traditional LDAP?
77.5% of the world's websites are using PHP, the "best language in the world"!
【LeetCode】【牛客】二叉树刷题
1340. 跳跃游戏 V-动态规划加dfs
Logout successful processor
1331. array sequence number conversion - quick sort plus binary search
科研入门,在导师完全散养不管的状态下,怎么提升自己的科研能力?
Go strconv package
其它权限校验方法
Is CSC safe
MicroNet:以极低的 FLOP 实现图像识别
leetcode. 36 --- effective Sudoku
[genius_platform software platform development] lesson 101st: summary of errors encountered in compiling Windows environment vs2017 for DZ products of power projects
随时随地可访问的 IT 资源构成
C语言与Lua的交互(实践二)