当前位置:网站首页>【数据挖掘】任务4:20Newsgroups聚类
【数据挖掘】任务4:20Newsgroups聚类
2022-07-03 01:09:00 【zstar-_】
要求
根据20Newsgroups数据集进行聚类,将聚类结果显示给用户,用户可以选择其中的一个类,标为关注,类的关键词作为主题,用户就可以跟踪这主题、了解主题的文章内容。
导入相关库
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import re
import string
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from wordcloud import WordCloud
%matplotlib inline
数据获取
使用sklearn的fetch_20newsgroups下载数据
dataset = fetch_20newsgroups(
download_if_missing=True, remove=('headers', 'footers', 'quotes'))
数据预览
可以看到,新闻数据共有20个分类
对各类别的数量进行可视化处理
dataset.target_names
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
# 对各类别的数量进行可视化
targets, frequency = np.unique(dataset.target, return_counts=True)
targets_str = np.array(dataset.target_names)
fig = plt.figure(figsize=(10, 5), dpi=80, facecolor='w', edgecolor='k')
plt.bar(targets_str, frequency)
plt.xticks(rotation=90)
plt.title('Class distribution of 20 Newsgroups Training Data')
plt.xlabel('News Group')
plt.ylabel('Number')
plt.show()

数据预处理
为了提升聚类的准确性,在聚类之前先对数据进行预处理,剔除数据中的数字和标点,并将大写字母转换成小写
dataset_df = pd.DataFrame({
'data': dataset.data, 'target': dataset.target})
# 使用正则表达式进行数据处理
def alphanumeric(x):
return re.sub(r"""\w*\d\w*""", ' ', x)
def punc_lower(x):
return re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
dataset_df['data'] = dataset_df.data.map(alphanumeric).map(punc_lower)
处理后的数据部分展示
dataset_df.data
0 i was wondering if anyone out there could enli...
1 a fair number of brave souls who upgraded thei...
2 well folks my mac plus finally gave up the gh...
3 \ndo you have weitek s address phone number ...
4 from article world std com by tombaker ...
...
11309 dn from nyeda cnsvax uwec edu david nye \nd...
11310 i have a very old mac and a mac plus both...
11311 i just installed a cpu in a clone motherbo...
11312 \nwouldn t this require a hyper sphere in ...
11313 stolen from pasadena between and pm on...
Name: data, Length: 11314, dtype: object
K-means聚类
使用K-means聚类方法,将数据聚成20类
texts = dataset.data
target = dataset.target
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)
number_of_clusters = 20
model = KMeans(n_clusters=number_of_clusters,
init='k-means++',
max_iter=100,
n_init=1)
model.fit(X)
KMeans(max_iter=100, n_clusters=20, n_init=1)
查看聚类后每个类别中的关键词,每个类别展示20个
dict_list = []
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
dict = {
}
print("Cluster %d:" % i),
for ind in order_centroids[i, :20]:
print(' %s' % terms[ind])
dict[terms[ind]] = model.cluster_centers_[i][ind]
dict_list.append(dict)
类别预测
根据模型来划分测试集的类别
# 对单个词进行类别划分
X = vectorizer.transform([texts[400]])
cluster = model.predict(X)[0]
# print("这个词属于第{0}类".format(cluster))
# 测试集预测结果可视化
count_target = dataset_df['target'].value_counts()
plt.figure(figsize=(8, 4))
sns.barplot(count_target.index, count_target.values, alpha=0.8)
plt.ylabel('Number', fontsize=12)
plt.xlabel('Target', fontsize=12)

词云图展示
对每个类别进行词云图展示
for i in range(20):
wordcloud = WordCloud(background_color="white", relative_scaling=0.5,
normalize_plurals=False).generate_from_frequencies(dict_list[i])
fig = plt.figure(figsize=(8, 6))
plt.axis('off')
plt.title('Cluster %d:' % i, fontsize='15')
plt.imshow(wordcloud)
plt.show()
这里仅放两个类别图片以展示。

边栏推荐
- Androd gradle's substitution of its use module dependency
- [androd] module dependency replacement of gradle's usage skills
- Type expansion of non ts/js file modules
- 【我的OpenGL学习进阶之旅】关于欧拉角、旋转顺序、旋转矩阵、四元数等知识的整理
- 2022 Jiangxi Provincial Safety Officer B certificate reexamination examination and Jiangxi Provincial Safety Officer B certificate simulation examination question bank
- Druid database connection pool
- 按键精灵打怪学习-前台和内网发送后台验证码
- [my advanced journey of OpenGL learning] collation of Euler angle, rotation order, rotation matrix, quaternion and other knowledge
- 【第29天】给定一个整数,请你求出它的因子数
- Kivy tutorial - example of using Matplotlib in Kivy app
猜你喜欢

【FH-GFSK】FH-GFSK信号分析与盲解调研究

Meituan dynamic thread pool practice ideas, open source

Main features of transport layer TCP and TCP connection

MySQL - database query - condition query

leetcode 6103 — 从树中删除边的最小分数

C#应用程序界面开发基础——窗体控制(1)——Form窗体

Basic concept and implementation of overcoming hash
![[FPGA tutorial case 6] design and implementation of dual port RAM based on vivado core](/img/fb/c371ffaa9614c6f2fd581ba89eb2ab.png)
[FPGA tutorial case 6] design and implementation of dual port RAM based on vivado core

Androd Gradle 对其使用模块依赖的替换
![[principles of multithreading and high concurrency: 2. Solutions to cache consistency]](/img/ce/5c41550ed649ee7cada17b0160f739.jpg)
[principles of multithreading and high concurrency: 2. Solutions to cache consistency]
随机推荐
How is the mask effect achieved in the LPL ban/pick selection stage?
C application interface development foundation - form control (2) - MDI form
[FPGA tutorial case 5] ROM design and Implementation Based on vivado core
Now that the teenager has returned, the world's fireworks are the most soothing and ordinary people return to work~
Mathematical knowledge: Nim game game theory
MySQL foundation 07-dcl
The difference between tail -f, tail -f and tail
[flutter] icons component (fluttericon Download Icon | customize SVG icon to generate TTF font file | use the downloaded TTF icon file)
CF1617B Madoka and the Elegant Gift、CF1654C Alice and the Cake、 CF1696C Fishingprince Plays With Arr
The latest analysis of tool fitter (technician) in 2022 and the test questions and analysis of tool fitter (technician)
串口抓包/截断工具的安装及使用详解
What are the trading forms of spot gold and what are the profitable advantages?
数学知识:台阶-Nim游戏—博弈论
英语常用词汇
SSL flood attack of DDoS attack
The industrial scope of industrial Internet is large enough. The era of consumer Internet is only a limited existence in the Internet industry
leetcode刷题_两数之和 II - 输入有序数组
Basis of information entropy
【系统分析师之路】第五章 复盘软件工程(开发模型开发方法)
2022 Jiangxi Provincial Safety Officer B certificate reexamination examination and Jiangxi Provincial Safety Officer B certificate simulation examination question bank