当前位置:网站首页>[data mining] task 4:20newsgroups clustering
[data mining] task 4:20newsgroups clustering
2022-07-03 01:38:00 【zstar-_】
requirement
according to 20Newsgroups Data sets are clustered , Display the clustering results to the user , Users can choose one of these classes , Mark as concerned , Class keywords as topics , Users can track this topic 、 Understand the content of the article on the topic .
Import related libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import re
import string
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from wordcloud import WordCloud
%matplotlib inline
Data acquisition
Use sklearn Of fetch_20newsgroups Download data
dataset = fetch_20newsgroups(
download_if_missing=True, remove=('headers', 'footers', 'quotes'))
Data preview
You can see , News data share 20 A classification
Visualize the quantity of each category
dataset.target_names
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
# Visualize the number of categories
targets, frequency = np.unique(dataset.target, return_counts=True)
targets_str = np.array(dataset.target_names)
fig = plt.figure(figsize=(10, 5), dpi=80, facecolor='w', edgecolor='k')
plt.bar(targets_str, frequency)
plt.xticks(rotation=90)
plt.title('Class distribution of 20 Newsgroups Training Data')
plt.xlabel('News Group')
plt.ylabel('Number')
plt.show()

Data preprocessing
In order to improve the accuracy of clustering , Preprocess the data before clustering , Eliminate numbers and punctuation in the data , And convert uppercase letters to lowercase
dataset_df = pd.DataFrame({
'data': dataset.data, 'target': dataset.target})
# Use regular expressions for data processing
def alphanumeric(x):
return re.sub(r"""\w*\d\w*""", ' ', x)
def punc_lower(x):
return re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
dataset_df['data'] = dataset_df.data.map(alphanumeric).map(punc_lower)
The processed data is displayed
dataset_df.data
0 i was wondering if anyone out there could enli...
1 a fair number of brave souls who upgraded thei...
2 well folks my mac plus finally gave up the gh...
3 \ndo you have weitek s address phone number ...
4 from article world std com by tombaker ...
...
11309 dn from nyeda cnsvax uwec edu david nye \nd...
11310 i have a very old mac and a mac plus both...
11311 i just installed a cpu in a clone motherbo...
11312 \nwouldn t this require a hyper sphere in ...
11313 stolen from pasadena between and pm on...
Name: data, Length: 11314, dtype: object
K-means clustering
Use K-means Clustering method , Aggregate data into 20 class
texts = dataset.data
target = dataset.target
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)
number_of_clusters = 20
model = KMeans(n_clusters=number_of_clusters,
init='k-means++',
max_iter=100,
n_init=1)
model.fit(X)
KMeans(max_iter=100, n_clusters=20, n_init=1)
View the keywords in each category after clustering , Each category shows 20 individual
dict_list = []
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
dict = {
}
print("Cluster %d:" % i),
for ind in order_centroids[i, :20]:
print(' %s' % terms[ind])
dict[terms[ind]] = model.cluster_centers_[i][ind]
dict_list.append(dict)
Category forecast
Classify the test set according to the model
# Classify individual words
X = vectorizer.transform([texts[400]])
cluster = model.predict(X)[0]
# print(" This word belongs to the {0} class ".format(cluster))
# Visualization of test set prediction results
count_target = dataset_df['target'].value_counts()
plt.figure(figsize=(8, 4))
sns.barplot(count_target.index, count_target.values, alpha=0.8)
plt.ylabel('Number', fontsize=12)
plt.xlabel('Target', fontsize=12)

Word cloud display
Show the word cloud diagram of each category
for i in range(20):
wordcloud = WordCloud(background_color="white", relative_scaling=0.5,
normalize_plurals=False).generate_from_frequencies(dict_list[i])
fig = plt.figure(figsize=(8, 6))
plt.axis('off')
plt.title('Cluster %d:' % i, fontsize='15')
plt.imshow(wordcloud)
plt.show()
Here are only two categories of pictures to show .

边栏推荐
- Main features of transport layer TCP and TCP connection
- 软考信息系统项目管理师_历年真题_2019下半年错题集_上午综合知识题---软考高级之信息系统项目管理师053
- Top ten regular spot trading platforms 2022
- [Cao gongzatan] after working in goose factory for a year in 2021, some of my insights
- The thread reuse problem of PageHelper using ThreadLocal, did you use it correctly?
- Tâche 6: regroupement DBSCAN
- [androd] module dependency replacement of gradle's usage skills
- Three core issues of concurrent programming - "deep understanding of high concurrent programming"
- Telecom Customer Churn Prediction challenge
- 给你一个可能存在 重复 元素值的数组 numbers ,它原来是一个升序排列的数组,并按上述情形进行了一次旋转。请返回旋转数组的最小元素。【剑指Offer】
猜你喜欢

Why can't the start method be called repeatedly? But the run method can?

Smart management of Green Cities: Digital twin underground integrated pipe gallery platform

Why can't the start method be called repeatedly? But the run method can?

Dotconnect for PostgreSQL data provider

Wireshark data analysis and forensics a.pacapng

High resolution network (Part 1): Principle Analysis

C application interface development foundation - form control (2) - MDI form
![[技术发展-23]:DSP在未来融合网络中的应用](/img/2e/f39543a18a8f58b1d341ce72cc4427.png)
[技术发展-23]:DSP在未来融合网络中的应用

CF1617B Madoka and the Elegant Gift、CF1654C Alice and the Cake、 CF1696C Fishingprince Plays With Arr

什么是调。调的故事
随机推荐
云原生题目整理(待更新)
Concise analysis of redis source code 11 - Main IO threads and redis 6.0 multi IO threads
Why can't the start method be called repeatedly? But the run method can?
[data mining] task 5: k-means/dbscan clustering: double square
不登陆或者登录解决oracle数据库账号被锁定。
Everything文件搜索工具
A simple tool for analyzing fgui dependencies
How is the mask effect achieved in the LPL ban/pick selection stage?
【QT】自定义控件的封装
Vant 实现简单的登录注册模块以及个人用户中心
The latest analysis of tool fitter (technician) in 2022 and the test questions and analysis of tool fitter (technician)
STM32 - introduction of external interrupts exti and NVIC
GDB 在嵌入式中的相关概念
Swiftui component Encyclopedia: using scenekit and swiftui to build interactive 3D pie charts (tutorial with source code)
Mathematical knowledge: Nim game game theory
[understanding of opportunity -36]: Guiguzi - flying clamp chapter - prevention against killing and bait
leetcode刷题_两数之和 II - 输入有序数组
Now that the teenager has returned, the world's fireworks are the most soothing and ordinary people return to work~
Uniapp component -uni notice bar notice bar
Thinkphp+redis realizes simple lottery