当前位置:网站首页>[data mining] task 4:20newsgroups clustering
[data mining] task 4:20newsgroups clustering
2022-07-03 01:38:00 【zstar-_】
requirement
according to 20Newsgroups Data sets are clustered , Display the clustering results to the user , Users can choose one of these classes , Mark as concerned , Class keywords as topics , Users can track this topic 、 Understand the content of the article on the topic .
Import related libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import re
import string
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from wordcloud import WordCloud
%matplotlib inline
Data acquisition
Use sklearn Of fetch_20newsgroups Download data
dataset = fetch_20newsgroups(
download_if_missing=True, remove=('headers', 'footers', 'quotes'))
Data preview
You can see , News data share 20 A classification
Visualize the quantity of each category
dataset.target_names
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
# Visualize the number of categories
targets, frequency = np.unique(dataset.target, return_counts=True)
targets_str = np.array(dataset.target_names)
fig = plt.figure(figsize=(10, 5), dpi=80, facecolor='w', edgecolor='k')
plt.bar(targets_str, frequency)
plt.xticks(rotation=90)
plt.title('Class distribution of 20 Newsgroups Training Data')
plt.xlabel('News Group')
plt.ylabel('Number')
plt.show()

Data preprocessing
In order to improve the accuracy of clustering , Preprocess the data before clustering , Eliminate numbers and punctuation in the data , And convert uppercase letters to lowercase
dataset_df = pd.DataFrame({
'data': dataset.data, 'target': dataset.target})
# Use regular expressions for data processing
def alphanumeric(x):
return re.sub(r"""\w*\d\w*""", ' ', x)
def punc_lower(x):
return re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
dataset_df['data'] = dataset_df.data.map(alphanumeric).map(punc_lower)
The processed data is displayed
dataset_df.data
0 i was wondering if anyone out there could enli...
1 a fair number of brave souls who upgraded thei...
2 well folks my mac plus finally gave up the gh...
3 \ndo you have weitek s address phone number ...
4 from article world std com by tombaker ...
...
11309 dn from nyeda cnsvax uwec edu david nye \nd...
11310 i have a very old mac and a mac plus both...
11311 i just installed a cpu in a clone motherbo...
11312 \nwouldn t this require a hyper sphere in ...
11313 stolen from pasadena between and pm on...
Name: data, Length: 11314, dtype: object
K-means clustering
Use K-means Clustering method , Aggregate data into 20 class
texts = dataset.data
target = dataset.target
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)
number_of_clusters = 20
model = KMeans(n_clusters=number_of_clusters,
init='k-means++',
max_iter=100,
n_init=1)
model.fit(X)
KMeans(max_iter=100, n_clusters=20, n_init=1)
View the keywords in each category after clustering , Each category shows 20 individual
dict_list = []
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
dict = {
}
print("Cluster %d:" % i),
for ind in order_centroids[i, :20]:
print(' %s' % terms[ind])
dict[terms[ind]] = model.cluster_centers_[i][ind]
dict_list.append(dict)
Category forecast
Classify the test set according to the model
# Classify individual words
X = vectorizer.transform([texts[400]])
cluster = model.predict(X)[0]
# print(" This word belongs to the {0} class ".format(cluster))
# Visualization of test set prediction results
count_target = dataset_df['target'].value_counts()
plt.figure(figsize=(8, 4))
sns.barplot(count_target.index, count_target.values, alpha=0.8)
plt.ylabel('Number', fontsize=12)
plt.xlabel('Target', fontsize=12)

Word cloud display
Show the word cloud diagram of each category
for i in range(20):
wordcloud = WordCloud(background_color="white", relative_scaling=0.5,
normalize_plurals=False).generate_from_frequencies(dict_list[i])
fig = plt.figure(figsize=(8, 6))
plt.axis('off')
plt.title('Cluster %d:' % i, fontsize='15')
plt.imshow(wordcloud)
plt.show()
Here are only two categories of pictures to show .

边栏推荐
- 【数据挖掘】任务2:医学数据库MIMIC-III数据处理
- Virtual list
- LDC Build Shared Library
- Telecom Customer Churn Prediction challenge
- Mathematical knowledge: divisible number inclusion exclusion principle
- Now that the teenager has returned, the world's fireworks are the most soothing and ordinary people return to work~
- 一比特苦逼程序員的找工作經曆
- 数学知识:能被整除的数—容斥原理
- 海量数据冷热分离方案与实践
- Thinkphp+redis realizes simple lottery
猜你喜欢

STM32 - vibration sensor control relay on

【數據挖掘】任務6:DBSCAN聚類

【数据挖掘】任务1:距离计算

一比特苦逼程序員的找工作經曆

Soft exam information system project manager_ Real topic over the years_ Wrong question set in the second half of 2019_ Morning comprehensive knowledge question - Senior Information System Project Man

Leetcode 2097 - Legal rearrangement of pairs

Smart management of Green Cities: Digital twin underground integrated pipe gallery platform
![[interview question] 1369 when can't I use arrow function?](/img/7f/84bba39965b4116f20b1cf8211f70a.png)
[interview question] 1369 when can't I use arrow function?
![[understanding of opportunity -36]: Guiguzi - flying clamp chapter - prevention against killing and bait](/img/c6/9aee30cb935b203c7c62b12c822085.jpg)
[understanding of opportunity -36]: Guiguzi - flying clamp chapter - prevention against killing and bait
![[my advanced journey of OpenGL learning] collation of Euler angle, rotation order, rotation matrix, quaternion and other knowledge](/img/ed/23331d939c9338760e426d368bfd5f.png)
[my advanced journey of OpenGL learning] collation of Euler angle, rotation order, rotation matrix, quaternion and other knowledge
随机推荐
电信客户流失预测挑战赛
Is there a handling charge for spot gold investment
看完这篇 教你玩转渗透测试靶机Vulnhub——DriftingBlues-9
d,ldc构建共享库
海量数据冷热分离方案与实践
C#应用程序界面开发基础——窗体控制(3)——文件类控件
MySQL - database query - basic query
Is there anything in common between spot gold and spot silver
[day 29] given an integer, please find its factor number
Wireshark data analysis and forensics a.pacapng
[androd] module dependency replacement of gradle's usage skills
C application interface development foundation - form control (3) - file control
给你一个可能存在 重复 元素值的数组 numbers ,它原来是一个升序排列的数组,并按上述情形进行了一次旋转。请返回旋转数组的最小元素。【剑指Offer】
【数据挖掘】任务2:医学数据库MIMIC-III数据处理
STM32 - GPIO input / output mode
Arduino dy-sv17f automatic voice broadcast
wirehark数据分析与取证A.pacapng
C#应用程序界面开发基础——窗体控制(4)——选择类控件
Summary of interval knowledge
String splicing function of MySQL