当前位置:网站首页>Data science [9]: SVD (2)
Data science [9]: SVD (2)
2022-07-02 06:20:00 【swy_ swy_ swy】
Data Science 【 Nine 】:SVD( Two )
Data preparation
We study text data this time . We can sklearn.datasets get fetch_20newsgroups, namely 20 News text sets in different categories ; Here we choose four categories .
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
news_data = fetch_20newsgroups(subset='train', categories=categories)
participle
Each article can be regarded as a string . This string consists of spaces 、 Line breaks and word composition . It should be noted that , The same word may have many variations , For example, tense 、 The plural 、 Change grid and so on , It depends on the language . From many deformations “ extract ” words , It is the participle . We can call SnowballStemmer()
Realization :
stemmer = SnowballStemmer('english')
stemmed_articles = []
for article in news_data.data:
stemmed_words = []
for word in article.split():
stemmed_words.append(stemmer.stem(word))
stemmed_articles.append(" ".join(stemmed_words))
The importance of words
Given a word and a text set , How to quantify the importance of this word ? One indicator is tf-idf features .tf-idf Features consist of two parts , Respectively :
- TF(t, a): words t In the article a The ratio appearing in , namely t Number of occurrences / The total number of words in the article .
- IDF(t, s): The number of articles in the article collection is higher than the occurrence of words t Logarithm of the number of articles , namely log10( The total number of articles / The number of articles with this word )
- tf-idf: TF And IDF The product of the .
By the above definition , We know that in a text set , For each text, there is an eigenvector .
We can use from sklearn.feature_extraction.text import TfidfVectorizer
To obtain a tf-idf Eigenvector .
import pandas as pd
tfidfvctr = TfidfVectorizer(max_df = 0.25, min_df = 0.05)
tfidf_mat = tfidfvctr.fit_transform(stemmed_articles)
tfidf_df = pd.DataFrame(tfidf_mat.toarray())
tfidf_df.to_csv("tfidf.csv", index=False)
Yes tf-idf Use SVD Dimension reduction
What we got above SVD Matrix dimensionality reduction , Preserve eigenvalues of different ranks , And pass disagreemet distance Evaluate clustering effect .
disagreement_distance = []
original_dataset = pd.read_csv("tfidf.csv", low_memory=False).values
for k in range(1,25):
dim_reduced_dataset = PCA(k).fit_transform(original_dataset)
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=100, n_init=10, random_state=0)
kmeans.fit_predict(dim_reduced_dataset)
labelsk = kmeans.labels_
disagreement_distance.append(disagreement_dist(labelsk, news_data.target))
plt.plot(range(1,25), disagreement_distance)
plt.ylabel('Disagreement')
plt.xlabel('Dimension')
plt.show()
边栏推荐
猜你喜欢
深入了解JUC并发(一)什么是JUC
Eco express micro engine system has supported one click deployment to cloud hosting
The official zero foundation introduction jetpack compose Chinese course is coming!
Current situation analysis of Devops and noops
【张三学C语言之】—深入理解数据存储
Decryption skills of encrypted compressed files
官方零基础入门 Jetpack Compose 的中文课程来啦!
让每一位开发者皆可使用机器学习技术
Comment utiliser mitmproxy
Shenji Bailian 3.54-dichotomy of dyeing judgment
随机推荐
Replace Django database with MySQL (attributeerror: 'STR' object has no attribute 'decode')
Format check JS
Use of Arduino wire Library
程序员的自我修养—找工作反思篇
锐捷EBGP 配置案例
Google Go to sea entrepreneurship accelerator registration countdown 3 days, entrepreneurs pass through the guide in advance collection!
Community theory | kotlin flow's principle and design philosophy
ROS create workspace
Contest3147 - game 38 of 2021 Freshmen's personal training match_ E: Listen to songs and know music
ROS2----LifecycleNode生命周期节点总结
Golang--map扩容机制(含源码)
Don't use the new WP collection. Don't use WordPress collection without update
LeetCode 90. 子集 II
线性dp(拆分篇)
Invalid operation: Load into table ‘sources_ orderdata‘ failed. Check ‘stl_ load_ errors‘ system table
Decryption skills of encrypted compressed files
复杂 json数据 js前台解析 详细步骤《案例:一》
On Web server
WLAN相关知识点总结
亚马逊aws数据湖工作之坑1