当前位置：网站首页>Data science [9]: SVD (2)

Data science [9]: SVD (2)

2022-07-02 06:20:00 【swy_ swy_ swy】

Data Science 【 Nine 】：SVD（ Two ）

Data preparation

We study text data this time . We can sklearn.datasets get fetch_20newsgroups, namely 20 News text sets in different categories ; Here we choose four categories .

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
news_data = fetch_20newsgroups(subset='train', categories=categories)

participle

Each article can be regarded as a string . This string consists of spaces 、 Line breaks and word composition . It should be noted that , The same word may have many variations , For example, tense 、 The plural 、 Change grid and so on , It depends on the language . From many deformations “ extract ” words , It is the participle . We can call SnowballStemmer() Realization ：

stemmer = SnowballStemmer('english')
stemmed_articles = []
for article in news_data.data:
    stemmed_words = []
    for word in article.split():
        stemmed_words.append(stemmer.stem(word))

    stemmed_articles.append(" ".join(stemmed_words))

The importance of words

Given a word and a text set , How to quantify the importance of this word ？ One indicator is tf-idf features .tf-idf Features consist of two parts , Respectively ：

TF(t, a)： words t In the article a The ratio appearing in , namely t Number of occurrences / The total number of words in the article .
IDF(t, s)： The number of articles in the article collection is higher than the occurrence of words t Logarithm of the number of articles , namely log10（ The total number of articles / The number of articles with this word )
tf-idf： TF And IDF The product of the .

By the above definition , We know that in a text set , For each text, there is an eigenvector .
We can use from sklearn.feature_extraction.text import TfidfVectorizer To obtain a tf-idf Eigenvector .

import pandas as pd

tfidfvctr = TfidfVectorizer(max_df = 0.25, min_df = 0.05)
tfidf_mat = tfidfvctr.fit_transform(stemmed_articles)
tfidf_df = pd.DataFrame(tfidf_mat.toarray())
tfidf_df.to_csv("tfidf.csv", index=False)

Yes tf-idf Use SVD Dimension reduction

What we got above SVD Matrix dimensionality reduction , Preserve eigenvalues of different ranks , And pass disagreemet distance Evaluate clustering effect .

disagreement_distance = []
original_dataset = pd.read_csv("tfidf.csv", low_memory=False).values
for k in range(1,25):
    
    dim_reduced_dataset = PCA(k).fit_transform(original_dataset)
    
    kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=100, n_init=10, random_state=0)
    kmeans.fit_predict(dim_reduced_dataset)
    labelsk = kmeans.labels_
    disagreement_distance.append(disagreement_dist(labelsk, news_data.target))

plt.plot(range(1,25), disagreement_distance)
plt.ylabel('Disagreement')
plt.xlabel('Dimension')
plt.show()