当前位置:网站首页>Data science [viii]: SVD (I)
Data science [viii]: SVD (I)
2022-07-02 06:20:00 【swy_ swy_ swy】
Data Science 【 8、 ... and 】:SVD( One )
The purpose of this paper is to give SVD How to use . Specific principle or SVD Please refer to other resources for the code implementation of itself .
SVD It is mainly used in data feature extraction , Data compression, etc .
Data preparation
take mnist Deposit in csv
Use fetch_openml
Common data sets can be obtained , Include mnist_784.
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
X, y = fetch_openml(name="mnist_784", version=1, return_X_y=True, as_frame=False)
import pandas as pd
import numpy as np
full_data = np.c_[y, X]
full_df = pd.DataFrame(full_data)
full_df.to_csv("mnist.csv", index=False)
Get eigenvalues
Get something “0” The eigenvalues of the
SVD You can call numpy Of linalg.svd
.
import matplotlib.pyplot as plt
full_df = pd.read_csv("mnist.csv", low_memory = False)
full_data = full_df.values
plt.figure()
for n in range(100):
if full_data[n][0] == 0:
print(n)
data = full_data[n][1:].reshape(28, 28)
u, s, v = np.linalg.svd(data)
plt.plot(s)
break
plt.show()
data compression
We can compress or blur the data by retaining some eigenvalues .
Single image compression
Example : By setting some characteristic values to 0, take “0” Image compression for .
Matrix multiplication can be done by numpy.matmul
Realization .
def image_svd(n, data):
u, s, v = np.linalg.svd(data)
svd = np.zeros((u.shape[0], v.shape[1]))
for i in range(n):
svd[i, i] = s[i]
img = np.matmul(u, svd)
img = np.matmul(img, v)
return img
plt.figure()
plt.subplot(1, 2, 1)
original_img = full_data[1][1:].reshape(28, 28)
plt.imshow(original_img, cmap="gray")
compress_img = image_svd(10, original_img)
plt.subplot(1, 2, 2)
plt.imshow(compress_img, cmap="gray")
plt.show()
Full data set compression
Example : The whole data set is compressed and stored in csv
X_data = full_data[:, range(1, 785)]
app_X = np.zeros(X_data.shape)
for i in range(X_data.shape[1]):
original = X_data[i].reshape(28, 28)
svd_img = image_svd(10, original)
app_X[i] = svd_img.reshape(1,784)
app_df = pd.DataFrame(app_X)
app_df.to_csv("app_mnist.csv", index=False)
Some phenomena
Clustering discreteness
First , We adopt and Last one Same method , Cluster the data set and draw the cluster center :
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
app_df = pd.read_csv("app_mnist.csv", low_memory=False)
kmeans_app = KMeans(n_clusters=10)
kmeans_app.fit(app_df.values)
app_centers = kmeans_app.cluster_centers_
centers_2d = PCA(2).fit_transform(app_centers)
plt.scatter(centers_2d[:, 0], centers_2d[:, 1])
plt.show()
Let's do it again on the compressed data set :
org_df = pd.read_csv("mnist.csv", low_memory=False)
kmeans_org = KMeans(n_clusters=10)
kmeans_org.fit(org_df.values)
org_centers = kmeans_org.cluster_centers_
centers_2d = PCA(2).fit_transform(org_centers)
plt.scatter(centers_2d[:, 0], centers_2d[:, 1])
plt.show()
Clustering accuracy
Calculate the original data set and the compressed data set relative to ground truth Of disagreement distance:
def disagreement_dist(P_labels, C_labels):
answer = 0
for i in range(len(P_labels)-1):
for j in range(i+1, len(P_labels)):
if (P_labels[i] == P_labels[j]) != (C_labels[i]==C_labels[j]):
answer += 1
return answer
import numpy as np
org_plabels = kmeans_org.labels_
app_plabels = kmeans_app.labels_
clabels = pd.read_csv("mnist.csv", low_memory=False).values[:, [0]]
clabels.reshape(1, len(clabels))
clabels.astype(np.int8)
print("Difference on original dataset:")
print(disagreement_dist(org_plabels, clabels))
print("Difference on approximated dataset:")
print(disagreement_dist(app_plabels, clabels))
Difference on original dataset:
288675700
Difference on approximated dataset:
2161020505
边栏推荐
- 492. Construction rectangle
- 利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
- VRRP之监视上行链路
- IPv6 experiment and summary
- 浅谈三点建议为所有已经毕业和终将毕业的同学
- 深入学习JVM底层(三):垃圾回收器与内存分配策略
- Codeforces Round #797 (Div. 3) A—E
- Little bear sect manual query and ADC in-depth study
- Contest3147 - game 38 of 2021 Freshmen's personal training match_ E: Listen to songs and know music
- Don't use the new WP collection. Don't use WordPress collection without update
猜你喜欢
VRRP之监视上行链路
加密压缩文件解密技巧
Leverage Google cloud infrastructure and landing area to build enterprise level cloud native excellent operation capability
谷歌出海创业加速器报名倒计时 3 天,创业人闯关指南提前收藏!
BGP报文详细解释
The difference between session and cookies
Detailed steps of JS foreground parsing of complex JSON data "case: I"
Frequently asked questions about jetpack compose and material you
如何使用MITMPROXy
栈(线性结构)
随机推荐
LeetCode 77. combination
Introduce uview into uni app
Web components series (VIII) -- custom component style settings
LeetCode 77. 组合
来自读者们的 I/O 观后感|有奖征集获奖名单
TensorRT中的循环
穀歌出海創業加速器報名倒計時 3 天,創業人闖關指南提前收藏!
LeetCode 47. 全排列 II
AttributeError: ‘str‘ object has no attribute ‘decode‘
数据科学【八】:SVD(一)
In depth understanding of JUC concurrency (II) concurrency theory
Database learning summary 5
线性dp(拆分篇)
LeetCode 27. 移除元素
Little bear sect manual query and ADC in-depth study
如何使用MITMPROXy
Monitoring uplink of VRRP
日期时间API详解
深入了解JUC并发(二)并发理论
链表(线性结构)