当前位置:网站首页>Data science [viii]: SVD (I)
Data science [viii]: SVD (I)
2022-07-02 06:20:00 【swy_ swy_ swy】
Data Science 【 8、 ... and 】:SVD( One )
The purpose of this paper is to give SVD How to use . Specific principle or SVD Please refer to other resources for the code implementation of itself .
SVD It is mainly used in data feature extraction , Data compression, etc .
Data preparation
take mnist Deposit in csv
Use fetch_openml Common data sets can be obtained , Include mnist_784.
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
X, y = fetch_openml(name="mnist_784", version=1, return_X_y=True, as_frame=False)
import pandas as pd
import numpy as np
full_data = np.c_[y, X]
full_df = pd.DataFrame(full_data)
full_df.to_csv("mnist.csv", index=False)
Get eigenvalues
Get something “0” The eigenvalues of the
SVD You can call numpy Of linalg.svd.
import matplotlib.pyplot as plt
full_df = pd.read_csv("mnist.csv", low_memory = False)
full_data = full_df.values
plt.figure()
for n in range(100):
if full_data[n][0] == 0:
print(n)
data = full_data[n][1:].reshape(28, 28)
u, s, v = np.linalg.svd(data)
plt.plot(s)
break
plt.show()

data compression
We can compress or blur the data by retaining some eigenvalues .
Single image compression
Example : By setting some characteristic values to 0, take “0” Image compression for .
Matrix multiplication can be done by numpy.matmul Realization .
def image_svd(n, data):
u, s, v = np.linalg.svd(data)
svd = np.zeros((u.shape[0], v.shape[1]))
for i in range(n):
svd[i, i] = s[i]
img = np.matmul(u, svd)
img = np.matmul(img, v)
return img
plt.figure()
plt.subplot(1, 2, 1)
original_img = full_data[1][1:].reshape(28, 28)
plt.imshow(original_img, cmap="gray")
compress_img = image_svd(10, original_img)
plt.subplot(1, 2, 2)
plt.imshow(compress_img, cmap="gray")
plt.show()

Full data set compression
Example : The whole data set is compressed and stored in csv
X_data = full_data[:, range(1, 785)]
app_X = np.zeros(X_data.shape)
for i in range(X_data.shape[1]):
original = X_data[i].reshape(28, 28)
svd_img = image_svd(10, original)
app_X[i] = svd_img.reshape(1,784)
app_df = pd.DataFrame(app_X)
app_df.to_csv("app_mnist.csv", index=False)
Some phenomena
Clustering discreteness
First , We adopt and Last one Same method , Cluster the data set and draw the cluster center :
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
app_df = pd.read_csv("app_mnist.csv", low_memory=False)
kmeans_app = KMeans(n_clusters=10)
kmeans_app.fit(app_df.values)
app_centers = kmeans_app.cluster_centers_
centers_2d = PCA(2).fit_transform(app_centers)
plt.scatter(centers_2d[:, 0], centers_2d[:, 1])
plt.show()

Let's do it again on the compressed data set :
org_df = pd.read_csv("mnist.csv", low_memory=False)
kmeans_org = KMeans(n_clusters=10)
kmeans_org.fit(org_df.values)
org_centers = kmeans_org.cluster_centers_
centers_2d = PCA(2).fit_transform(org_centers)
plt.scatter(centers_2d[:, 0], centers_2d[:, 1])
plt.show()

Clustering accuracy
Calculate the original data set and the compressed data set relative to ground truth Of disagreement distance:
def disagreement_dist(P_labels, C_labels):
answer = 0
for i in range(len(P_labels)-1):
for j in range(i+1, len(P_labels)):
if (P_labels[i] == P_labels[j]) != (C_labels[i]==C_labels[j]):
answer += 1
return answer
import numpy as np
org_plabels = kmeans_org.labels_
app_plabels = kmeans_app.labels_
clabels = pd.read_csv("mnist.csv", low_memory=False).values[:, [0]]
clabels.reshape(1, len(clabels))
clabels.astype(np.int8)
print("Difference on original dataset:")
print(disagreement_dist(org_plabels, clabels))
print("Difference on approximated dataset:")
print(disagreement_dist(app_plabels, clabels))
Difference on original dataset:
288675700
Difference on approximated dataset:
2161020505
边栏推荐
- 利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
- 深入学习JVM底层(五):类加载机制
- sudo提权
- Zabbix Server trapper 命令注入漏洞 (CVE-2017-2824)
- 日期时间API详解
- Web components series (VIII) -- custom component style settings
- Classic literature reading -- deformable Detr
- LeetCode 90. Subset II
- Redis---1.数据结构特点与操作
- Don't use the new WP collection. Don't use WordPress collection without update
猜你喜欢

Frequently asked questions about jetpack compose and material you

Shenji Bailian 3.54-dichotomy of dyeing judgment

如何使用MITMPROXy

Summary of WLAN related knowledge points

日期时间API详解

经典文献阅读之--Deformable DETR

LeetCode 90. 子集 II

Introduce uview into uni app

Data playback partner rviz+plotjuggler

【张三学C语言之】—深入理解数据存储
随机推荐
经典文献阅读之--SuMa++
Contest3145 - the 37th game of 2021 freshman individual training match_ H: Eat fish
一起学习SQL中各种join以及它们的区别
CNN visualization technology -- detailed explanation of cam & grad cam and concise implementation of pytorch
Flutter 混合开发: 开发一个简单的快速启动框架 | 开发者说·DTalk
Eco express micro engine system has supported one click deployment to cloud hosting
It is said that Kwai will pay for the Tiktok super fast version of the video? How can you miss this opportunity to collect wool?
穀歌出海創業加速器報名倒計時 3 天,創業人闖關指南提前收藏!
经典文献阅读之--Deformable DETR
BGP报文详细解释
队列(线性结构)
Style modification of Mui bottom navigation
Use of Arduino wire Library
链表(线性结构)
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
社区说|Kotlin Flow 的原理与设计哲学
Introduce uview into uni app
LeetCode 78. subset
BGP routing optimization rules and notification principles
ROS2----LifecycleNode生命周期节点总结