当前位置:网站首页>Data science [viii]: SVD (I)
Data science [viii]: SVD (I)
2022-07-02 06:20:00 【swy_ swy_ swy】
Data Science 【 8、 ... and 】:SVD( One )
The purpose of this paper is to give SVD How to use . Specific principle or SVD Please refer to other resources for the code implementation of itself .
SVD It is mainly used in data feature extraction , Data compression, etc .
Data preparation
take mnist Deposit in csv
Use fetch_openml
Common data sets can be obtained , Include mnist_784.
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
X, y = fetch_openml(name="mnist_784", version=1, return_X_y=True, as_frame=False)
import pandas as pd
import numpy as np
full_data = np.c_[y, X]
full_df = pd.DataFrame(full_data)
full_df.to_csv("mnist.csv", index=False)
Get eigenvalues
Get something “0” The eigenvalues of the
SVD You can call numpy Of linalg.svd
.
import matplotlib.pyplot as plt
full_df = pd.read_csv("mnist.csv", low_memory = False)
full_data = full_df.values
plt.figure()
for n in range(100):
if full_data[n][0] == 0:
print(n)
data = full_data[n][1:].reshape(28, 28)
u, s, v = np.linalg.svd(data)
plt.plot(s)
break
plt.show()
data compression
We can compress or blur the data by retaining some eigenvalues .
Single image compression
Example : By setting some characteristic values to 0, take “0” Image compression for .
Matrix multiplication can be done by numpy.matmul
Realization .
def image_svd(n, data):
u, s, v = np.linalg.svd(data)
svd = np.zeros((u.shape[0], v.shape[1]))
for i in range(n):
svd[i, i] = s[i]
img = np.matmul(u, svd)
img = np.matmul(img, v)
return img
plt.figure()
plt.subplot(1, 2, 1)
original_img = full_data[1][1:].reshape(28, 28)
plt.imshow(original_img, cmap="gray")
compress_img = image_svd(10, original_img)
plt.subplot(1, 2, 2)
plt.imshow(compress_img, cmap="gray")
plt.show()
Full data set compression
Example : The whole data set is compressed and stored in csv
X_data = full_data[:, range(1, 785)]
app_X = np.zeros(X_data.shape)
for i in range(X_data.shape[1]):
original = X_data[i].reshape(28, 28)
svd_img = image_svd(10, original)
app_X[i] = svd_img.reshape(1,784)
app_df = pd.DataFrame(app_X)
app_df.to_csv("app_mnist.csv", index=False)
Some phenomena
Clustering discreteness
First , We adopt and Last one Same method , Cluster the data set and draw the cluster center :
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
app_df = pd.read_csv("app_mnist.csv", low_memory=False)
kmeans_app = KMeans(n_clusters=10)
kmeans_app.fit(app_df.values)
app_centers = kmeans_app.cluster_centers_
centers_2d = PCA(2).fit_transform(app_centers)
plt.scatter(centers_2d[:, 0], centers_2d[:, 1])
plt.show()
Let's do it again on the compressed data set :
org_df = pd.read_csv("mnist.csv", low_memory=False)
kmeans_org = KMeans(n_clusters=10)
kmeans_org.fit(org_df.values)
org_centers = kmeans_org.cluster_centers_
centers_2d = PCA(2).fit_transform(org_centers)
plt.scatter(centers_2d[:, 0], centers_2d[:, 1])
plt.show()
Clustering accuracy
Calculate the original data set and the compressed data set relative to ground truth Of disagreement distance:
def disagreement_dist(P_labels, C_labels):
answer = 0
for i in range(len(P_labels)-1):
for j in range(i+1, len(P_labels)):
if (P_labels[i] == P_labels[j]) != (C_labels[i]==C_labels[j]):
answer += 1
return answer
import numpy as np
org_plabels = kmeans_org.labels_
app_plabels = kmeans_app.labels_
clabels = pd.read_csv("mnist.csv", low_memory=False).values[:, [0]]
clabels.reshape(1, len(clabels))
clabels.astype(np.int8)
print("Difference on original dataset:")
print(disagreement_dist(org_plabels, clabels))
print("Difference on approximated dataset:")
print(disagreement_dist(app_plabels, clabels))
Difference on original dataset:
288675700
Difference on approximated dataset:
2161020505
边栏推荐
- BGP 路由優選規則和通告原則
- 程序员的自我修养—找工作反思篇
- Current situation analysis of Devops and noops
- Classic literature reading -- deformable Detr
- Hydration failed because the initial UI does not match what was rendered on the server.问题原因之一
- Common means of modeling: combination
- Community theory | kotlin flow's principle and design philosophy
- The official zero foundation introduction jetpack compose Chinese course is coming!
- 栈(线性结构)
- I/o multiplexing & event driven yyds dry inventory
猜你喜欢
Contest3145 - the 37th game of 2021 freshman individual training match_ H: Eat fish
Zabbix Server trapper 命令注入漏洞 (CVE-2017-2824)
Jetpack Compose 与 Material You 常见问题解答
Google play academy team PK competition, official start!
深入学习JVM底层(三):垃圾回收器与内存分配策略
数据科学【九】:SVD(二)
Invalid operation: Load into table ‘sources_ orderdata‘ failed. Check ‘stl_ load_ errors‘ system table
Web components series (VIII) -- custom component style settings
【张三学C语言之】—深入理解数据存储
Summary of WLAN related knowledge points
随机推荐
Eco express micro engine system has supported one click deployment to cloud hosting
【每日一题】—华为机试01
500. Keyboard line
Compte à rebours de 3 jours pour l'inscription à l'accélérateur de démarrage Google Sea, Guide de démarrage collecté à l'avance!
MySQL的10大经典错误
Summary of WLAN related knowledge points
AttributeError: ‘str‘ object has no attribute ‘decode‘
官方零基础入门 Jetpack Compose 的中文课程来啦!
CUDA中的函数执行空间说明符
Flutter hybrid development: develop a simple quick start framework | developers say · dtalk
让每一位开发者皆可使用机器学习技术
浅谈三点建议为所有已经毕业和终将毕业的同学
Shenji Bailian 3.53-kruskal
Reading classic literature -- Suma++
亚马逊aws数据湖工作之坑1
Problems encountered in uni app development (continuous update)
LeetCode 83. Delete duplicate elements in the sorting linked list
Jetpack Compose 与 Material You 常见问题解答
Top 10 classic MySQL errors
Decryption skills of encrypted compressed files