当前位置:网站首页>Data science [viii]: SVD (I)
Data science [viii]: SVD (I)
2022-07-02 06:20:00 【swy_ swy_ swy】
Data Science 【 8、 ... and 】:SVD( One )
The purpose of this paper is to give SVD How to use . Specific principle or SVD Please refer to other resources for the code implementation of itself .
SVD It is mainly used in data feature extraction , Data compression, etc .
Data preparation
take mnist Deposit in csv
Use fetch_openml Common data sets can be obtained , Include mnist_784.
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
X, y = fetch_openml(name="mnist_784", version=1, return_X_y=True, as_frame=False)
import pandas as pd
import numpy as np
full_data = np.c_[y, X]
full_df = pd.DataFrame(full_data)
full_df.to_csv("mnist.csv", index=False)
Get eigenvalues
Get something “0” The eigenvalues of the
SVD You can call numpy Of linalg.svd.
import matplotlib.pyplot as plt
full_df = pd.read_csv("mnist.csv", low_memory = False)
full_data = full_df.values
plt.figure()
for n in range(100):
if full_data[n][0] == 0:
print(n)
data = full_data[n][1:].reshape(28, 28)
u, s, v = np.linalg.svd(data)
plt.plot(s)
break
plt.show()

data compression
We can compress or blur the data by retaining some eigenvalues .
Single image compression
Example : By setting some characteristic values to 0, take “0” Image compression for .
Matrix multiplication can be done by numpy.matmul Realization .
def image_svd(n, data):
u, s, v = np.linalg.svd(data)
svd = np.zeros((u.shape[0], v.shape[1]))
for i in range(n):
svd[i, i] = s[i]
img = np.matmul(u, svd)
img = np.matmul(img, v)
return img
plt.figure()
plt.subplot(1, 2, 1)
original_img = full_data[1][1:].reshape(28, 28)
plt.imshow(original_img, cmap="gray")
compress_img = image_svd(10, original_img)
plt.subplot(1, 2, 2)
plt.imshow(compress_img, cmap="gray")
plt.show()

Full data set compression
Example : The whole data set is compressed and stored in csv
X_data = full_data[:, range(1, 785)]
app_X = np.zeros(X_data.shape)
for i in range(X_data.shape[1]):
original = X_data[i].reshape(28, 28)
svd_img = image_svd(10, original)
app_X[i] = svd_img.reshape(1,784)
app_df = pd.DataFrame(app_X)
app_df.to_csv("app_mnist.csv", index=False)
Some phenomena
Clustering discreteness
First , We adopt and Last one Same method , Cluster the data set and draw the cluster center :
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
app_df = pd.read_csv("app_mnist.csv", low_memory=False)
kmeans_app = KMeans(n_clusters=10)
kmeans_app.fit(app_df.values)
app_centers = kmeans_app.cluster_centers_
centers_2d = PCA(2).fit_transform(app_centers)
plt.scatter(centers_2d[:, 0], centers_2d[:, 1])
plt.show()

Let's do it again on the compressed data set :
org_df = pd.read_csv("mnist.csv", low_memory=False)
kmeans_org = KMeans(n_clusters=10)
kmeans_org.fit(org_df.values)
org_centers = kmeans_org.cluster_centers_
centers_2d = PCA(2).fit_transform(org_centers)
plt.scatter(centers_2d[:, 0], centers_2d[:, 1])
plt.show()

Clustering accuracy
Calculate the original data set and the compressed data set relative to ground truth Of disagreement distance:
def disagreement_dist(P_labels, C_labels):
answer = 0
for i in range(len(P_labels)-1):
for j in range(i+1, len(P_labels)):
if (P_labels[i] == P_labels[j]) != (C_labels[i]==C_labels[j]):
answer += 1
return answer
import numpy as np
org_plabels = kmeans_org.labels_
app_plabels = kmeans_app.labels_
clabels = pd.read_csv("mnist.csv", low_memory=False).values[:, [0]]
clabels.reshape(1, len(clabels))
clabels.astype(np.int8)
print("Difference on original dataset:")
print(disagreement_dist(org_plabels, clabels))
print("Difference on approximated dataset:")
print(disagreement_dist(app_plabels, clabels))
Difference on original dataset:
288675700
Difference on approximated dataset:
2161020505
边栏推荐
猜你喜欢

来自读者们的 I/O 观后感|有奖征集获奖名单

ROS create workspace

AttributeError: ‘str‘ object has no attribute ‘decode‘

Google Play Academy 组队 PK 赛,正式开赛!

BGP中的状态机

500. Keyboard line

CUDA中的线程层次

Leverage Google cloud infrastructure and landing area to build enterprise level cloud native excellent operation capability

Step by step | help you easily submit Google play data security form

Sumo tutorial Hello World
随机推荐
Zabbix Server trapper 命令注入漏洞 (CVE-2017-2824)
Contest3147 - game 38 of 2021 Freshmen's personal training match_ 1: Maximum palindromes
Scheme and implementation of automatic renewal of token expiration
LeetCode 90. Subset II
Bgp Routing preference Rules and notice Principles
Mock simulate the background return data with mockjs
加密压缩文件解密技巧
BGP routing optimization rules and notification principles
Pbootcms collection and warehousing tutorial quick collection release
It is said that Kwai will pay for the Tiktok super fast version of the video? How can you miss this opportunity to collect wool?
Amazon AWS data Lake Work Pit 1
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
LeetCode 78. subset
On Web server
【张三学C语言之】—深入理解数据存储
TensorRT的功能
The real definition of open source software
Singleton mode compilation
BGP中的状态机
网络相关知识(硬件工程师)