当前位置:网站首页>数据科学【八】:SVD(一)
数据科学【八】:SVD(一)
2022-07-02 06:10:00 【swy_swy_swy】
数据科学【八】:SVD(一)
本文旨在给出SVD的使用方法。具体原理或SVD本身的代码实现请参考其他资料。
SVD主要应用于数据特征提取,数据压缩等。
数据准备
将mnist存入csv
使用fetch_openml
可以获得常用数据集,包括mnist_784。
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
X, y = fetch_openml(name="mnist_784", version=1, return_X_y=True, as_frame=False)
import pandas as pd
import numpy as np
full_data = np.c_[y, X]
full_df = pd.DataFrame(full_data)
full_df.to_csv("mnist.csv", index=False)
获得特征值
获得某个“0”的特征值
SVD可以调用numpy的linalg.svd
。
import matplotlib.pyplot as plt
full_df = pd.read_csv("mnist.csv", low_memory = False)
full_data = full_df.values
plt.figure()
for n in range(100):
if full_data[n][0] == 0:
print(n)
data = full_data[n][1:].reshape(28, 28)
u, s, v = np.linalg.svd(data)
plt.plot(s)
break
plt.show()
数据压缩
我们可以通过保留部分特征值来进行数据的压缩或模糊。
单图片压缩
示例:通过将一些特征值设置为0,将“0”的图片压缩。
矩阵乘法可以通过numpy.matmul
实现。
def image_svd(n, data):
u, s, v = np.linalg.svd(data)
svd = np.zeros((u.shape[0], v.shape[1]))
for i in range(n):
svd[i, i] = s[i]
img = np.matmul(u, svd)
img = np.matmul(img, v)
return img
plt.figure()
plt.subplot(1, 2, 1)
original_img = full_data[1][1:].reshape(28, 28)
plt.imshow(original_img, cmap="gray")
compress_img = image_svd(10, original_img)
plt.subplot(1, 2, 2)
plt.imshow(compress_img, cmap="gray")
plt.show()
全数据集压缩
示例:将整个数据集采用以上方法压缩并存入csv
X_data = full_data[:, range(1, 785)]
app_X = np.zeros(X_data.shape)
for i in range(X_data.shape[1]):
original = X_data[i].reshape(28, 28)
svd_img = image_svd(10, original)
app_X[i] = svd_img.reshape(1,784)
app_df = pd.DataFrame(app_X)
app_df.to_csv("app_mnist.csv", index=False)
一些现象
聚类离散性
首先,我们采用与上一篇同样的方法,对数据集进行十聚类并绘制聚类中心:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
app_df = pd.read_csv("app_mnist.csv", low_memory=False)
kmeans_app = KMeans(n_clusters=10)
kmeans_app.fit(app_df.values)
app_centers = kmeans_app.cluster_centers_
centers_2d = PCA(2).fit_transform(app_centers)
plt.scatter(centers_2d[:, 0], centers_2d[:, 1])
plt.show()
我们在压缩后的数据集上再来一遍:
org_df = pd.read_csv("mnist.csv", low_memory=False)
kmeans_org = KMeans(n_clusters=10)
kmeans_org.fit(org_df.values)
org_centers = kmeans_org.cluster_centers_
centers_2d = PCA(2).fit_transform(org_centers)
plt.scatter(centers_2d[:, 0], centers_2d[:, 1])
plt.show()
聚类准确性
分别计算原数据集与压缩数据集相对于ground truth的disagreement distance:
def disagreement_dist(P_labels, C_labels):
answer = 0
for i in range(len(P_labels)-1):
for j in range(i+1, len(P_labels)):
if (P_labels[i] == P_labels[j]) != (C_labels[i]==C_labels[j]):
answer += 1
return answer
import numpy as np
org_plabels = kmeans_org.labels_
app_plabels = kmeans_app.labels_
clabels = pd.read_csv("mnist.csv", low_memory=False).values[:, [0]]
clabels.reshape(1, len(clabels))
clabels.astype(np.int8)
print("Difference on original dataset:")
print(disagreement_dist(org_plabels, clabels))
print("Difference on approximated dataset:")
print(disagreement_dist(app_plabels, clabels))
Difference on original dataset:
288675700
Difference on approximated dataset:
2161020505
边栏推荐
- Page printing plug-in print js
- The Chinese word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)
- Arduino Wire 库使用
- No subject alternative DNS name matching updates. jenkins. IO found, the reason for the error and how to solve it
- 如何使用MITMPROXy
- In depth understanding of JUC concurrency (II) concurrency theory
- Redis key value database [advanced]
- Shenji Bailian 3.54-dichotomy of dyeing judgment
- Zhuanzhuanben - LAN construction - Notes
- 队列(线性结构)
猜你喜欢
随机推荐
Mock simulate the background return data with mockjs
Zhuanzhuanben - LAN construction - Notes
亚马逊aws数据湖工作之坑1
LeetCode 40. Combined sum II
[C language] screening method for prime numbers
日志(常用的日志框架)
New version of dedecms collection and release plug-in tutorial tool
I/o multiplexing & event driven yyds dry inventory
The real definition of open source software
State machine in BGP
How to use mitmproxy
Page printing plug-in print js
LeetCode 77. 组合
In depth understanding of JUC concurrency (II) concurrency theory
社区说|Kotlin Flow 的原理与设计哲学
Contest3145 - the 37th game of 2021 freshman individual training match_ H: Eat fish
Invalid operation: Load into table ‘sources_ orderdata‘ failed. Check ‘stl_ load_ errors‘ system table
LeetCode 283. 移动零
Detailed steps of JS foreground parsing of complex JSON data "case: I"
Memcached installation