当前位置:网站首页>Data feature analysis skills - correlation test
Data feature analysis skills - correlation test
2022-06-25 15:10:00 【A window full of stars and milky way】
Data feature analysis skills —— Correlation test
Correlation analysis refers to the analysis of two or more variable elements with correlation , thus Measure the correlation between the two variables
There are four common methods :
- Drawing judgment
- pearson( Pearson ) The correlation coefficient
- sperman( Spearman ) The correlation coefficient
- Cosine similarity ( Cosine correlation coefficient )
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
% matplotlib inlineDraw a graph to judge
Generally, for two variables with strong correlation , Drawing a picture can qualitatively judge whether it is relevant
data1 = pd.Series(np.random.rand(50)*100).sort_values()
data2 = pd.Series(np.random.rand(50)*50).sort_values()
data3 = pd.Series(np.random.rand(50)*500).sort_values(ascending = False)
# Create three data :data1 by 0-100 Random numbers and arrange them from small to large ,data2 by 0-50 Random numbers and arrange them from small to large ,data3 by 0-500 Random numbers and arrange them from large to small ,
fig = plt.figure(figsize = (10,4))
ax1 = fig.add_subplot(1,2,1)
ax1.scatter(data1, data2)
plt.grid()
# Positive linear correlation
ax2 = fig.add_subplot(1,2,2)
ax2.scatter(data1, data3)
plt.grid()
# Negative linear correlation 
# (2) The relationship between multivariable is judged by scatter graph matrix
data = pd.DataFrame(np.random.randn(200,4)*100, columns = ['A','B','C','D'])
pd.plotting.scatter_matrix(data,figsize=(8,8),
c = 'k',
marker = '+',
diagonal='hist',
alpha = 0.8,
range_padding=0.1)
data.head()| A | B | C | D | |
|---|---|---|---|---|
| 0 | 83.463300 | 108.208281 | -16.441879 | -69.039664 |
| 1 | -114.341786 | -176.341932 | -64.282506 | 54.378911 |
| 2 | -108.781464 | 116.223511 | 11.996554 | 4.445215 |
| 3 | -124.358401 | -74.357458 | -46.089528 | -73.539092 |
| 4 | 87.330398 | 205.767923 | 59.964420 | 137.955811 |

pearson( Pearson ) The correlation coefficient
requirement The sample satisfies the normal distribution
- The Pearson correlation coefficient between two variables is defined as the Quotient of covariance and standard deviation , Its value is between -1 And 1 Between
The formula :
covariance :sxy=1n−1∑nk=1(xk−x¯)(yk−y¯) s x y = 1 n − 1 ∑ k = 1 n ( x k − x ¯ ) ( y k − y ¯ )
Standard deviation :sx=1n−1∑nk=1(xk−x¯)2−−−−−−−−−−−−−−−√ s x = 1 n − 1 ∑ k = 1 n ( x k − x ¯ ) 2
Pearson correlation coefficient :sxysxsy=∑nk=1(xk−x¯)(yk−y¯)∑nk=1(xk−x¯)2√∑nk=1(yk−y¯)2√ s x y s x s y = ∑ k = 1 n ( x k − x ¯ ) ( y k − y ¯ ) ∑ k = 1 n ( x k − x ¯ ) 2 ∑ k = 1 n ( y k − y ¯ ) 2
data1 = pd.Series(np.random.rand(100)*100).sort_values()
data2 = pd.Series(np.random.rand(100)*50).sort_values()
data = pd.DataFrame({
'value1':data1.values,
'value2':data2.values})
print(data.head())
print('------')
# Create sample data
u1,u2 = data['value1'].mean(),data['value2'].mean() # Calculate the mean
std1,std2 = data['value1'].std(),data['value2'].std() # Calculate the standard deviation
print('value1 Normality test :\n',stats.kstest(data['value1'], 'norm', (u1, std1)))
print('value2 Normality test :\n',stats.kstest(data['value2'], 'norm', (u2, std2)))
print('------')
# Normality test → pvalue >0.05
data['(x-u1)*(y-u2)'] = (data['value1'] - u1) * (data['value2'] - u2)
data['(x-u1)**2'] = (data['value1'] - u1)**2
data['(y-u2)**2'] = (data['value2'] - u2)**2
print(data.head())
print('------')
# Make Pearson Correlation coefficient evaluation table
r = data['(x-u1)*(y-u2)'].sum() / (np.sqrt(data['(x-u1)**2'].sum() * data['(y-u2)**2'].sum()))
print('Pearson The correlation coefficient is 0 :%.4f' % r)
# Find out r
# |r| > 0.8 → Highly linear correlation value1 value2
0 0.438432 0.486913
1 2.974424 0.663775
2 4.497743 1.417196
3 5.490366 2.047252
4 6.216346 3.455314
------
value1 Normality test :
KstestResult(statistic=0.07534983222255448, pvalue=0.6116837468934935)
value2 Normality test :
KstestResult(statistic=0.11048646902786918, pvalue=0.1614817955196972)
------
value1 value2 (x-u1)*(y-u2) (x-u1)**2 (y-u2)**2
0 0.438432 0.486913 1201.352006 2597.621877 555.603052
1 2.974424 0.663775 1133.009967 2345.549928 547.296636
2 4.497743 1.417196 1062.031735 2200.319086 512.612654
3 5.490366 2.047252 1010.628854 2108.181383 484.479509
4 6.216346 3.455314 931.020494 2042.041746 424.476709
------
Pearson The correlation coefficient is 0 :0.9937
# Pearson The correlation coefficient - Algorithm
data1 = pd.Series(np.random.rand(100)*100).sort_values()
data2 = pd.Series(np.random.rand(100)*50).sort_values()
data = pd.DataFrame({
'value1':data1.values,
'value2':data2.values})
print(data.head())
print('------')
# Create sample data
data.corr()
# pandas Correlation method :data.corr(method='pearson', min_periods=1) → The correlation coefficient matrix of the data field is given directly
# method Default pearson value1 value2 0 0.983096 0.368653 1 1.107613 0.509117 2 1.130588 0.755587 3 2.996367 0.909899 4 3.283088 1.233879 —— | value1 | value2 | |
|---|---|---|
| value1 | 1.000000 | 0.996077 |
| value2 | 0.996077 | 1.000000 |
Sperman Rank correlation coefficient
Pearson correlation coefficient is mainly used for continuous variables following normal distribution , For variables that do not obey a normal distribution , Categorical relevance can be used Sperman Rank correlation coefficient , Also known as Rank correlation coefficient
computing method :
- Rank the two variables from small to large according to their values ,Rx representative Xi Rank of ,Ry representative Yi Rank of
- If two variables have the same rank , Then the rank is (index1+index2)/ 2
- di = Rx -Ry
The formula :
ρs=1−6∑d2in(n2−1) ρ s = 1 − 6 ∑ d i 2 n ( n 2 − 1 )
data = pd.DataFrame({
' Intelligence quotient (IQ) ':[106,86,100,101,99,103,97,113,112,110],
' TV hours per week ':[7,0,27,50,28,29,20,12,6,17]})
print(data)
print('------')
# Create sample data
data.sort_values(' Intelligence quotient (IQ) ', inplace=True)
data['range1'] = np.arange(1,len(data)+1)
data.sort_values(' TV hours per week ', inplace=True)
data['range2'] = np.arange(1,len(data)+1)
print(data)
print('------')
# “ Intelligence quotient (IQ) ”、“ TV hours per week ” Reorder from small to large , And set the rank index
data['d'] = data['range1'] - data['range2']
data['d2'] = data['d']**2
print(data)
print('------')
# Find out di,di2
n = len(data)
rs = 1 - 6 * (data['d2'].sum()) / (n * (n**2 - 1))
print('Sperman The rank correlation coefficient is :%.4f' % rs)
# Find out rs Intelligence quotient (IQ) TV hours per week
0 106 7
1 86 0
2 100 27
3 101 50
4 99 28
5 103 29
6 97 20
7 113 12
8 112 6
9 110 17
------
Intelligence quotient (IQ) TV hours per week range1 range2
1 86 0 1 1
8 112 6 9 2
0 106 7 7 3
7 113 12 10 4
9 110 17 8 5
6 97 20 2 6
2 100 27 4 7
4 99 28 3 8
5 103 29 6 9
3 101 50 5 10
------
Intelligence quotient (IQ) TV hours per week range1 range2 d d2
1 86 0 1 1 0 0
8 112 6 9 2 7 49
0 106 7 7 3 4 16
7 113 12 10 4 6 36
9 110 17 8 5 3 9
6 97 20 2 6 -4 16
2 100 27 4 7 -3 9
4 99 28 3 8 -5 25
5 103 29 6 9 -3 9
3 101 50 5 10 -5 25
------
Sperman The rank correlation coefficient is :-0.1758
# spearman The correlation coefficient - Algorithm
data = pd.DataFrame({
' Intelligence quotient (IQ) ':[106,86,100,101,99,103,97,113,112,110],
' TV hours per week ':[7,0,27,50,28,29,20,12,6,17]})
print(data)
print('------')
# Create sample data
data.corr(method='spearman')
# pandas Correlation method :data.corr(method='pearson', min_periods=1) → The correlation coefficient matrix of the data field is given directly
# method Default pearson Intelligence quotient (IQ) TV hours per week 0 106 7 1 86 0 2 100 27 3 101 50 4 99 28 5 103 29 6 97 20 7 113 12 8 112 6 9 110 17 —— | Intelligence quotient (IQ) | TV hours per week | |
|---|---|---|
| Intelligence quotient (IQ) | 1.000000 | -0.175758 |
| TV hours per week | -0.175758 | 1.000000 |
边栏推荐
- 有哪个瞬间让你觉得这个世界出bug了?
- One question per day, a classic simulation question
- GDB debugging
- 5 connection modes of QT signal slot
- 搭建极简GB28181 网守和网关服务器,建立AI推理和3d服务场景,然后开源代码(一)
- Vs2019 scanf error
- 【Try to Hack】vulnhub DC1
- Yolov3 spp Darknet version to caffemodel and then to OM model
- Software packaging and deployment
- Go语言Zap库Logger的定制化和封装使用详解
猜你喜欢

Flexible layout (display:flex;) Attribute details

How to crop GIF dynamic graph? Take this picture online clipping tool

多张动图怎样合成一张gif?仅需三步快速生成gif动画图片

Power automatic test system nsat-8000, accurate, high-speed and reliable power test equipment

Position (5 ways)
![[C language] implementation of magic square array (the most complete)](/img/b2/2595263b77e0abac667972bbfe0c8a.jpg)
[C language] implementation of magic square array (the most complete)

GDB debugging

Learning notes on February 5, 2022 (C language)

【Try to Hack】vulnhub DC1

QQ love talk candy love talk content acquisition and storage
随机推荐
Learning C language today is the first time to learn C language. In college, C linguistics is not good, but I want to make progress, so I found a beep video on the Internet to learn C language
HMS Core机器学习服务实现同声传译,支持中英文互译和多种音色语音播报
One question per day,
Real variable instance
QQ情话糖果情话内容获取并保存
(translation) json-rpc 2.0 specification (Chinese version)
basic_ String mind map
多张动图怎样合成一张gif?仅需三步快速生成gif动画图片
About%*s and%* s
Single user mode
Generation method and usage of coredump
Usage of pure virtual functions
Bessie's weight problem [01 backpack]
Modal and modeless dialogs for QT
New title of PTA
网上办理股票开户安全吗?
14 -- 验证回文字符串 Ⅱ
Function of getinstance() method
[try to hack] vulhub shooting range construction
A deformation problem of Hanoi Tower