当前位置:网站首页>About Covariance and Correlation(协方差和相关)
About Covariance and Correlation(协方差和相关)
2022-06-28 18:30:00 【梦想家DBA】
After completing this tutorial, you will know:
- How to calculate a covariance matrix to summarize the linear relationship between two or more variables.
- How to calculate the covariance to summarize the linear relationship between two variables.
- How to calculate the Pearson’s correlation coefficient to summarize the linear relationship between two variables.
1.1 Tutorial Overview
- What is Correlation?
- Test Dataset
- Covariance
- Person's Correlation
1.2 What is Correlation?
Variables within a dataset can be related for lots of reasons.
- One variable could cause or depend on the values of another variable
- One variable could be lightly associated with another variable.
- Two variables could depend on a third unknown variable.
A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.
- Positive Correlation: Both variables change in the same direction.
- Neutral Correlation: No relationship in the change of the variables.
- Negative Correlation: Variables change in opposite directions.
The performance of some algorithms can deteriorate if two or more variables are tightly related, called multicollinearity.
1.3 Test Dataset
Before we look at correlation methods, let’s define a dataset we can use to test the methods. We will generate 1,000 samples of two two variables with a strong positive correlation. The first variable will be random numbers drawn from a Gaussian distribution with a mean of 100 and a standard deviation of 20. The second variable will be values from the first variable with Gaussian noise added with a mean of a 50 and a standard deviation of 10. We will use the randn() function to generate random Gaussian values with a mean of 0 and a standard deviation of 1, then multiply the results by our own standard deviation and add the mean to shift the values into the preferred range. The pseudorandom number generator is seeded to ensure that we get the same sample of numbers each time the code is run.
# generate related variables
from numpy import mean
from numpy import std
from numpy.random import randn
from numpy.random import seed
from matplotlib import pyplot
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# summarize
print('data1: mean=%.3f stdv=%.3f' % (mean(data1),std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (mean(data2),std(data2)))
# plot
pyplot.scatter(data1, data2)
pyplot.show()Running the example first prints the mean and standard deviation for each variable.
A scatter plot of the two variables is created. Because we contrived the dataset, we know there is a relationship between the two variables. This is clear when we review the generated scatter plot where we can see an increasing trend.

1.4 Covariance
Variables can be related by a linear relationship. This is a relationship that is consistently additive across the two data samples. This relationship can be summarized between two variables, called the covariance. It is calculated as the average of the product between the values from each sample, where the values haven been centered (had their mean subtracted). The calculation of the sample covariance is as follows:

The use of the mean in the calculation suggests the need for each data sample to have a Gaussian or Gaussian-like distribution. The sign of the covariance can be interpreted as whether the two variables change in the same direction (positive) or change in different directions (negative). The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both variables are completely independent. The cov() NumPy function can be used to calculate a covariance matrix between two or more variables.
...
# calculate the covariance between two samples
covariance = cov(data1, data2)The diagonal of the matrix contains the covariance between each variable and itself. The other values in the matrix represent the covariance between the two variables; in this case, the remaining two values are the same given that we are calculating the covariance for only two variables. We can calculate the covariance matrix for the two variables in our test problem. The complete example is listed below.
# calculate the covariance between two variables
from numpy.random import randn
from numpy.random import seed
from numpy import cov
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate covariance matrix
covariance = cov(data1, data2)
print(covariance)A problem with covariance as a statistical tool alone is that it is challenging to interpret. This leads us to the Pearson’s correlation coefficient next.
1.5 Pearson's Correlation
The Pearson’s correlation coefficient (named for Karl Pearson) can be used to summarize the strength of the linear relationship between two data samples. The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample. It is the normalization of the covariance between the two variables to give an interpretable score.

The use of mean and standard deviation in the calculation suggests the need for the two data samples to have a Gaussian or Gaussian-like distribution. The result of the calculation, the correlation coefficient can be interpreted to understand the relationship. The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means no correlation. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a notable correlation, and values below those values suggests a less notable correlation. See the table below to help with interpretation the correlation coefficient.
The Pearson’s correlation is a statistical hypothesis test that does assume that there is no relationship between the samples (null hypothesis). The p-value can be interpreted as follows:
- p-value ≤ alpha: significant result, reject null hypothesis, some relationship (H1).
- p-value > alpha: not significant result, fail to reject null hypothesis, no relationship (H0).
The pearsonr() SciPy function can be used to calculate the Pearson’s correlation coefficient between two data samples with the same length. We can calculate the correlation between the two variables in our test problem. The complete example is listed below.
# calculate the pearson's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import pearsonr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate Pearson's correlation
corr,p = pearsonr(data1, data2)
# display the correlation
print('Pearsons correlation: %.3f' % corr)
# interpret the significance
alpha = 0.05
if p > alpha:
print('No correlation (fail to reject HO)')
else:
print('Some correlation (reject H0)')Running the example calculates and prints the Pearson’s correlation coefficient and interprets the p-value. We can see that the two variables are positively correlated and that the correlation is 0.888. This suggests a high level of correlation (as we expected).

边栏推荐
- [flask] update and delete crud of data
- 面部识别试验涉及隐私安全问题?国外一公司被紧急叫停
- OneFlow源码解析:算子签名的自动推断
- leetcode 1689. Partitioning Into Minimum Number Of Deci-Binary Numbers(最少的“二进制数“个数)
- 第2章 处理文件、摄像头和图形用户界面cameo应用
- 抗兔Dylight 488丨Abbkine通用型免疫荧光(IF)工具箱
- 上传文件列表(文件名重复加括号标识)
- 实时Transformer:美团在单图像深度估计上的研究
- Small program graduation project based on wechat subscription water supply mall small program graduation project opening report function reference
- 东方财富软件股票开户是靠谱的吗?在哪开户安全
猜你喜欢

图形系统——1. 布局加载

Common DOS commands

Learning notes: how to time 10ms for 51 single chip microcomputer (STC89C52)

Small program graduation project based on wechat milk tea takeout mall small program graduation project opening report function reference

Analysis of response parsing process of SAP ui5 batch request

Dnslog injection

声网 VQA:将实时互动中未知的视频画质用户主观体验变可知

Small program graduation project based on wechat campus lost and found graduation project opening report function reference

面部识别试验涉及隐私安全问题?国外一公司被紧急叫停

从理论到实践增强STEAM和工程教育
随机推荐
Shanghai Pudong Development Bank Software Test interview real question
Small program graduation project based on wechat subscription water supply mall small program graduation project opening report function reference
About the solution of "modulenotfounderror: no module named 'flask.\u compat'"
China gaobang brand story: standing guard for safety, gaobang pays attention to
深入解析kubernetes中的选举机制
Konva series tutorial 3: Customizing drawings
Sharing-JDBC分布式事务之Seata实现
WiFi安全漏洞KRACK深度解读
select/poll/epoll
sqrt()函数的详解和用法「建议收藏」
324. swing sequencing II
Openfire用户以及群组关系移植
Can I open an account today and buy shares today? Is it safe to open an account online?
[flask] update and delete crud of data
如何设计业务高性能高可用计算架构 - 作业
GCC getting started manual
Kubernetes visual interface dashboard
Database Experiment 7 integrity constraints
杂记:数据库go,begin,end,for,after,instead of
A preliminary study of IO model