当前位置:网站首页>Notes of training courses selected by Massey school
Notes of training courses selected by Massey school
2022-07-07 00:27:00 【Evergreen AAS】
Multivariate statistical analysis
Clustering analysis
characteristic :
- I don't know the number and structure of categories in advance ;
- The data to be analyzed is the similarity or dissimilarity between objects ( distance );
- Put objects close together .
classification
- According to different classification objects, it can be divided into
- Q Type clustering : Cluster the samples
- R Type clustering : Cluster variables
- According to the clustering method, it is mainly divided into
- Systematic clustering
- Dynamic clustering
distance
Minikowski distance :
d(x,y) = [\sum\limits_{k=1}^p|x_{k} - y_{k}|^{m}]^{\frac{1}{m}}, x,y by p Virial vector
- m = 1 when , Is absolute distance
- m = 2 when , For Euclidean distance
- m = \infty, For Chebyshev distance , namely \mathop{max}\limits_{1\le k \le p}|x_{k} - y_{k}|
Mahalanobis distance ( Commonly used in cluster analysis )
d(x,y) = \sqrt{(x - y)^{T} \sum\nolimits^{-1} (x - y)}
among x, y For coming from p Dimensional population Z Sample observations of ,Σ by Z The covariance matrix of , In the actual Σ Often do not know , It is often necessary to estimate with sample covariance . Markov distance is invariant to all linear transformations , Therefore, it is not affected by dimensions .
R sentence :
dist(x,method=“euclidean”, diag=FALSE, upper=FALSE, p=2)
- method: The way to calculate the distance
- “euclidean”: Euclidean distance
- “maximum”: Chebyshev distance
- “manhattan”: Absolute distance
- “minkowski”: Minkowski distance ,p yes Minkowski Order of distance
- diag=TRUE: Output the distance on the diagonal
- upper=TRUE: Output the value of the upper triangular matrix ( The default value is only output The value of trigonometric matrix )
python sentence :
import rpy2.robjects as robjects x = [1, 2, 6, 8, 11] r = robjects.r res = r.dist(x) print(res) # 1 2 3 4 # 2 1 # 3 5 4 # 4 7 6 2 # 5 10 9 5 3
import rpy2 import rpy2.robjects.numpy2ri R = rpy2.robjects.r r_code = """ x<-c(1,2,6,8,11) y<-dist(x) print(y) """ R(r_code)
- notes : I use rpy2 It's hard to realize this feeling . It may not be possible to learn from python call R Language , Direct use recommended R Language
Standardized treatment
When the measured values of indicators differ greatly , The data should be standardized first , Then use the standardized data to calculate the distance .
General standardized transformation
X_{ij}^{*} = \frac{X_{ij} - \overline{X}_{j}}{S_{j}}
i=1,2,…n It means the first one i Samples ,j=1,2,…p Represents the... Of the sample j Indicators , Each sample has p Two observation indicators . It's No j Sample mean of indicators
Range standardization transformation
X_{ij}^{*} = \frac{X_{ij} - \overline{X}_{j}}{R_{j}} \\ among ,R_{j} = \mathop{\max}\limits_{1\le k \le n}X_{kj} - \mathop{\max}\limits_{1 \le k \le n}X_{kj}
Range normalization transformation
X_{ij}^{*} = \frac{X_{ij} - \mathop{\min}\limits_{1\le k\le n} X_{kj}}{R_{j}}
Program statement
Centralization and standardization of data
R sentence
scale(X,center = True, scale = True)
X: Sample data matrix ,center = TURE Means to make centralized transformation of data ,scale=TRUE It means to make standardized changes to the data
python sentence
import rpy2 import numpy import rpy2.robjects.numpy2ri rpy2.robjects.numpy2ri.activate() R = rpy2.robjects.r x = numpy.array([[1.0, 2.0], [3.0, 1.0]]) res = R.scale(x, center=True, scale=True) print(res)
The data is subject to extreme standardization
x <- data.frame( points = c(99, 97, 104, 79, 84, 88, 91, 99), rebounds = c(34, 40, 41, 38, 29, 30, 22, 25), blocks = c(12, 8, 8, 7, 8, 11, 6, 7) ) # apply() Function must be applied to dataframe or matrix center <- sweep(x, 2, apply(x, 2, mean)) R <- apply(x, 2, max) - apply(x, 2, min) x_star <- sweep(center, 2, R, "/") # if x_star<-sweep(center, 2, sd(x), "/"), Then we get ( Ordinary ) Standardized transformed data ; print(x_star)
sweep(x, MARGIN, STATS, FUN=”-“, …)
- x: Array or matrix ;MARGIN: Operation area , The matrix 1 Said line ,2 The column ;
- STATS It's statistics ,apply(x, 2, mean) Represents the mean value of each column ;
- FUN Represents the operation of a function , The default value is subtraction .
Similarity coefficient
Calculate the correlation coefficient between different indicators of the sample , It is suitable for clustering variables .
Systematic clustering
Cluster analysis is the most commonly used
The basic idea
- (1) Depending on each sample ( Or variable ) Self becoming , Specify the distance between classes ( Or the coefficient of similarity );
- (2) Take the most similar sample ( Or variable ) Gather into small categories , Then aggregate the aggregated subclasses according to similarity ;
- (3) Finally, all subclasses are aggregated into a large class , Thus, we can get a pedigree relationship gathered according to the size of similarity
3. According to different definitions of distance, it can be divided into
- (1) The shortest distance method : The distance between classes is defined as the distance between the nearest samples in the two classes ;
- (2) The longest distance method : The distance between classes is defined as the distance between the farthest samples in the two classes ;
- (3) Class average method : The distance between classes is defined as the average of the distance between two samples in two classes ;
Program
x<-c(1,2,6,8,11); dim(x)<-c(5,1); d<-dist(x) # Generate distance structure hc1<-hclust(d, "single"); hc2<-hclust(d, "complete") hc3<-hclust(d, "median"); hc4<-hclust(d, "mcquitty") # Generate systematic clustering opar <- par(mfrow = c(2, 2)) plot(hc1,hang=-1); plot(hc2,hang=-1) plot(hc3,hang=-1); plot(hc4,hang=-1) par(opar)# Draw all tree structure diagrams , With 2*2 The form of is drawn on a picture
hclust(): Calculate the systematic clustering plot(): Draw a tree diagram of systematic clustering hclust(d, method = “complete”) d:dist Distance structure , method: The method of systematic clustering ( The default is the longest distance method ), The parameters are : (1)“single”: The shortest distance method (2)“complete”: The longest distance method (3)“average”: Class average method …… plot(x, labels = NULL, hang = 0.1, main = “Cluster Dendrogram”, sb = NULL, xlab = NULL, ylab =”Height”, …)
x: hclust() The object generated by the function hang: Indicate the positions of various types in the tree , Taking a negative value means that the classes in the tree are drawn from the bottom main: Drawing name
Dynamic clustering
System clustering : Once the class is formed, it will not change ; Dynamic clustering : Stepwise clustering
The basic idea
First, roughly classify , Then modify the unreasonable classification according to some optimal principle , Until the score is reasonable , Form the final classification result .
Program
kmeans(x, centers, iter.max=10, nstart=1, algorithm*=c(“Hartigan-Wong”, “Lloyd”, “MacQueen”))
- x It is a matrix or data frame composed of data ,
- centers Is the number of clusters or the center of the initial class ,
- iter.max Is the maximum number of iterations ( The maximum value is 10),
- nstart Is the number of random sets ,
- algorithm It is a dynamic clustering algorithm .
X<-data.frame( x1=c(2959.19, 2459.77, 1495.63, 1046.33, 1303.97, 1730.84, 1561.86, 1410.11, 3712.31, 2207.58, 2629.16, 1844.78, 2709.46, 1563.78, 1675.75, 1427.65, 1783.43, 1942.23, 3055.17, 2033.87, 2057.86, 2303.29, 1974.28, 1673.82, 2194.25, 2646.61, 1472.95, 1525.57, 1654.69, 1375.46, 1608.82), x2=c(730.79, 495.47, 515.90, 477.77, 524.29, 553.90, 492.42, 510.71, 550.74, 449.37, 557.32, 430.29, 428.11, 303.65, 613.32, 431.79, 511.88, 512.27, 353.23, 300.82, 186.44, 589.99, 507.76, 437.75, 537.01, 839.70, 390.89, 472.98, 437.77, 480.99, 536.05), x3=c(749.41, 697.33, 362.37, 290.15, 254.83, 246.91, 200.49, 211.88, 893.37, 572.40, 689.73, 271.28, 334.12, 233.81, 550.71, 288.55, 282.84, 401.39, 564.56, 338.65, 202.72, 516.21, 344.79, 461.61, 369.07, 204.44, 447.95, 328.90, 258.78, 273.84, 432.46), x4=c(513.34, 302.87, 285.32, 208.57, 192.17, 279.81, 218.36, 277.11, 346.93, 211.92, 435.69, 126.33, 160.77, 107.90, 219.79, 208.14, 201.01, 206.06, 356.27, 157.78, 171.79, 236.55, 203.21, 153.32, 249.54, 209.11, 259.51, 219.86, 303.00, 317.32, 235.82), x5=c(467.87, 284.19, 272.95, 201.50, 249.81, 239.18, 220.69, 224.65, 527.00, 302.09, 514.66, 250.56, 405.14, 209.70,272.59, 217.00, 237.60, 321.29, 811.88, 329.06, 329.65, 403.92, 240.24, 254.66, 290.84, 379.30, 230.61, 206.65, 244.93, 251.08, 250.28), x6=c(1141.82, 735.97, 540.58, 414.72, 463.09, 445.20, 459.62, 376.82, 1034.98, 585.23, 795.87, 513.18, 461.67, 393.99, 599.43, 337.76, 617.74, 697.22, 873.06, 621.74, 477.17, 730.05, 575.10, 445.59, 561.91, 371.04, 490.90, 449.69, 479.53, 424.75, 541.30), x7=c(478.42, 570.84, 364.91, 281.84, 287.87, 330.24, 360.48, 317.61, 720.33, 429.77, 575.76, 314.00, 535.13, 509.39, 371.62, 421.31, 523.52, 492.60, 1082.82, 587.02, 312.93,438.41, 430.36, 346.11, 407.70, 269.59, 469.10, 249.66, 288.56, 228.73, 344.85), x8=c(457.64, 305.08, 188.63, 212.10, 192.96, 163.86, 147.76, 152.85, 462.03, 252.54, 323.36, 151.39, 232.29, 160.12, 211.84, 165.32, 182.52, 226.45, 420.81, 218.27, 279.19, 225.80, 223.46, 191.48, 330.95, 389.33, 191.34, 228.19, 236.51, 195.93, 214.40), row.names = c(" Beijing ", " tianjin ", " hebei ", " shanxi ", " Inner Mongolia ", " liaoning ", " Ji Lin ", " heilongjiang ", " Shanghai ", " jiangsu ", " Zhejiang ", " anhui ", " fujian ", " jiangxi ", " Shandong ", " Henan ", " hubei ", " hunan ", " guangdong ", " guangxi ", " hainan ", " Chongqing ", " sichuan ", " guizhou ", " yunnan ", " Tibet ", " shaanxi ", " gansu ", " qinghai ", " ningxia ", " xinjiang ") ) kmeans(scale(X),5) K-means clustering with 5 clusters of sizes 10, 7, 3, 7, 4 Clustering vector: Beijing tianjin hebei shanxi Inner Mongolia liaoning Ji Lin heilongjiang Shanghai jiangsu 5 4 3 3 3 3 3 3 5 4 Zhejiang anhui fujian jiangxi Shandong Henan hubei hunan guangdong guangxi 5 1 2 1 4 1 1 4 5 2 hainan Chongqing sichuan guizhou yunnan Tibet shaanxi gansu qinghai ningxia 2 4 1 1 4 4 1 3 3 3 xinjiang 3
Principal component analysis
The basic idea
The importance of variables in practical problems is different , And there is a certain correlation between many variables . These variables are transformed through this correlation , Use a small number of new variables to reflect most of the information provided by the original variables , Simplify the original problem . That is, data dimensionality reduction
Principal component analysis is a statistical method for processing high-dimensional data under this dimensionality reduction idea .
The basic method
By properly constructing the linear combination of original variables , Generate a new list of unrelated variables , Select a few new variables and make them contain as much information as the original variables , Thus, a few new variables are used to replace the original variables , To analyze the original problem .
The variable contains “ Information ” The size of is usually measured by the variance of the variable or the sample variance .
Such as constant a,Var(a) = 0 , We go through a, I can only know a This constant , It contains little information .
Definition of principal component
set up X = (X_{1}, X_{2},……,X_{p})^{T} For practical problems involved p A vector of random variables , remember X The mean of \mu, The covariance matrix is \sum.
Consider linear combinations
\left\{ \begin{aligned} Y_{1} & = & a_{1}^{T}X \\ . \\ . \\ Y_{p} & = & a_{p}^{T}X \\ \end{aligned} \right.
………………………………………………………………………………………………………
warning: Just write the code
Program
Find the eigenvalue and eigenvector of the matrix
a <- c(1, -2, 0, -2, 5, 0, 0, 0, 2) # By vector a Construction one 3 Columns of the matrix , byrow=T The data representing the generated matrix is placed in rows ; b <- matrix(data = a, ncol = 3, byrow = T) c <- eigen(b) # seek b Eigenvalues and eigenvectors of
Linear model
1. The relationship between variables is generally divided into two categories
- Completely certain relationship , It can be expressed as a function analytic expression
- Uncertain relationship , Also known as correlation
2. The main content of regression analysis
- Through observation or processing of experimental data , Find out the quantitative mathematical expression of the correlation coefficient between variables — Empirical formula , That is, parameter estimation , And determine the specific form of empirical regression equation
- Test whether the established empirical regression equation is reasonable
- Use a reasonable regression equation for random variables Y Predict and control .
边栏推荐
- uniapp中redirectTo和navigateTo的区别
- 【vulnhub】presidential1
- Operation test of function test basis
- 从外企离开,我才知道什么叫尊重跟合规…
- js导入excel&导出excel
- 准备好在CI/CD中自动化持续部署了吗?
- 2022 PMP project management examination agile knowledge points (9)
- Jenkins' user credentials plug-in installation
- JWT signature does not match locally computed signature. JWT validity cannot be asserted and should
- The programmer resigned and was sentenced to 10 months for deleting the code. Jingdong came home and said that it took 30000 to restore the database. Netizen: This is really a revenge
猜你喜欢
48页数字政府智慧政务一网通办解决方案
Amazon MemoryDB for Redis 和 Amazon ElastiCache for Redis 的内存优化
一图看懂对程序员的误解:西方程序员眼中的中国程序员
Racher integrates LDAP to realize unified account login
智能运维应用之道,告别企业数字化转型危机
The programmer resigned and was sentenced to 10 months for deleting the code. Jingdong came home and said that it took 30000 to restore the database. Netizen: This is really a revenge
DAY SIX
2022年PMP项目管理考试敏捷知识点(9)
陀螺仪的工作原理
Geo data mining (III) enrichment analysis of go and KEGG using David database
随机推荐
专为决策树打造,新加坡国立大学&清华大学联合提出快速安全的联邦学习新系统
How engineers treat open source -- the heartfelt words of an old engineer
How can computers ensure data security in the quantum era? The United States announced four alternative encryption algorithms
谷歌百度雅虎都是中国公司开发的通用搜索引擎_百度搜索引擎url
智能运维应用之道,告别企业数字化转型危机
Designed for decision tree, the National University of Singapore and Tsinghua University jointly proposed a fast and safe federal learning system
On February 19, 2021ccf award ceremony will be held, "why in Hengdian?"
iMeta | 华南农大陈程杰/夏瑞等发布TBtools构造Circos图的简单方法
1000字精选 —— 接口测试基础
js导入excel&导出excel
如何判断一个数组中的元素包含一个对象的所有属性值
Use package FY in Oracle_ Recover_ Data. PCK to recover the table of truncate misoperation
《LaTex》LaTex数学公式简介「建议收藏」
Understand the misunderstanding of programmers: Chinese programmers in the eyes of Western programmers
Amazon MemoryDB for Redis 和 Amazon ElastiCache for Redis 的内存优化
2022/2/10 summary
Google, Baidu and Yahoo are general search engines developed by Chinese companies_ Baidu search engine URL
uniapp实现从本地上传头像并显示,同时将头像转化为base64格式存储在mysql数据库中
刘永鑫报告|微生物组数据分析与科学传播(晚7点半)
AI超清修复出黄家驹眼里的光、LeCun大佬《深度学习》课程生还报告、绝美画作只需一行代码、AI最新论文 | ShowMeAI资讯日报 #07.06