当前位置:网站首页>[R language data science] (XIV): random variables and basic statistics
[R language data science] (XIV): random variables and basic statistics
2022-06-24 13:48:00 【JOJO's data analysis Adventure】
【R Language data science 】( fourteen ): Random variables and basic statistics
- Personal home page :JoJo Data analysis adventure
- Personal introduction : I'm reading statistics in my senior year , At present, Baoyan has reached statistical top3 Colleges and universities continue to study for Postgraduates in Statistics
- If it helps you , welcome
Focus on、give the thumbs-up、Collection、subscribespecial column- This article is included in 【R Language data science 】 This series mainly introduces R The applications of language in the field of data science include :
R Fundamentals of language programming 、R Language visualization 、R Language for data manipulation 、R Modeling language 、R Language machine learning algorithm implementation 、R Language statistical theory and method . This series will continue to be completed , Please pay more attention, praise and support , Learning together ~, Try to keep updating every week , Welcome to subscribe to exchange study !

List of articles
Preface
In Data Science , We often deal with data affected by chance in some way : Data from random samples , The data are affected by measurement errors , Or the data measures some essentially random results . It is one of the most important tasks for data analysts to quantify the uncertainty caused by randomness . Statistical inference provides a framework and several practical tools for this purpose . The first step is to learn how to describe random variables mathematically .
In this chapter , We first introduce random variables and their basic statistics , The next chapter further introduces two important knowledge points in probability theory : The theorem of large numbers and the central limit theorem .
1. A random variable
A random variable is a numerical result of a random process . We can use some of the simple examples we show to generate random variables . for example , If the beads are blue , The definition is
Xby 1, Otherwise0:
beads <- rep( c("red", "blue"), times = c(2,3))
X <- ifelse(sample(beads, 1) == "blue", 1, 0)
there X It's just a random variable : Every time we choose a new bead , The results will change randomly .
ifelse(sample(beads, 1) == "blue", 1, 0)
ifelse(sample(beads, 1) == "blue", 1, 0)
ifelse(sample(beads, 1) == "blue", 1, 0)
0
1
1
2. Sampling model
Many data generators become the data we study , Can be obtained by sampling well , It's like drawing lots . for example , We can model the possible results of voter voting from the 0 and 1 From the urn of code
0( republican party ) and1( the Democratic Party of the United States ); In epidemiological studies , We often assume that the subjects in our study are random samples from interested groups . The data related to a particular result can be sampled as random samples from the urn containing the results of the entire population of interest . Again , In experimental research , We often assume that the individual organism we are studying , Such as worms 、 Flies or mice , Is a random sample from a larger population . Consider how individuals are assigned to groups , Random experiments can also be carried out by sampling : When assigned , The group was randomly selected . therefore , Revenue models are ubiquitous in data science . Casino games provide a number of real-world examples , Sampling models are used to answer specific questions . therefore , We will start with these examples .
Suppose a small casino hires you to ask if they should set up roulette . For the sake of simplicity , We assume that there is1,000 peopleTo participate in , And the only game you can play on Roulette is to bet red or black . Casinos want you to predict how much money they will make or lose . They want a series of results , especially , They want to know how likely it is to lose money . If the probability is too high , They will continue to install roulette .
We will define a random variable (S) To represent the total bonus of the casino . Let's start by building the sampling box . A roulette has 18 A red pocket 、18 A black pocket and 2 A green pocket . Playing a color in a roulette game is equivalent to drawing a lottery from the sampling box :
color <- rep(c("Black", "red", "Green"), c(18, 18, 2))
come from 1,000 A player's 1,000 Results were independently drawn from this sample . If red appears , Gamblers win , The casino lost a dollar , So we draw the result as -1$, otherwise , The casino won a dollar , Our result is 1 dollar . To construct our random variables (S), We can use the following code :
n <- 1000
X <- sample(ifelse(color == "red", -1, 1), n, replace = TrUE)
X[1:10]
- 1
- -1
- -1
- -1
- -1
- 1
- 1
- -1
- -1
- -1
Because we know 1 and − 1 and -1 and −1 The proportion of , We can generate data with one line of code , There is no need to define the corresponding number of categories
X <- sample(c(-1,1), n, replace = TrUE, prob=c(9/19, 10/19))
We call it a sampling model , Because we use random sampling from the sampling box to simulate the random behavior of roulette . Total bonus (S) Just this 1,000 The sum of the separate sweepstakes :
X <- sample(c(-1,1), n, replace = TrUE, prob=c(9/19, 10/19))
S <- sum(X)
S
3 The probability distribution of random variables
If we run the above code repeatedly , You can see (S) It's changing every time . Of course , This is because (S) It's a random variable . The probability distribution of a random variable tells us the probability that the observed value will fall at any given interval . therefore , for example , If we want to know our probability of losing money , What we have to ask is (S) In the interval (S<0) The probability of the interior .
Please note that , If we can define a cumulative distribution function F ( a ) = P r ( S ≤ a ) F(a) = {Pr}(S\leq a) F(a)=Pr(S≤a) , Then we will be able to answer any question variable related to our randomly defined event probability (S), Include events (S<0). We call this (F) Is the distribution function of random variables .
We can estimate random variables by using Monte Carlo simulation (S) Distribution function of , To generate random variables . Use this code , We run let 1,000 An experiment in which people play roulette over and over again , And carry out (B = 10,000) Time :
n <- 1000
B <- 10000
roulette_winnings <- function(n){
X <- sample(c(-1,1), n, replace = TrUE, prob=c(9/19, 10/19))
sum(X)
}
S <- replicate(B, roulette_winnings(n))
Now we can ask the following questions : In our simulation , We get less than or equal to a What is the frequency of the sum of ? This will be (F(a)) A very good approximation of , We can easily answer casino questions : How likely are we to lose money ? We can see it Very low :
mean(S<0)
0.0427
We can create a histogram to visualize (S) The distribution of , The histogram shows several intervals ((a,b]) Probability (F(b)-F(a)):

We see that the distribution seems to be approximately normal . qq The figure will confirm that the normal approximation is close to the perfect approximation of this distribution . in fact , If the distribution is normal , Then we just need to define the mean and standard deviation of the distribution . Because we have the original values to create the distribution , So we can easily use mean(S) and sd(S) Calculate these values . You can see that the blue curve added to the histogram above is the normal density with this mean and standard deviation .
The mean and the standard deviation have special names . They are called random variables (S) Expected value and standard error of . We will cover these in detail in the next section .
Statistical theory provides a way to derive the distribution of random variables , These random variables are defined as random variables randomly selected from the sampling frame . say concretely , In our example above , We can prove that ((S+n)/2) Following binomial distribution . therefore , We do not need to run Monte Carlo simulations to understand (S) Probability distribution of .
We can use functionsdbinomandpbinomTo accurately calculate the probability . for example , For calculation P r ( S < 0 ) {Pr}(S < 0) Pr(S<0), We noticed that :
P r ( S < 0 ) = P r ( ( S + n ) / 2 < ( 0 + n ) / 2 ) {Pr}(S < 0) = {Pr}((S+n)/2 < (0+n)/2) Pr(S<0)=Pr((S+n)/2<(0+n)/2)
We can use pbinom To calculate P r ( S ≤ 0 ) {Pr}(S \leq 0) Pr(S≤0)
n <- 1000
pbinom(n/2, size = n, prob = 10/19)
0.0510979434690998
Because this is a discrete probability function , In order to get P r ( S < 0 ) {Pr}(S < 0) Pr(S<0) instead of P r ( S ≤ 0 ) {Pr}(S \leq 0) Pr(S≤0), We write :
pbinom(n/2-1, size = n, prob = 10/19)
0.0447959069035901
4. Basic statistics
We have described the sampling model of the lottery . Now? , We will review mathematical theory , It allows us to approximate the probability distribution of the sum of the draws . Once we do that , We will be able to help casinos predict how much money they will make . The same method we use for the sum of draws will help to describe the distribution of averages and proportions , We need to understand how polls work . The first important concept to learn is expectation . In statistics books , Usually use letters like this E {E} E:
E [ X ] {E}[X] E[X]
Represents a random variable X X X The expectation of .
Random variables will vary around their expected values , If we go to the average of many lots , The average value of the draw will be close to the expected value , The more lots you draw , The closer to the real value .
Theoretical statistics provide techniques to help calculate expected values in different situations . for example , A useful formula tells us , The expected value of a random variable defined by a lottery is the average value of the numbers in the urn . In the urn used to simulate roulette red bets , We have 20 individual 1 The dollar and 18 A negative 1 dollar . therefore , The expected value is :
E [ X ] = ( 20 + ( − 18 ) ) / 38 E[X] = (20 + (-18))/38 E[X]=(20+(−18))/38
It's about 5 cents . say X X X stay 0.05 It's a bit counterintuitive to change left and right , And the only value it takes is 1 and -1. under these circumstances , One way to understand the expected value is to realize that if we play the game over and over again , The average casino game wins 5 cents . Monte Carlo simulations confirm this :
B <- 10^6
x <- sample(c(-1,1), B, replace = TrUE, prob=c(9/19, 10/19))
mean(x)
0.05279
- Generally speaking , If the urn has two possible outcomes , such as a a a and b b b, The proportions are p p p and 1 − p 1-p 1−p, The average value is :
E [ X ] = a p + b ( 1 − p ) {E}[X] = ap+b(1-p) E[X]=ap+b(1−p)
- Suppose there are now n A bead ,np Yes a,n(1-p) yes b, So their total is
n p a + n b ( 1 − p ) npa+nb(1-p) npa+nb(1−p)
- Divided by n Get the average :
a p + b ( 1 − p ) ap+b(1-p) ap+b(1−p)
- The reason why we define expectation is because this mathematical definition is for approximation sum The probability distribution of is very useful , This is useful for describing the distribution of mean values and proportions . The first useful fact is that the expected value of the sum is :
n × E ( X ) n \times E(X) n×E(X)
- therefore , If 1000 People play roulette , The casino is expected to win on average about 1000× 0.05 dollar = 50 dollar . But this is an expected value . How different an observed value is from an expected value ? Casinos really need to know this . What is the range of possibilities ? If it's a negative number , Casinos do not install roulette .
We can use standard deviation to answer this question . Standard error (SE) Let's understand the magnitude of the change around expectations . In statistics books , Usually use : S E ( X ) SE(X) SE(X) Express .
Now assume that each bet is independent , So the standard deviation is :
∣ b − a ∣ p ( 1 − P |b-a|\sqrt{p(1-P} ∣b−a∣p(1−P
- The standard error tells us the typical difference between a random variable and its expectation . We can use the above formula to calculate that the expected value of the random variable defined by a draw is 0.05, The standard error is about 1. It makes sense , Because we either get 1 or -1,1 Slightly more -1 many .
Use the above formula ,1000 The sum of people is about 32 The standard error of the dollar :
n <- 1000
sqrt(n) * 2 * sqrt(90)/19
31.5789473684211
5. Population variance and sample variance
X Standard deviation ( Let's take height as an example ) Defined as the square root of variance :
library(dslabs)
x <- heights$height
m <- mean(x)
s <- sqrt(mean((x-m)^2))
The mathematical expression is as follows :
μ = 1 n ∑ i = 1 n x i σ = 1 n ∑ i = 1 n ( x i − μ ) 2 \mu = \frac{1}{n}\sum_{i=1}^{n}x_i\\ \sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2} μ=n1i=1∑nxiσ=n1i=1∑n(xi−μ)2
However , stay r in , If you use sd function , The results returned are a little different
s == sd(x)
s - sd(x)
FALSE
-0.00194266120553532
This is because in the r in ,sd A function is not a standard deviation of a whole , Instead, a formula is used to estimate the standard deviation of the population from random samples . As follows :
X ˉ = 1 N ∑ i = 1 N x i s = 1 N − 1 ∑ i = 1 N ( X i − X ˉ ) 2 \bar X = \frac{1}{N}\sum_{i=1}^N x_i\\s = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(X_i-\bar X)^2} Xˉ=N1i=1∑Nxis=N−11i=1∑N(Xi−Xˉ)2
Let's check it out
n <- length(x)
s-sd(x)*sqrt((n-1) / n)
0
therefore , For all the theories discussed here , We need to calculate the actual standard deviation according to the definition as follows :
sqrt(mean((x-m)^2))
4.07667430843691
therefore , stay r Use in sd Be careful when using functions . But when the data size is large , These two are actually equivalent , because
( N − 1 ) / N ≈ 1 \sqrt{(N-1)/N\approx1} (N−1)/N≈1
This is the introduction of this chapter , If it helps you , Please do more thumb up 、 Collection 、 Comment on 、 Focus on supporting !!
边栏推荐
- 【5G NR】5G NR系统架构
- 杰理之.睡眠以后定时唤醒系统继续跑不复位【篇】
- 龙蜥开发者说:首次触电,原来你是这样的龙蜥社区? | 第 8 期
- Android kotlin Encyclopedia
- 发扬连续作战优良作风 全力以赴确保北江大堤安全
- 一个团队可以既做项目又做产品吗?
- Vipshop's "special sale" business is no longer easy to do?
- Ti Xing Shu'an joined the dragon lizard community to jointly create a network security ecosystem
- How can the new webmaster avoid the ups and downs caused by SEO optimization?
- 源碼解析 Handler 面試寶典
猜你喜欢

Tupu software is the digital twin of offshore wind power, striving to be the first

10 reduce common "tricks"

这几个默认路由、静态路由的配置部署都不会,还算什么网工!

3. caller service call - dapr

每日一题day8-515. 在每个树行中找最大值

Eight major trends in the industrial Internet of things (iiot)

Seven challenges faced by data scientists and Solutions

一键生成大学、专业甚至录取概率,AI填报志愿卡这么神奇?

openGauss内核:简单查询的执行

2022年质量员-设备方向-岗位技能(质量员)复训题库及在线模拟考试
随机推荐
Goldfish rhca memoirs: do447 managing projects and conducting operations -- creating a project for ansible scripts
《中国数据库安全能力市场洞察,2022》报告研究正式启动
2022煤矿瓦斯抽采操作证考试题及模拟考试
#yyds干货盘点# 解决剑指offer:调整数组顺序使奇数位于偶数前面(二)
Home office should be more efficient - automated office perfectly improves fishing time | community essay solicitation
SAP QM qac1 transaction code cannot modify the quantity in the inspection lot containing Hu
Can inspection results be entered after the completion of inspection lot UD with long-term inspection characteristics in SAP QM?
How can the new webmaster avoid the ups and downs caused by SEO optimization?
详解kubernetes备份恢复利器 Velero | 深入了解Carina系列第三期
Why did the audio and video based cloud conference usher in a big explosion of development?
Vim 常用快捷键
The hidden corner of codefarming: five things that developers hate most
工业物联网(IIoT)的八个主要趋势
39 - read XML node and attribute values
[one picture series] one picture to understand Tencent Qianfan ipaas
一个团队可以既做项目又做产品吗?
Yyds dry goods counting solution sword finger offer: adjust the array order so that odd numbers precede even numbers (2)
2022年烟花爆竹生产单位安全生产管理人员考试题模拟考试题库模拟考试平台操作
Cloud native essay solicitation progress case practice
Google Earth Engine——1999-2019年墨累全球潮汐湿地变化 v1 数据集