当前位置：网站首页>[R language data science] (XIV): random variables and basic statistics

[R language data science] (XIV): random variables and basic statistics

2022-06-24 13:48:00 【JOJO's data analysis Adventure】

【R Language data science 】( fourteen )： Random variables and basic statistics

Personal home page ：JoJo Data analysis adventure
Personal introduction ： I'm reading statistics in my senior year , At present, Baoyan has reached statistical top3 Colleges and universities continue to study for Postgraduates in Statistics
If it helps you , welcome Focus on 、 give the thumbs-up 、 Collection 、 subscribe special column
This article is included in 【R Language data science 】 This series mainly introduces R The applications of language in the field of data science include ：
R Fundamentals of language programming 、R Language visualization 、R Language for data manipulation 、R Modeling language 、R Language machine learning algorithm implementation 、R Language statistical theory and method . This series will continue to be completed , Please pay more attention, praise and support , Learning together ~, Try to keep updating every week , Welcome to subscribe to exchange study ！

Please add a picture description

List of articles

【R Language data science 】( fourteen )： Random variables and basic statistics
Preface
1. A random variable
2. Sampling model
3 The probability distribution of random variables
4. Basic statistics
5. Population variance and sample variance

Preface

In Data Science , We often deal with data affected by chance in some way ： Data from random samples , The data are affected by measurement errors , Or the data measures some essentially random results . It is one of the most important tasks for data analysts to quantify the uncertainty caused by randomness . Statistical inference provides a framework and several practical tools for this purpose . The first step is to learn how to describe random variables mathematically .
In this chapter , We first introduce random variables and their basic statistics , The next chapter further introduces two important knowledge points in probability theory ： The theorem of large numbers and the central limit theorem .

1. A random variable

A random variable is a numerical result of a random process . We can use some of the simple examples we show to generate random variables . for example , If the beads are blue , The definition is X by 1, Otherwise 0：

beads <- rep( c("red", "blue"), times = c(2,3))
X <- ifelse(sample(beads, 1) == "blue", 1, 0)

there X It's just a random variable ： Every time we choose a new bead , The results will change randomly .

ifelse(sample(beads, 1) == "blue", 1, 0)
ifelse(sample(beads, 1) == "blue", 1, 0)
ifelse(sample(beads, 1) == "blue", 1, 0)

2. Sampling model

Many data generators become the data we study , Can be obtained by sampling well , It's like drawing lots . for example , We can model the possible results of voter voting from the 0 and 1 From the urn of code 0（ republican party ） and 1（ the Democratic Party of the United States ）; In epidemiological studies , We often assume that the subjects in our study are random samples from interested groups . The data related to a particular result can be sampled as random samples from the urn containing the results of the entire population of interest . Again , In experimental research , We often assume that the individual organism we are studying , Such as worms 、 Flies or mice , Is a random sample from a larger population . Consider how individuals are assigned to groups , Random experiments can also be carried out by sampling ： When assigned , The group was randomly selected . therefore , Revenue models are ubiquitous in data science . Casino games provide a number of real-world examples , Sampling models are used to answer specific questions . therefore , We will start with these examples .
Suppose a small casino hires you to ask if they should set up roulette . For the sake of simplicity , We assume that there is 1,000 people To participate in , And the only game you can play on Roulette is to bet red or black . Casinos want you to predict how much money they will make or lose . They want a series of results , especially , They want to know how likely it is to lose money . If the probability is too high , They will continue to install roulette .
We will define a random variable (S) To represent the total bonus of the casino . Let's start by building the sampling box . A roulette has 18 A red pocket 、18 A black pocket and 2 A green pocket . Playing a color in a roulette game is equivalent to drawing a lottery from the sampling box ：

color <- rep(c("Black", "red", "Green"), c(18, 18, 2))

come from 1,000 A player's 1,000 Results were independently drawn from this sample . If red appears , Gamblers win , The casino lost a dollar , So we draw the result as -1$, otherwise , The casino won a dollar , Our result is 1 dollar . To construct our random variables (S), We can use the following code ：

n <- 1000
X <- sample(ifelse(color == "red", -1, 1),  n, replace = TrUE)
X[1:10]

1
-1
-1
-1
-1
1
1
-1
-1
-1

Because we know 1 $and - 1$ The proportion of , We can generate data with one line of code , There is no need to define the corresponding number of categories

X <- sample(c(-1,1), n, replace = TrUE, prob=c(9/19, 10/19))

We call it a sampling model , Because we use random sampling from the sampling box to simulate the random behavior of roulette . Total bonus (S) Just this 1,000 The sum of the separate sweepstakes ：

X <- sample(c(-1,1), n, replace = TrUE, prob=c(9/19, 10/19))
S <- sum(X)
S

3 The probability distribution of random variables

If we run the above code repeatedly , You can see (S) It's changing every time . Of course , This is because (S) It's a random variable . The probability distribution of a random variable tells us the probability that the observed value will fall at any given interval . therefore , for example , If we want to know our probability of losing money , What we have to ask is (S) In the interval (S<0) The probability of the interior .
Please note that , If we can define a cumulative distribution function ${Pr}(S\leq a)$ , Then we will be able to answer any question variable related to our randomly defined event probability (S), Include events (S<0). We call this (F) Is the distribution function of random variables .
We can estimate random variables by using Monte Carlo simulation (S) Distribution function of , To generate random variables . Use this code , We run let 1,000 An experiment in which people play roulette over and over again , And carry out (B = 10,000) Time ：

n <- 1000
B <- 10000
roulette_winnings <- function(n){
    
  X <- sample(c(-1,1), n, replace = TrUE, prob=c(9/19, 10/19))
  sum(X)
}
S <- replicate(B, roulette_winnings(n))

Now we can ask the following questions ： In our simulation , We get less than or equal to a What is the frequency of the sum of ？ This will be (F(a)) A very good approximation of , We can easily answer casino questions ： How likely are we to lose money ？ We can see it Very low ：

mean(S<0)

0.0427

We can create a histogram to visualize (S) The distribution of , The histogram shows several intervals ((a,b]) Probability (F(b)-F(a))：

Insert picture description here

We see that the distribution seems to be approximately normal . qq The figure will confirm that the normal approximation is close to the perfect approximation of this distribution . in fact , If the distribution is normal , Then we just need to define the mean and standard deviation of the distribution . Because we have the original values to create the distribution , So we can easily use mean(S) and sd(S) Calculate these values . You can see that the blue curve added to the histogram above is the normal density with this mean and standard deviation .

The mean and the standard deviation have special names . They are called random variables (S) Expected value and standard error of . We will cover these in detail in the next section .
Statistical theory provides a way to derive the distribution of random variables , These random variables are defined as random variables randomly selected from the sampling frame . say concretely , In our example above , We can prove that ((S+n)/2) Following binomial distribution . therefore , We do not need to run Monte Carlo simulations to understand (S) Probability distribution of .
We can use functions dbinom and pbinom To accurately calculate the probability . for example , For calculation ${Pr}(S < 0)$ , We noticed that ：
${Pr}(S < 0) = {Pr}((S+n)/2 < (0+n)/2)$
We can use pbinom To calculate $\leq 0)$

n <- 1000
pbinom(n/2, size = n, prob = 10/19)

0.0510979434690998

Because this is a discrete probability function , In order to get ${Pr}(S < 0)$ instead of $\leq 0)$ , We write ：

pbinom(n/2-1, size = n, prob = 10/19)

0.0447959069035901

4. Basic statistics

We have described the sampling model of the lottery . Now? , We will review mathematical theory , It allows us to approximate the probability distribution of the sum of the draws . Once we do that , We will be able to help casinos predict how much money they will make . The same method we use for the sum of draws will help to describe the distribution of averages and proportions , We need to understand how polls work . The first important concept to learn is expectation . In statistics books , Usually use letters like this ${E}$ ：
${E}[X]$
Represents a random variable $X$ The expectation of .
Random variables will vary around their expected values , If we go to the average of many lots , The average value of the draw will be close to the expected value , The more lots you draw , The closer to the real value .
Theoretical statistics provide techniques to help calculate expected values in different situations . for example , A useful formula tells us , The expected value of a random variable defined by a lottery is the average value of the numbers in the urn . In the urn used to simulate roulette red bets , We have 20 individual 1 The dollar and 18 A negative 1 dollar . therefore , The expected value is ：
$E [X] = (20 + (- 18)) / 38$

It's about 5 cents . say $X$ stay 0.05 It's a bit counterintuitive to change left and right , And the only value it takes is 1 and -1. under these circumstances , One way to understand the expected value is to realize that if we play the game over and over again , The average casino game wins 5 cents . Monte Carlo simulations confirm this ：

B <- 10^6
x <- sample(c(-1,1), B, replace = TrUE, prob=c(9/19, 10/19))
mean(x)

0.05279

Generally speaking , If the urn has two possible outcomes , such as $a$ and $b$ , The proportions are $p$ and $1 - p$ , The average value is ：

${E}[X] = ap+b(1-p)$

Suppose there are now n A bead ,np Yes a,n(1-p) yes b, So their total is

$n p a + n b (1 - p)$

Divided by n Get the average ：

$a p + b (1 - p)$

The reason why we define expectation is because this mathematical definition is for approximation sum The probability distribution of is very useful , This is useful for describing the distribution of mean values and proportions . The first useful fact is that the expected value of the sum is ：

$\times E(X)$

therefore , If 1000 People play roulette , The casino is expected to win on average about 1000× 0.05 dollar = 50 dollar . But this is an expected value . How different an observed value is from an expected value ？ Casinos really need to know this . What is the range of possibilities ？ If it's a negative number , Casinos do not install roulette .
We can use standard deviation to answer this question . Standard error (SE) Let's understand the magnitude of the change around expectations . In statistics books , Usually use ： $S E (X)$ Express .
Now assume that each bet is independent , So the standard deviation is ：

$|b-a|\sqrt{p(1-P}$

The standard error tells us the typical difference between a random variable and its expectation . We can use the above formula to calculate that the expected value of the random variable defined by a draw is 0.05, The standard error is about 1. It makes sense , Because we either get 1 or -1,1 Slightly more -1 many .
Use the above formula ,1000 The sum of people is about 32 The standard error of the dollar ：

n <- 1000
sqrt(n) * 2 * sqrt(90)/19

31.5789473684211

5. Population variance and sample variance

X Standard deviation （ Let's take height as an example ） Defined as the square root of variance ：

library(dslabs)
x <- heights$height
m <- mean(x)
s <- sqrt(mean((x-m)^2))

The mathematical expression is as follows ：

$\mu = \frac{1}{n}\sum_{i=1}^{n}x_i\\ \sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2}$

However , stay r in , If you use sd function , The results returned are a little different

s == sd(x)
s - sd(x)

FALSE

-0.00194266120553532

This is because in the r in ,sd A function is not a standard deviation of a whole , Instead, a formula is used to estimate the standard deviation of the population from random samples . As follows ：

$\bar X = \frac{1}{N}\sum_{i=1}^N x_i\\s = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(X_i-\bar X)^2}$

Let's check it out

n <- length(x)
s-sd(x)*sqrt((n-1) / n)

0

therefore , For all the theories discussed here , We need to calculate the actual standard deviation according to the definition as follows ：

sqrt(mean((x-m)^2))

4.07667430843691

therefore , stay r Use in sd Be careful when using functions . But when the data size is large , These two are actually equivalent , because

$\sqrt{(N-1)/N\approx1}$

This is the introduction of this chapter , If it helps you , Please do more thumb up 、 Collection 、 Comment on 、 Focus on supporting ！！

原网站

版权声明
本文为[JOJO's data analysis Adventure]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/175/202206241059029304.html