当前位置：网站首页>Statistical knowledge required by data analysts

Statistical knowledge required by data analysts

2022-06-11 04:14:00 【Xinyi 2002】

This paper describes five basic statistical concepts that data analysts should understand ： Statistical characteristics 、 A probability distribution 、 Dimension reduction 、 Oversampling / Undersampling 、 Bayesian statistical methods .

From a high point of view , Statistics is a technology that uses mathematical theory to analyze data . A basic visual form like a histogram , Will give you more comprehensive information . however , Through statistics, we can operate the data in a more information driven and targeted way . The mathematical theories involved help us form specific conclusions of the data , Not just guessing .

Using statistics , We can go deeper 、 Take a closer look at how the data is accurately organized , And based on this organizational structure , How to apply other related technologies in the best form to obtain more information . today , Let's take a look at what data analysts need to know 5 A basic statistical concept , And how to apply it effectively .

Characteristic statistics

Characteristic statistics is probably the most commonly used statistical concept in data science . It's a statistical technique you often use when studying data sets , Include deviation 、 variance 、 Average 、 Median 、 percentage wait . It's easy to understand feature statistics and implement them in code . Please look at the chart below. ：

Above picture , The middle line represents the median of the data . The median is used in the mean , Because it is more robust to outliers . The first quartile is essentially the twenty fifth quartile , In the data 25% Below this value . The third quartile is the 75th quartile , In the data 75% Below this value . The maximum and minimum values represent the upper and lower ends of the data range .

The box diagram well illustrates the role of basic statistical features :

When the box chart is short , This means that many data points are similar , Because many values are distributed in a very small range ;
When the box diagram is high , This means that most data points are very different , Because these values are widely distributed ;
If the median is close to the bottom , Then most of the data has lower values . If the median is closer to the top , Then most data have higher values . Basically , If the median line is not in the middle of the box , So it's skewed data ;
If the lines at the top and bottom of the box are very long, it means that the data has high standard deviation and variance , It means that these values are scattered , And it changes a lot . If there is a long line on one side of the frame , The other side is not long , Then the data may change greatly in only one direction

A probability distribution

We can define probability as the probability that some events will happen , Expressed as a percentage . In the field of Data Science , This is usually quantified to 0 To 1 Within the range of , among 0 Indicates that the event is certain not to occur , and 1 Indicates that the event is certain to occur . that , Probability distribution is a function of the probability of all possible values . Please look at the chart below. ：

Common probability distributions , Uniform distribution ( On )、 Normal distribution ( middle )、 Poisson distribution ( Next )：

Uniform distribution is the most basic way of probability distribution . It has a value that only appears in a certain range , And beyond that range are 0. We can also consider it as a variable with two classifications ：0 Or another value . Categorical variables may have division 0 Multiple values other than , But we can still visualize it as multiple uniformly distributed piecewise functions .
Normal distribution , Also known as Gaussian distribution , It is defined by its mean and standard deviation . The average value is distributed back and forth in space , The standard deviation controls its distribution and diffusion range . The main difference from other distribution methods is , The standard deviation is the same in all directions . therefore , Through Gaussian distribution , We know the average value of the data set and the diffusion distribution of the data , That is, it expands on a wide range , It is mainly concentrated around a few values .
Poisson distribution is similar to normal distribution , But there is a skew rate . Like a normal distribution , When the skewness value is low , Poisson distribution has relatively uniform diffusion in all directions . however , When the skewness value is very large , The diffusion of our data in different directions will be different . In one direction , The diffusion of data is very high , And in the other direction , The degree of diffusion is very low .

If you encounter a Gaussian distribution , So we know there are many algorithms , By default, Garth distribution will perform well , So you should first find those algorithms . If it's a Poisson distribution , We must be very careful , Choose an algorithm that is robust to changes in space expansion .

Dimension reduction

The term dimensionality reduction can be understood intuitively , It means reducing the dimension of a data set . In Data Science , This is the number of characteristic variables . Please look at the chart below. ：

The cube in the figure above represents our dataset , It has 3 Dimensions , in total 1000 A little bit . With current computing power , Calculation 1000 One point is easy , But on a larger scale , You'll be in trouble . However , Just look at our data from a two-dimensional perspective , For example, from the angle of one side of the cube , You can see that it's easy to divide all the colors . By reducing dimensions , We will 3D The data is presented to 2D In the plane , This effectively reduces the number of points we need to calculate to 100 individual , It greatly saves the amount of calculation .

Another way is that we can reduce the dimension by feature pruning . Use this method , It is not important for the analysis that we delete any features we see . for example , After studying the data set , We may find that , stay 10 Of the features , Yes 7 Features have a high correlation with the output , Others 3 One has a very low correlation . that , this 3 A low correlation feature may not be worth calculating , We may just be able to Remove them from the analysis without affecting the output .

The most common statistical technique for dimensionality reduction is PCA, It essentially creates a vector representation of features , It shows their importance to the output , Correlation .PCA It can be used for the above two dimensionality reduction methods .

Oversampling and undersampling

Oversampling and undersampling are techniques for classification problems . for example , We have 1 Taxonomic 2000 Samples , But the first 2 There are only 200 Samples . This will set aside many machine learning techniques we try and use to model and predict data . that , Oversampling and undersampling can cope with this situation . Please look at the chart below. ：

On the left and right in the figure above , The blue category has more samples than the orange category . under these circumstances , We have 2 A preprocessing option , It can help machine learning model training .

Under sampling means that we will only select some data from the classification with many samples , And try to use as many classified samples as possible . This choice should be to maintain the probability distribution of classification . We just make the data set more balanced by less sampling .

Oversampling means that we are going to create a few copies of the classification , So as to have the same number of samples as most classifications . Copies will be made to maintain the distribution of a few categories . We just make the data set more balanced without getting more data .

Bayesian Statistics

Fully understand why when we use Bayesian Statistics , It is required to first understand the failure of frequency statistics . Most people are hearing “ probability ” When it's a word , Frequency statistics is the first type of statistics that comes to mind . It involves the application of some mathematical theories to the analysis of the probability of an event , To be clear , The only data we calculate is a priori data (prior data).

Suppose I give you a dice , Ask you to throw it 6 What's the probability of a dot , Most people would say one in six .

however , If someone gives you a specific die, you can always roll it 6 A little bit ? Because frequency analysis only considers previous data , And the factors that give you cheating dice are not taken into account .

Bayesian statistics does take this into account , We can illustrate it by Bayes rule :

The probability in the equation P(H) It's basically our frequency analysis , Given the previous data on the probability of an event . In the equation P(E|H) Called possibility , According to the information from frequency analysis , In essence, it is the probability that the phenomenon is correct . for example , If you're going to roll the dice 10000 Time , And before 1000 Every time I throw 6 A little bit , Then you will be very confident that the dice cheated .

If the frequency analysis is done very well , Then we will be very confident that , guess 6 This point is correct . meanwhile , If dice cheating is true , Or not based on its own a priori probability and frequency analysis , We will also consider cheating . As you can see from the equation , Bayesian statistics takes everything into account . When you think the previous data is not a good representation of future data and results , We should use Bayesian statistics .