当前位置:网站首页>Thoroughly understand box plot analysis
Thoroughly understand box plot analysis
2022-08-04 06:05:00 【I'm fine please go away thank you】
Learn from this article
Article table of contents
I. Box plot
Box plot (English: Box plot), also known as box and whisker plot, box plot, box plot or boxplot, is a statistical chart used to display the dispersion of a set of data.Named for its shape like a box.It is also often used in various fields, commonly used in quality management, to quickly identify outliers.
The biggest advantage of the box plot is that it is not affected by outliers, can accurately and stably depict the discrete distribution of the data, and is also conducive to data cleaning.
If you want to understand the box plot, then you must understand...
Two and five factors "number"
Let's take a set of serial numbers as an example: 12, 15, 17, 19, 20, 23, 25, 28, 30, 33, 34, 35, 36, 37 to explain these five major "numbers".
1, lower quartile Q1
(1) Determine the position of the quartiles.Qi's position=i(n+1)/4, where i=1, 2, 3.n represents the number of items contained in the sequence.
(2) According to the location, calculate the corresponding quartile.
Example:
The position of Q1=(14+1)/4=3.75,
Q1=0.25×third term+0.75×fourth term=0.25×17+0.75×19=18.5;
2, median (second quartile) Q2
The median is the number in the middle of a group of numbers arranged from small to large.If the number of sequences is even, the median of the group is the average of the two middle numbers.
Example:
The position of Q2=2(14+1)/4=7.5,
Q2=0.5×the seventh term+0.5×the eighth term=0.5×25+0.5×28=26.5
3, upper quartile Q3
The calculation method is the same as the lower quartile.
Example:
The position of Q3=3(14+1)/4=11.25,
Q3=0.75×eleventh term+0.25×twelfth term=0.75×34+0.25×35=34.25.
4, upper limit
The upper limit is the maximum value within the non-anomalous range.
The first thing to know is what is the interquartile range and how is it calculated?
Interquartile range IQR=Q3-Q1, then upper limit=Q3+1.5IQR
5, lower limit
The lower bound is the minimum value within the non-anomalous range.
Lower limit=Q1-1.5IQR
Three, the value of the box plot
1. Intuitively identify outliers in data batches
I have talked about identifying outliers for a long time. In fact, the standard for judging outliers in boxplots is based on quartiles and interquartile ranges. The quartiles have a certain resistance, up to 25%The data can become arbitrarily far away without greatly disturbing the quartiles, so the outliers do not affect the data shape of the boxplot, and the results of the boxplot to identify outliers are more objective.It can be seen that the boxplot has certain advantages in identifying outliers.
2. Use boxplots to determine skewness and tail weight of data batches
For a standard normally distributed sample, there are very few outliers.The more outliers, the heavier the tail and the smaller the degrees of freedom (that is, the number of freely changing quantities);
The skewness indicates the degree of deviation. If the outliers are concentrated on the side of the smaller value, the distribution is left-skewed; if the outliers are concentrated on the side of the larger value, the distribution is right-skewed.
3. Use boxplots to compare the shapes of several batches of data
On the same number line, the boxplots of several batches of data are arranged in parallel, and the shape information such as the median, tail length, outliers, and distribution interval of several batches of data is clearly revealed.As shown in the figure above, it can be intuitively seen that the sales of each branch in the third quarter are generally declining.
But the box plot also has its limitations, such as: it cannot accurately measure the skewness and tail weight of the data distribution; for large batches of data, the reflected information is more ambiguous and the median represents the overall evaluation levelThere are certain limitations.
边栏推荐
猜你喜欢
随机推荐
Logistic Regression --- Introduction, API Introduction, Case: Cancer Classification Prediction, Classification Evaluation, and ROC Curve and AUC Metrics
MySql--存储引擎以及索引
Jupyter Notebook安装库;ModuleNotFoundError: No module named ‘plotly‘解决方案。
CTFshow—Web入门—信息(1-8)
Vulnhub:Sar-1
TensorFlow2学习笔记:7、优化器
纳米级完全删除MYSQL5.7以及一些吐槽
SQL练习 2022/7/2
剑指 Offer 2022/7/9
Simple and clear, the three paradigms of database design
AIDL communication between two APPs
sql中group by的用法
MySQL事务详解(事务隔离级别、实现、MVCC、幻读问题)
SQL练习 2022/7/1
lmxcms1.4
自动化运维工具Ansible(7)roles
Thread 、Handler和IntentService的用法
ISCC2021———MISC部分复现(练武)
智能合约安全——delegatecall (2)
(十六)图的基本操作---两种遍历