当前位置:网站首页>[data processing] boxplot drawing
[data processing] boxplot drawing
2022-07-28 02:33:00 【HoveXb】
1. summary
boxplot Is in 1977 By American statistician John · tukey (John Tukey) Invented , It contains five basic elements :
- minimum value (Q0, The first 0 The quartile ): The minimum value after removing outliers
- Maximum (Q4): The maximum value after removing outliers
- Median : All values in the sample are arranged from small to large, and then 50% The number of
- First quartile (Q1): also called “ Lower quartile ”, It is equal to the number of all values in the sample arranged from small to large 25% The number of .
- third quartile (Q3): also called “ Larger quartile ”, It is equal to the number of all values in the sample arranged from small to large 75% The number of .
In addition to the above 5 Two basic elements , Four minute spacing (InterQuartile Range,IQR) It is also often used for boxplot Construction , It is defined as the difference between the third quartile and the first quartile , namely IQR=Q3-Q1.
boxplot The figure of is as follows , It consists of boxes (box) And whiskers (whiskers ) constitute , among , The upper and lower boundaries of the box are Q3、Q1 constitute , The middle of the box is separated by the median . Must be (whiskers ) There are many variations on the upper and lower boundaries of .
A standard must be defined as : The maximum and minimum values in the dataset ;
There is also : Adopt upper limit =Q3+1.5IQR, Lower limit =Q1-1.5IQR, As the upper and lower boundaries of whiskers , Points outside this boundary are considered outliers .
In addition, there are, for example, the following ways to define boundaries :
- The minimum and the maximum value of the data set
- One standard deviation above and below the mean of the data set
- The 9th percentile and the 91st percentile of the data set
- The 2nd percentile and the 98th percentile of the data set

2. Calculation
Premise : Sort the data
Set the data length as n
- Median Q2: It is the calculation of the median in the general statistical sense (n Take the middle of odd numbers ,n Average the middle two values for even numbers )
- First quartile Q1: Calculate the first quarter position pos, Then calculate the value of the first quartile . among , There is no unified standard for the calculation of the first quartile position , But there are usually two ways to calculate : Mode one : p o s = 1 + n − 1 4 pos=1+\frac{n-1}{4} pos=1+4n−1; Mode two : p o s = n + 1 4 pos=\frac{n+1}{4} pos=4n+1. The quartile value is calculated as a simple linear interpolation .
- third quartile Q3: Calculate the third and fourth points pos, Then calculate the value of the third quartile . among , There is no unified standard for the calculation of the third quartile position , But there are usually two ways to calculate : Mode one : p o s = 1 + 3 ∗ ( n − 1 ) 4 pos=1+\frac{3*(n-1)}{4} pos=1+43∗(n−1); Mode two : p o s = 3 ( n + 1 ) 4 pos=\frac{3(n+1)}{4} pos=43(n+1). The quartile value is calculated as a simple linear interpolation .
3. Example ( Take mode 1 as an example ):
The data is :num=[1,2,3,4,5,6,7,8], Data length n=8
- Median Q2=(num[4]+num[5])/2=(4+5)/2=4.5
- First quartile Q1: p o s = 1 + n − 1 4 = 2.75 pos=1+\frac{n-1}{4}=2.75 pos=1+4n−1=2.75; Q 1 = n u m [ 1 ] + 0.75 ∗ ( n u m [ 2 ] − n u m [ 1 ] ) = 1 + 0.75 ∗ ( 2 − 1 ) = 2.75 Q1 = num[1]+0.75*(num[2]-num[1]) =1+0.75*(2-1)=2.75 Q1=num[1]+0.75∗(num[2]−num[1])=1+0.75∗(2−1)=2.75
- third quartile Q3: p o s = p o s = 1 + 3 ∗ ( n − 1 ) 4 = 6.25 pos=pos=1+\frac{3*(n-1)}{4}=6.25 pos=pos=1+43∗(n−1)=6.25; Q 3 = n u m [ 6 ] + 0.25 ∗ ( n u m [ 7 ] − n u m [ 6 ] ) = 6 + 0.25 ∗ ( 7 − 6 ) = 6.25 Q3= num[6]+0.25*(num[7]-num[6]) =6+0.25*(7-6)=6.25 Q3=num[6]+0.25∗(num[7]−num[6])=6+0.25∗(7−6)=6.25
Code :
Median 、 First quartile Q1、 third quartile Q3, The calculation process is shown above ,pandas The display of Zhongxu is : utilize Q3+1.5IQR、Q1-1.5IQR Identify outliers , Take the maximum and minimum of the remaining values as the upper and lower bounds :
import pandas as pd
num =[1,2,3,4,5,6,7,8]
df = pd.DataFrame(num)
boxplot = df.boxplot()
print(df.describe())

import pandas as pd
num =[-5,2,3,4,5,6,7,13]
df = pd.DataFrame(num)
boxplot = df.boxplot()
print(df.describe())

Reference resources :wiki
边栏推荐
- How is insert locked in MySQL? (glory Collection Edition)
- Flex layout - fixed positioning + flow layout - main axis alignment - side axis alignment - expansion ratio
- Alipay applet authorization / obtaining user information
- Day6 函数和模块的使用
- Under the new retail format, retail e-commerce RPA helps reshape growth
- "Risking your life to upload" proe/creo product structure design - seam and buckle
- 【ROS进阶篇】第九讲 基于Rviz和Arbotix控制的机器人模型运动
- 软工必备知识点
- How to put app on the app store?
- LeetCode 热题 HOT 100 -> 2.两数相加
猜你喜欢

MySQL是如何利用索引的(荣耀典藏版)

Leetcode hot topic Hot 100 - > 2. Add two numbers

The level "trap" of test / development programmers is not a measure of one-dimensional ability

With elephant & nbsp; Eplato created by swap, analysis of the high premium behind it

MySQL锁系列之锁算法详解(荣耀典藏版)

Unity saves pictures to albums and rights management

OBS keyboard plug-in custom DIY

Sqlserver problem solving: replication components are not installed on this server. Please run SQL Server Setup again and select the option to install replication components

Find - block search

11 Django basics database operation
随机推荐
[Yugong series] July 2022 go teaching course 019 - for circular structure
MySQL pymysql operation
Ceresdao: the world's first decentralized digital asset management protocol based on Dao enabled Web3.0
Use try-with-resources or close this
How is insert locked in MySQL? (glory Collection Edition)
Interviewer: what is the factory method mode?
MySQL's way to solve deadlock - lock analysis of common SQL statements
Flume (5 demos easy to get started)
Use of Day6 functions and modules
Wechat campus maintenance and repair applet graduation design finished product of applet completion work (4) opening report
Product axure9 English version, using repeater repeater repeater to realize multi-choice and single choice
获取两个集合相差数据
软工必备知识点
LeetCode 热题 HOT 100 -> 3. 无重复字符的最长子串
Shell regular and metacharacters
Common SQL statement query
Appium 点击操作梳理
[深入研究4G/5G/6G专题-42]: URLLC-14-《3GPP URLLC相关协议、规范、技术原理深度解读》-8-低延时技术-2-基于slot的调度与Slot内灵活的上下行符号配比
【HCIP】BGP 特性
cn+dt