当前位置:网站首页>Data analysis Seaborn visualization (for personal use)
Data analysis Seaborn visualization (for personal use)
2022-07-06 03:27:00 【Up and down black】
Reference Content
【Python】 An hour will take you to master seaborn visualization _ Bili, Bili _bilibili
Catalog
2、 Observe the distribution of variables
3、figure-level functions have FacetGrid characteristic
Two 、 Relationship analysis of numerical variables
2、sns.lmplot(): Analyze the linear relationship between two variables
3、sns.displot(): Plot the joint distribution of two variables
4、sns.jointplot(): Plot the joint distribution and respective distribution of two variables
(2)jointplot Upgraded version :JointGrid, It can be done by g.plot Custom function
(3)sns.pairplot(): Plot the joint distribution of all numerical variables in pairs
(4)pairplot Upgraded version :PairGrid, It can be done by g.map Custom function
(5)data.corr()+sns.heatmap(): Plot the correlation coefficients of all numerical variables in pairs
3、 ... and 、 Analysis of category variables
1、 Distribution of category variables :sns.countplot(), similar sns.histplot()
2、 The relationship between category variables and numerical variables
(2) The value range of numerical variables in different categories :boxplot, boxenplot
Four 、FacetGrid, PairGrid Custom drawing function in
seaborn The function structure of can usually be divided into : Figure drawing function ( Purplish red ) And axis drawing function ( sky blue ).
Each kind of graph drawing function aggregates the functions of the corresponding axis drawing function , It also provides corresponding interfaces .
One 、 Variable distribution
Get a data , First, check the distribution of variables :
- Variable value range , Whether there are outliers (outliers)?
- Whether the distribution of variables is approximately normal ? If not , Is there any offset ? Is there a bimodal distribution (bimodality)?
- If the data set is divided according to category variables , Whether the distribution of variables on each subset is very different ?
1、 View outliers
(1) use seaborn Data set included in
print(sns.get_dataset_names())
penguin_df = sns.load_dataset('penguins')
penguin_df
(2)sns.boxplot(): View the value range of numeric variables , Observe whether there are abnormal values ( Box figure )
sns.boxplot(data=penguin_df, x='bill_length_mm') # Define data and variable names
In the figure The line in the middle of the box is the median of the data ; The left and right boundaries of the box are quartiles (75% The value of is less than 49 Of ,25% The value is less than 39); The two lines outside the box represent the maximum and minimum values within a reasonable range ( Calculated by formula ), Beyond this range , The data is unreasonable , It may be an outlier , It needs to be analyzed in detail .
1、boxplot Corresponding catplot( Category variable analysis ), therefore Box diagram It can also be used. catplot draw :
sns.catplot(data=penguin_df, x='bill_length_mm',kind='box') # Need to define kind
2、 You can also put the box diagram of all variables into one diagram , But the effect is often not good because the data is not an order of magnitude :
sns.boxplot(data=penguin_df)
(3) Observe outliers
sns.boxplot(data=tip_df)
The red circle in the figure may be the point of outliers ( Specific analysis of specific problems ).
2、 Observe the distribution of variables
(1)sns.displot(): View the distribution of variables
sns.displot(data=penguin_df, x='bill_length_mm')
# By setting bins To control the division of histogram
sns.displot(data=penguin_df, x='bill_length_mm', bins=50)
bins If the division is too rough, the distribution characteristics of the data may be ignored , But sometimes too detailed division will lead to excessive interpretation . You can see that the above figure shows a bimodal distribution .
1、displot Category variables can also be analyzed :
sns.displot(data=penguin_df, x='species')
2、 Contrast with countplot Analyze category variables :
sns.countplot(data=penguin_df, x='species')
sns.displot(data=penguin_df, x='species', hue='species', shrink=0.7)
displot Can pass hue Parameter to distinguish colors , adopt shink Zoom the histogram
(2)sns.displot(): see kde curve
Use kernel function to fit the distribution of data , Gaussian kernel function is used by default .
- Method 1 :
sns.displot(data=penguin_df, x='bill_length_mm', kind='kde')
- Method 2 :
sns.kdeplot(data=penguin_df, x='bill_length_mm')
rugplot It doesn't take up space , It can be directly superimposed on displot On the image of :
sns.displot(data=penguin_df, x='bill_length_mm', kind='kde', rug=True)
although kde The curve is easier to observe the distribution of data , However, the drawing at the edge of the image may exceed the value range .
- Solution 1 :( Make cut=0)
sns.displot(data=penguin_df, x='bill_length_mm', kind='kde', rug=True, cut=0)
But this method may change the data distribution .
- Solution 2 :( Overlay and draw on the histogram kde)
sns.displot(data=penguin_df, x='bill_length_mm', kde=True)
(3) Analyze bimodal distribution
sns.displot(data=penguin_df, x='bill_length_mm', kind='kde', hue='species')
As you can see from the diagram , Penguins are the longest in different species kde The distribution is different , There is a certain gap between them . Therefore, the superposition will show the characteristics of bimodal distribution .
(4) Analyze the offset
Data can be processed logarithmically
(5) Empirical distribution function (acdfplot)
sns.displot(data=penguin_df, x='bill_length_mm', kind='ecdf')
In the figure 55 The corresponding proportion is 0.97, Indicates that the data is lower than 55 The data accounts for 75%. The function is similar to the box diagram , It's just shown in different forms .
3、figure-level functions have FacetGrid characteristic
FacetGrid Set rows and columns as category variables , According to different categories of data variables , Divide the data into different subsets , Analyze the distribution of each variable on each subset .( It is equivalent to plotting the conditional probability distribution of variables )
sns.displot(data=penguin_df, x='bill_length_mm', row='sex', col='island', kind='kde', hue='species')
Set the row to gender ( There are two categories ), The column is set to island ( There are three categories ), Draw penguins with long mouths kde curve (bill_length_mm).
Two 、 Relationship analysis of numerical variables
1、sns.relplot():
- Draw a scatter plot
sns.relplot(data=tip_df, x='total_bill', y='tip', hue='time', style='time', markers=['o', '^'])
markers You can customize the style of the point in the diagram
sns.relplot(data=tip_df, x='total_bill', y='tip', hue='size', size='size')
When there are many categories , Will use progressive colors , adopt size You can also set the size .
- Draw wiring diagram
sns.relplot(data=tip_df, x='total_bill', y='tip',kind='line')
The picture above is a little messy , Because the connection diagram is suitable for analyzing time series data 、 Fluctuation of stock price, etc .
# Stock price analysis
stock_df = pd.DataFrame(dict(time=np.arange(500), price=np.random.randn(500).cumsum()+np.ones(500)*50))
sns.relplot(data=stock_df, x='time', y='price', kind='line')
Random number generation 500 Number , Use the cumulative sum function (cumsum) Achieve the effect of continuous change to simulate the change of stock price .
2、sns.lmplot(): Analyze the linear relationship between two variables
Front facing tip_df When plotting the scatter diagram, you can see that the data has a certain correlation , So you can use lmplot Draw the regression line .
sns.lmplot(data=tip_df, x='total_bill', y='tip')
regplot and lmplot The effect of drawing regression line is the same :
sns.regplot(data=tip_df, x='total_bill', y='tip')
adopt residplot Draw a residual diagram :
sns.residplot(data=tip_df, x='total_bill', y='tip')
In a rational way ( If it fits well ), The residuals should be randomly distributed , The residual here also shows a certain divergence distribution , It must be that the relationship between the two variables has not been excavated .
- lmplot Can also be combined with relplot Add category variables as well
sns.lmplot(data=tip_df, x='total_bill', y='tip', hue='time')
![]()
- lmplot Also has the FacetGrid characteristic
sns.lmplot(data=tip_df, x='total_bill',row='smoker', col='time', y='tip', hue='time')
3、sns.displot(): Plot the joint distribution of two variables
Histogram form :
sns.displot(data=penguin_df, x='bill_length_mm', y='bill_depth_mm')
kde Curve form :
sns.displot(data=penguin_df, x='bill_length_mm', y='bill_depth_mm', kind='kde')
Can be set by thresh(0-1) To control the range of graphic display 、level Control the density of the line .
It can also be used. displot Draw the joint distribution of category variables :
sns.displot(data=penguin_df, x='island', y='species')
As you can see from the diagram ,Gentoo Only in Biscore island On , stay Biscore island On ,Gentoo Make up the majority , But there are still some Adelie.
4、sns.jointplot(): Plot the joint distribution and respective distribution of two variables
(1)sns.jointplot()
By default , The joint distribution is a scatter , Can pass kind Set it up .`kind` is one of ['scatter', 'hist', 'hex', 'kde', 'reg', 'resid']
sns.jointplot(data=tip_df, x='total_bill', y='tip')
You can also use hue Add a category variable :
And displot identical ,jointplot You can also plot two category variables .
(2)jointplot Upgraded version :JointGrid, It can be done by g.plot Custom function
g = sns.JointGrid(data=tip_df, x='total_bill', y='tip')
g.plot(sns.histplot, sns.boxplot) # Use histogram in the middle , Box drawing for edge
The customized part can also be more specific :
g = sns.JointGrid(data=tip_df, x='total_bill', y='tip')
g.plot_joint(sns.kdeplot) # Joint distribution
g.plot_marginals(sns.histplot, kde=True) # The distribution of the edges
(3)sns.pairplot(): Plot the joint distribution of all numerical variables in pairs
sns.pairplot(data=tip_df, kind='kde')
When there are many variables , You can choose the key variables you need for analysis :
(4)pairplot Upgraded version :PairGrid, It can be done by g.map Custom function
g = sns.PairGrid(data=car_df, x_vars=['total', 'speeding', 'alcohol'], y_vars=['total', 'speeding', 'alcohol'])
g.map_upper(sns.scatterplot)
g.map_diag(sns.histplot, kde=True)
g.map_lower(sns.regplot)
(5)data.corr()+sns.heatmap(): Plot the correlation coefficients of all numerical variables in pairs
First, find the correlation coefficient of each pair of variables :
car_cor = car_df.corr()
car_cor
Then the obtained correlation coefficient is expressed in the form of thermodynamic diagram :
sns.heatmap(car_cor, cmap='Blues', annot=True, fmt='.2f', linewidth=0.5)
among annot Used to display values ,fmt='.2f' Represents a floating point type , Keep to two decimal places .
3、 ... and 、 Analysis of category variables
1、 Distribution of category variables :sns.countplot(), similar sns.histplot()
sns.catplot(data=tip_df, x='time', kind='count')
2、 The relationship between category variables and numerical variables
(1) The mean value of numerical variables in different categories / Median estimate :barplot, pointplot
sns.catplot(data=penguin_df, x='species', y='bill_length_mm', kind='bar', estimator=np.median, hue='island')
sns.catplot(data=penguin_df, x='species', y='bill_length_mm', kind='point')
(2) The value range of numerical variables in different categories :boxplot, boxenplot
sns.catplot(data=penguin_df, x='species', y='bill_length_mm', kind='box')
In general boxplot That's enough. , When the variable value range is large , It can be used boxenplot( For big data sets ).boxenplot You can draw a box diagram step by step .
(3) Distribution diagram of numerical variables in different categories :stripplot, swarmplot, violinplot
The strip chart combines the characteristics of scatter chart and histogram :
sns.catplot(data=penguin_df, x='species', y='bill_length_mm', kind='strip',jitter=0.3)
adopt jitter You can set the width of the strip chart , Value range [0,1]
swarmplot:
violinplot:
Can be swarmplot Overlay to violinplot On :
sns.catplot(data=penguin_df, x='species', y='bill_length_mm', kind='violin')
sns.swarmplot(data=penguin_df, x='species', y='bill_length_mm', color='w') # Here want to use axes-level
Four 、FacetGrid, PairGrid Custom drawing function in
1、FacetGrid
g = sns.FacetGrid(data=tip_df, row='time', col='smoker') # The defined rows and columns need to be category changed ; The picture is only framed
# Custom part
g.map(sns.kdeplot, 'tip')
Plot the joint distribution
g = sns.FacetGrid(data=tip_df, row='time', col='smoker')
# You can also plot the joint distribution of two variables
g.map(sns.scatterplot,'total_bill', 'tip')
2、PairGrid
Usage and pairplot similar
g = sns.PairGrid(data=penguin_df,hue='species')
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.scatterplot)
边栏推荐
- [padding] an error is reported in the prediction after loading the model weight attributeerror: 'model' object has no attribute '_ place‘
- Introduction to DeNO
- 记录一下逆向任务管理器的过程
- 多项目编程极简用例
- SD卡报错“error -110 whilst initialising SD card
- Polymorphic day02
- 给新人工程师组员的建议
- 暑期刷题-Day3
- . Net 6 and Net core learning notes: Important issues of net core
- Pelosi: Congress will soon have legislation against members' stock speculation
猜你喜欢
给新人工程师组员的建议
Performance analysis of user login TPS low and CPU full
遥感图像超分辨率论文推荐
11. Container with the most water
深入刨析的指针(题解)
Pytorch load data
[slam] lidar camera external parameter calibration (Hong Kong University marslab) does not need a QR code calibration board
ASU & OSU | model based regularized off-line meta reinforcement learning
[risc-v] external interrupt
BUAA喜鹊筑巢
随机推荐
【SLAM】lidar-camera外参标定(港大MarsLab)无需二维码标定板
如何做好功能测试
Problems encountered in 2022 work IV
StrError & PERROR use yyds dry inventory
Shell pass parameters
【RISC-V】外部中断
3.2 rtthread 串口设备(V2)详解
Tidb ecological tools (backup, migration, import / export) collation
SD卡報錯“error -110 whilst initialising SD card
适合程序员学习的国外网站推荐
[slam] lidar camera external parameter calibration (Hong Kong University marslab) does not need a QR code calibration board
真机无法访问虚拟机的靶场,真机无法ping通虚拟机
2.1 rtthread pin设备详解
Pytorch基础——(1)张量(tensor)的初始化
Yyds dry inventory what is test driven development
ArabellaCPC 2019(补题)
Getting started with applet cloud development - getting user search content
Idea push rejected solution
Audio-AudioRecord Binder通信机制
Safety science to | travel, you must read a guide