当前位置:网站首页>Data analysis Seaborn visualization (for personal use)

Data analysis Seaborn visualization (for personal use)

2022-07-06 03:27:00 Up and down black

  Reference Content

 【Python】 An hour will take you to master seaborn visualization _ Bili, Bili _bilibili


Catalog

One 、 Variable distribution  

1、 View outliers  

2、 Observe the distribution of variables  

3、figure-level functions have FacetGrid characteristic  

Two 、 Relationship analysis of numerical variables  

1、sns.relplot():

2、sns.lmplot(): Analyze the linear relationship between two variables

3、sns.displot(): Plot the joint distribution of two variables

4、sns.jointplot(): Plot the joint distribution and respective distribution of two variables

 (1)sns.jointplot()

(2)jointplot Upgraded version :JointGrid, It can be done by g.plot Custom function

 (3)sns.pairplot(): Plot the joint distribution of all numerical variables in pairs

 (4)pairplot Upgraded version :PairGrid, It can be done by g.map Custom function

  (5)data.corr()+sns.heatmap(): Plot the correlation coefficients of all numerical variables in pairs

  3、 ... and 、 Analysis of category variables

 1、 Distribution of category variables :sns.countplot(), similar sns.histplot()

 2、 The relationship between category variables and numerical variables

(1) The mean value of numerical variables in different categories / Median estimate :barplot, pointplot

(2) The value range of numerical variables in different categories :boxplot, boxenplot

(3) Distribution diagram of numerical variables in different categories :stripplot, swarmplot, violinplot

  Four 、FacetGrid, PairGrid Custom drawing function in

1、FacetGrid

 2、PairGrid


 seaborn The function structure of can usually be divided into : Figure drawing function ( Purplish red ) And axis drawing function ( sky blue ).

Each kind of graph drawing function aggregates the functions of the corresponding axis drawing function , It also provides corresponding interfaces .

One 、 Variable distribution  

Get a data , First, check the distribution of variables : 

  • Variable value range , Whether there are outliers (outliers)?
  • Whether the distribution of variables is approximately normal ? If not , Is there any offset ? Is there a bimodal distribution (bimodality)?
  • If the data set is divided according to category variables , Whether the distribution of variables on each subset is very different ?

1、 View outliers  

(1) use seaborn Data set included in

print(sns.get_dataset_names())

penguin_df = sns.load_dataset('penguins')
penguin_df

(2)sns.boxplot(): View the value range of numeric variables , Observe whether there are abnormal values ( Box figure ) 

sns.boxplot(data=penguin_df, x='bill_length_mm')    #  Define data and variable names 

 

In the figure The line in the middle of the box is the median of the data ; The left and right boundaries of the box are quartiles (75% The value of is less than 49 Of ,25% The value is less than 39); The two lines outside the box represent the maximum and minimum values within a reasonable range ( Calculated by formula ), Beyond this range , The data is unreasonable , It may be an outlier , It needs to be analyzed in detail .

1、boxplot Corresponding catplot( Category variable analysis ), therefore Box diagram It can also be used. catplot draw :

sns.catplot(data=penguin_df, x='bill_length_mm',kind='box')  #  Need to define kind

 2、 You can also put the box diagram of all variables into one diagram , But the effect is often not good because the data is not an order of magnitude :

sns.boxplot(data=penguin_df)

(3) Observe outliers

sns.boxplot(data=tip_df)

The red circle in the figure may be the point of outliers ( Specific analysis of specific problems ).

2、 Observe the distribution of variables  

(1)sns.displot(): View the distribution of variables

sns.displot(data=penguin_df, x='bill_length_mm')

 

#  By setting bins To control the division of histogram  
sns.displot(data=penguin_df, x='bill_length_mm', bins=50)

 bins If the division is too rough, the distribution characteristics of the data may be ignored , But sometimes too detailed division will lead to excessive interpretation . You can see that the above figure shows a bimodal distribution .

1、displot Category variables can also be analyzed :

sns.displot(data=penguin_df, x='species')

 

2、 Contrast with countplot Analyze category variables :

sns.countplot(data=penguin_df, x='species')
sns.displot(data=penguin_df, x='species', hue='species', shrink=0.7)

displot Can pass hue Parameter to distinguish colors , adopt shink Zoom the histogram

(2)sns.displot(): see kde curve

Use kernel function to fit the distribution of data , Gaussian kernel function is used by default .

  • Method 1 :
sns.displot(data=penguin_df, x='bill_length_mm', kind='kde')
  • Method 2 :
sns.kdeplot(data=penguin_df, x='bill_length_mm')

rugplot It doesn't take up space , It can be directly superimposed on displot On the image of :

sns.displot(data=penguin_df, x='bill_length_mm', kind='kde', rug=True)

 

  although kde The curve is easier to observe the distribution of data , However, the drawing at the edge of the image may exceed the value range .

  • Solution 1 :( Make cut=0)
sns.displot(data=penguin_df, x='bill_length_mm', kind='kde', rug=True, cut=0)

But this method may change the data distribution .

  • Solution 2 :( Overlay and draw on the histogram kde)
sns.displot(data=penguin_df, x='bill_length_mm', kde=True)

 (3) Analyze bimodal distribution

sns.displot(data=penguin_df, x='bill_length_mm', kind='kde', hue='species')

  As you can see from the diagram , Penguins are the longest in different species kde The distribution is different , There is a certain gap between them . Therefore, the superposition will show the characteristics of bimodal distribution .

 (4) Analyze the offset   

Data can be processed logarithmically

(5) Empirical distribution function (acdfplot)  

sns.displot(data=penguin_df, x='bill_length_mm', kind='ecdf')

 

  In the figure 55 The corresponding proportion is 0.97, Indicates that the data is lower than 55 The data accounts for 75%. The function is similar to the box diagram , It's just shown in different forms .

3、figure-level functions have FacetGrid characteristic  

FacetGrid Set rows and columns as category variables , According to different categories of data variables , Divide the data into different subsets , Analyze the distribution of each variable on each subset .( It is equivalent to plotting the conditional probability distribution of variables )

sns.displot(data=penguin_df, x='bill_length_mm', row='sex', col='island', kind='kde', hue='species')

  Set the row to gender ( There are two categories ), The column is set to island ( There are three categories ), Draw penguins with long mouths kde curve (bill_length_mm).

Two 、 Relationship analysis of numerical variables  

1、sns.relplot():

  • Draw a scatter plot
sns.relplot(data=tip_df, x='total_bill', y='tip', hue='time', style='time', markers=['o', '^'])

 markers You can customize the style of the point in the diagram

sns.relplot(data=tip_df, x='total_bill', y='tip', hue='size', size='size')

 

  When there are many categories , Will use progressive colors , adopt size You can also set the size .

  • Draw wiring diagram
sns.relplot(data=tip_df, x='total_bill', y='tip',kind='line')

 

  The picture above is a little messy , Because the connection diagram is suitable for analyzing time series data 、 Fluctuation of stock price, etc .

#  Stock price analysis 
stock_df = pd.DataFrame(dict(time=np.arange(500), price=np.random.randn(500).cumsum()+np.ones(500)*50))
sns.relplot(data=stock_df, x='time', y='price', kind='line')

  Random number generation 500 Number , Use the cumulative sum function (cumsum) Achieve the effect of continuous change to simulate the change of stock price .

2、sns.lmplot(): Analyze the linear relationship between two variables

  Front facing tip_df When plotting the scatter diagram, you can see that the data has a certain correlation , So you can use lmplot  Draw the regression line .

sns.lmplot(data=tip_df, x='total_bill', y='tip')

 

regplot and lmplot The effect of drawing regression line is the same :

sns.regplot(data=tip_df, x='total_bill', y='tip')

 

  adopt residplot Draw a residual diagram :

sns.residplot(data=tip_df, x='total_bill', y='tip')

 

  In a rational way ( If it fits well ), The residuals should be randomly distributed , The residual here also shows a certain divergence distribution , It must be that the relationship between the two variables has not been excavated .

  • lmplot Can also be combined with relplot Add category variables as well
sns.lmplot(data=tip_df, x='total_bill', y='tip', hue='time')

 

  •  lmplot Also has the FacetGrid characteristic
sns.lmplot(data=tip_df, x='total_bill',row='smoker', col='time', y='tip', hue='time')

 

3、sns.displot(): Plot the joint distribution of two variables

Histogram form : 

sns.displot(data=penguin_df, x='bill_length_mm', y='bill_depth_mm')

 

 kde Curve form :

sns.displot(data=penguin_df, x='bill_length_mm', y='bill_depth_mm', kind='kde')

  Can be set by thresh(0-1) To control the range of graphic display 、level Control the density of the line .

It can also be used. displot Draw the joint distribution of category variables :

sns.displot(data=penguin_df, x='island', y='species')

  As you can see from the diagram ,Gentoo Only in Biscore island On , stay Biscore island On ,Gentoo Make up the majority , But there are still some Adelie.

4、sns.jointplot(): Plot the joint distribution and respective distribution of two variables

 (1)sns.jointplot()

 By default , The joint distribution is a scatter , Can pass kind Set it up .`kind` is one of ['scatter', 'hist', 'hex', 'kde', 'reg', 'resid']
sns.jointplot(data=tip_df, x='total_bill', y='tip')

 

  You can also use hue Add a category variable :

And displot identical ,jointplot You can also plot two category variables . 

(2)jointplot Upgraded version :JointGrid, It can be done by g.plot Custom function

g = sns.JointGrid(data=tip_df, x='total_bill', y='tip')
g.plot(sns.histplot, sns.boxplot)  #  Use histogram in the middle , Box drawing for edge 

  The customized part can also be more specific :

g = sns.JointGrid(data=tip_df, x='total_bill', y='tip')
g.plot_joint(sns.kdeplot)  #  Joint distribution 
g.plot_marginals(sns.histplot, kde=True)  #  The distribution of the edges 

 (3)sns.pairplot(): Plot the joint distribution of all numerical variables in pairs

sns.pairplot(data=tip_df, kind='kde')

  When there are many variables , You can choose the key variables you need for analysis :

 (4)pairplot Upgraded version :PairGrid, It can be done by g.map Custom function

g = sns.PairGrid(data=car_df, x_vars=['total', 'speeding', 'alcohol'], y_vars=['total', 'speeding', 'alcohol'])
g.map_upper(sns.scatterplot)
g.map_diag(sns.histplot, kde=True)
g.map_lower(sns.regplot)

  (5)data.corr()+sns.heatmap(): Plot the correlation coefficients of all numerical variables in pairs

First, find the correlation coefficient of each pair of variables  :

car_cor = car_df.corr()
car_cor

  Then the obtained correlation coefficient is expressed in the form of thermodynamic diagram :

sns.heatmap(car_cor, cmap='Blues', annot=True, fmt='.2f', linewidth=0.5)

  among annot Used to display values ,fmt='.2f' Represents a floating point type , Keep to two decimal places .

  3、 ... and 、 Analysis of category variables

 1、 Distribution of category variables :sns.countplot(), similar sns.histplot()

sns.catplot(data=tip_df, x='time', kind='count')

 2、 The relationship between category variables and numerical variables

(1) The mean value of numerical variables in different categories / Median estimate :barplot, pointplot

sns.catplot(data=penguin_df, x='species', y='bill_length_mm', kind='bar', estimator=np.median, hue='island')

sns.catplot(data=penguin_df, x='species', y='bill_length_mm', kind='point')

 

(2) The value range of numerical variables in different categories :boxplot, boxenplot

sns.catplot(data=penguin_df, x='species', y='bill_length_mm', kind='box')

 

  In general boxplot That's enough. , When the variable value range is large , It can be used boxenplot( For big data sets ).boxenplot You can draw a box diagram step by step .

(3) Distribution diagram of numerical variables in different categories :stripplot, swarmplot, violinplot

  The strip chart combines the characteristics of scatter chart and histogram :

sns.catplot(data=penguin_df, x='species', y='bill_length_mm', kind='strip',jitter=0.3)

  adopt jitter You can set the width of the strip chart , Value range [0,1] 

 

 swarmplot:

violinplot:

 

  Can be swarmplot  Overlay to violinplot On :

sns.catplot(data=penguin_df, x='species', y='bill_length_mm', kind='violin')
sns.swarmplot(data=penguin_df, x='species', y='bill_length_mm', color='w') #  Here want to use axes-level

  Four 、FacetGrid, PairGrid Custom drawing function in

1、FacetGrid

g = sns.FacetGrid(data=tip_df, row='time', col='smoker')  #  The defined rows and columns need to be category changed ; The picture is only framed 
#  Custom part 
g.map(sns.kdeplot, 'tip')

  Plot the joint distribution

g = sns.FacetGrid(data=tip_df, row='time', col='smoker') 
#  You can also plot the joint distribution of two variables 
g.map(sns.scatterplot,'total_bill', 'tip')

 2、PairGrid

Usage and pairplot similar  

g = sns.PairGrid(data=penguin_df,hue='species')
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.scatterplot)

 

原网站

版权声明
本文为[Up and down black]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207060318134712.html