当前位置:网站首页>Detailed description of drawing ridge plot, overlapping densities of overlapping kernel density estimation curve, facetgrid object and function sns Kdeplot, function facetgrid map

Detailed description of drawing ridge plot, overlapping densities of overlapping kernel density estimation curve, facetgrid object and function sns Kdeplot, function facetgrid map

2022-06-13 06:59:00 Bear danger

I don't know where to go , Peach blossom still smile spring breeze . —— Cui Hu

introduction

  • This paper mainly focuses on seaborn The legend provided on the official website : Overlapping kernel density estimation curve (overlapping densities) Description and implementation
  • This picture will look very advanced when drawn 、 Different variables correspond to different curves 、 Like a hill after Hill [ Crying and laughing ]、 Therefore, it can also be called the hill and ridge map (ridge plot)
  • It includes seaborn More profound use of logic 、 There are many benefits to realize

example

The data generated

  • Libraries required for general import
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='white', rc={
    'axes.facecolor': (0, 0, 0, 0)})  #  Parameters axes.facecolor Adjust the color of the canvas 

sns.__version__
'0.11.2'
  • emphasize seaborn It must be the latest version 0.11.2、 Use Anaconda Medium seaborn It should be updated
  • If you are directly in cmd Use the command line pip install --upgrade seaborn After the update is complete 、 Check seaborn Still old version
    You can try to enter the current environment first :conda activate envname、 Update again :pip instasll -U seaborn
  • There will be a little trouble 、 But the problem won't be too big
rs = np.random.RandomState(1979)
x = rs.randn(500)                                   #  Generate 500 A random number 、 To obey the mean is 0、 The standard deviation is 1 Is a normal distribution 
g = np.tile(list('ABCDEFGHIJ'), 50)                 #  Cycle generation ABCDEFGHIJ this 10 Letters in total 50 Time 、g[: 20]
df = pd.DataFrame(dict(x=x, g=g))                   #  Put the variable x and g Put together a data frame 
m = df.g.map(ord)                                   #  In the data frame g The letters in this column follow ASCII Code corresponds to decimal 、A->65、B->66、C->67、D->68、...
df.x += m                                           #  It is equivalent to translating the original random number 、g by A Of x Values are translated to about 65 symmetry 、g by B Of x Values are translated to about 66 symmetry 、...
  • Draw randomly generated variables x Histogram
  • As you can see from the diagram 、 A random variable x About 0 Symmetric and satisfying 3 σ \sigma σ principle 、 Obviously, it obeys the normal distribution
plt.hist(x)
(array([ 8., 30., 62., 88., 96., 82., 71., 37., 23.,  3.]),
 array([-2.54389589, -2.00311021, -1.46232454, -0.92153886, -0.38075318,
         0.16003249,  0.70081817,  1.24160384,  1.78238952,  2.3231752 ,
         2.86396087]),
 <a list of 10 Patch objects>)

 Generating random variables x Histogram

  • View the generated data frame df In front of 20 Data
df.iloc[: 20]
xg
064.038123A
166.147050B
266.370011C
368.791019D
470.583534E
569.135114F
672.390092G
773.822191H
873.868785I
972.938377J
1065.723433A
1166.580572B
1268.631715C
1368.175267D
1468.772384E
1570.369829F
1672.296533G
1770.704948H
1873.755082I
1975.130836J
  • For the generated letter distribution data 、 Often you can only draw the following image
  • Although such images are arranged neatly 、 It can also be more beautiful after careful adjustment
  • But it's definitely not new 、 And the comparison between variables is not particularly convenient 、 intuitive
sns.set(font_scale=2)
sns.displot(data=df, x='x', col='g', col_wrap=5, kind='hist', hue='g', kde=True, 
            palette="ch:r=-.2,d=.3_r", legend=False)#, palette="light:m_r")

 General data distribution display

Start drawing

  • The whole drawing process can be divided into 6 Step 、 On the whole, it is clear
  • Although I wrote a lot of comments 、 But it is still very difficult to cover all aspects 、 So it is emphasized to do more 、 Think more 、 It will be better to try more by yourself

initialization FacetGrid object

#  Generation has 10 A palette of color blocks  
pal = sns.cubehelix_palette(n_colors=10, rot=-.25, light=.7)   
#  towards FacetGrid Object df、 And according to g This column is used to branch (row) And semantic mapping (hue)
g = sns.FacetGrid(df, row='g', hue='g', height=.5, aspect=15, palette=pal)  

Draw density on the subgraph

  • function FacetGrid.map() yes FacetGrid A good helper of the object 、 It is used to apply the drawing function to the data subset corresponding to each facet 、 It mainly introduces the following three parts
    1. Plot function : Need to be able to be passed in data and Key parameters color、 If you use Semantic mapping parameters hue、 It also needs to be able to be passed in Key parameters label
    2. Column name of the data frame : For instance FacetGrid Object, the data frame passed in 、 Which column of data do you want the plot function to plot 、 The column name of this column of data is passed in
    3. Other keyword parameters : To the drawing function 、 It is equivalent to adjusting it
  • Here is a function FacetGrid.map() Afferent kernel density estimation (kernel density estimate) Function as a drawing function 、 Pass in the column name ’x’ Represents the nuclear density curve for plotting this column of data
  • in addition 、 The key parameters bw_adjust Used to adjust bandwidth 、 Parameters clip_on Set to not crop the curve 、 Parameters fill Set to fill the part below the curve
  • The white edges are drawn so that when the hills overlap 、 The boundary of intersection is more obvious
#  Draw one hill after another 
g.map(sns.kdeplot, 'x', bw_adjust=.5, clip_on=False, fill=True, alpha=1, linewidth=1.5) 
#  Only the kernel density estimation curve is drawn 、 No fill 、 The color of the thread has to be white 、 Equivalent to stroke 
g.map(sns.kdeplot, 'x', clip_on=False, color='w', lw=2, bw_adjust=.5)                   
  • Parameters bw_adjust Equivalent to bandwidth (bandwidth) The control of 、 The larger the size, the smoother the nuclear density curve will be drawn 、 The smaller the size, the more chaotic the nuclear density curve will be
  • Select the corresponding letter A The random number 、 Set different parameter values to plot the following kernel density estimation curve
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
#  Loop drawing more gracefully 
for i, bw_adjust in enumerate([0.2, 0.5, 0.9]):
    #  Set the corresponding sub graph and bandwidth 
    sns.kdeplot(x='x', data=df[df.g == 'A'], bw_adjust=bw_adjust, ax=ax[i])
    #  Set the title of each subgraph 
    ax[i].set_title('bw_adjust=' + str(bw_adjust))
fig.subplots_adjust(wspace=0.5)

 Different parameters bw_adjust Density estimation curve under

Draw a horizontal line

  • actually 、 If the coordinate axis can be the same as the density estimation curve and filling 、 It can correspond to different colors according to semantic mapping 、 This will undoubtedly make the whole picture more perfect
  • In particular 、 Through the first function FacetGrid.refline() Draw horizontal lines that follow semantic mapping 、 Again by function FacetGrid.despine() Remove the border of the subgraph 、 It is equivalent to erasing the original black coordinate axis
#  The parameter color Set to None To make the color of the line from FacetGrid Object to determine 
g.refline(y=0, linewidth=2, linestyle='-', color=None, clip_on=False)
#  Remove borders 、 Only the bottom and left borders of each sub graph are preserved 、 There are no top and right borders 
g.despine(bottom=True, left=True)        

Set the name of each subgraph

  • Define a simple 、 Can be introduced into function FacetGrid.map() In order to act on the drawing function of each subgraph label
  • actually 、 function label Parameters of x Not used
  • And in function FacetGrid.map() in 、 Parameters color Based on the Semantic mapping parameters hue、 Parameters label Based on the Branch parameters raw Automatically assigned 、 It should be like this …
#  Parameters color and label Is essential 
def label(x, color, label):
    #  Gets the current axis (get current axes)
    ax = plt.gca() 
    #  Parameters ha and va Indicates horizontal and vertical alignment, respectively 、 Parameters transform Set to give the text coordinates relative to the perimeter 
    ax.text(0, .2, label, fontweight='bold', color=color,
           ha='left', va='center', transform=ax.transAxes) 

g.map(label, 'x')          #  Passing in functions label And data frames df Variable name in 'x'

Adjust the spacing of subgraphs

  • In order to facilitate the comparison between density estimation curves 、 We need to reduce the distance between subgraphs 、 Create a kind of artistic conception that one mountain is higher than another
  • Controls the spacing of subgraphs Parameters hspace The size of the can be arbitrary
g.figure.subplots_adjust(hspace=-.25)

Delete details

  • Refine the image 、 Make it more beautiful on the whole
  • From this we can see that 、FacetGrid Objects contain commonly used matplotlib Set function 、 When called, it directly acts on all subgraphs
g.set_titles('')                          #  Delete title 
g.set(yticks=[], ylabel='')               #  Delete y The scale on the axis 、 as well as y The label of the shaft 

Complete code

# 1. Generating a palette 、 And instantiate FacetGrid object 
pal = sns.cubehelix_palette(n_colors=10, rot=-.25, light=.7)                   
g = sns.FacetGrid(df, row='g', hue='g', height=.5, aspect=15, palette=pal)  

# 2. Draw density on the subgraph 
g.map(sns.kdeplot, 'x', bw_adjust=.5, clip_on=False, fill=True, alpha=1, linewidth=1.5)
g.map(sns.kdeplot, 'x', clip_on=False, color='w', lw=2, bw_adjust=.5)                   #  Stroke 

# 3. By function refline To draw a horizontal line 、 This horizontal line will replace the abscissa axis 
g.refline(y=0, linewidth=2, linestyle='-', color=None, clip_on=False)

# 4. Define and use a simple function to set label names for each subgraph 
def label(x, color, label):
    ax = plt.gca()                                         
    ax.text(0, .2, label, fontweight='bold', color=color,
           ha='left', va='center', transform=ax.transAxes) 

g.map(label, 'x')                                                                        #  Passing in functions label And data frames df Variable name in 'x'

# 5. Adjust the spacing between subgraphs 、 Make them overlap 
g.figure.subplots_adjust(hspace=-.25)

# 6. Delete some details on the shaft 、 Make the image more beautiful on the whole 
g.set_titles('')                          
g.set(yticks=[], ylabel='')               
g.despine(bottom=True, left=True)  
g.figure

 Hill density map of letter distribution data

Plot Hill density maps for general data

Change the data form

  • The data obtained in practice is often in the following form 、 This is obviously different from the letter distribution data generated above —— Each piece of data has a corresponding category
  • call function DataFrame.stack() To change the data form 、 This may be the use of seaborn For the processing that is often used in drawing 、seaborn The convenience is worth it
  • You can copy the following data to Excel in 、 There will be no more reading 、 Variable data Is the corresponding data frame
year Dependence on imports of agricultural products (%) Dependence on agricultural exports (%) Factor input import rate (%) Factor input and export rate (%) Introduction of foreign investment in agriculture ( Billion dollars ) Foreign direct investment in agriculture ( Billion dollars ) Average tariff rate of agricultural products (%)
200011.66 9.77 2.47 0.75 6.760.9420.6
200111.70 9.52 2.64 0.74 8.991.1522.7
200212.16 9.25 3.44 0.78 10.281.6917.7
200313.05 9.40 3.79 1.24 10.010.8116.2
200417.25 9.09 4.27 1.42 11.142.8914.9
200518.66 11.86 4.39 1.43 7.181.0614.6
200613.73 8.14 3.90 1.62 5.991.9 14.5
200719.75 11.74 3.32 2.39 9.242.7214.5
200821.48 10.46 3.00 2.22 11.911.7214.5
200918.15 9.28 2.01 1.34 14.293.4313.4
201019.93 9.52 2.42 1.87 19.125.3414.5
201120.19 10.11 2.94 2.28 20.097.9814.4
201220.06 9.18 2.62 1.88 20.627.98 14
201320.83 8.68 1.81 1.61 18.00 18.114
201419.85 8.69 1.79 1.97 15.2217.414
201519.12 8.68 1.62 2.20 15.3420.514
201619.24 9.37 1.35 1.71 18.9829.714.2
201717.61 8.33 1.38 1.60 10.752213.6
201817.12 8.46 1.52 1.81 8.011814.7
201915.65 7.74 1.23 1.60 5.6215.414.4
202015.79 7.03 1.03 1.11 5.7613.915.2
  • Next 、 First select other variables except the year 、 Call again function DataFrame.stack() To see what it does
data = data.loc[: , ' Dependence on imports of agricultural products (%)': ]
data.columns = [' Dependence on imports of agricultural products ', ' Dependence on agricultural exports ', ' Factor input import rate ', ' Factor input and export rate ', 
                ' Introduction of foreign investment in agriculture ', ' Foreign direct investment in agriculture ', ' Average tariff rate of agricultural products ']                  #  Remove all units from the column name 
#  View the front... After changing the form 21 Data 
data.stack()[: 21]
0   Dependence on imports of agricultural products      11.659965
    Dependence on agricultural exports       9.774706
    Factor input import rate        2.469175
    Factor input and export rate        0.751264
    Introduction of foreign investment in agriculture         6.760000
    Foreign direct investment in agriculture      0.940000
    Average tariff rate of agricultural products      20.600000
1   Dependence on imports of agricultural products      11.703450
    Dependence on agricultural exports       9.517280
    Factor input import rate        2.640029
    Factor input and export rate        0.735898
    Introduction of foreign investment in agriculture         8.990000
    Foreign direct investment in agriculture      1.150000
    Average tariff rate of agricultural products      22.700000
2   Dependence on imports of agricultural products      12.156457
    Dependence on agricultural exports       9.251768
    Factor input import rate        3.440643
    Factor input and export rate        0.781500
    Introduction of foreign investment in agriculture        10.280000
    Foreign direct investment in agriculture      1.690000
    Average tariff rate of agricultural products      17.700000
dtype: float64
  • The first column shown here is the original sample index 、 The second column is all the column names of each sample 、 Equivalent to variable 、 The third column is the index 、 The sample value under this variable
  • actually 、 function DataFrame.stack() The original pd.DataFrame Change it to pd.Series
type(data.stack())
pandas.core.series.Series
  • You can convert an array into a data frame in the following ways 、 This is consistent with the form of the letter distribution data generated above
  • Thus our goal has been achieved
data = data.stack().reset_index(name='value').rename(columns={
    'level_0': 'index', 'level_1': 'feature'})
#  See the former 21 Data 
data.iloc[: 21]
indexfeaturevalue
00 Dependence on imports of agricultural products 11.659965
10 Dependence on agricultural exports 9.774706
20 Factor input import rate 2.469175
30 Factor input and export rate 0.751264
40 Introduction of foreign investment in agriculture 6.760000
50 Foreign direct investment in agriculture 0.940000
60 Average tariff rate of agricultural products 20.600000
71 Dependence on imports of agricultural products 11.703450
81 Dependence on agricultural exports 9.517280
91 Factor input import rate 2.640029
101 Factor input and export rate 0.735898
111 Introduction of foreign investment in agriculture 8.990000
121 Foreign direct investment in agriculture 1.150000
131 Average tariff rate of agricultural products 22.700000
142 Dependence on imports of agricultural products 12.156457
152 Dependence on agricultural exports 9.251768
162 Factor input import rate 3.440643
172 Factor input and export rate 0.781500
182 Introduction of foreign investment in agriculture 10.280000
192 Foreign direct investment in agriculture 1.690000
202 Average tariff rate of agricultural products 17.700000

Start drawing

  • The following process is not much different from the previous one
  • Just modify some variables 、 The original ’g’ Change it to ’feature’、 The original ’x’ Change it to ’value’、 And changed the palette and the size of the graph
#  Basic settings 
sns.set_theme(style='white', font_scale=1.5, rc={
    'axes.facecolor': (0, 0, 0, 0)})
plt.rcParams['font.sans-serif'] = ['SimHei']  # Normal display of Chinese 
plt.rcParams['axes.unicode_minus'] = False    # Negative numbers are normally displayed 

#  Instantiate FacetGrid object 
pal = sns.cubehelix_palette(n_colors=10, rot=.5, light=.9, dark=.28)
g = sns.FacetGrid(data, row='feature', hue='feature', height=1.4, aspect=12, palette=pal)

#  Draw density on the subgraph 
g.map(sns.kdeplot, 'value', bw_adjust=.5, clip_on=False,
     fill=True, alpha=1, linewidth=1.5)
g.map(sns.kdeplot, 'value', clip_on=False, color='w', lw=2, bw_adjust=.5)

#  The parameter pass Set to None To use semantic mapping 
g.refline(y=0, linewidth=2, linestyle='-', color=None, clip_on=False)

#  Define and use a simple function to set label names for each subgraph 
def label(x, color, label):
    ax = plt.gca()
    ax.text(0, .05, label, fontweight='bold', color=color,
           ha='left', va='center', transform=ax.transAxes)

g.map(label, 'value')

#  Let the subgraphs overlap 
g.figure.subplots_adjust(hspace=-.8)

#  Delete some details on the shaft 、 Make them more beautiful when they are overlapped 
g.set_titles('')
g.set(yticks=[], ylabel='', xlabel='')
g.despine(bottom=True, left=True)

 Hill density map under new data

  • Integrate the large piece of code above into a function 、 I don't feel much need

Original reading

  • In writing this blog 、 In order to be more accurate 、 Maybe it is also a kind of laziness 、 Read some content on the official website 、 You might as well put it here 、 For your reference
  • The recent discovery 、 Microsoft's translation is very good …

function seaborn.kdeplot Notes given in

  • ( Description of bandwidth )
  • The bandwidth, or standard deviation of the smoothing kernel, is an important parameter.
    bandwidth (bandwidth)、 Or the standard deviation of smoothing kernel is an important parameter
  • Misspecification of the bandwidth can produce a distorted representation of the data.
    Incorrectly specifying bandwidth can cause data distortion 、 Or the data represented will be distorted
  • Much like the choice of bin width in a histogram, an over-smoothed curve can erase true features of a distribution, while an under-smoothed curve can create false features out of random variability.
    And select the width of the column in the histogram (bin) Very similar 、 Excessively smooth curves will eliminate the true features of the distribution 、 The curve with insufficient smoothness will produce false features caused by randomness
  • The rule-of-thumb that sets the default bandwidth works best when the true distribution is smooth, unimodal, and roughly bell-shaped.
    When the real distribution is smooth 、 Single peak and roughly bell shaped ( Both ends are low 、 The middle is high and symmetrical ) when , A rule of thumb for setting the default bandwidth gives the best results
  • It is always a good idea to check the default behavior by using bw_adjust to increase or decrease the amount of smoothing.
    By using Parameters bw_adjust To increase or decrease smoothness 、 Then it would be a good idea to check whether the default settings are appropriate

  • ( The necessity of tailoring )
  • Because the smoothing algorithm uses a Gaussian kernel, the estimated density curve can extend to values that do not make sense for a particular dataset.
    Because the smoothing algorithm uses Gaussian kernel 、 Therefore, the estimated density curve can be extended to values that are not meaningful for a particular data set
  • For example, the curve may be drawn over negative values when smoothing data that are naturally positive.
    for example 、 When smoothing data that can only be positive in nature 、 The curve may be drawn on a negative value
  • The cut and clip parameters can be used to control the extent of the curve, but datasets that have many observations close to a natural boundary may be better served by a different visualization method.
    shear (cut) And clipping (clip) Parameters can be used to control the range of the curve 、 However, when there are more observations in the data set near the natural boundary 、 Maybe other visualization methods will get better results

  • ( Misleading estimation of nuclear density )
  • Similar considerations apply when a dataset is naturally discrete or “spiky” (containing many repeated observations of the same value).
    When the data set is naturally discrete (discrete) Or appear “ peak ”(spiky)( It contains many repeated observations of the same size ) when 、 Similar considerations apply
  • Kernel density estimation will always produce a smooth curve, which would be misleading in these situations.
    Kernel density estimation will always produce a smooth curve 、 But in these cases it would be misleading

  • ( Explanation of the kernel density estimation curve )
  • The units on the density axis are a common source of confusion.
    Units on the density axis are a common source of confusion
  • While kernel density estimation produces a probability distribution, the height of the curve at each point gives a density, not a probability.
    Although kernel density estimates produce probability distributions 、 But the height of each point on the curve represents the density 、 Not probability
  • A probability can be obtained only by integrating the density across a range.
    Only by integrating the density curve in a certain range can the probability be obtained
  • The curve is normalized so that the integral over all possible values is 1, meaning that the scale of the density axis depends on the data values.
    Normalize the curve 、 So that the integral value on the whole coordinate axis is 1、 This means that the scaling of the density axis will depend on the size of the data value

function FacetGrid.map Parameter description of

  • ( function FacetGrid.map The role of )Apply a plotting function to each facet’s subset of the data.
    Apply the drawing function to the data subset of each facet

  • (func)A plotting function that takes data and keyword arguments. It must plot to the currently active matplotlib Axes and take a color keyword argument. If faceting on the hue dimension, it must also take a label keyword argument.
    Drawing functions with data and keyword parameters 、 It must be drawn to the currently active matplotlib Axis 、 And USES the color Key parameters
    If faceting is performed on the hue dimension 、 You must also use label Key parameters
  • (args)Column names in self.data that identify variables with data to plot. The data for each variable is passed to func in the order the variables are specified in the call.
    self.data Column name in 、 Used to identify variables that have data to plot 、 The data of each variable will be passed to... In the order specified by the variable in the call func
  • (kwargs)All keyword arguments are passed to the plotting function.
    All key parameters are passed to the drawing function
原网站

版权声明
本文为[Bear danger]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/164/202206130653447533.html