当前位置:网站首页>Using pipe() to improve code readability in pandas

Using pipe() to improve code readability in pandas

2020-11-07 20:15:00 Freery

1 brief introduction

   We're using pandas When conducting data analysis , Try to avoid too much fragmentation Organization code , Especially creating too many unnecessary Intermediate variable , It's a waste Memory , It also brings the trouble of variable naming , It is not conducive to the readability of the whole analysis process code , Therefore, it is necessary to organize the code in a pipeline way .

chart 1

   And in some of the articles I've written before , I introduced to you pandas Medium eval() and query() These two help us chain code , Build a practical data analysis workflow API, Plus the following pipe(), We can take whatever pandas The code is perfectly organized into a pipeline .

2 stay pandas Flexible use of pipe()

  pipe() seeing the name of a thing one thinks of its function , It is specially used for Series and DataFrame The operation of the pipeline (pipeline) Transformed API, Its function is to transform the nested function call process into The chain The process , Its first parameter func Afferent acts on the corresponding Series or DataFrame Function of .

   say concretely pipe() There are two ways to use it , The first way Next , The parameter in the first position of the input function must be the target Series or DataFrame, Other related parameters use the conventional Key value pair It can be passed in , Like the following example , We make our own function to Titanic dataset Carry out some basic engineering treatment :

import pandas as pd

train = pd.read_csv('train.csv')

def do_something(data, dummy_columns):
    '''
     Self compiled sample function 
    '''

    data = (
        pd
        #  Generate dummy variables for the specified column 
        .get_dummies(data, #  Delete first data Column specified in 
                     columns=dummy_columns,
                     drop_first=True)
    )
    
    return data

#  Chain assembly line 
(
    train
    #  take Pclass Columns are converted to character type for subsequent dummy variable processing 
    .eval('Pclass=Pclass.astype("str")', engine='python')
    #  Delete the specified column 
    .drop(columns=['PassengerId', 'Name', 'Cabin', 'Ticket'])
    #  utilize pipe Call your own function in a chained way 
    .pipe(do_something, 
          dummy_columns=['Pclass', 'Sex', 'Embarked'])
    #  Delete rows with missing values 
    .dropna()
)

   You can see , And then drop() The next step is pipe() in , We pass in the custom function as its first parameter , Thus, a series of operations are skillfully embedded in the chain process .

   The second way to use it Fit the target Series and DataFrame Not for the first parameter of the pass in function , For example, in the following example, we assume that the target input data is the second parameter data2, be pipe() The first parameter of should take ( Function name , ' Parameter name ') In the format of :

def do_something(data1, data2, axis):
    '''
     Self compiled sample function 
    '''

    data = (
        pd
        .concat([data1, data2], axis=axis)
    )
    
    return data

# pipe() The second way to use it 
(
    train
    .pipe((do_something, 'data2'), data1=train, axis=0)
)

   In such a design, we can avoid many nested function calls , Optimize our code at will ~


   The above is the whole content of this paper , Welcome to discuss with me in the comments section ~

版权声明
本文为[Freery]所创,转载请带上原文链接,感谢