1 brief introduction
We're using pandas
When conducting data analysis , Try to avoid too much fragmentation Organization code , Especially creating too many unnecessary Intermediate variable , It's a waste Memory , It also brings the trouble of variable naming , It is not conducive to the readability of the whole analysis process code , Therefore, it is necessary to organize the code in a pipeline way .
And in some of the articles I've written before , I introduced to you pandas
Medium eval()
and query()
These two help us chain code , Build a practical data analysis workflow API
, Plus the following pipe()
, We can take whatever pandas
The code is perfectly organized into a pipeline .
2 stay pandas Flexible use of pipe()
pipe()
seeing the name of a thing one thinks of its function , It is specially used for Series
and DataFrame
The operation of the pipeline (pipeline) Transformed API, Its function is to transform the nested function call process into The chain The process , Its first parameter func
Afferent acts on the corresponding Series
or DataFrame
Function of .
say concretely pipe()
There are two ways to use it , The first way Next , The parameter in the first position of the input function must be the target Series
or DataFrame
, Other related parameters use the conventional Key value pair It can be passed in , Like the following example , We make our own function to Titanic dataset Carry out some basic engineering treatment :
import pandas as pd
train = pd.read_csv('train.csv')
def do_something(data, dummy_columns):
'''
Self compiled sample function
'''
data = (
pd
# Generate dummy variables for the specified column
.get_dummies(data, # Delete first data Column specified in
columns=dummy_columns,
drop_first=True)
)
return data
# Chain assembly line
(
train
# take Pclass Columns are converted to character type for subsequent dummy variable processing
.eval('Pclass=Pclass.astype("str")', engine='python')
# Delete the specified column
.drop(columns=['PassengerId', 'Name', 'Cabin', 'Ticket'])
# utilize pipe Call your own function in a chained way
.pipe(do_something,
dummy_columns=['Pclass', 'Sex', 'Embarked'])
# Delete rows with missing values
.dropna()
)
You can see , And then drop()
The next step is pipe()
in , We pass in the custom function as its first parameter , Thus, a series of operations are skillfully embedded in the chain process .
The second way to use it Fit the target Series
and DataFrame
Not for the first parameter of the pass in function , For example, in the following example, we assume that the target input data is the second parameter data2
, be pipe()
The first parameter of should take ( Function name , ' Parameter name ')
In the format of :
def do_something(data1, data2, axis):
'''
Self compiled sample function
'''
data = (
pd
.concat([data1, data2], axis=axis)
)
return data
# pipe() The second way to use it
(
train
.pipe((do_something, 'data2'), data1=train, axis=0)
)
In such a design, we can avoid many nested function calls , Optimize our code at will ~
The above is the whole content of this paper , Welcome to discuss with me in the comments section ~