当前位置:网站首页>Pycaret | a few lines of code to solve machine learning modeling
Pycaret | a few lines of code to solve machine learning modeling
2022-07-02 10:05:00 【Xinsheng rookie group】
Compared with other open source machine learning libraries ,PyCaret The library can perform complex machine learning tasks with just a few lines of code , It is convenient for us to perform iterative experiments efficiently , Draw conclusions faster .PyCaret It's kind of like R Inside Caret package , But it's simpler .
In general , If not used PyCaret, From data preprocessing 、 Perform feature Engineering , Modeling to parameter adjustment , We need at least 100 Line code , And these steps are PyCaret Just less than 10 That's ok , At the same time, these commands are very intuitive and easy to remember , for example :
# Preset process parameters setup() # Compare different algorithms compare_models() # Build a model create_model() # Adjustable parameter tune_model() # Model visualization plot_model() # Predict with a model predict_model() # preservation / Load model save_model() load_model()
In essence ,PyCaret It's a Python library , Encapsulates multiple machine learning libraries and frameworks , Such as sci-kit-learn、XGBoost、Microsoft LightGBM、spaCy wait . Include 6 A module , Support the training and deployment of supervised and unsupervised models , They are classification 、 Return to 、 clustering 、 Anomaly detection 、 Natural language processing and association rule mining . Each module encapsulates specific machine learning algorithms and functions that can be used by different modules . Users can according to the type of experiment , Import the module into the environment .
•GitHub Address :https://github.com/pycaret/pycaret• Official website :https://www.pycaret.org• course :https://www.pycaret.org/tutorial
install PyCaret
To avoid conflicts with other packages , It is strongly recommended to use conda Creating a virtual environment .
# establish conda Environmental Science conda create --name pycaret python=3.8 # Activate conda Environmental Science conda activate pycaret # Install the full version pycaret pip install pycaret[full] # take conda Environment loading jupyter notebook python -m ipykernel install --user --name pycaret --display-name "pycaret-2.3.5"
GPU Support
The following models support the use of GPU Model training and super parameter selection :
•XGBoost•CatBoost•LightGBM( Additional installation is required GPU edition :https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html)•Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, K Neighbors Regressor, Support Vector Machine, Linear Regression, Ridge Regression, Lasso Regression ( Need to be cuML >= 0.15 https://github.com/rapidsai/cuml)
use PyCaret Establish a binary classification model
PyCaret Classification module (pycaret.classification
) It is a supervised machine learning module , It is used for two classification and multi classification problems . Support over 18 Algorithms and 14 Graphics to analyze model performance . Whether it's basic hyperparametric tuning , It's still a high-level game like model integration , Can be implemented with this module .
Example data
We will use from UCI Of Default of Credit Card Clients Dataset Data sets . The dataset contains data from 2005 year 4 Month to 2005 year 9 Default payment of credit card customers in Taiwan in August 、 Demographic factors 、 Credit data 、 Payment history and billing information , Include 24,000 A sample and 25 Features . The description information of each column is as follows :
•ID: For each client ID•LIMIT_BAL: Credit limit in New Taiwan dollars •SEX: Gender (1= men ,2= women )•EDUCATION:(1= Graduate student ,2= university ,3= high school ,4= other ,5= Unknown ,6= Unknown )•MARRIAGE: Marital status (1= married ,2= single ,3= other )•AGE: Age •PAY_0 To PAY_6:n Repayment status months ago (PAY_0 = Last month, .. PAY_6 = 6 Months ago )( label :-1= Pay on time ,1= Delayed payment for one month ,2= Delayed payment for two months ,. .. 8= Payment is delayed by eight months ,9= Payment is delayed for nine months or more )•BILL_AMT1 To BILL_AMT6:n The bill amount of months ago (BILL_AMT1 = Last month, .. BILL_AMT6 = 6 Months ago )•PAY_AMT1 To PAY_AMT6:n Payment amount months ago (BILL_AMT1 = Last month, .. BILL_AMT6 = 6 Months ago )• The default value is : Default payment (1= yes ,0= no ) This is the target column
1. Download data
Use PyCaret The data repository comes with get_data()
Function to load data :
from pycaret.datasets import get_data dataset = get_data('credit')
# Check the data dimension dataset.shape
To demonstrate in an independent validation set predict_model()
function , We will split from the original data set 1200 Bar record , These data will not be used for modeling .
data = dataset.sample(frac=0.95, random_state=786) data_unseen = dataset.drop(data.index) data.reset_index(inplace=True, drop=True) data_unseen.reset_index(inplace=True, drop=True) print('Data for Modeling: ' + str(data.shape)) print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (22800, 24) Unseen Data For Predictions: (1200, 24)
2. Preset PyCaret Process parameters
In execution PyCaret Before other steps , We have to carry out setup()
function , This step will initialize PyCaret And create a data preprocessing process . Two required parameters are required here : One pandas Name of data frame and target column .
perform setup()
when ,PyCaret The data type of all features will be automatically inferred based on some attributes , Is it a continuous variable or a categorical variable . Generally speaking, there will be no big problems , When the inference is inconsistent , We can use numeric_features
and categorical_features
Parameter override PyCaret Inferred data type . If all data types are correctly identified , You can press Enter Key to continue or type quit End the experiment .
from pycaret.classification import *
exp_clf101 = setup(data = data, target = 'default', session_id=123)
Successful execution setup()
after , A table will be output , Most of the information and execution setup()
It is related to the preprocessing process built during . Some important information that requires additional attention include :
•session_id : Random number seed , Facilitate the recurrence of subsequent results .•Target Type : Two or more categories . The target type will be automatically detected and displayed .•Label Encoded : When the target variable is of string type ( namely “ yes ” or “ no ”) instead of 1 or 0 when , It will automatically encode the tag as 1 and 0 And display the mapping (0: no ,1: yes ). In this experiment , No label coding is required , Because the target variable itself is a number .•Original Data : Show the original shape of the dataset . In the example (22800, 24) Express 22,800 A sample and 24 Features , Include target columns .•Missing Values : When there are missing values in the original data , This will show up as True. This example has no missing values .•Numeric Features : Number of features inferred as numbers . In this example 24 One of the features is 14 Are inferred as digital features .•Categorical Features : The number of inferred taxonomic features . In this example 24 One of the features is 9 Are inferred as taxonomic features .•Transformed Train Set : Display the converted training set shape . Pretreated , We will train from (22800, 24) Convert to (15959, 91), And because of the existence of classification code , The number of features ranges from 24 An increase to 91 individual .•Transformed Test Set : Display the converted test set shape . Pretreated , The test set will contain 6841 Samples . This split is based on the default 70/30, have access to setup()
Medium train_size
Parameter change .
I suggest you read PyCaret Documentation to understand these steps in PyCaret How to deal with it automatically , For example, missing value interpolation 、 Classification variable coding and so on , Also learn about other optional parameters .
3. Compare different models
Next we will run all the models that can be used ( Use default super parameters ), See which models are more suitable for our dataset , The output table here includes the contents of different models (10 Crossover verification ) Average Accuracy, AUC, Recall, Precision, F1, Kappa, MCC And the corresponding training time .
best_model = compare_models()
With this simple line of code, we have completed cross validation training and evaluation for more than 15 Species algorithm . The output table defaults to “Accuracy” Sort from high to low , We can also modify parameters to sort based on other parameters , for example compare_models(sort = 'Recall')
Will press Recall Sort . If you want to fold
From the default 10 Change to another value , You can use fold
Parameters . for example compare_models(fold = 5)
Will be in 5 Compare all models based on fold cross validation , Reduce training time .
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=123, solver='auto', tol=0.001)
By default ,compare_models
Only the best performing models are returned , We can use n_select
Before parameter return N A list of models .
4. Creating models
PyCaret There are 18 Classifiers available .
below , We will take the random forest model as an example to demonstrate ( Here is just an example , Random forest is not the best model ).
rf = create_model('rf')
You can use print
Output model parameters :
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False)
The average score of the model shown here is the same as compare_models()
The scores printed in are consistent . This is because in the compare_models()
The indicators output in the table are all CV folds The average score of . And compare_models()
similar , If you want to fold From default 10 Change to a different value , You can use fold
Parameters . for example :create_model('dt', fold = 5)
Will use 5 fold CV Create a decision tree classifier .
5. Super parameter selection
When we use create_model()
Function when creating a model , The algorithm just calls the default super parameters to train the model . So in order to further optimize the model , We will use tune_model()
Function for super parameter tuning . This function uses random grid search to automatically adjust the super parameters of the model in the preset search space , We can also use custom_grid
Parameter Custom Grid .
tuned_rf = tune_model(rf)
By default ,tune_model
Will be based on Accuracy
To optimize , You can also use optimize
Parameter changes . for example : tune_model(dt, optimize = 'AUC')
Will be based on AUC Sort . Here is an example , We only use the default Accuracy
demonstrate , But it should be noted that , Especially when the data set is unbalanced ( For example, we are using this data set ),Accuracy
Is not a good indicator , You can read this article about this topic [1] To learn more about .
When finally determining the best machine learning model , These criteria should not be the only criteria we need to consider , Other factors to consider include training time 、 Standard deviation of cross validation and so on .
6. Model visualization
Function can be used to analyze the performance of different aspects of the model , for example AUC、 Confusion matrix 、 Decision boundary, etc .PyCaret Support 15 Drawing of seed map .
AUC Plot
plot_model(tuned_rf, plot = 'auc')
Precision-Recall Curve
plot_model(tuned_rf, plot = 'pr')
Feature Importance Plot
plot_model(tuned_rf, plot='feature')
Confusion Matrix
plot_model(tuned_rf, plot = 'confusion_matrix')
Another way to analyze model performance is to use evaluate_model()
function , This function displays the interactive interface of all available graphs of a given model .
7. Verify the model performance in the test set
Before finalizing the model , We also need to further evaluate the model performance through the test set . below , We will use storage in tune_rf
The final model in variables , be based on (30% The sample of ) Test sets predict and evaluate metrics , To see if they are significantly different from the results of cross validation .
You can see the random forest model in the test set accuracy by 0.8116, Ten fold cross validation result is 0.8203, The result here is close , If there is a big difference between the results in the test set and the results of cross validation , This indicates that our model may have over fitting . next step , We will continue to refine the model and test it based on independent validation sets ( We split it at the beginning 5% data ).
8. Finalize the deployment model
This is the last step of modeling , Perfect the final model ,finalize_model()
Function to fit the model to the complete data set , Including samples in the test set .
final_rf = finalize_model(tuned_rf)
#Final Random Forest model parameters for deployment print(final_rf)
RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight={}, criterion='entropy', max_depth=5, max_features=1.0, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0002, min_impurity_split=None, min_samples_leaf=5, min_samples_split=10, min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False)
9. Verify the model performance in the independent verification set
Finally we will use predict_model()
The function predicts based on an independent verification set .
unseen_predictions = predict_model(final_rf, data=data_unseen) unseen_predictions.head()
The output matrix includes Label
and Score
Two , among Label
It's the forecast ,Score
It's the probability of prediction .
We can further judge the performance of the model based on this matrix , I need to use pycaret.utils
modular :
from pycaret.utils import check_metric check_metric(unseen_predictions['default'], unseen_predictions['Label'], metric = 'Accuracy')
10. Save the model
Now we have finished machine learning modeling , In the last step, we will use save_model()
Save this model .
save_model(final_rf,'Final RF Model 11Nov2020')
Transformation Pipeline and Model Succesfully Saved (Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=True, features_todrop=[], id_columns=[], ml_usecase='classification', numerical_features=[], target='default', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', fill_value_categorical=None, fill_value_numerical=None, numeric_stra... RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight={}, criterion='entropy', max_depth=5, max_features=1.0, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0002, min_impurity_split=None, min_samples_leaf=5, min_samples_split=10, min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False)]], verbose=False), 'Final RF Model 11Nov2020.pkl')
Tips : It is best to use the date in the file name when saving the model , Easy version control .
Use load_model()
Function can be loaded into the model .
saved_final_rf = load_model('Final RF Model 11Nov2020')
Transformation Pipeline and Model Successfully Loaded
After loading the model , We can also use it predict_model()
To make predictions .
new_prediction = predict_model(saved_final_rf, data=data_unseen)
from pycaret.utils import check_metric check_metric(new_prediction['default'], new_prediction['Label'], metric = 'Accuracy')
This tutorial covers almost the entire machine learning modeling process , Get from data 、 Preprocessing 、 Training models 、 Super parameter selection 、 Model prediction to storage and loading model . From this we can see that PyCaret Easy to use , In addition to the simple modeling in the above example ,PyCaret It also supports more advanced operations , For example, integration model . I personally recommend it to students who already have a certain foundation of machine learning , Because the current official documents and tutorials are not particularly perfect , It needs to be combined sklearn And other dependent library documents .
Reference link
Read this article : https://medium.com/@MohammedS/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b
- [ue5] animation redirection: how to import magic tower characters into the game
- Data insertion in C language
- 2837xd code generation - Supplement (2)
- 2837xd 代码生成——补充(3)
- 2837xd code generation - stateflow (2)
- 职业规划和发展
- c语言编程题
- QT signal slot summary -connect function incorrect usage
- vs+qt 设置应用程序图标
- ue虛幻引擎程序化植物生成器設置——如何快速生成大片森林
Matlab generates DSP program -- official routine learning (6)
Image recognition - Data Cleaning
TD conducts functional simulation with Modelsim
Summary of demand R & D process nodes and key outputs
Junit5 支持suite的方法
Read Day6 30 minutes before going to bed every day_ Day6_ Date_ Calendar_ LocalDate_ TimeStamp_ LocalTime
Bookmark collection management software suspension reading and data migration between knowledge base and browser bookmarks
About the college entrance examination
Alibaba cloud SLS log service
Remember a simple Oracle offline data migration to tidb process
[illusory] automatic door blueprint notes
Is the C language too fat
2837xd代码生成模块学习(4)——idle_task、Simulink Coder
Typora installation package sharing
QT qlabel style settings
2837xd code generation - Summary
Read Day5 30 minutes before going to bed every day_ All key values in the map, how to obtain all value values
Failed to configure a DataSource: ‘url‘ attribute is not specified and no embedd
MySQL transaction
Read 30 minutes before going to bed every day_ day4_ Files