当前位置：网站首页>[AI practice] Application xgboost Xgbregressor builds air quality prediction model (I)

[AI practice] Application xgboost Xgbregressor builds air quality prediction model (I)

2022-07-03 03:15:00 【szZack】

1、xgboost.XGBRegressor Detailed explanation

xgboost.XGBRegressor Detailed parameters of can be viewed https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=XGBRegressor#xgboost.XGBRegressor

XGBRegressor class ：

class xgboost.XGBRegressor(*, objective='reg:squarederror', **kwargs)

Core parameters include ：

n_estimators (int) – Number of gradient boosted trees. Equivalent to number of boosting rounds.

max_depth (Optional[int]) – Maximum tree depth for base learners.

learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)

verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method

n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.

min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.

subsample (Optional[float]) – Subsample ratio of the training instance.

colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.

scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

2、 application xgboost.XGBRegressor Build an air quality prediction model

2.1 Dependent Library

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold #k Crossover verification 
from sklearn.model_selection import GridSearchCV # The grid search 
from sklearn.metrics import make_scorer
import os
import sys
import time
import math
from sklearn.metrics import r2_score
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn", lineno=193)
from sklearn.multioutput import MultiOutputRegressor
import xgboost as xgb
import joblib
from sklearn.preprocessing import MinMaxScaler

2.2 Build an air quality prediction model

Model
Use xgboost.XGBRegressor As a basic model , Use MultiOutputRegressor packing XGBRegressor So as to achieve multi-dimensional time output （ Multi objective regression Multi target regression）
The core code of the model is as follows ：

    def fit_model(self, x, y, learning_rate=0.05,
                         n_estimators=500,
                         max_depth=7,
                         min_child_weight=1,
                         gamma=0.0,
                         subsample=0.8,
                         colsample_bytree=0.8,
                         scale_pos_weight=0.8):
        
        model = xgb.XGBRegressor(learning_rate=learning_rate,
                                 n_estimators=n_estimators,
                                 max_depth=max_depth,
                                 min_child_weight=min_child_weight,
                                 gamma=gamma,
                                 subsample=subsample,
                                 colsample_bytree=colsample_bytree,
                                 scale_pos_weight=scale_pos_weight,
                                 seed=42,
                                 tree_method='gpu_hist',
                                 gpu_id=2)
        
        multioutput = MultiOutputRegressor(model).fit(x, y)
        
        return multioutput

Input x
shape by （N, W, 24）
among N Is the number of days of data , W For the dimension of characteristics , 24 Is the number of hours of input data
Output y
shape by (N, 24)
among N Is the number of days of data ,24 Is the number of hours of output data

2.3 Core code


#  be based on  XGBRegressor  Air quality model 
class AQXGB():

    def __init__(self, factor, n_input, n_output, version):
        
        self.n_input = n_input
        self.n_output = n_output
        self.version = version
        self.factor = factor# Air factor 
        
        if not os.path.exists('./ml_data/'):# Save the training data of machine learning 
            os.mkdir('./ml_data/')
        

    def train(self, train_data_path, test_data_path):
        
        x,y = self.load_data(self.version, 'train', train_data_path, self.n_input, self.n_output)
        train_x,test_x,train_y,test_y = train_test_split(x,y,test_size=0.2,random_state=2022)
        
        model = self.fit_model(train_x, train_y)
        
        pre_y = model.predict(test_x)
        
        # Calculate the decision coefficient r Fang 
        r2 = self.performance_metric(test_y, pre_y)  
        print('test_r2 = ', r2)
            
        x,y = self.load_data(self.version, 'test', test_data_path, self.n_input, self.n_output)
        pre_y = model.predict(x)
        r2 = self.performance_metric(y, pre_y)
        print('val_r2 = ', r2)
        
        # Save the model 
        joblib.dump(model, './ml_data/xgb_%s_%d_%d_%s.model' %(self.factor, self.n_input, self.n_output, self.version)) 

    

    def performance_metric(self, y_true, y_predict):
        #  Select the evaluation function as needed 
        # r2
        score = r2_score(y_true,y_predict)
        
        # MSE
        MSE=np.mean(( y_predict- y_true)**2)
        print('RMSE: ',MSE**0.5)
        
        #MAE
        MAE=np.mean(np.abs( y_predict- y_true))
        print('MAE: ',MAE)
        
        #SMAPE
        SMAPE=self.smape(y_true, y_predict)
        print('SMAPE: ',SMAPE)

        return score
        
    def smape(self, A, F):
        A = A.reshape(-1)
        F = F.reshape(-1)
        return 1.0/len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F)))

    def fit_model(self, x, y, learning_rate=0.05,
                         n_estimators=500,
                         max_depth=7,
                         min_child_weight=1,
                         gamma=0.0,
                         subsample=0.8,
                         colsample_bytree=0.8,
                         scale_pos_weight=0.8):
        
        model = xgb.XGBRegressor(learning_rate=learning_rate,
                                 n_estimators=n_estimators,
                                 max_depth=max_depth,
                                 min_child_weight=min_child_weight,
                                 gamma=gamma,
                                 subsample=subsample,
                                 colsample_bytree=colsample_bytree,
                                 scale_pos_weight=scale_pos_weight,
                                 seed=42,
                                 tree_method='gpu_hist',
                                 gpu_id=2)
        
        multioutput = MultiOutputRegressor(model).fit(x, y)
        
        return multioutput

2.4 model training

Training code

if __name__ == "__main__":

    if len(sys.argv) == 7:
        #  Training models 
        # python3 src/train_xgb_model.py data/train_data.csv data/test_data.csv O3 24 24 v2
        aq_model = AQXGB(sys.argv[3], int(sys.argv[4]), int(sys.argv[5]), sys.argv[6])
        aq_model.train(sys.argv[1], sys.argv[2])

Training scripts
Enter the past 24 Hours are characteristic data , Export future 24 Hours of O3 Forecast results of
```
python3 src/train_xgb_model.py data/train_data.csv data/test_data.csv O3 24 24 v2
```

2.5 data format

data format
csv file
Example

air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
1013.0,0.3,59.0,69.0,2019-02-01 00:00:00,15.0,80.0,88.0,26.0,8.0,xxx Monitoring stations ,-0.4,205.8,1.1,116.97810856433719,36.61655020673796,shik
1013.0,0.3,58.0,68.0,2019-02-01 01:00:00,15.0,80.0,86.0,26.0,8.0,xxx Monitoring stations ,-0.5,179.4,1.0,116.97810856433719,36.61655020673796,shik
1012.0,0.3,62.0,72.0,2019-02-01 02:00:00,15.0,80.0,94.0,26.0,8.0,xxx Monitoring stations ,-0.9,175.7,0.8,116.97810856433719,36.61655020673796,shik
1011.0,0.3,64.0,76.0,2019-02-01 03:00:00,15.0,80.0,102.0,26.0,8.0,xxx Monitoring stations ,-1.0,166.9,0.9,116.97810856433719,36.61655020673796,shik
1011.0,0.3,65.0,80.0,2019-02-01 04:00:00,15.0,80.0,110.0,26.0,8.0,xxx Monitoring stations ,-0.8,191.1,0.9,116.97810856433719,36.61655020673796,shik
1011.0,0.3,66.0,84.0,2019-02-01 05:00:00,15.0,80.0,117.0,26.0,8.0,xxx Monitoring stations ,-1.1,211.4,1.0,116.97810856433719,36.61655020673796,shik
1011.0,0.3,68.0,85.0,2019-02-01 06:00:00,15.0,80.0,119.0,26.0,8.0,xxx Monitoring stations ,-1.4,137.3,1.3,116.97810856433719,36.61655020673796,shik
1011.0,0.3,68.0,65.75,2019-02-01 07:00:00,15.0,80.0,130.6,26.0,8.0,xxx Monitoring stations ,-1.3,147.0,1.5,116.97810856433719,36.61655020673796,shik
1011.0,0.3,58.0,46.5,2019-02-01 08:00:00,15.0,80.0,142.2,26.0,8.0,xxx Monitoring stations ,0.7,157.0,1.4,116.97810856433719,36.61655020673796,shik