当前位置：网站首页>[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)

[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)

2022-07-07 14:04:00 【szZack】

Last one ：【AI actual combat 】 application xgboost.XGBRegressor Build an air quality prediction model （ One ）

This chapter mainly explains the data feature processing .

1、 features

With PM2.5 Explain the artificial characteristics as an example ：

#  features ：
# PM25 + PM25 Adjacent difference  + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance  + 
#  Time characteristics （ month 、 Japan 、 Hours 、 week ） +  Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 +  pressure  +  temperature  +  humidity  +  The wind speed  +  wind direction  +  Site features （ Longitude and latitude ）  +  Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
#  In the future （ Time characteristics （ month 、 Japan 、 Hours ） +  Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 +  pressure  +  temperature  +  humidity  +  The wind speed  +  wind direction ）   +  Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
#  Yes   pressure  +  temperature  +  humidity  +  The wind speed   All increase   Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance 

# 【 a key 】 Time characteristics use   A point on the circumference of a two-dimensional plane   It means 
#          Wind direction characteristics are used   A point on the circumference of a two-dimensional plane   It means

2、 Core code

Data feature processing code ：

    def load_data(self, data_type, data_path, n_input, n_output):
        #  features ：
        # PM25 + PM25 Adjacent difference  + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance  + 
        #  Time characteristics （ month 、 Japan 、 Hours 、 week ） +  Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 +  pressure  +  temperature  +  humidity  +  The wind speed  +  wind direction  +  Site features （ Longitude and latitude ） +  Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
        #  In the future （ Time characteristics （ month 、 Japan 、 Hours ） +  Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 +  pressure  +  temperature  +  humidity  +  The wind speed  +  wind direction ） +  Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
        #  Yes   pressure  +  temperature  +  humidity  +  The wind speed   All increase   Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance 
        
        # 【 a key 】 Time characteristics use   A point on the circumference of a two-dimensional plane   It means 
        #  Wind direction characteristics are used   A point on the circumference of a two-dimensional plane   It means 
        
        #  Field header of data file ：
        # air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
        
        usecols=['air_pressure','humidity','monitoring_time',self.factor,'station_number','air_temperature','wind_direction','wind_speed','longitude','latitude']
        df = pd.read_csv(data_path, usecols=usecols, low_memory=False)
        
        station_list = list(set(df['station_number'].values.tolist()))
        station_list.sort()
        print('station_list', station_list)
        
        # Time characteristics 
        df['monitoring_time'] = pd.to_datetime(df['monitoring_time'])
        #df['year'] = df['monitoring_time'].map(lambda x: (x.year))
        df['month_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[0])
        df['day_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[0])
        df['hour_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[0])
        df['dayofweek_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[0])
        df['month_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[1])
        df['day_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[1])
        df['hour_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[1])
        df['dayofweek_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[1])
        
        # Calculate with the previous day 23 Time difference 
        # Calculation ：humidity_diff  Humidity difference ,air_temperature_diff  The temperature difference ,wind_direction_diff  The wind direction is different ,wind_speed_diff  Wind speed difference 
        df = self.calcu_diff(df, 'humidity')
        df = self.calcu_diff(df, 'air_temperature')
        #df = self.calcu_diff(df, 'wind_direction')
        df = self.calcu_diff(df, 'wind_speed')
        df = self.calcu_diff(df, self.factor)
        
        # Calculate the adjacent difference 
        # Calculation 2：humidity_diff  Humidity difference ,air_temperature_diff  The temperature difference ,wind_direction_diff  The wind direction is different ,wind_speed_diff  Wind speed difference 
        df = self.calcu_diff2(df, 'humidity')
        df = self.calcu_diff2(df, 'air_temperature')
        #df = self.calcu_diff2(df, 'wind_direction')
        df = self.calcu_diff2(df, 'wind_speed')
        df = self.calcu_diff2(df, self.factor)

        # Wind direction use   A point on the circumference of a two-dimensional plane   It means 
        # The wind direction is divided into  8  Directions 
        df['wind_direction_x'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[0])
        df['wind_direction_y'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[1])
        
        # Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 
        df = self.calcu_value_feature3(df, self.factor)
        #  pressure  +  temperature  +  humidity  +  The wind speed   Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 
        df = self.calcu_value_feature3(df, 'humidity')
        df = self.calcu_value_feature3(df, 'air_temperature')
        df = self.calcu_value_feature3(df, 'wind_speed')
        df = self.calcu_value_feature3(df, 'air_pressure')
        
        # First calculate the total quantity 
        n_total = 0
        n_step = 24# Take the step of data 
        for site_id in station_list:
            site_df = df[(df['station_number'] == site_id)]
            for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
                n_total += 1
        print('n_total:', n_total)
        
        # X The length of ：
        x_len = (8 + 6 + 3 +3+ 2+8+2 + 4*8) + 1 + (8+6+3+3 + 4*8)
        X = np.ones((n_total, self.n_input, x_len), dtype=np.float32)
        Y = np.ones((n_total, self.n_output, 1), dtype=np.float32)
        
        print('*'*20)
        print('X.shape', X.shape, 'Y.shape', Y.shape)
        print(df.shape)
        print(df.head())
        n = 0
        x_feat = ['month_x','day_x','hour_x', 'dayofweek_x'] + \
                ['month_y','day_y','hour_y', 'dayofweek_y'] + \
                ['air_pressure','humidity', 'air_temperature', 'wind_direction_x', 'wind_direction_y', 'wind_speed'] + \
                ['humidity_diff', 'air_temperature_diff', 'wind_speed_diff'] + \
                ['humidity_diff2', 'air_temperature_diff2', 'wind_speed_diff2'] + \
                [i+'_max' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_min' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_recent' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_mean' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_median' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_max_min_diff' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_std' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_var' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                ['longitude', 'latitude'] + \
                [self.factor + '_diff', self.factor + '_diff2', self.factor + '_max', self.factor + '_min', self.factor + '_recent', self.factor + '_mean', self.factor + '_median', self.factor + '_max_min_diff', self.factor + '_std', self.factor + '_var'] + \
                [self.factor]
        for site_id in station_list:
            site_df = df[(df['station_number'] == site_id)]
            print('site_id:', site_id)
            for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
                X[n] = np.hstack((site_df.loc[site_df.index[i: i+self.n_input], x_feat].values, \
                        site_df.loc[site_df.index[i+self.n_output: i+self.n_input+self.n_output], x_feat[:-13]].values))
                Y[n] = site_df.loc[site_df.index[i+self.n_input: i+self.n_input+self.n_output], [self.factor]].values
                n += 1
        
        X = X.reshape(n_total, -1)
        Y = Y.reshape(n_total, -1)
        
        np.save('./ml_data/%s_%s_X-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), X)
        np.save('./ml_data/%s_%s_Y-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), Y)
        
        return X, Y
		
    def calcu_diff(self, df, field):
        
        df['tmp'] = df[field].copy()
        
        tmp_list = df.loc[df['tmp'].index[23::24], 'tmp'].tolist()
        tmp_list.insert(0, df.loc[0, field])
        tmp_list.pop(-1)
        output = [val for val in tmp_list for _ in range(24)]
        df['tmp'] = output
        
        df[field + '_diff'] = df[field] - df['tmp']
        
        df.drop(['tmp'], axis=1, inplace=True)
        
        return df
        
    def calcu_diff2(self, df, field):
        # The difference between the value of the current hour and the value of the previous hour 
        
        df['tmp'] = df[field].copy()
        df['tmp'] = df['tmp'].shift(1)
        df.loc[0, 'tmp'] = df.loc[1, 'tmp']
        
        df[field + '_diff2'] = df[field] - df['tmp']
        print(df.head())
        df.drop(['tmp'], axis=1, inplace=True)
        
        return df
        
    def calcu_value_feature(self, df, field):
        # Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 
        
        tmp_list = df[field].tolist()
        max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
        mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
        median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        
        df[field + '_max'] = max_val_list
        df[field + '_min'] = min_val_list
        df[field + '_recent'] = recent_val_list
        df[field + '_mean'] = mean_val_list
        df[field + '_median'] = median_val_list
        
        return df

    def calcu_value_feature3(self, df, field):
        # Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance 
        
        tmp_list = df[field].tolist()
        max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
        mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
        t=time.time()
        median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        std_val_list = [self.get_std(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        var_val_list = [self.get_var(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        
        df[field + '_max'] = max_val_list
        df[field + '_min'] = min_val_list
        df[field + '_recent'] = recent_val_list
        df[field + '_mean'] = mean_val_list
        df[field + '_median'] = median_val_list
        df[field + '_max_min_diff'] = df[field + '_max'] - df[field + '_min']
        df[field + '_std'] = std_val_list
        df[field + '_var'] = var_val_list
        
        return df

3、 Model training effect

With smape To measure the effectiveness of the model , give the result as follows ：

factor	smape
PM25	0.28
PM10	0.29
O3	0.316

【 notes 】 The above result is after After model parameter adjustment The result ,xgb.XGBRegressor The parameter adjustment process of can refer to my following article