当前位置:网站首页>[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)

[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)

2022-07-07 14:04:00 szZack

Last one :【AI actual combat 】 application xgboost.XGBRegressor Build an air quality prediction model ( One )

This chapter mainly explains the data feature processing .

1、 features

With PM2.5 Explain the artificial characteristics as an example :

#  features :
# PM25 + PM25 Adjacent difference  + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance  + 
#  Time characteristics ( month 、 Japan 、 Hours 、 week ) +  Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 +  pressure  +  temperature  +  humidity  +  The wind speed  +  wind direction  +  Site features ( Longitude and latitude )  +  Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
#  In the future ( Time characteristics ( month 、 Japan 、 Hours ) +  Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 +  pressure  +  temperature  +  humidity  +  The wind speed  +  wind direction )   +  Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
#  Yes   pressure  +  temperature  +  humidity  +  The wind speed   All increase   Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance 

# 【 a key 】 Time characteristics use   A point on the circumference of a two-dimensional plane   It means 
#          Wind direction characteristics are used   A point on the circumference of a two-dimensional plane   It means 

2、 Core code

Data feature processing code :

    def load_data(self, data_type, data_path, n_input, n_output):
        #  features :
        # PM25 + PM25 Adjacent difference  + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance  + 
        #  Time characteristics ( month 、 Japan 、 Hours 、 week ) +  Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 +  pressure  +  temperature  +  humidity  +  The wind speed  +  wind direction  +  Site features ( Longitude and latitude ) +  Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
        #  In the future ( Time characteristics ( month 、 Japan 、 Hours ) +  Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 +  pressure  +  temperature  +  humidity  +  The wind speed  +  wind direction ) +  Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
        #  Yes   pressure  +  temperature  +  humidity  +  The wind speed   All increase   Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance 
        
        # 【 a key 】 Time characteristics use   A point on the circumference of a two-dimensional plane   It means 
        #  Wind direction characteristics are used   A point on the circumference of a two-dimensional plane   It means 
        
        #  Field header of data file :
        # air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
        
        usecols=['air_pressure','humidity','monitoring_time',self.factor,'station_number','air_temperature','wind_direction','wind_speed','longitude','latitude']
        df = pd.read_csv(data_path, usecols=usecols, low_memory=False)
        
        station_list = list(set(df['station_number'].values.tolist()))
        station_list.sort()
        print('station_list', station_list)
        
        # Time characteristics 
        df['monitoring_time'] = pd.to_datetime(df['monitoring_time'])
        #df['year'] = df['monitoring_time'].map(lambda x: (x.year))
        df['month_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[0])
        df['day_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[0])
        df['hour_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[0])
        df['dayofweek_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[0])
        df['month_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[1])
        df['day_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[1])
        df['hour_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[1])
        df['dayofweek_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[1])
        
        # Calculate with the previous day 23 Time difference 
        # Calculation :humidity_diff  Humidity difference ,air_temperature_diff  The temperature difference ,wind_direction_diff  The wind direction is different ,wind_speed_diff  Wind speed difference 
        df = self.calcu_diff(df, 'humidity')
        df = self.calcu_diff(df, 'air_temperature')
        #df = self.calcu_diff(df, 'wind_direction')
        df = self.calcu_diff(df, 'wind_speed')
        df = self.calcu_diff(df, self.factor)
        
        # Calculate the adjacent difference 
        # Calculation 2:humidity_diff  Humidity difference ,air_temperature_diff  The temperature difference ,wind_direction_diff  The wind direction is different ,wind_speed_diff  Wind speed difference 
        df = self.calcu_diff2(df, 'humidity')
        df = self.calcu_diff2(df, 'air_temperature')
        #df = self.calcu_diff2(df, 'wind_direction')
        df = self.calcu_diff2(df, 'wind_speed')
        df = self.calcu_diff2(df, self.factor)

        # Wind direction use   A point on the circumference of a two-dimensional plane   It means 
        # The wind direction is divided into  8  Directions 
        df['wind_direction_x'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[0])
        df['wind_direction_y'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[1])
        
        # Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 
        df = self.calcu_value_feature3(df, self.factor)
        #  pressure  +  temperature  +  humidity  +  The wind speed   Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 
        df = self.calcu_value_feature3(df, 'humidity')
        df = self.calcu_value_feature3(df, 'air_temperature')
        df = self.calcu_value_feature3(df, 'wind_speed')
        df = self.calcu_value_feature3(df, 'air_pressure')
        
        # First calculate the total quantity 
        n_total = 0
        n_step = 24# Take the step of data 
        for site_id in station_list:
            site_df = df[(df['station_number'] == site_id)]
            for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
                n_total += 1
        print('n_total:', n_total)
        
        # X The length of :
        x_len = (8 + 6 + 3 +3+ 2+8+2 + 4*8) + 1 + (8+6+3+3 + 4*8)
        X = np.ones((n_total, self.n_input, x_len), dtype=np.float32)
        Y = np.ones((n_total, self.n_output, 1), dtype=np.float32)
        
        print('*'*20)
        print('X.shape', X.shape, 'Y.shape', Y.shape)
        print(df.shape)
        print(df.head())
        n = 0
        x_feat = ['month_x','day_x','hour_x', 'dayofweek_x'] + \
                ['month_y','day_y','hour_y', 'dayofweek_y'] + \
                ['air_pressure','humidity', 'air_temperature', 'wind_direction_x', 'wind_direction_y', 'wind_speed'] + \
                ['humidity_diff', 'air_temperature_diff', 'wind_speed_diff'] + \
                ['humidity_diff2', 'air_temperature_diff2', 'wind_speed_diff2'] + \
                [i+'_max' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_min' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_recent' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_mean' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_median' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_max_min_diff' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_std' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                [i+'_var' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
                ['longitude', 'latitude'] + \
                [self.factor + '_diff', self.factor + '_diff2', self.factor + '_max', self.factor + '_min', self.factor + '_recent', self.factor + '_mean', self.factor + '_median', self.factor + '_max_min_diff', self.factor + '_std', self.factor + '_var'] + \
                [self.factor]
        for site_id in station_list:
            site_df = df[(df['station_number'] == site_id)]
            print('site_id:', site_id)
            for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
                X[n] = np.hstack((site_df.loc[site_df.index[i: i+self.n_input], x_feat].values, \
                        site_df.loc[site_df.index[i+self.n_output: i+self.n_input+self.n_output], x_feat[:-13]].values))
                Y[n] = site_df.loc[site_df.index[i+self.n_input: i+self.n_input+self.n_output], [self.factor]].values
                n += 1
        
        X = X.reshape(n_total, -1)
        Y = Y.reshape(n_total, -1)
        
        np.save('./ml_data/%s_%s_X-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), X)
        np.save('./ml_data/%s_%s_Y-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), Y)
        
        return X, Y
		
    def calcu_diff(self, df, field):
        
        df['tmp'] = df[field].copy()
        
        tmp_list = df.loc[df['tmp'].index[23::24], 'tmp'].tolist()
        tmp_list.insert(0, df.loc[0, field])
        tmp_list.pop(-1)
        output = [val for val in tmp_list for _ in range(24)]
        df['tmp'] = output
        
        df[field + '_diff'] = df[field] - df['tmp']
        
        df.drop(['tmp'], axis=1, inplace=True)
        
        return df
        
    def calcu_diff2(self, df, field):
        # The difference between the value of the current hour and the value of the previous hour 
        
        df['tmp'] = df[field].copy()
        df['tmp'] = df['tmp'].shift(1)
        df.loc[0, 'tmp'] = df.loc[1, 'tmp']
        
        df[field + '_diff2'] = df[field] - df['tmp']
        print(df.head())
        df.drop(['tmp'], axis=1, inplace=True)
        
        return df
        
    def calcu_value_feature(self, df, field):
        # Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 
        
        tmp_list = df[field].tolist()
        max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
        mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
        median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        
        df[field + '_max'] = max_val_list
        df[field + '_min'] = min_val_list
        df[field + '_recent'] = recent_val_list
        df[field + '_mean'] = mean_val_list
        df[field + '_median'] = median_val_list
        
        return df

    def calcu_value_feature3(self, df, field):
        # Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance 
        
        tmp_list = df[field].tolist()
        max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
        mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
        t=time.time()
        median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        std_val_list = [self.get_std(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        var_val_list = [self.get_var(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
        
        df[field + '_max'] = max_val_list
        df[field + '_min'] = min_val_list
        df[field + '_recent'] = recent_val_list
        df[field + '_mean'] = mean_val_list
        df[field + '_median'] = median_val_list
        df[field + '_max_min_diff'] = df[field + '_max'] - df[field + '_min']
        df[field + '_std'] = std_val_list
        df[field + '_var'] = var_val_list
        
        return df

3、 Model training effect

With smape To measure the effectiveness of the model , give the result as follows :

factor smape
PM250.28
PM100.29
O30.316

【 notes 】 The above result is after After model parameter adjustment The result ,xgb.XGBRegressor The parameter adjustment process of can refer to my following article

4、 Other reference

【AI actual combat 】XGBRegressor Model acceleration training , Use GPU Second training XGBRegressor

【AI actual combat 】xgb.XGBRegressor Multiple regression MultiOutputRegressor Adjustable parameter 1

【AI actual combat 】xgb.XGBRegressor Multiple regression MultiOutputRegressor Adjustable parameter 2(GPU Training models )

5、 summary

Feature engineering is important , With the increase of effective features ,O3 Model smape from 0.41 Down to 0.31, The effect is obviously improved .

原网站

版权声明
本文为[szZack]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207071159085028.html