当前位置:网站首页>[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)
[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)
2022-07-07 14:04:00 【szZack】
Last one :【AI actual combat 】 application xgboost.XGBRegressor Build an air quality prediction model ( One )
This chapter mainly explains the data feature processing .
1、 features
With PM2.5 Explain the artificial characteristics as an example :
# features :
# PM25 + PM25 Adjacent difference + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance +
# Time characteristics ( month 、 Japan 、 Hours 、 week ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction + Site features ( Longitude and latitude ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# In the future ( Time characteristics ( month 、 Japan 、 Hours ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# Yes pressure + temperature + humidity + The wind speed All increase Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
# 【 a key 】 Time characteristics use A point on the circumference of a two-dimensional plane It means
# Wind direction characteristics are used A point on the circumference of a two-dimensional plane It means
2、 Core code
Data feature processing code :
def load_data(self, data_type, data_path, n_input, n_output):
# features :
# PM25 + PM25 Adjacent difference + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance +
# Time characteristics ( month 、 Japan 、 Hours 、 week ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction + Site features ( Longitude and latitude ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# In the future ( Time characteristics ( month 、 Japan 、 Hours ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# Yes pressure + temperature + humidity + The wind speed All increase Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
# 【 a key 】 Time characteristics use A point on the circumference of a two-dimensional plane It means
# Wind direction characteristics are used A point on the circumference of a two-dimensional plane It means
# Field header of data file :
# air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
usecols=['air_pressure','humidity','monitoring_time',self.factor,'station_number','air_temperature','wind_direction','wind_speed','longitude','latitude']
df = pd.read_csv(data_path, usecols=usecols, low_memory=False)
station_list = list(set(df['station_number'].values.tolist()))
station_list.sort()
print('station_list', station_list)
# Time characteristics
df['monitoring_time'] = pd.to_datetime(df['monitoring_time'])
#df['year'] = df['monitoring_time'].map(lambda x: (x.year))
df['month_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[0])
df['day_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[0])
df['hour_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[0])
df['dayofweek_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[0])
df['month_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[1])
df['day_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[1])
df['hour_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[1])
df['dayofweek_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[1])
# Calculate with the previous day 23 Time difference
# Calculation :humidity_diff Humidity difference ,air_temperature_diff The temperature difference ,wind_direction_diff The wind direction is different ,wind_speed_diff Wind speed difference
df = self.calcu_diff(df, 'humidity')
df = self.calcu_diff(df, 'air_temperature')
#df = self.calcu_diff(df, 'wind_direction')
df = self.calcu_diff(df, 'wind_speed')
df = self.calcu_diff(df, self.factor)
# Calculate the adjacent difference
# Calculation 2:humidity_diff Humidity difference ,air_temperature_diff The temperature difference ,wind_direction_diff The wind direction is different ,wind_speed_diff Wind speed difference
df = self.calcu_diff2(df, 'humidity')
df = self.calcu_diff2(df, 'air_temperature')
#df = self.calcu_diff2(df, 'wind_direction')
df = self.calcu_diff2(df, 'wind_speed')
df = self.calcu_diff2(df, self.factor)
# Wind direction use A point on the circumference of a two-dimensional plane It means
# The wind direction is divided into 8 Directions
df['wind_direction_x'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[0])
df['wind_direction_y'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[1])
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values
df = self.calcu_value_feature3(df, self.factor)
# pressure + temperature + humidity + The wind speed Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values
df = self.calcu_value_feature3(df, 'humidity')
df = self.calcu_value_feature3(df, 'air_temperature')
df = self.calcu_value_feature3(df, 'wind_speed')
df = self.calcu_value_feature3(df, 'air_pressure')
# First calculate the total quantity
n_total = 0
n_step = 24# Take the step of data
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
n_total += 1
print('n_total:', n_total)
# X The length of :
x_len = (8 + 6 + 3 +3+ 2+8+2 + 4*8) + 1 + (8+6+3+3 + 4*8)
X = np.ones((n_total, self.n_input, x_len), dtype=np.float32)
Y = np.ones((n_total, self.n_output, 1), dtype=np.float32)
print('*'*20)
print('X.shape', X.shape, 'Y.shape', Y.shape)
print(df.shape)
print(df.head())
n = 0
x_feat = ['month_x','day_x','hour_x', 'dayofweek_x'] + \
['month_y','day_y','hour_y', 'dayofweek_y'] + \
['air_pressure','humidity', 'air_temperature', 'wind_direction_x', 'wind_direction_y', 'wind_speed'] + \
['humidity_diff', 'air_temperature_diff', 'wind_speed_diff'] + \
['humidity_diff2', 'air_temperature_diff2', 'wind_speed_diff2'] + \
[i+'_max' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_min' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_recent' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_mean' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_median' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_max_min_diff' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_std' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_var' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
['longitude', 'latitude'] + \
[self.factor + '_diff', self.factor + '_diff2', self.factor + '_max', self.factor + '_min', self.factor + '_recent', self.factor + '_mean', self.factor + '_median', self.factor + '_max_min_diff', self.factor + '_std', self.factor + '_var'] + \
[self.factor]
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
print('site_id:', site_id)
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
X[n] = np.hstack((site_df.loc[site_df.index[i: i+self.n_input], x_feat].values, \
site_df.loc[site_df.index[i+self.n_output: i+self.n_input+self.n_output], x_feat[:-13]].values))
Y[n] = site_df.loc[site_df.index[i+self.n_input: i+self.n_input+self.n_output], [self.factor]].values
n += 1
X = X.reshape(n_total, -1)
Y = Y.reshape(n_total, -1)
np.save('./ml_data/%s_%s_X-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), X)
np.save('./ml_data/%s_%s_Y-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), Y)
return X, Y
def calcu_diff(self, df, field):
df['tmp'] = df[field].copy()
tmp_list = df.loc[df['tmp'].index[23::24], 'tmp'].tolist()
tmp_list.insert(0, df.loc[0, field])
tmp_list.pop(-1)
output = [val for val in tmp_list for _ in range(24)]
df['tmp'] = output
df[field + '_diff'] = df[field] - df['tmp']
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_diff2(self, df, field):
# The difference between the value of the current hour and the value of the previous hour
df['tmp'] = df[field].copy()
df['tmp'] = df['tmp'].shift(1)
df.loc[0, 'tmp'] = df.loc[1, 'tmp']
df[field + '_diff2'] = df[field] - df['tmp']
print(df.head())
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_value_feature(self, df, field):
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
return df
def calcu_value_feature3(self, df, field):
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
t=time.time()
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
std_val_list = [self.get_std(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
var_val_list = [self.get_var(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
df[field + '_max_min_diff'] = df[field + '_max'] - df[field + '_min']
df[field + '_std'] = std_val_list
df[field + '_var'] = var_val_list
return df
3、 Model training effect
With smape To measure the effectiveness of the model , give the result as follows :
| factor | smape |
|---|---|
| PM25 | 0.28 |
| PM10 | 0.29 |
| O3 | 0.316 |
【 notes 】 The above result is after After model parameter adjustment The result ,xgb.XGBRegressor The parameter adjustment process of can refer to my following article
4、 Other reference
【AI actual combat 】XGBRegressor Model acceleration training , Use GPU Second training XGBRegressor
【AI actual combat 】xgb.XGBRegressor Multiple regression MultiOutputRegressor Adjustable parameter 1
5、 summary
Feature engineering is important , With the increase of effective features ,O3 Model smape from 0.41 Down to 0.31, The effect is obviously improved .
边栏推荐
- 高等数学---第八章多元函数微分学1
- 内存溢出和内存泄漏的区别
- Is it safe to open an account online now? Which securities company should I choose to open an account online?
- Thread pool reject policy best practices
- . Net core about redis pipeline and transactions
- Redis 核心数据结构 & Redis 6 新特性详
- Best practice | using Tencent cloud AI willingness to audit as the escort of telephone compliance
- 最佳实践 | 用腾讯云AI意愿核身为电话合规保驾护航
- Environment configuration
- PostgreSQL array type, each splice
猜你喜欢

手把手教会:XML建模

Build a secure and trusted computing platform based on Kunpeng's native security

2022-7-6 beginner redis (I) download, install and run redis under Linux

Advanced Mathematics - Chapter 8 differential calculus of multivariate functions 1

118. Yanghui triangle

AutoCAD - how to input angle dimensions and CAD diameter symbols greater than 180 degrees?

高等数学---第八章多元函数微分学1

Deep understanding of array related problems in C language

2022-7-6 初学redis(一)在 Linux 下下载安装并运行 redis

Use of polarscatter function in MATLAB
随机推荐
THINKPHP框架的优秀开源系统推荐
Battle Atlas: 12 scenarios detailing the requirements for container safety construction
Help tenants
Leetcode simple question sharing (20)
toRaw和markRaw
Getting started with MySQL
[1] Basic knowledge of ros2 - summary version of operation commands
Dry goods | summarize the linkage use of those vulnerability tools
Move base parameter analysis and experience summary
118. Yanghui triangle
PostgreSQL array type, each splice
[daily training] 648 Word replacement
566. Reshaping the matrix
requires php ~7.1 -> your PHP version (7.0.18) does not satisfy that requirement
Seven propagation behaviors of transactions
属性关键字Aliases,Calculated,Cardinality,ClientName
Evolution of customer service hotline of dewu
Is the compass stock software reliable? Is it safe to trade stocks?
2022-7-6 使用SIGURG来接受外带数据,不知道为什么打印不出来
Toraw and markraw