当前位置:网站首页>[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)
[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)
2022-07-07 14:04:00 【szZack】
Last one :【AI actual combat 】 application xgboost.XGBRegressor Build an air quality prediction model ( One )
This chapter mainly explains the data feature processing .
1、 features
With PM2.5 Explain the artificial characteristics as an example :
# features :
# PM25 + PM25 Adjacent difference + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance +
# Time characteristics ( month 、 Japan 、 Hours 、 week ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction + Site features ( Longitude and latitude ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# In the future ( Time characteristics ( month 、 Japan 、 Hours ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# Yes pressure + temperature + humidity + The wind speed All increase Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
# 【 a key 】 Time characteristics use A point on the circumference of a two-dimensional plane It means
# Wind direction characteristics are used A point on the circumference of a two-dimensional plane It means
2、 Core code
Data feature processing code :
def load_data(self, data_type, data_path, n_input, n_output):
# features :
# PM25 + PM25 Adjacent difference + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance +
# Time characteristics ( month 、 Japan 、 Hours 、 week ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction + Site features ( Longitude and latitude ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# In the future ( Time characteristics ( month 、 Japan 、 Hours ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# Yes pressure + temperature + humidity + The wind speed All increase Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
# 【 a key 】 Time characteristics use A point on the circumference of a two-dimensional plane It means
# Wind direction characteristics are used A point on the circumference of a two-dimensional plane It means
# Field header of data file :
# air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
usecols=['air_pressure','humidity','monitoring_time',self.factor,'station_number','air_temperature','wind_direction','wind_speed','longitude','latitude']
df = pd.read_csv(data_path, usecols=usecols, low_memory=False)
station_list = list(set(df['station_number'].values.tolist()))
station_list.sort()
print('station_list', station_list)
# Time characteristics
df['monitoring_time'] = pd.to_datetime(df['monitoring_time'])
#df['year'] = df['monitoring_time'].map(lambda x: (x.year))
df['month_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[0])
df['day_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[0])
df['hour_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[0])
df['dayofweek_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[0])
df['month_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[1])
df['day_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[1])
df['hour_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[1])
df['dayofweek_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[1])
# Calculate with the previous day 23 Time difference
# Calculation :humidity_diff Humidity difference ,air_temperature_diff The temperature difference ,wind_direction_diff The wind direction is different ,wind_speed_diff Wind speed difference
df = self.calcu_diff(df, 'humidity')
df = self.calcu_diff(df, 'air_temperature')
#df = self.calcu_diff(df, 'wind_direction')
df = self.calcu_diff(df, 'wind_speed')
df = self.calcu_diff(df, self.factor)
# Calculate the adjacent difference
# Calculation 2:humidity_diff Humidity difference ,air_temperature_diff The temperature difference ,wind_direction_diff The wind direction is different ,wind_speed_diff Wind speed difference
df = self.calcu_diff2(df, 'humidity')
df = self.calcu_diff2(df, 'air_temperature')
#df = self.calcu_diff2(df, 'wind_direction')
df = self.calcu_diff2(df, 'wind_speed')
df = self.calcu_diff2(df, self.factor)
# Wind direction use A point on the circumference of a two-dimensional plane It means
# The wind direction is divided into 8 Directions
df['wind_direction_x'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[0])
df['wind_direction_y'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[1])
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values
df = self.calcu_value_feature3(df, self.factor)
# pressure + temperature + humidity + The wind speed Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values
df = self.calcu_value_feature3(df, 'humidity')
df = self.calcu_value_feature3(df, 'air_temperature')
df = self.calcu_value_feature3(df, 'wind_speed')
df = self.calcu_value_feature3(df, 'air_pressure')
# First calculate the total quantity
n_total = 0
n_step = 24# Take the step of data
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
n_total += 1
print('n_total:', n_total)
# X The length of :
x_len = (8 + 6 + 3 +3+ 2+8+2 + 4*8) + 1 + (8+6+3+3 + 4*8)
X = np.ones((n_total, self.n_input, x_len), dtype=np.float32)
Y = np.ones((n_total, self.n_output, 1), dtype=np.float32)
print('*'*20)
print('X.shape', X.shape, 'Y.shape', Y.shape)
print(df.shape)
print(df.head())
n = 0
x_feat = ['month_x','day_x','hour_x', 'dayofweek_x'] + \
['month_y','day_y','hour_y', 'dayofweek_y'] + \
['air_pressure','humidity', 'air_temperature', 'wind_direction_x', 'wind_direction_y', 'wind_speed'] + \
['humidity_diff', 'air_temperature_diff', 'wind_speed_diff'] + \
['humidity_diff2', 'air_temperature_diff2', 'wind_speed_diff2'] + \
[i+'_max' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_min' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_recent' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_mean' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_median' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_max_min_diff' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_std' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_var' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
['longitude', 'latitude'] + \
[self.factor + '_diff', self.factor + '_diff2', self.factor + '_max', self.factor + '_min', self.factor + '_recent', self.factor + '_mean', self.factor + '_median', self.factor + '_max_min_diff', self.factor + '_std', self.factor + '_var'] + \
[self.factor]
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
print('site_id:', site_id)
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
X[n] = np.hstack((site_df.loc[site_df.index[i: i+self.n_input], x_feat].values, \
site_df.loc[site_df.index[i+self.n_output: i+self.n_input+self.n_output], x_feat[:-13]].values))
Y[n] = site_df.loc[site_df.index[i+self.n_input: i+self.n_input+self.n_output], [self.factor]].values
n += 1
X = X.reshape(n_total, -1)
Y = Y.reshape(n_total, -1)
np.save('./ml_data/%s_%s_X-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), X)
np.save('./ml_data/%s_%s_Y-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), Y)
return X, Y
def calcu_diff(self, df, field):
df['tmp'] = df[field].copy()
tmp_list = df.loc[df['tmp'].index[23::24], 'tmp'].tolist()
tmp_list.insert(0, df.loc[0, field])
tmp_list.pop(-1)
output = [val for val in tmp_list for _ in range(24)]
df['tmp'] = output
df[field + '_diff'] = df[field] - df['tmp']
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_diff2(self, df, field):
# The difference between the value of the current hour and the value of the previous hour
df['tmp'] = df[field].copy()
df['tmp'] = df['tmp'].shift(1)
df.loc[0, 'tmp'] = df.loc[1, 'tmp']
df[field + '_diff2'] = df[field] - df['tmp']
print(df.head())
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_value_feature(self, df, field):
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
return df
def calcu_value_feature3(self, df, field):
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
t=time.time()
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
std_val_list = [self.get_std(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
var_val_list = [self.get_var(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
df[field + '_max_min_diff'] = df[field + '_max'] - df[field + '_min']
df[field + '_std'] = std_val_list
df[field + '_var'] = var_val_list
return df
3、 Model training effect
With smape To measure the effectiveness of the model , give the result as follows :
factor | smape |
---|---|
PM25 | 0.28 |
PM10 | 0.29 |
O3 | 0.316 |
【 notes 】 The above result is after After model parameter adjustment The result ,xgb.XGBRegressor The parameter adjustment process of can refer to my following article
4、 Other reference
【AI actual combat 】XGBRegressor Model acceleration training , Use GPU Second training XGBRegressor
【AI actual combat 】xgb.XGBRegressor Multiple regression MultiOutputRegressor Adjustable parameter 1
5、 summary
Feature engineering is important , With the increase of effective features ,O3 Model smape from 0.41 Down to 0.31, The effect is obviously improved .
边栏推荐
- DID登陆-MetaMask
- Parameter keywords final, flags, internal, mapping keywords internal
- Wired network IP address of VMware shared host
- The delivery efficiency is increased by 52 times, and the operation efficiency is increased by 10 times. See the compilation of practical cases of financial cloud native technology (with download)
- 社会责任·价值共创,中关村网络安全与信息化产业联盟对话网信企业家海泰方圆董事长姜海舟先生
- [daily training -- Tencent select 50] 231 Power of 2
- Data refresh of recyclerview
- requires php ~7.1 -> your PHP version (7.0.18) does not satisfy that requirement
- Laravel form builder uses
- 【面试高频题】难度 2.5/5,简单结合 DFS 的 Trie 模板级运用题
猜你喜欢
作战图鉴:12大场景详述容器安全建设要求
Details of redis core data structure & new features of redis 6
Excerpt from "misogyny: female disgust in Japan"
高等數學---第八章多元函數微分學1
室内ROS机器人导航调试记录(膨胀半径的选取经验)
2022-7-6 sigurg is used to receive external data. I don't know why it can't be printed out
Sliding rail stepping motor commissioning (national ocean vehicle competition) (STM32 master control)
Navicat run SQL file import data incomplete or import failed
Vmware 与主机之间传输文件
实现IP地址归属地显示功能、号码归属地查询
随机推荐
室內ROS機器人導航調試記錄(膨脹半徑的選取經驗)
2022-7-7 Leetcode 34.在排序数组中查找元素的第一个和最后一个位置
内存溢出和内存泄漏的区别
Seven propagation behaviors of transactions
Lavarel之环境配置 .env
Drawerlayout suppress sideslip display
648. Word replacement: the classic application of dictionary tree
供应链供需预估-[时间序列]
Environment configuration
使用day.js让时间 (显示为几分钟前 几小时前 几天前 几个月前 )
Es log error appreciation -limit of total fields
Is the compass stock software reliable? Is it safe to trade stocks?
"New red flag Cup" desktop application creativity competition 2022
请问,redis没有消费消息,都在redis里堆着是怎么回事?用的是cerely 。
【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(二)
Did login metamask
Help tenants
TPG x AIDU | AI leading talent recruitment plan in progress!
PostgreSQL array type, each splice
Best practice | using Tencent cloud AI willingness to audit as the escort of telephone compliance