当前位置:网站首页>[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)
[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)
2022-07-07 14:04:00 【szZack】
Last one :【AI actual combat 】 application xgboost.XGBRegressor Build an air quality prediction model ( One )
This chapter mainly explains the data feature processing .
1、 features
With PM2.5 Explain the artificial characteristics as an example :
# features :
# PM25 + PM25 Adjacent difference + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance +
# Time characteristics ( month 、 Japan 、 Hours 、 week ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction + Site features ( Longitude and latitude ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# In the future ( Time characteristics ( month 、 Japan 、 Hours ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# Yes pressure + temperature + humidity + The wind speed All increase Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
# 【 a key 】 Time characteristics use A point on the circumference of a two-dimensional plane It means
# Wind direction characteristics are used A point on the circumference of a two-dimensional plane It means
2、 Core code
Data feature processing code :
def load_data(self, data_type, data_path, n_input, n_output):
# features :
# PM25 + PM25 Adjacent difference + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance +
# Time characteristics ( month 、 Japan 、 Hours 、 week ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction + Site features ( Longitude and latitude ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# In the future ( Time characteristics ( month 、 Japan 、 Hours ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# Yes pressure + temperature + humidity + The wind speed All increase Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
# 【 a key 】 Time characteristics use A point on the circumference of a two-dimensional plane It means
# Wind direction characteristics are used A point on the circumference of a two-dimensional plane It means
# Field header of data file :
# air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
usecols=['air_pressure','humidity','monitoring_time',self.factor,'station_number','air_temperature','wind_direction','wind_speed','longitude','latitude']
df = pd.read_csv(data_path, usecols=usecols, low_memory=False)
station_list = list(set(df['station_number'].values.tolist()))
station_list.sort()
print('station_list', station_list)
# Time characteristics
df['monitoring_time'] = pd.to_datetime(df['monitoring_time'])
#df['year'] = df['monitoring_time'].map(lambda x: (x.year))
df['month_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[0])
df['day_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[0])
df['hour_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[0])
df['dayofweek_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[0])
df['month_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[1])
df['day_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[1])
df['hour_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[1])
df['dayofweek_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[1])
# Calculate with the previous day 23 Time difference
# Calculation :humidity_diff Humidity difference ,air_temperature_diff The temperature difference ,wind_direction_diff The wind direction is different ,wind_speed_diff Wind speed difference
df = self.calcu_diff(df, 'humidity')
df = self.calcu_diff(df, 'air_temperature')
#df = self.calcu_diff(df, 'wind_direction')
df = self.calcu_diff(df, 'wind_speed')
df = self.calcu_diff(df, self.factor)
# Calculate the adjacent difference
# Calculation 2:humidity_diff Humidity difference ,air_temperature_diff The temperature difference ,wind_direction_diff The wind direction is different ,wind_speed_diff Wind speed difference
df = self.calcu_diff2(df, 'humidity')
df = self.calcu_diff2(df, 'air_temperature')
#df = self.calcu_diff2(df, 'wind_direction')
df = self.calcu_diff2(df, 'wind_speed')
df = self.calcu_diff2(df, self.factor)
# Wind direction use A point on the circumference of a two-dimensional plane It means
# The wind direction is divided into 8 Directions
df['wind_direction_x'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[0])
df['wind_direction_y'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[1])
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values
df = self.calcu_value_feature3(df, self.factor)
# pressure + temperature + humidity + The wind speed Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values
df = self.calcu_value_feature3(df, 'humidity')
df = self.calcu_value_feature3(df, 'air_temperature')
df = self.calcu_value_feature3(df, 'wind_speed')
df = self.calcu_value_feature3(df, 'air_pressure')
# First calculate the total quantity
n_total = 0
n_step = 24# Take the step of data
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
n_total += 1
print('n_total:', n_total)
# X The length of :
x_len = (8 + 6 + 3 +3+ 2+8+2 + 4*8) + 1 + (8+6+3+3 + 4*8)
X = np.ones((n_total, self.n_input, x_len), dtype=np.float32)
Y = np.ones((n_total, self.n_output, 1), dtype=np.float32)
print('*'*20)
print('X.shape', X.shape, 'Y.shape', Y.shape)
print(df.shape)
print(df.head())
n = 0
x_feat = ['month_x','day_x','hour_x', 'dayofweek_x'] + \
['month_y','day_y','hour_y', 'dayofweek_y'] + \
['air_pressure','humidity', 'air_temperature', 'wind_direction_x', 'wind_direction_y', 'wind_speed'] + \
['humidity_diff', 'air_temperature_diff', 'wind_speed_diff'] + \
['humidity_diff2', 'air_temperature_diff2', 'wind_speed_diff2'] + \
[i+'_max' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_min' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_recent' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_mean' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_median' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_max_min_diff' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_std' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_var' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
['longitude', 'latitude'] + \
[self.factor + '_diff', self.factor + '_diff2', self.factor + '_max', self.factor + '_min', self.factor + '_recent', self.factor + '_mean', self.factor + '_median', self.factor + '_max_min_diff', self.factor + '_std', self.factor + '_var'] + \
[self.factor]
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
print('site_id:', site_id)
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
X[n] = np.hstack((site_df.loc[site_df.index[i: i+self.n_input], x_feat].values, \
site_df.loc[site_df.index[i+self.n_output: i+self.n_input+self.n_output], x_feat[:-13]].values))
Y[n] = site_df.loc[site_df.index[i+self.n_input: i+self.n_input+self.n_output], [self.factor]].values
n += 1
X = X.reshape(n_total, -1)
Y = Y.reshape(n_total, -1)
np.save('./ml_data/%s_%s_X-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), X)
np.save('./ml_data/%s_%s_Y-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), Y)
return X, Y
def calcu_diff(self, df, field):
df['tmp'] = df[field].copy()
tmp_list = df.loc[df['tmp'].index[23::24], 'tmp'].tolist()
tmp_list.insert(0, df.loc[0, field])
tmp_list.pop(-1)
output = [val for val in tmp_list for _ in range(24)]
df['tmp'] = output
df[field + '_diff'] = df[field] - df['tmp']
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_diff2(self, df, field):
# The difference between the value of the current hour and the value of the previous hour
df['tmp'] = df[field].copy()
df['tmp'] = df['tmp'].shift(1)
df.loc[0, 'tmp'] = df.loc[1, 'tmp']
df[field + '_diff2'] = df[field] - df['tmp']
print(df.head())
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_value_feature(self, df, field):
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
return df
def calcu_value_feature3(self, df, field):
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
t=time.time()
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
std_val_list = [self.get_std(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
var_val_list = [self.get_var(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
df[field + '_max_min_diff'] = df[field + '_max'] - df[field + '_min']
df[field + '_std'] = std_val_list
df[field + '_var'] = var_val_list
return df
3、 Model training effect
With smape To measure the effectiveness of the model , give the result as follows :
factor | smape |
---|---|
PM25 | 0.28 |
PM10 | 0.29 |
O3 | 0.316 |
【 notes 】 The above result is after After model parameter adjustment The result ,xgb.XGBRegressor The parameter adjustment process of can refer to my following article
4、 Other reference
【AI actual combat 】XGBRegressor Model acceleration training , Use GPU Second training XGBRegressor
【AI actual combat 】xgb.XGBRegressor Multiple regression MultiOutputRegressor Adjustable parameter 1
5、 summary
Feature engineering is important , With the increase of effective features ,O3 Model smape from 0.41 Down to 0.31, The effect is obviously improved .
边栏推荐
猜你喜欢
SSRF漏洞file伪协议之[网鼎杯 2018]Fakebook1
Thread pool reject policy best practices
Build a secure and trusted computing platform based on Kunpeng's native security
2022-7-6 beginner redis (I) download, install and run redis under Linux
2022-7-7 Leetcode 34.在排序数组中查找元素的第一个和最后一个位置
Use of polarscatter function in MATLAB
docker部署oracle
How to check the ram and ROM usage of MCU through Keil
SSRF vulnerability file pseudo protocol [netding Cup 2018] fakebook1
2022-7-6 Leetcode27.移除元素——太久没有做题了,为双指针如此狼狈的一天
随机推荐
ES日志报错赏析-Limit of total fields
属性关键字Aliases,Calculated,Cardinality,ClientName
The delivery efficiency is increased by 52 times, and the operation efficiency is increased by 10 times. See the compilation of practical cases of financial cloud native technology (with download)
请问,redis没有消费消息,都在redis里堆着是怎么回事?用的是cerely 。
Learning breakout 2 - about effective learning methods
【面试高频题】难度 2.5/5,简单结合 DFS 的 Trie 模板级运用题
【堡垒机】云堡垒机和普通堡垒机的区别是什么?
DID登陆-MetaMask
FC连接数据库,一定要使用自定义域名才能在外面访问吗?
3D Detection: 3D Box和点云 快速可视化
Excerpt from "misogyny: female disgust in Japan"
请问,如图,pyhon云函数提示使用了 pymysql模块,这个是怎么回事?
2022-7-6 beginner redis (I) download, install and run redis under Linux
Data refresh of recyclerview
AutoCAD - how to input angle dimensions and CAD diameter symbols greater than 180 degrees?
. Net core about redis pipeline and transactions
手把手教会:XML建模
[daily training] 648 Word replacement
XML文件的解析操作
How to check the ram and ROM usage of MCU through Keil