当前位置:网站首页>【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(二)
【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(二)
2022-07-07 11:59:00 【szZack】
上一篇:【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(一)
本篇主要详解数据特征处理。
1、特征
以 PM2.5 为例子进行说明人工特征说明:
# 特征:
# PM25 + PM25相邻差 + PM25的最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差 +
# 时间特征(月、日、小时、星期) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向 + 站点特征(经纬度) + 湿度差1,气温差1,风向差1,风速差1
# 未来的(时间特征(月、日、小时) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向) + 湿度差1,气温差1,风向差1,风速差1
# 对 气压 + 温度 + 湿度 + 风速 都增加 最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差
# 【重点】时间特征使用 一个二维平面圆周上的点 的表示
# 风向特征使用 一个二维平面圆周上的点 的表示
2、核心代码
数据特征处理代码:
def load_data(self, data_type, data_path, n_input, n_output):
# 特征:
# PM25 + PM25相邻差 + PM25的最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差 +
# 时间特征(月、日、小时、星期) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向 + 站点特征(经纬度) + 湿度差1,气温差1,风向差1,风速差1
# 未来的(时间特征(月、日、小时) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向) + 湿度差1,气温差1,风向差1,风速差1
# 对 气压 + 温度 + 湿度 + 风速 都增加 最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差
# 【重点】时间特征使用 一个二维平面圆周上的点 的表示
# 风向特征使用 一个二维平面圆周上的点 的表示
# 数据文件的字段头:
# air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
usecols=['air_pressure','humidity','monitoring_time',self.factor,'station_number','air_temperature','wind_direction','wind_speed','longitude','latitude']
df = pd.read_csv(data_path, usecols=usecols, low_memory=False)
station_list = list(set(df['station_number'].values.tolist()))
station_list.sort()
print('station_list', station_list)
#时间特征
df['monitoring_time'] = pd.to_datetime(df['monitoring_time'])
#df['year'] = df['monitoring_time'].map(lambda x: (x.year))
df['month_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[0])
df['day_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[0])
df['hour_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[0])
df['dayofweek_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[0])
df['month_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[1])
df['day_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[1])
df['hour_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[1])
df['dayofweek_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[1])
#计算与上一天23时的差
#计算:humidity_diff 湿度差,air_temperature_diff 气温差,wind_direction_diff 风向差,wind_speed_diff 风速差
df = self.calcu_diff(df, 'humidity')
df = self.calcu_diff(df, 'air_temperature')
#df = self.calcu_diff(df, 'wind_direction')
df = self.calcu_diff(df, 'wind_speed')
df = self.calcu_diff(df, self.factor)
#计算相邻差
#计算2:humidity_diff 湿度差,air_temperature_diff 气温差,wind_direction_diff 风向差,wind_speed_diff 风速差
df = self.calcu_diff2(df, 'humidity')
df = self.calcu_diff2(df, 'air_temperature')
#df = self.calcu_diff2(df, 'wind_direction')
df = self.calcu_diff2(df, 'wind_speed')
df = self.calcu_diff2(df, self.factor)
#风向使用 一个二维平面圆周上的点 的表示
#风向分为 8 个方位
df['wind_direction_x'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[0])
df['wind_direction_y'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[1])
#计算O3的最大值、最小值、最近值、平均值、中位数、最大最小值差
df = self.calcu_value_feature3(df, self.factor)
# 气压 + 温度 + 湿度 + 风速 最大值、最小值、最近值、平均值、中位数、最大最小值差
df = self.calcu_value_feature3(df, 'humidity')
df = self.calcu_value_feature3(df, 'air_temperature')
df = self.calcu_value_feature3(df, 'wind_speed')
df = self.calcu_value_feature3(df, 'air_pressure')
#先计算总数量
n_total = 0
n_step = 24#取数据的步长
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
n_total += 1
print('n_total:', n_total)
# X的长度:
x_len = (8 + 6 + 3 +3+ 2+8+2 + 4*8) + 1 + (8+6+3+3 + 4*8)
X = np.ones((n_total, self.n_input, x_len), dtype=np.float32)
Y = np.ones((n_total, self.n_output, 1), dtype=np.float32)
print('*'*20)
print('X.shape', X.shape, 'Y.shape', Y.shape)
print(df.shape)
print(df.head())
n = 0
x_feat = ['month_x','day_x','hour_x', 'dayofweek_x'] + \
['month_y','day_y','hour_y', 'dayofweek_y'] + \
['air_pressure','humidity', 'air_temperature', 'wind_direction_x', 'wind_direction_y', 'wind_speed'] + \
['humidity_diff', 'air_temperature_diff', 'wind_speed_diff'] + \
['humidity_diff2', 'air_temperature_diff2', 'wind_speed_diff2'] + \
[i+'_max' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_min' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_recent' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_mean' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_median' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_max_min_diff' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_std' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_var' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
['longitude', 'latitude'] + \
[self.factor + '_diff', self.factor + '_diff2', self.factor + '_max', self.factor + '_min', self.factor + '_recent', self.factor + '_mean', self.factor + '_median', self.factor + '_max_min_diff', self.factor + '_std', self.factor + '_var'] + \
[self.factor]
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
print('site_id:', site_id)
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
X[n] = np.hstack((site_df.loc[site_df.index[i: i+self.n_input], x_feat].values, \
site_df.loc[site_df.index[i+self.n_output: i+self.n_input+self.n_output], x_feat[:-13]].values))
Y[n] = site_df.loc[site_df.index[i+self.n_input: i+self.n_input+self.n_output], [self.factor]].values
n += 1
X = X.reshape(n_total, -1)
Y = Y.reshape(n_total, -1)
np.save('./ml_data/%s_%s_X-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), X)
np.save('./ml_data/%s_%s_Y-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), Y)
return X, Y
def calcu_diff(self, df, field):
df['tmp'] = df[field].copy()
tmp_list = df.loc[df['tmp'].index[23::24], 'tmp'].tolist()
tmp_list.insert(0, df.loc[0, field])
tmp_list.pop(-1)
output = [val for val in tmp_list for _ in range(24)]
df['tmp'] = output
df[field + '_diff'] = df[field] - df['tmp']
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_diff2(self, df, field):
#当前小时的值与上一个小时值之差
df['tmp'] = df[field].copy()
df['tmp'] = df['tmp'].shift(1)
df.loc[0, 'tmp'] = df.loc[1, 'tmp']
df[field + '_diff2'] = df[field] - df['tmp']
print(df.head())
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_value_feature(self, df, field):
#计算O3的最大值、最小值、最近值、平均值、中位数
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
return df
def calcu_value_feature3(self, df, field):
#计算O3的最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
t=time.time()
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
std_val_list = [self.get_std(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
var_val_list = [self.get_var(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
df[field + '_max_min_diff'] = df[field + '_max'] - df[field + '_min']
df[field + '_std'] = std_val_list
df[field + '_var'] = var_val_list
return df
3、模型训练效果
以 smape 来衡量模型的效果,结果如下:
| 因子 | smape |
|---|---|
| PM25 | 0.28 |
| PM10 | 0.29 |
| O3 | 0.316 |
【注】上面的结果是经过模型调参后得到的结果,xgb.XGBRegressor 的调参过程可以参考我下面的文章
4、其他参考
【AI实战】XGBRegressor模型加速训练,使用GPU秒级训练XGBRegressor
【AI实战】xgb.XGBRegressor之多回归MultiOutputRegressor调参1
【AI实战】xgb.XGBRegressor之多回归MultiOutputRegressor调参2(GPU训练模型)
5、总结
特征工程很重要,随着有效特征的增加,O3模型的 smape 从 0.41 降低到了 0.31,效果提升明显。
边栏推荐
- 648. Word replacement: the classic application of dictionary tree
- Co create a collaborative ecosystem of software and hardware: the "Joint submission" of graphcore IPU and Baidu PaddlePaddle appeared in mlperf
- 实现IP地址归属地显示功能、号码归属地查询
- 2022-7-6 Leetcode27. Remove the element - I haven't done the problem for a long time. It's such an embarrassing day for double pointers
- PC端页面如何调用QQ进行在线聊天?
- Problems that cannot be accessed in MySQL LAN
- 2022-7-6 Leetcode 977. Square of ordered array
- 【堡垒机】云堡垒机和普通堡垒机的区别是什么?
- Talk about pseudo sharing
- [1] ROS2基础知识-操作命令总结版
猜你喜欢

2022-7-7 Leetcode 34.在排序数组中查找元素的第一个和最后一个位置

Advanced Mathematics - Chapter 8 differential calculus of multivariate functions 1

Sliding rail stepping motor commissioning (national ocean vehicle competition) (STM32 master control)

干货|总结那些漏洞工具的联动使用

Custom thread pool rejection policy

"New red flag Cup" desktop application creativity competition 2022

2022-7-6 Leetcode27.移除元素——太久没有做题了,为双指针如此狼狈的一天

Ogre introduction

为租客提供帮助

Final review notes of single chip microcomputer principle
随机推荐
最佳实践 | 用腾讯云AI意愿核身为电话合规保驾护航
3D Detection: 3D Box和点云 快速可视化
提升树莓派性能的方法
How does MySQL control the number of replace?
Introduction to database system - Chapter 1 introduction [conceptual model, hierarchical model and three-level mode (external mode, mode, internal mode)]
Did login metamask
Data refresh of recyclerview
Flink | multi stream conversion
Cinnamon taskbar speed
得物客服热线的演进之路
Milkdown control icon
AI talent cultivation new ideas, this live broadcast has what you care about
Talk about pseudo sharing
How can the PC page call QQ for online chat?
1. Deep copy 2. Call apply bind 3. For of in differences
Leetcode simple question sharing (20)
云计算安全扩展要求关注的安全目标和实现方式区分原则有哪些?
Toraw and markraw
Help tenants
The meaning of variables starting with underscores in PHP