当前位置:网站首页>【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(二)
【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(二)
2022-07-07 11:59:00 【szZack】
上一篇:【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(一)
本篇主要详解数据特征处理。
1、特征
以 PM2.5 为例子进行说明人工特征说明:
# 特征:
# PM25 + PM25相邻差 + PM25的最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差 +
# 时间特征(月、日、小时、星期) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向 + 站点特征(经纬度) + 湿度差1,气温差1,风向差1,风速差1
# 未来的(时间特征(月、日、小时) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向) + 湿度差1,气温差1,风向差1,风速差1
# 对 气压 + 温度 + 湿度 + 风速 都增加 最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差
# 【重点】时间特征使用 一个二维平面圆周上的点 的表示
# 风向特征使用 一个二维平面圆周上的点 的表示
2、核心代码
数据特征处理代码:
def load_data(self, data_type, data_path, n_input, n_output):
# 特征:
# PM25 + PM25相邻差 + PM25的最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差 +
# 时间特征(月、日、小时、星期) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向 + 站点特征(经纬度) + 湿度差1,气温差1,风向差1,风速差1
# 未来的(时间特征(月、日、小时) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向) + 湿度差1,气温差1,风向差1,风速差1
# 对 气压 + 温度 + 湿度 + 风速 都增加 最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差
# 【重点】时间特征使用 一个二维平面圆周上的点 的表示
# 风向特征使用 一个二维平面圆周上的点 的表示
# 数据文件的字段头:
# air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
usecols=['air_pressure','humidity','monitoring_time',self.factor,'station_number','air_temperature','wind_direction','wind_speed','longitude','latitude']
df = pd.read_csv(data_path, usecols=usecols, low_memory=False)
station_list = list(set(df['station_number'].values.tolist()))
station_list.sort()
print('station_list', station_list)
#时间特征
df['monitoring_time'] = pd.to_datetime(df['monitoring_time'])
#df['year'] = df['monitoring_time'].map(lambda x: (x.year))
df['month_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[0])
df['day_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[0])
df['hour_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[0])
df['dayofweek_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[0])
df['month_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[1])
df['day_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[1])
df['hour_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[1])
df['dayofweek_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[1])
#计算与上一天23时的差
#计算:humidity_diff 湿度差,air_temperature_diff 气温差,wind_direction_diff 风向差,wind_speed_diff 风速差
df = self.calcu_diff(df, 'humidity')
df = self.calcu_diff(df, 'air_temperature')
#df = self.calcu_diff(df, 'wind_direction')
df = self.calcu_diff(df, 'wind_speed')
df = self.calcu_diff(df, self.factor)
#计算相邻差
#计算2:humidity_diff 湿度差,air_temperature_diff 气温差,wind_direction_diff 风向差,wind_speed_diff 风速差
df = self.calcu_diff2(df, 'humidity')
df = self.calcu_diff2(df, 'air_temperature')
#df = self.calcu_diff2(df, 'wind_direction')
df = self.calcu_diff2(df, 'wind_speed')
df = self.calcu_diff2(df, self.factor)
#风向使用 一个二维平面圆周上的点 的表示
#风向分为 8 个方位
df['wind_direction_x'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[0])
df['wind_direction_y'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[1])
#计算O3的最大值、最小值、最近值、平均值、中位数、最大最小值差
df = self.calcu_value_feature3(df, self.factor)
# 气压 + 温度 + 湿度 + 风速 最大值、最小值、最近值、平均值、中位数、最大最小值差
df = self.calcu_value_feature3(df, 'humidity')
df = self.calcu_value_feature3(df, 'air_temperature')
df = self.calcu_value_feature3(df, 'wind_speed')
df = self.calcu_value_feature3(df, 'air_pressure')
#先计算总数量
n_total = 0
n_step = 24#取数据的步长
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
n_total += 1
print('n_total:', n_total)
# X的长度:
x_len = (8 + 6 + 3 +3+ 2+8+2 + 4*8) + 1 + (8+6+3+3 + 4*8)
X = np.ones((n_total, self.n_input, x_len), dtype=np.float32)
Y = np.ones((n_total, self.n_output, 1), dtype=np.float32)
print('*'*20)
print('X.shape', X.shape, 'Y.shape', Y.shape)
print(df.shape)
print(df.head())
n = 0
x_feat = ['month_x','day_x','hour_x', 'dayofweek_x'] + \
['month_y','day_y','hour_y', 'dayofweek_y'] + \
['air_pressure','humidity', 'air_temperature', 'wind_direction_x', 'wind_direction_y', 'wind_speed'] + \
['humidity_diff', 'air_temperature_diff', 'wind_speed_diff'] + \
['humidity_diff2', 'air_temperature_diff2', 'wind_speed_diff2'] + \
[i+'_max' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_min' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_recent' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_mean' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_median' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_max_min_diff' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_std' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_var' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
['longitude', 'latitude'] + \
[self.factor + '_diff', self.factor + '_diff2', self.factor + '_max', self.factor + '_min', self.factor + '_recent', self.factor + '_mean', self.factor + '_median', self.factor + '_max_min_diff', self.factor + '_std', self.factor + '_var'] + \
[self.factor]
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
print('site_id:', site_id)
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
X[n] = np.hstack((site_df.loc[site_df.index[i: i+self.n_input], x_feat].values, \
site_df.loc[site_df.index[i+self.n_output: i+self.n_input+self.n_output], x_feat[:-13]].values))
Y[n] = site_df.loc[site_df.index[i+self.n_input: i+self.n_input+self.n_output], [self.factor]].values
n += 1
X = X.reshape(n_total, -1)
Y = Y.reshape(n_total, -1)
np.save('./ml_data/%s_%s_X-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), X)
np.save('./ml_data/%s_%s_Y-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), Y)
return X, Y
def calcu_diff(self, df, field):
df['tmp'] = df[field].copy()
tmp_list = df.loc[df['tmp'].index[23::24], 'tmp'].tolist()
tmp_list.insert(0, df.loc[0, field])
tmp_list.pop(-1)
output = [val for val in tmp_list for _ in range(24)]
df['tmp'] = output
df[field + '_diff'] = df[field] - df['tmp']
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_diff2(self, df, field):
#当前小时的值与上一个小时值之差
df['tmp'] = df[field].copy()
df['tmp'] = df['tmp'].shift(1)
df.loc[0, 'tmp'] = df.loc[1, 'tmp']
df[field + '_diff2'] = df[field] - df['tmp']
print(df.head())
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_value_feature(self, df, field):
#计算O3的最大值、最小值、最近值、平均值、中位数
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
return df
def calcu_value_feature3(self, df, field):
#计算O3的最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
t=time.time()
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
std_val_list = [self.get_std(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
var_val_list = [self.get_var(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
df[field + '_max_min_diff'] = df[field + '_max'] - df[field + '_min']
df[field + '_std'] = std_val_list
df[field + '_var'] = var_val_list
return df
3、模型训练效果
以 smape 来衡量模型的效果,结果如下:
因子 | smape |
---|---|
PM25 | 0.28 |
PM10 | 0.29 |
O3 | 0.316 |
【注】上面的结果是经过模型调参后得到的结果,xgb.XGBRegressor 的调参过程可以参考我下面的文章
4、其他参考
【AI实战】XGBRegressor模型加速训练,使用GPU秒级训练XGBRegressor
【AI实战】xgb.XGBRegressor之多回归MultiOutputRegressor调参1
【AI实战】xgb.XGBRegressor之多回归MultiOutputRegressor调参2(GPU训练模型)
5、总结
特征工程很重要,随着有效特征的增加,O3模型的 smape 从 0.41 降低到了 0.31,效果提升明显。
边栏推荐
- 10 pictures open the door of CPU cache consistency
- Flask session forged hctf admin
- 2022-7-6 Leetcode 977.有序数组的平方
- The difference between memory overflow and memory leak
- 【面试高频题】难度 2.5/5,简单结合 DFS 的 Trie 模板级运用题
- 带你掌握三层架构(建议收藏)
- Show the mathematical formula in El table
- 内存溢出和内存泄漏的区别
- .net core 关于redis的pipeline以及事务
- Did login metamask
猜你喜欢
最佳实践 | 用腾讯云AI意愿核身为电话合规保驾护航
Social responsibility · value co creation, Zhongguancun network security and Information Industry Alliance dialogue, wechat entrepreneur Haitai Fangyuan, chairman Mr. Jiang Haizhou
干货|总结那些漏洞工具的联动使用
得物客服热线的演进之路
Mathématiques avancées - - chapitre 8 différenciation des fonctions multivariables 1
How far can it go to adopt a cow by selling the concept to the market?
Use of polarscatter function in MATLAB
《厌女:日本的女性嫌恶》摘录
Battle Atlas: 12 scenarios detailing the requirements for container safety construction
Final review notes of single chip microcomputer principle
随机推荐
118. 杨辉三角
Drawerlayout suppress sideslip display
PHP - laravel cache
648. 单词替换 : 字典树的经典运用
2022-7-6 初学redis(一)在 Linux 下下载安装并运行 redis
Getting started with cinnamon applet
Social responsibility · value co creation, Zhongguancun network security and Information Industry Alliance dialogue, wechat entrepreneur Haitai Fangyuan, chairman Mr. Jiang Haizhou
Introduction to database system - Chapter 1 introduction [conceptual model, hierarchical model and three-level mode (external mode, mode, internal mode)]
Xshell connection server changes key login to password login
2022-7-7 Leetcode 844.比较含退格的字符串
Server to server (S2S) event (adjust)
Co create a collaborative ecosystem of software and hardware: the "Joint submission" of graphcore IPU and Baidu PaddlePaddle appeared in mlperf
Help tenants
The reason why data truncated for column 'xxx' at row 1 appears in the MySQL import file
Indoor ROS robot navigation commissioning record (experience in selecting expansion radius)
2022-7-6 Leetcode 977.有序数组的平方
干货|总结那些漏洞工具的联动使用
Final review notes of single chip microcomputer principle
The difference between memory overflow and memory leak
Getting started with MySQL