当前位置:网站首页>【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(二)
【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(二)
2022-07-07 11:59:00 【szZack】
上一篇:【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(一)
本篇主要详解数据特征处理。
1、特征
以 PM2.5 为例子进行说明人工特征说明:
# 特征:
# PM25 + PM25相邻差 + PM25的最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差 +
# 时间特征(月、日、小时、星期) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向 + 站点特征(经纬度) + 湿度差1,气温差1,风向差1,风速差1
# 未来的(时间特征(月、日、小时) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向) + 湿度差1,气温差1,风向差1,风速差1
# 对 气压 + 温度 + 湿度 + 风速 都增加 最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差
# 【重点】时间特征使用 一个二维平面圆周上的点 的表示
# 风向特征使用 一个二维平面圆周上的点 的表示
2、核心代码
数据特征处理代码:
def load_data(self, data_type, data_path, n_input, n_output):
# 特征:
# PM25 + PM25相邻差 + PM25的最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差 +
# 时间特征(月、日、小时、星期) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向 + 站点特征(经纬度) + 湿度差1,气温差1,风向差1,风速差1
# 未来的(时间特征(月、日、小时) + 湿度差2,气温差2,风向差2,风速差2 + 气压 + 温度 + 湿度 + 风速 + 风向) + 湿度差1,气温差1,风向差1,风速差1
# 对 气压 + 温度 + 湿度 + 风速 都增加 最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差
# 【重点】时间特征使用 一个二维平面圆周上的点 的表示
# 风向特征使用 一个二维平面圆周上的点 的表示
# 数据文件的字段头:
# air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
usecols=['air_pressure','humidity','monitoring_time',self.factor,'station_number','air_temperature','wind_direction','wind_speed','longitude','latitude']
df = pd.read_csv(data_path, usecols=usecols, low_memory=False)
station_list = list(set(df['station_number'].values.tolist()))
station_list.sort()
print('station_list', station_list)
#时间特征
df['monitoring_time'] = pd.to_datetime(df['monitoring_time'])
#df['year'] = df['monitoring_time'].map(lambda x: (x.year))
df['month_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[0])
df['day_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[0])
df['hour_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[0])
df['dayofweek_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[0])
df['month_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[1])
df['day_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[1])
df['hour_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[1])
df['dayofweek_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[1])
#计算与上一天23时的差
#计算:humidity_diff 湿度差,air_temperature_diff 气温差,wind_direction_diff 风向差,wind_speed_diff 风速差
df = self.calcu_diff(df, 'humidity')
df = self.calcu_diff(df, 'air_temperature')
#df = self.calcu_diff(df, 'wind_direction')
df = self.calcu_diff(df, 'wind_speed')
df = self.calcu_diff(df, self.factor)
#计算相邻差
#计算2:humidity_diff 湿度差,air_temperature_diff 气温差,wind_direction_diff 风向差,wind_speed_diff 风速差
df = self.calcu_diff2(df, 'humidity')
df = self.calcu_diff2(df, 'air_temperature')
#df = self.calcu_diff2(df, 'wind_direction')
df = self.calcu_diff2(df, 'wind_speed')
df = self.calcu_diff2(df, self.factor)
#风向使用 一个二维平面圆周上的点 的表示
#风向分为 8 个方位
df['wind_direction_x'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[0])
df['wind_direction_y'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[1])
#计算O3的最大值、最小值、最近值、平均值、中位数、最大最小值差
df = self.calcu_value_feature3(df, self.factor)
# 气压 + 温度 + 湿度 + 风速 最大值、最小值、最近值、平均值、中位数、最大最小值差
df = self.calcu_value_feature3(df, 'humidity')
df = self.calcu_value_feature3(df, 'air_temperature')
df = self.calcu_value_feature3(df, 'wind_speed')
df = self.calcu_value_feature3(df, 'air_pressure')
#先计算总数量
n_total = 0
n_step = 24#取数据的步长
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
n_total += 1
print('n_total:', n_total)
# X的长度:
x_len = (8 + 6 + 3 +3+ 2+8+2 + 4*8) + 1 + (8+6+3+3 + 4*8)
X = np.ones((n_total, self.n_input, x_len), dtype=np.float32)
Y = np.ones((n_total, self.n_output, 1), dtype=np.float32)
print('*'*20)
print('X.shape', X.shape, 'Y.shape', Y.shape)
print(df.shape)
print(df.head())
n = 0
x_feat = ['month_x','day_x','hour_x', 'dayofweek_x'] + \
['month_y','day_y','hour_y', 'dayofweek_y'] + \
['air_pressure','humidity', 'air_temperature', 'wind_direction_x', 'wind_direction_y', 'wind_speed'] + \
['humidity_diff', 'air_temperature_diff', 'wind_speed_diff'] + \
['humidity_diff2', 'air_temperature_diff2', 'wind_speed_diff2'] + \
[i+'_max' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_min' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_recent' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_mean' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_median' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_max_min_diff' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_std' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_var' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
['longitude', 'latitude'] + \
[self.factor + '_diff', self.factor + '_diff2', self.factor + '_max', self.factor + '_min', self.factor + '_recent', self.factor + '_mean', self.factor + '_median', self.factor + '_max_min_diff', self.factor + '_std', self.factor + '_var'] + \
[self.factor]
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
print('site_id:', site_id)
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
X[n] = np.hstack((site_df.loc[site_df.index[i: i+self.n_input], x_feat].values, \
site_df.loc[site_df.index[i+self.n_output: i+self.n_input+self.n_output], x_feat[:-13]].values))
Y[n] = site_df.loc[site_df.index[i+self.n_input: i+self.n_input+self.n_output], [self.factor]].values
n += 1
X = X.reshape(n_total, -1)
Y = Y.reshape(n_total, -1)
np.save('./ml_data/%s_%s_X-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), X)
np.save('./ml_data/%s_%s_Y-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), Y)
return X, Y
def calcu_diff(self, df, field):
df['tmp'] = df[field].copy()
tmp_list = df.loc[df['tmp'].index[23::24], 'tmp'].tolist()
tmp_list.insert(0, df.loc[0, field])
tmp_list.pop(-1)
output = [val for val in tmp_list for _ in range(24)]
df['tmp'] = output
df[field + '_diff'] = df[field] - df['tmp']
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_diff2(self, df, field):
#当前小时的值与上一个小时值之差
df['tmp'] = df[field].copy()
df['tmp'] = df['tmp'].shift(1)
df.loc[0, 'tmp'] = df.loc[1, 'tmp']
df[field + '_diff2'] = df[field] - df['tmp']
print(df.head())
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_value_feature(self, df, field):
#计算O3的最大值、最小值、最近值、平均值、中位数
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
return df
def calcu_value_feature3(self, df, field):
#计算O3的最大值、最小值、最近值、平均值、中位数、最大最小值差、标准差、方差
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
t=time.time()
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
std_val_list = [self.get_std(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
var_val_list = [self.get_var(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
df[field + '_max_min_diff'] = df[field + '_max'] - df[field + '_min']
df[field + '_std'] = std_val_list
df[field + '_var'] = var_val_list
return df
3、模型训练效果
以 smape 来衡量模型的效果,结果如下:
因子 | smape |
---|---|
PM25 | 0.28 |
PM10 | 0.29 |
O3 | 0.316 |
【注】上面的结果是经过模型调参后得到的结果,xgb.XGBRegressor 的调参过程可以参考我下面的文章
4、其他参考
【AI实战】XGBRegressor模型加速训练,使用GPU秒级训练XGBRegressor
【AI实战】xgb.XGBRegressor之多回归MultiOutputRegressor调参1
【AI实战】xgb.XGBRegressor之多回归MultiOutputRegressor调参2(GPU训练模型)
5、总结
特征工程很重要,随着有效特征的增加,O3模型的 smape 从 0.41 降低到了 0.31,效果提升明显。
边栏推荐
- 使用day.js让时间 (显示为几分钟前 几小时前 几天前 几个月前 )
- Excerpt from "misogyny: female disgust in Japan"
- Custom thread pool rejection policy
- Redis can only cache? Too out!
- The reason why data truncated for column 'xxx' at row 1 appears in the MySQL import file
- Xshell connection server changes key login to password login
- Enregistrement de la navigation et de la mise en service du robot ROS intérieur (expérience de sélection du rayon de dilatation)
- Vmware 与主机之间传输文件
- 2022-7-6 使用SIGURG来接受外带数据,不知道为什么打印不出来
- 2022-7-6 beginner redis (I) download, install and run redis under Linux
猜你喜欢
Flink | multi stream conversion
Mathématiques avancées - - chapitre 8 différenciation des fonctions multivariables 1
Indoor ROS robot navigation commissioning record (experience in selecting expansion radius)
Leetcode simple question sharing (20)
MySQL error 28 and solution
Navicat运行sql文件导入数据不全或导入失败
Milkdown control icon
Did login metamask
Esp32 ① compilation environment
AI talent cultivation new ideas, this live broadcast has what you care about
随机推荐
Error lnk2019: unresolved external symbol
Vmware共享主机的有线网络IP地址
2022-7-6 Leetcode27.移除元素——太久没有做题了,为双指针如此狼狈的一天
Solve the cache breakdown problem
Excellent open source system recommendation of ThinkPHP framework
Realize the IP address home display function and number home query
.net core 关于redis的pipeline以及事务
AI人才培育新思路,这场直播有你关心的
PHP - laravel cache
Oracle advanced (V) schema solution
[daily training -- Tencent select 50] 231 Power of 2
How to make join run faster?
Mysql怎样控制replace替换的次数?
MySQL error 28 and solution
What are the principles for distinguishing the security objectives and implementation methods that cloud computing security expansion requires to focus on?
Dry goods | summarize the linkage use of those vulnerability tools
作战图鉴:12大场景详述容器安全建设要求
The reason why data truncated for column 'xxx' at row 1 appears in the MySQL import file
Ogre introduction
118. 杨辉三角