当前位置:网站首页>[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)
[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)
2022-07-07 14:04:00 【szZack】
Last one :【AI actual combat 】 application xgboost.XGBRegressor Build an air quality prediction model ( One )
This chapter mainly explains the data feature processing .
1、 features
With PM2.5 Explain the artificial characteristics as an example :
# features :
# PM25 + PM25 Adjacent difference + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance +
# Time characteristics ( month 、 Japan 、 Hours 、 week ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction + Site features ( Longitude and latitude ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# In the future ( Time characteristics ( month 、 Japan 、 Hours ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# Yes pressure + temperature + humidity + The wind speed All increase Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
# 【 a key 】 Time characteristics use A point on the circumference of a two-dimensional plane It means
# Wind direction characteristics are used A point on the circumference of a two-dimensional plane It means
2、 Core code
Data feature processing code :
def load_data(self, data_type, data_path, n_input, n_output):
# features :
# PM25 + PM25 Adjacent difference + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance +
# Time characteristics ( month 、 Japan 、 Hours 、 week ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction + Site features ( Longitude and latitude ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# In the future ( Time characteristics ( month 、 Japan 、 Hours ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# Yes pressure + temperature + humidity + The wind speed All increase Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
# 【 a key 】 Time characteristics use A point on the circumference of a two-dimensional plane It means
# Wind direction characteristics are used A point on the circumference of a two-dimensional plane It means
# Field header of data file :
# air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
usecols=['air_pressure','humidity','monitoring_time',self.factor,'station_number','air_temperature','wind_direction','wind_speed','longitude','latitude']
df = pd.read_csv(data_path, usecols=usecols, low_memory=False)
station_list = list(set(df['station_number'].values.tolist()))
station_list.sort()
print('station_list', station_list)
# Time characteristics
df['monitoring_time'] = pd.to_datetime(df['monitoring_time'])
#df['year'] = df['monitoring_time'].map(lambda x: (x.year))
df['month_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[0])
df['day_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[0])
df['hour_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[0])
df['dayofweek_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[0])
df['month_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[1])
df['day_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[1])
df['hour_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[1])
df['dayofweek_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[1])
# Calculate with the previous day 23 Time difference
# Calculation :humidity_diff Humidity difference ,air_temperature_diff The temperature difference ,wind_direction_diff The wind direction is different ,wind_speed_diff Wind speed difference
df = self.calcu_diff(df, 'humidity')
df = self.calcu_diff(df, 'air_temperature')
#df = self.calcu_diff(df, 'wind_direction')
df = self.calcu_diff(df, 'wind_speed')
df = self.calcu_diff(df, self.factor)
# Calculate the adjacent difference
# Calculation 2:humidity_diff Humidity difference ,air_temperature_diff The temperature difference ,wind_direction_diff The wind direction is different ,wind_speed_diff Wind speed difference
df = self.calcu_diff2(df, 'humidity')
df = self.calcu_diff2(df, 'air_temperature')
#df = self.calcu_diff2(df, 'wind_direction')
df = self.calcu_diff2(df, 'wind_speed')
df = self.calcu_diff2(df, self.factor)
# Wind direction use A point on the circumference of a two-dimensional plane It means
# The wind direction is divided into 8 Directions
df['wind_direction_x'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[0])
df['wind_direction_y'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[1])
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values
df = self.calcu_value_feature3(df, self.factor)
# pressure + temperature + humidity + The wind speed Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values
df = self.calcu_value_feature3(df, 'humidity')
df = self.calcu_value_feature3(df, 'air_temperature')
df = self.calcu_value_feature3(df, 'wind_speed')
df = self.calcu_value_feature3(df, 'air_pressure')
# First calculate the total quantity
n_total = 0
n_step = 24# Take the step of data
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
n_total += 1
print('n_total:', n_total)
# X The length of :
x_len = (8 + 6 + 3 +3+ 2+8+2 + 4*8) + 1 + (8+6+3+3 + 4*8)
X = np.ones((n_total, self.n_input, x_len), dtype=np.float32)
Y = np.ones((n_total, self.n_output, 1), dtype=np.float32)
print('*'*20)
print('X.shape', X.shape, 'Y.shape', Y.shape)
print(df.shape)
print(df.head())
n = 0
x_feat = ['month_x','day_x','hour_x', 'dayofweek_x'] + \
['month_y','day_y','hour_y', 'dayofweek_y'] + \
['air_pressure','humidity', 'air_temperature', 'wind_direction_x', 'wind_direction_y', 'wind_speed'] + \
['humidity_diff', 'air_temperature_diff', 'wind_speed_diff'] + \
['humidity_diff2', 'air_temperature_diff2', 'wind_speed_diff2'] + \
[i+'_max' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_min' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_recent' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_mean' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_median' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_max_min_diff' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_std' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_var' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
['longitude', 'latitude'] + \
[self.factor + '_diff', self.factor + '_diff2', self.factor + '_max', self.factor + '_min', self.factor + '_recent', self.factor + '_mean', self.factor + '_median', self.factor + '_max_min_diff', self.factor + '_std', self.factor + '_var'] + \
[self.factor]
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
print('site_id:', site_id)
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
X[n] = np.hstack((site_df.loc[site_df.index[i: i+self.n_input], x_feat].values, \
site_df.loc[site_df.index[i+self.n_output: i+self.n_input+self.n_output], x_feat[:-13]].values))
Y[n] = site_df.loc[site_df.index[i+self.n_input: i+self.n_input+self.n_output], [self.factor]].values
n += 1
X = X.reshape(n_total, -1)
Y = Y.reshape(n_total, -1)
np.save('./ml_data/%s_%s_X-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), X)
np.save('./ml_data/%s_%s_Y-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), Y)
return X, Y
def calcu_diff(self, df, field):
df['tmp'] = df[field].copy()
tmp_list = df.loc[df['tmp'].index[23::24], 'tmp'].tolist()
tmp_list.insert(0, df.loc[0, field])
tmp_list.pop(-1)
output = [val for val in tmp_list for _ in range(24)]
df['tmp'] = output
df[field + '_diff'] = df[field] - df['tmp']
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_diff2(self, df, field):
# The difference between the value of the current hour and the value of the previous hour
df['tmp'] = df[field].copy()
df['tmp'] = df['tmp'].shift(1)
df.loc[0, 'tmp'] = df.loc[1, 'tmp']
df[field + '_diff2'] = df[field] - df['tmp']
print(df.head())
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_value_feature(self, df, field):
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
return df
def calcu_value_feature3(self, df, field):
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
t=time.time()
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
std_val_list = [self.get_std(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
var_val_list = [self.get_var(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
df[field + '_max_min_diff'] = df[field + '_max'] - df[field + '_min']
df[field + '_std'] = std_val_list
df[field + '_var'] = var_val_list
return df
3、 Model training effect
With smape To measure the effectiveness of the model , give the result as follows :
| factor | smape |
|---|---|
| PM25 | 0.28 |
| PM10 | 0.29 |
| O3 | 0.316 |
【 notes 】 The above result is after After model parameter adjustment The result ,xgb.XGBRegressor The parameter adjustment process of can refer to my following article
4、 Other reference
【AI actual combat 】XGBRegressor Model acceleration training , Use GPU Second training XGBRegressor
【AI actual combat 】xgb.XGBRegressor Multiple regression MultiOutputRegressor Adjustable parameter 1
5、 summary
Feature engineering is important , With the increase of effective features ,O3 Model smape from 0.41 Down to 0.31, The effect is obviously improved .
边栏推荐
- 【堡垒机】云堡垒机和普通堡垒机的区别是什么?
- [1] ROS2基础知识-操作命令总结版
- "New red flag Cup" desktop application creativity competition 2022
- SSRF vulnerability file pseudo protocol [netding Cup 2018] fakebook1
- Flink | multi stream conversion
- Redis can only cache? Too out!
- Evolution of customer service hotline of dewu
- PostgreSQL array type, each splice
- 【日常训练】648. 单词替换
- Drawerlayout suppress sideslip display
猜你喜欢

Redis can only cache? Too out!

C语言数组相关问题深度理解

作战图鉴:12大场景详述容器安全建设要求

Custom thread pool rejection policy

2022-7-6 Leetcode27.移除元素——太久没有做题了,为双指针如此狼狈的一天
![供应链供需预估-[时间序列]](/img/2c/82d118cfbcef4498998298dd3844b1.png)
供应链供需预估-[时间序列]

AI talent cultivation new ideas, this live broadcast has what you care about

Xshell connection server changes key login to password login

Help tenants

MySQL error 28 and solution
随机推荐
C语言数组相关问题深度理解
What are the principles for distinguishing the security objectives and implementation methods that cloud computing security expansion requires to focus on?
FC连接数据库,一定要使用自定义域名才能在外面访问吗?
2022-7-7 Leetcode 844.比较含退格的字符串
华为镜像地址
Environment configuration of lavarel env
Learning breakout 2 - about effective learning methods
Enregistrement de la navigation et de la mise en service du robot ROS intérieur (expérience de sélection du rayon de dilatation)
【AI实战】应用xgboost.XGBRegressor搭建空气质量预测模型(二)
[untitled]
Did login metamask
作战图鉴:12大场景详述容器安全建设要求
mysql导入文件出现Data truncated for column ‘xxx’ at row 1的原因
2022-7-6 使用SIGURG来接受外带数据,不知道为什么打印不出来
【堡垒机】云堡垒机和普通堡垒机的区别是什么?
Environment configuration
云计算安全扩展要求关注的安全目标和实现方式区分原则有哪些?
[1] ROS2基础知识-操作命令总结版
toRaw和markRaw
最佳实践 | 用腾讯云AI意愿核身为电话合规保驾护航