当前位置:网站首页>[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)
[AI practice] Application xgboost Xgbregressor builds air quality prediction model (II)
2022-07-07 14:04:00 【szZack】
Last one :【AI actual combat 】 application xgboost.XGBRegressor Build an air quality prediction model ( One )
This chapter mainly explains the data feature processing .
1、 features
With PM2.5 Explain the artificial characteristics as an example :
# features :
# PM25 + PM25 Adjacent difference + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance +
# Time characteristics ( month 、 Japan 、 Hours 、 week ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction + Site features ( Longitude and latitude ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# In the future ( Time characteristics ( month 、 Japan 、 Hours ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# Yes pressure + temperature + humidity + The wind speed All increase Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
# 【 a key 】 Time characteristics use A point on the circumference of a two-dimensional plane It means
# Wind direction characteristics are used A point on the circumference of a two-dimensional plane It means
2、 Core code
Data feature processing code :
def load_data(self, data_type, data_path, n_input, n_output):
# features :
# PM25 + PM25 Adjacent difference + PM25 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance +
# Time characteristics ( month 、 Japan 、 Hours 、 week ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction + Site features ( Longitude and latitude ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# In the future ( Time characteristics ( month 、 Japan 、 Hours ) + Humidity difference 2, The temperature difference 2, The wind direction is different 2, Wind speed difference 2 + pressure + temperature + humidity + The wind speed + wind direction ) + Humidity difference 1, The temperature difference 1, The wind direction is different 1, Wind speed difference 1
# Yes pressure + temperature + humidity + The wind speed All increase Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
# 【 a key 】 Time characteristics use A point on the circumference of a two-dimensional plane It means
# Wind direction characteristics are used A point on the circumference of a two-dimensional plane It means
# Field header of data file :
# air_pressure,CO,humidity,AQI,monitoring_time,NO2,O3,PM10,PM25,SO2,station_number,air_temperature,wind_direction,wind_speed,longitude,latitude,station_type_name
usecols=['air_pressure','humidity','monitoring_time',self.factor,'station_number','air_temperature','wind_direction','wind_speed','longitude','latitude']
df = pd.read_csv(data_path, usecols=usecols, low_memory=False)
station_list = list(set(df['station_number'].values.tolist()))
station_list.sort()
print('station_list', station_list)
# Time characteristics
df['monitoring_time'] = pd.to_datetime(df['monitoring_time'])
#df['year'] = df['monitoring_time'].map(lambda x: (x.year))
df['month_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[0])
df['day_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[0])
df['hour_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[0])
df['dayofweek_x'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[0])
df['month_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.month, 12)[1])
df['day_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.day,31)[1])
df['hour_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.hour,24)[1])
df['dayofweek_y'] = df['monitoring_time'].map(lambda x: self.to_periodic_feature(x.dayofweek+1,7)[1])
# Calculate with the previous day 23 Time difference
# Calculation :humidity_diff Humidity difference ,air_temperature_diff The temperature difference ,wind_direction_diff The wind direction is different ,wind_speed_diff Wind speed difference
df = self.calcu_diff(df, 'humidity')
df = self.calcu_diff(df, 'air_temperature')
#df = self.calcu_diff(df, 'wind_direction')
df = self.calcu_diff(df, 'wind_speed')
df = self.calcu_diff(df, self.factor)
# Calculate the adjacent difference
# Calculation 2:humidity_diff Humidity difference ,air_temperature_diff The temperature difference ,wind_direction_diff The wind direction is different ,wind_speed_diff Wind speed difference
df = self.calcu_diff2(df, 'humidity')
df = self.calcu_diff2(df, 'air_temperature')
#df = self.calcu_diff2(df, 'wind_direction')
df = self.calcu_diff2(df, 'wind_speed')
df = self.calcu_diff2(df, self.factor)
# Wind direction use A point on the circumference of a two-dimensional plane It means
# The wind direction is divided into 8 Directions
df['wind_direction_x'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[0])
df['wind_direction_y'] = df['wind_direction'].map(lambda x: self.to_periodic_feature(int(x//45), 8)[1])
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values
df = self.calcu_value_feature3(df, self.factor)
# pressure + temperature + humidity + The wind speed Maximum 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values
df = self.calcu_value_feature3(df, 'humidity')
df = self.calcu_value_feature3(df, 'air_temperature')
df = self.calcu_value_feature3(df, 'wind_speed')
df = self.calcu_value_feature3(df, 'air_pressure')
# First calculate the total quantity
n_total = 0
n_step = 24# Take the step of data
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
n_total += 1
print('n_total:', n_total)
# X The length of :
x_len = (8 + 6 + 3 +3+ 2+8+2 + 4*8) + 1 + (8+6+3+3 + 4*8)
X = np.ones((n_total, self.n_input, x_len), dtype=np.float32)
Y = np.ones((n_total, self.n_output, 1), dtype=np.float32)
print('*'*20)
print('X.shape', X.shape, 'Y.shape', Y.shape)
print(df.shape)
print(df.head())
n = 0
x_feat = ['month_x','day_x','hour_x', 'dayofweek_x'] + \
['month_y','day_y','hour_y', 'dayofweek_y'] + \
['air_pressure','humidity', 'air_temperature', 'wind_direction_x', 'wind_direction_y', 'wind_speed'] + \
['humidity_diff', 'air_temperature_diff', 'wind_speed_diff'] + \
['humidity_diff2', 'air_temperature_diff2', 'wind_speed_diff2'] + \
[i+'_max' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_min' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_recent' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_mean' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_median' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_max_min_diff' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_std' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
[i+'_var' for i in ['air_pressure','humidity', 'air_temperature', 'wind_speed']] + \
['longitude', 'latitude'] + \
[self.factor + '_diff', self.factor + '_diff2', self.factor + '_max', self.factor + '_min', self.factor + '_recent', self.factor + '_mean', self.factor + '_median', self.factor + '_max_min_diff', self.factor + '_std', self.factor + '_var'] + \
[self.factor]
for site_id in station_list:
site_df = df[(df['station_number'] == site_id)]
print('site_id:', site_id)
for i in range(0, site_df.shape[0] - self.n_output - self.n_input, n_step):
X[n] = np.hstack((site_df.loc[site_df.index[i: i+self.n_input], x_feat].values, \
site_df.loc[site_df.index[i+self.n_output: i+self.n_input+self.n_output], x_feat[:-13]].values))
Y[n] = site_df.loc[site_df.index[i+self.n_input: i+self.n_input+self.n_output], [self.factor]].values
n += 1
X = X.reshape(n_total, -1)
Y = Y.reshape(n_total, -1)
np.save('./ml_data/%s_%s_X-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), X)
np.save('./ml_data/%s_%s_Y-%d-%d-%s.npy' %(self.factor, data_type, self.n_input, self.n_output, self.version), Y)
return X, Y
def calcu_diff(self, df, field):
df['tmp'] = df[field].copy()
tmp_list = df.loc[df['tmp'].index[23::24], 'tmp'].tolist()
tmp_list.insert(0, df.loc[0, field])
tmp_list.pop(-1)
output = [val for val in tmp_list for _ in range(24)]
df['tmp'] = output
df[field + '_diff'] = df[field] - df['tmp']
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_diff2(self, df, field):
# The difference between the value of the current hour and the value of the previous hour
df['tmp'] = df[field].copy()
df['tmp'] = df['tmp'].shift(1)
df.loc[0, 'tmp'] = df.loc[1, 'tmp']
df[field + '_diff2'] = df[field] - df['tmp']
print(df.head())
df.drop(['tmp'], axis=1, inplace=True)
return df
def calcu_value_feature(self, df, field):
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
return df
def calcu_value_feature3(self, df, field):
# Calculation O3 The maximum of 、 minimum value 、 Recent value 、 Average 、 Median 、 The difference between the maximum and minimum values 、 Standard deviation 、 variance
tmp_list = df[field].tolist()
max_val_list = [max(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
min_val_list = [min(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
recent_val_list = [tmp_list[i+23] for i in range(0, len(tmp_list), 24) for j in range(24)]
mean_val_list = [sum(tmp_list[i:i+24])/24.0 for i in range(0, len(tmp_list), 24) for j in range(24)]
t=time.time()
median_val_list = [self.get_median(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
std_val_list = [self.get_std(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
var_val_list = [self.get_var(tmp_list[i:i+24]) for i in range(0, len(tmp_list), 24) for j in range(24)]
df[field + '_max'] = max_val_list
df[field + '_min'] = min_val_list
df[field + '_recent'] = recent_val_list
df[field + '_mean'] = mean_val_list
df[field + '_median'] = median_val_list
df[field + '_max_min_diff'] = df[field + '_max'] - df[field + '_min']
df[field + '_std'] = std_val_list
df[field + '_var'] = var_val_list
return df
3、 Model training effect
With smape To measure the effectiveness of the model , give the result as follows :
factor | smape |
---|---|
PM25 | 0.28 |
PM10 | 0.29 |
O3 | 0.316 |
【 notes 】 The above result is after After model parameter adjustment The result ,xgb.XGBRegressor The parameter adjustment process of can refer to my following article
4、 Other reference
【AI actual combat 】XGBRegressor Model acceleration training , Use GPU Second training XGBRegressor
【AI actual combat 】xgb.XGBRegressor Multiple regression MultiOutputRegressor Adjustable parameter 1
5、 summary
Feature engineering is important , With the increase of effective features ,O3 Model smape from 0.41 Down to 0.31, The effect is obviously improved .
边栏推荐
- MySQL error 28 and solution
- 请问,我kafka 3个分区,flinksql 任务中 写了 join操作,,我怎么单独给join
- 实现IP地址归属地显示功能、号码归属地查询
- 648. Word replacement: the classic application of dictionary tree
- Excerpt from "misogyny: female disgust in Japan"
- C语言数组相关问题深度理解
- 室内ROS机器人导航调试记录(膨胀半径的选取经验)
- 2022-7-6 Leetcode27. Remove the element - I haven't done the problem for a long time. It's such an embarrassing day for double pointers
- 2022-7-7 Leetcode 844.比较含退格的字符串
- toRaw和markRaw
猜你喜欢
118. 杨辉三角
最长上升子序列模型 AcWing 1014. 登山
Enregistrement de la navigation et de la mise en service du robot ROS intérieur (expérience de sélection du rayon de dilatation)
.net core 关于redis的pipeline以及事务
Introduction to database system - Chapter 1 introduction [conceptual model, hierarchical model and three-level mode (external mode, mode, internal mode)]
[fortress machine] what is the difference between cloud fortress machine and ordinary fortress machine?
Xshell connection server changes key login to password login
"Song of ice and fire" in the eleventh issue of "open source Roundtable" -- how to balance the natural contradiction between open source and security?
【堡垒机】云堡垒机和普通堡垒机的区别是什么?
js 获取当前时间 年月日,uniapp定位 小程序打开地图选择地点
随机推荐
Redis只能做缓存?太out了!
Leetcode simple question sharing (20)
[daily training -- Tencent select 50] 231 Power of 2
Navicat run SQL file import data incomplete or import failed
Show the mathematical formula in El table
手把手教会:XML建模
How can the PC page call QQ for online chat?
118. Yanghui triangle
648. 单词替换 : 字典树的经典运用
Build a secure and trusted computing platform based on Kunpeng's native security
Xshell connection server changes key login to password login
参数关键字Final,Flags,Internal,映射关键字Internal
The meaning of variables starting with underscores in PHP
请问指南针股票软件可靠吗?交易股票安全吗?
2022-7-6 beginner redis (I) download, install and run redis under Linux
3D detection: fast visualization of 3D box and point cloud
使用day.js让时间 (显示为几分钟前 几小时前 几天前 几个月前 )
flask session伪造之hctf admin
数据库系统概论-第一章绪论【概念模型、层次模型和三级模式(外模式、模式、内模式)】
手里的闲钱是炒股票还是买理财产品好?