当前位置:网站首页>Information leakage and computational complexity of EMD like methods in time series prediction
Information leakage and computational complexity of EMD like methods in time series prediction
2022-06-09 22:31:00 【Cyril_ KI】
Catalog
I. Preface
Nowadays, many time series prediction papers are EMD decompose +LSTM The routine , And have achieved good results . But in these papers , Almost no paper has introduced its own data processing process in detail , And how to realize this method in real applications , They just stay at the theoretical level . therefore , This article mainly wants to talk about when we use EMD Method, how to process data .
II. Information leakage and the amount of computation
The data processing method in most papers is like this : First, all the data is EMD decompose , Get multiple components . Then for each component , Divide the training set and test set respectively . For each component, we use a model to train on its training set , Then test on the test set . If we want to get the truth of the original data / Predictive value , Just put the truth of all the components / The predicted values can be superimposed .
However , Generally in time series prediction , This is how we process data : First, divide all the data into training set and test set , Then the training set is normalized and feature engineered , And then model training , In this way , The training set and the test set are separate , The data in the training set is independent of the test set . It is worth noting that , When normalizing the test set , We use the largest in the training set / minimum value . take the reverse into consideration EMD in , We first decompose all the data and then divide the training set and the test set , This creates a problem : The training focused on future data ! In short , In the process of getting each component of the training set , We used the data from the test set , However, for model training , The data in the test set should be unknown , That's what happened Information disclosure problem .
Besides , because EMD There is The number of decomposed components depends heavily on the distribution and length of the sequence The problem of , When the model is actually applied , There are also some problems : Suppose we use the former 24 After a time forecast 12 Time data , In the model training phase , We have trained a total of 10 A component model , That is, the data is decomposed into 10 Weight . After model training , It needs to be put into use , If we want to predict the future 12 Hours of data , Then we only need the nearest 24 Time data , A key question is : How to make this 24 Data is decomposed into 10 Weight ? Because only this 24 Data is decomposed into 10 Weight , We can use the trained 10 A model . The reality is that this decomposition is almost impossible , Because the data used for model training is hundreds of pieces , Tens of thousands in length , Such long data can be decomposed into 10 Weight ,24 Data is unlikely to be decomposed into 10 A component of .
Assume that the time series data used for model training and testing is D, The length is T+k, front T Data for model training , after k Data for model testing , Before utilization 24 After a data forecast 12 Data . For the above two questions , Generally speaking, there are the following 3 Kind of solution :
2.1 Training for many times + Multiple prediction
To get each prediction in the test set , We can do the following :
- To the front T Data for EMD decompose , Get multiple components , Then train a model for each component .
- After model training , We make use of D[T-23:T] The data of each component model is obtained 12 A prediction output , And then superimpose them , You can get D[T+1:T+12] The predicted value of .
- Yes D[13:T+12] Data processing EMD decompose , Also get multiple components , Then train a model for each component . And then use it D[T-11:T+12] this 24 Data to get the... Of each component model 12 A forecast , Then stack , You can get D[T+13:T+24] Of 12 Predicted values .
- Repeat the above 123 step , Every time the decomposition window slides 12 Data , Then decompose the data in the window and train the model , Finally, the prediction is made , Until get D[T+1:T+k] All predicted values of .
- utilize D[T+k] Calculate the evaluation index with the predicted value and the real value of .
Although the above methods did not cause information leakage , but It takes a lot of calculation . But the biggest advantage of this method is : There is no need to guarantee the IMFs The number of is the same , to D[1:T] When we decompose the data, we can get 10 Weight , At this time, the D[T-23:T] Every data in also exists 10 Weight , Can directly predict , Yes D[13:T+12] The data in , You can get 12 Weight , At this time, the D[T-11:T+12] Every data in also exists 12 Weight , Can directly predict .
2.2 A single workout + Multiple decomposition prediction
Method 2.2 With the method 2.1 The main difference is that you only train once :
- First of all, will D[1:T] Of the data , Then the data is decomposed , And train a model for each component .
- When making a formal forecast , The first use of D[T-23:T] The decomposition data of D[T+1:T+12] The predicted value of ; Then the data D[13:T+12] decomposition , utilize D[T-11:T+12] obtain D[T+13:T+24] The predicted value of .
- Cycle prediction , Until the end of the forecast .
It can be found that this method is to decompose the data with the latest length consistent with the training set every time , Then make predictions . Although this method does not cause information leakage , But the amount of calculation is also very large . meanwhile , This method has a fatal flaw : Ask for every T Data decomposed IMFs The number of must be consistent , Otherwise, the trained model cannot be used .
2.3 Sliding decomposition construction sample
Specific steps :
- Yes D[1:24] Of the data , Get component set X, Then on D[13:36] decomposition , Get component set Y, Then construct a sample for each component (x, y), among x For the former 24 Data ,y For after 12 Data .
- Yes D[13:36] and D[25:48] Decompose them separately ,13:36 by x,37:48 by y, Also generate a sample for all components .
- Repeat the above steps , Both windows slide each time 12 Data , Until the end of the training set , Get all training samples .
- Before utilization 3 Step to get data to train multiple component models .
- When making a formal forecast , Choose the nearest one every time 24 Data is decomposed , Then carry out prediction superposition .
The above methods have not caused information leakage , But the same It takes a lot of calculation . meanwhile , The method and 2.2 equally : Ask for every 24 Data decomposed IMFs The number of must be consistent , Otherwise the training will fail , At the same time, the trained model cannot be used .
2.3 summary
- Method 1 has no loopholes , But the amount of calculation is the largest of the three methods .
- Both method 2 and method 3 have the problem that the number of decompositions is uncertain .
- For the problems of the last two methods , Some articles have proposed solutions : Used to PyEMD Everyone knows , The maximum number of decompositions can be set during specific decomposition , But this number will not exceed the upper limit of decomposition number , In this case, we can assume a decomposition number in advance , This value is usually small , Each decomposition is carried out according to the number of decomposition , But there is also an extreme case : The number of a certain decomposition cannot even reach the minimum value , At this time we can be in Add... After the data 0, And then break it down , If the number is still not reached, continue to add 0 Until the specified number of decomposition is reached , because 0 After decomposition, the components are added together as 0( Or a little noise ), It has no effect on the final result . But here's the thing , This addition 0 Is not advisable in model training , Can only be used in model prediction .
This improved method has yet to be tested !
III. reference
边栏推荐
- 邦纳光电开关SM912LVQD
- Collection operation of MySQL
- 请教个问题,flink-sql,想指定从一个时间点开始读取数据?可以这样搞么?
- 86. (leaflet house) leaflet military plotting - collection of linear arrows
- Digital engineering construction enterprises carry out "safety production month" activities in this way
- 10个常见触发IO瓶颈的高频业务场景
- M-Arch(番外10)GD32L233评测-SPI驱动DS1302
- 调查显示macOS应用开发者普遍表示产品如何被用户发现是他们最大的挑战
- Solving definite integral of C language test question 164
- 原来树状数组可以这么简单?
猜你喜欢

稍微复杂的查询

从小开始勤奋

Fire tongs Liu Ming ~ a little interview suggestion summarized from the three-year small test of Dai Zhuan
![[the second revolution of report tools] optimize report structure and improve report operation performance based on SPL language](/img/53/d6f05e8050e27dc9d59f1196753512.png)
[the second revolution of report tools] optimize report structure and improve report operation performance based on SPL language

【报表工具的第二次革命】基于SPL语言优化报表结构、提升报表运算性能

Solving definite integral of C language test question 164

从零开始实现lmax-Disruptor队列(二)多消费者、消费者组间消费依赖原理解析

86.(leaflet之家)leaflet军事标绘-直线箭头采集

Do your filial duty to make an old people's fall prevention alarm system for your family

【图像加密解密】基于混沌序列结合DWT+SVD实现图像加密解密(含相关性检验)含Matlab源码
随机推荐
Cremb Pro backend sub administrator 403 problem analysis
What is fitness?
【翻译论文】A Progressive Morphological Filter for Removing Nonground Measurements From Airborne LIDAR Dat
YUV格式与RGB格式
C语言试题169之谁家孩子跑得最慢
【刷题篇】跳跃游戏
时间序列预测中使用类EMD方法时的信息泄露和计算量问题
St link V2 Download: internal command error & error: flash download failed - target DLL has been canceled
Digital integrated management system of double prevention system in chemical enterprises
使用addr2line分析Native Crash
[launch] modify the app theme according to the wallpaper. It really comes
目前28岁,从20年开始北漂闯荡到入职软件测试,我成功捧起15K薪资
【BP预测】基于Adaboost的BP神经网络实现数据回归预测附matlab代码
Maximum value and subscript of acquisition matrix of C language test question 168
Find My产品|新款TWS耳机支持Find My功能,苹果Find My在第三方厂家应用更广泛
sparksql源码系列 | 一文搞懂Show create table 执行原理
npm和yarn
原来树状数组可以这么简单?
Digital engineering construction enterprises carry out "safety production month" activities in this way
Pi of C language test question 162