当前位置:网站首页>Time series analysis of data mining [easy to understand]
Time series analysis of data mining [easy to understand]
2022-06-25 17:41:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
A set of random variables arranged in chronological order X1,X2,…,Xt Represents a time series of random events .
The purpose of time series analysis is to give an observed time series , Predict the future value of the sequence .
The model name | describe |
|---|---|
Smoothing method | It is often used for trend analysis and prediction , Using smoothing technology , Weaken the impact of short-term random fluctuations on the series , Smoothes the sequence . Depending on the smoothing technique used , It can be divided into moving average method and exponential smoothing method . |
Trend fitting method | Take time as an independent variable , The corresponding sequence observations are used as dependent variables , Build a regression model . According to the characteristics of the sequence , It can be divided into linear fitting and curve fitting . |
Combination model | The change of time series is mainly affected by the long-term trend (T)、 Seasonal changes (S)、 Cyclical changes (C) And irregular changes () The influence of these four factors . According to the characteristics of the sequence , We can build addition model and multiplication model . additive model :x = T+S+C+ Multiplication model :x = TSC |
AR Model | before p The sequence value of the period is an independent variable , A random variable Xt Establish a linear regression model for dependent variables |
MA Model | A random variable Xt The value of is independent of the sequence value of previous periods , establish Xt With the former q Random disturbance of period () The linear regression model of |
ARMA Model | A random variable Xt The value of , Not only with the former p The sequence value of the period , Also with the former q Random disturbance of period () of |
ARIMA Model | Many non-stationary sequences will show the properties of stationary sequences after difference , Call this nonstationary sequence a differential stationary sequence . For differential stationary sequences, we can use ARIMA Model fitting |
ARCH Model | It can accurately simulate the volatility of time series variables , It is suitable for sequences with heteroscedasticity and short-term autocorrelation of heteroscedasticity function |
GARCH Model and its derivative models | It is called generalized ARCH Model , yes ARCH The expansion of the model . It can better reflect the long-term memory in the actual sequence 、 The asymmetry of information |
1、 Before time series analysis , You need to preprocess the sequence , Including pure randomness and stability test . According to the test results, the sequences can be divided into different types , Take different analytical methods .
Pure random sequence | Also called white noise sequence , There is no correlation between the items of the sequence , The sequence is undergoing completely disordered random fluctuations . White noise sequence is a stationary sequence without information to extract , Analysis can be terminated . |
|---|---|
Stationary non white noise sequence | Mean and variance are constants . A linear model is usually established to fit the development of the sequence , So as to extract useful information . ARMA Stationary series fitting model is the most commonly used model . |
Nonstationary sequence | Mean and variance are unstable . It is generally transformed into a stationary sequence , Using the analysis method of stationary time series , Such as ARMA Model . If the time series is processed by difference operation , It is stable , The sequence is called a difference stationary sequence , Use ARIMA Model analysis . |
(1) Pure randomness test
If the sequence is a pure randomness test , There should be no relationship between the sequence values . In fact, the sample autocorrelation coefficient of pure random sequence is not absolutely zero , But it's close to zero , And random fluctuations around zero .
Pure randomness test , Also called white noise test , Generally, we construct test statistics to test . Commonly used test statistics are Q statistic 、LB statistic , From the autocorrelation coefficient of each delay period of the sample , You can calculate the test statistics , And then calculate the corresponding p value , If p The value is greater than the significance level , It means to accept the original hypothesis , It's a pure random sequence , Stop analyzing .
(2) Stability test
If the time series fluctuate near a constant and the range of fluctuation is limited , There are constant mean and constant variance , And delay k The autocorrelation and autocorrelation coefficients of the series variables are equal , Or delay k The influence degree between the series variables of the period is the same , The time series are called stationary series .
Two test methods :
a. A graph test that makes a judgment according to the characteristics of a sequence diagram and an autocorrelation diagram , The method is simple to operate 、 Widely applied , The disadvantage is subjectivity ;
Sequence diagram test : According to the property that the mean and variance of stationary time series are constant , The sequence diagram of the stationary sequence shows that the sequence value always fluctuates randomly around a constant , And the range of fluctuation is bounded .
If there is a clear trend or periodicity , Usually not a stationary sequence .
Autocorrelation graph test : Stationary series have short-term correlation , So in a stationary sequence , Only the recent sequence value has a significant impact on the current value , The farther the interval between past values, the less the impact on current values .
With the number of delay periods k An increase in , The autocorrelation coefficient of stationary series will decay rapidly and tend to zero , And random fluctuations around zero , The autocorrelation coefficient of non-stationary series decays slowly .
b. Construct test statistics , At present, the most commonly used method is the unit root test .
Unit root test refers to checking whether there is a unit root in the sequence , Because the existence of unit root is nonstationary time series .
2、 Stationary time series analysis
ARMA The full name of the model is the autoregressive moving average model , Can be subdivided into AR Model 、MA Models and ARMA There are three types of models , Can be regarded as a multiple linear regression model .
Modeling steps :
(1) Calculate the autocorrelation coefficient (ACF) And partial autocorrelation coefficient (PACF)
(2)ARMA Model recognition , It's also called model order determination , from AR(p) Model 、MA(q) Models and ARMA(p,q) The properties of autocorrelation coefficient and partial autocorrelation coefficient of , Choose the right model .
Model | Autocorrelation coefficient (ACF) | Partial autocorrelation coefficient (PACF) |
|---|---|---|
AR(p) | trailing | p Order truncation |
MA(q) | q Order truncation | trailing |
ARMA(p,q) | trailing | trailing |
(3) Estimate the value of the unknown parameter in the model , And carry out parameter test
(4) Model test
(5) Model optimization
(6) Model application : Make short-term forecasts .
3、 Nonstationary time series analysis
actually , In nature, most sequences are nonstationary .
Analytical methods fall into two categories :
(1) Time series analysis of deterministic factor decomposition
Put all the changes in the sequence down to four factors , Long term trends 、 Seasonal changes 、 The combined effect of cyclical and random changes .
The disadvantage is that the fluctuation caused by random factors is difficult to determine and analyze , The waste of random information is serious , It will lead to irrational model fitting accuracy .
(2) Random time series analysis
According to the different characteristics of time series , The models of stochastic time series analysis are as follows ARIMA Model 、 Residual autoregressive model 、 Seasonal models 、 Heteroscedasticity model, etc .
ARIMA Model modeling steps :
a. Check the stability of the sequence
b. Difference the original sequence , And the stability and white noise test
c. choice ARIMA Model
Need to be for ARIMA(p、d、q) Model assignment parameters p、d、q. among d Is the difference number .
Or use forecast The inside of the bag auto.arima Function to achieve optimal ARIMA Automatic model selection .
Model | ACF | PACF |
|---|---|---|
ARIMA(p,d,0) | Gradually decrease to zero | stay p The order decreases to zero |
ARIMA(0,d,q) | q The order decreases to zero | Gradually decrease to zero |
ARIMA(p,d,q) | Gradually decrease to zero | Gradually decrease to zero |
d. Fitting model
e. forecast
Illustrate with examples ARIMA Application of the model .
R Language implementation :
1、 Read data Set
2、 Generate timing objects , Test for stationarity
sales = ts(data) # Generate timing objects
plot.ts(sales,xlab=" Time ",ylab=" sales ") # Make a sequence diagram
acf(sales) # Make an autocorrelation diagram
library(fUnitRoots)
unitrootTest(sales) # Unit root test The sequence diagram is as follows , With monotonic increasing trend .
The autocorrelation diagram is as follows , The autocorrelation coefficient is greater than zero for a long time , It shows that there is a strong long-term correlation between sequences .
Unit root test give the result as follows ,p The value is significantly greater than 0.05, It is judged as non-stationary sequence .( The nonstationary sequence must not be a white noise sequence )
3、 First order difference of the original sequence , And conduct stability test .
difsales = diff(sales) # First order difference
plot.ts(difsales,xlab=" Time ",ylab=" Sales residual ") # Make a sequence diagram
acf(difsales) # Make an autocorrelation diagram
unitrootTest(difsales) # Unit root test The sequence diagram is as follows , Relatively stable fluctuation near the mean value .
The autocorrelation diagram is as follows , There is a strong short-term correlation .
Unit root test give the result as follows ,p Less than 0.05, So the sequence after the first-order difference is a stationary sequence .
4、 White noise test
Box.test(difsales,type = "Ljung-Box") # White noise test The result is
p The value is significantly less than 0.05, Therefore, the sequence after the first-order difference is a stationary non white noise sequence .
5、 fitting ARIMA Model
The first method :
pacf(difsales) # Make partial autocorrelation diagram The partial autocorrelation diagram is as follows ,
According to the table 4 How to choose , selected ARIMA(1,1,0) Model .
fit = arima(sales,order=c(1,1,0)) #ARIMA(1,1,0) Model The second method :
auto.arima(CWD) # Another example The output is as follows :
6、 Model test
After the model is determined , Check whether the residual is white noise , If it's not white noise , It indicates that there is still useful information in the residuals , Model parameters need to be modified , Further extraction .
fit = arima(CWD,order = c(0,1,1)) # Another example
r3 = fit$residuals
Box.test(r3,type = "Ljung-Box")7、 forecast
library(forecast)
forecast(fit,5) # Predict the next 5 The sequence value of the period
plot(forecast(fit,5)) # Make a prediction chart , Dark areas are 80% and 95% The confidence interval of The result is
8、 Model evaluation
Three statistical indicators are used to measure the prediction accuracy of the model : Mean absolute error 、 Root mean square error 、 Mean absolute percent error . These three indicators reflect the prediction accuracy of the model from different aspects .
mae = mean(abs(pre-real)) # Mean absolute error
rmse = mean((pre-real)^2) # Root mean square error
mape = mean(abs(pre-real)/real) # Mean absolute percent error Combined with actual business analysis , Set the error threshold to a value , Such as 1.5, Evaluate the prediction accuracy of the model .
Python Realization :
#ARIMA Time series model
import pandas as pd
forecastnum = 5
data = pd.read_excel("arima_data.xls",index_col=u' date ') #pandas Automatically put “ date ” The column is identified as datetime Format
# Sequence diagram
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
data.plot()
plt.show()
# Autocorrelation diagram
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(data).show()
#ADF Unit root test
from statsmodels.tsa.stattools import adfuller as ADF
print(u' Original sequence ADF The inspection result is :',ADF(data[u' sales ']))
# The return values are adf、pvalueThe output is :
It can be concluded that the time series is unstable , Differential operation is required .
D_data = data.diff().dropna() # Difference
D_data.columns = [u' Sales difference ']
D_data.plot() # Sequence diagram
plt.show()
plot_acf(D_data).show() # Autocorrelation diagram
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(D_data).show() # Partial autocorrelation graph
print(u' Of differential sequence ADF The inspection result is :',ADF(D_data[u' Sales difference ']))The output is :
At this time, the sequence after difference conforms to stationarity .
# White noise test
from statsmodels.stats.diagnostic import acorr_ljungbox
print(u' The white noise test result of the difference sequence is :',acorr_ljungbox(D_data,lags=1)) # Returns the sum of statistics p value The output is :
p The value is less than the significance level , So non white noise .
from statsmodels.tsa.arima_model import ARIMA
data[u' sales '] = data[u' sales '].astype(float)
# Order determination
pmax = int(len(D_data)/10) # Generally, the order does not exceed length/10
qmax = int(len(D_data)/10)
bic_matrix =[] #bic matrix
for p in range(pmax+1):
tmp = []
for q in range(qmax+1):
try: # There are some errors
tmp.append(ARIMA(data,(p,1,q)).fit().bic)
except:
tmp.append(None)
bic_matrix.append(tmp)
bic_matrix = pd.DataFrame(bic_matrix)
p,q = bic_matrix.stack().idxmin() # First use stack Flattening , use idxmin Find the minimum position
print (u'BIC The smallest p Values and q The value is :%s,%s' % (p,q))
model = ARIMA(data,(p,1,q)).fit() # establish ARIMA(0,1,1) Model
model.summary2() # Model report
model.forecast(5) # forecast 5 Days of data , Return forecast results 、 Standard error 、 confidence interval The output is :
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/151153.html Link to the original text :https://javaforall.cn
边栏推荐
- jupyter的使用
- conda 修改镜像源
- Sword finger offer II 010 Subarray prefix sum difference with sum K
- What are the steps for launching the mobile ERP system? It's important to keep it tight
- Accumulation of some common knowledge points
- What is public chain development? What are the public chain development projects?
- 配电室环境的分布式远程管理
- [compilation principle] overview
- 求满足条件的最长子串长度
- Old mobile phones turn waste into treasure and serve as servers
猜你喜欢

LSF如何看job预留slot是否合理?

WPF development essays Collection - ECG curve drawing

【编译原理】概述

How does LSF see whether the job reserved slot is reasonable?
![Jerry's ADC_ get_ Incorrect voltage value obtained by voltage function [chapter]](/img/7a/9c4f4f800c3142ffc279b70354a0bc.png)
Jerry's ADC_ get_ Incorrect voltage value obtained by voltage function [chapter]

Under the same WiFi, the notebook is connected to the virtual machine on the desktop

BILSTM和CRF的那些事

【Matlab】数据插值

HMS core machine learning service realizes simultaneous interpretation, supports Chinese-English translation and multiple voice broadcast

【Matlab】数值微积分与方程求解
随机推荐
杰理之SPI 从机使用注意事项【篇】
Langage d'assemblage (5) Registre (accès à la mémoire)
MySQL mysql-8.0.19-winx64 installation and Navicat connection
Create a new ar fashion experience with cheese and sugar beans
ES6 knowledge points
Old mobile phones turn waste into treasure and serve as servers
[matlab] curve fitting
What are the steps for launching the mobile ERP system? It's important to keep it tight
CGI connects to database through ODBC
HMS Core机器学习服务实现同声传译,支持中英文互译和多种音色语音播报
学习太极创客 — MQTT(一)MQTT 是什么
The second round of Yunnan Cyberspace Security competition in May 2021
【编译原理】概述
Mathematical modeling - nonlinear programming
Use of jupyter
Introduction to the container of() function
bert之我的小总结
golang sort slice int
【Matlab】数值微积分与方程求解
RuntimeError: Trying to backward through the graph a second time (or directly access saved variable