当前位置：网站首页>Time series analysis of data mining [easy to understand]

Time series analysis of data mining [easy to understand]

2022-06-25 17:41:00 【Full stack programmer webmaster】

Hello everyone , I meet you again , I'm your friend, Quan Jun .

A set of random variables arranged in chronological order X1,X2,…,Xt Represents a time series of random events .

The purpose of time series analysis is to give an observed time series , Predict the future value of the sequence .

The model name	describe
Smoothing method	It is often used for trend analysis and prediction , Using smoothing technology , Weaken the impact of short-term random fluctuations on the series , Smoothes the sequence . Depending on the smoothing technique used , It can be divided into moving average method and exponential smoothing method .
Trend fitting method	Take time as an independent variable , The corresponding sequence observations are used as dependent variables , Build a regression model . According to the characteristics of the sequence , It can be divided into linear fitting and curve fitting .
Combination model	The change of time series is mainly affected by the long-term trend （T）、 Seasonal changes （S）、 Cyclical changes （C） And irregular changes （） The influence of these four factors . According to the characteristics of the sequence , We can build addition model and multiplication model . additive model ：x = T+S+C+ Multiplication model ：x = TSC
AR Model	before p The sequence value of the period is an independent variable , A random variable Xt Establish a linear regression model for dependent variables
MA Model	A random variable Xt The value of is independent of the sequence value of previous periods , establish Xt With the former q Random disturbance of period （） The linear regression model of
ARMA Model	A random variable Xt The value of , Not only with the former p The sequence value of the period , Also with the former q Random disturbance of period （） of
ARIMA Model	Many non-stationary sequences will show the properties of stationary sequences after difference , Call this nonstationary sequence a differential stationary sequence . For differential stationary sequences, we can use ARIMA Model fitting
ARCH Model	It can accurately simulate the volatility of time series variables , It is suitable for sequences with heteroscedasticity and short-term autocorrelation of heteroscedasticity function
GARCH Model and its derivative models	It is called generalized ARCH Model , yes ARCH The expansion of the model . It can better reflect the long-term memory in the actual sequence 、 The asymmetry of information

1、 Before time series analysis , You need to preprocess the sequence , Including pure randomness and stability test . According to the test results, the sequences can be divided into different types , Take different analytical methods .

Pure random sequence	Also called white noise sequence , There is no correlation between the items of the sequence , The sequence is undergoing completely disordered random fluctuations . White noise sequence is a stationary sequence without information to extract , Analysis can be terminated .
Stationary non white noise sequence	Mean and variance are constants . A linear model is usually established to fit the development of the sequence , So as to extract useful information . ARMA Stationary series fitting model is the most commonly used model .
Nonstationary sequence	Mean and variance are unstable . It is generally transformed into a stationary sequence , Using the analysis method of stationary time series , Such as ARMA Model . If the time series is processed by difference operation , It is stable , The sequence is called a difference stationary sequence , Use ARIMA Model analysis .

（1） Pure randomness test

If the sequence is a pure randomness test , There should be no relationship between the sequence values . In fact, the sample autocorrelation coefficient of pure random sequence is not absolutely zero , But it's close to zero , And random fluctuations around zero .

Pure randomness test , Also called white noise test , Generally, we construct test statistics to test . Commonly used test statistics are Q statistic 、LB statistic , From the autocorrelation coefficient of each delay period of the sample , You can calculate the test statistics , And then calculate the corresponding p value , If p The value is greater than the significance level , It means to accept the original hypothesis , It's a pure random sequence , Stop analyzing .

（2） Stability test

If the time series fluctuate near a constant and the range of fluctuation is limited , There are constant mean and constant variance , And delay k The autocorrelation and autocorrelation coefficients of the series variables are equal , Or delay k The influence degree between the series variables of the period is the same , The time series are called stationary series .

Two test methods ：

a. A graph test that makes a judgment according to the characteristics of a sequence diagram and an autocorrelation diagram , The method is simple to operate 、 Widely applied , The disadvantage is subjectivity ;

Sequence diagram test ： According to the property that the mean and variance of stationary time series are constant , The sequence diagram of the stationary sequence shows that the sequence value always fluctuates randomly around a constant , And the range of fluctuation is bounded .

If there is a clear trend or periodicity , Usually not a stationary sequence .

Autocorrelation graph test ： Stationary series have short-term correlation , So in a stationary sequence , Only the recent sequence value has a significant impact on the current value , The farther the interval between past values, the less the impact on current values .

With the number of delay periods k An increase in , The autocorrelation coefficient of stationary series will decay rapidly and tend to zero , And random fluctuations around zero , The autocorrelation coefficient of non-stationary series decays slowly .

b. Construct test statistics , At present, the most commonly used method is the unit root test .

Unit root test refers to checking whether there is a unit root in the sequence , Because the existence of unit root is nonstationary time series .

2、 Stationary time series analysis

ARMA The full name of the model is the autoregressive moving average model , Can be subdivided into AR Model 、MA Models and ARMA There are three types of models , Can be regarded as a multiple linear regression model .

Modeling steps ：

（1） Calculate the autocorrelation coefficient （ACF） And partial autocorrelation coefficient （PACF）

（2）ARMA Model recognition , It's also called model order determination , from AR（p） Model 、MA（q） Models and ARMA（p,q） The properties of autocorrelation coefficient and partial autocorrelation coefficient of , Choose the right model .

Model	Autocorrelation coefficient （ACF）	Partial autocorrelation coefficient （PACF）
AR（p）	trailing	p Order truncation
MA（q）	q Order truncation	trailing
ARMA（p,q）	trailing	trailing

（3） Estimate the value of the unknown parameter in the model , And carry out parameter test

（4） Model test

（5） Model optimization

（6） Model application ： Make short-term forecasts .

3、 Nonstationary time series analysis

actually , In nature, most sequences are nonstationary .

Analytical methods fall into two categories ：

（1） Time series analysis of deterministic factor decomposition

Put all the changes in the sequence down to four factors , Long term trends 、 Seasonal changes 、 The combined effect of cyclical and random changes .

The disadvantage is that the fluctuation caused by random factors is difficult to determine and analyze , The waste of random information is serious , It will lead to irrational model fitting accuracy .

（2） Random time series analysis

According to the different characteristics of time series , The models of stochastic time series analysis are as follows ARIMA Model 、 Residual autoregressive model 、 Seasonal models 、 Heteroscedasticity model, etc .

ARIMA Model modeling steps ：

a. Check the stability of the sequence

b. Difference the original sequence , And the stability and white noise test

c. choice ARIMA Model

Need to be for ARIMA（p、d、q） Model assignment parameters p、d、q. among d Is the difference number .

Or use forecast The inside of the bag auto.arima Function to achieve optimal ARIMA Automatic model selection .

Model	ACF	PACF
ARIMA(p,d,0)	Gradually decrease to zero	stay p The order decreases to zero
ARIMA(0,d,q)	q The order decreases to zero	Gradually decrease to zero
ARIMA(p,d,q)	Gradually decrease to zero	Gradually decrease to zero

d. Fitting model

e. forecast

Illustrate with examples ARIMA Application of the model .

R Language implementation ：

1、 Read data Set

2、 Generate timing objects , Test for stationarity

sales = ts(data) # Generate timing objects 

plot.ts(sales,xlab=" Time ",ylab=" sales ") # Make a sequence diagram 

acf(sales) # Make an autocorrelation diagram 

library(fUnitRoots)
unitrootTest(sales) # Unit root test

The sequence diagram is as follows , With monotonic increasing trend .

The autocorrelation diagram is as follows , The autocorrelation coefficient is greater than zero for a long time , It shows that there is a strong long-term correlation between sequences .

Unit root test give the result as follows ,p The value is significantly greater than 0.05, It is judged as non-stationary sequence .（ The nonstationary sequence must not be a white noise sequence ）

3、 First order difference of the original sequence , And conduct stability test .

difsales = diff(sales) # First order difference 

plot.ts(difsales,xlab=" Time ",ylab=" Sales residual ") # Make a sequence diagram 

acf(difsales) # Make an autocorrelation diagram 

unitrootTest(difsales) # Unit root test

The sequence diagram is as follows , Relatively stable fluctuation near the mean value .

The autocorrelation diagram is as follows , There is a strong short-term correlation .

Unit root test give the result as follows ,p Less than 0.05, So the sequence after the first-order difference is a stationary sequence .

4、 White noise test

Box.test(difsales,type = "Ljung-Box") # White noise test

The result is

p The value is significantly less than 0.05, Therefore, the sequence after the first-order difference is a stationary non white noise sequence .

5、 fitting ARIMA Model

The first method ：

pacf(difsales) # Make partial autocorrelation diagram

The partial autocorrelation diagram is as follows ,

According to the table 4 How to choose , selected ARIMA（1,1,0） Model .

fit = arima(sales,order=c(1,1,0)) #ARIMA（1,1,0） Model

The second method ：

auto.arima(CWD) # Another example

The output is as follows ：

6、 Model test

After the model is determined , Check whether the residual is white noise , If it's not white noise , It indicates that there is still useful information in the residuals , Model parameters need to be modified , Further extraction .

fit = arima(CWD,order = c(0,1,1)) # Another example 

r3 = fit$residuals

Box.test(r3,type = "Ljung-Box")

7、 forecast

library(forecast) 
forecast(fit,5) # Predict the next 5 The sequence value of the period 

plot(forecast(fit,5)) # Make a prediction chart , Dark areas are 80% and 95% The confidence interval of

The result is

8、 Model evaluation

Three statistical indicators are used to measure the prediction accuracy of the model ： Mean absolute error 、 Root mean square error 、 Mean absolute percent error . These three indicators reflect the prediction accuracy of the model from different aspects .

mae = mean(abs(pre-real)) # Mean absolute error 

rmse = mean((pre-real)^2) # Root mean square error 

mape = mean(abs(pre-real)/real) # Mean absolute percent error

Combined with actual business analysis , Set the error threshold to a value , Such as 1.5, Evaluate the prediction accuracy of the model .

Python Realization ：

#ARIMA Time series model 
import pandas as pd

forecastnum = 5
data = pd.read_excel("arima_data.xls",index_col=u' date ') #pandas Automatically put “ date ” The column is identified as datetime Format 

# Sequence diagram 
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] 
plt.rcParams['axes.unicode_minus'] = False
data.plot()
plt.show()

# Autocorrelation diagram 
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(data).show()

#ADF Unit root test 
from statsmodels.tsa.stattools import adfuller as ADF
print(u' Original sequence ADF The inspection result is ：',ADF(data[u' sales ']))
# The return values are adf、pvalue

The output is ：

It can be concluded that the time series is unstable , Differential operation is required .

D_data = data.diff().dropna() # Difference 
D_data.columns = [u' Sales difference ']
D_data.plot() # Sequence diagram 
plt.show()

plot_acf(D_data).show() # Autocorrelation diagram 

from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(D_data).show() # Partial autocorrelation graph 

print(u' Of differential sequence ADF The inspection result is ：',ADF(D_data[u' Sales difference ']))

The output is ：

At this time, the sequence after difference conforms to stationarity .

# White noise test 
from statsmodels.stats.diagnostic import acorr_ljungbox
print(u' The white noise test result of the difference sequence is ：',acorr_ljungbox(D_data,lags=1)) # Returns the sum of statistics p value

The output is ：

p The value is less than the significance level , So non white noise .

from statsmodels.tsa.arima_model import ARIMA
data[u' sales '] = data[u' sales '].astype(float)
# Order determination 
pmax = int(len(D_data)/10) # Generally, the order does not exceed length/10
qmax = int(len(D_data)/10) 
bic_matrix =[] #bic matrix 
for p in range(pmax+1):
    tmp = []
    for q in range(qmax+1):
        try: # There are some errors 
            tmp.append(ARIMA(data,(p,1,q)).fit().bic)
        except:
            tmp.append(None)
    bic_matrix.append(tmp)
    
bic_matrix = pd.DataFrame(bic_matrix)
p,q = bic_matrix.stack().idxmin() # First use stack Flattening , use idxmin Find the minimum position 
print (u'BIC The smallest p Values and q The value is ：%s,%s' % (p,q))

model = ARIMA(data,(p,1,q)).fit() # establish ARIMA（0,1,1） Model 
model.summary2() # Model report 

model.forecast(5) # forecast 5 Days of data , Return forecast results 、 Standard error 、 confidence interval

The output is ：

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/151153.html Link to the original text ：https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/176/202206251711139081.html

当前位置：网站首页>Time series analysis of data mining [easy to understand]

Time series analysis of data mining [easy to understand]

边栏推荐

猜你喜欢

随机推荐