当前位置:网站首页>Summary of preprocessing methods for time series data
Summary of preprocessing methods for time series data
2022-07-29 00:50:00 【I love Python data mining】
Time series data can be seen everywhere , Time series analysis , We must preprocess the data first . Time series preprocessing technology has a significant impact on the accuracy of data modeling .
In this paper , We will mainly discuss the following points :
Definition and importance of time series data .
Preprocessing steps of time series data .
Build time series data , Find missing values , Denoise features , And look for outliers in the dataset .
First , Let's first understand the definition of time series :
Time series is a series of evenly distributed observations recorded in a specific time interval .
An example of a time series is the price of gold . under these circumstances , Our observation is the price of gold collected after a fixed time interval . The unit of time can be minutes 、 Hours 、 God 、 Years etc. . But the time difference between any two consecutive samples is the same .
In this paper , We will see the common time series preprocessing steps and common problems related to time series data that should be performed before going deep into the data modeling part .
Time series data preprocessing
Time series data contains a lot of information , But it's usually invisible . A common problem associated with time series is unordered timestamps 、 Missing value ( Or timestamp )、 Outliers and noise in data . Of all the questions mentioned , Dealing with missing values is one of the most difficult , Because the traditional interpolation ( A technique for processing missing data by replacing missing values to retain most of the information ) The method is not applicable when processing time series data . In order to analyze the real-time analysis of this preprocessing , We will use Kaggle Of Air Passenger Data sets .
Time series data usually exist in unstructured format , That is, timestamps may be mixed together and not sorted correctly . In addition, in most cases , The date time column has the default string data type , Before applying any action to it , You must first convert the data time column to date time data type . Let's implement it into our dataset :
import pandas as pd
passenger = pd.read_csv('AirPassengers.csv')
passenger['Date'] = pd.to_datetime(passenger['Date'])
passenger.sort_values(by=['Date'], inplace=True, ascending=True)

Missing values in time series
Dealing with missing values in time series data is a challenging task . The traditional interpolation technology is not suitable for time series data , Because the order in which the values are received is important . To solve this problem , We have the following interpolation methods :
Interpolation is a commonly used missing value interpolation technology of time series . It helps to use two known data points around to estimate the missing data points . This method is simple and intuitive . When processing time series data, the following methods can be used :
Time based interpolation
Spline interpolation
linear interpolation
Let's see what our data looks like before interpolation :
from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
figure(figsize=(12, 5), dpi=80, linewidth=10)
plt.plot(passenger['Date'], passenger['Passengers'])
plt.title('Air Passengers Raw Data with Missing Values')
plt.xlabel('Years', fontsize=14)
plt.ylabel('Number of Passengers', fontsize=14)
plt.show()

Let's look at the results of the above three methods :
passenger[‘Linear’] = passenger[‘Passengers’].interpolate(method=’linear’)
passenger[‘Spline order 3’] = passenger[‘Passengers’].interpolate(method=’spline’, order=3)
passenger[‘Time’] = passenger[‘Passengers’].interpolate(method=’time’)
methods = ['Linear', 'Spline order 3', 'Time']
from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
for method in methods:
figure(figsize=(12, 4), dpi=80, linewidth=10)
plt.plot(passenger["Date"], passenger[method])
plt.title('Air Passengers Imputation using: ' + types)
plt.xlabel("Years", fontsize=14)
plt.ylabel("Number of Passengers", fontsize=14)
plt.show()

All methods give good results . When the missing value window ( Width of missing data ) Very hour , These methods are more meaningful . But if you lose several consecutive values , These methods are more difficult to estimate them .
Time series denoising
Noise elements in time series can cause serious problems , Therefore, in general, there will be noise removal operations before building any model . The process of minimizing noise is called denoising . Here are some methods commonly used to remove noise from time series :
Rolling average
The rolling average is the average of the previous observation window , Where the window is a series of values from time series data . Calculate the average for each ordered window . This can greatly help minimize noise in time series data .
Let's apply a rolling average to Google's stock price :
rolling_google = google_stock_price['Open'].rolling(20).mean()
plt.plot(google_stock_price['Date'], google_stock_price['Open'])
plt.plot(google_stock_price['Date'], rolling_google)
plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.legend(['Open','Rolling Mean'])
plt.show()

The Fourier transform
Fourier transform can help remove noise by converting time series data to frequency domain , We can filter out the noise frequency . Then the filtered time series are obtained by inverse Fourier transform . We use Fourier transform to calculate Google stock price .
denoised_google_stock_price = fft_denoiser(value, 0.001, True)
plt.plot(time, google_stock['Open'][0:300])
plt.plot(time, denoised_google_stock_price)
plt.xlabel('Date', fontsize = 13)
plt.ylabel('Stock Price', fontsize = 13)
plt.legend([‘Open’,’Denoised: 0.001'])
plt.show()

Outlier detection in time series
A peak or outlier in a time series . There may be many factors leading to outliers . Let's take a look at the methods available to detect outliers :
Method based on rolling statistics
This method is the most intuitive , It is applicable to almost all types of time series . In this way , Upper and lower limits are created based on specific statistical measures , For example, mean and standard deviation 、Z and T Scores and percentiles of the distribution . for example , We can define the upper and lower limits as :

It is not advisable to take the mean and standard deviation of the whole sequence , Because in this case , The boundary will be static . Boundaries should be created on the basis of scrolling windows , It's like thinking about a continuous set of observations to create boundaries , Then move to another window . This method is an efficient 、 Simple outlier detection method .
Isolated forests
seeing the name of a thing one thinks of its function , Isolated forest is a machine learning algorithm for anomaly detection based on decision tree . It works by using the partition of the decision tree to isolate the data points on a given feature set . let me put it another way , It takes a sample from the data set , And construct a tree on the sample , Until every point is isolated . To isolate data points , The partition is carried out randomly by selecting the segmentation between the maximum and minimum values of the feature , Until every point is isolated . The random partition of features will create a shorter path in the tree for abnormal data points , To distinguish them from the rest of the data .

K-means clustering
K-means Clustering is an unsupervised machine learning algorithm , It is often used to detect outliers in time series data . The algorithm looks at the data points in the dataset , And group similar data points into K Clusters . The anomaly is distinguished by measuring the distance from the data point to its nearest centroid . If the distance is greater than a certain threshold , Mark the data point as an exception .K-Means The algorithm uses Euclidean distance for comparison .

Possible interview questions
If a person writes a project about time series in his resume , Then the interviewer can ask these possible questions from this topic :
What are the methods of preprocessing time series data , How is it different from the standard interpolation method ?
What does time series window mean ?
Have you ever heard of isolated forests ? If it is , So can you explain how it works ?
What is Fourier transform , Why do we need it ?
What are the different ways to fill in missing values in time series data ?
summary
In this paper , We study some common time series data preprocessing techniques . Let's start by sorting time series Observations ; Then various missing value interpolation techniques are studied . Because we're dealing with an ordered set of observations , Therefore, time series interpolation is different from traditional interpolation technology . Besides , Some noise removal techniques are also applied to Google stock price data set , Finally, some outlier detection methods of time series are discussed . Using all these mentioned preprocessing steps ensures high-quality data , Get ready to build complex models .
Recommended articles
Li Hongyi 《 machine learning 》 Mandarin Program (2022) coming
Some people made Mr. Wu Enda's machine learning and in-depth learning into a Chinese version
So elegant ,4 paragraph Python Automatic data analysis artifact is really fragrant
Technical communication
Welcome to reprint 、 Collection 、 Gain some praise and support ! data 、 The code can be obtained from me

At present, a technical exchange group has been opened , Group friends have exceeded 2000 people , The best way to add notes is : source + Interest direction , Easy to find like-minded friends
- The way ①、 Send the following picture to wechat , Long press recognition , The background to reply : Add group ;
- The way ②、 Add microsignals :dkl88191, remarks : come from CSDN
- The way ③、 WeChat search official account :Python Learning and data mining , The background to reply : Add group

边栏推荐
- Common measurement matrix and matlab code of compressed sensing
- Some operations of Ubuntu remote server configuration database (unable to locate package MySQL server, steps of installing mysql, unable to enter password when logging in MySQL)
- ORACLE not available如何解决
- Introduction and solution of common security vulnerabilities in Web System SQL injection
- [develop low code platform] low code rendering
- CUDA相关
- 关于ThreadPool的一些注意事项
- Api 接口优化的那些技巧
- Outlier detection and open set identification (1)
- Calculate properties and listeners
猜你喜欢

Solutions such as failed plug-in installation and slow speed of linking remote server under vscode

【开发教程10】疯壳·开源蓝牙心率防水运动手环-蓝牙 BLE 收发

I don't know how lucky the boy who randomly typed the log is. There must be a lot of overtime!

Some operations of Ubuntu remote server configuration database (unable to locate package MySQL server, steps of installing mysql, unable to enter password when logging in MySQL)

How to solve the problem that the Oracle instance cannot be started

我不建议你使用SELECT *

Error reporting: Rong Lianyun sends SMS verification code message 500

Techo Hub 福州站干货来袭|与开发者共话工业智能新技术

Tips for API interface optimization
![Error reporting: the network preview shows {xxx:['this field is required']}](/img/96/b0a6c01543fcbcc6d3262b3797fae2.jpg)
Error reporting: the network preview shows {xxx:['this field is required']}
随机推荐
Camera Hal OEM module ---- CMR_ preview.c
Requestvideoframecallback() simple instance
armeabi-v7a架构(sv7a)
【开发教程10】疯壳·开源蓝牙心率防水运动手环-蓝牙 BLE 收发
2022DASCTF7月赋能赛(复现)
Router view cannot be rendered (a very low-level error)
【MySQL 8】Generated Invisible Primary Keys(GIPK)
Application and principle of distributed current limiting redistribution rratelimiter
【飞控开发基础教程8】疯壳·开源编队无人机-I2C(激光测距)
我不建议你使用SELECT *
PTA (one question per day) 7-76 ratio
Common sparse basis and matlab code for compressed sensing
rk3399 9.0驱动添加Powser按键
Anti shake and throttling
[development tutorial 11] crazy shell · open source Bluetooth heart rate waterproof sports Bracelet - explanation of the function code of the whole machine
将行内元素转换为块元素的方法
Isolation level of MySQL, possible problems (dirty reading, unrepeatable reading, phantom reading) and their solutions
SAP VL02N 交货单过账函数 WS_DELIVERY_UPDATE
【愚公系列】2022年07月 Go教学课程 020-Go容器之数组
2022dasctfjuly empowerment competition (reappearance)