当前位置:网站首页>A method for detecting outliers of data
A method for detecting outliers of data
2022-07-04 07:58:00 【Data analysis - Zhongzhi】
Outlier detection method is an important step in the process of data analysis , It is also one of the essential skills of data analysts . The following is a brief introduction to the detection methods related to data analysis , If there is any deficiency, please criticize and correct .
One 、 Outlier detection method based on distribution
1. 3sigma
Based on the normal distribution ,3sigma The code says that it exceeds 3sigma The data of is outliers .
python Program realization :
def three_sigma(s):
mu, std = np.mean(s), np.std(s)
lower, upper = mu-3*std, mu+3*std
return lower, upper
2.Z-score
Z-score Is the standard score , Measure the distance between the data point and the average value , if A The difference from the average is 2 A standard deviation ,Z-score by 2. When put Z-score=3 As a threshold to eliminate outliers , It is equivalent to 3sigma.
def z_score(s):
z_score = (s - np.mean(s)) / np.std(s)
return z_score
3. boxplot
The boxplot is based on the interquartile distance (IQR) Looking for outliers .
def boxplot(s):
q1, q3 = s.quantile(.25), s.quantile(.75)
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
return lower, upper
4.Grubbs Hypothesis testing
Grubbs’Test It is a method of hypothesis testing , It is often used to test univariate data sets with normal distribution (univariate data set)Y Single outliers in . If there are abnormal values , It must be the maximum or minimum value in the data set . The original and alternative assumptions are as follows :
● H0: There are no outliers in the dataset
● H1: There is an outlier in the dataset
Use Grubbs Testing requires that the population be normally distributed . Algorithm flow :
1, The samples are sorted from small to large
2, For the sample mean and dev
3, Calculation min/max And mean The gap between , The bigger one is suspicious
4, For suspicious values z-score (standard score), If it is greater than Grubbs critical value , So that is outlier
Grubbs The critical value can be obtained by looking up the table , It's determined by two values : detection level α( The more strict, the smaller ), Number of samples n, exclude outlier, Do... For the remaining sequence loop 1-4 step [1]. For detailed calculation examples, please refer to .
from outliers import smirnov_grubbs as grubbs
print(grubbs.test([8, 9, 10, 1, 9], alpha=0.05))
print(grubbs.min_test_outliers([8, 9, 10, 1, 9], alpha=0.05))
print(grubbs.max_test_outliers([8, 9, 10, 1, 9], alpha=0.05))
print(grubbs.max_test_indices([8, 9, 10, 50, 9], alpha=0.05))
limited :
1、 Only single dimension data can be detected
2、 Unable to accurately output the normal interval
3、 Its judging mechanism is “ Eliminate one by one ”, So each outlier should be calculated separately for the whole step , Too much data .
4、 It is necessary to assume that the data obey normal distribution or near normal distribution
Two 、 Distance based outlier detection method
1、KNN
Calculate each sample point and its nearest... In turn K The average distance between samples , Then compare the calculated distance with the threshold , If it is greater than the threshold , It is considered to be an outlier . The advantage is that there is no need to assume the distribution of data , The disadvantage is that only global outliers can be found , Local outliers cannot be found .
from pyod.models.knn import KNN
# Initialize detector clf
clf = KNN( method='mean', n_neighbors=3, )
clf.fit(X_train)
# Return the classification label on the training data (0: Normal value , 1: outliers )
y_train_pred = clf.labels_
# Return outliers on training data ( The higher the score, the more abnormal )
y_train_scores = clf.decision_scores_
3、 ... and 、 Density based approach
1、 Local Outlier Factor (LOF)
LOF It's a classic density based algorithm (Breuning et. al. 2000), By assigning an outlier factor that depends on the neighborhood density to each data point LOF, Then judge whether the data point is an outlier . Its advantage is that it can quantify the abnormal degree of each data point (outlierness).
The data points P The local relative density of ( Local anomaly factor )= spot P The average local reachable density of points in the neighborhood Follow The data points P Local reachable density of The ratio of the : The data points P Local reachable density of =P The reciprocal of the average reachable distance of the nearest neighbor . The greater the distance , The smaller the density .
spot P point-to-point O Of the k Reach distance =max( spot O Of k Neighbor distance , spot P point-to-point O Distance of ).
spot O Of k Neighbor distance = The first k The nearest point and point O Distance between .
As a whole ,LOF The algorithm flow is as follows :
● For each data point , Calculate its distance from all the other points , And sort it from near to far ;
● For each data point , Find it. K-Nearest-Neighbor, Calculation LOF score .
from sklearn.neighbors import LocalOutlierFactor as LOF
X = [[-1.1], [0.2], [100.1], [0.3]]
clf = LOF(n_neighbors=2)
res = clf.fit_predict(X)
print(res)
print(clf.negative_outlier_factor_)
2.Connectivity-Based Outlier Factor (COF)
COF yes LOF Variants , Compared with LOF,COF It can handle outliers in low density ,COF The local density of is calculated based on the average chain distance . At the beginning, we will also calculate the value of each point k-nearest neighbor. And then we'll calculate for each point Set based nearest Path, Here's the picture :
If we today our k=5, therefore F Of neighbor by B、C、D、E、G. And for F The nearest point to him is E, therefore SBN Path The first element of is F、 The second is E. leave E The nearest point is D So the third element is D, Next, leave D The nearest point is C and G, So the fourth and fifth elements are C and G, Finally leave C The nearest point is B, The sixth element is B. So the whole process goes down ,F Of SBN Path by {F, E, D, C, G, C, B}. And for SBN Path The corresponding distance e={e1, e2, e3,…,ek}, According to the example above e={3,2,1,1,1}.
So we can say that if we want to calculate p Dot SBN Path, We just need to calculate directly p Point and its neighbor Of all the points graph Of minimum spanning tree, Then we'll take p Point as the starting point shortest path Algorithm , You can get our SBN Path.
And then we have SBN Path We will then calculate ,p Chain distance of points : With ac_distance after , We can calculate COF:
from pyod.models.cof import COF
cof = COF(contamination = 0.06, ## The proportion of outliers
n_neighbors = 20, ## Number of nearest neighbors
)
cof_label = cof.fit_predict(iris.values) # Iris data
print(" The number of outliers detected is :",np.sum(cof_label == 1))
3.Stochastic Outlier Selection (SOS)
The characteristic matrix (feature martrix) Or dissimilarity matrix (dissimilarity matrix) Input to SOS Algorithm , Will return an abnormal probability value vector ( Each point corresponds to a ).SOS The idea is : When a point is related to all other points (affinity) When they were very young , It is an outlier .
SOS The process of :
Calculate the dissimilarity matrix D;
Calculate the incidence matrix A;
Calculate the incidence probability matrix B;
Calculate the anomaly probability vector .
Dissimilarity matrix D Is the measured distance between two samples , Such as European distance or Hamming distance . The correlation matrix reflects the variance of measurement distance , Pictured 7, spot Has the highest density , Minimum variance ; Has the lowest density , The largest variance . And the incidence probability matrix B(binding probability matrix) The incidence matrix (affinity matrix) Normalized by rows , Pictured 8 Shown .
The above figure shows the density visualization in the correlation matrix , The following figure shows the correlation probability matrix
Got it binding probability matrix, The abnormal probability value of each point is calculated by the following formula , When a point is related to all other points (affinity) When they were very young , It is an outlier .
import pandas as pd
from sksos import SOS
iris = pd.read_csv("http://bit.ly/iris-csv")
X = iris.drop("Name", axis=1).values
detector = SOS()
iris["score"] = detector.predict(X)
iris.sort_values("score", ascending=False).head(10)
Four : Clustering based methods
DBSCAN Algorithm (Density-Based Spatial Clustering of Applications with Noise) The inputs and outputs of are as follows , For outliers that cannot form clusters , This is the outlier ( Noise point ).
● Input : Data sets , Neighborhood radius Eps, Threshold for the number of data objects in the neighborhood MinPts;
● Output : Density connected cluster .
The process is as follows .
Select any data object point from the data set p;
If for parameters Eps and MinPts, Selected data object point p As the core point , Then find out all from p Density up to data object points , Form a cluster ;
If the selected data object point p It's the edge point , Select another data object point ;
Repeat above 2、3 Step , Until all points are processed .
from sklearn.cluster import DBSCAN
import numpy as np
X = np.array([[1, 2], [2, 2], [2, 3],
[8, 7], [8, 8], [25, 80]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
clustering.labels_
array([ 0, 0, 0, 1, 1, -1])
# 0,,0,,0: Indicates that the first three samples are divided into a group
# 1, 1: The middle two are divided into a group
# -1: The last one is the outlier , Not belonging to any group
5、 ... and 、 Tree based approach
Isolated forest “ isolated ” (isolation) refer to “ Isolate outliers from all samples ”, The original text in the paper is “separating an instance from the rest of the instances”.
We use a random hyperplane to cut a data space , You can generate two subspaces at a time . Next , Let's continue to randomly select hyperplanes , To cut the two subspaces obtained in the first step , It's a cycle , Until each subspace contains only one data point . We can find out , Those clusters with high density have to be cut many times before they stop cutting , That is, each point exists separately in a subspace , But those sparse points , Most of them stopped in a subspace very early . therefore , The algorithm idea of the whole isolated forest : Abnormal samples are more likely to fall into leaf nodes or , The abnormal sample is on the decision tree , Closer to the root node .
Random selection m Features , By randomly selecting a value between the maximum value and the minimum value of the selected feature to segment the data points . The partition of observations repeats recursively , Until all observations are isolated .
get t After an isolated tree , The training of a single tree is over . Then you can use the generated isolated tree to evaluate the test data , Calculate the abnormal score s. For each sample x, It is necessary to comprehensively calculate the results of each tree , Calculate the abnormal score by the following formula : ● h(x): For the sample at iTree Upper PathLength;
● E(h(x)): For the sample at t Tree iTree Of PathLength The average of ;
● c(n): by n Samples to build a binary search tree BST Average path length of unsuccessful searches in ( mean value h(x) The estimation of the external node terminal is equivalent to BST Unsuccessful search in ). Yes sample x The path length of h(x) Standardized treatment .H(n-1) It's a harmonic number , You can use ln(n-1)+0.5772156649( Euler constant ) Estimate . The value range of the index part is (−∞,0), therefore s The value range is (0,1). When PathLength The smaller it is ,s The closer the 1, At this time, the greater the probability that the sample is an outlier .
from sklearn.datasets import load_iris
from sklearn.ensemble import IsolationForest
data = load_iris(as_frame=True)
X,y = data.data,data.target
df = data.frame
# model training
iforest = IsolationForest(n_estimators=100, max_samples='auto',
contamination=0.05, max_features=4,
bootstrap=False, n_jobs=-1, random_state=1)
# fit_predict function Training and prediction work together Whether the abnormal model can be judged ,-1 Is abnormal ,1 It's normal
df['label'] = iforest.fit_predict(X)
# forecast decision_function We can draw Abnormal score
df['scores'] = iforest.decision_function(X)
6、 ... and 、 Method based on dimension reduction
1、Principal Component Analysis (PCA)
PCA The practice of anomaly detection , There are generally two ideas :
(1) Mapping data to low dimensional feature space , Then we can check the deviation between each data point and other data in different dimensions of feature space ;
(2) Mapping data to low dimensional feature space , Then the low dimensional feature space is mapped back to the original space , Try to reconstruct the original data with low dimensional features , Look at the size of the reconstruction error .
PCA Doing eigenvalue decomposition , You'll get :
● Eigenvector : It reflects the different directions of variance variation of original data ;
● The eigenvalue : The variance of the data in the corresponding direction .
therefore , The eigenvector corresponding to the maximum eigenvalue is the direction of the maximum variance of the data , The eigenvector corresponding to the minimum eigenvalue is the direction of the minimum variance of the data . The variance of the original data in different directions reflects its inherent characteristics . If the characteristics of a single data sample are not consistent with the overall data sample , For example, it deviates greatly from other data samples in some directions , It may indicate that the data sample is an outlier .
In the first approach mentioned above , sample x i x_i xi The abnormal score of is the deviation degree of the sample in all directions : among , Is the distance of the sample from the eigenvector in the reconstruction space . If there is a sample point, the farther it deviates from each principal component , The bigger , It means that the deviation degree is large , High abnormal score . It's characteristic value , For normalization , Make the degree of deviation in different directions comparable .
When calculating the abnormal score , About eigenvectors ( That is, the benchmark for measuring exceptions ) There are two more ways to choose :
● Think ahead k The deviation in the direction of the eigenvectors : front k A feature vector often directly corresponds to a certain number of features in the original data , Data samples with large deviation in the direction of the first few eigenvectors , It is often the extreme points on the features in the original data .
● After consideration r The deviation in the direction of the eigenvectors : after r A feature vector usually represents a linear combination of several original features , The variance of linear combination is small, which reflects the relationship between these characteristics . Data samples with large deviation in the latter several feature directions , It indicates that the corresponding features in the original data are inconsistent with the expected situation . The score is greater than the threshold C Then it is judged as abnormal .
The second way ,PCA The main features of the data are extracted , If a data sample is not easy to reconstruct , Indicates that the characteristics of this data sample are inconsistent with those of the overall data sample , So it's obviously an abnormal sample :
among , Is based on k Sample of dimensional eigenvector reconstruction .
When reconstructing data samples based on low dimensional features , The information in the direction of the eigenvector corresponding to the smaller eigenvalue is discarded . In other words , In fact, the reconstruction error mainly comes from the information in the direction of the eigenvector corresponding to the smaller eigenvalues . Based on this intuitive understanding ,PCA The two different approaches to anomaly detection pay special attention to the eigenvectors corresponding to smaller eigenvalues . therefore , We said PCA The two approaches to anomaly detection are essentially similar , Of course, the first method can also focus on the eigenvectors corresponding to larger eigenvalues .
rom sklearn.decomposition import PCA
pca = PCA()
pca.fit(centered_training_data)
transformed_data = pca.transform(training_data)
y = transformed_data
# Calculate the exception score
lambdas = pca.singular_values_
M = ((y*y)/lambdas)
# front k Eigenvectors and post r eigenvectors
q = 5
print "Explained variance by first q terms: ", sum(pca.explained_variance_ratio_[:q])
q_values = list(pca.singular_values_ < .2)
r = q_values.index(True)
# Calculate the sum of distances for each sample point
major_components = M[:,range(q)]
minor_components = M[:,range(r, len(features))]
major_components = np.sum(major_components, axis=1)
minor_components = np.sum(minor_components, axis=1)
# Artificially set c1、c2 threshold
components = pd.DataFrame({
'major_components': major_components,
'minor_components': minor_components})
c1 = components.quantile(0.99)['major_components']
c2 = components.quantile(0.99)['minor_components']
# Make classifiers
def classifier(major_components, minor_components):
major = major_components > c1
minor = minor_components > c2
return np.logical_or(major,minor)
results = classifier(major_components=major_components, minor_components=minor_components)
2、AutoEncoder
PCA It is linear dimensionality reduction ,AutoEncoder Is a nonlinear dimensionality reduction . Trained according to normal data AutoEncoder, It can reconstruct and restore normal samples , However, it is impossible to restore data points different from the normal distribution , Resulting in large reduction error . So if a new sample is encoded , After decoding , Its error exceeds the error range after normal data encoding and decoding , It is regarded as abnormal data . It should be noted that ,AutoEncoder The data used for training is normal data ( That is, there is no abnormal value ), Only in this way can we get the reasonable and normal error distribution range after reconstruction . therefore AutoEncoder When doing anomaly detection here , It is a method of supervised learning .
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
# Standardized data
scaler = preprocessing.MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(dataset_train),
columns=dataset_train.columns,
index=dataset_train.index)
# Random shuffle training data
X_train.sample(frac=1)
X_test = pd.DataFrame(scaler.transform(dataset_test),
columns=dataset_test.columns,
index=dataset_test.index)
tf.random.set_seed(10)
act_func = 'relu'
# Input layer:
model=Sequential()
# First hidden layer, connected to input vector X.
model.add(Dense(10,activation=act_func,
kernel_initializer='glorot_uniform',
kernel_regularizer=regularizers.l2(0.0),
input_shape=(X_train.shape[1],)
)
)
model.add(Dense(2,activation=act_func,
kernel_initializer='glorot_uniform'))
model.add(Dense(10,activation=act_func,
kernel_initializer='glorot_uniform'))
model.add(Dense(X_train.shape[1],
kernel_initializer='glorot_uniform'))
model.compile(loss='mse',optimizer='adam')
print(model.summary())
# Train model for 100 epochs, batch size of 10:
NUM_EPOCHS=100
BATCH_SIZE=10
history=model.fit(np.array(X_train),np.array(X_train),
batch_size=BATCH_SIZE,
epochs=NUM_EPOCHS,
validation_split=0.05,
verbose = 1)
plt.plot(history.history['loss'],
'b',
label='Training loss')
plt.plot(history.history['val_loss'],
'r',
label='Validation loss')
plt.legend(loc='upper right')
plt.xlabel('Epochs')
plt.ylabel('Loss, [mse]')
plt.ylim([0,.1])
plt.show()
# See how the error distribution of the training set restore , In order to establish the normal error distribution range
X_pred = model.predict(np.array(X_train))
X_pred = pd.DataFrame(X_pred,
columns=X_train.columns)
X_pred.index = X_train.index
scored = pd.DataFrame(index=X_train.index)
scored['Loss_mae'] = np.mean(np.abs(X_pred-X_train), axis = 1)
plt.figure()
sns.distplot(scored['Loss_mae'],
bins = 10,
kde= True,
color = 'blue')
plt.xlim([0.0,.5])
# Error threshold comparison , Find outliers
X_pred = model.predict(np.array(X_test))
X_pred = pd.DataFrame(X_pred,
columns=X_test.columns)
X_pred.index = X_test.index
threshod = 0.3
scored = pd.DataFrame(index=X_test.index)
scored['Loss_mae'] = np.mean(np.abs(X_pred-X_test), axis = 1)
scored['Threshold'] = threshod
scored['Anomaly'] = scored['Loss_mae'] > scored['Threshold']
scored.head()
7、 ... and 、 A classification based approach
1、One-Class SVM
One-Class SVM, The idea of this algorithm is very simple , Is to find a hyperplane to circle the positive examples in the sample , Prediction is to use this hyperplane to make decisions , A sample in a circle is considered a positive sample , The sample outside the circle is a negative sample , Used in anomaly detection , Negative samples can be regarded as abnormal samples . It belongs to unsupervised learning , So you don't need labels .
One-Class SVM Another way of deriving is SVDD(Support Vector Domain Description, Support vector domain description ), about SVDD Come on , We expect all samples that are not abnormal to be positive , And it uses a hypersphere , Instead of using a hyperplane to partition , The algorithm obtains the spherical boundary around the data in the feature space , I want to minimize the volume of this hypersphere , To minimize the impact of outlier data .
Let's assume that the generated hypersphere parameter is at the center o And the corresponding radius of the hypersphere r>0, Hypersphere volume V Be minimized , center o It's a linear combination of support lines ; With tradition SVM The method is similar , All training data points can be requested xi The distance to the center is strictly less than r. But at the same time, we construct a penalty coefficient as C Of relaxation variables ζi, Optimize the problem as shown below :
C Is to adjust the influence of relaxation variables , To put it mildly , How much slack space for data points that need to be relaxed , If C The relatively small , Will give the outliers greater flexibility , So that they can not be included in the hypersphere .
from sklearn import svm
# fit the model
clf = svm.OneClassSVM(nu=0.1, kernel='rbf', gamma=0.1)
clf.fit(X)
y_pred = clf.predict(X)
n_error_outlier = y_pred[y_pred == -1].size
8、 ... and 、 Prediction based approach
For single time series data , The predicted time series curve is compared with the real data , Find the residuals for each point , And the residual sequence is modeled , utilize KSigma Or quantile and other methods can be used for anomaly detection . The specific process is as follows :
Nine 、 summary
The anomaly detection methods are summarized as follows :
边栏推荐
- Oracle-存储过程与函数
- Using the rate package for data mining
- Example analysis of C # read / write lock
- Introduction to neural network (Part 2)
- Mysql database - function constraint multi table query transaction
- Parallel shift does not provide any acceleration - C #
- How to send mail with Jianmu Ci
- How to write a summary of the work to promote the implementation of OKR?
- Difference between static method and non static method (advantages / disadvantages)
- Unity-Text上标平方表示形式+text判断文本是否为空
猜你喜欢
Moher college phpMyAdmin background file contains analysis traceability
Distributed transaction management DTM: the little helper behind "buy buy buy"
BUUCTF(4)
Devops Practice Guide - reading notes (long text alarm)
Do you know about autorl in intensive learning? A summary of articles written by more than ten scholars including Oxford University and Google
【性能測試】一文讀懂Jmeter
Unity text superscript square representation +text judge whether the text is empty
论文学习——基于极值点特征的时间序列相似性查询方法
墨者学院-Webmin未经身份验证的远程代码执行
神经网络入门(下)
随机推荐
University stage summary
C#实现一个万物皆可排序的队列
Thesis learning -- time series similarity query method based on extreme point characteristics
Laravel page load problem connection reset - PHP
墨者学院-Webmin未经身份验证的远程代码执行
人生规划(Flag)
Zephyr study notes 2, scheduling
zabbix監控系統自定義監控內容
博客停更声明
大学阶段总结
ZABBIX monitoring system deployment
节点基础~节点操作
NPM run build error
OKR vs. KPI 一次搞清楚这两大概念!
深入浅出:了解时序数据库 InfluxDB
线性代数1.1
Handwritten easy version flexible JS and source code analysis
zabbix监控系统邮件报警配置
Preliminary study on temporal database incluxdb 2.2
Types of references in BibTex