当前位置:网站首页>Statistical method for anomaly detection

Statistical method for anomaly detection

2022-07-07 23:06:00 Anny Linlin

1、 The general idea is : Learn a generation model that fits a given data set , Then identify the objects in the low probability region of the model , Take them as outliers .

2、 Statistical methods for anomaly detection can be divided into two main types : Parametric and nonparametric methods .

3、 Parameter method

    3.1  Univariate outlier detection based on normal distribution

            Data involving only one attribute or variable is called metadata . We assume that the data is generated by a normal distribution , Then the parameters of normal distribution can be learned from the input data , And identify the points with low probability as abnormal points .

     3.2 Multivariate outlier detection

          Data involving two or more attributes or variables is called multivariate data . Many unary outlier detection methods can be extended , Used to process multivariate data . The core idea is to transform the multi outlier detection task into a single outlier detection problem . For example, when univariate outlier detection based on normal distribution is extended to multivariate cases , You can find the mean and standard deviation of each dimension .

4、 Nonparametric methods

In the nonparametric method of anomaly detection ,“ Normal data ” Learning from input data , Instead of assuming a priori . Usually , Nonparametric methods make less assumptions about data , So it can be used in more cases .

Example : Use histogram to detect outliers .

Histogram is a frequently used nonparametric statistical model , It can be used to detect outliers . This process includes the following two steps :

step 1: Construct histogram . Use input data ( Training data ) Construct a histogram . The histogram can be unary , Or diversified ( If the input data is multidimensional ).

Although nonparametric methods do not assume any prior statistical model , However, it is often true that the user is required to provide parameters , In order to learn from data . for example , The user must specify the type of histogram ( Equal in width or depth ) And other parameters ( The number of boxes in the histogram or the size of each box ). Different from the parametric method , These parameters do not specify the type of data distribution .

step 2: Detect outliers . To determine whether an object is an outlier , You can check it against the histogram . In the simplest way , If the object falls into a box in the histogram , Then the object is considered normal , Otherwise, it is considered as an outlier .

For more complex methods , Histogram can be used to give each object an outlier score . For example, let the abnormal point score of the object be the reciprocal of the volume of the box that the object falls into .

One disadvantage of using histogram as a nonparametric model for outlier detection is , It's hard to choose the right box size . One side , If the box size is too small , Then many normal objects will fall into empty or sparse boxes , Therefore, it is mistakenly recognized as an outlier . On the other hand , If the box size is too large , Then the abnormal point object may penetrate into some frequent boxes , thus “ Pretending to be ” Become normal .


HBOS Full name :Histogram-based Outlier Score. It's a combination of univariate methods , You can't model dependencies between features , But it's faster , Friendly to big data sets . The basic assumption is that each dimension of the dataset is independent of each other . Then interval each dimension (bin) Divide , The higher the density of the interval , The lower the abnormal score .

6、 practice





本文为[Anny Linlin]所创,转载请带上原文链接,感谢