当前位置:网站首页>Statistical method for anomaly detection
Statistical method for anomaly detection
2022-07-07 23:06:00 【Anny Linlin】
1、 The general idea is : Learn a generation model that fits a given data set , Then identify the objects in the low probability region of the model , Take them as outliers .
2、 Statistical methods for anomaly detection can be divided into two main types : Parametric and nonparametric methods .
3、 Parameter method
3.1 Univariate outlier detection based on normal distribution
Data involving only one attribute or variable is called metadata . We assume that the data is generated by a normal distribution , Then the parameters of normal distribution can be learned from the input data , And identify the points with low probability as abnormal points .
3.2 Multivariate outlier detection
Data involving two or more attributes or variables is called multivariate data . Many unary outlier detection methods can be extended , Used to process multivariate data . The core idea is to transform the multi outlier detection task into a single outlier detection problem . For example, when univariate outlier detection based on normal distribution is extended to multivariate cases , You can find the mean and standard deviation of each dimension .
4、 Nonparametric methods
In the nonparametric method of anomaly detection ,“ Normal data ” Learning from input data , Instead of assuming a priori . Usually , Nonparametric methods make less assumptions about data , So it can be used in more cases .
Example : Use histogram to detect outliers .
Histogram is a frequently used nonparametric statistical model , It can be used to detect outliers . This process includes the following two steps :
step 1: Construct histogram . Use input data ( Training data ) Construct a histogram . The histogram can be unary , Or diversified ( If the input data is multidimensional ).
Although nonparametric methods do not assume any prior statistical model , However, it is often true that the user is required to provide parameters , In order to learn from data . for example , The user must specify the type of histogram ( Equal in width or depth ) And other parameters ( The number of boxes in the histogram or the size of each box ). Different from the parametric method , These parameters do not specify the type of data distribution .
step 2: Detect outliers . To determine whether an object is an outlier , You can check it against the histogram . In the simplest way , If the object falls into a box in the histogram , Then the object is considered normal , Otherwise, it is considered as an outlier .
For more complex methods , Histogram can be used to give each object an outlier score . For example, let the abnormal point score of the object be the reciprocal of the volume of the box that the object falls into .
One disadvantage of using histogram as a nonparametric model for outlier detection is , It's hard to choose the right box size . One side , If the box size is too small , Then many normal objects will fall into empty or sparse boxes , Therefore, it is mistakenly recognized as an outlier . On the other hand , If the box size is too large , Then the abnormal point object may penetrate into some frequent boxes , thus “ Pretending to be ” Become normal .
5、HBOS
HBOS Full name :Histogram-based Outlier Score. It's a combination of univariate methods , You can't model dependencies between features , But it's faster , Friendly to big data sets . The basic assumption is that each dimension of the dataset is independent of each other . Then interval each dimension (bin) Divide , The higher the density of the interval , The lower the abnormal score .
6、 practice
边栏推荐
- Debezium series: set role statement supporting mysql8
- Talk about DART's null safety feature
- GBU1510-ASEMI电源专用15A整流桥GBU1510
- PCL . VTK files and Mutual conversion of PCD
- Debezium series: binlogreader for source code reading
- 30讲 线性代数 第五讲 特征值与特征向量
- I wish you all the best and the year of the tiger
- Knowledge drop - PCB manufacturing process flow
- Leetcode1984. Minimum difference in student scores
- Digital collections accelerated out of the circle, and marsnft helped diversify the culture and tourism economy!
猜你喜欢

Understand the session, cookie and token at one time, and the interview questions are all finalized

Quick sort (diagram +c code)

Leetcode206. Reverse linked list

微生物健康网,如何恢复微生物群落
Apple further entered the financial sector through the 'virtual card' security function in IOS 16

iNFTnews | Web5 vs Web3:未来是一个过程,而不是目的地

How to operate DTC community?

Force deduction - question 561 - array splitting I - step by step parsing

详解全志V853上的ARM A7和RISC-V E907之间的通信方式

Line test - graphic reasoning - 3 - symmetric graphic class
随机推荐
2021-01-11
Knowledge drop - PCB manufacturing process flow
Debezium series: MySQL tombstone event
聊聊 Dart 的空安全 (null safety) 特性
Innovation today | five key elements for enterprises to promote innovation
「开源摘星计划」Loki实现Harbor日志的高效管理
This time, let's clear up: synchronous, asynchronous, blocking, non blocking
6-3 find the table length of the linked table
Explain in detail the communication mode between arm A7 and risc-v e907 on Quanzhi v853
[network] Introduction to C language
Leetcode19. Delete the penultimate node of the linked list [double pointer]
Debezium系列之:mysql墓碑事件
行测-图形推理-8-图群类
Early childhood education industry of "screwing bar": trillion market, difficult to be a giant
Ligne - raisonnement graphique - 4 - classe de lettres
Line test - graphic reasoning - 6 - similar graphic classes
Redis cluster installation
Transparent i/o model from beginning to end
ASEMI整流桥KBPC1510的型号数字代表什么
Redis集群安装