Extended Isolation Forest for Anomaly Detection

Related tags

Machine Learningeif
Overview

latest releasepypi version

Table of contents

Extended Isolation Forest

This is a simple Python implementation for the Extended Isolation Forest method described in this (https://doi.org/10.1109/TKDE.2019.2947676). It is an improvement on the original algorithm Isolation Forest which is described (among other places) in this paper for detecting anomalies and outliers for multidimensional data point distributions. An R wrapper around the core Python implementation can be found here.

Summary

The problem of anomaly detection has wide range of applications in various fields and scientific applications. Anomalous data can have as much scientific value as normal data or in some cases even more, and it is of vital importance to have robust, fast and reliable algorithms to detect and flag such anomalies. Here, we present an extension to the model-free anomaly detection algorithm, Isolation Forest Liu2008. This extension, named Extended Isolation Forest (EIF), improves the consistency and reliability of the anomaly score produced by standard methods for a given data point. We show that the standard Isolation Forest produces inconsistent anomaly score maps, and that these score maps suffer from an artifact produced as a result of how the criteria for branching operation of the binary tree is selected.

Our method allows for the slicing of the data to be done using hyperplanes with random slopes which results in improved score maps. The consistency and reliability of the algorithm is much improved using this extension. Here we show the need for an improvement on the source algorithm to improve the scoring of anomalies and the robustness of the score maps especially around edges of nominal data. We discuss the sources of the problem, and we present an efficient way for choosing these hyperplanes which give way to multiple extension levels in the case of higher dimensional data. The standard Isolation Forest is therefore a special case of the Extended Isolation Forest as presented it here. For an N dimensional dataset, Extended Isolation Forest has N levels of extension, with 0 being identical to the case of standard Isolation Forest, and N-1 being the fully extended version.

Motivation

Example training data. a) Normally distributed cluster. b) Two normally distributed clusters. c) Sinusoidal data points with Gaussian noise.

Figure 1: Example training data. a) Normally distributed cluster. b) Two normally distributed clusters. c) Sinusoidal data points with Gaussian noise.

While various techniques exist for approaching anomaly detection, Isolation Forest Liu2008 is one with unique capabilities. This algorithm can readily work on high dimensional data, it is model free, and it scales well. It is therefore highly desirable and easy to use. However, looking at score maps for some basic example, we can see that the anomaly scores produced by the standard Isolation Forest are inconsistent, . To see this we look at the three examples shown in Figure 1.

In each case, we use the data to train our Isolation Forest. We then use the trained models to score a square grid of uniformly distributed data points, which results in score maps shown in Figure 2. Through the simplicity of the example data, we have an intuition about what the score maps should look like. For example, for the data shown in Figure 1a, we expect to see low anomaly scores in the center of the map, while the anomaly score should increase as we move radially away from the center. Similarly for the other figures.

Looking at the score maps produced by the standard Isolation Forest shown in Figure 2, we can clearly see the inconsistencies in the scores. While we can clearly see a region of low anomaly score in the center in Figure 2a, we can also see regions aligned with x and y axes passing through the origin that have lower anomaly scores compared to the four corners of the region. Based on our intuitive understanding of the data, this cannot be correct. A similar phenomenon is observed in Figure 2b. In this case, the problem is amplified. Since there are two clusters, the artificially low anomaly score regions intersect close to points (0,0) and (10,10), and create low anomaly score regions where there is no data. It is immediately obvious how this can be problematic. As for the third example, figure 2c shows that the structure of the data is completely lost. The sinusoidal shape is essentially treated as one rectangular blob.

Score maps using the Standard Isolation Forest for the points from Figure 1. We can see the bands and artifacts on these maps

Figure 2: Score maps using the Standard Isolation Forest for the points from Figure 1. We can see the bands and artifacts on these maps

Isolation Forest

Given a dataset of dimension N, the algorithm chooses a random sub-sample of data to construct a binary tree. The branching process of the tree occurs by selecting a random dimension x_i with i in {1,2,...,N} of the data (a single variable). It then selects a random value v within the minimum and maximum values in that dimension. If a given data point possesses a value smaller than v for dimension x_i, then that point is sent to the left branch, otherwise it is sent to the right branch. In this manner the data on the current node of the tree is split in two. This process of branching is performed recursively over the dataset until a single point is isolated, or a predetermined depth limit is reached. The process begins again with a new random sub-sample to build another randomized tree. After building a large ensemble of trees, i.e. a forest, the training is complete.

During the scoring step, a new candidate data point (or one chosen from the data used to create the trees) is run through all the trees, and an ensemble anomaly score is assigned based on the depth the point reaches in each tree. Figure 3 shows an schematic example of a tree and a forest plotted radially.

a) Shows an example tree formed from the example data while b) shows the forest generated where each tree is represented by a radial line from the center to  the  outer  circle.  Anomalous  points  (shown  in  red)  are  isolated  very  quickly,which means they reach shallower depths than nominal points (shown in blue).

Figure 3: a) Shows an example tree formed from the example data while b) shows the forest generated where each tree is represented by a radial line from the center to the outer circle. Anomalous points (shown in red) are isolated very quickly,which means they reach shallower depths than nominal points (shown in blue).

It turns out the splitting process described above is the main source of the bias observed in the score maps. Figure 4 shows the process described above for each one of the examples considered thus far. The branch cuts are always parallel to the axes, and as a result over construction of many trees, regions in the domain that don't occupy any data points receive superfluous branch cuts.

Splitting of data in the domain during the process of construction of one tree.

Figure 4: Splitting of data in the domain during the process of construction of one tree.

Extension

The Extended Isolation Forest remedies this problem by allowing the branching process to occur in every direction. The process of choosing branch cuts is altered so that at each node, instead of choosing a random feature along with a random value, we choose a random normal vector along with a random intercept point.

Figure 5 shows the resulting branch cuts int he domain for each of our examples.

Same as Figure 4 but using Extended Isolation Forest

Figure 5: Same as Figure 4 but using Extended Isolation Forest

We can see that the region is divided much more uniformly, and without the bias introducing effects of the coordinate system. As in the case of the standard Isolation Forest, the anomaly score is computed by the aggregated depth that a given point reaches on each iTree.

As we see in Figure 6, these modifications completely fix the issue with the score maps that we saw before and produce reliable results. Clearly, these score maps are a much better representation of anomaly score distributions.

Score maps using the Extended Isolation Forest.

Figure 6: Score maps using the Extended Isolation Forest.

Figure 7 shows a very simple example of anomalies and nominal points from a Single blob example as shown in Figure 1a. It also shows the distribution of the anomaly scores which can be used to make hard cuts on the definition of anomalies or even assign probabilities to each point.

a) Shows the dataset used, some sample anomalous data points discovered using the algorithm are highlighted in black. We also highlight some nominal points in red. In b), we have the distribution of anomaly scores obtained by the algorithm.

Figure 7: a) Shows the dataset used, some sample anomalous data points discovered using the algorithm are highlighted in black. We also highlight some nominal points in red. In b), we have the distribution of anomaly scores obtained by the algorithm.

The Code

Here we provide the source code for the algorithm as well as documented example notebooks to help get started. Various visualizations are provided such as score distributions, score maps, aggregate slicing of the domain, and tree and whole forest visualizations. Most examples are in 2D. We present one 3D example. However, the algorithm works readily with higher dimensional data.

Installation

pip install eif

or directly from the repository

pip install git+https://github.com/sahandha/eif.git

Alternatively, you can install the eif R package from here, which provides an R wrapper around the core Python implementation.

Requirements

  • numpy
  • cython

No extra requirements are needed. In addition, it also contains means to draw the trees created using the igraph library. See the example for tree visualizations.

Use

See these notebooks for examples on how to use it

Citation

If you use this code and method, please considering using the following reference:

A link to the paper can be found here

@ARTICLE{8888179,
author={S. {Hariri} and M. {Carrasco Kind} and R. J. {Brunner}},
journal={IEEE Transactions on Knowledge and Data Engineering},
title={Extended Isolation Forest},
year={2019},
volume={},
number={},
pages={1-1},
keywords={Forestry;Vegetation;Distributed databases;Anomaly detection;Standards;Clustering algorithms;Heating systems;Anomaly Detection;Isolation Forest},
doi={10.1109/TKDE.2019.2947676},
ISSN={},
month={},}

Releases

v2.0.2

2019-NOV-14

  • Convert code into C++ with using cython.
  • Much faster and efficient forest generation and scoring procedures.
  • Previous implementation renamed, use import eif_old to use old version

v1.0.2

2018-OCT-01

  • Release
  • Added documentation, examples and software paper

v1.0.1

2018-AUG-08

  • Bugfix for multidimensional data

v1.0.0

2018-JUL-15

  • Initial Release
Comments
  • Error while installing eif

    Error while installing eif

    Hi!

    Trying to install eif through pip I get the following error:

    
    (base) C:\WINDOWS\system32>pip install eif
    Collecting eif
      Using cached https://files.pythonhosted.org/packages/83/b2/d87d869deeb192ab599c899b91a9ad1d3775d04f5b7adcaf7ff6daa54c24/eif-2.0.2.tar.gz
    Requirement already satisfied: numpy in c:\users\o.korshun\appdata\local\continuum\anaconda3\lib\site-packages (from eif) (1.16.5)
    Requirement already satisfied: cython in c:\users\o.korshun\appdata\local\continuum\anaconda3\lib\site-packages (from eif) (0.29.13)
    Building wheels for collected packages: eif
      Building wheel for eif (setup.py) ... error
      ERROR: Command errored out with exit status 1:
       command: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-wheel-kw_2kpwv' --python-tag cp37
           cwd: C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-install-1adywqes\eif\
      Complete output (60 lines):
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-3.7
      copying eif_old.py -> build\lib.win-amd64-3.7
      copying version.py -> build\lib.win-amd64-3.7
      running egg_info
      writing eif.egg-info\PKG-INFO
      writing dependency_links to eif.egg-info\dependency_links.txt
      writing requirements to eif.egg-info\requires.txt
      writing top-level names to eif.egg-info\top_level.txt
      reading manifest file 'eif.egg-info\SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      writing manifest file 'eif.egg-info\SOURCES.txt'
      running build_ext
      cythoning _eif.pyx to _eif.cpp
      building 'eif' extension
      creating build\temp.win-amd64-3.7
      creating build\temp.win-amd64-3.7\Release
      C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\Library\mingw-w64\bin\gcc.exe -mdll -O -Wall -DMS_WIN64 -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -c _eif.cpp -o build\temp.win-amd64-3.7\Release\_eif.o -Wcpp
      In file included from C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/random:35:0,
                       from eif.hxx:5,
                       from _eif.cpp:614:
      C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
       #error This file requires compiler and library support for the \
        ^
      In file included from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarraytypes.h:1822:0,
                       from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:12,
                       from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/arrayobject.h:4,
                       from _eif.cpp:612:
      C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h:15:77: note: #pragma message: C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
                                "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
                                                                                   ^
      In file included from _eif.cpp:614:0:
      eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
       #define RANDOM_ENGINE std::mt19937_64
                                  ^
      eif.hxx:65:55: note: in expansion of macro 'RANDOM_ENGINE'
               void build_tree (double*, int, int, int, int, RANDOM_ENGINE&, int);
                                                             ^
      eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
       #define RANDOM_ENGINE std::mt19937_64
                                  ^
      eif.hxx:66:44: note: in expansion of macro 'RANDOM_ENGINE'
               Node* add_node (double*, int, int, RANDOM_ENGINE&);
                                                  ^
      eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
       #define RANDOM_ENGINE std::mt19937_64
                                  ^
      eif.hxx:132:63: note: in expansion of macro 'RANDOM_ENGINE'
       inline std::vector<int> sample_without_replacement (int, int, RANDOM_ENGINE&);
                                                                     ^
      _eif.cpp: In function 'PyTypeObject* __Pyx_ImportType(PyObject*, const char*, const char*, size_t, __Pyx_ImportType_CheckSize)':
      _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
                   module_name, class_name, size, basicsize);
                                                           ^
      _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
      _eif.cpp:8085:53: warning: too many arguments for format [-Wformat-extra-args]
      error: command 'C:\\Users\\o.korshun\\AppData\\Local\\Continuum\\anaconda3\\Library\\mingw-w64\\bin\\gcc.exe' failed with exit status 1
      ----------------------------------------
      ERROR: Failed building wheel for eif
      Running setup.py clean for eif
    Failed to build eif
    Installing collected packages: eif
        Running setup.py install for eif ... error
        ERROR: Command errored out with exit status 1:
         command: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-record-yqa9lmac\install-record.txt' --single-version-externally-managed --compile
             cwd: C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-install-1adywqes\eif\
        Complete output (60 lines):
        running install
        running build
        running build_py
        creating build
        creating build\lib.win-amd64-3.7
        copying eif_old.py -> build\lib.win-amd64-3.7
        copying version.py -> build\lib.win-amd64-3.7
        running egg_info
        writing eif.egg-info\PKG-INFO
        writing dependency_links to eif.egg-info\dependency_links.txt
        writing requirements to eif.egg-info\requires.txt
        writing top-level names to eif.egg-info\top_level.txt
        reading manifest file 'eif.egg-info\SOURCES.txt'
        reading manifest template 'MANIFEST.in'
        writing manifest file 'eif.egg-info\SOURCES.txt'
        running build_ext
        skipping '_eif.cpp' Cython extension (up-to-date)
        building 'eif' extension
        creating build\temp.win-amd64-3.7
        creating build\temp.win-amd64-3.7\Release
        C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\Library\mingw-w64\bin\gcc.exe -mdll -O -Wall -DMS_WIN64 -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -c _eif.cpp -o build\temp.win-amd64-3.7\Release\_eif.o -Wcpp
        In file included from C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/random:35:0,
                         from eif.hxx:5,
                         from _eif.cpp:614:
        C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
         #error This file requires compiler and library support for the \
          ^
        In file included from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarraytypes.h:1822:0,
                         from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:12,
                         from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/arrayobject.h:4,
                         from _eif.cpp:612:
        C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h:15:77: note: #pragma message: C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
                                  "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
                                                                                     ^
        In file included from _eif.cpp:614:0:
        eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
         #define RANDOM_ENGINE std::mt19937_64
                                    ^
        eif.hxx:65:55: note: in expansion of macro 'RANDOM_ENGINE'
                 void build_tree (double*, int, int, int, int, RANDOM_ENGINE&, int);
                                                               ^
        eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
         #define RANDOM_ENGINE std::mt19937_64
                                    ^
        eif.hxx:66:44: note: in expansion of macro 'RANDOM_ENGINE'
                 Node* add_node (double*, int, int, RANDOM_ENGINE&);
                                                    ^
        eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
         #define RANDOM_ENGINE std::mt19937_64
                                    ^
        eif.hxx:132:63: note: in expansion of macro 'RANDOM_ENGINE'
         inline std::vector<int> sample_without_replacement (int, int, RANDOM_ENGINE&);
                                                                       ^
        _eif.cpp: In function 'PyTypeObject* __Pyx_ImportType(PyObject*, const char*, const char*, size_t, __Pyx_ImportType_CheckSize)':
        _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
                     module_name, class_name, size, basicsize);
                                                             ^
        _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
        _eif.cpp:8085:53: warning: too many arguments for format [-Wformat-extra-args]
        error: command 'C:\\Users\\o.korshun\\AppData\\Local\\Continuum\\anaconda3\\Library\\mingw-w64\\bin\\gcc.exe' failed with exit status 1
        ----------------------------------------
    ERROR: Command errored out with exit status 1: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-record-yqa9lmac\install-record.txt' --single-version-externally-managed --compile Check the logs for full command output.
    
    
    opened by PoradaKev 25
  • Can the extension concept Applied to Gradient Boosted Machine?

    Can the extension concept Applied to Gradient Boosted Machine?

    Hi there,

    This might be dummy questions.

    I was curious whether the "extension" concept that you introduce can be applied to Supervised version such as Gradient Boosted Trees algorithm or not. There was several widely known Implementation like XGBoost or LightGBM. All of these GBT also suffer from "box" like decision boundary. I believe it would be great to see GBT to create decision boundary the way your extended isolation forest was producing.

    What do you guys think?

    Feel free to close this issue since its not real issue, just discussion.

    opened by alfian777 5
  • Installation problem

    Installation problem

    Hello, i'm trying to install this package, and i'm having error messages and i don't get to install it. Can you help?

    Windows 10

    (base) C:\Users\quirosgu>pip install eif Collecting eif Using cached eif-2.0.2.tar.gz (1.6 MB) Requirement already satisfied: numpy in c:\users\quirosgu\anaconda3\lib\site-packages (from eif) (1.18.5) Requirement already satisfied: cython in c:\users\quirosgu\anaconda3\lib\site-packages (from eif) (0.29.21) Building wheels for collected packages: eif Building wheel for eif (setup.py) ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\quirosgu\AppData\Local\Temp\pip-wheel-6t9epked' cwd: C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif
    Complete output (19 lines): running bdist_wheel running build running build_py creating build creating build\lib.win32-3.8 copying eif_old.py -> build\lib.win32-3.8 copying version.py -> build\lib.win32-3.8 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext cythoning _eif.pyx to _eif.cpp building 'eif' extension error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

    ERROR: Failed building wheel for eif Running setup.py clean for eif Failed to build eif Installing collected packages: eif Running setup.py install for eif ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\quirosgu\AppData\Local\Temp\pip-record-fjpa9g_k\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\quirosgu\Anaconda3\Include\eif' cwd: C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif
    Complete output (19 lines): running install running build running build_py creating build creating build\lib.win32-3.8 copying eif_old.py -> build\lib.win32-3.8 copying version.py -> build\lib.win32-3.8 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/ ---------------------------------------- ERROR: Command errored out with exit status 1: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\quirosgu\AppData\Local\Temp\pip-record-fjpa9g_k\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\quirosgu\Anaconda3\Include\eif' Check the logs for full command output.

    In this case, i already installed all the dependencies required MVC++, etc, but the problem continues.

    I tried to reproduce it in another WIndows machine and it does not work, contrary, in a Linux based system it does work.

    opened by luigiquiros 3
  • PR for Parallelization and Reduce Memory

    PR for Parallelization and Reduce Memory

    Hello,

    For high dimensional datasets, I'm finding multi-processing parallelization can speed things up a bit. I also, find that storing the original data in each Node and each iTree consumes a lot of needless memory. Would you be open to reviewing a Pull Request(s) that addressed both of these items? If so, would you accept them bundled together as one PR or would you like them separated?

    Thanks

    opened by pford221 3
  • Use in novelty detection/one-class classification

    Use in novelty detection/one-class classification

    From what I understand, your api doesn't distinguish between constructing the trees and querying to obtain scores (like the fit/predict methods of scikit-learn), is that correct?

    So it's not currently possible to use this implementation for novelty detection/one-class classification, where the training set is different from the test set?

    opened by oulenz 2
  • Scoring takes too long

    Scoring takes too long

    My training and validation data are of similar size (about 1,500,000 rows and 11 features). Model building took very less time even with full extension. But, when scoring the validation data using compute_paths, the function has been running for close to 15 hours and still scoring is not done. Is there some way to speed up the scoring process?

    opened by thedarklord310780 2
  • Add Arxiv paper to readme

    Add Arxiv paper to readme

    Thanks for providing this code. Please add mention of and a link to your associated Arxiv paper into the repo's readme. The link is https://arxiv.org/abs/1811.02141

    opened by impredicative 2
  • setting ExtensionLevel

    setting ExtensionLevel

    If I understand the paper correctly, we obtain the full EIF approach by setting ExtensionLevel equal to the number of dimensions of the data minus 1, correct?

    opened by oulenz 1
  • Small fix install progress

    Small fix install progress

    One of the extra compile arguments in setup.py seemed to prevent successful installation on multiple systems. Simply removing this argument seems to resolve this with no negative implications. The argument seems to try and force the compiler to run in c++11. Unsure if this was even present on the tested systems

    opened by Dainean 1
  • Update eif.py

    Update eif.py

    Goal: for more convenient usage Inspired by the tutorial document, I added two functions, outlier_pred and outlier_index into iForest, which returns the outlier prediction index and label matrix.

    opened by MaiRajborirug 0
  • How to save the eif Model?

    How to save the eif Model?

    I am trying to save the model using pickle.dump() but this not working. How do I save the eif model? Please provide me a solution as I am stuck with this problem. Thank you.

    opened by SanthanaMano 0
  • module 'eif' has no attribute '__version__'

    module 'eif' has no attribute '__version__'

    i install eif by "pip install eif" and Successfully installed eif-2.0.2 but when i use eif.iForest arise attributeError: module 'eif' has no attribute 'version'

    opened by wererLinC 0
  • I can't install eif 2.0.2, please tell me the reason

    I can't install eif 2.0.2, please tell me the reason

    (base) C:\Users\22393\eif-2.0.2\eif-2.0.2>python setup.py install running install running bdist_egg running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_py running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IE:\ProgramFiles\anaconda\lib\site-packages\numpy\core\include -IE:\ProgramFiles\anaconda\include -IE:\ProgramFiles\anaconda\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.8\Release_eif.obj -Wcpp cl: 命令行 error D8021 :无效的数值参数“/Wcpp” error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe' failed with exit status 2

    opened by whmwhm123 0
  • Unable to install eif2.0.2

    Unable to install eif2.0.2

    Dear Team, I am getting below error while trying install eif2.02 . Methods tried:

    1. pip install eif
    2. Downloaded eif tar file from pypi.org and tried installing
    3. Downloaded the code from github and tried installing
    4. In one of the issue it is mentioned to edit setup.py file(Remove the extra_compile line) and executed

    failed in all above methods, Below is the error ERROR: Complete output from command 'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-wheel-wjzxwp64' --python-tag cp37: ERROR: running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.7 copying eif_old.py -> build\lib.win-amd64-3.7 copying version.py -> build\lib.win-amd64-3.7 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext cythoning _eif.pyx to _eif.cpp building 'eif' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Anaconda3\lib\site-packages\numpy\core\include -IC:\Anaconda3\include -IC:\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.7\Release_eif.obj -Wcpp cl : Command line error D8021 : invalid numeric argument '/Wcpp' error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2

    ERROR: Failed building wheel for eif Running setup.py clean for eif Failed to build eif Installing collected packages: eif Running setup.py install for eif ... error ERROR: Complete output from command 'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-record-f8wv7_fl\install-record.txt' --single-version-externally-managed --compile: ERROR: running install running build running build_py creating build creating build\lib.win-amd64-3.7 copying eif_old.py -> build\lib.win-amd64-3.7 copying version.py -> build\lib.win-amd64-3.7 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Anaconda3\lib\site-packages\numpy\core\include -IC:\Anaconda3\include -IC:\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.7\Release_eif.obj -Wcpp cl : Command line error D8021 : invalid numeric argument '/Wcpp' error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2 ---------------------------------------- ERROR: Command "'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-record-f8wv7_fl\install-record.txt' --single-version-externally-managed --compile" failed with error code 1 in C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\

    Please help.

    opened by botlavijaykumar 1
  • Effect of feature scaling

    Effect of feature scaling

    Hi thanks for the great package (and example notebooks!). My issue is summarised in two points:

    • It appears that feature scale influences the orientation of the hyperplane splits in the trees, resulting in a poor anomaly score map.
    • Is this expected behaviour? If so, can anyone offer an explanation as to how this comes about as it seems from the paper that the orientation of all hyperplanes are random.

    The following illustrates this further:

    I have noticed that the extended forest shows odd results when applied to features with very different scales. For example if I draw 2D points from 2 normal distributions with variance 1 and 1000 and plot the contour maps comparing the regular iForest and the extended we see the contours become horizontal and the heat map in general is not good compared to the regular iForest. image

    It seems as though the choice of hyperplane gets biased towards horizontal lines. This is also notable in the examples given in the paper (figure 9) where 3 plots of tree splits are shown: image Here we see the first two examples (a and b) the x and y values of the data lie on the same scale and the splits look randomly orientated. However in c) the x scale of the data is much larger than y scale, and most splits look more vertical. As a result we seen areas of higher anomaly score above and below the point cloud in the resulting heat map: image

    This issue is easily fixed by simply scaling all features before using the forest. However I was wondering if the splits are done on a hyperplane of random orientation why/how does feature scale influence the orientation of splits in each tree?

    Apologies if I am missing something obvious, any insight would be useful, thanks!

    opened by felixcaz 0
Releases(v2.0.2)
Owner
Sahand Hariri
Sahand Hariri
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learn

Vowpal Wabbit 8.1k Dec 30, 2022
A collection of machine learning examples and tutorials.

machine_learning_examples A collection of machine learning examples and tutorials.

LazyProgrammer.me 7.1k Jan 01, 2023
Coursera Machine Learning - Python code

Coursera Machine Learning This repository contains python implementations of certain exercises from the course by Andrew Ng. For a number of assignmen

Jordi Warmenhoven 859 Dec 10, 2022
Real-time domain adaptation for semantic segmentation

Advanced-Machine-Learning This repository contains the code for the project Real

Andrea Cavallo 1 Jan 30, 2022
Kaggle Competition using 15 numerical predictors to predict a continuous outcome.

Kaggle-Comp.-Data-Mining Kaggle Competition using 15 numerical predictors to predict a continuous outcome as part of a final project for a stats data

moisey alaev 1 Dec 28, 2021
Implemented four supervised learning Machine Learning algorithms

Implemented four supervised learning Machine Learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs), details see README_Report.

Teng (Elijah) Xue 0 Jan 31, 2022
YouTube Spam Detection with python

YouTube Spam Detection This code deletes spam comment on youtube videos based on two characteristics (currently) If the author of the comment has a se

MohamadReza Taalebi 5 Sep 27, 2022
Official code for HH-VAEM

HH-VAEM This repository contains the official Pytorch implementation of the Hierarchical Hamiltonian VAE for Mixed-type Data (HH-VAEM) model and the s

Ignacio Peis 8 Nov 30, 2022
PyTorch extensions for high performance and large scale training.

Description FairScale is a PyTorch extension library for high performance and large scale training on one or multiple machines/nodes. This library ext

Facebook Research 2k Dec 28, 2022
Distributed deep learning on Hadoop and Spark clusters.

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version

Yahoo 1.3k Dec 28, 2022
LightGBM + Optuna: no brainer

AutoLGBM LightGBM + Optuna: no brainer auto train lightgbm directly from CSV files auto tune lightgbm using optuna auto serve best lightgbm model usin

Rishiraj Acharya 22 Dec 15, 2022
Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Sean Zahller 1 Feb 04, 2022
🎛 Distributed machine learning made simple.

🎛 lazycluster Distributed machine learning made simple. Use your preferred distributed ML framework like a lazy engineer. Getting Started • Highlight

Machine Learning Tooling 44 Nov 27, 2022
A Python implementation of FastDTW

fastdtw Python implementation of FastDTW [1], which is an approximate Dynamic Time Warping (DTW) algorithm that provides optimal or near-optimal align

tanitter 651 Jan 04, 2023
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

A library for debugging/inspecting machine learning classifiers and explaining their predictions

154 Dec 17, 2022
Climin is a Python package for optimization, heavily biased to machine learning scenarios

climin climin is a Python package for optimization, heavily biased to machine learning scenarios distributed under the BSD 3-clause license. It works

Biomimetic Robotics and Machine Learning at Technische Universität München 177 Sep 02, 2022
This is a curated list of medical data for machine learning

Medical Data for Machine Learning This is a curated list of medical data for machine learning. This list is provided for informational purposes only,

Andrew L. Beam 5.4k Dec 26, 2022
A Python implementation of GRAIL, a generic framework to learn compact time series representations.

GRAIL A Python implementation of GRAIL, a generic framework to learn compact time series representations. Requirements Python 3.6+ numpy scipy tslearn

3 Nov 24, 2021
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

Aaron Zuspan 76 Dec 15, 2022
Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

Miles Cranmer 924 Jan 03, 2023