A rule learning algorithm for the deduction of syndrome definitions from time series data.

Overview

README

This project provides a rule learning algorithm for the deduction of syndrome definitions from time series data. Large parts of the algorithm are based on "BOOMER".

Features

The algorithm that is provided by this project currently supports the following functionalities for learning descriptive rules:

  • The quality of rules is assessed by comparing the predictions of the current model to the ground truth in terms of the Pearson correlation coefficient.
  • When learning a new rule, random samples of the features may be used.
  • Hyper-parameters that provide control over the specificity/generality of rules are available.
  • The algorithm can natively handle numerical, ordinal and nominal features (without the need for pre-processing techniques such as one-hot encoding).
  • The algorithm is able to deal with missing feature values, i.e., occurrences of NaN in the feature matrix.

In addition, the following features that may speed up training or reduce the memory footprint are currently implemented:

  • Dense or sparse feature matrices can be used for training. The use of sparse matrices may speed-up training significantly on some data sets.
  • Multi-threading can be used to parallelize the evaluation of a rule's potential refinements across multiple CPU cores.

Project structure

|-- cpp                     Contains the implementation of core algorithms in C++
    |-- subprojects
        |-- common          Contains implementations that all algorithms have in common
        |-- tsa             Contains implementations for time series analysis
    |-- ...
|-- python                  Contains Python code for running experiments
    |-- rl
        |-- common          Contains Python code that is needed to run any kind of algorithms
            |-- cython      Contains commonly used Cython wrappers
            |-- ...
        |-- tsa             Contains Python code for time series analysis
            |-- cython      Contains time series-specific Cython wrappers
            |-- ...
        |-- testbed         Contains useful functionality for running experiments
            |-- ...
    |-- main.py             Can be used to start an experiment
    |-- ...
|-- Makefile                Makefile for compilation
|-- ...

Project setup

The algorithm provided by this project is implemented in C++. In addition, a Python wrapper that implements the scikit-learn API is available. To be able to integrate the underlying C++ implementation with Python, Cython is used.

The C++ implementation, as well as the Cython wrappers, must be compiled in order to be able to run the provided algorithm. To facilitate compilation, this project comes with a Makefile that automatically executes the necessary steps.

At first, a virtual Python environment can be created via the following command:

make venv

As a prerequisite, Python 3.7 (or a more recent version) must be available on the host system. All compile-time dependencies (numpy, scipy, Cython, meson and ninja) that are required for building the project will automatically be installed into the virtual environment. As a result of executing the above command, a subdirectory venv should have been created within the project's root directory.

Afterwards, the compilation can be started by executing the following command:

make compile

Finally, the library must be installed into the virtual environment, together with all of its runtime dependencies (e.g. scikit-learn, a full list can be found in setup.py). For this purpose, the project's Makefile provides the following command:

make install

Whenever any C++ or Cython source files have been modified, they must be recompiled by running the command make compile again! If compilation files do already exist, only the modified files will be recompiled.

Cleanup: To get rid of any compilation files, as well as of the virtual environment, the following command can be used:

make clean

For more fine-grained control, the command make clean_venv (for deleting the virtual environment) or make clean_compile (for deleting the compiled files) can be used. If only the compiled Cython files should be removed, the command make clean_cython can be used. Accordingly, the command make clean_cpp removes the compiled C++ files.

Parameters

The file python/main.py allows to run experiments on a specific data set using different configurations of the learning algorithm. The implementation takes care of writing the experimental results into .csv files and the learned model can (optionally) be stored on disk to reuse it later.

In order to run an experiment, the following command line arguments must be provided (most of them are optional):

Parameter Optional? Default Description
--data-dir No None The path of the directory where the data sets are located.
--temp-dir No None The path of the directory where temporary files should be saved.
--dataset No None The name of the .csv files that store the raw data (without suffix).
--feature-definition No None The name of the .txt file that specifies the names of the features to be used (without suffix).
--from-year No None The first year (inclusive) that should be taken into account.
--to-year No None The last year (inclusive) that should be taken into account.
--from-week Yes -1 The first week (inclusive) of the first year that should be taken into account or -1, if all weeks of that year should be used.
--to-week Yes -1 The last week (inclusive) of the last year that should be taken into account or -1, if all weeks of that year should be used.
--count-file-name Yes None The name of the file that stores the number of cases that correspond to individual weeks (without suffix). If not specified, the results from appending "_counts" to the dataset name.
--one-hot-encoding Yes False True, if one-hot-encoding should be used for nominal attributes, False otherwise.
--output-dir Yes None The path of the directory into which the experimental results (.csv files) should be written.
--print-rules Yes True True, if the induced rules should be printed on the console, False otherwise.
--store-rules Yes True True, if the induced rules should be stored as a .txt file, False otherwise. Does only have an effect if the parameter --output-dir is specified.
--print-options Yes {} A dictionary that specifies additional options to be used for printing or storing rules, if the parameter --print-rules and/or --store-rules is set to True, e.g. {'print_feature_names':True,'print_label_names':True,'print_nominal_values':True}.
--store-predictions Yes True True, if the predictions for the training data should be stored as a .csv file, False otherwise. Does only have an effect if the parameter --output-dir is specified.
--model-dir Yes None The path of the directory where models (.model files) are located.
--max-rules Yes 50 The maximum number of rules to be induced or -1, if the number of rules should not be restricted.
--time-limit Yes -1 The duration in seconds after which the induction of rules should be canceled or -1, if no time limit should be used.
--feature-sub-sampling Yes None The name of the strategy to be used for feature sub-sampling. Must be random-feature-selection or None. Additional arguments may be provided as a dictionary, e.g. random_feature-selection{'sample_size':0.5}.
--min-support Yes 0.0001 The percentage of training examples that must be covered by a rule. Must be greater than 0 and smaller than 1.
--max-conditions Yes -1 The maximum number of conditions to be included in a rule's body. Must be at least 1 or -1, if the number of conditions should not be restricted.
--random-state Yes 1 The seed to the be used by random number generators.
--feature-format Yes auto The format to be used for the feature matrix. Must be sparse, if a sparse matrix should be used, dense, if a dense matrix should be used, or auto, if the format should be chosen automatically.
--num-threads-refinement Yes 1 The number of threads to be used to search for potential refinements of rules. Must be at least 1 or -1, if the number of cores that are available on the machine should be used.
--log-level Yes info The log level to be used. Must be debug, info, warn, warning, error, critical, fatal or notset.

Example and data format

In the following, we give a more detailed description of the data that must be provided to the algorithm. All input files must use UTF-8 encoding and they must be available in a single directory. The path of the directory must be specified via the parameter --data-dir. The following files must be included in the directory:

  • A .csv file that stores the raw training data (see data/example.csv for an example). Each row (separated by line breaks) must correspond to an individual instance and the columns (separated by commas) must correspond to the available features. The names of the columns/features must be given as the first row. The names of columns can be arbitrary, but there must be a column named "week" that associates each instance with a corresponding year and week (using the format year-month, e.g. 2019-2).
  • A .csv file that specifies the number of cases that correspond to individual weeks (see data/example_counts.csv for an example). The file must consist of three columns, year,week,cases, separated by commas. The names of columns must be given as the first row. Each of the other rows (separated by line breaks) assigns a specific number of cases to a certain week of a year (all values must be positive integers). For each combination of year and week that occurs in the column "week" of the first .csv file, the number of cases must be specified in this second .csv file.
  • A .txt file that specifies the names of the features that should be taken into account (see data/features.txt for an example). Each feature name must be given as a new line. For each feature that is specified in the text file, a column with the same name must exist in the first .csv file.

The parameter --dataset is used to identify the .csv files that should be used by the algorithm. Its value must correspond to the name of the first .csv file mentioned above, omitting the file's suffix (e.g. example if the file's name is example.csv). The second .csv file must be named accordingly by appending the suffix _counts to the name of the first file (e.g. example_counts.csv). The parameter --feature-definition is used to specify the name of the text file that stores the names of relevant features. The given value must correspond to the name of the text file, again omitting the file's suffix (e.g. features, if the file's name is features.txt).

In the following, the command for running an experiment, including all mandatory parameters, can be seen:

venv/bin/python3 python/main.py --data-dir /path/to/data/ --temp-dir /path/to/temp/ --dataset example --feature-definition features --from-year 2018 --to-year 2019

When running the program for the first time, the .csv files that are located in the specified data directory will be loaded. The data will be filtered according to the parameters --from-year and --to-year, such that only instances that belong to the specified timespan are retained. Furthermore, all columns that are missing from the supplied text file will be removed. Finally, the data is converted into the format that is required for learning a rule model. This results in two files (an .arff file and a .xml file) that are written to the directory that is specified via the parameter --temp-dir. The resulting files are named according to the following scheme: <dataset>_<feature-definition>_<from-year>-<to-year> (e.g., example_features_2018-2019.) When running the program multiple times, it will check if the files do already exist. If this is the case, the preprocessing step will be skipped and the available files will be used as they are.

You might also like...
Rule-based Customer Segmentation
Rule-based Customer Segmentation

Rule-based Customer Segmentation Business Problem A game company wants to create level-based new customer definitions (personas) by using some feature

Rule based classification A hotel s customers dataset

Rule-based-classification-A-hotel-s-customers-dataset- Aim: Categorize new customers by segment and predict how much revenue they can generate This re

PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)
PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)

PyExplainer PyExplainer is a local rule-based model-agnostic technique for generating explanations (i.e., why a commit is predicted as defective) of J

Continuous Security Group Rule Change Detection & Response at scale
Continuous Security Group Rule Change Detection & Response at scale

Introduction Get notified of Security Group Changes across all AWS Accounts & Regions in an AWS Organization, with the ability to respond/revert those

A rule-based log analyzer & filter

Flog 一个根据规则集来处理文本日志的工具。 前言 在日常开发过程中,由于缺乏必要的日志规范,导致很多人乱打一通,一个日志文件夹解压缩后往往有几十万行。 日志泛滥会导致信息密度骤减,给排查问题带来了不小的麻烦。 以前都是用grep之类的工具先挑选出有用的,再逐条进行排查,费时费力。在忍无可忍之后决

The source code and data of the paper "Instance-wise Graph-based Framework for Multivariate Time Series Forecasting".

IGMTF The source code and data of the paper "Instance-wise Graph-based Framework for Multivariate Time Series Forecasting". Requirements The framework

TAug :: Time Series Data Augmentation using Deep Generative Models

TAug :: Time Series Data Augmentation using Deep Generative Models Note!!! The package is under development so be careful for using in production! Fea

A real world application of a Recurrent Neural Network on a binary classification of time series data
A real world application of a Recurrent Neural Network on a binary classification of time series data

What is this This is a real world application of a Recurrent Neural Network on a binary classification of time series data. This project includes data

A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

Releases(0.1.0)
  • 0.1.0(Sep 24, 2021)

    The first release of the algorithm. It supports the following functionalities for learning descriptive rules:

    • The quality of rules is assessed by comparing the predictions of the current model to the ground truth in terms of the Pearson correlation coefficient.
    • When learning a new rule, random samples of the features may be used.
    • Hyper-parameters that provide control over the specificity/generality of rules are available.
    • The algorithm can natively handle numerical, ordinal and nominal features (without the need for pre-processing techniques such as one-hot encoding).
    • The algorithm is able to deal with missing feature values, i.e., occurrences of NaN in the feature matrix.
    Source code(tar.gz)
    Source code(zip)
PyContinual (An Easy and Extendible Framework for Continual Learning)

PyContinual (An Easy and Extendible Framework for Continual Learning) Easy to Use You can sumply change the baseline, backbone and task, and then read

Zixuan Ke 176 Jan 05, 2023
ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers

ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers Official implementation of ViewFormer. ViewFormer is a NeRF-free neural rend

Jonáš Kulhánek 169 Dec 30, 2022
Motion Reconstruction Code and Data for Skills from Videos (SFV)

Motion Reconstruction Code and Data for Skills from Videos (SFV) This repo contains the data and the code for motion reconstruction component of the S

268 Dec 01, 2022
Group Activity Recognition with Clustered Spatial Temporal Transformer

GroupFormer Group Activity Recognition with Clustered Spatial-TemporalTransformer Backbone Style Action Acc Activity Acc Config Download Inv3+flow+pos

28 Dec 12, 2022
Tightness-aware Evaluation Protocol for Scene Text Detection

TIoU-metric Release on 27/03/2019. This repository is built on the ICDAR 2015 evaluation code. If you propose a better metric and require further eval

Yuliang Liu 206 Nov 18, 2022
Fast, general, and tested differentiable structured prediction in PyTorch

Fast, general, and tested differentiable structured prediction in PyTorch

HNLP 1.1k Dec 16, 2022
Unofficial implementation of "Coordinate Attention for Efficient Mobile Network Design"

Unofficial implementation of "Coordinate Attention for Efficient Mobile Network Design". CoordAttention tensorflow slim

Billy 9 Aug 22, 2022
The official implementation of Theme Transformer

Theme Transformer This is the official implementation of Theme Transformer. Checkout our demo and paper : Demo | arXiv Environment: using python versi

Ian Shih 85 Dec 08, 2022
PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to handle and build

simple, elegant and safe Introduction PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to ha

Johnsz 2 Mar 02, 2022
Face recognition project by matching the features extracted using SIFT.

MV_FaceDetectionWithSIFT Face recognition project by matching the features extracted using SIFT. By : Aria Radmehr Professor : Ali Amiri Dependencies

Aria Radmehr 4 May 31, 2022
Planner_backend - Academic planner application designed for students and counselors.

Planner (backend) Academic planner application designed for students and advisors.

2 Dec 31, 2021
URIE: Universal Image Enhancementfor Visual Recognition in the Wild

URIE: Universal Image Enhancementfor Visual Recognition in the Wild This is the implementation of the paper "URIE: Universal Image Enhancement for Vis

Taeyoung Son 43 Sep 12, 2022
Implementation of light baking system for ray tracing based on Activision's UberBake

Vulkan Light Bakary MSU Graphics Group Student's Diploma Project Treefonov Andrey [GitHub] [LinkedIn] Project Goal The goal of the project is to imple

Andrey Treefonov 7 Dec 27, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
Voice Conversion by CycleGAN (语音克隆/语音转换):CycleGAN-VC3

CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

Kun Ma 110 Dec 24, 2022
Efficient Householder transformation in PyTorch

Efficient Householder Transformation in PyTorch This repository implements the Householder transformation algorithm for calculating orthogonal matrice

Anton Obukhov 49 Nov 20, 2022
A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available.

Use this instead: https://github.com/facebookresearch/maskrcnn-benchmark A Pytorch Implementation of Detectron Example output of e2e_mask_rcnn-R-101-F

Roy 2.8k Dec 29, 2022
Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

105 Nov 07, 2022
Some code of the implements of Geological Modeling Using 3D Pixel-Adaptive and Deformable Convolutional Neural Network

3D-GMPDCNN Geological Modeling Using 3D Pixel-Adaptive and Deformable Convolutional Neural Network PyTorch implementation of "Geological Modeling Usin

5 Nov 21, 2022
Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

OCR Ground Truth for Historical Commentaries The dataset OCR ground truth for historical commentaries (GT4HistComment) was created from the public dom

Ajax Multi-Commentary 3 Sep 08, 2022