Automatic extraction of relevant features from time series:

Overview

tsfresh

Documentation Status Build Status codecov license py36 status Binder Downloads

This repository contains the TSFRESH python package. The abbreviation stands for

"Time Series Feature extraction based on scalable hypothesis tests".

The package contains many feature extraction methods and a robust feature selection algorithm.

Spend less time on feature engineering

Data Scientists often spend most of their time either cleaning data or building features. While we cannot change the first thing, the second can be automated. TSFRESH frees your time spent on building features by extracting them automatically. Hence, you have more time to study the newest deep learning paper, read hacker news or build better models.

Automatic extraction of 100s of features

TSFRESH automatically extracts 100s of features from time series. Those features describe basic characteristics of the time series such as the number of peaks, the average or maximal value or more complex features such as the time reversal symmetry statistic.

The features extracted from a exemplary time series

The set of features can then be used to construct statistical or machine learning models on the time series to be used for example in regression or classification tasks.

Forget irrelevant features

Time series often contain noise, redundancies or irrelevant information. As a result most of the extracted features will not be useful for the machine learning task at hand.

To avoid extracting irrelevant features, the TSFRESH package has a built-in filtering procedure. This filtering procedure evaluates the explaining power and importance of each characteristic for the regression or classification tasks at hand.

It is based on the well developed theory of hypothesis testing and uses a multiple test procedure. As a result the filtering process mathematically controls the percentage of irrelevant extracted features.

The TSFRESH package is described in the following open access paper

  • Christ, M., Braun, N., Neuffer, J. and Kempa-Liehr A.W. (2018). Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh -- A Python package). Neurocomputing 307 (2018) 72-77, doi:10.1016/j.neucom.2018.03.067.

The FRESH algorithm is described in the following whitepaper

  • Christ, M., Kempa-Liehr, A.W. and Feindt, M. (2017). Distributed and parallel time series feature extraction for industrial big data applications. ArXiv e-print 1610.07717, https://arxiv.org/abs/1610.07717.

Advantages of tsfresh

TSFRESH has several selling points, for example

  1. it is field tested
  2. it is unit tested
  3. the filtering process is statistically/mathematically correct
  4. it has a comprehensive documentation
  5. it is compatible with sklearn, pandas and numpy
  6. it allows anyone to easily add their favorite features
  7. it both runs on your local machine or even on a cluster

Next steps

If you are interested in the technical workings, go to see our comprehensive Read-The-Docs documentation at http://tsfresh.readthedocs.io.

The algorithm, especially the filtering part are also described in the paper mentioned above.

If you have some questions or feedback you can find the developers in the gitter chatroom.

We appreciate any contributions, if you are interested in helping us to make TSFRESH the biggest archive of feature extraction methods in python, just head over to our How-To-Contribute instructions.

If you want to try out tsfresh quickly or if you want to integrate it into your workflow, we also have a docker image available:

docker pull nbraun/tsfresh

Acknowledgements

The research and development of TSFRESH was funded in part by the German Federal Ministry of Education and Research under grant number 01IS14004 (project iPRODICT).

Comments
  • Shapelet extraction

    Shapelet extraction

    One interesting feature with an explanatory ability is shapelet extraction.

    Would maybe be interesting to implement within this package? A far from optimal code example by me can be found here

    enhancement 
    opened by GillesVandewiele 74
  • extract_features is failing with:

    extract_features is failing with: "OverflowError: value too large to convert to int"

    I am running extract_features on a very large matrix, having ~350 million rows and 6 features (as part of a complex data science pipeline). I am using a machine with 64 cores and 2TB memory, and utilizing all 64 cores. I am getting this error: "Overflow Error: value too large to convert to int". Some comments: i) When I split the matrix vertically into, say 3, chunks (each chunk having 2 features only), and run them sequentially, everything works fine. So it does not seem like I am having issues with "problematic" values in the matrix. ii) It does not seem a memory related issue either (as alluded to in https://github.com/blue-yonder/tsfresh/issues/368) because I was babysitting the mentioned run that failed, and was checking memory usage regularly (using "free -g"). It never got above 400GB. iii) I also tried running with LocalDaskDistributor and got the same error. iv) All 6 features in the matrix are floats. v) pai_tsfresh below is a fork of tsfresh.

    1. Your operating system No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.3 LTS Release: 16.04 Codename: xenial

    2. The version of tsfresh that you are using latest

    3. A minimal code snippet which reproduces the problem/bug Here's the call in my code to extract_features: extracted_features_df = extract_features(rolled_design_matrix, column_id='account_date_index', column_sort='date', default_fc_parameters=fc_parameters, n_jobs=64) where fc_parameters is: {'abs_energy': None, 'autocorrelation': [{'lag': 1}], 'binned_entropy': [{'max_bins': 10}], 'c3': [{'lag': 1}], 'cid_ce': [{'normalize': True}], 'fft_aggregated': [{'aggtype': 'centroid'}, {'aggtype': 'variance'}, {'aggtype': 'skew'}, {'aggtype': 'kurtosis'}], 'fft_coefficient': [{'attr': 'real', 'coeff': 0}], 'sample_entropy': None, 'spkt_welch_density': [{'coeff': 2}], 'time_reversal_asymmetry_statistic': [{'lag': 1}]}

    4. Any reported errors or traceback Here's the traceback: Traceback (most recent call last): File "/home/yuval/pai/projects/ds-feature-engineering-service/feature_engineering_service/src/fe/stateless/time_series_features_enricher/time_series_features_enricher.py", line 175, in do_enrich distributor=local_dask_distributor) File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 152, in extract_features distributor=distributor) File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 217, in _do_extraction data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]] File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 217, in data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]] File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1922, in get_iterator splitter = self._get_splitter(data, axis=axis) File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1928, in _get_splitter comp_ids, _, ngroups = self.group_info File "pandas/_libs/properties.pyx", line 38, in pandas._libs.properties.cache_readonly.get File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2040, in group_info comp_ids, obs_group_ids = self._get_compressed_labels() File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in _get_compressed_labels all_labels = [ping.labels for ping in self.groupings] File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in all_labels = [ping.labels for ping in self.groupings] File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2750, in labels self._make_labels() File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2767, in _make_labels self.grouper, sort=self.sort) File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 468, in factorize table = hash_klass(size_hint or len(values)) File "pandas/_libs/hashtable_class_helper.pxi", line 1005, in pandas._libs.hashtable.StringHashTable.init OverflowError: value too large to convert to int

    bug 
    opened by yuval-nardi 46
  • Improve performance of the impute function.

    Improve performance of the impute function.

    Improve performance of the functions:

    • utilities.dataframe_function.get_range_values_per_column(df)
    • utilities.dataframe_function.impute(df_impute) More specifically: apply the impute function directly on numpy array to improve computation time.

    Now the impute function runs in 109ms (60 samples, 14256 features (or columns) ).

    Note: I did not impoved the performance of impute_dataframe_range(...) since it would have been to much of a hassle to implement all the checks in that function, e.g. in case the min/max/median values of each columns are not present. In our case we call the get_range_values_per_column just before so these checks are not necessary. So I just reimplemented the function impute_dataframe_rage directly in the impute function. This is less modular. Maybe you could pack this code in a new impute_dataframe_range function.

    Solve the issue #123 .

    opened by F-A 31
  • Avoid leaking indices from training data sets as feature, classification accuracy depends on order of input time series in data frame

    Avoid leaking indices from training data sets as feature, classification accuracy depends on order of input time series in data frame

    I attempt to use tsfresh for a simple binary classification using a k-nearest-neighbor-classifier and k-fold-validation. However, the classification accuracy depends on the order of the input time series, which should not be relevant at all.

    The underlying problem are the features selected by select_features: value__index_mass_quantile__q_0.8 value__index_mass_quantile__q_0.7 value__index_mass_quantile__q_0.2 value__index_mass_quantile__q_0.3 and so on. All of them are directly proportional to the id in the training data set.

    Now the k-nearest-neighbor classifier just has to decide whether these index "features" are above a certain threshold to make a correct classification.

    I need to disable the consideration of the index for feature extraction. Using the index of the samples in my training data as input for feature extraction reduces my model to absurdity. All features should be only based on the time stamps and the associated values, but not on the order of the samples in my input data.

    How can I disable this incorrect behavior?

    extracted_features = extract_features(time_series, column_id="id", column_sort="time", column_value="value")
    impute(extracted_features)
    features_filtered = select_features(extracted_features, y) # use features_filtered and y as input for k-fold validation
    

    The time_series data frame is constructed in the same way as the robots example:

    id time value 0 0 1 760 1 0 11 761 2 0 466 761 3 0 473 765 4 0 481 763 5 0 488 761 6 0 516 763 7 0 532 763 8 0 542 756 9 0 610 756 10 0 618 757 11 0 885 757 12 0 1189 757 13 0 1206 758 14 0 1263 758 15 0 1275 760 16 0 1295 768 17 1 1 760 18 1 11 761 19 1 466 761 20 1 473 765 21 1 481 763 22 1 488 761 23 1 516 763 .. .. ... ... 538 31 885 757 539 31 1189 757 540 31 1206 758 541 31 1263 758 542 31 1275 760 543 31 5000 768 544 32 1 760 545 32 11 761 546 32 466 761 547 32 473 765 548 32 481 763 549 32 488 761 550 32 516 763 551 32 532 763 552 32 542 756 553 32 610 756 554 32 618 757 555 32 885 757 556 32 1189 757 557 32 1206 758 558 32 1263 758 559 32 1275 760 560 32 5000 768

    The same goes for the target labels y:

    0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25 2 26 2 27 2 28 2 29 2 30 2 31 2 32 2 dtype: int64

    opened by fkirc 28
  • Notebook rolling

    Notebook rolling

    It is not ready to be merged.

    However, I would like to get your feedback on this. What do you think about that make_forecasting_frame method?

    The notebooks/timeseries_forecasting_basic_example.ipynb notebook can be used to play a little bit with the method.

    opened by MaxBenChrist 26
  • Parallelization performance v0.7.1 and commit ...1c99e8

    Parallelization performance v0.7.1 and commit ...1c99e8

    Hi,

    Two topics that I wanted to discuss:

    Runtime increase in latest commit

    We compared the extraction time of version 0.7.1 to code version 1c99e8. We varied the number of ids ,where every chunk/time-series had length = 91 and the dataframe had 4 kinds of ts columns. As you may see in the figure and table below, we didn't see any significant improvement when using 4 processes in parallel and CHUNCKSIZE = None (fastest settings). How does this fit your benchmark results ?

    image

    Results table:

    #ids | 0.7.1 w parallel [sec] | 1c99e8 w parallel [sec] -- | -- | -- 940 | 8.95 ±1.2 | 14.6 ±2.47 4500 | 46.98 ±14.21 | 61.96 ±1.66 7612 | 74.37 ±9.18 | 69.07 ±6.3 30523 | 295.98 ±49.15 | 317.16 ±16.52

    BTW, I actually saw a nice improvement with no-parallelization (N_PROCESSES = 0). See table below:

    #ids | 0.7.1 w/o parallel [sec] | 1c99e8 w/o parallel [sec] -- | -- | -- 940 | 12.42 ±0.51 | 10.34±0.69 4500 | 58.04 ±1.04 | 45.79±0.09

    mean and std taken from 5 trials

    50% CPU usage with no-parallelization

    Recently I've noticed that while using the package for feature extraction, the CPU usage with no-parallelization settings stays mostly near 50±2%. This means that functions that are located deeper in _do_extraction_on_chunk are opening many processes/threads that are not configurable by the tsfresh API. Accordingly, what is the meaning of working with more than 2-3 processes, which use 100% CPU? I've also noticed, that when working with N_PROCESSES > 4 run time tends to raise drastically

    image

    Disclaimer:

    I've modified the parallelization backend in feature_extraction/extraction.py from multiprocess toconcurrent.futures (ProcessPoolExecuter) in order to allow an hierarchical parallelization scheme. This scheme allows to run in parallel 2 modules - one module for DB extraction, preprocessing and feature extraction (tsfresh) and second child module for the internal tsfresh parallelization . The change can be seen in the image bellow

    image![image](https://user-image

    s.githubusercontent.com/13464827/28779134-c5264a92-760a-11e7-91b0-009e9aa8123b.png)

    Benchmark settings OS: win7, Resources: Xeon 48-cores, 32 RAM GB Python interpreter: python 3.6 tsfresh package version: 0.7.1 and 1c99e8 feature extraction settings: mean,std,var,median,max,min, sum_values, length ,augmented_dickey_fuller, ar_5_params

    Thanks

    opened by NoamGit 23
  • Formatting the data for 'tsfresh'

    Formatting the data for 'tsfresh'

    First off, thank you for this amazing library, which showed me another way of observing data.

    I read the tutorials, but I think I clearly don't get it right somewhere... Below is the DataFrame 'df' I was trying to feed into 'tsfresh': multiple tickers with n 'features' so that the machine can learn the 'label' for the particular ticker on the particular date.

    a

    X = extract_features(df[:, :-1], column_id='ticker', column_sort='date')
    y = df['label']
    

    But then, [ValueError: Index of X must be a subset of y's index] occurred in that the number of unique 'column_id' equals to 3 whereas the number of 'label' equals to 15. I know this is what the tutorial explains; each robot will be predicted 'only once' with all the corresponding time-series data of features.

    My intention was to predict each 'label' per 'time step' for each 'ticker' like the figure above. Could you please help me out with this?

    opened by Nuri8 23
  • STUMPY, Matrix Profiles, and Motif Discovery

    STUMPY, Matrix Profiles, and Motif Discovery

    Hello, tsfresh devs/users! First off, I wanted to say thank you for this wonderful and thoughtfully created package. I am a big fan of the work that y'all are doing here!

    I had noticed an older issue (and PR) between @Ezekiel-Kruglick, @GillesVandewiele, @MaxBenChrist, and @nils-braun regarding the earlier work from Eamonn Keogh's group on motif discovery (and shapelets too). If I understand correctly, this discussion happened right before Keogh published his wonderful papers on matrix profiles during the fall of 2017. I was wondering if it is of any interest to the group to re-explore the idea of motif extraction in light of these papers. The STUMPY Python package is focused on providing a fast and user friendly interface for computing the matrix profile and, more importantly, faithfully reproduces Keogh's work. It is Python 3 only and has support for parallel CPU computation via Numba, distributed computations via Dask, multi-GPU support, and maintains 100% code coverage. Depending on the data size, it may fit well with some of the tsfresh use cases.

    Full disclosure, I am the creator of STUMPY so let me know if you see an opportunity to collaborate here!

    new feature calculator 
    opened by seanlaw 21
  • extract_features MemoryError with > 7x5k timeseries

    extract_features MemoryError with > 7x5k timeseries

    I'm attempting to to feature extraction on a time series that's 5 minutes long at 30 samples per second, with 7 features. However, I noticed that once I got past ~5k samples, I got a MemoryError.

    Running on Windows 10 64-bit/16gb RAM with python 3.5.2 32-bit, and the master branch of tsfresh. Sample Timeseries, Code & Traceback: https://gist.github.com/ProgBot/0463a68efcbabdb0e6c204c4b8bbf52a

    Is this a limitation of 32 bit Python? Or is tsfresh sadly incapable of handling this amount of data? Thanks

    question 
    opened by ProgBot 21
  • Implement parallelization of feature calculation per kind

    Implement parallelization of feature calculation per kind

    If there are several kinds of time series, their features are calculated in parallel using a process pool. Standard behavior will be one process per cpu. This setting can be overwritten in the FeatureExtractionSettings object provided to extract_features.

    opened by jneuff 21
  • Use tqdm for Jupyter Notebooks

    Use tqdm for Jupyter Notebooks

    When using tsfresh in Jupyter Notebook, output from tqdm results bar is not overwriting itself but creating a line for every change in percentage. Can we somehow build a switch for using tqdm_notebook?

    opened by anderl80 19
  • Added functionality to test multiple versions of python using tox

    Added functionality to test multiple versions of python using tox

    What functionality changed?

    This PR adds functionality to be able to test multiple versions of python using tox.

    This PR will run the test suite for python versions 3.7, 3.8, 3.9, 3.10, and 3.11 for whichever platform the user runs the tests on. The specific microversion of python depends on what is installed on the user's machine.

    tox will skip over running the test suite for a given version if it cannot find a particular python environment.

    How to run?

    To test on multiple python versions, edit envlist in setup.cfg to include the python versions you want to test on, and then run:

    tox -r
    

    in the top-level directory for tsfresh. Currently the versions being tested span between python 3.7-3.11

    Why?

    This PR will make it much easier to test tsfresh against multiple versions of python, for a given platform.

    What code changed?

    setup.cfg was changed to include some extra configuration information.

    tox was added to test-requirements.txt

    Tips for reviewing/running

    By default, tox will look in binary directories for any relevant python interpreters. For example, if

    envlist = (py37, py38)
    

    exists in setup.cfg, then tox will look inside binary directories for python executables named similar to python3.7.X and python 3.8.X. If it cannot find any executables for a given python version, then it will skip over testing that version.

    Using pyenv in tandem with tox

    A recommended way to handle multiple versions of python is with pyenv. pyenv will allow you to install multiple versions of python in a well-organised fashion.

    If you choose to use pyenv to manage the different python versions installed on your machine, then the executables of each python version will be in ~/.pyenv/shims/ which will not be found by tox by default. A recommended solution to this is to make a .python-version file, with the versions of python that you want tox to look for.

    For example, if the output of running

    pyenv versions
    

    shows that you have installed python 3.7.16 and python 3.8.16, then you should put 3.7.16 and 3.8.16 as separate lines in the .python-version file. Running tox -r will then run the tests for python3.7.16 and python3.8.16, and tox will know where to find the relevant python interpreters.

    pyenv important note

    The python version that the tox command is initially invoked from matters!

    Running pyenv which python will show the version of python from which the tox command will be invoked from. If tox is initially invoked from a version of python that is not supported by the package (i.e. package is invoked from python 3.6 and python_requires is >=3.7), then tox will fail for all environments, including python versions that would otherwise work if tox was originally invoked from a version of python supported with the package.

    Note that we can still test tsfresh on unsupported versions of python (such as 3.6), provided that tox is initially invoked from a version of python that is in tsfresh python_requires (such as 3.8).

    Log files

    Log files are stored in the .tox directory which is created once tox is run.

    opened by Scott-Simmons 2
  • Why make_forecasting_frame does not have min_timeshift argument?

    Why make_forecasting_frame does not have min_timeshift argument?

    The problem:

    Hi. I noticed that when using make_forecasting_frame, it does not have min_timeshift argument so the first few rows have less predictor rows.

    Environment:

    • Python version: 3.8.16
    • Operating System: Ubuntu
    • tsfresh version: 0.19.0
    • Install method (conda, pip, source): pip-Colab
    bug 
    opened by arashag 0
  • unmaintained dependency `matrixprofile` makes `tsfresh` uninstallable on python 3.10

    unmaintained dependency `matrixprofile` makes `tsfresh` uninstallable on python 3.10

    The problem:

    tsfresh cannot be installed on python 3.10 because it has matrixprofile as a dependency, yet matrixprofile cannot be installed on python 3.10 and is no longer maintained.

    Anything else we need to know?:

    matrixprofile is superseded by stumpy.

    Environment:

    • Python version: 3.10
    bug 
    opened by mmp3 6
  • IndexError: cannot do a non-empty take from an empty axes.

    IndexError: cannot do a non-empty take from an empty axes.

    This error occurs when using EfficientFCParameters or ComprehensiveFCParameters (not MinimalFCParameters). Error does not occur when pandas version is 1.3.5. However this is an old version and is not compatible with other python packages. Can you please resolve the issue making it work with later panda versions such as 1.4.3?

    bug 
    opened by hn2 0
  • BrokenPipeError

    BrokenPipeError

    The problem: I'm trying to run extract_features on my data but keep getting BrokenPipeError. I tried it on two different computers (both with the same environment) with the same error. The dataset is quite large, merged DataFrame shape: (880169, 522), so it is expected to run for 20 hours. It runs for a few hours and then crashes,

    Settings:

    features = extract_features(
        merged_time_series,
        column_id="id",
        default_fc_parameters=ComprehensiveFCParameters(),
        n_jobs=15,
        impute_function=impute,
    )
    

    Error (repeated many times):

    Process ForkPoolWorker-1:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
    Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
      File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker                                                                                                                                                                                                                                                                                                                   
        put((job, i, result))                                                                                                                                                                                                                                                                                                                                                                   
      File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker                                                                                                                                                                                                                                                                                                                   
        put((job, i, result))                                                                                                                                                                                                                                                                                                                                                                   
      File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker                                                                                                                                                                                                                                                                                                                   
        put((job, i, result))                                                                                                                                                                                                                                                                                                                                                                   
      File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker                                                                                                                                                                                                                                                                                                                   
        put((job, i, result))                                                                                                                                                                                                                                                                                                                                                                   
      File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker                                                                                                                                                                                                                                                                                                                   
        put((job, i, result))                                                                                                                                                                                                                                                                                                                                                                   
      File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put                                                                                                                                                                                                                                                                                                                    
        self._writer.send_bytes(obj)                                                                                                                                                                                                                                                                                                                                                            
      File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put                                                                                                                                                                                                                                                                                                                    
        self._writer.send_bytes(obj)                                                                                                                                                                                                                                                                                                                                                            
      File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put                                                                                                                                                                                                                                                                                                                    
        self._writer.send_bytes(obj)                                                                                                                                                                                                                                                                                                                                                            
      File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put                                                                                                                                                                                                                                                                                                                    
        self._writer.send_bytes(obj)                                                                                                                                                                                                                                                                                                                                                            
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes                                                                                                                                                                                                                                                                                                         
        self._send_bytes(m[offset:offset + size])                                                                                                                                                                                                                                                                                                                                               
      File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put                                                                                                                                                                                                                                                                                                                    
        self._writer.send_bytes(obj)                                                               
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes                                                                                                           
        self._send_bytes(m[offset:offset + size])                                                  
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes                                                                                                           
        self._send_bytes(m[offset:offset + size])                                                                                                                                                                                                                                                                                                                                               
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes                                                                                                           
        self._send_bytes(m[offset:offset + size])                            
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes                                                                                                          
        self._send(header)                                                       
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes                                                                                                          
        self._send(header)                                                  
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes                                                                                                                                                                                                                                                                                                         
        self._send_bytes(m[offset:offset + size])                                                                                                                                                                                                                                                                                                                                               
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes                                                                                                                                                                                                                                                                                                        
        self._send(header)                                                                                                                                                                        
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes                                                                                                          
        self._send(header)                                                                                                                                                                        
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send                 
        n = write(self._handle, buf)                                                                                                                                                              
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send                                                                                                                                                                                                                                                                                                              
        n = write(self._handle, buf)                                                                                                                                                              
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes                                                                                                          
        self._send(header)                                                                                                                                                                        
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send                                                                                                                
        n = write(self._handle, buf)                                                                                                                                                                                                                                                                                                                                                            
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send                                                                                                                                                                                                                                                                                                              
        n = write(self._handle, buf)                                                                                                                                                                                                                                                                                                                                                            
      File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send                                                                                                                
        n = write(self._handle, buf)                                                                                                                                                              
    BrokenPipeError: [Errno 32] Broken pipe
    

    Anything else we need to know?: I also tried running it with a smaller chunksize and fewer jobs, but with no change.

    features = extract_features(
        merged_time_series,
        column_id="id",
        default_fc_parameters=ComprehensiveFCParameters(),
        n_jobs=8,
        impute_function=impute,
        chunksize=1,
    )
    

    Environment:

    • Python version: 3.10
    • Operating System: Ubuntu 22.04
    • tsfresh version: 0.19.0
    • Install method (conda, pip, source): pip
    bug 
    opened by johan-sightic 0
Releases(v0.20.0)
  • v0.20.0(Dec 30, 2022)

    • Breaking Change

      • The matrixprofile package becomes an optional dependency
    • Bugfixes/Typos/Documentation:

      • Fix feature extraction of Friedrich coefficients for pandas>1.3.5
      • Fix file paths after example notebooks were moved
    Source code(tar.gz)
    Source code(zip)
  • v0.19.0(Dec 21, 2021)

    • Breaking Change

      • Drop Python 3.6 support due to dependency on statsmodels 0.13
    • Added Features

      • Improve documentation (#831, #834, #851, #853, #870)
      • Add absolute_maximum and mean_n_absolute_max features (#833)
      • Make settings pickable (#845, #847, #910)
      • Disable multiprocessing for n_jobs=1 (#852)
      • Add black, isort, and pre-commit (#876)
    • Bugfixes/Typos/Documentation:

      • Fix conversion of time-series into sequence for lempel_ziv_complexity (#806)
      • Fix range count config (#827)
      • Reword documentation (#893)
      • Fix statsmodels deprecation issues (#898, #912)
      • Fix typo in requirements (#903)
      • Updated references
    Source code(tar.gz)
    Source code(zip)
  • v0.18.0(Mar 6, 2021)

    • Added Features

      • Allow arbitrary rolling sizes (#766)
      • Allow for multiclass significance tests (#762)
      • Add multiclass option to RelevantFeatureAugmenter (#782)
      • Addition of matrix_profile feature (#793)
      • Added new query similarity counter feature (#798)
      • Add root mean square feature (#813)
    • Bugfixes/Typos/Documentation:

      • Do not send coverage of notebook tests to codecov (#759)
      • Fix typos in notebook (#757, #780)
      • Fix output format of make_forecasting_frame (#758)
      • Fix badges and remove benchmark test
      • Fix BY notebook plot (#760)
      • Ts forecast example improvement (#763)
      • Also surpress warnings in dask (#769)
      • Update relevant_feature_augmenter.py (#779)
      • Fix column names in quick_start.rst (#778)
      • Improve relevance table function documentation (#781)
      • Fixed #789 Typo in "how to add custom feature" (#790)
      • Convert to the correct type on warnings (#799)
      • Fix minor typos in the docs (#802)
      • Add unwanted filetypes to gitignore (#819)
      • Fix build and test failures (#815)
      • Fix imputing docu (#800)
      • Bump the scikit-learn version (#822)
    Source code(tar.gz)
    Source code(zip)
  • v0.17.0(Sep 9, 2020)

    We changed the default branch from "master" to "main".

    • Breaking Change
      • Changed constructed id in roll_time_series from string to tuple (#700)
      • Same for add_sub_time_series_index (#720)
    • Added Features
      • Implemented the Lempel-Ziv-Complexity and the Fourier Entropy (#688)
      • Prevent #524 by adding an assert for common identifiers (#690)
      • Added permutation entropy (#691)
      • Added a logo :-) (#694)
      • Implemented the benford distribution feature (#689)
      • Reworked the notebooks (#701, #704)
      • Speed up the result pivoting (#705)
      • Add a test for the dask bindings (#719)
      • Refactor input data iteration to need less memory (#707)
      • Added benchmark tests (#710)
      • Make dask a possible input format (#736)
    • Bugfixes:
      • Fixed a bug in the selection, that caused all regression tasks with un-ordered index to be wrong (#715)
      • Fixed readthedocs (#695, #696)
      • Fix spark and dask after #705 and for non-id named id columns (#712)
      • Fix in the forecasting notebook (#729)
      • Let tsfresh choose the value column if possible (#722)
      • Move from coveralls github action to codecov (#734)
      • Improve speed of data processing (#735)
      • Fix for newer, more strict pandas versions (#737)
      • Fix documentation for feature calculators (#743)
    Source code(tar.gz)
    Source code(zip)
  • v0.16.0(May 12, 2020)

    • Breaking Change
      • Fix the sorting of the parameters in the feature names (#656) The feature names consist of a sorted list of all parameters now. That used to be true for all non-combiner features, and is now also true for combiner features. If you relied on the actual feature name, this is a breaking change.
      • Change the id after the rolling (#668) Now, the old id of your data is still kept. Additionally, we improved the way dataframes without a time column are rolled and how the new sub-time series are named. Also, the documentation was improved a lot.
    • Added Features
      • Added variation coefficient (#654)
      • Added the datetimeindex explanation from the notebook to the docs (#661)
      • Optimize RelevantFeatureAugmenter to avoid re-extraction (#669)
      • Added a function add_sub_time_series_index (#666)
      • Added Dockerfile
      • Speed optimizations and speed testing script (#681)
    • Bugfixes
      • Increase the extracted ar coefficients to the full parameter range. (#662)
      • Documentation fixes (#663, #664, #665)
      • Rewrote the sample_entropy feature calculator (#681) It is now faster and (hopefully) more correct. But your results will change!
    Source code(tar.gz)
    Source code(zip)
  • v0.15.1(May 12, 2020)

  • v0.15.0(Mar 26, 2020)

    • Added Features
      • Add count_above and count_below feature (#632)
      • Add convenience bindings for dask dataframes and pyspark dataframes (#651)
    • Bugfixes
      • Fix documentation build and feature table in sphinx (#637, #631, #627)
      • Add scripts to API documentation
      • Skip dask test for older python versions (#649)
      • Add missing distributor keyword (#648)
      • Fix tuple input for cwt (#645)
    Source code(tar.gz)
    Source code(zip)
  • v0.14.0(Feb 4, 2020)

    • Breaking Change
      • Replace Benjamini-Hochberg implementation with statsmodels implementation (#570)
    • Refactoring and Documentation
      • travis.yml (#605)
      • gitignore (#608)
      • Fix docstring of c3 (#590)
      • Feature/pep8 (#607)
    • Added Features
      • Improve test coverage (#609)
      • Add "autolag" parameter to augmented_dickey_fuller() (#612)
    • Bugfixes
      • Feature/pep8 (#607)
      • Fix filtering on warnings with multiprocessing on Windows (#610)
      • Remove outdated logging config (#621)
      • Replace Benjamini-Hochberg implementation with statsmodels implementation (#570)
      • Fix the kernel and the naming of a notebook (#626)
    Source code(tar.gz)
    Source code(zip)
  • v0.13.0(Nov 24, 2019)

    • Drop python 2.7 support (#568)
    • Fixed bugs
      • Fix cache in friedrich_coefficients and agg_linear_trend (#593)
      • Added a check for wrong column names and a test for this check (#586)
      • Make sure to not install the tests folder (#599)
      • Make sure there is at least a single column which we can use for data (#589)
      • Avoid division by zero in energy_ratio_by_chunks (#588)
      • Ensure that get_moment() uses float computations (#584)
      • Preserve index when column_value and column_kind not provided (#576)
      • Add @set_property("input", "pd.Series") when needed (#582)
      • Fix off-by-one error in longest strike features (fixes #577) (#578)
      • Add set_property import (#572)
      • Fix typo (#571)
      • Fix indexing of melted normalized input (#563)
      • Fix travis (#569)
    • Remove warnings (#583)
    • Update to newest python version (#594)
    • Optimizations
      • Early return from change_quantiles if ql >= qh (#591)
      • Optimize mean_second_derivative_central (#587)
      • Improve performance with Numpy's sum function (#567)
      • Optimize mean_change (fixes issue #542) and correct documentation (#574)
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Nov 24, 2019)

    • fixed bugs
      • wrong calculation of friedrich coefficients
      • feature selection selected too many features
      • an ignored max_timeshift parameter in roll_time_series
    • add deprecation warning for python 2
    • added support for index based features
    • new feature calculator
      • linear_trend_timewise
    • enable the RelevantFeatureAugmenter to be used in cross validated pipelines
    • increased scipy dependency to 1.2.0
    Source code(tar.gz)
    Source code(zip)
  • v0.11.1(Nov 24, 2019)

    • general performance improvements
    • removed hard pinning of dependencies
    • fixed bugs
      • the stock price forecasting notebook
      • the multi classification notebook
    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Nov 24, 2019)

    • new feature calculators:
      • fft_aggregated
      • cid_ce
    • renamed mean_second_derivate_central to mean_second_derivative_central
    • add warning if no relevant features were found in feature selection
    • add columns_to_ignore parameter to from_columns method
    • add distribution module, contains support for distributed feature extraction on Dask
    Source code(tar.gz)
    Source code(zip)
open-source feature selection repository in python

scikit-feature Feature selection repository scikit-feature in Python. scikit-feature is an open-source feature selection repository in Python develope

Jundong Li 1.3k Jan 05, 2023
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

Epistasis Lab at UPenn 374 Dec 15, 2022
Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

Blue Yonder GmbH 7k Jan 03, 2023
A fast xgboost feature selection algorithm

BoostARoota A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers) Why Create Another Algorithm? Automated processes l

Chase DeHan 187 Dec 22, 2022
An open source python library for automated feature engineering

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to

alteryx 6.4k Jan 05, 2023
Python implementations of the Boruta all-relevant feature selection method.

boruta_py This project hosts Python implementations of the Boruta all-relevant feature selection method. Related blog post How to install Install with

1.2k Jan 04, 2023
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 05, 2022
scikit-learn addon to operate on set/"group"-based features

skl-groups skl-groups is a package to perform machine learning on sets (or "groups") of features in Python. It extends the scikit-learn library with s

Danica J. Sutherland 41 Apr 06, 2022
a feature engineering wrapper for sklearn

Few Few is a Feature Engineering Wrapper for scikit-learn. Few looks for a set of feature transformations that work best with a specified machine lear

William La Cava 47 Nov 18, 2022
A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.

Master status: Development status: Package information: MDR A scikit-learn-compatible Python implementation of Multifactor Dimensionality Reduction (M

Epistasis Lab at UPenn 122 Jul 06, 2022