Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Overview

scikit-fda: Functional Data Analysis in Python

scikit-fda: Functional Data Analysis in Python

python build status Documentation Status Codecov PyPIBadge license doi

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

This package offers classes, methods and functions to give support to FDA in Python. Includes a wide range of utils to work with functional data, and its representation, exploratory analysis, or preprocessing, among other tasks such as inference, classification, regression or clustering of functional data. See documentation for further information on the features included in the package.

Documentation

The documentation is available at fda.readthedocs.io/en/stable/, which includes detailed information of the different modules, classes and methods of the package, along with several examples showing different functionalities.

The documentation of the latest version, corresponding with the develop version of the package, can be found at fda.readthedocs.io/en/latest/.

Installation

Currently, scikit-fda is available in Python 3.6 and 3.7, regardless of the platform. The stable version can be installed via PyPI:

pip install scikit-fda

Installation from source

It is possible to install the latest version of the package, available in the develop branch, by cloning this repository and doing a manual installation.

git clone https://github.com/GAA-UAM/scikit-fda.git
pip install ./scikit-fda

Make sure that your default Python version is currently supported, or change the python and pip commands by specifying a version, such as python3.6:

git clone https://github.com/GAA-UAM/scikit-fda.git
python3.6 -m pip install ./scikit-fda

Requirements

scikit-fda depends on the following packages:

The dependencies are automatically installed.

Contributions

All contributions are welcome. You can help this project grow in multiple ways, from creating an issue, reporting an improvement or a bug, to doing a repository fork and creating a pull request to the development branch.

The people involved at some point in the development of the package can be found in the contributors file.

License

The package is licensed under the BSD 3-Clause License. A copy of the license can be found along with the code.

Comments
  • Feature/improve visualization

    Feature/improve visualization

    The plotting functions have been changed to not use global state (using pyplot) whenever possible. Instead, the plotting functions in skfda create and return a new figure, if none is passed.

    Pyplot is still used for Sphinx-gallery, as it has to scrap the produced images.

    All the examples have been changed in order to use the new API and the object oriented functionalities of Matplotlib.

    opened by vnmabus 17
  • Retrieve coefficients for function reconstruction

    Retrieve coefficients for function reconstruction

    Hey there!

    I fitted a KNN FDataGrid to my input data. It actually looks pretty good so far and I now would like to "export" it so I can represent the function as numerical values (preferable in a numpy array).

    I saw that you offer some basis that can be used to "export" the underlying representation. Could you elaborrate on what basis should be used when? My data represents a demand/supply curve. I tried the BSpline one but it only constructs something close to a sine wave which doesn't really represent my data.

    Here is an image of the graph itself: grafik

    Is there some way to get the raw representation instead of transforming it to another basis?

    opened by mainrs 16
  • add inverse_transform #297 method to FPCA

    add inverse_transform #297 method to FPCA

    Reference issue: https://github.com/GAA-UAM/scikit-fda/issues/297#issue-775429060

    What does it address ? The inverse_transform method projects the functional principal score (coefficient value w.r.t to eigenfunctions basis) of a (or set of) sample back to the input space. In other words FPCA.inverse_transform(FPCA.transform()) should approximate the identity map.

    Empirical tests: I've only tested the inverse_transform method on synthetic datasets (both FDataGrid and FDataBasis) and assessed the results in terms of 'identity regression' i.e. with QQ-plots on the distribution of the residuals where residuals = X_input.value - X_recovered.value, where:

    • X_recovered is computed as FPCA.inverse_transform(FPCA.transform(X_input)),
    • value is .coefficients or .data_matrix attribute, depending on the input data format.

    X_input is generated according to:

    cov = Exponential(length_scale=0.5)
    n_grid = 100
    n_samples = 50
    
    fd = make_gaussian_process(
        start=0,
        stop=4,
        n_samples=n_samples,
        n_features=n_grid,
        mean=lambda t: np.power(t, 2) + 5,
        cov=cov,
    )
    

    Herebelow are the resulting QQ-plots (computed with scipy.stats.probplot) when `X_recovered' is computed with 3 principal components:

    • When `fd' is let in FDataGrid format: slope ~0.675, intercept ~ -3.e-17 and an R^2 ~ 0.9994. Note that the residuals are computed in terms of funtional data values. qqplot-fdatagrid

    • When `fd' is mapped to FDataBasis format with a 4-order Bspline basis with cardinality 50: -slope ~ 1.08, intercept -5.e-17 and R^2 ~0.9996. Note that the residuals are computed in terms of the coefficients in the (here Bspline) basis. qqplot-fdatabasis

    I can enhance the PR and provide further emprical results, just tell me :)

    opened by Clej 12
  • ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    Scikit-fda is successfully installed, but when I try to import it, I receive the following error:

    ValueError                                Traceback (most recent call last)
    <ipython-input-17-97c44e0d3210> in <module>
    ----> 1 import skfda
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/__init__.py in <module>
         35 from .representation._functional_data import concatenate
         36 
    ---> 37 from . import representation, datasets, preprocessing, exploratory, misc, ml, \
         38     inference
         39 
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/datasets/__init__.py in <module>
          5                              fetch_weather, fetch_aemet,
          6                              fetch_octane, fetch_gait)
    ----> 7 from ._samples_generators import (make_gaussian,
          8                                   make_gaussian_process,
          9                                   make_sinusoidal_process,
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/datasets/_samples_generators.py in <module>
          7 from .. import FDataGrid
          8 from .._utils import _cartesian_product
    ----> 9 from ..misc import covariances
         10 from ..preprocessing.registration import normalize_warping
         11 from ..representation.interpolation import SplineInterpolation
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/__init__.py in <module>
    ----> 1 from . import covariances, kernels, metrics
          2 from . import operators
          3 from . import regularization
          4 from ._math import (log, log2, log10, exp, sqrt, cumsum,
          5                     inner_product, inner_product_matrix)
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/covariances.py in <module>
          7 import sklearn.gaussian_process.kernels as sklearn_kern
          8 
    ----> 9 from ..exploratory.visualization._utils import _create_figure, _figure_to_svg
         10 
         11 
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/__init__.py in <module>
          1 from . import depth
    ----> 2 from . import outliers
          3 from . import stats
          4 from . import visualization
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/outliers/__init__.py in <module>
          4 )
          5 from ._iqr import IQROutlierDetector
    ----> 6 from .neighbors_outlier import LocalOutlierFactor
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/outliers/neighbors_outlier.py in <module>
          2 from sklearn.base import OutlierMixin
          3 
    ----> 4 from ...misc.metrics import lp_distance
          5 from ...ml._neighbors_base import (
          6     KNeighborsMixin,
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/metrics.py in <module>
          6 
          7 from .._utils import _pairwise_commutative
    ----> 8 from ..preprocessing.registration import normalize_warping, ElasticRegistration
          9 from ..preprocessing.registration._warping import _normalize_scale
         10 from ..preprocessing.registration.elastic import SRSF
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/__init__.py in <module>
    ----> 1 from . import registration
          2 from . import smoothing
          3 from . import dim_reduction
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/registration/__init__.py in <module>
         14 from ._warping import invert_warping, normalize_warping
         15 
    ---> 16 from .elastic import ElasticRegistration
         17 
         18 from . import validation, elastic
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/registration/elastic.py in <module>
          1 
    ----> 2 from fdasrsf.utility_functions import optimum_reparam
          3 import scipy.integrate
          4 from sklearn.base import BaseEstimator, TransformerMixin
          5 from sklearn.utils.validation import check_is_fitted
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/__init__.py in <module>
         20 del sys
         21 
    ---> 22 from .time_warping import fdawarp, align_fPCA, align_fPLS, pairwise_align_bayes
         23 from .plot_style import f_plot, rstyle, plot_curve, plot_reg_open_curve, plot_geod_open_curve, plot_geod_close_curve
         24 from .utility_functions import smooth_data, optimum_reparam, f_to_srsf, gradient_spline, elastic_distance, invertGamma, srsf_to_f
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/time_warping.py in <module>
          7 import numpy as np
          8 import matplotlib.pyplot as plt
    ----> 9 import fdasrsf.utility_functions as uf
         10 import fdasrsf.fPCA as fpca
         11 import fdasrsf.geometry as geo
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/utility_functions.py in <module>
         17 from joblib import Parallel, delayed
         18 import numpy.random as rn
    ---> 19 import optimum_reparamN2 as orN2
         20 import optimum_reparam_N as orN
         21 import cbayesian as bay
    
    src/optimum_reparamN2.pyx in init optimum_reparamN2()
    
    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
    

    I'm using the standard format

    import skfda
    

    I don't know why this is happening, any help is greatly appreciated

    Version information

    • OS: MacOS
    • Python version: 3.7.9
    • scikit-fda version: 0.5
    • Version of other packages involved: [numpy: 1.19.2, scipy: 1.6.0, matplotlib: 3.3.4 , conda: 4.9.2 ]
    bug 
    opened by adeyemi-lawal 10
  • ModuleNotFoundError: No module named 'optimum_reparam'

    ModuleNotFoundError: No module named 'optimum_reparam'

    Hi, I just forked/cloned the repository and I found a module which is not listed in the dependencies. It's called "optimum_reparam" and it's imported right here. Maybe somewhere else. I failed finding the module myself, could someone provide details about this module in the README requirements section, please?

    Thanks for this initiative!

    opened by alejandro-ariza 8
  • Problem compiling binaries in macOS

    Problem compiling binaries in macOS

    Describe the bug Problem compiling C code from fdasrsf:

    error: $MACOSX_DEPLOYMENT_TARGET mismatch: now "10.14" but "10.15" during configure

    In the fdasrsf package it is fixed the variable with os.environ['MACOSX_DEPLOYMENT_TARGET'] = '10.14' in the setup.py. After cloning the repository and remove the line I get the same error. I tried to export the environment variable before running the installation with no success.

    Complete trace:

    Building wheel for fdasrsf (PEP 517) ... error ERROR: Command errored out with exit status 1: command: /Users/pablomm/scikit_fda_test2/venv/bin/python /Users/pablomm/scikit_fda_test2/venv/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/tmpo2_rkkrs cwd: /private/var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/pip-install-ym8nls3t/fdasrsf_a923c378ec5a4ec689f1d7459d35c8c7 Complete output (38 lines): generating build/_DP.c (already up-to-date) running bdist_wheel running build running build_py creating build/lib.macosx-10.15-x86_64-3.8 creating build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/plot_style.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_stats.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_functions.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/geodesic.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/utility_functions.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/tolerance.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/pcr_regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/init.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/fPCA.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/umap_metric.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/time_warping.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/geometry.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/rbfgs.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/fPLS.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/boxplots.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf running build_ext cythoning src/optimum_reparamN2.pyx to src/optimum_reparamN2.c cythoning src/fpls_warp.pyx to src/fpls_warp.c cythoning src/mlogit_warp.pyx to src/mlogit_warp.c cythoning src/ocmlogit_warp.pyx to src/ocmlogit_warp.c cythoning src/oclogit_warp.pyx to src/oclogit_warp.c cythoning src/optimum_reparam_N.pyx to src/optimum_reparam_N.c cythoning src/cbayesian.pyx to src/cbayesian.cpp building 'optimum_reparamN2' extension creating build/temp.macosx-10.15-x86_64-3.8 creating build/temp.macosx-10.15-x86_64-3.8/src clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk -I/private/var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/pip-build-env-qnbgsf0y/overlay/lib/python3.8/site-packages/numpy/core/include -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/usr/local/opt/tcl-tk/include -I/Users/pablomm/scikit_fda_test2/venv/include -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/include/python3.8 -c src/optimum_reparamN2.c -o build/temp.macosx-10.15-x86_64-3.8/src/optimum_reparamN2.o not modified: 'build/_DP.c' error: $MACOSX_DEPLOYMENT_TARGET mismatch: now "10.14" but "10.15" during configure


    ERROR: Failed building wheel for fdasrsf Building wheel for findiff (setup.py) ... done Created wheel for findiff: filename=findiff-0.8.9-py3-none-any.whl size=29218 sha256=c2bc96e93c195fb5c2a1acbf932b5311e8e94e571edf74929eca6e019c66532d Stored in directory: /Users/pablomm/Library/Caches/pip/wheels/df/48/68/71cc95b16d5f7c5115a009f92f9a5a3896fb2ece31228b0aa5 Successfully built findiff Failed to build fdasrsf ERROR: Could not build wheels for fdasrsf which use PEP 517 and cannot be installed directly

    To Reproduce

    virtualenv venv
    source venv/bin/activate
    python --version # 3.8.8 and also tried 3.9.2
    pip install scikit-fda
    

    Also, it is happening when the package is installed from code with python setup.py install

    Version information

    • OS: macOS catalina 10.15.7
    • Python version: 3.9.2 and 3.8.8
    • scikit-fda version: stable master (0.5) and develop
    • gcc and g++ version: 12.0.0
    bug 
    opened by pablomm 7
  • Importing from skfda import FDataGrid gives you ValueError

    Importing from skfda import FDataGrid gives you ValueError

    Describe the bug When importing FDataGrid from skfda it will launch the following error: ` ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    `To Reproduce We are installing scikit-fda==0.3 with the following docker image:

    FROM python:3.7-slim-buster
    
    # ------------------------------------------------------------------------------------Basic Linux Packages------------------------------------------------------------------------------------
    RUN apt-get update \
        && apt-get install -y --no-install-recommends \
        ca-certificates \
        cmake \
        build-essential \
        gcc \
        g++ \
        git \
        wget \
        curl \
        libffi-dev \
        python-dev \
        unixodbc-dev \
        && rm -rf /var/lib/apt/lists/*
    

    Expected behavior To be able to import the library Version information

    • OS: Linux
    • Python version: python:3.7-slim-buster
    • scikit-fda version: 0.3
    • Version of other packages involved [e.g. numpy, scipy, matplotlib, ... ]: numpy==1.16.2 , scipy==1.3.1, matplotlib==3.1.1

    Additional context Add any other context about the problem here. Since 01/02/2021 we have the issue of not being able to import the library. Before of that date we could import it without problem with the same requirements and environment

    Error Thread:

    src/hvl/utils/model_utils.py:12: in <module>
        from skfda import FDataGrid
    /usr/local/lib/python3.7/site-packages/skfda/__init__.py:36: in <module>
        from . import representation, datasets, preprocessing, exploratory, misc, ml
    /bin/bash failed with return code: 1
    /usr/local/lib/python3.7/site-packages/skfda/datasets/__init__.py:6: in <module>
    return code: 1
        from ._samples_generators import (make_gaussian_process,
    /usr/local/lib/python3.7/site-packages/skfda/datasets/_samples_generators.py:8: in <module>
        from ..misc import covariances
    /usr/local/lib/python3.7/site-packages/skfda/misc/__init__.py:2: in <module>
        from . import covariances, kernels, metrics
    /usr/local/lib/python3.7/site-packages/skfda/misc/covariances.py:9: in <module>
        from ..exploratory.visualization._utils import _create_figure, _figure_to_svg
    /usr/local/lib/python3.7/site-packages/skfda/exploratory/__init__.py:4: in <module>
        from . import visualization
    /usr/local/lib/python3.7/site-packages/skfda/exploratory/visualization/__init__.py:1: in <module>
        from . import clustering, representation
    /usr/local/lib/python3.7/site-packages/skfda/exploratory/visualization/clustering.py:11: in <module>
        from ...ml.clustering.base_kmeans import FuzzyKMeans
    /usr/local/lib/python3.7/site-packages/skfda/ml/__init__.py:2: in <module>
        from . import classification, clustering, regression
    /usr/local/lib/python3.7/site-packages/skfda/ml/classification/__init__.py:3: in <module>
        from ..._neighbors import (KNeighborsClassifier, RadiusNeighborsClassifier,
    /usr/local/lib/python3.7/site-packages/skfda/_neighbors/__init__.py:11: in <module>
        from .unsupervised import NearestNeighbors
    >   ???
    E   ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
    
    deps/fdasrsf/optimum_reparam.pyx:1: ValueError
    
    bug 
    opened by JavierHernandezMontes 6
  • Feature/local outlier factor

    Feature/local outlier factor

    • Created LocalOutlierFactor (which wraps scikit-learn multivariate version)
    • Example in gallery of detection of outliers
    • New real dataset employed in the example (fetch_octane)
    • Test and Doctests added
    enhancement pending theory 
    opened by pablomm 6
  • Error when calling L2Regularization

    Error when calling L2Regularization

    I have tried different python versions, but calling:

    regularization=L2Regularization(LinearDifferentialOperator())

    As in the docs, results in the following error:

    TypeError: init() takes 1 positional argument but 2 were given

    opened by dcbrien 5
  • Feature/neighbors

    Feature/neighbors

    Added the following neighbors estimators:

    • NearestNeighbors
    • KNeighborsClassifier
    • RadiusNeighborsClassifier
    • NearestCentroids
    • KNeighborsScalarRegressor
    • RadiusNeighborsScalarRegressor
    • KNeighborsFunctionalRegressor
    • RadiusNeighborsFunctionalRegressor

    I wrote some examples of KNeighborsClassifier, RadiusNeighborsClassifier and KNeighborsScalarRegressor, I will write other for the regressors with functional response and the nearest centroids, but in another PR.

    Also, I had to modify the lp_distance lo accept the infinity case and the mean to accept a list of weights.

    There are some little things which can be improved after merge #101.

    opened by pablomm 5
  • Transformer to reduce the image dimension

    Transformer to reduce the image dimension

    Created a transformer which receives a multivariate function from R^n->R^d and applies a vectorial norm to reduce the image dimension to 1.

    There are two version of the transformer, a procedural one and a sklearn style transformer.

    I put the functions in skfda.preprocessing.dim_reduction, but maybe there is a better place.

    opened by pablomm 5
  • Dose BasisSmoother rely on smoothing_parameter values?

    Dose BasisSmoother rely on smoothing_parameter values?

    Describe the bug It seems to me change smoothing_parameter doesn't effect result of BasisSmoother, particularly the score

    To Reproduce I'm using something similar to the one used in the docs

    t = np.linspace(0, 1, 5)
    x = np.sin(2 * np.pi * t) + np.cos(2 * np.pi * t) + 2
    fd = skfda.FDataGrid(data_matrix=x, grid_points=t)
    basis = skfda.representation.basis.FourierBasis((0, 1), n_basis=3)
    smoother = skfda.preprocessing.smoothing.BasisSmoother(basis,smoothing_parameter=100)
    fd_smooth = smoother.fit_transform(fd)
    fd_smooth.data_matrix.round(2)
    smoother.score(fd,fd)
    

    Expected behavior changing smoothing_parameter doesn't seem to affect the score. I'm having the updated github version

    Version information

    • OS: [e.g. MacOs 13.0]
    • Python version: [e.g. 3.9.15]
    • scikit-fda version: [e.g. latest]
    • Version of other packages involved [e.g. numpy 1.23.4]

    Additional context

    bug 
    opened by mikeaalv 2
  • Allow BasisSmoother to accept irregularly sampled data

    Allow BasisSmoother to accept irregularly sampled data

    Alternatively, I have implemented a separate SparseBasisSmoother class which encapsulates the functionality of BasisSmoother in order to apply it to irregularly sampled data, in the format given in Issue #325

    opened by opintosant 1
  • Data structure for discretized sparse data

    Data structure for discretized sparse data

    Is your feature request related to a problem? Please describe. We need a FData subclass capable of storing discretized functions where:

    • The points in which the functions are measured are NOT in a grid (specially relevant for high dimensional functions such as surfaces).
    • The points in which each function is measured are different, and probably a different number of points per observation.
    • There are typically few points measured per observation (sparse instead of dense).

    This structure is necessary for efficient storage and computation with this kind of data.

    Describe the solution you'd like The proposal for the implementation is to have three arrays:

    • A first array containing the (concatenated) input points for the observations. For example, for three functional observations $f_1$, $f_2$, $f_3$, evaluated at $f_1(1, 2) = (3, 4, 5), f_1(2, 4) = (1, 7, 3), f_2(0, 0) = (1, 0, 1), f_3(0, 1) = (2, 2, 2), f_3(3, 5) = (1, 0, 0)$, this array would contain [(1, 2), (2, 4), (0, 0), (0, 1), (3, 5)].
    • A second array with the corresponding values. In the previous example it would be [(3, 4, 5), (1, 7, 3), (1, 0, 1), (2, 2, 2), (1, 0, 0)].
    • A third array with the indexes where the points of each observation start. In this case [0, 2, 3].

    This approach has a compact representation and also allows for vectorization to be applied. The indexes can be used to apply the reduceat method of NumPy ufuncs.

    Describe alternatives you've considered A more high-level API should be also exposed and used when possible.

    enhancement 
    opened by vnmabus 1
  • Fix regularized FPCA with FDataGrid

    Fix regularized FPCA with FDataGrid

    The issue

    When using FPCA with regularization of FDataGrid objects, the obtained components are orthogonal wrt to the usual inner product, which is not the one that should be used in the regularization. As specified in Ramsay, J. O. and Silverman, B. W.. Functional Data Analysis (page 178), the components should not be orthogonal, but they should fulfill

    $$ \int \xi_j(s) \xi_k(s) ds + \int D^2 \xi_j(s) D^2 \xi_k(s) ds = 0, \quad (j\ne k) $$

    where $\xi_i$ are the loadings and $D^2$ is the second order diferential operator.

    Steps to reproduce

    The following code shows the problem:

    X, y = skfda.datasets.fetch_phoneme(return_X_y=True)
    X = X[:300]
    y = y[:300]
    
    n_components = 4
    
    fpca_regression = FPCA(
        n_components=n_components,
        regularization=L2Regularization(
            LinearDifferentialOperator(1),
            regularization_parameter=20,
        ),
    )
    
    fpca_regression.fit(X, y)
    
    matrix = skfda.misc.inner_product_matrix(fpca_regression.components_)
    print("Grid regularized:\n", matrix)
    
    
    grid_points = X.grid_points
    X = X.to_basis(Fourier(n_basis=10))
    
    fpca_regression = FPCA(
        n_components=n_components,
        regularization=L2Regularization(
            LinearDifferentialOperator(1),
            regularization_parameter=0.25,
        ),
    )
    fpca_regression.fit(X, y)
    
    
    matrix = skfda.misc.inner_product_matrix(fpca_regression.components_)
    print("Basis regularized:\n", matrix)
    

    Output:

    Grid regularized:
     [[ 1.00000000e+00  6.96057795e-16  9.19403442e-17 -2.91433544e-16]
     [ 6.71554826e-16  1.00000000e+00  1.51354623e-16 -2.83627288e-16]
     [ 6.93889390e-18  1.55257751e-16  1.00000000e+00  1.49619900e-16]
     [-2.43294968e-16 -2.56305394e-16  1.47451495e-16  1.00000000e+00]]
    Basis regularized:
     [[ 0.99393024 -0.0180874  -0.01324601  0.0030897 ]
     [-0.0180874   0.79784464 -0.06858017  0.07861877]
     [-0.01324601 -0.06858017  0.74805033  0.09757001]
     [ 0.0030897   0.07861877  0.09757001  0.70994851]]
    

    In the grid regularized case, the output is the identity matrix, even though the regularization coefficient has been set to a very high value (20). In contrast, the basis regularized case outputs a matrix very different from the identity.

    Current implementation

    The relevant extract of the code is the following:

    basis = FDataGrid(
        data_matrix=np.identity(n_points_discretization),
        grid_points=X.grid_points,
    )
    regularization_matrix = compute_penalty_matrix(
        basis_iterable=(basis,),
        regularization_parameter=1,
        regularization=self.regularization,
    )
    
    basis_matrix = basis.data_matrix[..., 0]
    if regularization_matrix is not None:
        basis_matrix += regularization_matrix
    
    fd_data = np.linalg.solve(
        basis_matrix.T,
        fd_data.T,
    ).T
    
    # see docstring for more information
    final_matrix = fd_data @ np.sqrt(weights_matrix)
    
    pca = PCA(n_components=self.n_components)
    pca.fit(final_matrix)
    self.components_ = X.copy(
        data_matrix=np.transpose(
            np.linalg.solve(
                np.sqrt(weights_matrix),
                np.transpose(pca.components_),
            ),
        ),
        sample_names=(None,) * self.n_components,
    )
    

    We can see that the code "matches" the TFG corresponding to this functionality. In this TFG, it is stated that the problem is equivalent to solving the multivariate PCA on the following matrix:

    $$ \mathbf{V}=\frac{\mathbf{X}\mathbf{W}^{1/2}}{\sqrt{N}(\mathbf{I}_M + \mathbf{P}_k)} $$

    This formula has been extracted directly from the TFG, where the matrix operation is expressed as a fraction. Note, however that the correct equation would have the inverse of the denominator multiplying the nominator from the right side. In the TFG, this equation is justified by stating that maximizing the penalized variance is equivalent to the following eigenvalue problem:

    $$ \frac{\mathbf{W}^{1/2} (\mathbf{X}^\intercal \mathbf{X})\mathbf{W}^{1/2} v_j(t)} {N (\mathbf{I}_M + \mathbf{P}_k)^\intercal(\mathbf{I}_M + \mathbf{P}_k)} = \lambda_j v_j(t) $$

    Then, the previous equation can be seen as the usual eigenvalue problem for the covariance matrix of the data, but with a modified covariance matrix. However, even though the previous equation is similar to $\mathbf{V}^\intercal \mathbf{V}$, there are issues with this notation and analysis.

    fda.usc implementation

    The implementation in fda.usc is the following:

     if (lambda > 0) {
        if (is.vector(P)) {
          P <- P.penalty(tt, P)
        }
        dimp <- dim(P)
        if (!(dimp[1] == dimp[2] & dimp[1] == J)) {
          stop("Incorrect matrix dimension P")
        }
        M <- solve(diag(J) + lambda * P)
        Xcen.fdata$data <- Xcen.fdata$data %*% t(M)
      }
    

    where

    • diag(J) is the identity matrix of size J
    • P is the penalty matrix

    Therefore, the implementation in fda.usc is equivalent to the current implementation and also returns the identity matrix as the inner product matrix. The issue (moviedo5/fda.usc/#6) has been opened in the fda.usc repository.

    Proposed solution

    Mathematical analysis

    Non-regularized FPCA summary

    In the usual PCA, the goal is to maximize the variance of the projections upon the principal components found subject to two restrictions. That is to say, maximize

    $$ \frac{1}{N} \mathbf{\phi^\intercal X^\intercal X\phi} $$

    subject to:

    • $\mathbf{\phi^\intercal \phi}=1$
    • The loadings are orthogonal $\mathbf{\phi_i^\intercal \phi_j}=0$ if $i\neq j$

    Regularized FPCA

    The regularized version of FPCA aims to maximize the penalized sample variance. If we want to penalized the second derivative, we can define it as:

    $$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\phi,2) }, $$

    where $p(\phi,2)$ is the penalty function for the second derivative applied to $\phi$. The restrictions now are:

    • $|\phi|^2=1$
    • $\int \phi_i \phi_j + \lambda \int D^2\phi_i D^2\phi_j= 0$

    For simplicity, we will ignore the quadrature for now and assume $\mathbf{W=Id}$ (the integration weights). Then, we can make the following substitutions:

    $$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\xi,2) } \approx \frac{ \frac 1N \boldsymbol \phi^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \boldsymbol \phi + \lambda \boldsymbol \phi^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi } = \frac{ \frac 1N \boldsymbol {\phi}^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi }, $$

    where $\mathbf P_2$ is the penalty matrix for the second derivative, $\boldsymbol \phi$ is the vector of $\phi$ evaluated in the grid points and $\mathbf X$ is the usual data matrix. That is to say, the functions evaluated in the grid points.

    We can also obtain:

    $$ 0=\int \phi_i \phi_j + \int D^2\phi_i D^2\phi_j= \boldsymbol \phi_i^\intercal \boldsymbol \phi_j + \lambda \boldsymbol \phi_i ^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi_j = \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi $$

    In light of these relationships, we can see that we can transform the problem back to the usual PCA problem using the Cholesky decomposition $\mathbf{L} \mathbf{L}^\intercal = \mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2$. Then, we can apply the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$. Using the Cholesky decomposition, we can obtain the following equation.

    $$ psv(\phi) = \frac{ \frac 1N \boldsymbol \psi^\intercal \mathbf{L}^{-1} \mathbf X^\intercal \mathbf X \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }{ \boldsymbol \psi^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }= \frac{ \frac 1N \boldsymbol \psi ^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big)^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big) \boldsymbol \psi } { \boldsymbol \psi^\intercal \boldsymbol \psi } $$

    Then the constraints become:

    $$ 1 = |\boldsymbol \phi|^2 = |(\mathbf L^\intercal)^{-1} \boldsymbol \psi|^2 = \boldsymbol \psi^\intercal \mathbf L^{-1} (\mathbf L^{-1})^\intercal \boldsymbol \psi $$

    $$ 0 = \boldsymbol \psi_i^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi_j = \boldsymbol \psi_i^\intercal \boldsymbol \psi_j $$

    Threfore, we can see that the problem is equivalent to the usual PCA problem with the following changes:

    • The data matrix $\mathbf X$ is replaced by $\mathbf X \mathbf {(L^\intercal)}^{-1}$.
    • Use the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$ so that loadings obtained are orthogonal with respect to the new inner product given by $\mathbf{L}^{-1} (\mathbf L^{-1})^\intercal$.

    Note that after the change of variable, the loadings will no longer have norm 1 with respect to the original inner product. However, this can be easily fixed by dividing normailizing them at the end.

    Progress so far

    This Choesky decomposition approach has been implemented in the attached pull request. Note however that the code has not been optimized yet (still using np.inv instead of np.linalg.solve for example), nor has it been tested extensively.

    Regularization parameter effect

    The main issue remaining is that the regularization parameter $\lambda$ does not seem to have the same effect in grid implementation as in the basis implementation. The reason may be related to the penalty matrix. To reproduce the issue, run the following code:

    X, y = skfda.datasets.fetch_phoneme(return_X_y=True)
    X = X[:300]
    y = y[:300]
    
    
    def fit_plot(ax, title, regularization_parameter=0):
        fpca_regression = FPCA(
            n_components=4,
            regularization=L2Regularization(
                LinearDifferentialOperator(1),
                regularization_parameter=regularization_parameter,
            ),
        )
        fpca_regression.fit(X, y)
        fpca_regression.components_.plot(axes=ax)
        ax.set_title(title)
        ax.set_xlabel("")
        ax.set_ylabel("")
    
    
    fig, ax = plt.subplots(3, 2, figsize=(10, 10))
    
    fit_plot(ax[0, 0], "Grid not regularized")
    fit_plot(ax[0, 1], "Grid regularized", regularization_parameter=2.5)
    
    grid_points = X.grid_points
    X = X.to_basis(Fourier(n_basis=10))
    
    fit_plot(ax[1, 0], "Basis not regularized")
    fit_plot(ax[1, 1], "Basis regularized", regularization_parameter=0.125)
    
    # Convert back to grid using the same points as before
    X = X.to_grid(grid_points)
    
    fit_plot(ax[2, 0], "Grid from basis not regularized")
    fit_plot(ax[2, 1], "Grid from basis regularized", regularization_parameter=2.5)
    
    plt.show()
    

    image

    As it can be seen, the regularization parameter has less of an effect in the basis implementation than in the grid implementation. However, by adjusting the parameters, the results do seem to match almost perfectly.

    Next steps

    • [ ] Fix the regularization parameter effect issue.
    • [ ] Either change the tests so that they do not compare against fda.usc in this particular case or wait for them to respond to the issue (moviedo5/fda.usc/#6). Taking advantage of the migration from unittest to pytest, it would be possible to use pytest mechanisms to prevent these tests from being executed while documenting the reason why they should not be executed.
    • [ ] Optimize and revise the implementation in #485
    • [ ] Investigate the fact that the results in FDataBasis might not have norm 1.
    opened by Ddelval 0
Releases(0.7.1)
  • 0.7.1(Jan 25, 2022)

  • 0.7(Jan 25, 2022)

    • TikhonovRegularization renamed to L2Regularization. The old name is deprecated. See #409 for details.
    • Add the option for MagnitudeShapePlot to plot the inlier ellipse.
    • Additional bug fixes.
    Source code(tar.gz)
    Source code(zip)
  • 0.6.1(Dec 23, 2021)

  • 0.6(Sep 12, 2021)

    Some highlights of this version:

    • Add typing support for most functionalities of the package. Now you can use Mypy to check for errors.
    • New depth-based classifiers
    • New visualization functions
      • Parametric plot
      • Outliergram
      • Multiple interactive plots
    • New historical functional regression model
    • Other minor additions and fixes
    Source code(tar.gz)
    Source code(zip)
  • 0.5(Dec 31, 2020)

    Some highlights of the new functionality:

    • Depth functions refactored
    • Removed C code and added fdasrsf as a dependency
    • Added new variable selection methods
    • Added Maximum Depth classifier
    • Optimized FDataGrid inner product
    Source code(tar.gz)
    Source code(zip)
  • 0.4(Jul 23, 2020)

    Adds functional component analysis, ANOVA and Hotelling tests, basis representation for multivariate and vector valued functions and Tikhonov regularization.

    Source code(tar.gz)
    Source code(zip)
Owner
Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid
Machine Learning Group at Universidad Autónoma de Madrid
Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid
Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

ShawnWang 0 Jul 05, 2022
Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

StreamAlpha 27 Dec 09, 2022
TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

TheMachineScraper 🐱‍👤 is a tool made purely for analysing machine data for any reason.

doop 5 Dec 01, 2022
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
An Integrated Experimental Platform for time series data anomaly detection.

Curve Sorry to tell contributors and users. We decided to archive the project temporarily due to the employee work plan of collaborators. There are no

Baidu 486 Dec 21, 2022
MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

SeungHeonDoh 3 Jul 02, 2022
DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis. The main goal of the package is to accelerate the process of computing estimates of forward reachable sets for nonlinear dy

2 Nov 08, 2021
This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

superSFS This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot. It is easy-to-use and runing fast. What you s

3 Dec 16, 2022
Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-

International Business Machines 293 Dec 29, 2022
ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

ToeholdTools Category Status Repository Package Build Quality A library for the analysis of toehold switch riboregulators created by the iGEM team Cit

0 Dec 01, 2021
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 07, 2023
Fitting thermodynamic models with pycalphad

ESPEI ESPEI, or Extensible Self-optimizing Phase Equilibria Infrastructure, is a tool for thermodynamic database development within the CALPHAD method

Phases Research Lab 42 Sep 12, 2022
An extension to pandas dataframes describe function.

pandas_summary An extension to pandas dataframes describe function. The module contains DataFrameSummary object that extend describe() with: propertie

Mourad 450 Dec 30, 2022
Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

Maratona Behind the Code 44 Sep 28, 2022
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

Cristian Garcia 1.4k Dec 31, 2022
A distributed block-based data storage and compute engine

Nebula is an extremely-fast end-to-end interactive big data analytics solution. Nebula is designed as a high-performance columnar data storage and tabular OLAP engine.

Columns AI 131 Dec 26, 2022
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
DefAP is a program developed to facilitate the exploration of a material's defect chemistry

DefAP is a program developed to facilitate the exploration of a material's defect chemistry. A large number of features are provided and rapid exploration is supported through the use of autoplotting

6 Oct 25, 2022
A Python package for the mathematical modeling of infectious diseases via compartmental models

A Python package for the mathematical modeling of infectious diseases via compartmental models. Originally designed for epidemiologists, epispot can be adapted for almost any type of modeling scenari

epispot 12 Dec 28, 2022
Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

Glotzer Group 109 Dec 14, 2022