Universal 1d/2d data containers with Transformers functionality for data analysis.

Overview

Logo

Build Status PyPI version

XPandas (extended Pandas) implements 1D and 2D data containers for storing type-heterogeneous tabular data of any type, and encapsulates feature extraction and transformation modelling in an sklearn-compatible transformer interface.

Quickstart

Install the latest version

$ pip install xpandas

and run the example jupyter notebook

$ jupyter examples/ExampleUsage.ipynb

Documentation

The full documentation is available at https://alan-turing-institute.github.io/xpandas/.

Acknowledgements

  • Bernd Bischl (@berndbischl), who mentioned the idea of a general data container with transformers attached to columns in personal discussion with Franz Kiraly during a London visit in 2016.
  • Franz Kiraly (@fkiraly), who initiated and funded the project up to release, and who substantially contributed to the API design.
  • Haoran Xue (@HaoranXue), who, under the supervision of Franz Kiraly, earlier completed a thesis for a degree at UCL on the topic, and who wrote a similar package as part of it. No code was re-used in the creation of the XPandas package.

List of developers and contributors

Comments
  • Acknowledgments

    Acknowledgments

    Before I forget: somewhere prominently the following should be acknowledged in some form, not necessarily in the below:

    Bernd Bischl, who mentioned the idea of a general data container with transformers attached to columns in personal discussion during a London visit in 2016. Myself, having (in my opinion) substantially contributed through the API design (?). Haoran Xue, who completed a thesis on the topic erlier. While no code was transferred, lessons that were learnt may have been transferred.

    opened by fkiraly 2
  • Improved documentation

    Improved documentation

    This pull request improves the readability of the documentation.

    While going through your codebase, I realised that there's a lot of redundancy in the module naming, e.g. /transformers/transformers/series_transformers/series_transformer.py instead of /transformers/series/series_transformer.py. Is there any specific reason for that? If not I'd suggest you to refactor the module into a more straightforward naming structure.

    opened by frthjf 1
  • sensible default for transformation: column replacement

    sensible default for transformation: column replacement

    currently it adds the transformer output while retaining the original column

    for retaining original column: use identityTransformer (to be implemented)

    opened by fkiraly 0
  • tutorial: separate data container from transformer tutorial

    tutorial: separate data container from transformer tutorial

    Structure should be changed to: (1) data container (Xseries and XDataFrame) (2) transformer functionality

    since user should be made aware that (1) is a separate interface concept on top of which (2) may be invoked but isn't necessarily tied together

    opened by fkiraly 0
  • Bump numpy from 1.15.2 to 1.22.0

    Bump numpy from 1.15.2 to 1.22.0

    Bumps numpy from 1.15.2 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump ipython from 7.0.1 to 7.16.3

    Bump ipython from 7.0.1 to 7.16.3

    Bumps ipython from 7.0.1 to 7.16.3.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • will it work for multivariate time series prediction   both regression and classification

    will it work for multivariate time series prediction both regression and classification

    great code thanks may you clarify : will it work for multivariate time series prediction both regression and classification 1 where all values are continues values weight height age target 1 56 160 34 1.2 2 77 170 54 3.5 3 87 167 43 0.7 4 55 198 72 0.5 5 88 176 32 2.3

    2 or even will it work for multivariate time series where values are mixture of continues and categorical values for example 2 dimensions have continues values and 3 dimensions are categorical values

    color        weight     gender  height  age  target 
    

    1 black 56 m 160 34 yes 2 white 77 f 170 54 no 3 yellow 87 m 167 43 yes 4 white 55 m 198 72 no 5 white 88 f 176 32 yes

    opened by Sandy4321 0
  • will it work for multivariate time series prediction   both regression and classification

    will it work for multivariate time series prediction both regression and classification

    great code thanks may you clarify : will it work for multivariate time series prediction both regression and classification 1 where all values are continues values weight height age target 1 56 160 34 1.2 2 77 170 54 3.5 3 87 167 43 0.7 4 55 198 72 0.5 5 88 176 32 2.3

    2 or even will it work for multivariate time series where values are mixture of continues and categorical values for example 2 dimensions have continues values and 3 dimensions are categorical values

    color        weight     gender  height  age  target 
    

    1 black 56 m 160 34 yes 2 white 77 f 170 54 no 3 yellow 87 m 167 43 yes 4 white 55 m 198 72 no 5 white 88 f 176 32 yes

    opened by Sandy4321 0
  • Many standard methods do not work (properly) on XDataFrame with hierarchical data

    Many standard methods do not work (properly) on XDataFrame with hierarchical data

    # loading some time-series data
    from io import BytesIO
    from zipfile import ZipFile
    from urllib.request import urlopen
    from xpandas.data_container import XSeries, XDataFrame
    import numpy as np
    import pandas as pd
    
    def read_data(file):
        data = file.readlines()
        rows = [row.decode('utf-8').strip().split('  ') for row in data]
        X = pd.DataFrame(rows, dtype=np.float)
        y = X.pop(0)
        ts = XSeries([row for _, row in X.iterrows()])
        X = XDataFrame({'ts1': ts, 'ts2': ts})
        return X, y
    
    url = 'http://www.timeseriesclassification.com/Downloads/GunPoint.zip'
    url = urlopen(url)
    zipfile = ZipFile(BytesIO(url.read()))
    file = zipfile.open('GunPoint_TRAIN.txt')
    X, y = read_data(file)
    
    X.mean() # returns empty series rather than mean of series, the same for many other methods like .std(), .median(), etc)
    
    X.apply(np.mean) # breaks
    
    X['ts1'].mean() # breaks 
    
    X['ts1'].apply(np.mean) # works
    
    X['ts1'].apply(np.percentile, args=(25,)) # breaks, does not passes on args 
    
    opened by mloning 0
  • Slicing single row of XDataFrame does not work

    Slicing single row of XDataFrame does not work

    Slicing of single row in XDataFrame does not work, probably because it tries to return a series which does not work as types are heterogeneous, so instead one may want to return a XDataFrame with a single row.

    import pandas as pd
    from xpandas.data_container import XDataFrame
    
    iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
    iris.iloc[0] # works
    
    irisx = XDataFrame(iris) 
    irisx.iloc[0] # breaks
    
    
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-32-b46ce52f9af0> in <module>
    ----> 1 irisx.iloc[0]
    
    ~/.conda/envs/sktime/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
       1476 
       1477             maybe_callable = com._apply_if_callable(key, self.obj)
    -> 1478             return self._getitem_axis(maybe_callable, axis=axis)
       1479 
       1480     def _is_scalar_access(self, key):
    
    ~/.conda/envs/sktime/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
       2102             self._validate_integer(key, axis)
       2103 
    -> 2104             return self._get_loc(key, axis=axis)
       2105 
       2106     def _convert_to_indexer(self, obj, axis=None, is_setter=False):
    
    ~/.conda/envs/sktime/lib/python3.7/site-packages/pandas/core/indexing.py in _get_loc(self, key, axis)
        143         if axis is None:
        144             axis = self.axis
    --> 145         return self.obj._ixs(key, axis=axis)
        146 
        147     def _slice(self, obj, axis=None, kind=None):
    
    ~/.conda/envs/sktime/lib/python3.7/site-packages/pandas/core/frame.py in _ixs(self, i, axis)
       2624                                                       index=self.columns,
       2625                                                       name=self.index[i],
    -> 2626                                                       dtype=new_values.dtype)
       2627                 result._set_is_copy(self, copy=copy)
       2628                 return result
    
    ~/.conda/envs/sktime/lib/python3.7/site-packages/xpandas/data_container/data_container.py in __init__(self, *args, **kwargs)
         71         check_result, data_type = _check_all_elements_have_the_same_property(data, type)
         72         if not check_result:
    ---> 73             raise ValueError('Not all elements the same type')
         74 
         75         if data_type is not None:
    
    ValueError: Not all elements the same type
    
    opened by mloning 0
Releases(1.0.2)
Owner
The Alan Turing Institute
The UK's national institute for data science and artificial intelligence.
The Alan Turing Institute
Universal 1d/2d data containers with Transformers functionality for data analysis.

XPandas (extended Pandas) implements 1D and 2D data containers for storing type-heterogeneous tabular data of any type, and encapsulates feature extra

The Alan Turing Institute 25 Mar 14, 2022
A pure Python implementation of Apache Spark's RDD and DStream interfaces.

pysparkling Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to exe

Sven Kreiss 254 Dec 06, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
High performance datastore for time series and tick data

Arctic TimeSeries and Tick store Arctic is a high performance datastore for numeric data. It supports Pandas, numpy arrays and pickled objects out-of-

Man Group 2.9k Dec 23, 2022
The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

Eyal Trabelsi 206 Dec 13, 2022
Koalas: pandas API on Apache Spark

pandas API on Apache Spark Explore Koalas docs » Live notebook · Issues · Mailing list Help Thirsty Koalas Devastated by Recent Fires The Koalas proje

Databricks 3.2k Jan 04, 2023
sqldf for pandas

pandasql pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar

yhat 1.2k Jan 09, 2023
Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

Python for Data 348 Jan 03, 2023
Modin: Speed up your Pandas workflows by changing a single line of code

Scale your pandas workflows by changing one line of code To use Modin, replace the pandas import: # import pandas as pd import modin.pandas as pd Inst

8.2k Jan 01, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 01, 2023
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

vaex io 7.7k Jan 01, 2023
A Python package for manipulating 2-dimensional tabular data structures

datatable This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame

H2O.ai 1.6k Jan 05, 2023
The easy way to write your own flavor of Pandas

Pandas Flavor The easy way to write your own flavor of Pandas Pandas 0.23 added a (simple) API for registering accessors with Pandas objects. Pandas-f

Zachary Sailer 260 Jan 01, 2023
cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Built based on the Apache Arrow columnar memory format,

RAPIDS 5.2k Dec 31, 2022
A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

Jason Carpenter 2.2k Jan 04, 2023