Easy pipelines for pandas DataFrames.

Overview

pdpipe ˨

PyPI-Status PePy stats PyPI-Versions Build-Status Codecov Codefactor code quality LICENCE

Easy pipelines for pandas DataFrames (learn how!).

Website: https://pdpipe.github.io/pdpipe/

Documentation: https://pdpipe.github.io/pdpipe/doc/pdpipe/

>>> df = pd.DataFrame(
        data=[[4, 165, 'USA'], [2, 180, 'UK'], [2, 170, 'Greece']],
        index=['Dana', 'Jane', 'Nick'],
        columns=['Medals', 'Height', 'Born']
    )
>>> import pdpipe as pdp
>>> pipeline = pdp.ColDrop('Medals').OneHotEncode('Born')
>>> pipeline(df)
            Height  Born_UK  Born_USA
    Dana     165        0         1
    Jane     180        1         0
    Nick     170        0         0

1   Documentation

This is the repository of the pdpipe package, and this readme file is aimed to help potential contributors to the project.

To learn more about how to use pdpipe, either visit pdpipe's homepage or read the online documentation of pdpipe.

2   Installation

Install pdpipe with:

pip install pdpipe

Some pipeline stages require scikit-learn; they will simply not be loaded if scikit-learn is not found on the system, and pdpipe will issue a warning. To use them you must also install scikit-learn.

Similarly, some pipeline stages require nltk; they will simply not be loaded if nltk is not found on your system, and pdpipe will issue a warning. To use them you must additionally install nltk.

3   Contributing

Package author and current maintainer is Shay Palachy ([email protected]); You are more than welcome to approach him for help. Contributions are very welcomed, especially since this package is very much in its infancy and many other pipeline stages can be added.

3.1   Installing for development

Clone:

git clone [email protected]:pdpipe/pdpipe.git

Install in development mode with test dependencies:

cd pdpipe
pip install -e ".[test]"

3.2   Running the tests

To run the tests, use:

python -m pytest

Notice pytest runs are configured by the pytest.ini file. Read it to understand the exact pytest arguments used.

3.3   Adding tests

At the time of writing, pdpipe is maintained with a test coverage of 100%. Although challenging, I hope to maintain this status. If you add code to the package, please make sure you thoroughly test it. Codecov automatically reports changes in coverage on each PR, and so PR reducing test coverage will not be examined before that is fixed.

Tests reside under the tests directory in the root of the repository. Each module has a separate test folder, with each class - usually a pipeline stage - having a dedicated file (always starting with the string "test") containing several tests (each a global function starting with the string "test"). Please adhere to this structure, and try to separate tests cases to different test functions; this allows us to quickly focus on problem areas and use cases. Thank you! :)

3.4   Code style

pdpip code is written to adhere to the coding style dictated by flake8. Practically, this means that one of the jobs that runs on the project's Travis for each commit and pull request checks for a successfull run of the flake8 CLI command in the repository's root. Which means pull requests will be flagged red by the Travis bot if non-flake8-compliant code was added.

To solve this, please run flake8 on your code (whether through your text editor/IDE or using the command line) and fix all resulting errors. Thank you! :)

3.5   Adding documentation

This project is documented using the numpy docstring conventions, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings (in my personal opinion, of course). When documenting code you add to this project, please follow these conventions.

Additionally, if you update this README.rst file, use python setup.py checkdocs to validate it compiles.

3.6   Adding doctests

Please notice that for pdoc3 - the Python package used to generate the html documentation files for pdpipe - to successfully include doctests in the generated documentation files, the whole doctest must be indented in relation to the opening multi-string indentation, like so:

class ApplyByCols(PdPipelineStage):
    """A pipeline stage applying an element-wise function to columns.

    Parameters
    ----------
    columns : str or list-like
        Names of columns on which to apply the given function.
    func : function
        The function to be applied to each element of the given columns.
    result_columns : str or list-like, default None
        The names of the new columns resulting from the mapping operation. Must
        be of the same length as columns. If None, behavior depends on the
        drop parameter: If drop is True, the name of the source column is used;
        otherwise, the name of the source column is used with the suffix
        '_app'.
    drop : bool, default True
        If set to True, source columns are dropped after being mapped.
    func_desc : str, default None
        A function description of the given function; e.g. 'normalizing revenue
        by company size'. A default description is used if None is given.


    Example
    -------
        >>> import pandas as pd; import pdpipe as pdp; import math;
        >>> data = [[3.2, "acd"], [7.2, "alk"], [12.1, "alk"]]
        >>> df = pd.DataFrame(data, [1,2,3], ["ph","lbl"])
        >>> round_ph = pdp.ApplyByCols("ph", math.ceil)
        >>> round_ph(df)
           ph  lbl
        1   4  acd
        2   8  alk
        3  13  alk
    """

4   Credits

Created by Shay Palachy ([email protected]).

Comments
  • My article about your package

    My article about your package

    Hi,

    I found your package quite useful and pretty neat. I may contribute to it later with ideas and suggestions for enhancement.

    For now, I have written an article about it and it has gained traction pretty fast.

    Check out the article here.

    question 
    opened by tirthajyoti 14
  • Pickling of pipelines

    Pickling of pipelines

    Hi,

    I stumbled upon your excellent library when researching ways of serializing and pickling pipelines for reuse on new datasets, and found your approach to be most promising. However, when attempting to pickle to the following pipeline from your documentation;

    pipeline = pdp.ColDrop("Name") + pdp.OneHotEncode("Label") pipeline += pdp.MapColVals("Job", {"Part": True, "Full":True, "No": False}) pipeline += pdp.PdPipeline([pdp.ColRename({"Job": "Employed"})]) joblib.dump(pipeline, "pipeline.pkl")

    ... I get the following Exception:

    PicklingError: Can't pickle <function ColRename.__init__.<locals>._tprec at 0x000001CBC33864C0>: it's not found as pdpipe.basic_stages.ColRename.__init__.<locals>._tprec

    I get similar exceptions when attemtping other function-based steps, like ApplyToRows or ColByFrameFunc.

    Is there a way to get around this? As an example: How would I solve applying a function that calculates age from a column containing birthdate and pickling this?

    In general I'd love to see more explanations and examples on pickling pipelines in your otherwise excellent documentation!

    bug complex issue 
    opened by MagBuchSB1 13
  • Application context objects should not be kept by default + add way to supply fit context

    Application context objects should not be kept by default + add way to supply fit context

    pdpipe uses PdpApplicationContext objects in two ways:

    1. As the fit_context that should be kept as-is after a fit, and used by stages to pass to one another parameters that should also be used on transform time.
    2. As the application_context that should be discarded after a specific application is done, and is used by stages to feed consecutive stages with context. It can be added to by supplying apply(context={}), fit_transform(context={}) or transform(context={}) with a dict that will be used to update the application context.

    Two changes are required:

    1. At the moment there is a single context parameter to application functions that is used to update both the fit and the application context. I think they should be two, one for each type of context.
    2. At the moment the application_context is not discarded when the application is done. It's as simple as self.application_context = None expression added at the PdPipeline level in the couple of right cases.
    enhancement good first issue 
    opened by shaypal5 9
  • Docs/Improve documentation

    Docs/Improve documentation

    Summary

    Issue #28

    • Fix spellings
    • Fix docstrings
      • Example: Missing callable in columns : str or iterable, optional
    • Add code examples when use cases are complex
      • Trying to use existing tests when possible
      • If cannot, tests are added
    • Enable links for crucial references
      • Example: pdpipe.cq to pdpipe.cq

    Progress

    • [x] documentation.md
    • [x] pdpipe.basic_stages
    • [x] pdpipe.col_generation
    • [x] pdpipe.cond
    • [x] pdpipe.core
    • [x] pdpipe.cq
    • [x] pdpipe.nltk_stages
    • [x] pdpipe.skintegrate
    • [x] pdpipe.sklearn_stages
    • [x] pdpipe.text_stages
    • [x] pdpipe.wrappers
    opened by yarkhinephyo 8
  • Improve the docs

    Improve the docs

    Excellent work! I love this tool very much. I think you can improve the docs. I tried some times to know how to use AdHocStage to do sum/mean operations. The description of AggByCols is worse.

    Please provide some basic information about your system (Python version, operating system, &c).

    enhancement help wanted good first issue documentation 
    opened by gandad 7
  • Fix setup_required in setup.py

    Fix setup_required in setup.py

    In the setup.py there are the following lines:

    install_requires=INSTALL_REQUIRES, setup_requires=INSTALL_REQUIRES,

    I think they should only be install_requires and not setup_requires. This causes an issue where dependencies during conda packaging can't be met. Removal of the line fixes the problem.

    Does that sound reasonable? Sorry for the earlier confusion on my part about sklearn :)

    bug 
    opened by Silun 6
  • Issue #27 changes

    Issue #27 changes

    1.1 Not sure what exactly is needed as "description". Implemented the class name for now. 3. May be can be implemented whenever #19 is done.

    enhancement 
    opened by naveenkaushik2504 6
  • About ColByFrameFunc

    About ColByFrameFunc

    Summary.

    Expected Result

    Execute add_Col function according to the condition of the df['A'] column value.

    What you expected.

    Actual Result

    ERROR:PipelineApplicationError: Exception raised in stage [ 0] PdPipelineStage: Applying a function to generate column A. What happened instead. image

    Reproduction Steps

    import pdpipe as pdp
    #%%
    data = [[3, 3], [2, 4], [1, 5]]
    df_usage = pd.DataFrame(data, [1,2,3], ["A","B"])
    
    def add_Col(a,b):
        if a > 2:
            return a+ b
        else:
            return a+10
    
    func = lambda df:add_Col(df['A'],df['B'])
    
    pipeline = pdp.PdPipeline([
        pdp.ColByFrameFunc("A",func,follow_column='B'),
        pdp.ColRename({'A':'CloA','B':'ColB'})
    ])
    
    df_usage = pipeline(df_usage,verbose=True)
    
    

    System Information

    MacOS 10.15.7 python version:3.9.10

    Please provide some basic information about your system (Python version, operating system, &c).

    invalid 
    opened by banduoba 5
  • Add example for GridSearchCV parameter tuning

    Add example for GridSearchCV parameter tuning

    Hi, please add to documentation simple example for common scenario in ML models, how to use pdpipe with GridSearchCV.

    It is not clear, how to do it / of if expected scenario at all.

    This basic simple will help people to use pdpipe for all ML models and parameter-tuning techniques.

    Best Stefan

    enhancement question 
    opened by stefansimik 5
  • Feature Request: More information in the messages of PipelineApplicationError

    Feature Request: More information in the messages of PipelineApplicationError

    At the moment, there is too little information in the messages of PipelineApplicationError.

    Things that should be added:

    1. The type of the pipeline stage (so the name of the class). 1.1. Actually, maybe the description?! :)
    2. The index in the pipeline (obviously only when part of a pipeline; requires catching and rethrowing at the pipeline level).
    3. The label of the pipeline stage, when #19 is implemented.
    enhancement good first issue 
    opened by shaypal5 5
  • exclude_columns does not work for ColumnsBasedPipelineStage

    exclude_columns does not work for ColumnsBasedPipelineStage

    When setting exclude_columns as a list of strings in the constructor of ColumnsBasedPipelineStage the member variable self._exclude_columns is set to a tuple instead of the passed in list. This leads to errors with __get_cols_by_arg which ends up returning the tuple and not the list itself.

    Code where the issue appears to be

    https://github.com/pdpipe/pdpipe/blob/763320db326e9a49f51bd7fb9ea65944f51869f2/pdpipe/core.py#L602

    The line which seems incorrect is linked above. _interpret_columns_param returns a tuple, so we should be assigning the individual components of the tuple, in that line, as opposed to assigning the entire return value to 'self._exclude_columns'; something like

    self._exclude_columns, _ = self._interpret_columns_param(...

    bug 
    opened by jjlee88 4
  • Read & write pipeline configuration from/to YAML

    Read & write pipeline configuration from/to YAML

    This is a feature that has been discussed and requested several times.

    A couple of independent projects tried to do exactly that, but I believe it is worthwhile to have one official API exposed by the package, and also kept up-to-date:

    • https://github.com/blakeNaccarato/pdpipewrench
    • https://github.com/altescy/pdpcli
    • https://github.com/neilbartlett/datapipeliner
    enhancement complex issue 
    opened by shaypal5 0
  • Feature Request: Contextual params using application context

    Feature Request: Contextual params using application context

    I want a way to supply pipeline stage constructor parameters with future-like placeholders, so that actual values will be determined by prior stages on application time.

    I would assign the name of the future ApplicationContext key holding the value after it was calculated — wrapped by a unique class — to the parameter.

    The constructor will hold on to this object, and will — in application time — pull the value of the right key from either the fit context or the application context (depending on how I set it: I might want this value to be set on pipeline fit, or to be set on each application dynamically, even on transforms when the pipeline is fitted), and use it for the transformation. Default should probably be the fit context?

    Here's an example:

    import numpy as np; import pandas as pd; import pdpipe as pdp;
    
    def scaling_decider(X: pd.DataFrame) -> str:
        """Determines with type of scaling to apply by examining all numerical columns."""
        numX = X.select_dtypes(include=np.number)
        for col in numX.columns:
            # this is nonsense logic, just an example
            if np.std(numX[col]) > 2 * np.mean(numX[col]):
                return 'StandardScaler'
        return 'MinMaxScaler'
    
    pipeline = pdp.PdPipeline(stages=[
        pdp.ColDrop(pdp.cq.StartWith('n_'),
        pdp.ApplicationContextEnricher(scaling_type=scaling_decider),
        pdp.Scale(
            # fit=False means it will take it from the application context, and not fit context
            scaler=pdp.contextual('scaling_type', fit=False),  
            joint=True,
        ),
    ])
    

    Design

    This has to be implemented at the PdPipeline base class. Since the base class can't hijack constructor arguments, I think the contract with extending classes should be:

    1. When implementing class exteding PdPipeline, if you want to enjoy support for contextual constructor parameters, you MUST delay any initialization of inner state objects to fit_transform, so that fit/application context is available on initialization (it is NOT available at pipeline stage costruction and initialization, after all).

    2. pdp.contextual is a factory function that returns contextual parameter placeholder objects. Code using it shouldn't really care about it, as it should never interact with the resulting objects directly. I think.

    3. PdPipeline can auto-magically make sure that any attribute of a PdPipeline instance that is assigned a pdp.contextual object in the constructor (e.g. self.k = k, and the k constructor argument was provided with k=pdp.future('pca_k')) will be hot-swapped with a concrete value by the time we wish to use it fit_transform or transform (for example, when we call self.pca_ = PCA(k=self.k)). It can also do so for any such object that is contained in any iterable or dict-like attribute (so if I have self._pca_kwargs = {...} in my constructor, I can safely call self.pca_ = PCA(**self._pca_kwargs) in fit_transform().

    Implementation thoughts

    To make this efficient, since this means accessing sub-class attribute instance on pipeline transformations, I have a few thoughts:

    1. The contextuals module should have a global variable such as CONTEXTUALS_ARE_ON = False. Then, the pdp.contextual factory function sets global CONTEXTUALS_ARE_ON; CONTEXTUALS_ARE_ON = True when called. Then, we condition the whole inspection-heavy logic on this indicator variable, so that if our user never called pdp.contextual during the current kernel, runtime is saved.

    2. I first thought pdp.contextual could somehow register _ContextualParam objects in a registery we could use to find what needed to be swapped, but actually this wouldn't help, as they won't know which attribute of which pipeline stage they were assigned to.

    3. We thus have to scan sub-class attributes, but we can do so if and only after pdp.contextual was called, and right after pipeline stage initialization. Moreover, we can create a literal list of all attribute names we should ignore, stored in pdpipe.core as a global: e.g. _IGNORE_PDPSTAGE_ATT = ['_desc', '_name'], etc. Everything we know isn't an attribute the sub-class declared. Then, we can check any attribute that isn't one of these. This can be done in pdpipe.PdPipelineStage.__init__(), since it's called (by contract; we can demand that from extending subclasses) at the end of the __init__() method of subclasses. When we find that such an attribute holds a pdp.contextual, we register it at the pdp.contextuals module, in some global dict (or something more sophisticated), keyed by the attribute name. We can also registed the containig stage object.

    Then, in a stage fit_transform and transform methods, if the current stage object is registered for contextual hot-swapping, we find the concrete contextual value of any attribute resigtered for this stage (in either the self.fit_context object or self.application_context object the pipeline injects all stages during applications of the pipeline) and hot-swap it literaly: This will look something like setattr(self, 'k', self.fit_context['pca_k']), since we're at pdpipe.PdPipelineStage.fit_transform(), and the self object is an instance of the subclass requiring the hot swap (in this case, pdp.Decompose).

    enhancement complex issue 
    opened by shaypal5 0
  • External file loading problem with pdpipe import

    External file loading problem with pdpipe import

    Discussed in https://github.com/pdpipe/pdpipe/discussions/97

    Originally posted by Dranikf March 14, 2022 Will start with example. I create test.py file in some dir, it may looks like this:

    from sys import path
    import pdpipe
    
    path.append("<path to external folder>/some_external_folder")
    import temp
    

    In ".../some_external_folder" i have added temp.py file, which contains 'hello world' printing. When I try to run first file i have a error:

    Traceback (most recent call last):
      File "<first file folder path>/test.py", line 6, in <module>
        import temp
    ModuleNotFoundError: No module named 'temp'
    

    But when i rem import pdpipe program starts. What can cause this behavior and how to correct it?

    bug 
    opened by shaypal5 4
  • Feature: A generic ,column-based, stage wrapper for any matrix-to-matrix sklearn transformer.

    Feature: A generic ,column-based, stage wrapper for any matrix-to-matrix sklearn transformer.

    Basically the column parameter - which can be a single columns, a list of them or a dynamic ColumnQualifier object, takes a subset of input dataframes, and the wrapped sklearn transformer transforms just this sub-dataframe, which the stage puts back in the right place.

    enhancement 
    opened by shaypal5 0
  • Chore: Add pickling tests for all pipeline stages

    Chore: Add pickling tests for all pipeline stages

    Issue #71 pointed at a pickling problem with the ColRename stage, with a fix released in v0.0.68. After fixing that, a couple of additional unpickle-able stage were found and fixed, which was released in v0.0.69.

    Everything should be pickle-able now, BUT not all stages are tested for this. This situation should be remedied sooner rather than later.

    chore tests 
    opened by shaypal5 0
Releases(v0.3.2)
Owner
Easy pipelines for pandas DataFrames.
BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

BatchFlow BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflo

Data Analysis Center 185 Dec 20, 2022
Easy pipelines for pandas DataFrames.

pdpipe ˨ Easy pipelines for pandas DataFrames (learn how!). Website: https://pdpipe.github.io/pdpipe/ Documentation: https://pdpipe.github.io/pdpipe/d

694 Jan 05, 2023
A Python toolkit for processing tabular data

meza: A Python toolkit for processing tabular data Index Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installat

Reuben Cummings 401 Dec 19, 2022
Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hoste

Riya Vijay Vishwakarma 1 Dec 12, 2021
Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

Prodmodel 53 Nov 29, 2022
Pandas integration with sklearn

Sklearn-pandas This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames. In particular, it provides

2.7k Dec 27, 2022
dplyr for python

Dplython: Dplyr for Python Welcome to Dplython: Dplyr for Python. Dplyr is a library for the language R designed to make data analysis fast and easy.

Chris Riederer 754 Nov 21, 2022
functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

Coursera 188 Nov 24, 2022
Tools for parsing messy tabular data.

Parsing for messy tables A library for dealing with messy tabular data in several formats, guessing types and detecting headers. See the documentation

Open Knowledge Foundation 382 Nov 10, 2022
Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

Eric Ma 1.1k Jan 01, 2023
Directions overlay for working with pandas in an analysis environment

dovpanda Directions OVer PANDAs Directions are hints and tips for using pandas in an analysis environment. dovpanda is an overlay companion for workin

dovpandev 431 Dec 20, 2022