functional data manipulation for pandas

Related tags

Pipelinespandas-ply
Overview

pandas-ply: functional data manipulation for pandas

pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it provides elegant, functional, chainable syntax in cases where pandas would require mutation, saved intermediate values, or other awkward constructions. In this way, it aims to move pandas closer to the "grammar of data manipulation" provided by the dplyr package for R.

For example, take the dplyr code below:

flights %>%
  group_by(year, month, day) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 & dep > 30)

The most common way to express this in pandas is probably:

grouped_flights = flights.groupby(['year', 'month', 'day'])
output = pd.DataFrame()
output['arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
filtered_output = output[(output.arr > 30) & (output.dep > 30)]

pandas-ply lets you instead write:

(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))

In our opinion, this pandas-ply code is cleaner, more expressive, more readable, more concise, and less error-prone than the original pandas code.

Explanatory notes on the pandas-ply code sample above:

  • pandas-ply's methods (like ply_select and ply_where above) are attached directly to pandas objects and can be used immediately, without any wrapping or redirection. They start with a ply_ prefix to distinguish them from built-in pandas methods.
  • pandas-ply's methods are named for (and modelled after) SQL's operators. (But keep in mind that these operators will not always appear in the same order as they do in a SQL statement: SELECT a FROM b WHERE c GROUP BY d probably maps to b.ply_where(c).groupby(d).ply_select(a).)
  • pandas-ply includes a simple system for building "symbolic expressions" to provide as arguments to its methods. X above is an instance of ply.symbolic.Symbol. Operations on this symbol produce larger compound symbolic expressions. When pandas-ply receives a symbolic expression as an argument, it converts it into a function. So, for instance, X.arr > 30 in the above code could have instead been provided as lambda x: x.arr > 30. Use of symbolic expressions allows the lambda x: to be left off, resulting in less cluttered code.

Warning

pandas-ply is new, and in an experimental stage of its development. The API is not yet stable. Expect the unexpected.

(Pull requests are welcome. Feel free to contact us at [email protected].)

Using pandas-ply

Install pandas-ply with:

$ pip install pandas-ply

Typical use of pandas-ply starts with:

import pandas as pd
from pandas_ply import install_ply, X, sym_call

install_ply(pd)

After calling install_ply, all pandas objects have pandas-ply's methods attached.

API reference

Full API reference is available at http://pythonhosted.org/pandas-ply/.

Possible TODOs

  • Extend pandas' native groupby to support symbolic expressions?
  • Extend pandas' native apply to support symbolic expressions?
  • Add .ply_call to pandas objects to extend chainability?
  • Version of ply_select which supports later computed columns relying on earlier computed columns?
  • Version of ply_select which supports careful column ordering?
  • Better handling of indices?

License

Copyright 2015 Coursera Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments
  • python3 support ?

    python3 support ?

    import pandas as pd
    from ply import install_ply, X, sym_call
    install_ply(pd)
    

    gives

    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-1-f35c480251ef> in <module>()
          1 import pandas as pd
    ----> 2 from ply import install_ply, X, sym_call
          3 
          4 install_ply(pd)
    
    D:\result_tests\WinPython-64bit-3.4.2.3_build3\python-3.4.2.amd64\lib\site-packages\ply\__init__.py in <module>()
    ----> 1 from methods import install_ply
          2 from symbolic import X, sym_call
    
    ImportError: No module named 'methods'
    
    opened by stonebig 6
  • Continuous Integration

    Continuous Integration

    Hello,

    maybe you should add CI to this project. Travis-CI can help. You might use miniconda to install Pandas on Travis.

    Here is an example https://github.com/scls19fr/pandas_confusion/blob/master/.travis.yml

    A much more complex .travis.yml file can be find here https://github.com/pydata/pandas/blob/master/.travis.yml It can show you how to define a build matrix http://docs.travis-ci.com/user/customizing-the-build/#Build-Matrix

    Kind regards

    opened by scls19fr 2
  • Outputs for README example don't match

    Outputs for README example don't match

    There doesn't seem to be a __version__ in the code, but I installed via pip semi-recently. The filtered_output and the pandas-ply output in the README don't match. The pandas-ply results are missing January. On Python 3.4

    import pandas as pd
    from ply import install_ply, X
    install_ply(pd)
    
    %load_ext rpy2.ipython.rmagic
    from pandas.rpy import common as com
    %R library("nycflights13")
    flights = com.load_data("flights")
    
    grouped_flights = flights.groupby(['year', 'month', 'day'])
    output = pd.DataFrame()
    output['arr'] = grouped_flights.arr_delay.mean()
    output['dep'] = grouped_flights.arr_delay.mean()
    filtered_output = output[(output.arr > 30) & (output.dep > 30)]
    
    print(filtered_output)
    
    (flights
      .groupby(['year', 'month', 'day'])
      .ply_select(
        arr = X.arr_delay.mean(),
        dep = X.dep_delay.mean())
      .ply_where(X.arr > 30, X.dep > 30))
    

    Produces

    [42]: print(filtered_output)
                          arr        dep
    year month day                      
    2013 1     16   34.247362  34.247362
               31   32.602854  32.602854
         2     11   36.290094  36.290094
               27   31.252492  31.252492
         3     8    85.862155  85.862155
               18   41.291892  41.291892
         4     10   38.412311  38.412311
               12   36.048140  36.048140
               18   36.028481  36.028481
               19   47.911697  47.911697
               22   37.812166  37.812166
               25   33.681250  33.681250
         5     8    39.609183  39.609183
               23   61.970899  61.970899
         6     13   63.753689  63.753689
               18   37.648026  37.648026
               24   51.176808  51.176808
               25   41.513684  41.513684
               27   44.783296  44.783296
               28   44.976852  44.976852
               30   43.510278  43.510278
         7     1    58.280502  58.280502
               7    40.306378  40.306378
               9    31.334365  31.334365
               10   59.626478  59.626478
               22   62.763403  62.763403
               23   44.959821  44.959821
               28   49.831776  49.831776
         8     1    35.989259  35.989259
               8    55.481163  55.481163
               9    43.313641  43.313641
               28   35.203074  35.203074
         9     2    45.518430  45.518430
               12   58.912418  58.912418
         10    7    39.017260  39.017260
         12    5    51.666255  51.666255
               8    36.911801  36.911801
               9    42.575556  42.575556
               10   44.508796  44.508796
               14   46.397504  46.397504
               17   55.871856  55.871856
               23   32.226042  32.226042
    

    and

                          dep        arr
    year month day                      
    2013 2     11   39.073598  36.290094
               27   37.763274  31.252492
         3     8    83.536921  85.862155
               18   30.117960  41.291892
         4     10   33.023675  38.412311
               12   34.838428  36.048140
               18   34.915361  36.028481
               19   46.127828  47.911697
               22   30.642553  37.812166
         5     8    43.217778  39.609183
               23   51.144720  61.970899
         6     13   45.790828  63.753689
               18   35.950766  37.648026
               24   47.157418  51.176808
               25   43.063025  41.513684
               27   40.891232  44.783296
               28   48.827784  44.976852
               30   44.188179  43.510278
         7     1    56.233825  58.280502
               7    36.617450  40.306378
               9    30.711499  31.334365
               10   52.860702  59.626478
               22   46.667047  62.763403
               23   44.741685  44.959821
               28   37.710162  49.831776
         8     1    34.574034  35.989259
               8    43.349947  55.481163
               9    34.691898  43.313641
               28   40.526894  35.203074
         9     2    53.029551  45.518430
               12   49.958750  58.912418
         10    7    39.146710  39.017260
         12    5    52.327990  51.666255
               9    34.800221  42.575556
               17   40.705602  55.871856
               23   32.254149  32.226042
    
    opened by jseabold 2
  • pipe

    pipe

    Hello,

    maybe you should mention the new pipe method http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pipe.html#pandas.DataFrame.pipe

    Kind regards

    opened by scls19fr 1
  • Main page example flights.groupby(['year', 'month', 'day']) using pandas only

    Main page example flights.groupby(['year', 'month', 'day']) using pandas only

    The main example in the readme could be re-written as such using only pandas:

    import pandas
    flights = pandas.read_csv('~/downloads/flights.csv')
    df = (flights
          .groupby(['year', 'month', 'day'])
          .agg({'arr_delay': 'mean',
                'dep_delay': 'mean'})
          .query("arr_delay>30 & dep_delay>30")
         )
    

    Note I exported the flights data set from R with

    library(nycflights13)
    write.csv(flights,"~/downloads/flights.csv",` row.names=FALSE)
    
    opened by paulrougieux 1
  • `ply_select` doesn't work for grouped mutate

    `ply_select` doesn't work for grouped mutate

    With dplyr, I often find myself using mutate to calculate a item-level value using a grouped aggregate. For example:

    flights %>%
      group_by(year) %>%
      mutate(mean_delay = mean(arr_delay),
             std_delay = sd(arr_delay),
             z_delay = (arr_delay - mean_delay)/std_delay)
    

    From the docs, I thought that the first step of the pandas-ply equivalent would be:

    (flights
      .groupby('year')
      .ply_select('*',
        mean_delay = X.arr_delay.mean(),
        std_delay = X.arr_delay.std())
    )
    

    But when I try this I get the following error:

    Traceback (most recent call last):
      File "<pyshell#17>", line 5, in <module>
        sd = X.arr_delay.std()))
    TypeError: _ply_select_for_groups() takes exactly 1 argument (4 given)
    

    The problem appears to be the '*' argument not working when ply_select operates on a group.

    opened by jkeirstead 0
  • Sample

    Sample

    Hello,

    I'm trying your package but it will be nice to improve doc to tell us where to find flights sample dataframe. I have been looking inside dplyr package http://cran.r-project.org/web/packages/dplyr/index.html but I wasn't able to find it. Thanks

    Kind regards

    opened by scls19fr 3
Releases(v0.2.1)
Directions overlay for working with pandas in an analysis environment

dovpanda Directions OVer PANDAs Directions are hints and tips for using pandas in an analysis environment. dovpanda is an overlay companion for workin

dovpandev 431 Dec 20, 2022
BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

BatchFlow BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflo

Data Analysis Center 185 Dec 20, 2022
A Python toolkit for processing tabular data

meza: A Python toolkit for processing tabular data Index Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installat

Reuben Cummings 401 Dec 19, 2022
Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hoste

Riya Vijay Vishwakarma 1 Dec 12, 2021
Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

Prodmodel 53 Nov 29, 2022
Pandas integration with sklearn

Sklearn-pandas This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames. In particular, it provides

2.7k Dec 27, 2022
Tools for parsing messy tabular data.

Parsing for messy tables A library for dealing with messy tabular data in several formats, guessing types and detecting headers. See the documentation

Open Knowledge Foundation 382 Nov 10, 2022
Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

Eric Ma 1.1k Jan 01, 2023
dplyr for python

Dplython: Dplyr for Python Welcome to Dplython: Dplyr for Python. Dplyr is a library for the language R designed to make data analysis fast and easy.

Chris Riederer 754 Nov 21, 2022
Easy pipelines for pandas DataFrames.

pdpipe ˨ Easy pipelines for pandas DataFrames (learn how!). Website: https://pdpipe.github.io/pdpipe/ Documentation: https://pdpipe.github.io/pdpipe/d

694 Jan 05, 2023
functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

Coursera 188 Nov 24, 2022