Tidy interface to polars

Overview

tidypolars

PyPI Latest Release

tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users.

Installation

$ pip3 install tidypolars

General syntax

tidypolars methods are designed to work like tidyverse functions:

import tidypolars as tp
from tidypolars import col, desc

df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])

(
    df
    .select('x', 'y', 'z')
    .filter(col('x') < 4, col('y') > 1)
    .arrange(desc('z'), 'x')
    .mutate(double_x = col('x') * 2,
            x_plus_y = col('x') + col('y'))
)
┌─────┬─────┬─────┬──────────┬──────────┐
│ xyzdouble_xx_plus_y │
│ ---------------      │
│ i64i64stri64i64      │
╞═════╪═════╪═════╪══════════╪══════════╡
│ 25b47        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 03a03        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 14a25        │
└─────┴─────┴─────┴──────────┴──────────┘

The key difference from R is that column names must be wrapped in col() in the following methods:

  • .filter()
  • .mutate()
  • .summarize()

The general idea - when doing calculations on a column you need to wrap it in col(). When doing simple column selections (like in .select()) you can pass the column names as strings.

Group by syntax

Methods operate by group by calling the by arg.

  • A single column can be passed with by = 'z'
  • Multiple columns can be passed with by = ['y', 'z']
(
    df
    .summarize(avg_x = tp.mean(col('x')),
               by = 'z')
)
┌─────┬───────┐
│ zavg_x │
│ ------   │
│ strf64   │
╞═════╪═══════╡
│ a0.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b2     │
└─────┴───────┘

Selecting/dropping columns

tidyselect functions can be mixed with normal selection when selecting columns:

df = tp.Tibble(x1 = range(3), x2 = range(3), y = range(3), z = range(3))

df.select(tp.starts_with('x'), 'z')
┌─────┬─────┬─────┐
│ x1x2z   │
│ --------- │
│ i64i64i64 │
╞═════╪═════╪═════╡
│ 000   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 111   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 222   │
└─────┴─────┴─────┘

To drop columns use the .drop() method:

df.drop(tp.starts_with('x'), 'z')
┌─────┐
│ y   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
├╌╌╌╌╌┤
│ 1   │
├╌╌╌╌╌┤
│ 2   │
└─────┘

Converting to/from pandas data frames

If you need to use a package that requires pandas data frames, you can convert from a tidypolars Tibble to a pandas DataFrame.

To do this you'll first need to install pyarrow:

pip3 install pyarrow

To convert to a pandas DataFrame:

df = df.to_pandas()

To convert from a pandas DataFrame to a tidypolars Tibble:

df = tp.from_pandas(df)

Speed Comparisons

A few notes:

  • Comparing times from separate functions typically isn't very useful. For example - the .summarize() tests were performed on a different dataset from .pivot_wider().
  • All tests are run 5 times. The times shown are the median of those 5 runs.
  • All timings are in milliseconds.
  • All tests can be found in the source code here.
  • FAQ - Why are some tidypolars functions faster than their polars counterpart?
    • Short answer - they're not! After all they're just using polars in the background.
    • Long answer - All python functions have some slight natural variation in their execution time. By chance the tidypolars runs were slightly shorter on those specific functions on this iteration of the tests. However one goal of these tests is to show that the "time cost" of translating syntax to polars is very negligible to the user (especially on medium-to-large datasets).
  • Lastly I'd like to mention that these tests were not rigorously created to cover all angles equally. They are just meant to be used as general insight into the performance of these packages.
┌─────────────┬────────────┬─────────┬──────────┐
│ func_testedtidypolarspolarspandas   │
│ ------------      │
│ strf64f64f64      │
╞═════════════╪════════════╪═════════╪══════════╡
│ arrange190.345169.478500.112  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ case_when87.34879.427152.623  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ distinct16.88816.28228.725   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ filter29.78929.91231.397  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ full_join236.784231.2831042.689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ inner_join49.7147.563630.98   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ left_join113.7921151100.607 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ mutate7.9797.408117.283  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ pivot_wider42.76439.93949.048   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ summarize59.43458.011453.707  │
└─────────────┴────────────┴─────────┴──────────┘

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

Comments
  • `drop` with error `RuntimeError: Any(NotFound(

    `drop` with error `RuntimeError: Any(NotFound("^x.*$"))`

    import sys
    import tidypolars as tp
    sys.version
    # '3.9.7 (default, Sep 16 2021, 13:09:58) \n[GCC 7.5.0]'
    tp.__version__
    # '0.2.1'
    ## error
    df = tp.Tibble(x1 = range(3), x2 = range(3), y=range(3), z = range(3))
    df.drop([tp.starts_with('x'), 'z'])
    df.drop()
    `
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    /tmp/ipykernel_12815/866601321.py in <module>
    ----> 1 df.drop(tp.starts_with('x'))
    
    ~/miniconda3/envs/py39/lib/python3.9/site-packages/polars/eager/frame.py in drop(self, name)
       2253             return df
       2254 
    -> 2255         return wrap_df(self._df.drop(name))
       2256 
       2257     def drop_in_place(self, name: str) -> "pl.Series":
    
    RuntimeError: Any(NotFound("^x.*$"))
    `
    
    
    
    opened by ztsweet 9
  • `AttributeError: arrange not found`

    `AttributeError: arrange not found`

    import tidypolars as tp
    from tidypolars import col, desc
    import sys
    sys.version
    # '3.10.0 | packaged by conda-forge | (default, Oct 12 2021, 21:24:52) [GCC 9.4.0]'
    tp.__version__
    # '0.2.1'
    df = tp.Tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
    df.arrange('x', 'y')
    `
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    ~/miniconda3/envs/py310/lib/python3.10/site-packages/polars/eager/frame.py in __getattr__(self, item)
        882         try:
    --> 883             return pl.eager.series.wrap_s(self._df.column(item))
        884         except RuntimeError:
    
    RuntimeError: Any(NotFound("arrange"))
    
    During handling of the above exception, another exception occurred:
    
    AttributeError                            Traceback (most recent call last)
    /tmp/ipykernel_21110/1194586334.py in <module>
    ----> 1 df.arrange('x', 'y')
    
    ~/miniconda3/envs/py310/lib/python3.10/site-packages/polars/eager/frame.py in __getattr__(self, item)
        883             return pl.eager.series.wrap_s(self._df.column(item))
        884         except RuntimeError:
    --> 885             raise AttributeError(f"{item} not found")
        886 
        887     def __iter__(self) -> Iterator[Any]:
    
    AttributeError: arrange not found
    `
    
    bug 
    opened by ztsweet 6
  • Missing attributes when chaining

    Missing attributes when chaining

    Hi Mark, thanks for putting this package together. It looks very cool.

    I'm having a tough time getting the motivating examples to work, though. For example, the following triggers an error:

    import tidypolars as tp
    from tidypolars import col, desc
    
    df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])
    
    df.filter(col('x') < 2).arrange(desc('z'), 'x')
    
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Untitled-1 in <cell line: 1>()
    ----> <a href='untitled:Untitled-1?line=7'>8</a> df.filter(col('x') < 2).arrange(desc('z'), 'x')
    
    AttributeError: 'DataFrame' object has no attribute 'arrange'
    

    What's genuinely odd about the above is that arrange works on its own and when it comes before filter.

    # All of these work as expected
    df.filter(col('x') < 2)
    df.arrange(desc('z'), 'x')
    df.arrange(desc('z'), 'x').filter(col('x') < 2)
    

    A seemingly related issue is that I can't pass two arguments to filter when it follows arrange (or other verbs likeselect for that matter).

    df.filter(col('x') < 2, col('y')>3) ## works
    df.arrange(desc('z'), 'x').filter(col('x') < 2, col('y')>3) ## errors with "filter() takes 2 positional arguments but 3 were given"
    

    Any ideas?

    I'm on Python 3.9.2 installed via Homebrew on a 2019 Macbook (so regular Intel chip) and running the latest version of tidypolars (0.2.15).

    bug 
    opened by grantmcdermott 5
  • Is it possible to have dplyr's `group_by` + `mutate` behavior?

    Is it possible to have dplyr's `group_by` + `mutate` behavior?

    First of all, I really like this package and I've started to use it a lot in my work. As a Pythonista whose first language is R, I really enjoy tidypolars.

    In R, we can do something like the following

    library(dplyr)
    data(iris)
    
    iris %>%
      group_by(Species) %>%
      mutate(
        result = Petal.Width - mean(Petal.Width)
      )
    

    Since we have a group_by(Species) call, dplyr will subtract the mean that corresponds to each group in the mutate() operation (not the mean across all observations from all species).

    As far as I understand, this is still not possible with tidypolars since we don't have a group_by function that behaves in a similar way to the one in dplyr. So my questions are

    • Is it possible to have this behavior in tidypolars now?
      • If yes, how?
      • If not, is it going to be possible? I could volunteer to try to implement it. I'm not familiar with the existing codebase, but I suspect that Python eager evaluation of function arguments is what makes it harder to have such a feature?

    Again, thanks for the fantastic library!

    opened by tomicapretto 5
  • idiomatic way to add list as column

    idiomatic way to add list as column

    Forgive what's probably a dumb question, but is there a way to get .mutate to return the same object as the .bind_cols line?

    import tidypolars as tp
    
    tb = tp.Tibble({'a': [1, 2, 3]})
    x = [4, 5, 6]
    # gives desired output 
    tb.bind_cols(tp.Tibble({'b': x}))
    # gives error: ValueError: could not convert value '[4, 5, 6]' as a Literal
    tb.mutate(b = x)
    
    feature 
    opened by eutwt 4
  • purrr functions!?

    purrr functions!?

    I noticed that in tidytable, you have purrr functions like map.(), but not in tidypolars.

    Using for loops + lambda functions are just not desirable for collaborative coding / code readability/comprehension. In Python, even if there is a bit of sacrifice in performance, if it allows better code readability, it would be really nice to have.

    Would something like map.() be in the scope of this repo?

    feature 
    opened by exsell-jc 3
  • ```ValueError``` with ```filter```

    ```ValueError``` with ```filter```

    When I chain filter expressions with | (error message said to use | and not or), I receive a ValueError message:

    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(col('chr_col') == 'this is a test 1' |
                col('chr_col') == 'this is a test 2')
    
    ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. 
    Hint: use '&' or '|' to chain Expr together, not and/or.
    

    It works fine if I do one or the other:

    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(# col('chr_col') == 'this is a test 1' |
                col('chr_col') == 'this is a test 2')
    
    # chr_col
    #   --
    #   str
    # "this is a test 2"
    
    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(col('chr_col') == 'this is a test 1' # |
                # col('chr_col') == 'this is a test 2'
                )
    
    # chr_col
    #   --
    #   str
    # "this is a test 1"
    
    opened by alexandro-ag 3
  • ```as_date``` with ```RuntimeError: please define a fmt```

    ```as_date``` with ```RuntimeError: please define a fmt```

    Good afternoon,

    I think I found an issue with the as_date method. In the example per the documentation, the following succeeds:

    import tidypolars as tp
    from tidypolars import col
    
    date_df = tp.Tibble(date = ['2021-12-31']) # Year-Month-Day (%Y-%m-%d)
    date_df.mutate(date_parsed = tp.as_date(col('date'))) # Success
    

    However when parsing different formats (using the fmt argument), the date fails to parse:

    import tidypolars as tp
    from tidypolars import col
    
    date_df = tp.Tibble(date = ['12/31/2021']) # Month/Day/Year (%m/%d/%Y)
    date_df.mutate(date_parsed = tp.as_date(col('date'), fmt='%m/%d/%Y')) # RuntimeError
    

    I also extend my appreciation for all the work on this package. I've been searching for a tidyverse implementation in python and this one knocks my expectations out of the park. Thank you.

    opened by alexandro-ag 3
  • Revisit `.rename()` syntax

    Revisit `.rename()` syntax

    Should the syntax be the same as pl.DataFrame.rename? Currently polars mimics pandas syntax. Or should it be something that attempts to mimic tidyverse syntax?

    Note: polars also has a .rename_col() with syntax df.rename_col('old', 'new').

    opened by markfairbanks 3
  • Compatibility with polars v0.14.0

    Compatibility with polars v0.14.0

    PR that caused the break: https://github.com/pola-rs/polars/pull/4309

    Old behavior that tidypolars relied on: https://github.com/pola-rs/polars/pull/2862

    feature 
    opened by markfairbanks 2
  • Basics: tp.read_csv(), df.drop(x1, x2, x3, ...), and df.colnames?

    Basics: tp.read_csv(), df.drop(x1, x2, x3, ...), and df.colnames?

    Really new to the library, but looking at the documentation did not really help with understanding.

    Problem 1

    import polars as pl
    import tidypolars as tp
    import csv
    import requests
    
    url = f'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-24/Scrumqueens-data-2022-05-23.csv'
    
    df = tp.read_csv(file = url) # does not work
    df = pl.read_csv(file = url) # works??
    

    Problem 2

    df = df.drop('...1', 'Notes') # does not work
    df = df.drop('...1') # works separately
    df = df.drop('Notes') # works separately
    

    Problem 3

    df.colnames
    df.names
    df.colnames()
    df.names()
    # None of these work
    

    What am I missing, exactly?

    opened by exsell-jc 2
  • plans for adding type hints

    plans for adding type hints

    Hi, it seems that the codebase is not annotated making the discoverability of methods difficult and static code analysis not working. Any plans on adding type hints?

    feature 
    opened by mr-majkel 1
  • `write_csv()` returns `'super' object has no attribute 'to_csv'`

    `write_csv()` returns `'super' object has no attribute 'to_csv'`

    Hi, There seems to be a problem with write_csv(). I can import tidypolars and the data just fine:

    import tidypolars as tp
    
    rents = tp.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv")
    

    But when I try to export the data frame as a csv file:

    rents.write_csv("rents.csv")
    

    I get an error stating 'super' object has no attribute 'to_csv'.

    The data come from the Tidytuesday repo. Python version is 3.10.8 and tidypolars is 0.2.19. I'm on macOS 13.

    bug 
    opened by alesvomacka 1
  • Calculating time

    Calculating time

    In R with lubridate, it would look like this:

    one_year_before = some_date - years(1)
    one_year_before = some_date - months(12)
    

    But in tidypolars functions list, there doesn't seem to be a years or months function: https://tidypolars.readthedocs.io/en/latest/reference.html

    feature 
    opened by exsell-jc 4
Releases(v0.2.19)
Official code of "Mitigating the Mutual Error Amplification for Semi-Supervised Object Detection"

CrossTeaching-SSOD 0. Introduction Official code of "Mitigating the Mutual Error Amplification for Semi-Supervised Object Detection" This repo include

Bruno Ma 9 Nov 29, 2022
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

Rishikesh (ऋषिकेश) 69 Dec 17, 2022
MatchGAN: A Self-supervised Semi-supervised Conditional Generative Adversarial Network

MatchGAN: A Self-supervised Semi-supervised Conditional Generative Adversarial Network This repository is the official implementation of MatchGAN: A S

Justin Sun 12 Dec 27, 2022
Dataset for the Research2Clinics @ NeurIPS 2021 Paper: What Do You See in this Patient? Behavioral Testing of Clinical NLP Models

Behavioral Testing of Clinical NLP Models This repository contains code for testing the behavior of clinical prediction models based on patient letter

Betty van Aken 2 Sep 20, 2022
PyTorch implementation of ''Background Activation Suppression for Weakly Supervised Object Localization''.

Background Activation Suppression for Weakly Supervised Object Localization PyTorch implementation of ''Background Activation Suppression for Weakly S

35 Jan 06, 2023
Multi-Agent Reinforcement Learning (MARL) method to learn scalable control polices for multi-agent target tracking.

scalableMARL Scalable Reinforcement Learning Policies for Multi-Agent Control CD. Hsu, H. Jeong, GJ. Pappas, P. Chaudhari. "Scalable Reinforcement Lea

Christopher Hsu 17 Nov 17, 2022
MakeItTalk: Speaker-Aware Talking-Head Animation

MakeItTalk: Speaker-Aware Talking-Head Animation This is the code repository implementing the paper: MakeItTalk: Speaker-Aware Talking-Head Animation

Adobe Research 285 Jan 08, 2023
Code for Deep Single-image Portrait Image Relighting

Deep Single-Image Portrait Relighting [Project Page] Hao Zhou, Sunil Hadap, Kalyan Sunkavalli, David W. Jacobs. In ICCV, 2019 Overview Test script for

438 Jan 05, 2023
NEG loss implemented in pytorch

Pytorch Negative Sampling Loss Negative Sampling Loss implemented in PyTorch. Usage neg_loss = NEG_loss(num_classes, embedding_size) optimizer =

Daniil Gavrilov 123 Sep 13, 2022
QAT(quantize aware training) for classification with MQBench

MQBench Quantization Aware Training with PyTorch I am using MQBench(Model Quantization Benchmark)(http://mqbench.tech/) to quantize the model for depl

Ling Zhang 29 Nov 18, 2022
This repo contains the official implementations of EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis This repo contains the official implementations of EigenDamage: Structured Prunin

Chaoqi Wang 107 Apr 20, 2022
A Dataset of Python Challenges for AI Research

Python Programming Puzzles (P3) This repo contains a dataset of python programming puzzles which can be used to teach and evaluate an AI's programming

Microsoft 850 Dec 24, 2022
LSTM and QRNN Language Model Toolkit for PyTorch

LSTM and QRNN Language Model Toolkit This repository contains the code used for two Salesforce Research papers: Regularizing and Optimizing LSTM Langu

Salesforce 1.9k Jan 08, 2023
Unofficial PyTorch code for BasicVSR

Dependencies and Installation The code is based on BasicSR, Please install the BasicSR framework first. Pytorch=1.51 Training cd ./code CUDA_VISIBLE_

Long 59 Dec 06, 2022
Nicholas Lee 3 Jan 09, 2022
Решения, подсказки, тесты и утилиты для тренировки по алгоритмам от Яндекса.

Решения и подсказки к тренировке по алгоритмам от Яндекса Что есть внутри Решения с подсказками и комментариями; рекомендую сначала смотреть md файл п

Yankovsky Andrey 50 Dec 26, 2022
The Video-based Accident Detection System built in Python

Accident-detection-system About the Project This Repository contains the Video-based Accident Detection System built in Python. Contributors Yukta Gop

SURYAVANSHI SNEHAL BALKRISHNA 50 Dec 07, 2022
PyJokes - Joking around with Python library pyjokes

Hi, it's Muhaimin again 👋 This is something unorthodox but cool. Don't forget t

Muhaimin A. Salay Kanton 1 Feb 02, 2022
Trax — Deep Learning with Clear Code and Speed

Trax — Deep Learning with Clear Code and Speed Trax is an end-to-end library for deep learning that focuses on clear code and speed. It is actively us

Google 7.3k Dec 26, 2022
Flow is a computational framework for deep RL and control experiments for traffic microsimulation.

Flow Flow is a computational framework for deep RL and control experiments for traffic microsimulation. See our website for more information on the ap

867 Jan 02, 2023