Modin: Speed up your Pandas workflows by changing a single line of code

Overview

Scale your pandas workflows by changing one line of code

To use Modin, replace the pandas import:

# import pandas as pd
import modin.pandas as pd

Installation

Modin can be installed from PyPI:

pip install modin

If you don't have Ray or Dask installed, you will need to install Modin with one of the targets:

pip install modin[ray] # Install Modin dependencies and Ray to run on Ray
pip install modin[dask] # Install Modin dependencies and Dask to run on Dask
pip install modin[all] # Install all of the above

Modin will automatically detect which engine you have installed and use that for scheduling computation!

Pandas API Coverage

pandas Object Modin's Ray Engine Coverage Modin's Dask Engine Coverage
pd.DataFrame
pd.Series
pd.read_csv
pd.read_table
pd.read_parquet
pd.read_sql
pd.read_feather
pd.read_excel
pd.read_json ✳️ ✳️
pd.read_<other> ✴️ ✴️

Some pandas APIs are easier to implement than other, so if something is missing feel free to open an issue!
Choosing a Compute Engine

If you want to choose a specific compute engine to run on, you can set the environment variable MODIN_ENGINE and Modin will do computation with that engine:

export MODIN_ENGINE=ray  # Modin will use Ray
export MODIN_ENGINE=dask  # Modin will use Dask

This can also be done within a notebook/interpreter before you import Modin:

import os

os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray
os.environ["MODIN_ENGINE"] = "dask"  # Modin will use Dask

import modin.pandas as pd

Note: You should not change the engine after you have imported Modin as it will result in undefined behavior

Which engine should I use?

If you are on Windows, you must use Dask. Ray does not support Windows. If you are on Linux or Mac OS, you can install and use either engine. There is no knowledge required to use either of these engines as Modin abstracts away all of the complexity, so feel free to pick either!

Advanced usage

In Modin, you can start a custom environment in Dask or Ray and Modin will connect to that environment automatically. For example, if you'd like to limit the amount of resources that Modin uses, you can start a Dask Client or Initialize Ray and Modin will use those instances. Make sure you've set the correct environment variable so Modin knows which engine to connect to!

For Ray:

import ray
ray.init(plasma_directory="/path/to/custom/dir", object_store_memory=10**10)
# Modin will connect to the existing Ray environment
import modin.pandas as pd

For Dask:

from distributed import Client
client = Client(n_workers=6)
# Modin will connect to the Dask Client
import modin.pandas as pd

This gives you the flexibility to start with custom resource constraints and limit the amount of resources Modin uses.

Full Documentation

Visit the complete documentation on readthedocs: https://modin.readthedocs.io

Scale your pandas workflow by changing a single line of code.

import modin.pandas as pd
import numpy as np

frame_data = np.random.randint(0, 100, size=(2**10, 2**8))
df = pd.DataFrame(frame_data)

In local (without a cluster) modin will create and manage a local (dask or ray) cluster for the execution

To use Modin, you do not need to know how many cores your system has and you do not need to specify how to distribute the data. In fact, you can continue using your previous pandas notebooks while experiencing a considerable speedup from Modin, even on a single machine. Once you've changed your import statement, you're ready to use Modin just like you would pandas.

Faster pandas, even on your laptop

The modin.pandas DataFrame is an extremely light-weight parallel DataFrame. Modin transparently distributes the data and computation so that all you need to do is continue using the pandas API as you were before installing Modin. Unlike other parallel DataFrame systems, Modin is an extremely light-weight, robust DataFrame. Because it is so light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores.

In pandas, you are only able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. Even in read_csv, we see large gains by efficiently distributing the work across your entire machine.

import modin.pandas as pd

df = pd.read_csv("my_dataset.csv")

Modin is a DataFrame designed for datasets from 1MB to 1TB+

We have focused heavily on bridging the solutions between DataFrames for small data (e.g. pandas) and large data. Often data scientists require different tools for doing the same thing on different sizes of data. The DataFrame solutions that exist for 1KB do not scale to 1TB+, and the overheads of the solutions for 1TB+ are too costly for datasets in the 1KB range. With Modin, because of its light-weight, robust, and scalable nature, you get a fast DataFrame at small and large data. With preliminary cluster and out of core support, Modin is a DataFrame library with great single-node performance and high scalability in a cluster.

Modin Architecture

We designed Modin to be modular so we can plug in different components as they develop and improve:

Architecture

Visit the Documentation for more information, and checkout the difference between Modin and Dask!

modin.pandas is currently under active development. Requests and contributions are welcome!

More information and Getting Involved

Comments
  • Refactor testing suite

    Refactor testing suite

    What do these changes do?

    This PR aims to refactor the testing suite to simplify the addition of new tests and to increase the code coverage of our current tests.

    • [x] passes git diff upstream/master -u -- "*.py" | flake8 --diff
    opened by williamma12 278
  • Add codecov to testing suite

    Add codecov to testing suite

    What do these changes do?

    Related issue number

    N/A

    • [x] passes git diff upstream/master -u -- "*.py" | flake8 --diff
    • [x] passes black --check modin/
    opened by devin-petersohn 181
  • First initiative to distribute the read_sql() method.

    First initiative to distribute the read_sql() method.

    What do these changes do?

    Propose a new approach to distribute read_sql() method.

    Related issue number

    #418

    • [x] passes git diff upstream/master -u -- "*.py" | flake8 --diff
    • [x] passes black --check modin/
    opened by igorborgest 72
  • Create Base Query Compiler object for query compilers

    Create Base Query Compiler object for query compilers

    What do these changes do?

    Creates a BaseQueryCompiler abstract class that will be the parent of all the other QueryCompiler classes to facilitate the adding of other QueryCompilers

    Related issue number

    • [x] passes git diff upstream/master -u -- "*.py" | flake8 --diff
    • [x] passes black --check modin/
    opened by williamma12 65
  • Adding SeriesView object that allows inplace operations

    Adding SeriesView object that allows inplace operations

    • Now df['column1'].fillna(0, inplace=True) will work correctly
    • Wraps all Series objects that are returned from getitem
    • Minor fix to an issue in setitem where some Series objects were incorrectly set based on their index

    What do these changes do?

    Related issue number

    • [x] passes git diff upstream/master -u -- "*.py" | flake8 --diff
    • [x] passes black --check modin/
    opened by devin-petersohn 64
  • Making Modin more efficient at small DataFrames

    Making Modin more efficient at small DataFrames

    • Adds a minimum partition size of 4kb
    • Adds some improved shuffling methods
    • Improves small DataFrame performance by up to 10x

    What do these changes do?

    Related issue number

    • [x] passes git diff upstream/master -u -- "*.py" | flake8 --diff
    • [x] passes black --check modin/
    opened by devin-petersohn 52
  • Modify test dataframes

    Modify test dataframes

    What do these changes do?

    Tests empty dataframes and larger test dataframes

    Related issue number

    • [x] passes git diff upstream/master -u -- "*.py" | flake8 --diff
    • [x] passes black --check modin/
    opened by williamma12 46
  • Fixed apply for python backend

    Fixed apply for python backend

    What do these changes do?

    Removes try_scale so it will properly assign index in a python backend

    Related issue number

    • [x] passes git diff upstream/master -u -- "*.py" | flake8 --diff
    • [x] passes black --check modin/
    opened by williamma12 45
  • Squeeze functionality

    Squeeze functionality

    What do these changes do?

    Adding squeeze() functionality for modin. Basic test with several cases also included in test_dataframe.py. Currently returns pandas.Series where necessary because no Series implementation in modin yet.

    • [x] passes git diff upstream/master -u -- "*.py" | flake8 --diff
    • [x] passes black --check modin/
    opened by adits31 44
  • FEAT-#4147: Add partial compatibility with Python 3.6 and pandas 1.1

    FEAT-#4147: Add partial compatibility with Python 3.6 and pandas 1.1

    What do these changes do?

    • [x] commit message follows format outlined here
    • [x] passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
    • [x] passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
    • [x] signed commit with git commit -s
    • [x] Resolves #4147
    • [x] tests added and passing
    • [x] module layout described at docs/development/architecture.rst is up-to-date
    • [x] added (Issue Number: PR title (PR Number)) and github username to release notes for next major release
    Ready for review 
    opened by devin-petersohn 42
  • REFACTOR-#2642: Refactor partition API

    REFACTOR-#2642: Refactor partition API

    Signed-off-by: Igoshev, Yaroslav [email protected]

    What do these changes do?

    • [x] commit message follows format outlined here
    • [x] passes flake8 modin
    • [x] passes black --check modin
    • [x] signed commit with git commit -s
    • [x] Resolves #2642
    • [x] tests passing
    opened by YarShev 39
  • PERF-#5182: pre compute dtypes in query compiler for binary operations

    PERF-#5182: pre compute dtypes in query compiler for binary operations

    What do these changes do?

    Similar to #5494 with the logic for pre computing dtypes moved to the query compiler.

    • [x] first commit message and PR title follow format outlined here

      NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

    • [ ] passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
    • [ ] passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
    • [ ] signed commit with git commit -s
    • [ ] Resolves #?
    • [ ] tests added and passing
    • [ ] module layout described at docs/development/architecture.rst is up-to-date
    opened by arunjose696 0
  • Use Ray design patterns and avoid anti-patterns

    Use Ray design patterns and avoid anti-patterns

    Having read design patterns and anti-patterns in Ray docs I think that we could use/avoid some of them in our code to gain benefit/performance. Here are some examples.

    Individual issues/tasks can be created from this list in order to be investigated and resolved separately/more granularly.

    I encourage everyone from @modin-project/modin-core to read Ray design patterns and anti-patterns so we could probably see more gaps in our code.

    Aside: resolving the concrete issues we should be careful regarding performance of every engine because that may speed up one engine, but slow down other.

    new feature/request 💬 
    opened by YarShev 0
  • PyArrowOnRay: implement read_parquet???

    PyArrowOnRay: implement read_parquet???

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. What kind of performance improvements would you like to see with this new API?

    Right now it appears the only way to create a dataframe with the PyArrowOnRay engine is to read it in from CSV. This is very unfortunate, and makes the module essentially unusable.

    In my case, I'm using lists of integers for certain columns, which PyArrow handles beautifully as a segmented array, but pandas encodes as objects, thereby exploding processing time and memory usage. PyArrow's format is highly desirable for this reason, as would reading from feather/parquet (which natively support this format, unlike csv).

    I understand that PyArrowOnRay is an experimental feature, but would it be possible to implement a parquet reader? This would enable testing on a key use case as PyArrowOnRay is developed further.

    Thanks!

    new feature/request 💬 pandas.io P1 External 
    opened by swamidass 1
  • FIX-#2320: raise exceptions in read_csv in some cases with `skipfooter!=0`

    FIX-#2320: raise exceptions in read_csv in some cases with `skipfooter!=0`

    Signed-off-by: Anatoly Myachev [email protected]

    What do these changes do?

    • [x] first commit message and PR title follow format outlined here

      NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

    • [x] passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
    • [x] passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
    • [x] signed commit with git commit -s
    • [x] Resolves #2320
    • [x] tests added and passing
    • [x] module layout described at docs/development/architecture.rst is up-to-date
    Ready for review 
    opened by anmyachev 2
  • FIX-#3080: Fix case when there is duplicated columns for read_csv on hdk

    FIX-#3080: Fix case when there is duplicated columns for read_csv on hdk

    Signed-off-by: Anatoly Myachev [email protected]

    What do these changes do?

    • [x] first commit message and PR title follow format outlined here

      NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

    • [x] passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
    • [x] passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
    • [x] signed commit with git commit -s
    • [x] Resolves #3080
    • [x] tests added and passing
    • [x] module layout described at docs/development/architecture.rst is up-to-date
    Ready for review 
    opened by anmyachev 0
Releases(0.18.0)
  • 0.18.0(Dec 12, 2022)

    This release includes support for MPI backend using Unidist, improvements to the shuffling mechanism, SQL query execution on the HDK backend (currently pyhdk==0.3), support for pandas 1.5.2 and external query compilers. It also includes many bug fixes and some performance enhancements.

    Key Features and Updates Since 0.17.0

    • Stability and Bugfixes
      • FIX-https://github.com/modin-project/modin/issues/3823: Fix TypeError when creating Series from SparseArray (https://github.com/modin-project/modin/pull/5377)
      • FIX-https://github.com/modin-project/modin/issues/4100: Fall back to Pandas on row drop (https://github.com/modin-project/modin/pull/4937)
      • FIX-https://github.com/modin-project/modin/issues/4636: Allows read_parquet to detect column partitioning in non-local filesystems (https://github.com/modin-project/modin/pull/5192)
      • FIX-https://github.com/modin-project/modin/issues/4859: Add support for PyArrow Dictionary Arrays to type mapping (https://github.com/modin-project/modin/pull/4864)
      • FIX-https://github.com/modin-project/modin/issues/4859: Add support for PyArrow Dictionary Arrays to type mapping (https://github.com/modin-project/modin/pull/5271)
      • FIX-https://github.com/modin-project/modin/issues/5016: Suppress spammy ray task errors. (https://github.com/modin-project/modin/pull/5298)
      • FIX-https://github.com/modin-project/modin/issues/5114: Change mask name to resolve namespace conflict with numpy mask (https://github.com/modin-project/modin/pull/5215)
      • FIX-https://github.com/modin-project/modin/issues/5137: df.info failure with default columns (https://github.com/modin-project/modin/pull/5251)
      • FIX-https://github.com/modin-project/modin/issues/5138: df_categories_equals typo (https://github.com/modin-project/modin/pull/5250)
      • FIX-https://github.com/modin-project/modin/issues/5171: Allow xgboost >= 1.7.0. (https://github.com/modin-project/modin/pull/5195)
      • FIX-https://github.com/modin-project/modin/issues/5186: set_index case with multiindex (https://github.com/modin-project/modin/pull/5190)
      • FIX-https://github.com/modin-project/modin/issues/5187: Fixed RecursionError in OmnisciLaunchParameters.get() (https://github.com/modin-project/modin/pull/5199)
      • FIX-https://github.com/modin-project/modin/issues/5204: Fix binary operations with a dictionary (https://github.com/modin-project/modin/pull/5205)
      • FIX-https://github.com/modin-project/modin/issues/5208: Support ray==2.1.0 (https://github.com/modin-project/modin/pull/5283)
      • FIX-https://github.com/modin-project/modin/issues/5232: Stop changing original series names during binary ops. (https://github.com/modin-project/modin/pull/5249)
      • FIX-https://github.com/modin-project/modin/issues/5234: Use query compiler str_repeat. (https://github.com/modin-project/modin/pull/5235)
      • FIX-https://github.com/modin-project/modin/issues/5236: Allow binary operations with custom classes. (https://github.com/modin-project/modin/pull/5237)
      • FIX-https://github.com/modin-project/modin/issues/5238: Make rmul really rmul instead of mul. (https://github.com/modin-project/modin/pull/5246)
      • FIX-https://github.com/modin-project/modin/issues/5240: Fix dask[complete] syntax in conda environment files (https://github.com/modin-project/modin/pull/5241)
      • FIX-https://github.com/modin-project/modin/issues/5252: Disable notebook tests until access control issues are resolved for modin-test bucket (https://github.com/modin-project/modin/pull/5257)
      • FIX-https://github.com/modin-project/modin/issues/5277: Fix internal execute function (https://github.com/modin-project/modin/pull/5278)
      • FIX-https://github.com/modin-project/modin/issues/5284: Move ray, redis, tqdm, xgboost packages from pip to conda deps (https://github.com/modin-project/modin/pull/5270)
      • FIX-https://github.com/modin-project/modin/issues/5285: Check for both pyarrow and fastparquet when read parquet format (https://github.com/modin-project/modin/pull/5297)
      • FIX-https://github.com/modin-project/modin/issues/5306: Fix code scanning alert - Use of the return value of a procedure (https://github.com/modin-project/modin/pull/5307)
      • FIX-https://github.com/modin-project/modin/issues/5308: Allow custom execution with no known engine. (https://github.com/modin-project/modin/pull/5379)
      • FIX-https://github.com/modin-project/modin/issues/5319: Do not use deprecated '.iteritems()' (https://github.com/modin-project/modin/pull/5320)
      • FIX-https://github.com/modin-project/modin/issues/5325: Fix read_csv_glob with non-empty parse_dates dict (https://github.com/modin-project/modin/pull/5339)
      • FIX-https://github.com/modin-project/modin/issues/5327: Bump mypy cap to fix CI. (https://github.com/modin-project/modin/pull/5328)
      • FIX-https://github.com/modin-project/modin/issues/5364: Fix get_indices internal function (https://github.com/modin-project/modin/pull/5355)
      • FIX-https://github.com/modin-project/modin/issues/5380: Fix warning about setting _cache attribute. (https://github.com/modin-project/modin/pull/5381)
      • FIX-https://github.com/modin-project/modin/issues/5398: Resolve length 1 nonNA partition issue, and off by one error in sort (https://github.com/modin-project/modin/pull/5400)
      • FIX-https://github.com/modin-project/modin/issues/5405: Pin ray>=1.13.0 (https://github.com/modin-project/modin/pull/5390)
    • Performance enhancements
      • PERF-https://github.com/modin-project/modin/issues/5225: Do not convert 'value' to a list at '.insert()' (https://github.com/modin-project/modin/pull/5226)
      • PERF-https://github.com/modin-project/modin/issues/5268: Call get on all partitions at once in to_pandas (https://github.com/modin-project/modin/pull/4776)
    • Refactor Codebase
      • REFACTOR-https://github.com/modin-project/modin/issues/5202: Pass loc arguments to query compiler. (https://github.com/modin-project/modin/pull/5305)
      • REFACTOR-https://github.com/modin-project/modin/issues/5262: Update the examples to the latest version of the omniscripts (https://github.com/modin-project/modin/pull/5263)
      • REFACTOR-https://github.com/modin-project/modin/issues/5287: Remove code to test getting TypeError for Series.dropna (https://github.com/modin-project/modin/pull/5288)
      • REFACTOR-https://github.com/modin-project/modin/issues/5294: Fix code scanning alert - Potentially uninitialized local variable (https://github.com/modin-project/modin/pull/5383)
      • REFACTOR-https://github.com/modin-project/modin/issues/5299: Variable defined multiple times error found by CodeQL (https://github.com/modin-project/modin/pull/5300)
      • REFACTOR-https://github.com/modin-project/modin/issues/5301: Fix code scanning alert - Duplicate key in dict literal (https://github.com/modin-project/modin/pull/5302)
      • REFACTOR-https://github.com/modin-project/modin/issues/5303: Fix code scanning alert - Unused local variable (https://github.com/modin-project/modin/pull/5304)
      • REFACTOR-https://github.com/modin-project/modin/issues/5310: Remove some hasattr('columns') checks. (https://github.com/modin-project/modin/pull/5311)
      • REFACTOR-https://github.com/modin-project/modin/issues/5312: Let lazy query compilers check for astype and drop errors. (https://github.com/modin-project/modin/pull/5313)
      • REFACTOR-https://github.com/modin-project/modin/issues/5322: Remove python3.7 related code from read_csv_glob (https://github.com/modin-project/modin/pull/5323)
      • REFACTOR-https://github.com/modin-project/modin/issues/5330: Remove BaseIO._read (https://github.com/modin-project/modin/pull/5329)
      • REFACTOR-https://github.com/modin-project/modin/issues/5332: Define PQ_INDEX_REGEX as class variable (https://github.com/modin-project/modin/pull/5333)
      • REFACTOR-https://github.com/modin-project/modin/issues/5334: Make _validate as classmethod (https://github.com/modin-project/modin/pull/5331)
      • REFACTOR-https://github.com/modin-project/modin/issues/5335: Remove unnecessary lambdas (https://github.com/modin-project/modin/pull/5336)
      • REFACTOR-https://github.com/modin-project/modin/issues/5359: Fix code scanning alert - File is not always closed (https://github.com/modin-project/modin/pull/5362)
      • REFACTOR-https://github.com/modin-project/modin/issues/5363: Introduce partition constructor; move add_to_apply_calls impl in base class (https://github.com/modin-project/modin/pull/5354)
      • REFACTOR-https://github.com/modin-project/modin/issues/5382: Use pandas.util.cache_readonly for __constructors__ (https://github.com/modin-project/modin/pull/5368)
      • REFACTOR-https://github.com/modin-project/modin/issues/5386: Move partition.split implementation in base class (https://github.com/modin-project/modin/pull/5384)
      • REFACTOR-https://github.com/modin-project/modin/issues/5391: Improve setup function in TimeDropDuplicatesDataframe (https://github.com/modin-project/modin/pull/5389)
      • REFACTOR-https://github.com/modin-project/modin/issues/5413: Check Index.dtype instead of isinstance(obj, Int64Index) (https://github.com/modin-project/modin/pull/5406)
    • Update testing suite
      • TEST-https://github.com/modin-project/modin/issues/2073: Check that read_csv can use a parse_dates dict. (https://github.com/modin-project/modin/pull/4572)
      • TEST-https://github.com/modin-project/modin/issues/4562: In windows CI, try to start ray a few times (https://github.com/modin-project/modin/pull/5101)
      • TEST-https://github.com/modin-project/modin/issues/4821: Monkeypatch cache_readonly to avoid errors in doc_checker.py (https://github.com/modin-project/modin/pull/5365)
      • TEST-https://github.com/modin-project/modin/issues/5123: Add CodeQL workflow for GitHub code scanning (https://github.com/modin-project/modin/pull/5222)
      • TEST-https://github.com/modin-project/modin/issues/5219: Relax matplotlib and coverage pins (https://github.com/modin-project/modin/pull/5216)
      • TEST-https://github.com/modin-project/modin/issues/5259: Use new URL for dataset (https://github.com/modin-project/modin/pull/5401)
      • TEST-https://github.com/modin-project/modin/issues/5261: Port indexing, reindex and fillna benchmarks from pandas github (https://github.com/modin-project/modin/pull/5244)
      • TEST-https://github.com/modin-project/modin/issues/5280: Test pandas objects for non-commutative multiply. (https://github.com/modin-project/modin/pull/5281)
      • TEST-https://github.com/modin-project/modin/issues/5290: Add testing for unidist on push (https://github.com/modin-project/modin/pull/5291)
      • TEST-https://github.com/modin-project/modin/issues/5340: Use dev requirements in test-ray-master to get fastparquet (https://github.com/modin-project/modin/pull/5347)
      • TEST-https://github.com/modin-project/modin/issues/5341: Bump test-ray-master ray to 3.0. (https://github.com/modin-project/modin/pull/5342)
      • TEST-https://github.com/modin-project/modin/issues/5343: Unpin test-ray-client ray version. (https://github.com/modin-project/modin/pull/5344)
      • TEST-https://github.com/modin-project/modin/issues/5345: Stop running CI for some worfklow changes. (https://github.com/modin-project/modin/pull/5346)
      • TEST-https://github.com/modin-project/modin/issues/5348: Instead of capping mypy, exclude 0.990. (https://github.com/modin-project/modin/pull/5349)
      • TEST-https://github.com/modin-project/modin/issues/5350: Port DropDuplicates and LevelAlign benchmarks from pandas github (https://github.com/modin-project/modin/pull/5351)
      • TEST-https://github.com/modin-project/modin/issues/5374: Port DatetimeAccessor and Categories benchmarks from pandas github (https://github.com/modin-project/modin/pull/5375)
      • TEST-https://github.com/modin-project/modin/issues/5378: Port stack, unstack, replace and groups benchmarks from pandas (https://github.com/modin-project/modin/pull/5388)
    • Documentation improvements
      • DOCS-https://github.com/modin-project/modin/issues/5279: Add documentation for pandas on unidist (https://github.com/modin-project/modin/pull/5289)
      • DOCS-https://github.com/modin-project/modin/issues/5292: Make readme image links raw so they render on pypi.org. (https://github.com/modin-project/modin/pull/5293)
      • DOCS-https://github.com/modin-project/modin/issues/5314: Update documentation for Ray Generic module (https://github.com/modin-project/modin/pull/5315)
      • DOCS-https://github.com/modin-project/modin/issues/5356: Update conda install instructions (https://github.com/modin-project/modin/pull/5357)
      • DOCS-https://github.com/modin-project/modin/issues/5402: Add warning about instability of sort (https://github.com/modin-project/modin/pull/5403)
    • New Features
      • FEAT-https://github.com/modin-project/modin/issues/3535: Implement partition shuffling mechanism and algebra sort_by (https://github.com/modin-project/modin/pull/4601)
      • FEAT-https://github.com/modin-project/modin/issues/4263: Efficiently construct dataframes from a dict of modin Series (https://github.com/modin-project/modin/pull/5193)
      • FEAT-https://github.com/modin-project/modin/issues/4433: Add support of MultiIndex in reindex method (https://github.com/modin-project/modin/pull/4434)
      • FEAT-https://github.com/modin-project/modin/issues/4747: Implement release notes generation (https://github.com/modin-project/modin/pull/5214)
      • FEAT-https://github.com/modin-project/modin/issues/4897: Drop python 3.6 support. (https://github.com/modin-project/modin/pull/5229)
      • FEAT-https://github.com/modin-project/modin/issues/5053: Add pandas on unidist execution with MPI backend (https://github.com/modin-project/modin/pull/5059)
      • FEAT-https://github.com/modin-project/modin/issues/5223: Execute SQL queries on the HDK backend (https://github.com/modin-project/modin/pull/5224)
      • FEAT-https://github.com/modin-project/modin/issues/5230: Support external query compiler and IO (https://github.com/modin-project/modin/pull/5231)
      • FEAT-https://github.com/modin-project/modin/issues/5242: Implement str.extract when expand==True (https://github.com/modin-project/modin/pull/5243)
      • FEAT-https://github.com/modin-project/modin/issues/5253: Upgrade pandas to 1.5.2 (https://github.com/modin-project/modin/pull/5254)
      • FEAT-https://github.com/modin-project/modin/issues/5255: Add a timestamp to the folder names generated by the logger (https://github.com/modin-project/modin/pull/5321)
      • FEAT-https://github.com/modin-project/modin/issues/5367: Introduce new API for repartitioning Modin objects (https://github.com/modin-project/modin/pull/5366)
      • FEAT-https://github.com/modin-project/modin/issues/5387: Enable rebalance_partitions for Unidist (https://github.com/modin-project/modin/pull/5385)
      • FEAT-https://github.com/modin-project/modin/issues/5396: Bump pyhdk version to 0.3 (https://github.com/modin-project/modin/pull/5397)

    Contributors

    @AndreyPavlenko @Billy2551 @Garra1980 @RehanSD @YarShev @anmyachev @arunjose696 @dchigarev @devin-petersohn @lgtm-migrator @mvashishtha @noloerino @pyrito @trgiangdo @vnlitvinov @Retribution98

    Source code(tar.gz)
    Source code(zip)
  • 0.17.1(Nov 25, 2022)

    This release includes pandas 1.5.2 support and a bunch of bug fixes.

    Key Features and Updates Since 0.17.0

    • Stability and Bugfixes
      • FIX-#4100: Fall back to Pandas on row drop (#4937)
      • FIX-#4636: allows read_parquet to detect column partitioning in non-local filesystems (#5192)
      • FIX-#5138: df_categories_equals typo (#5250)
      • FIX-#5186: set_index case with multiindex (#5190)
      • FIX-#5187: Fixed RecursionError in OmnisciLaunchParameters.get() (#5199)
      • FIX-#5204: fix binary operations with a dictionary (#5205)
      • FIX-#5232: Stop changing original series names during binary ops. (#5249)
      • FIX-#5234: Use query compiler str_repeat. (#5235)
      • FIX-#5236: Allow binary operations with custom classes. (#5237)
      • FIX-#5252: Disable notebook tests until access control issues are resolved for modin-test bucket (#5257)
    • New Features
      • FEAT-#5253: Upgrade pandas to 1.5.2 (#5254)

    Contributors

    @AndreyPavlenko @Billy2551 @RehanSD @YarShev @anmyachev @dchigarev @mvashishtha @noloerino

    Source code(tar.gz)
    Source code(zip)
  • 0.17.0(Nov 11, 2022)

    This release includes support for pyhdk 0.2. It also includes many bug fixes and some performance enhancements.

    Key Features and Updates Since 0.16.0

    • Stability and Bugfixes
      • FIX-#3764: Ensure df.loc with a scalar out of bounds appends to df (#3765)
      • FIX-#4016, FIX-#4086, FIX-#4039: Fall back to pandas in case of duplicate column names (#4896)
      • FIX-#4023: Fall back to pandas in case of MultiIndex columns (#5149)
      • FIX-#4660: Fix fillna when Modin series object is an argument (#4674)
      • FIX-#5034: Handle lists in df.get() (#5035)
      • FIX-#5097: Stop using deprecated mangle_dup_cols. (#5104)
      • FIX-#5098: Stop using append internally. (#5100)
      • FIX-#5099: Fix PandasQueryCompiler.groupby_mean with timestamp in by (#5140)
      • FIX-#5112: allows empty partition to be passed into query_compiler.dt_prop_map (#5133)
      • FIX-#5128: Fix reading parquet directory from s3. (#5129)
      • FIX-#5150: Sync row labels after read_csv when index_col is False (#5151)
      • FIX-#5158: Synchronize metadata before to_parquet (#5161)
      • FIX-#5168: module 'collections' has no attribute 'Sequence' in dataframe protocol (#5169)
      • FIX-#5174: Pin xgboost < 1.7. (#5175)
      • FIX-#5180: Do not set OMP_NUM_THREADS=1 on modin.pandas init (#5181)
      • FIX-#5184: Fix get_dummies to respect passed columns to be encoded (#5185)
      • FIX-#5188: Fix getitem_bool when the key is Series with empty partition (#5189)
      • FIX-#5206: pin mypy<0.990 (#5207)
      • FIX-#5208: pin ray version under 2.1.0 (#5209)
    • Performance enhancements
      • PERF-#5029: Don't use _compute_axis_labels_and_lengths for computing _row_lengths/_column_widths (#5030)
      • PERF-#5087: use cache for widths/lengths/index/columns if possible (#5031)
      • PERF-#5162: precompute new row/column lengths in '._reorder_labels' (#5144)
    • Refactor Codebase
      • REFACTOR-#4631: Add mypy checks for modin.distributed (#5109)
      • REFACTOR-#5079: Add mypy checks for modin.core.dataframe.base (#5110)
      • REFACTOR-#5092: Fix future warning for set_axis function (#5093)
    • Update testing suite
      • TEST-#4982: Require format for PR descriptions instead of commit descriptions (#5117)
      • TEST-#5124: Disable codecov comments. (#5125)
      • TEST-#5135: Return CI back after accidental removal (#5136)
      • TEST-#5172: Add fuzzydata logs to artifacts (#5173)
    • Benchmarking enhancements
      • BENCH: add some cases for join and merge ops from pandas (#5021)
      • TEST-#5102: Add HDK benchmarks to github workflows (#5063)
    • Documentation improvements
      • DOCS-#3634: Fix examples related to ProgressBar usage (#5119)
      • DOCS-#5019: Update HDK on native documentation (#5088)
      • DOCS-#5095: Remove release note checkbox from PR template (#5096)
      • DOCS-#5105: Update release procedure (#5106)
    • New Features
      • FEAT-#5120: Update to pyhdk 0.2 (#5121)
      • FEAT-#5141: Implement 2D insertion of Modin DFs in .__setitem__ (#5142)
      • FEAT-#5145: Upgrade pandas to 1.5.1 (#5146)

    Contributors

    @AndreyPavlenko @Billy2551 @RehanSD @YarShev @anmyachev @dchigarev @devin-petersohn @ienkovich @mvashishtha @noloerino @pyrito @rosdyana @shalearkane @suhailrehman @vnlitvinov

    Source code(tar.gz)
    Source code(zip)
  • 0.16.2(Oct 21, 2022)

    This release includes pandas 1.5.1 support and two bug fixes.

    Key features and Updates

    • Stability and Bugfixes
      • FIX-#4016, FIX-#4086, FIX-#4039: Fall back to pandas in case of duplicate column names (#4896)
      • FIX-#5128: Fix reading parquet directory from s3. (#5129)
    • New Features
      • FEAT-#5145: Upgrade pandas to 1.5.1 (#5146)

    Contributors

    @AndreyPavlenko @mvashishtha @YarShev

    Source code(tar.gz)
    Source code(zip)
  • 0.16.1(Oct 11, 2022)

    This release features a bug fix, as well as fixes for deprecation warnings introduced by pandas 1.5.

    Key Features and Updates

    • Stability and Bugfixes
      • FIX-#5034: Handle lists in df.get() (#5035)
      • FIX-#5098: Stop using append internally. (#5100)
      • FIX-#5097: Stop using deprecated mangle_dup_cols. (#5104)
    • Refactor Codebase
      • REFACTOR-#5092: Fix future warning for set_axis function (#5093)

    Contributors

    @mvashishtha @pyrito @anmyachev @vnlitvinov

    Source code(tar.gz)
    Source code(zip)
  • 0.16.0(Oct 5, 2022)

    This release includes support for pandas 1.5, support for the latest version of dask, and backwards compatibility with python 3.6 and pandas 1.1. Additionally, it includes many performance enhancements, bug fixes, and documentation improvements.

    Key Features and Updates

    • Stability and Bugfixes
      • FIX-#4570: Replace np.bool -> np.bool_ (#4571)
      • FIX-#4543: Fix read_csv in case skiprows=<0, []> (#4544)
      • FIX-#4059: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
      • FIX-#4589: Pin protobuf<4.0.0 to fix ray (#4590)
      • FIX-#4577: Set attribute of Modin dataframe to updated value (#4588)
      • FIX-#4411: Fix binary_op between datetime64 Series and pandas timedelta (#4592)
      • FIX-#4604: Fix groupby + agg in case when multicolumn can arise (#4642)
      • FIX-#4582: Inherit custom log layer (#4583)
      • FIX-#4639: Fix storage_options usage for read_csv and read_csv_glob (#4644)
      • FIX-#4593: Ensure Modin warns when setting columns via attributes (#4621)
      • FIX-#4584: Enable pdb debug when running cloud tests (#4585)
      • FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set (#4603)
      • FIX-#4641: Reindex pandas partitions in df.describe() (#4651)
      • FIX-#2064: Fix iloc/loc assignment when dataframe is empty (#4677)
      • FIX-#4634: Check for FrozenList as by in df.groupby() (#4667)
      • FIX-#4680: Fix read_csv that started defaulting to pandas again in case of reading from a buffer and when a buffer has a non-zero starting position (#4681)
      • FIX-#4491: Wait for all partitions in parallel in benchmark mode (#4656)
      • FIX-#4358: MultiIndex loc shouldn't drop levels for full-key lookups (#4608)
      • FIX-#4658: Expand exception handling for read_* functions from s3 storages (#4659)
      • FIX-#4672: Fix incorrect warning when setting frame.index or frame.columns (#4721)
      • FIX-#4686: Propagate metadata and drain call queue in unwrap_partitions (#4697)
      • FIX-#4652: Support categorical data in from_dataframe (#4737)
      • FIX-#4756: Correctly propagate storage_options in read_parquet (#4764)
      • FIX-#4657: Use fsspec for handling s3/http-like paths instead of s3fs (#4710)
      • FIX-#4676: drain sub-virtual-partition call queues (#4695)
      • FIX-#4782: Exclude certain non-parquet files in read_parquet (#4783)
      • FIX-#4808: Set dtypes correctly after column rename (#4809)
      • FIX-#4811: Apply dataframe -> not_dataframe functions to virtual partitions (#4812)
      • FIX-#4099: Use mangled column names but keep the original when building frames from arrow (#4767)
      • FIX-#4838: Bump up modin-spreadsheet to latest master (#4839)
      • FIX-#4840: Change modin-spreadsheet version for notebook requirements (#4841)
      • FIX-#4835: Handle Pathlike paths in read_parquet (#4837)
      • FIX-#4872: Stop checking the private ray mac memory limit (#4873)
      • FIX-#4914: base_lengths should be computed from base_frame instead of self in copartition (#4915)
      • FIX-#4848: Fix rebalancing partitions when NPartitions == 1 (#4874)
      • FIX-#4927: Fix dtypes computation in dataframe.filter (#4928)
      • FIX-#4907: Implement radd for Series and DataFrame (#4908)
      • FIX-#4945: Fix _take_2d_positional that loses indexes due to filtering empty dataframes (#4951)
      • FIX-#4818, PERF-#4825: Fix where by using the new n-ary operator (#4820)
      • FIX-#3983: FIX-#4107: Materialize 'rowid' columns when selecting rows by position (#4834)
      • FIX-#4845: Fix KeyError from __getitem_bool for single row dataframes (#4845)
      • FIX-#4734: Handle Series.apply when return type is a DataFrame (#4830)
      • FIX-#4983: Set frac to None in _sample when n=0 (#4984)
      • FIX-#4993: Return _default_to_pandas in df.attrs (#4995)
      • FIX-#5043: Fix execute function in ASV utils failed if len(partitions) == 0 (#5044)
      • FIX-#4597: Refactor Partition handling of func, args, kwargs (#4715)
      • FIX-#4996: Evaluate BenchmarkMode at each function call (#4997)
      • FIX-#4022: Fixed empty data frame with index (#4910)
      • FIX-#4090: Fixed check if the index is trivial (#4936)
      • FIX-#4966: Fix to_timedelta to return Series instead of TimedeltaIndex (#5028)
      • FIX-#5042: Fix series getitem with invalid strings (#5048)
      • FIX-#4691: Fix binary operations between virtual partitions (#5049)
      • FIX-#5045: Fix ray virtual_partition.wait with duplicate object refs (#5058)
    • Performance enhancements
      • PERF-#4182: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
      • PERF-#4288: Improve perf of groupby.mean for narrow data (#4591)
      • PERF-#4772: Remove df.copy call from from_pandas since it is not needed for Ray and Dask (#4781)
      • PERF-#4325: Improve perf of multi-column assignment in __setitem__ when no new column names are assigning (#4455)
      • PERF-#3844: Improve perf of drop operation (#4694)
      • PERF-#4727: Improve perf of concat operation (#4728)
      • PERF-#4705: Improve perf of arithmetic operations between Series objects with shared .index (#4689)
      • PERF-#4703: Improve performance in accessing ser.cat.categories, ser.cat.ordered, and ser.__array_priority__ (#4704)
      • PERF-#4305: Parallelize read_parquet over row groups (#4700)
      • PERF-#4773: Compute lengths and widths in put method of Dask partition like Ray do (#4780)
      • PERF-#4732: Avoid overwriting already-evaluated PandasOnRayDataframePartition._length_cache and PandasOnRayDataframePartition._width_cache (#4754)
      • PERF-#4862: Don't call compute_sliced_len.remote when row_labels/col_labels == slice(None) (#4863)
      • PERF-#4713: Stop overriding the ray MacOS object store size limit (#4792)
      • PERF-#4944: Avoid default_to_pandas in Series.cat.codes, Series.dt.tz, and Series.dt.to_pytimedelta (#4833)
      • PERF-#4851: Compute dtypes for binary operations that can only return bool type and the right operand is not a Modin object (#4852)
      • PERF-#4842: copy should not trigger any previous computations (#4843)
      • PERF-#4849: Compute dtypes in concat also for ROW_WISE case when possible (#4850)
      • PERF-#4929: Compute dtype when using Series.dt accessor (#4930)
      • PERF-#4892: Compute lengths in rebalance_partitions when possible (#4893)
      • PERF-#4794: Compute caches in _propagate_index_objs (#4888)
      • PERF-#4860: PandasDataframeAxisPartition.deploy_axis_func should be serialized only once (#4861)
      • PERF-#4890: PandasDataframeAxisPartition.drain should be serialized only once (#4891)
      • PERF-#4870: Avoid index materialization in __getattribute__ and __getitem__ (4911)
      • PERF-#4886: Use lazy index and columns evaluation in query method (#4887)
      • PERF-#4866: iloc function that used in partition.mask should be serialized only once (#4901)
      • PERF-#4920: Avoid index and cache computations in take_2d_labels_or_positional unless they are needed (#4921)
      • PERF-#4999: don't call apply in virtual partition' drain_call_queue if call_queue is empty (#4975)
      • PERF-#4268: Implement partition-parallel getitem for bool Series masks (#4753)
      • PERF-#5017: reset_index shouldn't trigger index materialization if possible (#5018)
      • PERF-#4963: Use partition width/length methods instead of _compute_axis_labels_and_lengths if index is already known (#4964)
      • PERF-#4940: Optimize categorical dtype check in concatenate (#4953)
    • Benchmarking enhancements
      • TEST-#5066: Add outer join case for TimeConcat benchmark (#5067)
      • TEST-#5083: Add merge op with categorical data (#5084)
      • FEAT-#4706: Add Modin ClassLogger to PandasDataframePartitionManager (#4707)
      • TEST-#5014: Simplify adding new ASV benchmarks (#5015)
      • TEST-#5064: Update TimeConcat benchmark with new parameter ignore_index (#5065)
      • TEST-#5068: Add binary op benchmark for Series (#5069)
    • Refactor Codebase
      • REFACTOR-#4530: Standardize access to physical data in partitions (#4563)
      • REFACTOR-#4534: Replace logging meta class with class decorator (#4535)
      • REFACTOR-#4708: Delete combine dtypes (#4709)
      • REFACTOR-#4629: Add type annotations to modin/config (#4685)
      • REFACTOR-#4717: Improve PartitionMgr.get_indices() usage (#4718)
      • REFACTOR-#4730: make Indexer immutable (#4731)
      • REFACTOR-#4774: remove _build_treereduce_func call from _compute_dtypes (#4775)
      • REFACTOR-#4750: Delete BaseDataframeAxisPartition.shuffle (#4751)
      • REFACTOR-#4722: Stop suppressing undefined name lint (#4723)
      • REFACTOR-#4832: unify split_result_of_axis_func_pandas (#4831)
      • REFACTOR-#4796: Introduce constant for reduced column name (#4799)
      • REFACTOR-#4000: Remove code duplication for PandasOnRayDataframePartitionManager (#4895)
      • REFACTOR-#3780: Remove code duplication for PandasOnDaskDataframe (#3781)
      • REFACTOR-#4530: Unify access to physical data for any partition type (#4829)
      • REFACTOR-#4978: Align modin/core/execution/dask/common/__init__.py with modin/core/execution/ray/common/__init__.py (#4979)
      • REFACTOR-#4949: Remove code duplication in default2pandas/dataframe.py and default2pandas/any.py (#4950)
      • REFACTOR-#4976: Rename RayTask to RayWrapper in accordance with Dask (#4977)
      • REFACTOR-#4885: De-duplicated take_2d_labels_or_positional methods (#4883)
      • REFACTOR-#5005: Use finalize method instead of list comprehension + drain_call_queue (#5006)
      • REFACTOR-#5001: Remove jenkins stuff (#5002)
      • REFACTOR-#5026: Change exception names to simplify grepping (#5027)
      • REFACTOR-#4970: Rewrite base implementations of a partition' width/length (#4971)
      • REFACTOR-#4942: Remove call method in favor of register due to duplication (4943)
      • REFACTOR-#4922: Helpers for take_2d_labels_or_positional (#4865)
      • REFACTOR-#5024: Make _row_lengths and _column_widths public (#5025)
      • REFACTOR-#5009: Use RayWrapper.materialize instead of ray.get (#5010)
      • REFACTOR-#4755: Rewrite Pandas version mismatch warning (#4965)
      • REFACTOR-#5012: Add mypy checks for singleton files in base modin directory (#5013)
      • REFACTOR-#5038: Remove unnecessary _method argument from resamplers (#5039)
      • REFACTOR-#5081: Remove c323f7fe385011ed849300155de07645.db file (#5082)
    • Pandas API implementations and improvements
      • FEAT-#4670: Implement convert_dtypes by mapping across partitions (#4671)
    • OmniSci enhancements
      • FEAT-#4913: Enabling pyhdk
    • Update testing suite
      • TEST-#4508: Reduce test_partition_api pytest threads to deflake it (#4551)
      • TEST-#4550: Use much less data in test_partition_api (#4554)
      • TEST-#4610: Remove explicit installation of black/flake8 for omnisci ci-notebooks (#4609)
      • TEST-#2564: Add caching and use mamba for conda setups in GH (#4607)
      • TEST-#4557: Delete multiindex sorts instead of xfailing (#4559)
      • TEST-#4698: Stop passing invalid storage_options param (#4699)
      • TEST-#4745: Pin flake8 to <5 to workaround installation conflict (#4752)
      • TEST-#4875: XFail tests failing due to file gone missing (#4876)
      • TEST-#4879: Use pandas ensure_clean() in place of io_tests_data (#4881)
      • TEST-#4562: Use local Ray cluster in CI to resolve flaky test-compat-win (#5007)
      • TEST-#5040: Rework test_series using eval_general() (#5041)
      • TEST-#5050: Add black to pre-commit hook (#5051)
    • Documentation improvements
      • DOCS-#4552: Change default sphinx language to en to fix sphinx >= 5.0.0 build (#4553)
      • DOCS-#4628: Add to_parquet partial support notes (#4648)
      • DOCS-#4668: Set light theme for readthedocs page, remove theme switcher (#4669)
      • DOCS-#4748: Apply the Triage label to new issues (#4749)
      • DOCS-#4790: Give all templates issue type and triage labels (#4791)
      • DOCS-#4521: Document how to benchmark modin (#5020)
    • Dependencies
      • FEAT-#4598: Add support for pandas 1.4.3 (#4599)
      • FEAT-#4619: Integrate mypy static type checking (#4620)
      • FEAT-#4202: Allow dask past 2022.2.0 (#4769)
      • FEAT-#4925: Upgrade pandas to 1.4.4 (#4926)
      • TEST-#4998: Add flake8 plugins to dev requirements (#5000)
    • New Features
      • FEAT-4463: Add experimental fuzzydata integration for testing against a randomized dataframe workflow (#4556)
      • FEAT-#4419: Extend virtual partitioning API to pandas on Dask (#4420)
      • FEAT-#4147: Add partial compatibility with Python 3.6 and pandas 1.1 (#4301)
      • FEAT-#4569: Add error message when read_ function defaults to pandas (#4647)
      • FEAT-#4725: Make index and columns lazy in Modin DataFrame (#4726)
      • FEAT-#4664: Finalize compatibility support for Python 3.6 (#4800)
      • FEAT-#4746: Sync interchange protocol with recent API changes (#4763)
      • FEAT-#4733: Support fastparquet as engine for read_parquet (#4807)
      • FEAT-#4766: Support fsspec URLs in read_csv and read_csv_glob (#4898)
      • FEAT-#4827: Implement infer_types dataframe algebra operator (#4871)
      • FEAT-#4989: Switch pandas version to 1.5 (#5037)

    Contributors

    @mvashishtha @NickCrews @prutskov @vnlitvinov @pyrito @suhailrehman @RehanSD @helmeleegy @anmyachev @d33bs @noloerino @devin-petersohn @YarShev @naren-ponder @jbrockmendel @ienkovich @Garra1980 @Billy2551

    Source code(tar.gz)
    Source code(zip)
  • 0.15.3(Sep 7, 2022)

    This release adds support for pandas 1.4.4 and includes a bunch of bugfixes.

    Key Features and Updates

    • Stability and Bugfixes
      • FIX-#4593: Ensure Modin warns when setting columns via attributes (#4621)
      • FIX-#4604: Fix groupby + agg in case when multicolumn can arise (#4642)
      • FIX-#4641: Reindex pandas partitions in df.describe() (#4651)
      • FIX-#4634: Check for FrozenList as by in df.groupby() (#4667)
      • FIX-#2064: Fix iloc/loc assignment when dataframe is empty (#4677)
      • FIX-#4658: Expand exception handling for read_* functions from s3 storages (#4659)
      • FIX-#4672: Fix incorrect warning when setting frame.index or frame.columns (#4721)
      • FIX-#4686: Propagate metadata and drain call queue in unwrap_partitions (#4697)
      • FIX-#4680: Fix read_csv that started defaulting to pandas again in case of reading from a buffer and when a buffer has a non-zero starting position (#4681)
      • FIX-#4808: Set dtypes correctly after column rename (#4809)
      • FIX-#4811: Apply dataframe -> not_dataframe functions to virtual partitions (#4812)
      • FIX-#4848: Fix rebalancing partitions when NPartitions == 1 (#4874)
      • FIX-#4838: Bump up modin-spreadsheet to latest master (#4839)
      • FIX-#4840: Change modin-spreadsheet version for notebook requirements (#4841)
      • FIX-#4657: Use fsspec for handling s3/http-like paths instead of s3fs (#4710)
      • FIX-#4639: Fix storage_options usage for read_csv and read_csv_glob (#4644)
    • Update testing suite
      • TEST-#4875: XFail tests failing due to file gone missing (#4876)
    • Dependencies
      • FEAT-#4925: Upgrade pandas to 1.4.4 (#4926)

    Contributors

    @helmeleegy @YarShev @anmyachev @pyrito @prutskov @jbrockmendel @mvashishtha @RehanSD @vnlitvinov

    Source code(tar.gz)
    Source code(zip)
  • 0.15.2(Jun 25, 2022)

    This release adds support for pandas 1.4.3, pins protobuf < 4.0.0 to ensure compatibility with ray < 1.13, and includes a bugfix for modifying columns via attribute access.

    Key Features and Updates

    • Stability and Bugfixes
      • FIX-https://github.com/modin-project/modin/issues/4589: Pin protobuf<4.0.0 to fix ray (https://github.com/modin-project/modin/pull/4590)
      • FIX-https://github.com/modin-project/modin/issues/4577: Set attribute of Modin dataframe to updated value (https://github.com/modin-project/modin/pull/4588)
    • Dependencies
      • FEAT-https://github.com/modin-project/modin/issues/4598: Add support for pandas 1.4.3 (https://github.com/modin-project/modin/pull/4599)

    Contributors

    @mvashishtha @pyrito @RehanSD

    Source code(tar.gz)
    Source code(zip)
  • 0.15.1(Jun 16, 2022)

    This release pins Ray < 1.13.0 to avoid deserialization race condition.

    Key Features and Updates

    • Stability and Bugfixes
      • FIX-#4566: Pin Ray < 1.13.0 to avoid deserialization race condition. (#4567)

    Contributors

    @mvashishtha

    Source code(tar.gz)
    Source code(zip)
  • 0.15.0(Jun 8, 2022)

    This release includes updated support for pandas 1.4.2, new Batch and Logging APIs, and a plethora of bug fixes and documentation improvements.

    Key Features and Updates

    • Stability and Bugfixes
      • FIX-https://github.com/modin-project/modin/issues/4376: Upgrade pandas to 1.4.2 (https://github.com/modin-project/modin/pull/4377)
      • FIX-https://github.com/modin-project/modin/issues/3615: Relax some deps in development env (https://github.com/modin-project/modin/pull/4365)
      • FIX-https://github.com/modin-project/modin/issues/4370: Fix broken docstring links (https://github.com/modin-project/modin/pull/4375)
      • FIX-https://github.com/modin-project/modin/issues/4392: Align Modin XGBoost with xgb>=1.6 (https://github.com/modin-project/modin/pull/4393)
      • FIX-https://github.com/modin-project/modin/issues/4385: Get rid of use-deprecated option in pip (https://github.com/modin-project/modin/pull/4386)
      • FIX-https://github.com/modin-project/modin/issues/3527: Fix parquet partitioning issue causing negative row length partitions (https://github.com/modin-project/modin/pull/4368)
      • FIX-https://github.com/modin-project/modin/issues/4330: Override the memory limit to start ray 1.11.0 on Macs (https://github.com/modin-project/modin/pull/4335)
      • FIX-https://github.com/modin-project/modin/issues/4407: Align insert function with pandas in case of numpy array with several columns (https://github.com/modin-project/modin/pull/4408)
      • FIX-https://github.com/modin-project/modin/issues/4373: Fix invalid file path when trying read_csv_glob with usecols parameter (https://github.com/modin-project/modin/pull/4405)
      • FIX-https://github.com/modin-project/modin/issues/4394: Fix issue with multiindex metadata desync (https://github.com/modin-project/modin/pull/4395)
      • FIX-https://github.com/modin-project/modin/issues/4438: Fix reindex function that doesn't preserve initial index metadata (https://github.com/modin-project/modin/pull/4442)
      • FIX-https://github.com/modin-project/modin/issues/4425: Add parameters to groupby pct_change (https://github.com/modin-project/modin/pull/4429)
      • FIX-https://github.com/modin-project/modin/pull/4457: Fix loc in case when need reindex item (https://github.com/modin-project/modin/pull/4457)
      • FIX-https://github.com/modin-project/modin/issues/4414: Add missing f prefix on f-strings found at https://codereview.doctor/ (https://github.com/modin-project/modin/pull/4415)
      • FIX-https://github.com/modin-project/modin/issues/4461: Fix S3 CSV data path (https://github.com/modin-project/modin/pull/4462)
      • FIX-https://github.com/modin-project/modin/issues/4467: drop_duplicates no longer removes items based on index values (https://github.com/modin-project/modin/pull/4468)
      • FIX-https://github.com/modin-project/modin/issues/4449: Drain the call queue before waiting on result in benchmark mode (https://github.com/modin-project/modin/pull/4472)
      • FIX-https://github.com/modin-project/modin/issues/4518: Fix Modin Logging to report specific Modin warnings/errors (https://github.com/modin-project/modin/pull/4519)
      • FIX-https://github.com/modin-project/modin/issues/4481: Allow clipping with a Modin Series of bounds (https://github.com/modin-project/modin/pull/4486)
      • FIX-https://github.com/modin-project/modin/issues/4504: Support na_action in applymap (https://github.com/modin-project/modin/pull/4505)
      • FIX-https://github.com/modin-project/modin/issues/4503: Stop the memory logging thread after session exit (https://github.com/modin-project/modin/pull/4515)
      • FIX-https://github.com/modin-project/modin/issues/4531: Fix a makedirs race condition in to_parquet (https://github.com/modin-project/modin/pull/4533)
      • FIX-https://github.com/modin-project/modin/issues/4464: Refactor Ray utils and quick fix groupby.count failing on virtual partitions (https://github.com/modin-project/modin/pull/4490)
      • FIX-https://github.com/modin-project/modin/issues/4436: Fix to_pydatetime dtype for timezone None (https://github.com/modin-project/modin/pull/4437)
      • FIX-https://github.com/modin-project/modin/issues/4541: Fix merge_asof with non-unique right index (https://github.com/modin-project/modin/pull/4542)
    • Performance enhancements
      • FEAT-https://github.com/modin-project/modin/issues/4320: Add connectorx as an alternative engine for read_sql (https://github.com/modin-project/modin/pull/4346)
      • PERF-https://github.com/modin-project/modin/issues/4493: Use partition size caches more in Modin dataframe (https://github.com/modin-project/modin/pull/4495)
    • Benchmarking enhancements
      • FEAT-https://github.com/modin-project/modin/issues/4371: Add logging to Modin (https://github.com/modin-project/modin/pull/4372)
      • FEAT-https://github.com/modin-project/modin/issues/4501: Add RSS Memory Profiling to Modin Logging (https://github.com/modin-project/modin/pull/4502)
      • FEAT-https://github.com/modin-project/modin/issues/4524: Split Modin API and Memory log files (https://github.com/modin-project/modin/pull/4526)
    • Refactor Codebase
      • REFACTOR-https://github.com/modin-project/modin/issues/4284: use variable length unpacking when getting results from deploy function (https://github.com/modin-project/modin/pull/4285)
      • REFACTOR-https://github.com/modin-project/modin/issues/3642: Move PyArrow storage format usage from main feature to experimental ones (https://github.com/modin-project/modin/pull/4374)
      • REFACTOR-https://github.com/modin-project/modin/issues/4003: Delete the deprecated cloud mortgage example (https://github.com/modin-project/modin/pull/4406)
      • REFACTOR-https://github.com/modin-project/modin/issues/4513: Fix spelling mistakes in docs and docstrings (https://github.com/modin-project/modin/pull/4514)
      • REFACTOR-https://github.com/modin-project/modin/issues/4510: Align experimental and regular IO modules initializations (https://github.com/modin-project/modin/pull/4511)
    • Developer API enhancements
      • FEAT-https://github.com/modin-project/modin/issues/4359: Add dataframe method to the protocol dataframe (https://github.com/modin-project/modin/pull/4360)
    • Update testing suite
      • TEST-https://github.com/modin-project/modin/issues/4363: Use Ray from pypi in CI (https://github.com/modin-project/modin/pull/4364)
      • FIX-https://github.com/modin-project/modin/issues/4422: get rid of case sensitivity for warns_that_defaulting_to_pandas (https://github.com/modin-project/modin/pull/4423)
      • TEST-https://github.com/modin-project/modin/issues/4426: Stop passing is_default kwarg to Modin and pandas (https://github.com/modin-project/modin/pull/4428)
      • FIX-https://github.com/modin-project/modin/issues/4439: Fix flake8 CI fail (https://github.com/modin-project/modin/pull/4440)
      • FIX-https://github.com/modin-project/modin/issues/4409: Fix eval_insert utility that doesn't actually check results of insert function (https://github.com/modin-project/modin/pull/4410)
      • TEST-https://github.com/modin-project/modin/issues/4482: Fix getitem and loc with series of bools (https://github.com/modin-project/modin/pull/4483).
    • Documentation improvements
      • DOCS-https://github.com/modin-project/modin/issues/4296: Fix docs warnings (https://github.com/modin-project/modin/pull/4297)
      • DOCS-https://github.com/modin-project/modin/issues/4388: Turn off fail_on_warning option for docs build (https://github.com/modin-project/modin/pull/4389)
      • DOCS-https://github.com/modin-project/modin/issues/4469: Say that commit messages can start with PERF (https://github.com/modin-project/modin/pull/4470).
      • DOCS-https://github.com/modin-project/modin/issues/4466: Recommend GitHub issues over bug_repor[email protected] (https://github.com/modin-project/modin/pull/4474).
      • DOCS-https://github.com/modin-project/modin/issues/4487: Recommend GitHub issues over feature_reques[email protected] (https://github.com/modin-project/modin/pull/4489).
    • Dependencies
      • FIX-https://github.com/modin-project/modin/issues/4327: Update min pin for xgboost version (https://github.com/modin-project/modin/pull/4328)
      • FIX-https://github.com/modin-project/modin/issues/4383: Remove pathlib from deps (https://github.com/modin-project/modin/pull/4384)
      • FIX-https://github.com/modin-project/modin/issues/4390: Add redis to Modin dependencies (https://github.com/modin-project/modin/pull/4396)
      • FIX-https://github.com/modin-project/modin/issues/3689: Add black and flake8 into development environment files (https://github.com/modin-project/modin/pull/4480)
      • TEST-https://github.com/modin-project/modin/issues/4516: Add numpydoc to developer requirements (https://github.com/modin-project/modin/pull/4517)
    • New Features
      • FEAT-https://github.com/modin-project/modin/issues/4412: Add Batch Pipeline API to Modin (https://github.com/modin-project/modin/pull/4452)

    Contributors

    @YarShev @Garra1980 @prutskov @alexander3774 @amyskov @wangxiaoying @jeffreykennethli @mvashishtha @anmyachev @dchigarev @devin-petersohn @jrsacher @orcahmlee @naren-ponder @RehanSD

    Source code(tar.gz)
    Source code(zip)
  • 0.14.1(May 4, 2022)

    This release contains a few key bugfixes and pandas version update.

    Key Features and Updates

    • FIX-#4376: Upgrade pandas to 1.4.2 (#4377)
    • FIX-#4390: Add redis to Modin dependencies (#4396)
    • FIX-#3527: Fix parquet partitioning issue causing negative row length partitions (#4368)
    • FIX-#4330: Override the memory limit to start ray 1.11.0 on Macs. (#4335)
    • FIX-#4394: Fix issue with multiindex metadata desync (#4395)
    • FIX-#4373: fix usage of 'read_csv_glob' with 'usecols' parameter (#4405)
    • FIX-#4425: Add parameters to groupby pct_change. (#4429)

    Contributors

    @Garra1980, @devin-petersohn, @dchigarev, @jeffreykennethli, @mvashishtha, @YarShev, @anmyachev

    Source code(tar.gz)
    Source code(zip)
  • 0.14.0(Mar 29, 2022)

    This release contains significant upgrades to Developer API, as well as to Modin's documentation, some refactor codebase and performance enhancements, and multiple bugfixes.

    Key Features and Updates

    • Stability and Bugfixes
      • FIX-https://github.com/modin-project/modin/issues/4058: Allow pickling empty dataframes and series (https://github.com/modin-project/modin/pull/4095)
      • FIX-https://github.com/modin-project/modin/issues/4136: Fix exercise_3.ipynb example notebook (https://github.com/modin-project/modin/pull/4137)
      • FIX-https://github.com/modin-project/modin/issues/4105: Fix names of pandas options to avoid OptionError (https://github.com/modin-project/modin/pull/4109)
      • FIX-https://github.com/modin-project/modin/issues/3417: Fix read_csv with skiprows and header parameters (https://github.com/modin-project/modin/pull/3419)
      • FIX-https://github.com/modin-project/modin/issues/4142: Fix OmniSci enabling (https://github.com/modin-project/modin/pull/4146)
      • FIX-https://github.com/modin-project/modin/issues/4162: Use skipif instead of skip for compatibility with pytest 7.0 (https://github.com/modin-project/modin/pull/4163)
      • FIX-https://github.com/modin-project/modin/issues/4158: Do not print OmniSci logs to stdout by default (https://github.com/modin-project/modin/pull/4159)
      • FIX-https://github.com/modin-project/modin/issues/4177: Support read_feather from pathlike objects (https://github.com/modin-project/modin/issues/4177)
      • FIX-https://github.com/modin-project/modin/issues/4234: Upgrade pandas to 1.4.1 (https://github.com/modin-project/modin/pull/4235)
      • FIX-https://github.com/modin-project/modin/issues/3368: support unsigned integers in OmniSci backend (https://github.com/modin-project/modin/pull/4256)
      • FIX-https://github.com/modin-project/modin/issues/4057: Allow reading an empty parquet file (https://github.com/modin-project/modin/pull/4075)
      • FIX-https://github.com/modin-project/modin/issues/3884: Fix read_excel() dropping empty rows (https://github.com/modin-project/modin/pull/4161)
      • FIX-https://github.com/modin-project/modin/issues/4257: Fix Categorical() for scalar categories (https://github.com/modin-project/modin/pull/4258)
      • FIX-https://github.com/modin-project/modin/issues/4300: Fix Modin Categorical column dtype categories (https://github.com/modin-project/modin/pull/4276)
      • FIX-https://github.com/modin-project/modin/issues/4208: Fix lazy metadata update for PandasDataFrame.from_labels (https://github.com/modin-project/modin/pull/4209)
      • FIX-https://github.com/modin-project/modin/issues/3981, FIX-https://github.com/modin-project/modin/issues/3801, FIX-https://github.com/modin-project/modin/issues/4149: Stop broadcasting scalars to set items (https://github.com/modin-project/modin/pull/4160)
      • FIX-https://github.com/modin-project/modin/issues/4185: Fix rolling across column partitions (https://github.com/modin-project/modin/pull/4262)
      • FIX-https://github.com/modin-project/modin/issues/4303: Fix the syntax error in reading from postgres (https://github.com/modin-project/modin/pull/4304)
      • FIX-https://github.com/modin-project/modin/issues/4308: Add proper error handling in df.set_index (https://github.com/modin-project/modin/pull/4309)
      • FIX-https://github.com/modin-project/modin/issues/4056: Allow an empty parse_date list in read_csv_glob (https://github.com/modin-project/modin/pull/4074)
      • FIX-https://github.com/modin-project/modin/issues/4312: Fix constructing categorical frame with duplicate column names (https://github.com/modin-project/modin/pull/4313).
      • FIX-https://github.com/modin-project/modin/issues/4314: Allow passing a series of dtypes to astype (https://github.com/modin-project/modin/pull/4318)
      • FIX-https://github.com/modin-project/modin/issues/4310: Handle lists of lists of ints in read_csv_glob (https://github.com/modin-project/modin/pull/4319)
      • FIX-https://github.com/modin-project/modin/issues/4138, FIX-https://github.com/modin-project/modin/issues/4009: remove redundant sorting in the internal
    • Performance enhancements
      • FIX-https://github.com/modin-project/modin/issues/4138, FIX-https://github.com/modin-project/modin/issues/4009: remove redundant sorting in the internal '.mask()' flow (https://github.com/modin-project/modin/pull/4140)
      • FIX-https://github.com/modin-project/modin/issues/4183: Stop shallow copies from creating global shared state. (https://github.com/modin-project/modin/pull/4184)
    • Benchmarking enhancements
      • FIX-https://github.com/modin-project/modin/issues/4221: add wait method for PandasOnRayDataframeColumnPartition class (https://github.com/modin-project/modin/pull/4231)
    • Refactor Codebase
      • REFACTOR-https://github.com/modin-project/modin/issues/3990: remove code duplication in PandasDataframePartition hierarchy (https://github.com/modin-project/modin/pull/3991)
      • REFACTOR-https://github.com/modin-project/modin/issues/4229: remove unused dask_client global variable in modin\pandas\__init__.py (https://github.com/modin-project/modin/pull/4230)
      • REFACTOR-https://github.com/modin-project/modin/issues/3997: remove code duplication for broadcast_apply method (https://github.com/modin-project/modin/pull/3996)
      • REFACTOR-https://github.com/modin-project/modin/issues/3994: remove code duplication for get_indices function (https://github.com/modin-project/modin/pull/3995)
      • REFACTOR-https://github.com/modin-project/modin/issues/4331: remove code duplication for to_pandas, to_numpy functions in QueryCompiler hierarchy (https://github.com/modin-project/modin/pull/4332)
      • REFACTOR-https://github.com/modin-project/modin/issues/4213: Refactor modin/examples/tutorial/ directory (https://github.com/modin-project/modin/pull/4214)
      • REFACTOR-https://github.com/modin-project/modin/issues/4206: add assert check into __init__ method of PandasOnDaskDataframePartition class (https://github.com/modin-project/modin/pull/4207)
      • REFACTOR-https://github.com/modin-project/modin/issues/3900: add flake8-no-implicit-concat plugin and refactor flake8 error codes (https://github.com/modin-project/modin/pull/3901)
      • REFACTOR-https://github.com/modin-project/modin/issues/4093: Refactor base to be smaller (https://github.com/modin-project/modin/pull/4220)
      • REFACTOR-https://github.com/modin-project/modin/issues/4047: Rename cluster directory to cloud in examples (https://github.com/modin-project/modin/pull/4212)
      • REFACTOR-https://github.com/modin-project/modin/issues/3853: interacting with Dask interface through DaskWrapper class (https://github.com/modin-project/modin/pull/3854)
      • REFACTOR-https://github.com/modin-project/modin/issues/4322: Move is_reduce_fn outside of groupby_agg (https://github.com/modin-project/modin/pull/4323)
    • Pandas API implementations and improvements
      • FEAT-https://github.com/modin-project/modin/issues/3603: add experimental read_custom_text function that can read custom line-by-line text files (https://github.com/modin-project/modin/pull/3441)
      • FEAT-https://github.com/modin-project/modin/issues/979: Enable reading from SQL server (https://github.com/modin-project/modin/pull/4279)
    • Developer API enhancements
      • FEAT-https://github.com/modin-project/modin/issues/4245: Define base interface for dataframe exchange protocol (https://github.com/modin-project/modin/pull/4246)
      • FEAT-https://github.com/modin-project/modin/issues/4244: Implement dataframe exchange protocol for OmnisciOnNative execution (https://github.com/modin-project/modin/pull/4269)
      • FEAT-https://github.com/modin-project/modin/issues/4144: Implement dataframe exchange protocol for pandas storage format (https://github.com/modin-project/modin/pull/4150)
      • FEAT-https://github.com/modin-project/modin/issues/4342: Support `from_dataframe`` for pandas storage format (https://github.com/modin-project/modin/pull/4343)
    • Update testing suite
      • TEST-https://github.com/modin-project/modin/issues/3628: Report coverage data for test-internals CI job (https://github.com/modin-project/modin/pull/4198)
      • TEST-https://github.com/modin-project/modin/issues/3938: Test tutorial notebooks in CI (https://github.com/modin-project/modin/pull/4145)
      • TEST-https://github.com/modin-project/modin/issues/4153: Fix condition of running lint-commit and set of CI triggers (https://github.com/modin-project/modin/pull/4156)
      • TEST-https://github.com/modin-project/modin/issues/4201: Add read_parquet, explode, tail, and various arithmetic functions to asv_bench (https://github.com/modin-project/modin/pull/4203)
    • Documentation improvements
      • DOCS-https://github.com/modin-project/modin/issues/4077: Add release notes template to docs folder (https://github.com/modin-project/modin/pull/4078)
      • DOCS-https://github.com/modin-project/modin/issues/4082: Add pdf/epub/htmlzip formats for doc builds (https://github.com/modin-project/modin/pull/4083)
      • DOCS-https://github.com/modin-project/modin/issues/4168: Fix rendering the examples on troubleshooting page (https://github.com/modin-project/modin/pull/4169)
      • DOCS-https://github.com/modin-project/modin/issues/4151: Add info in troubleshooting page related to Dask engine usage (https://github.com/modin-project/modin/pull/4152)
      • DOCS-https://github.com/modin-project/modin/issues/4172: Refresh Intel Distribution of Modin paragraph (https://github.com/modin-project/modin/pull/4175)
      • DOCS-https://github.com/modin-project/modin/issues/4173: Mention strict channel priority in conda install section (https://github.com/modin-project/modin/pull/4178)
      • DOCS-https://github.com/modin-project/modin/issues/4176: Update OmniSci usage section (https://github.com/modin-project/modin/pull/4192)
      • DOCS-https://github.com/modin-project/modin/issues/4027: Add GIF images and chart to Modin README demonstrating speedups (https://github.com/modin-project/modin/pull/4232)
      • DOCS-https://github.com/modin-project/modin/issues/3954: Add Dask example notebooks (https://github.com/modin-project/modin/pull/4139)
      • DOCS-https://github.com/modin-project/modin/issues/4272: Add bar chart comparisons to quick start guide (https://github.com/modin-project/modin/pull/4277)
      • DOCS-https://github.com/modin-project/modin/issues/3953: Add docs and notebook examples on running Modin with OmniSci (https://github.com/modin-project/modin/pull/4001)
      • DOCS-https://github.com/modin-project/modin/issues/4280: Change links in jupyter notebooks (https://github.com/modin-project/modin/pull/4281)
      • DOCS-https://github.com/modin-project/modin/issues/4290: Add changes for OmniSci notebooks (https://github.com/modin-project/modin/pull/4291)
      • DOCS-https://github.com/modin-project/modin/issues/4241: Update warnings and docs regarding defaulting to pandas (https://github.com/modin-project/modin/pull/4242)
      • DOCS-https://github.com/modin-project/modin/issues/3099: Fix BasePandasDataSet docstrings warnings (https://github.com/modin-project/modin/pull/4333)
      • DOCS-https://github.com/modin-project/modin/issues/4339: Reformat I/O functions docstrings (https://github.com/modin-project/modin/pull/4341)
      • DOCS-https://github.com/modin-project/modin/issues/4336: Reformat general utilities docstrings (https://github.com/modin-project/modin/pull/4338)
    • Dependencies
      • FIX-https://github.com/modin-project/modin/issues/4113, FIX-https://github.com/modin-project/modin/issues/4116, FIX-https://github.com/modin-project/modin/issues/4115: Apply new black formatting, fix pydocstyle check and readthedocs build (https://github.com/modin-project/modin/pull/4114)
      • TEST-https://github.com/modin-project/modin/issues/3227: Use codecov github action instead of bash form in GA workflows (https://github.com/modin-project/modin/pull/3226)
      • FIX-https://github.com/modin-project/modin/issues/4115: Unpin pip in readthedocs deps list (https://github.com/modin-project/modin/pull/4170)
      • TEST-https://github.com/modin-project/modin/issues/4217: Pin Dask<2022.2.0 as a temporary fix of CI (https://github.com/modin-project/modin/pull/4218)

    Contributors

    @prutskov, @amyskov, @paulovn, @anmyachev, @YarShev, @RehanSD, @devin-petersohn, @dchigarev, @Garra1980, @mvashishtha, @naren-ponder, @jeffreykennethli, @dorisjlee, @Rubtsowa

    Source code(tar.gz)
    Source code(zip)
  • 0.13.3(Mar 18, 2022)

    This release contains a few key bugfixes and pandas version update.

    Key Features and Updates

    • Stability and Bugfixes
      • Stop shallow dataframe copies from creating global shared state (#4184)
      • Make PandasOnRayDataframeColumnPartition conformant to partition interface (#4231)
      • Fix lazy metadata update for PandasDataFrame.from_labels (#4209)
      • Fix Categorical() for scalar categories (#4258)
      • Fix some cases when assigning a scalar to a subset of dataframe or series. (#4160)
      • Align read_excel() behaviour on empty rows with pandas 1.3+ (#4161)
      • Allow reading an empty parquet file. (#4075)
      • Pin Dask<2022.2.0 as a temporary fix. (#4218)
      • Add proper error handling in df.set_index. (#4309)
    • Documentation improvements
      • Clarify OmniSci activation in its usage section. (#4192)
    • Upgrade pandas to 1.4.1 (#4235)

    Contributors

    @mvashishtha @anmyachev @prutskov @devin-petersohn @naren-ponder @YarShev @Garra1980

    Source code(tar.gz)
    Source code(zip)
  • 0.13.2(Feb 10, 2022)

    This release contains documentation polishing and small user experience improvements.

    Key Features and Updates

    • Mention strict channel priority in conda install section (#4178)
    • Refresh Intel Distribution of Modin paragraph (#4175)
    • Add info in troubleshooting page related to Dask engine usage (#4152)
    • Do not print OmniSci logs to stdout by default (#4159)
    • Fix rendering the examples on troubleshooting page (#4169)
    • Use skipif instead of skip for compatibility with pytest 7.0 (#4163)

    Contributors

    @RehanSD, @YarShev, @dchigarev, @prutskov, @Garra1980

    Source code(tar.gz)
    Source code(zip)
  • 0.13.1(Feb 4, 2022)

    This release contains a few key bugfixes and updates to the documentation.

    Key Features and Updates

    • Stability and Bugfixes
      • FIX-#4058: Allow pickling empty dataframes and series (#4095)
      • FIX-#4105: Fix names of pandas options to avoid OptionError (#4109)
      • FIX-#4142: Fix OmniSci enabling (#4146)
    • Documentation improvements
      • DOCS-#4082: Add pdf/epub/htmlzip formats for doc builds (#4083)
      • DOCS-#4079: Fix link to PandasDataframe in docs (#4108)

    Contributors

    @prutskov, @paulovn, @YarShev, @RehanSD, @devin-petersohn, @mvashishtha

    Source code(tar.gz)
    Source code(zip)
  • 0.13.0(Jan 27, 2022)

    This release contains significant upgrades to Modin's documentation, support for pandas 1.4, new algebra and partitioning layer APIs, and some bugfixes.

    Key Features and Updates

    • Stability and bugfixes
      • Support for subscripting Resampler (1a1edfd)
      • Fix groupby with column name for by (a04d7b7)
      • Workaround for groupby with sort=False with categorical keys (c67a7c5)
      • Align default value of REDIS_PASSWORD with Ray's DEFAULT_REDIS_PASSWORD (f79cb85)
      • Fix groupby dictionary aggregation when by and columns to aggregate overlap (d42c070)
      • Fix read_csv when callables are provided for skip_rows parameter (7c84758)
      • Ensure address is not passed to ray.init when running Ray in local mode (02a23d4)
      • Ensure that groupby.indices returns positional indices (e9c06f2)
      • Fix setting of categorical values (0e36e22)
      • Ensure df.__getitem__ respects step attribute of slice (7e85c5d)
      • Ensure data argument is delievered to the Dataframe in experimental cloud mode (2f7da1f)
      • Fix assigning to a Series with a single item (0d9d14e)
      • Fix the default to pandas in pd.DataFrame.sparse.from_spmatrix (ab2855b)
      • Fix apply result type inference (ac17ca1)
      • Exclude "scripts" from setup package (6224aba)
      • Fix assigning a Categorical to a column (cb4e727)
      • Ensure df.to_csv propagates metadata (e.g. index) (154697b)
      • Update pyarrow requirement in environment files (b55b08d)
    • Performance enhancements
      • Optimize __getitem__ flow for .loc/.iloc (0947ee8)
      • Delay instantiation of lazy dtypes on transpose (cd8db0c)
    • Benchmarking enhancements
      • Update benchmarks for groupby that are more representative (0582aa2)
    • Refactor Codebase
      • Update CODEOWNERS to reflect repository after refactor (cde6390)
      • Remove duplicate import of FactoryDispatcher in Modin experimental pandas IO (2cfabaf)
      • Update Modin to incorporate dataframe algebra (58bbcc3)
    • Pandas API implementations and improvements
      • Add support for storage_options argument to read_csv_glob (7c33afe)
      • Add support for dropna argument for groupby.indices and groupby.groups (144a613)
      • Ensure relabeling Modin Frame does not lose partition shape (3c740db)
      • Update Series.values to default to to_numpy() (67228ef)
      • Add support for modin.pandas.show_versions and python -m modin --versions (efe717f)
      • Upgrade pandas support to 1.4 (39fbc57)
    • OmniSci enhancements
      • Update benchmarks for groupby that are more representative (9396f23)
      • Update documentation on Native + OmniSci (edc1608)
      • Add support for getArrowTable() (6882ec2)
      • Fix segfault during init when only OmniSci is present (8c8a6a3)
      • Optimize append with default arguments (67013f9)
      • Fix OmniSci engine enabling for IO functions (9d1a334)
    • XGBoost enhancements
    • Developer API enhancements
      • Add parameter for minimum partition size (1be66d1)
      • Improve documentation for read_csv_glob and ensure warning raised if wildcard not in filepath_or_buffer (be10ba9)
      • Expand virtual partitioning utility (8d1004f)
    • Update testing suite
    • Documentation improvements
      • Improve documentation on pandas on Ray execution (b76dc57)
      • Reformat documentation to match pandas documentation theme (cc96f5d)
      • Improve documentation on pandas on Python execution (d590de0)
      • Improve System view in architecture documentation (6d51921)
      • Improve documentation on using pandas on Dask (003f338)
      • Improve documentation on pandas on Dask execution (61bf043)
      • Add documentation on using pandas on Python (195b668)
      • Improve Modin Out of Core documentation (cf426c4)
      • Improve documentation on OmniSci on native execution (689faee)
      • Improve documentation on IO (ffa67c7)
      • Add documentation on factories and parsers (6ca66db)
      • Improve documentation for experimental pandas on Ray execution (20abddd)
      • Improve documentation for modin.core.dataframe.base and modin.core.dataframe.pandas (cf1e541)
      • Update troubleshooting documentation and add FAQs (cc95ae2)
      • Improve README introduction and installation sections (a632d1f)
      • Update copyright year (7da1dc8)
      • Update a link to pandas.read_json (0315823)
      • Improve documentation for Modin vs. Dask (34732cb)
      • Fix links to the contributing page (81a06d6)
      • Remove broken links from supported apis (c04502d)
      • Change docs copyright statement to 'Modin Developers' (ed2a7a4)
      • Rename Developer page to Development in docs (406af7c)
      • Improve "Getting Started" section (4a62bba)
      • Update Modin tutorials (76707bf)
      • Add back quickstart notebook (4dd97ab)
      • Fix links in README and update README and FAQs (5d84042)
      • Update Modin module layout in architecture docs (7fcafa7)
      • Update documentation with new algebra operators and ModinDataframe (4b70725)
      • Add usage guide to documentation (4511566)
      • Build docs with Python 3.8 (01c1876)
    • Dependencies
      • Update PyArrow to 6.0 and OmniSci to 5.10.1 (018515f)

    Contributors

    @anmyachev, @prutskov, @Rubtsowa, @vnlitvinov, @dchigarev, @YarShev, @amyskov, @mvashishtha, @dorisjlee, @devin-petersohn, @jeffreykennethli, @RehanSD, @novichkovg, @Lozovskii-Aleksandr, @naren-ponder, @ahallermed, @fexolm, @adityagp, @susmitpy, @ienkovich

    Source code(tar.gz)
    Source code(zip)
A Python package for manipulating 2-dimensional tabular data structures

datatable This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame

H2O.ai 1.6k Jan 05, 2023
Universal 1d/2d data containers with Transformers functionality for data analysis.

XPandas (extended Pandas) implements 1D and 2D data containers for storing type-heterogeneous tabular data of any type, and encapsulates feature extra

The Alan Turing Institute 25 Mar 14, 2022
Modin: Speed up your Pandas workflows by changing a single line of code

Scale your pandas workflows by changing one line of code To use Modin, replace the pandas import: # import pandas as pd import modin.pandas as pd Inst

8.2k Jan 01, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 01, 2023
sqldf for pandas

pandasql pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar

yhat 1.2k Jan 09, 2023
cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Built based on the Apache Arrow columnar memory format,

RAPIDS 5.2k Dec 31, 2022
The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

Eyal Trabelsi 206 Dec 13, 2022
Koalas: pandas API on Apache Spark

pandas API on Apache Spark Explore Koalas docs » Live notebook · Issues · Mailing list Help Thirsty Koalas Devastated by Recent Fires The Koalas proje

Databricks 3.2k Jan 04, 2023
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
The easy way to write your own flavor of Pandas

Pandas Flavor The easy way to write your own flavor of Pandas Pandas 0.23 added a (simple) API for registering accessors with Pandas objects. Pandas-f

Zachary Sailer 260 Jan 01, 2023
Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

Python for Data 348 Jan 03, 2023
High performance datastore for time series and tick data

Arctic TimeSeries and Tick store Arctic is a high performance datastore for numeric data. It supports Pandas, numpy arrays and pickled objects out-of-

Man Group 2.9k Dec 23, 2022
A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

Jason Carpenter 2.2k Jan 04, 2023
A pure Python implementation of Apache Spark's RDD and DStream interfaces.

pysparkling Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to exe

Sven Kreiss 254 Dec 06, 2022
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

vaex io 7.7k Jan 01, 2023