A simple and efficient tool to parallelize Pandas operations on all available CPUs

Overview

Pandaral·lel

PyPI version fury.io PyPI license PyPI download month

Without parallelization Without Pandarallel
With parallelization With Pandarallel

Installation

$ pip install pandarallel [--upgrade] [--user]

Requirements

On Windows, Pandaral·lel will works only if the Python session (python, ipython, jupyter notebook, jupyter lab, ...) is executed from Windows Subsystem for Linux (WSL).

On Linux & macOS, nothing special has to be done.

Warning

  • Parallelization has a cost (instantiating new processes, sending data via shared memory, ...), so parallelization is efficient only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallelization is not always worth it.

Examples

An example of each API is available here.

Benchmark

For some examples, here is the comparative benchmark with and without using Pandaral·lel.

Computer used for this benchmark:

  • OS: Linux Ubuntu 16.04
  • Hardware: Intel Core i7 @ 3.40 GHz - 4 cores

Benchmark

For those given examples, parallel operations run approximately 4x faster than the standard operations (except for series.map which runs only 3.2x faster).

API

First, you have to import pandarallel:

from pandarallel import pandarallel

Then, you have to initialize it.

pandarallel.initialize()

This method takes 5 optional parameters:

  • shm_size_mb: Deprecated.
  • nb_workers: Number of workers used for parallelization. (int) If not set, all available CPUs will be used.
  • progress_bar: Display progress bars if set to True. (bool, False by default)
  • verbose: The verbosity level (int, 2 by default)
    • 0 - don't display any logs
    • 1 - display only warning logs
    • 2 - display all logs
  • use_memory_fs: (bool, None by default)
    • If set to None and if memory file system is available, Pandarallel will use it to transfer data between the main process and workers. If memory file system is not available, Pandarallel will default on multiprocessing data transfer (pipe).
    • If set to True, Pandarallel will use memory file system to transfer data between the main process and workers and will raise a SystemError if memory file system is not available.
    • If set to False, Pandarallel will use multiprocessing data transfer (pipe) to transfer data between the main process and workers.

Using memory file system reduces data transfer time between the main process and workers, especially for big data.

Memory file system is considered as available only if the directory /dev/shm exists and if the user has read and write rights on it.

Basically, memory file system is only available on some Linux distributions (including Ubuntu).

With df a pandas DataFrame, series a pandas Series, func a function to apply/map, args, args1, args2 some arguments, and col_name a column name:

Without parallelization With parallelization
df.apply(func) df.parallel_apply(func)
df.applymap(func) df.parallel_applymap(func)
df.groupby(args).apply(func) df.groupby(args).parallel_apply(func)
df.groupby(args1).col_name.rolling(args2).apply(func) df.groupby(args1).col_name.rolling(args2).parallel_apply(func)
df.groupby(args1).col_name.expanding(args2).apply(func) df.groupby(args1).col_name.expanding(args2).parallel_apply(func)
series.map(func) series.parallel_map(func)
series.apply(func) series.parallel_apply(func)
series.rolling(args).apply(func) series.rolling(args).parallel_apply(func)

You will find a complete example here for each row in this table.

Troubleshooting

I have 8 CPUs but parallel_apply speeds up computation only about x4. Why?

Actually Pandarallel can only speed up computation until about the number of cores your computer has. The majority of recent CPUs (like Intel Core i7) uses hyperthreading. For example, a 4-core hyperthreaded CPU will show 8 CPUs to the operating system, but will really have only 4 physical computation units.

On Ubuntu, you can get the number of cores with $ grep -m 1 'cpu cores' /proc/cpuinfo.


I use Jupyter Lab and instead of progress bars, I see these kind of things:
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625000), Label(value='0 / 625000')…

Run the following 3 lines, and you should be able to see the progress bars:

$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager

(You may also have to install nodejs if asked)

Comments
  • Maximum size exceeded

    Maximum size exceeded

    hi I set pandarallel.initialize(shm_size_mb=10000) and after apply parallel_apply to my column i get the net error Maximum size exceeded (2GB)

    why i get this message when i set more than 2gb?

    bug 
    opened by vvssttkk 23
  • Setting progress_bar=True freezes execution for parallel_apply before reaching 1% completion on all CPU's

    Setting progress_bar=True freezes execution for parallel_apply before reaching 1% completion on all CPU's

    When progress_bar=True, I noticed that the execution of my parallel_apply task stopped right before all parallel processes reached 1% progress mark. Here are some further details of what I was encountering -

    • I turned on logging with DEBUG messages, but no messages were displayed when the execution stopped. There were no error messages either. The dataframe rows simply stopped processing further and the process seemed to be frozen.
    • I have two CPU's. It seems that the progress bar only updates in 1% increments. One of the progress bars reaches 1% mark, but when the number of processed rows reaches the 2% mark (which I assume is associated with the second progress bar updating to 1% as well), that's when the process froze.
    • The process runs fine with progress_bar=False.
    opened by abhineetgupta 22
  • pandarallel_apply crashes with OverflowError: int too big to convert

    pandarallel_apply crashes with OverflowError: int too big to convert

    Hi everyone,

    I am getting this error here using parallel_apply in pandas:

      File "extract_specifications.py", line 156, in <module>
        extracted_data = df.parallel_apply(extract_raw_infos, axis=1)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 367, in closure
        kwargs,
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 239, in get_workers_args
        zip(input_files, output_files, chunk_lengths)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 238, in <listcomp>
        for index, (input_file, output_file, chunk_length) in enumerate(
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 169, in wrapper
        time=time,
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 464, in inline
        func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in shift_instructions
        for instruction in instructions
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in <genexpr>
        for instruction in instructions
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 293, in shift_instruction
        return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 71, in int2python_bytes
        return int.to_bytes(item, nb_bytes, "little")
    OverflowError: int too big to convert
    

    I am using

    pandarallel == 1.4.2
    pandas == 0.24.2
    python == 3.6.9
    

    Any idea how to proceed from here? I have basically no idea what could cause this bug. I suspect it might be related to the size of the data I have in one column (I save html from web pages in there). But otherwise no idea. I would help removing this bug(?) if I had some guidance here. Thx for helping.

    opened by yeus 21
  • Fails with

    Fails with "_wrap_applied_output() missing 1 required positional argument" where a simple pandas apply succeeds

    Hello,

    I'm using python 3.8.10 (anaconda distribution, GCC 7.5.10) in Ubuntu LTS 20 64bits x86

    From my pip freeze:

    pandarallel 1.5.2 pandas 1.3.0 numpy 1.20.3

    I'm working with a dataFrame that looks like this one:

    HoleID scaffold tpl strand base score tMean tErr modelPrediction ipdRatio coverage isboundary identificationQv context experiment isbegin_bondary isend_boundary isin_IES uniqueID No_known_IES_retention_this_CCS detailed_classif
    1025444 70189477 scaffold_024_with_IES 688203 0 T 2 0.517 0.190 0.555 0.931 11 True NaN TTAAATAGAAATTAAAATCAGCTGC NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025446 70189477 scaffold_024_with_IES 688204 0 A 4 1.347 0.367 1.251 1.077 13 True NaN TAAATAGAAATTAAAATCAGCTGCT NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025448 70189477 scaffold_024_with_IES 688205 0 A 5 1.913 0.779 1.464 1.307 16 True NaN AAATAGAAATTAAAATCAGCTGCTT NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025450 70189477 scaffold_024_with_IES 688206 0 A 4 1.535 0.712 1.328 1.156 18 True NaN AATAGAAATTAAAATCAGCTGCTTA NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025452 70189477 scaffold_024_with_IES 688207 0 A 5 1.655 0.565 1.391 1.190 18 True NaN ATAGAAATTAAAATCAGCTGCTTAA NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES

    I defined the following function

    def get_distance_from_nearest_criteria(df,criteria):
        begins = df[df[criteria]].copy()
        
        if len(begins) == 0:
            return pd.Series([np.nan for x in range(len(df))])
        else:
            list_return = []
    
            for idx, nt in df.iterrows():
                distances = [abs(nt["tpl"] - x) for x in begins["tpl"]]
                mindistance = min(distances,default=np.nan)
                list_return.append(mindistance)
    
            return pd.Series(list_return)
    

    Then using :

    from pandarallel import pandarallel
    pandarallel.initialize(progress_bar=False, nb_workers=12)
    out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))
    

    leads to :

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-49-02fc7c0589e3> in <module>
    ----> 1 out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))
    
    ~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
        463             )
        464 
    --> 465             return reduce(results, reduce_meta_args)
        466 
        467         finally:
    
    ~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/data_types/dataframe_groupby.py in reduce(results, df_grouped)
         14         keys, values, mutated = zip(*results)
         15         mutated = any(mutated)
    ---> 16         return df_grouped._wrap_applied_output(
         17             keys, values, not_indexed_same=df_grouped.mutated or mutated
         18         )
    
    TypeError: _wrap_applied_output() missing 1 required positional argument: 'values'
    
    

    For me, the error is not clear enough (I can't tell what's happening)

    However, when I run it with a simple pandas apply :

    uniqueID           
    HT2_10354935    0      297.0
                    1      297.0
                    2      296.0
                    3      296.0
                    4      295.0
                           ...  
    NM9_10_9568952  502      NaN
                    503      NaN
                    504      NaN
                    505      NaN
                    506      NaN
    Length: 1028437, dtype: float64
    

    I'm running all of this in a jupyter notebook

    ipykernel 5.3.4 ipython 7.22.0 ipython-genutils 0.2.0 notebook 6.4.0 jupyter 1.0.0 jupyter-client 6.1.12 jupyter-console 6.4.0 jupyter-core 4.7.1 jupyter-dash 0.4.0 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0

    I was wondering if someone could explain me what's hapenning, and how to fix it if the error is mine. Because it works out of the box with a simple pandas apply, I suppose that there is a small problem in pandarallel

    NB: Note also that this code leaves unkilled processes even after I interrupted or restarted the ipython kernel EDIT: Would it be linked to the fact that I'm using a lambda function ?

    opened by GDelevoye 18
  • Add `parallel_apply` for `Resampler` class

    Add `parallel_apply` for `Resampler` class

    I implemented parallel_apply for the Resampler class to have some important time series functionality. For now it is still using the default _chunk method, but it can lead to some processes terminating much quicker than others i.e. if the time series gets denser over time. A potential upgrade would be to random sample the contents of the chunks, so each chunk gets a similar distribution of workloads.

    P.S.: I noticed that 30/188 of the tests fail due to a ZeroDivisionError. This is unrelated to my pull request, but an important issue.

    opened by alvail 17
  • Connection to IPC socket failed for pathname

    Connection to IPC socket failed for pathname

    Hello, ask, pandarallel.initialize () appears warning how is it? Thank you: WARNING: Logging before InitGoogleLogging() is written to STDERR E0812 19:11:57.484051 2409853824 http://io.cc:168] Connection to IPC socket failed for pathname /var/folders/sp/vz74h1tx3jlb3jqrq__bjwh00000gp/T/pandarallel-32ts0h6r/plasma_sock, retrying 20 more times please help me,thank

    opened by lsircc 10
  • ZeroDivisionError: float division by zero

    ZeroDivisionError: float division by zero

    General

    • Operating System: windows 11
    • Python version: 3.8.3
    • Pandas version: 1.4.2
    • Pandarallel version: 1.6.1

    Acknowledgement

    • [ ] My issue is NOT present when using pandas without alone (without pandarallel)
    • [ ] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    WARNING: You are on Windows. If you detect any issue with pandarallel, be sure you checked out the Troubleshooting page: 0.99% | 46 / 4628 | 0.00% | 0 / 4627 | multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\multiprocessing\pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\core.py", line 158, in call results = self.work_function( File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\data_types\series.py", line 26, in work return data.apply( File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\series.py", line 4433, in apply return SeriesApply(self, func, convert_dtype, args, kwargs).apply() File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\apply.py", line 1082, in apply return self.apply_standard() File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\apply.py", line 1137, in apply_standard mapped = lib.map_infer( File "pandas_libs\lib.pyx", line 2870, in pandas._libs.lib.map_infer File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\progress_bars.py", line 206, in closure state.next_put_iteration += max(int((delta_i / delta_t) * 0.25), 1) ZeroDivisionError: float division by zero

    Observed behavior

    Write here the observed behavior

    Expected behavior

    Write here the expected behavior

    Minimal but working code sample to ease bug fix for pandarallel team

    Write here the minimal code sample to ease bug fix for pandarallel team

    bug 
    opened by heya5 8
  • OverflowError

    OverflowError

    Using python 3.8, I am getting an OverflowError running apply_parallel:

    OverflowError                             Traceback (most recent call last)
    <ipython-input-5-a78fd5119887> in <module>
         37     grouped_df = df.groupby("id")
         38 
    ---> 39     grouped_df.parallel_apply(lookahead) \
         40         .to_parquet(output_location_look_ahead, compression='snappy', engine='pyarrow')
         41 
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
        429         queue = manager.Queue()
        430 
    --> 431         workers_args, chunk_lengths, input_files, output_files = get_workers_args(
        432             use_memory_fs,
        433             nb_requested_workers,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in get_workers_args(use_memory_fs, nb_workers, progress_bar, chunks, worker_meta_args, queue, func, args, kwargs)
        284             raise OSError(msg)
        285 
    --> 286         workers_args = [
        287             (
        288                 input_file.name,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in <listcomp>(.0)
        293                 progress_bar == PROGRESS_IN_WORKER,
        294                 dill.dumps(
    --> 295                     progress_wrapper(
        296                         progress_bar >= PROGRESS_IN_FUNC, queue, index, chunk_length
        297                     )(func)
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in wrapper(func)
        203     def wrapper(func):
        204         if progress_bar:
    --> 205             wrapped_func = inline(
        206                 progress_pre_func,
        207                 func,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in inline(pre_func, func, pre_func_arguments)
        485 
        486     func_instructions = tuple(get_instructions(func))
    --> 487     shifted_func_instructions = shift_instructions(
        488         func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
        489     )
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in shift_instructions(instructions, qty)
        301     If Python version not in 3.{5, 6, 7}, a SystemError is raised.
        302     """
    --> 303     return tuple(
        304         shift_instruction(instruction, qty)
        305         if bytes((instruction[0],))
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in <genexpr>(.0)
        302     """
        303     return tuple(
    --> 304         shift_instruction(instruction, qty)
        305         if bytes((instruction[0],))
        306         in (
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in shift_instruction(instruction, qty)
        291     """
        292     operation, *values = instruction
    --> 293     return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
        294 
        295 
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in int2python_bytes(item)
         69 
         70     nb_bytes = 2 if python_version.minor == 5 else 1
    ---> 71     return int.to_bytes(item, nb_bytes, "little")
         72 
         73 
    
    OverflowError: int too big to convert
    
    opened by lfdversluis 7
  • TypeError: 'generator' object is not subscriptable error in colab - works in VScode

    TypeError: 'generator' object is not subscriptable error in colab - works in VScode

    General

    • Operating System: OSX
    • Python version: 3.7.13
    • Pandas version: 1.3.5
    • Pandarallel version: 1.6.2

    Acknowledgement

    • Issue happens on Colab only. When I use VScode, the problem does not happen

    Bug description

    I get the error:

    100.00%
    1 / 1
    100.00%
    1 / 1
    100.00%
    1 / 1
    100.00%
    1 / 1
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    [<ipython-input-16-b9ca1a6007f2>](https://localhost:8080/#) in <module>()
          1 data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
          2 df = pd.DataFrame(data)
    ----> 3 df['HalfAge'] = df.parallel_apply(lambda r: r.Age/2,axis=1)
    
    2 frames
    [/usr/local/lib/python3.7/dist-packages/pandarallel/core.py](https://localhost:8080/#) in closure(data, user_defined_function, *user_defined_function_args, **user_defined_function_kwargs)
        324             return wrapped_reduce_function(
        325                 (Path(output_file.name) for output_file in output_files),
    --> 326                 reduce_extra,
        327             )
        328 
    
    [/usr/local/lib/python3.7/dist-packages/pandarallel/core.py](https://localhost:8080/#) in closure(output_file_paths, extra)
        197         )
        198 
    --> 199         return reduce_function(dfs, extra)
        200 
        201     return closure
    
    [/usr/local/lib/python3.7/dist-packages/pandarallel/data_types/dataframe.py](https://localhost:8080/#) in reduce(datas, extra)
         45             datas: Iterable[pd.DataFrame], extra: Dict[str, Any]
         46         ) -> pd.DataFrame:
    ---> 47             axis = 0 if isinstance(datas[0], pd.Series) else 1 - extra["axis"]
         48             return pd.concat(datas, copy=False, axis=axis)
         49 
    
    TypeError: 'generator' object is not subscriptable
    

    Minimal but working code sample to ease bug fix for pandarallel team

    !pip install pandarallel
    from pandarallel import pandarallel
    pandarallel.initialize(progress_bar=True)
    
    import pandas as pd
    
    data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
    df = pd.DataFrame(data)
    df['HalfAge'] = df.parallel_apply(lambda r: r.Age/2,axis=1)
    
    opened by agiveon 6
  • Library not working....

    Library not working....

    Hello everyone!

    While trying to run the example cases and I'm receiving the following error:

    Code

    import pandas as pd
    import pandarallel
    pandarallel.initialize()
    
    def func(x):
        return math.sin(x.a**2) + math.sin(x.b**2)
    
    if __name__ == '__main__':
        df_size = int(5e6)
        df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
                               b=np.random.rand(df_size)))
        res_parallel = df.parallel_apply(func, axis=1, progress_bar=True)
    

    Error

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <timed exec> in <module>
    
    ~\.conda\envs\ingestion-env\lib\site-packages\pandarallel\pandarallel.py in closure(data, func, *args, **kwargs)
        434         try:
        435             pool = Pool(
    --> 436                 nb_workers, worker_init, (prepare_worker(use_memory_fs)(worker),)
        437             )
        438 
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\context.py in Pool(self, processes, initializer, initargs, maxtasksperchild)
        117         from .pool import Pool
        118         return Pool(processes, initializer, initargs, maxtasksperchild,
    --> 119                     context=self.get_context())
        120 
        121     def RawValue(self, typecode_or_type, *args):
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\pool.py in __init__(self, processes, initializer, initargs, maxtasksperchild, context)
        174         self._processes = processes
        175         self._pool = []
    --> 176         self._repopulate_pool()
        177 
        178         self._worker_handler = threading.Thread(
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\pool.py in _repopulate_pool(self)
        239             w.name = w.name.replace('Process', 'PoolWorker')
        240             w.daemon = True
    --> 241             w.start()
        242             util.debug('added worker')
        243 
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\process.py in start(self)
        110                'daemonic processes are not allowed to have children'
        111         _cleanup()
    --> 112         self._popen = self._Popen(self)
        113         self._sentinel = self._popen.sentinel
        114         # Avoid a refcycle if the target function holds an indirect
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\context.py in _Popen(process_obj)
        320         def _Popen(process_obj):
        321             from .popen_spawn_win32 import Popen
    --> 322             return Popen(process_obj)
        323 
        324     class SpawnContext(BaseContext):
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
         87             try:
         88                 reduction.dump(prep_data, to_child)
    ---> 89                 reduction.dump(process_obj, to_child)
         90             finally:
         91                 set_spawning_popen(None)
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
         58 def dump(obj, file, protocol=None):
         59     '''Replacement for pickle.dump() using ForkingPickler.'''
    ---> 60     ForkingPickler(file, protocol).dump(obj)
         61 
         62 #
    
    AttributeError: Can't pickle local object 'prepare_worker.<locals>.closure.<locals>.wrapper'
    

    I'm working on a Jupyter Lab notebook with a conda environment: What can I do?

    opened by murilobellatini 6
  • IndexError when there are fewer DataFrame rows than workers

    IndexError when there are fewer DataFrame rows than workers

    When the number of rows is below the number of workers an IndexError is raised. Minimal example:

    Code

    import time
    import pandas as pd
    from pandarallel import pandarallel
    
    pandarallel.initialize(progress_bar=True)
    
    df = pd.DataFrame({'x':[1,2]})
    df.parallel_apply(lambda row: print('A'), time.sleep(2), print('B'), axis=1)
    

    Output

    INFO: Pandarallel will run on 6 workers.
    INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
    B
       0.00%                                          |        0 /        1 |                                                                                                                    
       0.00%                                          |        0 /        1 |                                                                                                                    Traceback (most recent call last):
      File "foo.py", line 8, in <module>
        df.parallel_apply(lambda row: print('A'), time.sleep(2), print('B'), axis=1)
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 446, in closure
        map_result,
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 382, in get_workers_result
        progress_bars.update(progresses)
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/utils/progress_bars.py", line 82, in update
        self.__bars[index][0] = value
    IndexError: list index out of range
    

    I'm using python version 3.7.4 with pandas 0.25.3 and pandarallel 1.4.4.

    bug 
    opened by elemakil 6
  • __main__ imposition in code

    __main__ imposition in code

    Make sure if a novice tries to access the module, he should be aware of encapsulating code within main execution context, pretty much similar to that of requirements of multiprocessing in general. This will reduce wasting of time and delving into multiporcessing more.

    opened by karunakar2 4
  • how to use it in fastapi

    how to use it in fastapi

    I have a server like:

    from fastapi import FastAPI
    from pandarallel import pandarallel
    pandarallel.initialize()
    app = FastAPI()
    
    @app.post("/area_quant")
    def create_item(event_data: Areadata):
        data = pd.DataFrame(event_data.data)
        data['type_score'] = data['EVENT_TYPE'].applymap(Config.type_map)
    
    if __name__ == '__main__':
        uvicorn.run(app)
    

    as a server,it can only run one time,then it will shutdown,how can i use it in a server?

    opened by lim-0 0
  • 'functools.partial' object has no attribute '__code__' in Jupyter Notebooks

    'functools.partial' object has no attribute '__code__' in Jupyter Notebooks

    General

    • Operating System: Ubuntu 20.04
    • Python version: 3.9.13
    • Pandas version: 1.4.4
    • Pandarallel version: 1.6.3

    Acknowledgement

    • [x] My issue is NOT present when using pandas without alone (without pandarallel)
    • [x] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    functool.partial cannot be used in junction with jupyter notebooks and pandarallel

    Observed behavior

    Write here the observed behavior

    Expected behavior

    Write here the expected behavior

    Minimal but working code sample to ease bug fix for pandarallel team

    import pandas as pd
    from functools import partial
    from typing import List
    import numpy as np
    from pandarallel import pandarallel
    
    pandarallel.initialize(progress_bar=True)
    
    def processing_fn(row: pd.Series, invalid_numbers: List[int], default: int) -> pd.Series:
        cond = row.isin(invalid_numbers)
        row[cond] = default
        return row
    
    data = pd.DataFrame(np.random.randint(low=-10, high=10, size=(10000, 5)))
    print("Before", (data.values == 100).sum())
    
    fn = partial(processing_fn, invalid_numbers=[-5, 2, 5], default=100)
    new_data = data.apply(fn, axis=1)
    
    print("After serial", (new_data.values == 100).sum())
    
    data = data.parallel_apply(fn, axis=1)
    print("After parallel", (data.values == 100).sum())
    
    

    Works fine in a standalone script, but fails if ran in Jupyter notebook

    opened by Meehai 0
  • Choose which type of progress bar you want in a notebook

    Choose which type of progress bar you want in a notebook

    Sometimes one doesn't always have control of the environment of the Jupyter instance where one's working (think Jupyterhub) and can't install the necessary extensions for progress bars. In this case it might be nice to have the option to manually request the simple progress bar so that it doesn't just display a widget error.

    Is there already a way to do so that I've missed? Otherwise, if it's something you'd consider adding, I'd also be happy to try and draft a PR.

    opened by astrojarred 3
  • TypeError: cannot pickle 'sqlite3.Connection' object in pyCharm

    TypeError: cannot pickle 'sqlite3.Connection' object in pyCharm

    General

    • Operating System: Ubuntu 22.04
    • Python version: 3.9.7
    • Pandas version: 1.4.2
    • Pandarallel version: 1.6.3

    Acknowledgement

    • [x] My issue is NOT present when using pandas without alone (without pandarallel)
    • [ ] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    I observe this when running parallel_apply in pyCharm. #76 sounds similar to me, but none of the tricks suggested there work for me. At this point I am also not sure if it's more of an issue with pyCharm.

    Observed behavior

    Traceback (most recent call last):
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-39-d230d86ff5ef>", line 1, in <module>
        iris.groupby("species").parallel_apply(lambda x: np.mean(x))
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/pandarallel/core.py", line 265, in closure
        dilled_user_defined_function = dill.dumps(user_defined_function)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 364, in dumps
        dump(obj, file, protocol, byref, fmode, recurse, **kwds)#, strictio)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 336, in dump
        Pickler(file, protocol, **_kwds).dump(obj)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 620, in dump
        StockPickler.dump(self, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 487, in dump
        self.save(obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1963, in save_function
        _save_with_postproc(pickler, (_create_function, (
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1154, in _save_with_postproc
        pickler._batch_setitems(iter(source.items()))
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 692, in save_reduce
        save(args)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 886, in save_tuple
        save(element)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 717, in save_reduce
        save(state)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 717, in save_reduce
        save(state)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 578, in save
        rv = reduce(self.proto)
    TypeError: cannot pickle 'sqlite3.Connection' object
    

    Minimal but working code sample to ease bug fix for pandarallel team

    import pandas as pd
    import numpy as np
    from pandarallel import pandarallel
    pandarallel.initialize(use_memory_fs=False)
    iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
    iris.groupby("species").parallel_apply(lambda x: np.mean(x))
    
    opened by wiessall 1
Releases(v1.6.3)
Owner
Manu NALEPA
Data Scientist / Data Engineer @ Clustree — Sousaphone & saxophone player
Manu NALEPA
Weather Image Recognition - Python weather application using series of data

Weather Image Recognition - Python weather application using series of data

Kushal Shingote 1 Feb 04, 2022
Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

PB2 101 Dec 07, 2022
Analytical view of olist e-commerce in Brazil

Analysis of E-Commerce Public Dataset by Olist The objective of this project is to propose an analytical view of olist e-commerce in Brazil. For this

Gurpreet Singh 1 Jan 11, 2022
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

Maratona Behind the Code 44 Sep 28, 2022
Toolchest provides APIs for scientific and bioinformatic data analysis.

Toolchest Python Client Toolchest provides APIs for scientific and bioinformatic data analysis. It allows you to abstract away the costliness of runni

Toolchest 11 Jun 30, 2022
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 03, 2023
First steps with Python in Life Sciences

First steps with Python in Life Sciences This course material is part of the "First Steps with Python in Life Science" three-day course of SIB-trainin

SIB Swiss Institute of Bioinformatics 22 Jan 08, 2023
Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

University of Stuttgart Visualization Research Center 2 Jul 01, 2022
Fitting thermodynamic models with pycalphad

ESPEI ESPEI, or Extensible Self-optimizing Phase Equilibria Infrastructure, is a tool for thermodynamic database development within the CALPHAD method

Phases Research Lab 42 Sep 12, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
A program that uses an API and a AI model to get info of sotcks

Stock-Market-AI-Analysis I dont mind anyone using this code but please give me credit A program that uses an API and a AI model to get info of stocks

1 Dec 17, 2021
2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

Deepak Nandwani 1 Jan 01, 2022
Exploratory Data Analysis for Employee Retention Dataset

Exploratory Data Analysis for Employee Retention Dataset Employee turn-over is a very costly problem for companies. The cost of replacing an employee

kana sudheer reddy 2 Oct 01, 2021
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Chatistics Python 3 scripts to convert chat logs from various messaging platforms into Pandas DataFrames. Can also generate histograms and word clouds

Florian 893 Jan 02, 2023
Binance Kline Data With Python

Binance Kline Data by seunghan(gingerthorp) reference https://github.com/binance/binance-public-data/ All intervals are supported: 1m, 3m, 5m, 15m, 30

shquant 5 Jul 13, 2022
InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

CRISPRanalysis InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family. In this work, we present a workflow

2 Jan 31, 2022
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022
EOD Historical Data Python Library (Unofficial)

EOD Historical Data Python Library (Unofficial) https://eodhistoricaldata.com Installation python3 -m pip install eodhistoricaldata Note Demo API key

Michael Whittle 20 Dec 22, 2022