A simple and efficient tool to parallelize Pandas operations on all available CPUs

Overview

Pandaral·lel

PyPI version fury.io PyPI license PyPI download month

Without parallelization Without Pandarallel
With parallelization With Pandarallel

Installation

$ pip install pandarallel [--upgrade] [--user]

Requirements

On Windows, Pandaral·lel will works only if the Python session (python, ipython, jupyter notebook, jupyter lab, ...) is executed from Windows Subsystem for Linux (WSL).

On Linux & macOS, nothing special has to be done.

Warning

  • Parallelization has a cost (instantiating new processes, sending data via shared memory, ...), so parallelization is efficient only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallelization is not always worth it.

Examples

An example of each API is available here.

Benchmark

For some examples, here is the comparative benchmark with and without using Pandaral·lel.

Computer used for this benchmark:

  • OS: Linux Ubuntu 16.04
  • Hardware: Intel Core i7 @ 3.40 GHz - 4 cores

Benchmark

For those given examples, parallel operations run approximately 4x faster than the standard operations (except for series.map which runs only 3.2x faster).

API

First, you have to import pandarallel:

from pandarallel import pandarallel

Then, you have to initialize it.

pandarallel.initialize()

This method takes 5 optional parameters:

  • shm_size_mb: Deprecated.
  • nb_workers: Number of workers used for parallelization. (int) If not set, all available CPUs will be used.
  • progress_bar: Display progress bars if set to True. (bool, False by default)
  • verbose: The verbosity level (int, 2 by default)
    • 0 - don't display any logs
    • 1 - display only warning logs
    • 2 - display all logs
  • use_memory_fs: (bool, None by default)
    • If set to None and if memory file system is available, Pandarallel will use it to transfer data between the main process and workers. If memory file system is not available, Pandarallel will default on multiprocessing data transfer (pipe).
    • If set to True, Pandarallel will use memory file system to transfer data between the main process and workers and will raise a SystemError if memory file system is not available.
    • If set to False, Pandarallel will use multiprocessing data transfer (pipe) to transfer data between the main process and workers.

Using memory file system reduces data transfer time between the main process and workers, especially for big data.

Memory file system is considered as available only if the directory /dev/shm exists and if the user has read and write rights on it.

Basically, memory file system is only available on some Linux distributions (including Ubuntu).

With df a pandas DataFrame, series a pandas Series, func a function to apply/map, args, args1, args2 some arguments, and col_name a column name:

Without parallelization With parallelization
df.apply(func) df.parallel_apply(func)
df.applymap(func) df.parallel_applymap(func)
df.groupby(args).apply(func) df.groupby(args).parallel_apply(func)
df.groupby(args1).col_name.rolling(args2).apply(func) df.groupby(args1).col_name.rolling(args2).parallel_apply(func)
df.groupby(args1).col_name.expanding(args2).apply(func) df.groupby(args1).col_name.expanding(args2).parallel_apply(func)
series.map(func) series.parallel_map(func)
series.apply(func) series.parallel_apply(func)
series.rolling(args).apply(func) series.rolling(args).parallel_apply(func)

You will find a complete example here for each row in this table.

Troubleshooting

I have 8 CPUs but parallel_apply speeds up computation only about x4. Why?

Actually Pandarallel can only speed up computation until about the number of cores your computer has. The majority of recent CPUs (like Intel Core i7) uses hyperthreading. For example, a 4-core hyperthreaded CPU will show 8 CPUs to the operating system, but will really have only 4 physical computation units.

On Ubuntu, you can get the number of cores with $ grep -m 1 'cpu cores' /proc/cpuinfo.


I use Jupyter Lab and instead of progress bars, I see these kind of things:
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625000), Label(value='0 / 625000')…

Run the following 3 lines, and you should be able to see the progress bars:

$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager

(You may also have to install nodejs if asked)

Comments
  • Maximum size exceeded

    Maximum size exceeded

    hi I set pandarallel.initialize(shm_size_mb=10000) and after apply parallel_apply to my column i get the net error Maximum size exceeded (2GB)

    why i get this message when i set more than 2gb?

    bug 
    opened by vvssttkk 23
  • Setting progress_bar=True freezes execution for parallel_apply before reaching 1% completion on all CPU's

    Setting progress_bar=True freezes execution for parallel_apply before reaching 1% completion on all CPU's

    When progress_bar=True, I noticed that the execution of my parallel_apply task stopped right before all parallel processes reached 1% progress mark. Here are some further details of what I was encountering -

    • I turned on logging with DEBUG messages, but no messages were displayed when the execution stopped. There were no error messages either. The dataframe rows simply stopped processing further and the process seemed to be frozen.
    • I have two CPU's. It seems that the progress bar only updates in 1% increments. One of the progress bars reaches 1% mark, but when the number of processed rows reaches the 2% mark (which I assume is associated with the second progress bar updating to 1% as well), that's when the process froze.
    • The process runs fine with progress_bar=False.
    opened by abhineetgupta 22
  • pandarallel_apply crashes with OverflowError: int too big to convert

    pandarallel_apply crashes with OverflowError: int too big to convert

    Hi everyone,

    I am getting this error here using parallel_apply in pandas:

      File "extract_specifications.py", line 156, in <module>
        extracted_data = df.parallel_apply(extract_raw_infos, axis=1)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 367, in closure
        kwargs,
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 239, in get_workers_args
        zip(input_files, output_files, chunk_lengths)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 238, in <listcomp>
        for index, (input_file, output_file, chunk_length) in enumerate(
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 169, in wrapper
        time=time,
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 464, in inline
        func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in shift_instructions
        for instruction in instructions
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in <genexpr>
        for instruction in instructions
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 293, in shift_instruction
        return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 71, in int2python_bytes
        return int.to_bytes(item, nb_bytes, "little")
    OverflowError: int too big to convert
    

    I am using

    pandarallel == 1.4.2
    pandas == 0.24.2
    python == 3.6.9
    

    Any idea how to proceed from here? I have basically no idea what could cause this bug. I suspect it might be related to the size of the data I have in one column (I save html from web pages in there). But otherwise no idea. I would help removing this bug(?) if I had some guidance here. Thx for helping.

    opened by yeus 21
  • Fails with

    Fails with "_wrap_applied_output() missing 1 required positional argument" where a simple pandas apply succeeds

    Hello,

    I'm using python 3.8.10 (anaconda distribution, GCC 7.5.10) in Ubuntu LTS 20 64bits x86

    From my pip freeze:

    pandarallel 1.5.2 pandas 1.3.0 numpy 1.20.3

    I'm working with a dataFrame that looks like this one:

    HoleID scaffold tpl strand base score tMean tErr modelPrediction ipdRatio coverage isboundary identificationQv context experiment isbegin_bondary isend_boundary isin_IES uniqueID No_known_IES_retention_this_CCS detailed_classif
    1025444 70189477 scaffold_024_with_IES 688203 0 T 2 0.517 0.190 0.555 0.931 11 True NaN TTAAATAGAAATTAAAATCAGCTGC NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025446 70189477 scaffold_024_with_IES 688204 0 A 4 1.347 0.367 1.251 1.077 13 True NaN TAAATAGAAATTAAAATCAGCTGCT NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025448 70189477 scaffold_024_with_IES 688205 0 A 5 1.913 0.779 1.464 1.307 16 True NaN AAATAGAAATTAAAATCAGCTGCTT NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025450 70189477 scaffold_024_with_IES 688206 0 A 4 1.535 0.712 1.328 1.156 18 True NaN AATAGAAATTAAAATCAGCTGCTTA NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025452 70189477 scaffold_024_with_IES 688207 0 A 5 1.655 0.565 1.391 1.190 18 True NaN ATAGAAATTAAAATCAGCTGCTTAA NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES

    I defined the following function

    def get_distance_from_nearest_criteria(df,criteria):
        begins = df[df[criteria]].copy()
        
        if len(begins) == 0:
            return pd.Series([np.nan for x in range(len(df))])
        else:
            list_return = []
    
            for idx, nt in df.iterrows():
                distances = [abs(nt["tpl"] - x) for x in begins["tpl"]]
                mindistance = min(distances,default=np.nan)
                list_return.append(mindistance)
    
            return pd.Series(list_return)
    

    Then using :

    from pandarallel import pandarallel
    pandarallel.initialize(progress_bar=False, nb_workers=12)
    out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))
    

    leads to :

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-49-02fc7c0589e3> in <module>
    ----> 1 out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))
    
    ~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
        463             )
        464 
    --> 465             return reduce(results, reduce_meta_args)
        466 
        467         finally:
    
    ~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/data_types/dataframe_groupby.py in reduce(results, df_grouped)
         14         keys, values, mutated = zip(*results)
         15         mutated = any(mutated)
    ---> 16         return df_grouped._wrap_applied_output(
         17             keys, values, not_indexed_same=df_grouped.mutated or mutated
         18         )
    
    TypeError: _wrap_applied_output() missing 1 required positional argument: 'values'
    
    

    For me, the error is not clear enough (I can't tell what's happening)

    However, when I run it with a simple pandas apply :

    uniqueID           
    HT2_10354935    0      297.0
                    1      297.0
                    2      296.0
                    3      296.0
                    4      295.0
                           ...  
    NM9_10_9568952  502      NaN
                    503      NaN
                    504      NaN
                    505      NaN
                    506      NaN
    Length: 1028437, dtype: float64
    

    I'm running all of this in a jupyter notebook

    ipykernel 5.3.4 ipython 7.22.0 ipython-genutils 0.2.0 notebook 6.4.0 jupyter 1.0.0 jupyter-client 6.1.12 jupyter-console 6.4.0 jupyter-core 4.7.1 jupyter-dash 0.4.0 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0

    I was wondering if someone could explain me what's hapenning, and how to fix it if the error is mine. Because it works out of the box with a simple pandas apply, I suppose that there is a small problem in pandarallel

    NB: Note also that this code leaves unkilled processes even after I interrupted or restarted the ipython kernel EDIT: Would it be linked to the fact that I'm using a lambda function ?

    opened by GDelevoye 18
  • Add `parallel_apply` for `Resampler` class

    Add `parallel_apply` for `Resampler` class

    I implemented parallel_apply for the Resampler class to have some important time series functionality. For now it is still using the default _chunk method, but it can lead to some processes terminating much quicker than others i.e. if the time series gets denser over time. A potential upgrade would be to random sample the contents of the chunks, so each chunk gets a similar distribution of workloads.

    P.S.: I noticed that 30/188 of the tests fail due to a ZeroDivisionError. This is unrelated to my pull request, but an important issue.

    opened by alvail 17
  • Connection to IPC socket failed for pathname

    Connection to IPC socket failed for pathname

    Hello, ask, pandarallel.initialize () appears warning how is it? Thank you: WARNING: Logging before InitGoogleLogging() is written to STDERR E0812 19:11:57.484051 2409853824 http://io.cc:168] Connection to IPC socket failed for pathname /var/folders/sp/vz74h1tx3jlb3jqrq__bjwh00000gp/T/pandarallel-32ts0h6r/plasma_sock, retrying 20 more times please help me,thank

    opened by lsircc 10
  • ZeroDivisionError: float division by zero

    ZeroDivisionError: float division by zero

    General

    • Operating System: windows 11
    • Python version: 3.8.3
    • Pandas version: 1.4.2
    • Pandarallel version: 1.6.1

    Acknowledgement

    • [ ] My issue is NOT present when using pandas without alone (without pandarallel)
    • [ ] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    WARNING: You are on Windows. If you detect any issue with pandarallel, be sure you checked out the Troubleshooting page: 0.99% | 46 / 4628 | 0.00% | 0 / 4627 | multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\multiprocessing\pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\core.py", line 158, in call results = self.work_function( File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\data_types\series.py", line 26, in work return data.apply( File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\series.py", line 4433, in apply return SeriesApply(self, func, convert_dtype, args, kwargs).apply() File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\apply.py", line 1082, in apply return self.apply_standard() File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\apply.py", line 1137, in apply_standard mapped = lib.map_infer( File "pandas_libs\lib.pyx", line 2870, in pandas._libs.lib.map_infer File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\progress_bars.py", line 206, in closure state.next_put_iteration += max(int((delta_i / delta_t) * 0.25), 1) ZeroDivisionError: float division by zero

    Observed behavior

    Write here the observed behavior

    Expected behavior

    Write here the expected behavior

    Minimal but working code sample to ease bug fix for pandarallel team

    Write here the minimal code sample to ease bug fix for pandarallel team

    bug 
    opened by heya5 8
  • OverflowError

    OverflowError

    Using python 3.8, I am getting an OverflowError running apply_parallel:

    OverflowError                             Traceback (most recent call last)
    <ipython-input-5-a78fd5119887> in <module>
         37     grouped_df = df.groupby("id")
         38 
    ---> 39     grouped_df.parallel_apply(lookahead) \
         40         .to_parquet(output_location_look_ahead, compression='snappy', engine='pyarrow')
         41 
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
        429         queue = manager.Queue()
        430 
    --> 431         workers_args, chunk_lengths, input_files, output_files = get_workers_args(
        432             use_memory_fs,
        433             nb_requested_workers,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in get_workers_args(use_memory_fs, nb_workers, progress_bar, chunks, worker_meta_args, queue, func, args, kwargs)
        284             raise OSError(msg)
        285 
    --> 286         workers_args = [
        287             (
        288                 input_file.name,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in <listcomp>(.0)
        293                 progress_bar == PROGRESS_IN_WORKER,
        294                 dill.dumps(
    --> 295                     progress_wrapper(
        296                         progress_bar >= PROGRESS_IN_FUNC, queue, index, chunk_length
        297                     )(func)
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in wrapper(func)
        203     def wrapper(func):
        204         if progress_bar:
    --> 205             wrapped_func = inline(
        206                 progress_pre_func,
        207                 func,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in inline(pre_func, func, pre_func_arguments)
        485 
        486     func_instructions = tuple(get_instructions(func))
    --> 487     shifted_func_instructions = shift_instructions(
        488         func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
        489     )
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in shift_instructions(instructions, qty)
        301     If Python version not in 3.{5, 6, 7}, a SystemError is raised.
        302     """
    --> 303     return tuple(
        304         shift_instruction(instruction, qty)
        305         if bytes((instruction[0],))
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in <genexpr>(.0)
        302     """
        303     return tuple(
    --> 304         shift_instruction(instruction, qty)
        305         if bytes((instruction[0],))
        306         in (
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in shift_instruction(instruction, qty)
        291     """
        292     operation, *values = instruction
    --> 293     return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
        294 
        295 
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in int2python_bytes(item)
         69 
         70     nb_bytes = 2 if python_version.minor == 5 else 1
    ---> 71     return int.to_bytes(item, nb_bytes, "little")
         72 
         73 
    
    OverflowError: int too big to convert
    
    opened by lfdversluis 7
  • TypeError: 'generator' object is not subscriptable error in colab - works in VScode

    TypeError: 'generator' object is not subscriptable error in colab - works in VScode

    General

    • Operating System: OSX
    • Python version: 3.7.13
    • Pandas version: 1.3.5
    • Pandarallel version: 1.6.2

    Acknowledgement

    • Issue happens on Colab only. When I use VScode, the problem does not happen

    Bug description

    I get the error:

    100.00%
    1 / 1
    100.00%
    1 / 1
    100.00%
    1 / 1
    100.00%
    1 / 1
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    [<ipython-input-16-b9ca1a6007f2>](https://localhost:8080/#) in <module>()
          1 data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
          2 df = pd.DataFrame(data)
    ----> 3 df['HalfAge'] = df.parallel_apply(lambda r: r.Age/2,axis=1)
    
    2 frames
    [/usr/local/lib/python3.7/dist-packages/pandarallel/core.py](https://localhost:8080/#) in closure(data, user_defined_function, *user_defined_function_args, **user_defined_function_kwargs)
        324             return wrapped_reduce_function(
        325                 (Path(output_file.name) for output_file in output_files),
    --> 326                 reduce_extra,
        327             )
        328 
    
    [/usr/local/lib/python3.7/dist-packages/pandarallel/core.py](https://localhost:8080/#) in closure(output_file_paths, extra)
        197         )
        198 
    --> 199         return reduce_function(dfs, extra)
        200 
        201     return closure
    
    [/usr/local/lib/python3.7/dist-packages/pandarallel/data_types/dataframe.py](https://localhost:8080/#) in reduce(datas, extra)
         45             datas: Iterable[pd.DataFrame], extra: Dict[str, Any]
         46         ) -> pd.DataFrame:
    ---> 47             axis = 0 if isinstance(datas[0], pd.Series) else 1 - extra["axis"]
         48             return pd.concat(datas, copy=False, axis=axis)
         49 
    
    TypeError: 'generator' object is not subscriptable
    

    Minimal but working code sample to ease bug fix for pandarallel team

    !pip install pandarallel
    from pandarallel import pandarallel
    pandarallel.initialize(progress_bar=True)
    
    import pandas as pd
    
    data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
    df = pd.DataFrame(data)
    df['HalfAge'] = df.parallel_apply(lambda r: r.Age/2,axis=1)
    
    opened by agiveon 6
  • Library not working....

    Library not working....

    Hello everyone!

    While trying to run the example cases and I'm receiving the following error:

    Code

    import pandas as pd
    import pandarallel
    pandarallel.initialize()
    
    def func(x):
        return math.sin(x.a**2) + math.sin(x.b**2)
    
    if __name__ == '__main__':
        df_size = int(5e6)
        df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
                               b=np.random.rand(df_size)))
        res_parallel = df.parallel_apply(func, axis=1, progress_bar=True)
    

    Error

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <timed exec> in <module>
    
    ~\.conda\envs\ingestion-env\lib\site-packages\pandarallel\pandarallel.py in closure(data, func, *args, **kwargs)
        434         try:
        435             pool = Pool(
    --> 436                 nb_workers, worker_init, (prepare_worker(use_memory_fs)(worker),)
        437             )
        438 
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\context.py in Pool(self, processes, initializer, initargs, maxtasksperchild)
        117         from .pool import Pool
        118         return Pool(processes, initializer, initargs, maxtasksperchild,
    --> 119                     context=self.get_context())
        120 
        121     def RawValue(self, typecode_or_type, *args):
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\pool.py in __init__(self, processes, initializer, initargs, maxtasksperchild, context)
        174         self._processes = processes
        175         self._pool = []
    --> 176         self._repopulate_pool()
        177 
        178         self._worker_handler = threading.Thread(
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\pool.py in _repopulate_pool(self)
        239             w.name = w.name.replace('Process', 'PoolWorker')
        240             w.daemon = True
    --> 241             w.start()
        242             util.debug('added worker')
        243 
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\process.py in start(self)
        110                'daemonic processes are not allowed to have children'
        111         _cleanup()
    --> 112         self._popen = self._Popen(self)
        113         self._sentinel = self._popen.sentinel
        114         # Avoid a refcycle if the target function holds an indirect
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\context.py in _Popen(process_obj)
        320         def _Popen(process_obj):
        321             from .popen_spawn_win32 import Popen
    --> 322             return Popen(process_obj)
        323 
        324     class SpawnContext(BaseContext):
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
         87             try:
         88                 reduction.dump(prep_data, to_child)
    ---> 89                 reduction.dump(process_obj, to_child)
         90             finally:
         91                 set_spawning_popen(None)
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
         58 def dump(obj, file, protocol=None):
         59     '''Replacement for pickle.dump() using ForkingPickler.'''
    ---> 60     ForkingPickler(file, protocol).dump(obj)
         61 
         62 #
    
    AttributeError: Can't pickle local object 'prepare_worker.<locals>.closure.<locals>.wrapper'
    

    I'm working on a Jupyter Lab notebook with a conda environment: What can I do?

    opened by murilobellatini 6
  • IndexError when there are fewer DataFrame rows than workers

    IndexError when there are fewer DataFrame rows than workers

    When the number of rows is below the number of workers an IndexError is raised. Minimal example:

    Code

    import time
    import pandas as pd
    from pandarallel import pandarallel
    
    pandarallel.initialize(progress_bar=True)
    
    df = pd.DataFrame({'x':[1,2]})
    df.parallel_apply(lambda row: print('A'), time.sleep(2), print('B'), axis=1)
    

    Output

    INFO: Pandarallel will run on 6 workers.
    INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
    B
       0.00%                                          |        0 /        1 |                                                                                                                    
       0.00%                                          |        0 /        1 |                                                                                                                    Traceback (most recent call last):
      File "foo.py", line 8, in <module>
        df.parallel_apply(lambda row: print('A'), time.sleep(2), print('B'), axis=1)
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 446, in closure
        map_result,
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 382, in get_workers_result
        progress_bars.update(progresses)
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/utils/progress_bars.py", line 82, in update
        self.__bars[index][0] = value
    IndexError: list index out of range
    

    I'm using python version 3.7.4 with pandas 0.25.3 and pandarallel 1.4.4.

    bug 
    opened by elemakil 6
  • __main__ imposition in code

    __main__ imposition in code

    Make sure if a novice tries to access the module, he should be aware of encapsulating code within main execution context, pretty much similar to that of requirements of multiprocessing in general. This will reduce wasting of time and delving into multiporcessing more.

    opened by karunakar2 4
  • how to use it in fastapi

    how to use it in fastapi

    I have a server like:

    from fastapi import FastAPI
    from pandarallel import pandarallel
    pandarallel.initialize()
    app = FastAPI()
    
    @app.post("/area_quant")
    def create_item(event_data: Areadata):
        data = pd.DataFrame(event_data.data)
        data['type_score'] = data['EVENT_TYPE'].applymap(Config.type_map)
    
    if __name__ == '__main__':
        uvicorn.run(app)
    

    as a server,it can only run one time,then it will shutdown,how can i use it in a server?

    opened by lim-0 0
  • 'functools.partial' object has no attribute '__code__' in Jupyter Notebooks

    'functools.partial' object has no attribute '__code__' in Jupyter Notebooks

    General

    • Operating System: Ubuntu 20.04
    • Python version: 3.9.13
    • Pandas version: 1.4.4
    • Pandarallel version: 1.6.3

    Acknowledgement

    • [x] My issue is NOT present when using pandas without alone (without pandarallel)
    • [x] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    functool.partial cannot be used in junction with jupyter notebooks and pandarallel

    Observed behavior

    Write here the observed behavior

    Expected behavior

    Write here the expected behavior

    Minimal but working code sample to ease bug fix for pandarallel team

    import pandas as pd
    from functools import partial
    from typing import List
    import numpy as np
    from pandarallel import pandarallel
    
    pandarallel.initialize(progress_bar=True)
    
    def processing_fn(row: pd.Series, invalid_numbers: List[int], default: int) -> pd.Series:
        cond = row.isin(invalid_numbers)
        row[cond] = default
        return row
    
    data = pd.DataFrame(np.random.randint(low=-10, high=10, size=(10000, 5)))
    print("Before", (data.values == 100).sum())
    
    fn = partial(processing_fn, invalid_numbers=[-5, 2, 5], default=100)
    new_data = data.apply(fn, axis=1)
    
    print("After serial", (new_data.values == 100).sum())
    
    data = data.parallel_apply(fn, axis=1)
    print("After parallel", (data.values == 100).sum())
    
    

    Works fine in a standalone script, but fails if ran in Jupyter notebook

    opened by Meehai 0
  • Choose which type of progress bar you want in a notebook

    Choose which type of progress bar you want in a notebook

    Sometimes one doesn't always have control of the environment of the Jupyter instance where one's working (think Jupyterhub) and can't install the necessary extensions for progress bars. In this case it might be nice to have the option to manually request the simple progress bar so that it doesn't just display a widget error.

    Is there already a way to do so that I've missed? Otherwise, if it's something you'd consider adding, I'd also be happy to try and draft a PR.

    opened by astrojarred 3
  • TypeError: cannot pickle 'sqlite3.Connection' object in pyCharm

    TypeError: cannot pickle 'sqlite3.Connection' object in pyCharm

    General

    • Operating System: Ubuntu 22.04
    • Python version: 3.9.7
    • Pandas version: 1.4.2
    • Pandarallel version: 1.6.3

    Acknowledgement

    • [x] My issue is NOT present when using pandas without alone (without pandarallel)
    • [ ] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    I observe this when running parallel_apply in pyCharm. #76 sounds similar to me, but none of the tricks suggested there work for me. At this point I am also not sure if it's more of an issue with pyCharm.

    Observed behavior

    Traceback (most recent call last):
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-39-d230d86ff5ef>", line 1, in <module>
        iris.groupby("species").parallel_apply(lambda x: np.mean(x))
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/pandarallel/core.py", line 265, in closure
        dilled_user_defined_function = dill.dumps(user_defined_function)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 364, in dumps
        dump(obj, file, protocol, byref, fmode, recurse, **kwds)#, strictio)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 336, in dump
        Pickler(file, protocol, **_kwds).dump(obj)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 620, in dump
        StockPickler.dump(self, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 487, in dump
        self.save(obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1963, in save_function
        _save_with_postproc(pickler, (_create_function, (
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1154, in _save_with_postproc
        pickler._batch_setitems(iter(source.items()))
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 692, in save_reduce
        save(args)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 886, in save_tuple
        save(element)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 717, in save_reduce
        save(state)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 717, in save_reduce
        save(state)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 578, in save
        rv = reduce(self.proto)
    TypeError: cannot pickle 'sqlite3.Connection' object
    

    Minimal but working code sample to ease bug fix for pandarallel team

    import pandas as pd
    import numpy as np
    from pandarallel import pandarallel
    pandarallel.initialize(use_memory_fs=False)
    iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
    iris.groupby("species").parallel_apply(lambda x: np.mean(x))
    
    opened by wiessall 1
Releases(v1.6.3)
Owner
Manu NALEPA
Data Scientist / Data Engineer @ Clustree — Sousaphone & saxophone player
Manu NALEPA
A set of functions and analysis classes for solvation structure analysis

SolvationAnalysis The macroscopic behavior of a liquid is determined by its microscopic structure. For ionic systems, like batteries and many enzymes,

MDAnalysis 19 Nov 24, 2022
peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. 🗺️ Overview peptides.py is a pure-Python package to compute common descr

Martin Larralde 32 Dec 31, 2022
MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI Hallo

Florent Zahoui 1 Feb 07, 2022
Import, connect and transform data into Excel

xlwings_query Import, connect and transform data into Excel. Description The concept is to apply data transformations to a main query object. When the

George Karakostas 1 Jan 19, 2022
CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

J. W. Dell 0 Mar 13, 2022
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023
Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

John Koumentis 2 Nov 18, 2022
MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

Isabela Caetano 1 Dec 09, 2021
Wafer Fault Detection - Wafer circleci with python

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

Avnish Yadav 14 Nov 21, 2022
Analytical view of olist e-commerce in Brazil

Analysis of E-Commerce Public Dataset by Olist The objective of this project is to propose an analytical view of olist e-commerce in Brazil. For this

Gurpreet Singh 1 Jan 11, 2022
A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and

Tsinghua Machine Learning Group 2.2k Dec 28, 2022
Multiple Pairwise Comparisons (Post Hoc) Tests in Python

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data anal

Maksim Terpilowski 264 Dec 30, 2022
ASOUL直播间弹幕抓取&&数据分析

ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。

159 Dec 10, 2022
Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

Statistical_Modelling Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data Statistical Methods for Decision Ma

Avnika Mehta 1 Jan 27, 2022
Additional tools for particle accelerator data analysis and machine information

PyLHC Tools This package is a collection of useful scripts and tools for the Optics Measurements and Corrections group (OMC) at CERN. Documentation Au

PyLHC 3 Apr 13, 2022
Common bioinformatics database construction

biodb Common bioinformatics database construction 1.taxonomy (Substance classification database) Download the database wget -c https://ftp.ncbi.nlm.ni

sy520 2 Jan 04, 2022
Exploring the Top ML and DL GitHub Repositories

This repository contains my work related to my project where I scraped data on the most popular machine learning and deep learning GitHub repositories in order to further visualize and analyze it.

Nico Van den Hooff 17 Aug 21, 2022