Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second ๐Ÿš€

Overview

Documentation

What is Vaex?

Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid for more than a billion (10^9) samples/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

Installing

With pip:

$ pip install vaex

Or conda:

$ conda install -c conda-forge vaex

For more details, see the documentation

Key features

Instant opening of Huge data files (memory mapping)

HDF5 and Apache Arrow supported.

opening1a

opening1b

Read the documentation on how to efficiently convert your data from CSV files, Pandas DataFrames, or other sources.

Lazy streaming from S3 supported in combination with memory mapping.

opening1c

Expression system

Don't waste memory or time with feature engineering, we (lazily) transform your data when needed.

expression

Out-of-core DataFrame

Filtering and evaluating expressions will not waste memory by making copies; the data is kept untouched on disk, and will be streamed only when needed. Delay the time before you need a cluster.

occ-animated

Fast groupby / aggregations

Vaex implements parallelized, highly performant groupby operations, especially when using categories (>1 billion/second).

groupby

Fast and efficient join

Vaex doesn't copy/materialize the 'right' table when joining, saving gigabytes of memory. With subsecond joining on a billion rows, it's pretty fast!

join

More features

Learn more about Vaex

Comments
  • When opening a bigger than memory CSV, using chunksize and converting to HDF5, the memory is still fully used until the machine hangs and the kernel dies.

    When opening a bigger than memory CSV, using chunksize and converting to HDF5, the memory is still fully used until the machine hangs and the kernel dies.

    Following the recommendation here: https://vaex.readthedocs.io/en/latest/faq.html

    I have been able to verify that the small files (from each chunk) are being generated.

    However, although the data is being chunked and the small files are created, the memory usage increases until all memory is used, and the machine hangs until the kernel dies.

    When reading the dataset using the chunksize argument, into pandas.read_csv(), the memory is kept as low as the chunksize. Perhaps the memory leak is happening in the process of conversion to hdf5?

    To reproduce the issue, just try to read a bigger than memory csv file with this code (as recommended on the faq link):

    df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)

    opened by nf78 42
  • List of Numpy functions that can be used in expressions

    List of Numpy functions that can be used in expressions

    I am trying to create a virtual column that is the one shifted value of another column. I know there is a shift() method on the dataframe class, but it appears that creates a whole new copy of the column. I tried to create an expression using a Numpy function, i.e., np.roll, that could then be used to form a virtual column, but it appears it isn't compatible with vaex expressions. Specifically, I did the following:

    import vaex osc_data = vaex.open('.../data.hdf5') # '...' is the path to data.hdf5 import numpy as np osc_data['sx'] = np.roll(osc_data.x, -1)

    This threw an error:


    TypeError Traceback (most recent call last) /var/folders/l5/wjz73b69427bh_9x0954j0dr0000gn/T/ipykernel_6758/539048344.py in 1 import numpy as np ----> 2 osc_data['sx'] = np.roll(osc_data.x, -1) 3 # print(type(osc_data['sx'])) 4 # print(type(osc_data['x'])) 5 # print(osc_data[['x', 'sx']])

    /usr/local/lib/python3.9/site-packages/vaex/dataframe.py in setitem(self, name, value) 4969 if isinstance(name, six.string_types): 4970 if isinstance(value, supported_column_types): -> 4971 self.add_column(name, value) 4972 else: 4973 self.add_virtual_column(name, value)

    /usr/local/lib/python3.9/site-packages/vaex/dataframe.py in add_column(self, name, f_or_array, dtype) 3261 self._length_original = _len(data) 3262 self._index_end = self._length_unfiltered -> 3263 if _len(ar) != self.length_original(): 3264 if self.filtered: 3265 # give a better warning to avoid confusion

    /usr/local/lib/python3.9/site-packages/vaex/dataframe.py in _len(o) 65 66 def _len(o): ---> 67 return o.len() 68 69

    TypeError: len() of unsized object

    My guess is np.roll is not one of the Numpy functions supported in vaex expressions. However, I could find not list of Numpy functions that are valid in vaex expressions. This means those of us coding with vaex have to use hit or miss guessing which functions are compatible and which are not. This is extremely frustrating.

    As a side question, is there any other way of creating an expression that shifts a column without copying the column's data?

    opened by dnessett 35
  • Virtual Column coordinates __xxx_matrix not found / Stride != 8

    Virtual Column coordinates __xxx_matrix not found / Stride != 8

    A strange bug appears when I add a virtual column like

    cat.add_virtual_columns_eq2gal('RAout', 'DECout')
    

    I can evaluate the column but when I try to select upon it I get a strange message:

    error name '__celestial_eq2gal_matrix' is not defined
    
    opened by mfouesneau 31
  • Add basic support for accessing struct arrays

    Add basic support for accessing struct arrays

    This PR continues work described in #1423. It adds support to access individual fields of arrow based struct arrays and wraps them in vaex expressions.

    The following additions are made:

    • Adds StructOperations namespace to Expression.
    • Adds Expression.struct.get_field and Expression.struct.project.
    • Adds Expression.struct.field_names and Expression.struct.field_types
    • Adds shorthand notation to use struct_get_field and struct_project via Expression.__getitem__.

    Still to do and to be discussed:

    • [ ] Add examples to documentation.
    • [ ] Add changes to changelog.

    Please feel free to modify the function names. I'm not 100% sure about the API.

    opened by mansenfranzen 29
  • [BUG-REPORT] - `to_numpy` corrupts numpy array

    [BUG-REPORT] - `to_numpy` corrupts numpy array

    Thank you for reaching out and helping us improve Vaex!

    Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

    Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

    When you have a column whose values are a pyarrow Arrays, and you try to convert that to numpy, it corrupts the numpy array and it cannot be worked with directly.

    NUM_SAMPLES = 100
    WIDTH = 20
    
    vals = [pa.array([random() for _ in range(WIDTH)]) for _ in range(NUM_SAMPLES)]
    ids = list(range(NUM_SAMPLES))
    data = {
        'vals': vals,
        'id': ids
    }
    df = vaex.from_pandas(pd.DataFrame(data))
    
    bad = df['vals'].to_numpy() # np.array of type Object, cannot become a (NUM_SAMPLES,WIDTH) matrix
    good = np.array(vals) # (NUM_SAMPLES,WIDTH) matrix
    

    In the example above, good is a numpy matrix that you'd expect. But bad is a numpy array of pyarrow arrays. Numpy, for some reason, cannot convert it to a numpy matrix, it thinks the object is of unequal length. The only way to do it is either to

    1. Cast it to a list, then back to numpy
    2. Convert each pa.array to a numpy array first, then call the np.array constructor

    Both of those are very slow. Do you have a suggestion for a way to get around this? Thanks!

    Software information

    • Vaex version (import vaex; vaex.__version__):
    {'vaex': '4.5.0',
     'vaex-core': '4.5.1',
     'vaex-viz': '0.5.0',
     'vaex-hdf5': '0.10.0',
     'vaex-server': '0.6.1',
     'vaex-astro': '0.9.0',
     'vaex-jupyter': '0.6.0',
     'vaex-ml': '0.14.0'}
    
    • Vaex was installed via: pip / conda-forge / from source : pip
    • OS: MacOS big sur

    Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc..).

    opened by Ben-Epstein 26
  • [BUG-REPORT] pip install is broken

    [BUG-REPORT] pip install is broken

    there are many other closed issues about this but it still pops up for me. If I do pip install vaex-core vaex-hdf5 and then try to import vaex it it cant find ANY of the submodules. I have also tried the more general pip install vaex with same results.

    I have tried everything, reinstalling multiple times, cleared pip cache etc. It just never works somehow. It used to work for me but I am on a new virtual machine and now I cant install anymore.

    I am running python3.9 on latest centos. installed packages:

    Package         Version
    --------------- -----------
    aplus           0.11.0
    blake3          0.1.8
    certifi         2021.5.30
    chardet         4.0.0
    cloudpickle     1.6.0
    dask            2021.6.2
    frozendict      2.0.3
    fsspec          2021.6.1
    future          0.18.2
    h5py            3.3.0
    idna            2.10
    locket          0.2.1
    nest-asyncio    1.5.1
    numpy           1.21.0
    pandas          1.2.5
    partd           1.2.0
    pip             21.1.2
    progressbar2    3.53.1
    pyarrow         4.0.1
    python-dateutil 2.8.1
    python-utils    2.5.6
    pytz            2021.1
    PyYAML          5.4.1
    requests        2.25.1
    setuptools      57.0.0
    six             1.16.0
    tabulate        0.8.9
    toolz           0.11.1
    urllib3         1.26.6
    vaex-core       4.3.0.post1
    vaex-hdf5       0.8.0
    wheel           0.36.2
    

    when I try to open a hdf5 file I get:

    ModuleNotFoundError: No module named 'vaex.hdf5'
    

    very frustrating to not get it to work :( anybody any ideas?

    opened by FrankBoermanTenneT 25
  • Problems with vaex to support Python3

    Problems with vaex to support Python3

    Once we installed vaex using conda the original Python3 was replaced by Python2 when call ipython. In principle vaex supports Python3, how to avoid this?

    opened by fprada 25
  • [FEATURE-REQUEST] Create OHLC from tick data

    [FEATURE-REQUEST] Create OHLC from tick data

    Description Trying to create OHLC (Open High Low Close) data from price tick data. Can sort of create Open, High, Low, but Open is wrong and times have some offset. Not sure about close.

    import pandas as pd
    import vaex
    import numpy as np
    import time
    dates = pd.date_range("01-01-2019", "14-04-2020", freq="60s")
    num = len(dates)
    vdf = vaex.from_arrays(ts=pd.to_datetime(dates), x=np.random.randint(1, 1000, num))
    print(vdf.head(17))
    print()
    
    # Desired output
    print(vdf.to_pandas_df().resample('15Min', on='ts')['x'].ohlc())
    print()
    
    # Create Open High Low (Bit off)
    vdf2 = vdf.groupby(by=vaex.BinnerTime(vdf.ts, resolution='m', every=15), agg={'O': vaex.agg.first('x', 'ts'), 'H': vaex.agg.max('x'), 'L': vaex.agg.min('x')})
    print(vdf2)
    print()
    
    # Create Close?
    vdf3 = vdf.groupby(by=vaex.BinnerTime(vdf.ts, resolution='m', every=15), agg={'C': vaex.agg.first('x', '-ts')})
    print(vdf3)
    

    Output:

    #    ts                             x
    0    2019-01-01 00:00:00.000000000  5
    1    2019-01-01 00:01:00.000000000  20
    2    2019-01-01 00:02:00.000000000  690
    3    2019-01-01 00:03:00.000000000  434
    4    2019-01-01 00:04:00.000000000  686
    ...  ...                            ...
    12   2019-01-01 00:12:00.000000000  182
    13   2019-01-01 00:13:00.000000000  530
    14   2019-01-01 00:14:00.000000000  659
    15   2019-01-01 00:15:00.000000000  929
    16   2019-01-01 00:16:00.000000000  734
    
                         open  high  low  close
    ts                                         
    2019-01-01 00:00:00     5   894    5    659
    2019-01-01 00:15:00   929   929  217    611
    2019-01-01 00:30:00   424   966   41    228
    2019-01-01 00:45:00    19   977   19     42
    2019-01-01 01:00:00   137   989   96    686
    ...                   ...   ...  ...    ...
    2020-04-13 23:00:00   756   994   99    204
    2020-04-13 23:15:00   510   847    3      3
    2020-04-13 23:30:00   128   898   62    501
    2020-04-13 23:45:00   920   937   54    626
    2020-04-14 00:00:00   694   694  694    694
    
    [45025 rows x 4 columns]
    
    #       ts                O    H    L
    0       2018-12-31 23:59  690  894  5
    1       2019-01-01 00:14  929  929  217
    2       2019-01-01 00:29  938  966  41
    3       2019-01-01 00:44  904  977  19
    4       2019-01-01 00:59  96   989  42
    ...     ...               ...  ...  ...
    45,020  2020-04-13 22:59  440  994  3
    45,021  2020-04-13 23:14  204  847  117
    45,022  2020-04-13 23:29  263  898  3
    45,023  2020-04-13 23:44  54   937  54
    45,024  2020-04-13 23:59  626  694  626
    
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    KeyError: "Unknown variables or column: '-ts'"
    
    During handling of the above exception, another exception occurred:
    
    UFuncTypeError: ufunc 'negative' did not contain a loop with signature matching types dtype('<M8[ns]') -> dtype('<M8[ns]')
    
    During handling of the above exception, another exception occurred:
    
    ...
    KeyError: "Unknown variables or column: '-ts'"
    
    During handling of the above exception, another exception occurred:
    ...
    UFuncTypeError: ufunc 'negative' did not contain a loop with signature matching types dtype('<M8[ns]') -> dtype('<M8[ns]')
    

    Additional Context

    Trying to emulate Pandas OHLC

    opened by Penacillin 19
  • [BUG-REPORT] problem using data from parquet/pyarrow in correlation function

    [BUG-REPORT] problem using data from parquet/pyarrow in correlation function

    Dear Vaex developers,

    Description I would like to use the Vaex correlation function to calculate the correlation coefficient between two columns of a dataframe (one type int64, the other type str). I am loading the dataframe from a parquet file generated using the pyarrow engine. It seems that correlation may not work correctly with data in this format.

    Here is a reproducer:

    #use pandas to create and save dataframe as parquet
    data = [[0,'hello'], [1,'world'], [3,'this'], [5,'example'], [0,'hello']]
    pdf = pd.DataFrame(data, columns=['a', 'b'], dtype=int)
    pdf.to_parquet('test', engine='pyarrow')
    
    #now load into vaex
    vdf = vaex.open('test')
    #try correlation without .values
    out = vdf.correlation(vdf.a, vdf.b)
    

    This fails with a traceback that ends in ValueError: could not convert string to float: 'hello'.

    I thought I might need to explictly request .values:

    #try correlation with .values
    out = vdf.correlation(vdf.a.values, vdf.b.values)
    

    This fails with a traceback that ends in

    ValueError: <pyarrow.lib.Int64Array object at 0x2aaae733da60>
    [
      0,
      1,
      3,
      5,
      0
    ] is not of string or Expression type, but <class 'pyarrow.lib.Int64Array'>
    

    Software information

    • Vaex version:
    {'vaex-core': '4.1.0',
    'vaex-viz': '0.5.0',
    'vaex-hdf5': '0.7.0',
    'vaex-server': '0.4.0',
    'vaex-astro': '0.8.0',
    'vaex-jupyter': '0.6.0',
    'vaex-ml': '0.11.1'}
    
    • Vaex was installed via: conda-forge
    • OS: Ubuntu 20.04 (in Docker)

    Apologies if I am using Vaex or this function incorrectly.

    Thank you very much, Laurie

    opened by lastephey 19
  • AttributeError: module 'vaex' has no attribute 'from_csv'

    AttributeError: module 'vaex' has no attribute 'from_csv'

    I have attempted to utilize vaex, but have been unsuccessful. Reason unknown. I am using the most recent version of anaconda 3. I just uninstalled and reinstalled the most recent version as I was experiencing this same problem and figured a fresh install may help to clear the issue. The same error is appearing. I have copied and pasted the console here.

    Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] Type "copyright", "credits" or "license" for more information.

    IPython 7.19.0 -- An enhanced Interactive Python.

    conda install -c conda-forge vaex Collecting package metadata (current_repodata.json): ...working... done Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve. Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source. Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... done

    Package Plan

    environment location: C:\Users\sutei\anaconda3

    added / updated specs: - vaex

    The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    abseil-cpp-20200225.2      |       ha925a31_2         1.9 MB  conda-forge
    aplus-0.11.0               |             py_1           6 KB  conda-forge
    arrow-cpp-2.0.0            |py38h647f3f1_6_cpu        13.4 MB  conda-forge
    aws-c-common-0.4.59        |       h8ffe710_1         151 KB  conda-forge
    aws-c-event-stream-0.1.6   |       hb4e73fc_6          26 KB  conda-forge
    aws-checksums-0.1.10       |       h6f0a1a5_0          51 KB  conda-forge
    aws-sdk-cpp-1.8.70         |       he2782d2_1         2.9 MB  conda-forge
    boto3-1.17.19              |     pyhd8ed1ab_0          70 KB  conda-forge
    botocore-1.20.19           |     pyhd8ed1ab_0         4.5 MB  conda-forge
    bqplot-0.12.20             |     pyhd8ed1ab_0         1.0 MB  conda-forge
    branca-0.4.2               |     pyhd8ed1ab_0          26 KB  conda-forge
    brotli-1.0.9               |       h0e60522_4         882 KB  conda-forge
    c-ares-1.17.1              |       h8ffe710_0         109 KB  conda-forge
    cachetools-4.2.1           |     pyhd8ed1ab_0          13 KB  conda-forge
    conda-4.9.2                |   py38haa244fe_0         3.1 MB  conda-forge
    geos-3.9.1                 |       h39d44d4_2         1.1 MB  conda-forge
    gflags-2.2.2               |    ha925a31_1004          80 KB  conda-forge
    glog-0.4.0                 |       h0174b99_3          83 KB  conda-forge
    grpc-cpp-1.33.2            |       h59b151f_1        14.0 MB  conda-forge
    ipydatawidgets-4.2.0       |     pyhd3deb0d_0         171 KB  conda-forge
    ipyleaflet-0.13.4          |     pyhd3deb0d_0         4.3 MB  conda-forge
    ipympl-0.5.8               |     pyh9f0ad1d_0         506 KB  conda-forge
    ipyvolume-0.6.0a6          |     pyh9f0ad1d_0         5.1 MB  conda-forge
    ipyvue-1.5.0               |     pyhd3deb0d_0         2.2 MB  conda-forge
    ipyvuetify-1.6.2           |     pyh44b312d_0         7.7 MB  conda-forge
    ipywebrtc-0.5.0            |   py38h32f6830_1         877 KB  conda-forge
    jmespath-0.10.0            |     pyh9f0ad1d_0          21 KB  conda-forge
    

    Note: you may need to restart the kernel to use updated packages.

    libprotobuf-3.13.0.1       |       h200bbdf_0         2.3 MB  conda-forge
    libthrift-0.13.0           |       hdfef310_6         1.6 MB  conda-forge
    libutf8proc-2.6.1          |       hcb41399_0          98 KB  conda-forge
    openssl-1.1.1h             |       he774522_0         5.8 MB  conda-forge
    parquet-cpp-1.5.1          |                2           3 KB  conda-forge
    pcre-8.44                  |       ha925a31_0         498 KB  conda-forge
    progressbar2-3.53.1        |     pyh9f0ad1d_0          25 KB  conda-forge
    pyarrow-2.0.0              |py38ha37a76c_6_cpu         2.0 MB  conda-forge
    python-utils-2.5.5         |     pyh44b312d_0          15 KB  conda-forge
    python_abi-3.8             |           1_cp38           4 KB  conda-forge
    pythreejs-2.3.0            |     pyhd8ed1ab_0         2.6 MB  conda-forge
    re2-2020.11.01             |       h0e60522_0         468 KB  conda-forge
    s3fs-0.2.2                 |             py_0          20 KB  conda-forge
    s3transfer-0.3.4           |     pyhd8ed1ab_0          51 KB  conda-forge
    shapely-1.7.1              |   py38h2426642_4         420 KB  conda-forge
    snappy-1.1.8               |       ha925a31_3          50 KB  conda-forge
    tabulate-0.8.9             |     pyhd8ed1ab_0          26 KB  conda-forge
    traittypes-0.2.1           |     pyh9f0ad1d_2          10 KB  conda-forge
    vaex-3.0.0                 |     pyh9f0ad1d_0           9 KB  conda-forge
    vaex-arrow-0.5.1           |     pyh9f0ad1d_0          10 KB  conda-forge
    vaex-astro-0.7.0           |     pyh9f0ad1d_0          13 KB  conda-forge
    vaex-core-2.0.3            |   py38h4c96930_1         1.5 MB  conda-forge
    vaex-hdf5-0.6.0            |     pyh9f0ad1d_0          13 KB  conda-forge
    vaex-jupyter-0.5.2         |     pyh9f0ad1d_0          36 KB  conda-forge
    vaex-ml-0.9.0              |     pyh9f0ad1d_0          76 KB  conda-forge
    vaex-server-0.3.1          |     pyh9f0ad1d_0          14 KB  conda-forge
    vaex-viz-0.4.0             |     pyh9f0ad1d_0          18 KB  conda-forge
    xarray-0.17.0              |     pyhd8ed1ab_0         561 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        82.3 MB
    

    The following NEW packages will be INSTALLED:

    abseil-cpp conda-forge/win-64::abseil-cpp-20200225.2-ha925a31_2 aplus conda-forge/noarch::aplus-0.11.0-py_1 arrow-cpp conda-forge/win-64::arrow-cpp-2.0.0-py38h647f3f1_6_cpu aws-c-common conda-forge/win-64::aws-c-common-0.4.59-h8ffe710_1 aws-c-event-stream conda-forge/win-64::aws-c-event-stream-0.1.6-hb4e73fc_6 aws-checksums conda-forge/win-64::aws-checksums-0.1.10-h6f0a1a5_0 aws-sdk-cpp conda-forge/win-64::aws-sdk-cpp-1.8.70-he2782d2_1 boto3 conda-forge/noarch::boto3-1.17.19-pyhd8ed1ab_0 botocore conda-forge/noarch::botocore-1.20.19-pyhd8ed1ab_0 bqplot conda-forge/noarch::bqplot-0.12.20-pyhd8ed1ab_0 branca conda-forge/noarch::branca-0.4.2-pyhd8ed1ab_0 brotli conda-forge/win-64::brotli-1.0.9-h0e60522_4 c-ares conda-forge/win-64::c-ares-1.17.1-h8ffe710_0 cachetools conda-forge/noarch::cachetools-4.2.1-pyhd8ed1ab_0 geos conda-forge/win-64::geos-3.9.1-h39d44d4_2 gflags conda-forge/win-64::gflags-2.2.2-ha925a31_1004 glog conda-forge/win-64::glog-0.4.0-h0174b99_3 grpc-cpp conda-forge/win-64::grpc-cpp-1.33.2-h59b151f_1 ipydatawidgets conda-forge/noarch::ipydatawidgets-4.2.0-pyhd3deb0d_0 ipyleaflet conda-forge/noarch::ipyleaflet-0.13.4-pyhd3deb0d_0 ipympl conda-forge/noarch::ipympl-0.5.8-pyh9f0ad1d_0 ipyvolume conda-forge/noarch::ipyvolume-0.6.0a6-pyh9f0ad1d_0 ipyvue conda-forge/noarch::ipyvue-1.5.0-pyhd3deb0d_0 ipyvuetify conda-forge/noarch::ipyvuetify-1.6.2-pyh44b312d_0 ipywebrtc conda-forge/win-64::ipywebrtc-0.5.0-py38h32f6830_1 jmespath conda-forge/noarch::jmespath-0.10.0-pyh9f0ad1d_0 libprotobuf conda-forge/win-64::libprotobuf-3.13.0.1-h200bbdf_0 libthrift conda-forge/win-64::libthrift-0.13.0-hdfef310_6 libutf8proc conda-forge/win-64::libutf8proc-2.6.1-hcb41399_0 parquet-cpp conda-forge/noarch::parquet-cpp-1.5.1-2 pcre conda-forge/win-64::pcre-8.44-ha925a31_0 progressbar2 conda-forge/noarch::progressbar2-3.53.1-pyh9f0ad1d_0 pyarrow conda-forge/win-64::pyarrow-2.0.0-py38ha37a76c_6_cpu python-utils conda-forge/noarch::python-utils-2.5.5-pyh44b312d_0 python_abi conda-forge/win-64::python_abi-3.8-1_cp38 pythreejs conda-forge/noarch::pythreejs-2.3.0-pyhd8ed1ab_0 re2 conda-forge/win-64::re2-2020.11.01-h0e60522_0 s3fs conda-forge/noarch::s3fs-0.2.2-py_0 s3transfer conda-forge/noarch::s3transfer-0.3.4-pyhd8ed1ab_0 shapely conda-forge/win-64::shapely-1.7.1-py38h2426642_4 snappy conda-forge/win-64::snappy-1.1.8-ha925a31_3 tabulate conda-forge/noarch::tabulate-0.8.9-pyhd8ed1ab_0 traittypes conda-forge/noarch::traittypes-0.2.1-pyh9f0ad1d_2 vaex conda-forge/noarch::vaex-3.0.0-pyh9f0ad1d_0 vaex-arrow conda-forge/noarch::vaex-arrow-0.5.1-pyh9f0ad1d_0 vaex-astro conda-forge/noarch::vaex-astro-0.7.0-pyh9f0ad1d_0 vaex-core conda-forge/win-64::vaex-core-2.0.3-py38h4c96930_1 vaex-hdf5 conda-forge/noarch::vaex-hdf5-0.6.0-pyh9f0ad1d_0 vaex-jupyter conda-forge/noarch::vaex-jupyter-0.5.2-pyh9f0ad1d_0 vaex-ml conda-forge/noarch::vaex-ml-0.9.0-pyh9f0ad1d_0 vaex-server conda-forge/noarch::vaex-server-0.3.1-pyh9f0ad1d_0 vaex-viz conda-forge/noarch::vaex-viz-0.4.0-pyh9f0ad1d_0 xarray conda-forge/noarch::xarray-0.17.0-pyhd8ed1ab_0

    The following packages will be SUPERSEDED by a higher-priority channel:

    conda pkgs/main::conda-4.9.2-py38haa95532_0 --> conda-forge::conda-4.9.2-py38haa244fe_0 openssl pkgs/main --> conda-forge

    Downloading and Extracting Packages

    vaex-core-2.0.3 | 1.5 MB | | 0% vaex-core-2.0.3 | 1.5 MB | 1 | 1% vaex-core-2.0.3 | 1.5 MB | ####9 | 49% vaex-core-2.0.3 | 1.5 MB | ########## | 100% vaex-core-2.0.3 | 1.5 MB | ########## | 100%

    boto3-1.17.19 | 70 KB | | 0% boto3-1.17.19 | 70 KB | ########## | 100% boto3-1.17.19 | 70 KB | ########## | 100%

    libprotobuf-3.13.0.1 | 2.3 MB | | 0% libprotobuf-3.13.0.1 | 2.3 MB | #######3 | 73% libprotobuf-3.13.0.1 | 2.3 MB | ########## | 100% libprotobuf-3.13.0.1 | 2.3 MB | ########## | 100%

    grpc-cpp-1.33.2 | 14.0 MB | | 0% grpc-cpp-1.33.2 | 14.0 MB | | 0% grpc-cpp-1.33.2 | 14.0 MB | 2 | 2% grpc-cpp-1.33.2 | 14.0 MB | 8 | 9% grpc-cpp-1.33.2 | 14.0 MB | ##1 | 22% grpc-cpp-1.33.2 | 14.0 MB | ####1 | 41% grpc-cpp-1.33.2 | 14.0 MB | ###### | 60% grpc-cpp-1.33.2 | 14.0 MB | ######## | 80% grpc-cpp-1.33.2 | 14.0 MB | #########9 | 100% grpc-cpp-1.33.2 | 14.0 MB | ########## | 100%

    pythreejs-2.3.0 | 2.6 MB | | 0% pythreejs-2.3.0 | 2.6 MB | ###### | 60% pythreejs-2.3.0 | 2.6 MB | ########## | 100% pythreejs-2.3.0 | 2.6 MB | ########## | 100%

    vaex-hdf5-0.6.0 | 13 KB | | 0% vaex-hdf5-0.6.0 | 13 KB | ########## | 100%

    openssl-1.1.1h | 5.8 MB | | 0% openssl-1.1.1h | 5.8 MB | 1 | 2% openssl-1.1.1h | 5.8 MB | ###8 | 38% openssl-1.1.1h | 5.8 MB | ########8 | 88% openssl-1.1.1h | 5.8 MB | ########## | 100%

    ipyvue-1.5.0 | 2.2 MB | | 0% ipyvue-1.5.0 | 2.2 MB | #######1 | 72% ipyvue-1.5.0 | 2.2 MB | ########## | 100% ipyvue-1.5.0 | 2.2 MB | ########## | 100%

    xarray-0.17.0 | 561 KB | | 0% xarray-0.17.0 | 561 KB | ########2 | 83% xarray-0.17.0 | 561 KB | ########## | 100%

    conda-4.9.2 | 3.1 MB | | 0% conda-4.9.2 | 3.1 MB | #####3 | 54% conda-4.9.2 | 3.1 MB | ########## | 100% conda-4.9.2 | 3.1 MB | ########## | 100%

    ipympl-0.5.8 | 506 KB | | 0% ipympl-0.5.8 | 506 KB | ########## | 100% ipympl-0.5.8 | 506 KB | ########## | 100%

    libutf8proc-2.6.1 | 98 KB | | 0% libutf8proc-2.6.1 | 98 KB | ########## | 100%

    aws-checksums-0.1.10 | 51 KB | | 0% aws-checksums-0.1.10 | 51 KB | ########## | 100%

    ipyvolume-0.6.0a6 | 5.1 MB | | 0% ipyvolume-0.6.0a6 | 5.1 MB | ##9 | 29% ipyvolume-0.6.0a6 | 5.1 MB | ########1 | 82% ipyvolume-0.6.0a6 | 5.1 MB | ########## | 100%

    snappy-1.1.8 | 50 KB | | 0% snappy-1.1.8 | 50 KB | ########## | 100%

    parquet-cpp-1.5.1 | 3 KB | | 0% parquet-cpp-1.5.1 | 3 KB | ########## | 100%

    brotli-1.0.9 | 882 KB | | 0% brotli-1.0.9 | 882 KB | ########## | 100% brotli-1.0.9 | 882 KB | ########## | 100%

    arrow-cpp-2.0.0 | 13.4 MB | | 0% arrow-cpp-2.0.0 | 13.4 MB | | 0% arrow-cpp-2.0.0 | 13.4 MB | 8 | 9% arrow-cpp-2.0.0 | 13.4 MB | #8 | 19% arrow-cpp-2.0.0 | 13.4 MB | ###4 | 35% arrow-cpp-2.0.0 | 13.4 MB | ####5 | 45% arrow-cpp-2.0.0 | 13.4 MB | #####7 | 57% arrow-cpp-2.0.0 | 13.4 MB | ######7 | 68% arrow-cpp-2.0.0 | 13.4 MB | #######7 | 78% arrow-cpp-2.0.0 | 13.4 MB | ######### | 91% arrow-cpp-2.0.0 | 13.4 MB | ########## | 100%

    branca-0.4.2 | 26 KB | | 0% branca-0.4.2 | 26 KB | ########## | 100%

    ipyleaflet-0.13.4 | 4.3 MB | | 0% ipyleaflet-0.13.4 | 4.3 MB | 5 | 5% ipyleaflet-0.13.4 | 4.3 MB | ######7 | 68% ipyleaflet-0.13.4 | 4.3 MB | ########## | 100% ipyleaflet-0.13.4 | 4.3 MB | ########## | 100%

    vaex-astro-0.7.0 | 13 KB | | 0% vaex-astro-0.7.0 | 13 KB | ########## | 100%

    gflags-2.2.2 | 80 KB | | 0% gflags-2.2.2 | 80 KB | ########## | 100%

    libthrift-0.13.0 | 1.6 MB | | 0% libthrift-0.13.0 | 1.6 MB | #####4 | 55% libthrift-0.13.0 | 1.6 MB | ########## | 100% libthrift-0.13.0 | 1.6 MB | ########## | 100%

    abseil-cpp-20200225. | 1.9 MB | | 0% abseil-cpp-20200225. | 1.9 MB | | 1% abseil-cpp-20200225. | 1.9 MB | ########## | 100% abseil-cpp-20200225. | 1.9 MB | ########## | 100%

    progressbar2-3.53.1 | 25 KB | | 0% progressbar2-3.53.1 | 25 KB | ########## | 100%

    pcre-8.44 | 498 KB | | 0% pcre-8.44 | 498 KB | ########## | 100% pcre-8.44 | 498 KB | ########## | 100%

    pyarrow-2.0.0 | 2.0 MB | | 0% pyarrow-2.0.0 | 2.0 MB | | 1% pyarrow-2.0.0 | 2.0 MB | #########2 | 93% pyarrow-2.0.0 | 2.0 MB | ########## | 100%

    c-ares-1.17.1 | 109 KB | | 0% c-ares-1.17.1 | 109 KB | ########## | 100% c-ares-1.17.1 | 109 KB | ########## | 100%

    shapely-1.7.1 | 420 KB | | 0% shapely-1.7.1 | 420 KB | ########## | 100% shapely-1.7.1 | 420 KB | ########## | 100%

    aws-sdk-cpp-1.8.70 | 2.9 MB | | 0% aws-sdk-cpp-1.8.70 | 2.9 MB | ##6 | 27% aws-sdk-cpp-1.8.70 | 2.9 MB | ########6 | 87% aws-sdk-cpp-1.8.70 | 2.9 MB | ########## | 100%

    vaex-server-0.3.1 | 14 KB | | 0% vaex-server-0.3.1 | 14 KB | ########## | 100%

    python-utils-2.5.5 | 15 KB | | 0% python-utils-2.5.5 | 15 KB | ########## | 100%

    vaex-viz-0.4.0 | 18 KB | | 0% vaex-viz-0.4.0 | 18 KB | ########## | 100%

    s3fs-0.2.2 | 20 KB | | 0% s3fs-0.2.2 | 20 KB | ########## | 100%

    ipyvuetify-1.6.2 | 7.7 MB | | 0% ipyvuetify-1.6.2 | 7.7 MB | 4 | 5% ipyvuetify-1.6.2 | 7.7 MB | #8 | 19% ipyvuetify-1.6.2 | 7.7 MB | ##3 | 24% ipyvuetify-1.6.2 | 7.7 MB | ####1 | 41% ipyvuetify-1.6.2 | 7.7 MB | #####5 | 55% ipyvuetify-1.6.2 | 7.7 MB | #######7 | 78% ipyvuetify-1.6.2 | 7.7 MB | #########6 | 96% ipyvuetify-1.6.2 | 7.7 MB | ########## | 100%

    aws-c-common-0.4.59 | 151 KB | | 0% aws-c-common-0.4.59 | 151 KB | ########## | 100% aws-c-common-0.4.59 | 151 KB | ########## | 100%

    geos-3.9.1 | 1.1 MB | | 0% geos-3.9.1 | 1.1 MB | ########2 | 82% geos-3.9.1 | 1.1 MB | ########## | 100%

    s3transfer-0.3.4 | 51 KB | | 0% s3transfer-0.3.4 | 51 KB | ########## | 100%

    vaex-jupyter-0.5.2 | 36 KB | | 0% vaex-jupyter-0.5.2 | 36 KB | ########## | 100%

    glog-0.4.0 | 83 KB | | 0% glog-0.4.0 | 83 KB | ########## | 100%

    jmespath-0.10.0 | 21 KB | | 0% jmespath-0.10.0 | 21 KB | ########## | 100%

    vaex-ml-0.9.0 | 76 KB | | 0% vaex-ml-0.9.0 | 76 KB | ########## | 100%

    aws-c-event-stream-0 | 26 KB | | 0% aws-c-event-stream-0 | 26 KB | ########## | 100%

    ipydatawidgets-4.2.0 | 171 KB | | 0% ipydatawidgets-4.2.0 | 171 KB | ########## | 100% ipydatawidgets-4.2.0 | 171 KB | ########## | 100%

    bqplot-0.12.20 | 1.0 MB | | 0% bqplot-0.12.20 | 1.0 MB | 1 | 2% bqplot-0.12.20 | 1.0 MB | ########## | 100% bqplot-0.12.20 | 1.0 MB | ########## | 100%

    tabulate-0.8.9 | 26 KB | | 0% tabulate-0.8.9 | 26 KB | ########## | 100%

    vaex-3.0.0 | 9 KB | | 0% vaex-3.0.0 | 9 KB | ########## | 100%

    aplus-0.11.0 | 6 KB | | 0% aplus-0.11.0 | 6 KB | ########## | 100%

    vaex-arrow-0.5.1 | 10 KB | | 0% vaex-arrow-0.5.1 | 10 KB | ########## | 100%

    botocore-1.20.19 | 4.5 MB | | 0% botocore-1.20.19 | 4.5 MB | 9 | 9% botocore-1.20.19 | 4.5 MB | ##7 | 28% botocore-1.20.19 | 4.5 MB | #####1 | 52% botocore-1.20.19 | 4.5 MB | #######6 | 77% botocore-1.20.19 | 4.5 MB | ########## | 100% botocore-1.20.19 | 4.5 MB | ########## | 100%

    python_abi-3.8 | 4 KB | | 0% python_abi-3.8 | 4 KB | ########## | 100%

    cachetools-4.2.1 | 13 KB | | 0% cachetools-4.2.1 | 13 KB | ########## | 100%

    traittypes-0.2.1 | 10 KB | | 0% traittypes-0.2.1 | 10 KB | ########## | 100%

    re2-2020.11.01 | 468 KB | | 0% re2-2020.11.01 | 468 KB | ########## | 100% re2-2020.11.01 | 468 KB | ########## | 100%

    ipywebrtc-0.5.0 | 877 KB | | 0% ipywebrtc-0.5.0 | 877 KB | ########7 | 88% ipywebrtc-0.5.0 | 877 KB | ########## | 100% Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done

    runfile('C:/Users/sutei/Desktop/CASEY/vaex2.py', wdir='C:/Users/sutei/Desktop/CASEY')

    WARNING: This is not valid Python code. If you want to use IPython magics, flexible indentation, and prompt removal, please save this file with the .ipy extension. This will be an error in a future version of Spyder.

    <class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 10 columns):

    Column Non-Null Count Dtype


    0 c0 100000 non-null int32 1 c1 100000 non-null int32 2 c2 100000 non-null int32 3 c3 100000 non-null int32 4 c4 100000 non-null int32 5 c5 100000 non-null int32 6 c6 100000 non-null int32 7 c7 100000 non-null int32 8 c8 100000 non-null int32 9 c9 100000 non-null int32 dtypes: int32(10) memory usage: 3.8 MB Traceback (most recent call last):

    File "C:\Users\sutei\Desktop\CASEY\vaex2.py", line 25, in vaex_df = vaex.from_csv(filepath, convert = True, chunk_size = 5_000_000)

    AttributeError: module 'vaex' has no attribute 'from_csv'

    Code: import vaex import pandas as pd import numpy as np

    n_rows = 100000 n_cols = 10 df = pd.DataFrame(np.random.randint(0, 100, size=(n_rows, n_cols)), columns=['c%d' % i for i in range(n_cols)])

    df.info(memory_usage='deep')

    creating .csv files

    filepath = 'main_dataset.csv' df.to_csv(filepath, index=False)

    create hdf5 files

    vaex_df = vaex.from_csv(filepath, convert = True, chunk_size = 5_000_000)

    type(vaex_df)

    opened by steveentrehub 19
  • csv file without header not supported

    csv file without header not supported

    Hi, I am trying to load a CSV with no header using

    df = vaex.open('data/star0000-1.csv',sep=",", header=None, error_bad_lines=False)
    

    but I get

    could not convert column 0, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 0, error: TypeError('getattr(): attribute name must be string')
    could not convert column 1, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 1, error: TypeError('getattr(): attribute name must be string')
    could not convert column 2, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 2, error: TypeError('getattr(): attribute name must be string')
    could not convert column 3, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 3, error: TypeError('getattr(): attribute name must be string')
    could not convert column 4, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 4, error: TypeError('getattr(): attribute name must be string')
    could not convert column 5, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 5, error: TypeError('getattr(): attribute name must be string')
    could not convert column 6, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 6, error: TypeError('getattr(): attribute name must be string')
    could not convert column 7, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 7, error: TypeError('getattr(): attribute name must be string')
    could not convert column 8, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 8, error: TypeError('getattr(): attribute name must be string')
    could not convert column 9, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 9, error: TypeError('getattr(): attribute name must be string')
    could not convert column 10, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 10, error: TypeError('getattr(): attribute name must be string')
    could not convert column 11, error: TypeError('getattr(): attribute name must be string'), will try to convert it to string
    Giving up column 11, error: TypeError('getattr(): attribute name must be string')
    
    

    Also it takes like 20x the time to do nothing, any help?

    enhancement help wanted priority: low good first issue 
    opened by argenisleon 19
  • [BUG-REPORT] `tolist` is much slower than `to_numpy().tolist()`

    [BUG-REPORT] `tolist` is much slower than `to_numpy().tolist()`

    Thank you for reaching out and helping us improve Vaex!

    Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

    Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

    Software information

    • Vaex version (import vaex; vaex.__version__): {'vaex-core': '4.16.0', 'vaex-hdf5': '0.12.2'}
    • Vaex was installed via: pip / conda-forge / from source
    • OS:

    Additional information

    import vaex
    df = vaex.example()
    df.export("file.arrow")
    df2 = vaex.open("file.arrow")
    
    # time this
    with vaex.cache.off():
        df2.id.tolist()
    
    
    # vs this
    with vaex.cache.off():
        df2.id.to_numpy.tolilst()
    
    image
    opened by Ben-Epstein 0
  • [BUG-REPORT] Can't do binary ops between expressions from copied DFs

    [BUG-REPORT] Can't do binary ops between expressions from copied DFs

    Description

    import vaex
    
    df = vaex.from_arrays(x=[1, 2, 3])
    df2 = df.copy()
    df.x + df2.x
    

    raises:

    ---------------------------------------------------------------------------
    AssertionError                            Traceback (most recent call last)
    Cell In[5], line 5
          3 df = vaex.from_arrays(x=[1, 2, 3])
          4 df2 = df.copy()
    ----> 5 df.x + df2.x
    
    File ~/Library/Application Support/hatch/env/virtual/noatak-UM6-FHel/noatak/lib/python3.10/site-packages/vaex/expression.py:144, in Meta.__new__.<locals>.wrap.<locals>.f(a, b)
        142 else:
        143     if isinstance(b, Expression):
    --> 144         assert b.ds == a.ds
        145         b = b.expression
        146     elif isinstance(b, (np.timedelta64)):
    
    AssertionError: 
    

    I'm not sure if this is a bug. It seems to me as though it should be: Both df and df2 are from the same source of data, and that should be the important thing that should make this operation possible. But vaex complains that df and df2 aren't the exact same DataFrame. Am I missing something though?

    Software information

    • Vaex version (import vaex; vaex.__version__): {'vaex': '4.16.0', 'vaex-core': '4.16.1', 'vaex-viz': '0.5.4', 'vaex-hdf5': '0.14.1', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.3', 'vaex-jupyter': '0.8.1', 'vaex-ml': '0.18.1'}
    opened by NickCrews 0
  • [BUG-REPORT] Problems unlinking shared memory of arrays after dataframe deletion

    [BUG-REPORT] Problems unlinking shared memory of arrays after dataframe deletion

    I'm trying to use vaex with numpy arrays that reference shared memory and experience problems when trying to unlink the shared memory. Here a minimal reproducing example:

    import numpy as np
    from multiprocessing import shared_memory
    import time
    import vaex
    
    shm = shared_memory.SharedMemory(create=True, size=8)
    arr = np.frombuffer(shm.buf, dtype="uint8", count=8)
    df = vaex.from_dict(dict(x=arr))
    
    del arr
    del df
    time.sleep(2)
    
    shm.close()
    shm.unlink()
    

    Execution throws the following exception:

    Traceback (most recent call last):
      File "<...>\memory_test.py", line 15, in <module>
        shm.close()
      File "<...>\.pyenv\pyenv-win\versions\3.9.10\lib\multiprocessing\shared_memory.py", line 227, in close
        self._mmap.close()
    BufferError: cannot close exported pointers exist
    

    It works fine when not creating the dataframe object.

    It seems like vaex is still keeping a reference to the array/shm block after deleting the dataframe object. Is that a bug or is there a recommended way to delete all references?

    Software information

    • Vaex version (import vaex; vaex.__version__): {'vaex': '4.16.0', 'vaex-core': '4.16.1', 'vaex-viz': '0.5.4', 'vaex-hdf5': '0.14.1', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.3', 'vaex-jupyter': '0.8.1', 'vaex-ml': '0.18.1'}
    • Vaex was installed via: pip
    • OS: Windows 10
    • Python: 3.9.10
    opened by schwingkopf 0
  • Issue when trying to use interactive heatmap (df.widget.heatmap --> NameError: name 'observe' is not defined)

    Issue when trying to use interactive heatmap (df.widget.heatmap --> NameError: name 'observe' is not defined)

    I was following your article (https://towardsdatascience.com/vaex-out-of-core-dataframes-for-python-and-fast-visualization-12c102db044a) and tried to reproduce it locally (on an Azure AML compute instance)

    Whilst most things work as expected, creating an interactive heatmap fails. I am using the yellow taxi 2015 hdf5 file you provide in your datasets section. Invoking df_yt_2015.widget.heatmap(df_yt_2015.dropoff_longitude, df_yt_2015.dropoff_latitude, shape=400, f='log1p', controls_selection=True) fails with:

    File /anaconda/envs/my_env/lib/python3.8/site-packages/vaex/jupyter/widgets.py:517, in ToolsToolbar()
        513 @traitlets.default('template')
        514 def _template(self):
        515     return load_template('vue/tools-toolbar.vue')
    --> 517 @observe('z_normalize')
        518 def _observe_normalize(self, change):
        519     self.normalize = bool(self.z_normalize)
    
    NameError: name 'observe' is not defined
    

    I didn't have time to dig further into it.

    opened by meierale 0
  • Issue on page /faq.html

    Issue on page /faq.html

    I am not able to install via pip install vaex

    I install python version 3.11.1, error given below which I am getting when I try to install

    Using cached numba-0.56.4.tar.gz (2.4 MB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error

    ร— python setup.py egg_info did not run successfully. โ”‚ exit code: 1 โ•ฐโ”€> [8 lines of output] Traceback (most recent call last): File "", line 2, in File "", line 34, in File "C:\Users\HP\AppData\Local\Temp\pip-install-ksldzaij\numba_1712998ee6f0470e807e228c6b892e9b\setup.py", line 51, in _guard_py_ver() File "C:\Users\HP\AppData\Local\Temp\pip-install-ksldzaij\numba_1712998ee6f0470e807e228c6b892e9b\setup.py", line 48, in _guard_py_ver raise RuntimeError(msg.format(cur_py, min_py, max_py)) RuntimeError: Cannot install on Python version 3.11.1; only versions >=3.7,<3.11 are supported. [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

    ร— Encountered error while generating package metadata. โ•ฐโ”€> See above for output.

    note: This is an issue with the package mentioned above, not pip. hint: See above for details.

    opened by navikaran1 0
  • [BUG-REPORT] Runtime error occurs when joining dataframes

    [BUG-REPORT] Runtime error occurs when joining dataframes

    Hi Vaex Team,

    Description When joining two dataframes, getting runtime error saying Oops, get an empty chunk, from 33112 to 33112, that should not happen. First dataframe df1 is filtered and aggregated. Second dataframe df2 is simply the source dataframe filtered. Size of both left and right dataframes is 33112. I am using c5.x large EC2 instance. Can you please confirm if it is a bug or I am missing any parameter to be passed ?

    Software information

    Vaex version 4.14.0 Vaex was installed via: pip OS: Linux Additional information Jupyter notebook and sample data attached

    Additional information Error Log Attached image

    opened by vignesh-bungee 0
Releases(vaexpaper_v1)
Owner
vaex io
Big data made simple. Visualization and exploration. Machine learning and deployment.
vaex io
A pure Python implementation of Apache Spark's RDD and DStream interfaces.

pysparkling Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to exe

Sven Kreiss 254 Dec 06, 2022
The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

Eyal Trabelsi 206 Dec 13, 2022
Modin: Speed up your Pandas workflows by changing a single line of code

Scale your pandas workflows by changing one line of code To use Modin, replace the pandas import: # import pandas as pd import modin.pandas as pd Inst

8.2k Jan 01, 2023
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second ๐Ÿš€

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

vaex io 7.7k Jan 01, 2023
High performance datastore for time series and tick data

Arctic TimeSeries and Tick store Arctic is a high performance datastore for numeric data. It supports Pandas, numpy arrays and pickled objects out-of-

Man Group 2.9k Dec 23, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

Python for Data 348 Jan 03, 2023
A Python package for manipulating 2-dimensional tabular data structures

datatable This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame

H2O.ai 1.6k Jan 05, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 01, 2023
Koalas: pandas API on Apache Spark

pandas API on Apache Spark Explore Koalas docs ยป Live notebook ยท Issues ยท Mailing list Help Thirsty Koalas Devastated by Recent Fires The Koalas proje

Databricks 3.2k Jan 04, 2023
The easy way to write your own flavor of Pandas

Pandas Flavor The easy way to write your own flavor of Pandas Pandas 0.23 added a (simple) API for registering accessors with Pandas objects. Pandas-f

Zachary Sailer 260 Jan 01, 2023
Universal 1d/2d data containers with Transformers functionality for data analysis.

XPandas (extended Pandas) implements 1D and 2D data containers for storing type-heterogeneous tabular data of any type, and encapsulates feature extra

The Alan Turing Institute 25 Mar 14, 2022
A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

Jason Carpenter 2.2k Jan 04, 2023
sqldf for pandas

pandasql pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar

yhat 1.2k Jan 09, 2023
cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Built based on the Apache Arrow columnar memory format,

RAPIDS 5.2k Dec 31, 2022