BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

Overview

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem.

Get Started on app.blazingsql.com

Getting Started | Documentation | Examples | Contributing | License | Blog | Try Now

BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS ecosystem. RAPIDS is based on the Apache Arrow columnar memory format, and cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

BlazingSQL is a SQL interface for cuDF, with various features to support large scale data science workflows and enterprise datasets.

  • Query Data Stored Externally - a single line of code can register remote storage solutions, such as Amazon S3.
  • Simple SQL - incredibly easy to use, run a SQL query and the results are GPU DataFrames (GDFs).
  • Interoperable - GDFs are immediately accessible to any RAPIDS library for data science workloads.

Try our 5-min Welcome Notebook to start using BlazingSQL and RAPIDS AI.

Getting Started

Here's two copy + paste reproducable BlazingSQL snippets, keep scrolling to find example Notebooks below.

Create and query a table from a cudf.DataFrame with progress bar:

import cudf

df = cudf.DataFrame()

df['key'] = ['a', 'b', 'c', 'd', 'e']
df['val'] = [7.6, 2.9, 7.1, 1.6, 2.2]

from blazingsql import BlazingContext
bc = BlazingContext(enable_progress_bar=True)

bc.create_table('game_1', df)

bc.sql('SELECT * FROM game_1 WHERE val > 4') # the query progress will be shown
Key Value
0 a 7.6
1 b 7.1

Create and query a table from a AWS S3 bucket:

from blazingsql import BlazingContext
bc = BlazingContext()

bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')

bc.create_table('taxi', 's3://blazingsql-colab/yellow_taxi/taxi_data.parquet')

bc.sql('SELECT passenger_count, trip_distance FROM taxi LIMIT 2')
passenger_count fare_amount
0 1.0 1.1
1 1.0 0.7

Examples

Notebook Title Description Try Now
Welcome Notebook An introduction to BlazingSQL Notebooks and the GPU Data Science Ecosystem. Launch on BlazingSQL Notebooks
The DataFrame Learn how to use BlazingSQL and cuDF to create GPU DataFrames with SQL and Pandas-like APIs. Launch on BlazingSQL Notebooks
Data Visualization Plug in your favorite Python visualization packages, or use GPU accelerated visualization tools to render millions of rows in a flash. Launch on BlazingSQL Notebooks
Machine Learning Learn about cuML, mirrored after the Scikit-Learn API, it offers GPU accelerated machine learning on GPU DataFrames. Launch on BlazingSQL Notebooks

Documentation

You can find our full documentation at docs.blazingdb.com.

Prerequisites

  • Anaconda or Miniconda installed
  • OS Support
    • Ubuntu 16.04/18.04 LTS
    • CentOS 7
  • GPU Support
    • Pascal or Better
    • Compute Capability >= 6.0
  • CUDA Support
    • 10.1.2
    • 10.2
  • Python Support
    • 3.7
    • 3.8

Install Using Conda

BlazingSQL can be installed with conda (miniconda, or the full Anaconda distribution) from the blazingsql channel:

Stable Version

conda install -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=$PYTHON_VERSION cudatoolkit=$CUDA_VERSION

Where $CUDA_VERSION is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 10.1 and Python 3.7:

conda install -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=3.7 cudatoolkit=10.1

Nightly Version

conda install -c blazingsql-nightly -c rapidsai-nightly -c nvidia -c conda-forge -c defaults blazingsql python=$PYTHON_VERSION  cudatoolkit=$CUDA_VERSION

Where $CUDA_VERSION is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 10.1 and Python 3.7:

conda install -c blazingsql-nightly -c rapidsai-nightly -c nvidia -c conda-forge -c defaults blazingsql python=3.7  cudatoolkit=10.1

Build/Install from Source (Conda Environment)

This is the recommended way of building all of the BlazingSQL components and dependencies from source. It ensures that all the dependencies are available to the build process.

Stable Version

Install build dependencies

conda create -n bsql python=$PYTHON_VERSION
conda activate bsql
conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
conda install --yes -c rapidsai -c nvidia -c conda-forge -c defaults dask-cuda=0.18 dask-cudf=0.18 cudf=0.18 ucx-py=0.18 ucx-proc=*=gpu python=3.7 cudatoolkit=$CUDA_VERSION
conda install --yes -c conda-forge cmake=3.18 gtest gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets

Where $CUDA_VERSION is is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 10.1 and Python 3.7:

conda create -n bsql python=3.7
conda activate bsql
conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
conda install --yes -c rapidsai -c nvidia -c conda-forge -c defaults dask-cuda=0.18 dask-cudf=0.18 cudf=0.18 ucx-py=0.18 ucx-proc=*=gpu python=3.7 cudatoolkit=10.1
conda install --yes -c conda-forge cmake=3.18 gtest gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets

Build

The build process will checkout the BlazingSQL repository and will build and install into the conda environment.

cd $CONDA_PREFIX
git clone https://github.com/BlazingDB/blazingsql.git
cd blazingsql
git checkout main
export CUDACXX=/usr/local/cuda/bin/nvcc
./build.sh

NOTE: You can do ./build.sh -h to see more build options.

$CONDA_PREFIX now has a folder for the blazingsql repository.

Nightly Version

Install build dependencies

conda create -n bsql python=$PYTHON_VERSION
conda activate bsql
conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
conda install --yes -c rapidsai-nightly -c nvidia -c conda-forge -c defaults dask-cuda=0.19 dask-cudf=0.19 cudf=0.19 ucx-py=0.19 ucx-proc=*=gpu python=3.7 cudatoolkit=$CUDA_VERSION
conda install --yes -c conda-forge cmake=3.18 gtest==1.10.0=h0efe328_4 gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets

Where $CUDA_VERSION is is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 10.1 and Python 3.7:

conda create -n bsql python=3.7
conda activate bsql
conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
conda install --yes -c rapidsai-nightly -c nvidia -c conda-forge -c defaults dask-cuda=0.19 dask-cudf=0.19 cudf=0.19 ucx-py=0.19 ucx-proc=*=gpu python=3.7 cudatoolkit=10.1
conda install --yes -c conda-forge cmake=3.18 gtest==1.10.0=h0efe328_4 gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets

Build

The build process will checkout the BlazingSQL repository and will build and install into the conda environment.

cd $CONDA_PREFIX
git clone https://github.com/BlazingDB/blazingsql.git
cd blazingsql
export CUDACXX=/usr/local/cuda/bin/nvcc
./build.sh

NOTE: You can do ./build.sh -h to see more build options.

NOTE: You can perform static analysis with cppcheck with the command cppcheck --project=compile_commands.json in any of the cpp project build directories.

$CONDA_PREFIX now has a folder for the blazingsql repository.

Storage plugins

To build without the storage plugins (AWS S3, Google Cloud Storage) use the next arguments:

# Disable all storage plugins
./build.sh disable-aws-s3 disable-google-gs

# Disable AWS S3 storage plugin
./build.sh disable-aws-s3

# Disable Google Cloud Storage plugin
./build.sh disable-google-gs

NOTE: By disabling the storage plugins you don't need to install previously AWS SDK C++ or Google Cloud Storage (neither any of its dependencies).

Documentation

User guides and public APIs documentation can be found at here

Our internal code architecture can be built using Spinx.

pip install recommonmark exhale
conda install -c conda-forge doxygen
cd $CONDA_PREFIX
cd blazingsql/docs
make html

The generated documentation can be viewed in a browser at blazingsql/docs/_build/html/index.html

Community

Contributing

Have questions or feedback? Post a new github issue.

Please see our guide for contributing to BlazingSQL.

Contact

Feel free to join our channel (#blazingsql) in the RAPIDS-GoAi Slack: join RAPIDS-GoAi workspace.

You can also email us at [email protected] or find out more details on BlazingSQL.com.

License

Apache License 2.0

RAPIDS AI - Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

Apache Arrow on GPU

The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.

Comments
  • Communication c++ layer

    Communication c++ layer

    We had a pretty interesting experience trying to get performance and correctness by sending all of our messages between nodes using ucx-py and dask to send messages. The single threaded nature of python, the fact that dask is using torando.ioloop and we were seeing things like coroutines run at the same time if we were awaiting a ucx.send. It has been really hard to troubleshoot and the performance isn't there for us.

    We need to send and receive messages in the c++ layer to remove the issues we had. Seeing as how we have often been hasty in trying to implement ucx as fast as possible we are going to try and be smart and slow the heck down. Hell if it takes 3 times as long to develop and 1/2 as long to debug we will come out ahead :).

    I kind of envision a few classes like this

    
    template < typename SerializerFunction, typename BufferCommunicator> 
    class SenderClass{
       SenderClass(serializer, bufferCommunicator)
    
        //stuff like broadcast, send_to_node, 
    }
    
    template < typename DeserializerFunction, typename BufferAssembler >
    class ReceiverClass{
        ReceiverClass(deserializer,bufferAssembler)
    }
    
    SerializerFunc ==> f(vector<column_views>) returns a list of views to rmm buffers, metadata for reassembling these buffers into views
    DeserializerFunc ==> takes a list of buffers and metadata and gives us a unique_ptr to a BlazingTable
    BufferAssembler ==> collects all the buffers of a message and associate them with their metadata
    BufferCommunicator ==> can send a single buffer from one node to another and have it arrive at the other nodes buffer assembler
    
    

    We can use a combination of these things to send a message and receive it on the other end with a listener that basses the buffer to the bufferAssembler, when the buffer assembler is done the deserializer converts it to a cudf::table and a Metadata that we can use to add the message to the appropriate class

    Design 
    opened by felipeblazing 28
  • Blazingsql cannot process large files

    Blazingsql cannot process large files

    blazingsql cannot process csv files larger than 3 GB, and the message "out of memory" is displayed. The number of GPUs is 4 and the total memory is 6 GB. The failure occurs whether a single GPU or multiple GPUs are used.

    If multiple GPUs are used, the following information is displayed: distributed.nanny - WARNING - Restarting worker

    question 
    opened by Wxinxiny 21
  • [REVIEW] Enabling E2E tests with null data

    [REVIEW] Enabling E2E tests with null data

    Enabling this new env flag BLAZINGSQL_E2E_TEST_WITH_NULLS, I got:

    TOTAL SUMMARY for test suite: 
    PASSED: 1839/1945
    FAILED: 106/1945
    CRASH: 0/1945
    TOTAL: 1945
    saveLog = false
    MAX DELTA: 192.0
    
    opened by rommelDB 19
  • Allow for concurrent queries form a single BlazingContext

    Allow for concurrent queries form a single BlazingContext

    Right now when you do: bc.sql() execution on the python script halts until that function call returns when the function returns with the result of the query. You used to be able to use the option return_futures but that feature is now obsolete due to https://github.com/BlazingDB/blazingsql/pull/1289

    On the other hand https://github.com/BlazingDB/blazingsql/pull/1289 makes it easy to implement multiple concurrent queries.

    This feature request is to propose an API and user experience for multiple concurrent queries support from a single BlazingContext.

    The proposed API would be something as follows: Proposed API A

    query0 = 'SELECT * FROM my_table where columnA > 0'
    query1 = 'SELECT * FROM my_table where columnB < 0'
    token0 = bc.sql(query0, return_token=True)  
    token1 = bc.sql(query1, return_token=True)
    result0 = bc.fetch(token0)
    result1 = bc.fetch(token1)
    

    In this case token0 and token1 would be int32s which are actually just the queryId. In this case bc.fetch would halt execution until the results are available. We would also implement a function (which would be optional) that would look like this: done = bc.is_query_done(token0) which would return a boolean, simply indicating if the query is done.

    Other ways we could do this are: Proposed API B:

    token0 = bc.async_sql(query)
    is_done = bc.async_sql(token0, get_status=True)  #this is the optional is_query_done API
    result = bc.async_sql(token0)  # here its the same API, but since we are passing in an int instead of a string we would know that we are getting the result
    

    Proposed API C:

    token0 = bc.sql(query, return_token=True)
    is_done = bc.sql(token0, get_status=True)  #this is the optional is_query_done API
    result = bc.sql(token0)  # here its the same API, but since we are passing in an int instead of a string we would know that we are getting the result
    

    Feel free to propose other APIs.

    This internally, this would just use the APIs that are now part of https://github.com/BlazingDB/blazingsql/pull/1289, which allow us to start a query, check its status and get the results. Internally what happens is that when multiple queries are running at the same time is that each query has its own graph, and each graph is generating compute tasks. The compute tasks are then processed by the executor as resources allow. Right now the tasks would be processed FIFO (with a certain amount of parallelism depending on resources and configuration). Eventually we can set prioritization policies for which tasks get done first. For example tasks from the first query to start are given priority, or tasks which are most likely to reduce memory pressure are prioritized, etc...

    opened by wmalpica 19
  • [REVIEW] Fix `CC`/`CXX` variables in CI

    [REVIEW] Fix `CC`/`CXX` variables in CI

    This PR adds CC, CXX, and CUDAHOSTCXX entries to the build.script_env section of the conda recipe, so that those environment variables get passed through the build.sh script and ultimately to CMake. This enables CMake to use the correct versions of gcc and g++ when compiling.

    Additionally, it includes some fixes for the upstream cudf changes in https://github.com/rapidsai/cudf/pull/8142

    opened by ajschmidt8 16
  • Implement a prototype to a create table from other RDBMS

    Implement a prototype to a create table from other RDBMS

    Proposal 1 Direct approach using only create table semantic:

    bc.create_table("dept_emp", "mysql://lucho:[email protected]:3306/sampledb/dept_emp")
    bc.create_table("titles", "postgres://luis:[email protected]:3306/testdb/subchema/titles")
    

    Proposal 2 2 step approach, similar to what we have with storage registration:

    bc.mysql("mysqldb1", "mysql://lucho:[email protected]:3306/sampledb")
    bc.create_table("dept_emp", "mysqldb1/dept_emp")
    
    bc.postgres("pgdb1", "postgres://luis:[email protected]:3306/testdb")
    bc.create_table("titles", "pgdb1/subchema/titles")
    

    cc @williamBlazing @felipeblazing @rommelDB

    feature request 
    opened by aucahuasi 15
  • [BUG] parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered

    [BUG] parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered

    Describe the bug Crash when using example from: https://blog.blazingdb.com/data-visualization-with-blazingsql-12095862eb73

    Steps/Code to reproduce bug run sample code [(attached)]([url](url s3-test.py.txt ))

    Expected behavior No illegal memory access exception.

    Environment overview (please complete the following information)

    • Environment location: Bare metal conda Python 3.7.7 (default, Mar 23 2020, 22:36:06) [GCC 7.3.0] :: Anaconda, Inc. on linux Debian 10 CUDA 10.2

    • Method of cuDF install:

    conda install -c blazingsql/label/cuda10.2 -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=3.7

    Environment details PATH=/opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/local/go/bin

    Additional context Code attached. Other tests using blazingsql worked fine on this box.

    Output:

    listening: tcp://*:22758 2020-06-14T15:56:41Z|-78920688|TRACE|deregisterFileSystem: filesystem authority not found CacheDataLocalFile: /tmp/.blazing-temp-D63WqK6ZgzRBOMd0kxS4CzTDNC69hqAn1vlzzPGIjU8ijs78nLFqpShVKo8Qkdmm.orc terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered distributed.nanny - WARNING - Restarting worker BlazingContext ready distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing

    After crash, nvidia-smi shows below, main python process is hung:

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  TITAN V             On   | 00000000:01:00.0 Off |                  N/A |
    | 29%   43C    P8    26W / 250W |    640MiB / 12066MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0     31262      C   python                                       627MiB |
    +-----------------------------------------------------------------------------+
    
    

    First time I ran it, it created a number of .orc files in /tmp before crashing with above error. Another time it gave:

    listening: tcp://*:22170 BlazingContext ready 2020-06-14T16:59:14Z|-682139984|TRACE|deregisterFileSystem: filesystem authority not found distributed.nanny - WARNING - Restarting worker distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing Unable to start CUDA Context Traceback (most recent call last): File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/dask_cuda/initialize.py", line 108, in dask_setup numba.cuda.current_context() File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 212, in get_context return _runtime.get_or_create_context(devnum) File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context return self._get_or_create_context_uncached(devnum) File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 153, in _get_or_create_context_uncached return self._activate_context_for(0) File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 169, in _activate_context_for newctx = gpu.get_primary_context() File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 529, in get_primary_context driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id) File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 295, in safe_cuda_api_call self._check_error(fname, retcode) File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 330, in _check_error raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [304] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OPERATING_SYSTEM pure virtual method called terminate called without an active exception

    Edit: Tried accessing the files locally instead of from s3 and reproduced same error. As soon as memory fills up or after several orc files created (varies) gets illegal memory access or worker dies. If the worker restarts successfully it does not appear to process anything. Using non-dask/cluster version of BlazingContext it says 'Killed' as soon as it runs out of memory on the GPU. Processing only one input file works fine as it does not run out of memory.

    bug 
    opened by threedliteguy 15
  • [BUG]  b'In function ddlCreateTableService: cannot create the table: Could not create table'

    [BUG] b'In function ddlCreateTableService: cannot create the table: Could not create table'

    (rapids_blazing) sh-4.2$ conda list | grep blazing blazingdb-toolchain 0.4.0 py37hf484d3e_0 blazingsql blazingsql-calcite 0.4.0 py37_0 blazingsql blazingsql-communication 0.4.0 py37_80 blazingsql blazingsql-io 0.4.0 py37_31 blazingsql blazingsql-orchestrator 0.4.0 py37_19 blazingsql blazingsql-protocol 0.4.0 py37_25 blazingsql blazingsql-python 0.4.0 cuda10.0_py37_14 blazingsql/label/cuda10.0 blazingsql-ral 0.4.0 cuda10.0_py37_5 blazingsql/label/cuda10.0

    Yesterday I was able to create a table based on a cudf but today I'm having some errors. I have already created again the whole instance and did again all the steps but I'm facing some errors similar to the following: b'In function ddlCreateTableService: cannot create the table: Could not create table'

    There is no more information. Any ideas how I can trace the error? In addition, it seems in version 0.4.2 there will be some changes in how BlazingContext launches processes, could it be related to this? When this release will be conda installable?

    If I execute again BlazingContext() and try to create the table I can get two kinds of different errors: 1) Already connected to the Orchestrator b'In function ddlCreateTableService: cannot create the table: Connection to server failed.'

    WARNING: blazingsql-orchestrator was not automativally started, its probably already running WARNING: blazingsql-engine was not automativally started, its probably already running WARNING: blazingsql-algebra was not automativally started, its probably already running Already connected to the Orchestrator Unexpected error on create_table, can only concatenate str (not "tuple") to str

    bug 
    opened by ivenzor 15
  • [REVIEW] Implement string REPLACE

    [REVIEW] Implement string REPLACE

    This PR:

    • Implements the REPLACE operator for string columns using cudf::strings::replace
    • Adds a new end-to-end test in stringsTests.py, and updates the runTest.py
    • Refactors the removal of the string encaspulation characters (i.e., the single quotes in LIKE '%the%') in several parts of LogicalProjection.cpp to use a string utility function

    If the implementation looks fine, I'll push the new parquet file to https://github.com/BlazingDB/blazingsql-testing-files and update the CHANGELOG to unblock gpuCI.

    This closes https://github.com/BlazingDB/blazingsql/issues/1175

    • [x] passes e2e tests locally
    • [x] PR to update testing files https://github.com/BlazingDB/blazingsql-testing-files/pull/1
    from pyspark.sql import SparkSession
    from blazingsql import BlazingContext
    import pandas as pd
    ​
    ​
    # spark = SparkSession.builder \
    #     .master("local") \
    #     .getOrCreate()
    ​
    # bc = BlazingContext()
    ​
    ​
    df = pd.DataFrame({
        "a": ["Felipe", "William", "Rodrigo"],
        "b": [2, 4, 6],
        "c": ["2020-11-20", "2020-11-19", "2020-11-18"]
    })
    ​
    bc.create_table("df", df)
    sdf = spark.createDataFrame(df)
    sdf.createOrReplaceTempView("df")
    ​
    query = """
    SELECT
        a,
        REPLACE(a, 'i', '##') as a_new,
        c,
        REPLACE(c, '2020', '1999') as c_new
    FROM df
    """
    ​
    spark.sql(query).show()
    ​
    print(bc.explain(query))
    print(bc.sql(query))
    +-------+---------+----------+----------+
    |      a|    a_new|         c|     c_new|
    +-------+---------+----------+----------+
    | Felipe|  Fel##pe|2020-11-20|1999-11-20|
    |William|W##ll##am|2020-11-19|1999-11-19|
    |Rodrigo| Rodr##go|2020-11-18|1999-11-18|
    +-------+---------+----------+----------+
    
    LogicalProject(a=[$0], a_new=[REPLACE($0, 'i', '##')], c=[$1], c_new=[REPLACE($1, '2020', '1999')])
      BindableTableScan(table=[[main, df]], projects=[[0, 2]], aliases=[[a, a_new, c, c_new]])
    
             a      a_new           c       c_new
    0   Felipe    Fel##pe  2020-11-20  1999-11-20
    1  William  W##ll##am  2020-11-19  1999-11-19
    2  Rodrigo   Rodr##go  2020-11-18  1999-11-18
    
    opened by beckernick 13
  • Barriers Required for Distributed execution.

    Barriers Required for Distributed execution.

    Right now Kernels handle distribution and ensuring completeness so that they continue when they have to communicate. Here is an example of what that looks like in aggregation.

    Below we are iterating through batches that this kernel gets from its input cache, partitioning them and sending each node its corresponding partition. We store a count of how many partitions we sent to each node and how many we kept for ourselves.

    while (input.wait_for_next()) {
        auto batch = input.next();
        CudfTableView batch_view = batch->view();
        std::vector<CudfTableView> partitioned;
        std::unique_ptr<CudfTable> hashed_data; // Keep table alive in this scope
        if (batch_view.num_rows() > 0) {
            std::vector<cudf::size_type> hased_data_offsets;
            std::tie(hashed_data, hased_data_offsets) = cudf::hash_partition(batch->view(), columns_to_hash, num_partitions);
            // the offsets returned by hash_partition will always start at 0, which is a value we want to ignore for cudf::split
            std::vector<cudf::size_type> split_indexes(hased_data_offsets.begin() + 1, hased_data_offsets.end());
            partitioned = cudf::split(hashed_data->view(), split_indexes);
        } else {
            //  copy empty view
            for (auto i = 0; i < num_partitions; i++) {
                partitioned.push_back(batch_view);
            }
        }
    
        ral::cache::MetadataDictionary metadata;
        for(int i = 0; i < this->context->getTotalNodes(); i++ ){
            auto partition = std::make_unique<ral::frame::BlazingTable>(partitioned[i], batch->names());
            if (this->context->getNode(i) == self_node){
                this->output_.get_cache()->addToCache(std::move(partition),"",true);
                node_count[self_node.id()]++;
            } else {
                node_count[this->context->getNode(i).id()]++;
                output_cache->addCacheData(std::make_unique<ral::cache::GPUCacheDataMetaData>(std::move(partition), metadata),"",true);
            }
        }
        }
        batch_count++;
    }
    
    

    After this code executes we send each node a count of how many partitions we sent them.

    
    auto self_node = ral::communication::CommunicationData::getInstance().getSelfNode();
    auto nodes = context->getAllNodes();
    std::string worker_ids = "";
    
    
    for(std::size_t i = 0; i < nodes.size(); ++i) {
        if(!(nodes[i] == self_node)) {
            ral::cache::MetadataDictionary metadata;
            messages_to_wait_for.push_back(metadata.get_values()[ral::cache::QUERY_ID_METADATA_LABEL] + "_" +
                                    metadata.get_values()[ral::cache::KERNEL_ID_METADATA_LABEL] +	"_" +
                                    metadata.get_values()[ral::cache::WORKER_IDS_METADATA_LABEL]);
            this->query_graph->get_output_cache()->addCacheData(
                std::unique_ptr<ral::cache::GPUCacheData>(new 
                   ral::cache::GPUCacheDataMetaData(ral::utilities::create_empty_table({}, {}), metadata)),"",true);
        }
    }
    
    
    

    Then we collect all of the partition counts from each worker node. After this we sum them up and wait for our output cache to have that many partitions before we can say this kernel is finished.

    
    auto self_node = ral::communication::CommunicationData::getInstance().getSelfNode();
    int total_count = node_count[self_node.id()];
    for (auto message : messages_to_wait_for){
        auto meta_message = this->query_graph->get_input_cache()->pullCacheData(message);
        total_count += std::stoi(static_cast<ral::cache::GPUCacheDataMetaData *>(meta_message.get())->getMetadata().get_values()[ral::cache::PARTITION_COUNT]);
    }
    this->output_cache()->wait_for_count(total_count);
    
    

    We want to abstract away a few of the things that are happening here. We are often following this pattern of spreading data out and then theres a barrier to be able to continue. We want to remove this code from the kernel run function itself and have a more generic way of saying things like

    As we discuss and implement the movement towards scheduling tasks to be run we need to have primitives that can do things like :

    • create a broadcast to all and expect broadcast from all primitive
    • create a method for preventing tasks from either being scheduled or run by the scheduler until some kind of condition is met (e.g. wait_for_count but disassociated from the actual run function so it is something that can be "injected" preferably through something like composition).
    Design 
    opened by felipeblazing 13
  • [REVIEW] fix latest cudf dependencies

    [REVIEW] fix latest cudf dependencies

    This PR contains fixes for issues of building, cc @romulo-auccapuclla This PR also contains fixes for new arrow API 4.0.1 (due to cudf-nightly) This PR contains a fix due to https://github.com/rapidsai/cudf/pull/8692 (related to the conda env name). Note: Something that worries me is that the HiveFileTest lately crashes randomly with a std::bad_alloc(query 01 CSV, that was commented out).

    opened by Christian8491 11
  • Bump calcite-core from 1.23.0 to 1.32.0 in /algebra/blazingdb-calcite-core

    Bump calcite-core from 1.23.0 to 1.32.0 in /algebra/blazingdb-calcite-core

    Bumps calcite-core from 1.23.0 to 1.32.0.

    Commits
    • 413eded [CALCITE-5275] Release Calcite 1.32.0
    • 57aafa3 Cosmetic changes to release notes
    • 2624925 [CALCITE-5262] Add many spatial functions, including support for WKB (well-kn...
    • 479afa6 [CALCITE-5278] Upgrade Janino from 3.1.6 to 3.1.8
    • 1167b12 [CALCITE-5270] JDBC adapter should not generate 'FILTER (WHERE)' in Firebolt ...
    • 89c940c [CALCITE-5241] Implement CHAR function for MySQL and Spark, also JDBC '{fn CH...
    • d20fd09 [CALCITE-5274] Improve DocumentBuilderFactory in DiffRepository test class by...
    • 6302e6f [CALCITE-5277] Make EnumerableRelImplementor stashedParameters order determin...
    • baeecc8 [CALCITE-5251] Support SQL hint for Snapshot
    • ba80b91 [CALCITE-5263] Improve XmlFunctions by using an XML DocumentBuilder
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies java 
    opened by dependabot[bot] 0
  • [BUG] Cannot import BlazingContext when processor type unknown

    [BUG] Cannot import BlazingContext when processor type unknown

    Describe the bug Cannot import BlazingContext when processor type unknown.

    Steps/Code to reproduce bug Code and output from ipython (personal info hidden).

    In [1]: from blazingsql import BlazingContext
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-7-0b19b5b41f48> in <module>
    ----> 1 from blazingsql import BlazingContext
    
    ~/miniconda3/envs/blazingsql/lib/python3.7/site-packages/blazingsql/__init__.py in <module>
          1 from pyblazing.apiv2 import S3EncryptionType
          2 from pyblazing.apiv2 import DataType
    ----> 3 from pyblazing.apiv2.context import BlazingContext
          4
          5 from cio import getProductDetailsCaller
    
    ~/miniconda3/envs/blazingsql/lib/python3.7/site-packages/pyblazing/apiv2/context.py in <module>
        105         )
        106
    --> 107 jpype.startJVM("-ea", convertStrings=False, jvmpath=jvm_path)
        108 # jpype.startJVM()
        109
    
    ~/miniconda3/envs/blazingsql/lib/python3.7/site-packages/jpype/_core.py in startJVM(*args, **kwargs)
        225     try:
        226         _jpype.startup(jvmpath, tuple(args),
    --> 227                        ignoreUnrecognized, convertStrings, interrupt)
        228         initializeResources()
        229     except RuntimeError as ex:
    
    FileNotFoundError: [Errno 2] JVM DLL not found: /home/{my_username}/miniconda3/envs/blazingsql/lib/server/libjvm.so
    
    
    In [2]: !uname -p
    unknown
    

    Expected behavior Should be imported without any errors.

    Environment overview (please complete the following information)

    • Environment location: Bare-metal
    • Method of BlazingSQL install: conda
    • BlazingSQL Version
    BlazingSQL version (git hash): 13618d177a37bd34bb20ac832fb8a14f8243ff5c
    BlazingSQL branch name: HEAD
    BlazingSQL branch tag: v21.08.02
    BlazingSQL build id: 0
    BlazingSQL compiler version: GNU /usr/local/gcc9/bin/g++ 9.4.0
    BlazingSQL cuda flags: -Xcompiler -Wno-parentheses -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 --expt-extended-lambda --expt-relaxed-constexpr -Werror=cross-execution-space-call -Xcompiler -Wall,-Wno-error=deprecated-declarations --default-stream=per-thread -DHT_DEFAULT_ALLOCATOR
    BlazingSQL Operating system kernel: Linux-5.4.0-1054-aws
    BlazingSQL Operating system architecture: x86_64
    BlazingSQL Linux Operating system release: NAME=CentOS Linux|VERSION=7 (Core)|ID=centos|ID_LIKE=rhel fedora|VERSION_ID=7|PRETTY_NAME=CentOS Linux 7 (Core)|ANSI_COLOR=031|CPE_NAME=cpe:/o:centos:centos:7|HOME_URL=[https://www.centos.org/|BUG_REPORT_URL=https://bugs.centos.org/||CENTOS_MANTISBT_PROJECT=CentOS-7|CENTOS_MANTISBT_PROJECT_VERSION=7|REDHAT_SUPPORT_PRODUCT=centos|REDHAT_SUPPORT_PRODUCT_VERSION=7|](https://www.centos.org/%7CBUG_REPORT_URL=https://bugs.centos.org/%7C%7CCENTOS_MANTISBT_PROJECT=CentOS-7%7CCENTOS_MANTISBT_PROJECT_VERSION=7%7CREDHAT_SUPPORT_PRODUCT=centos%7CREDHAT_SUPPORT_PRODUCT_VERSION=7%7C)
    

    ----For BlazingSQL Developers---- Suspected source of the issue https://github.com/BlazingDB/blazingsql/blob/branch-21.08/pyblazing/pyblazing/apiv2/context.py#L70

    machine_processor = platform.processor()
    
    if machine_processor in ("x86_64", "x64"):
        machine_processor = "amd64"
    

    when the uname -p is unknown, platform.processor() equals to '', thus machine_processor is empty, which leads to wrong jvm lib path.

    bug ? - Needs Triage 
    opened by callofdutyops 2
  • [BUG] app.blazing.com website not reachable

    [BUG] app.blazing.com website not reachable

    Describe the bug Not able to open any app on the app deployment

    https://app.blazingsql.com/jupyter/user-redirect/lab/workspaces/auto-b/tree/Welcome_to_BlazingSQL_Notebooks/welcome.ipynb

    Steps/Code to reproduce bug Go to browser and try to open https://app.blazingsql.com Error: Site could not be reached

    Expected behavior Should be able to open the website

    Other design considerations It is very difficult to set up blazing SQL on google collab. It is giving me too many version compatibility issues. If someone can help me with it that would work as well

    bug ? - Needs Triage 
    opened by shailee-m 1
  • Bump liquibase-core from 3.6.2 to 4.8.0 in /algebra/blazingdb-calcite-application

    Bump liquibase-core from 3.6.2 to 4.8.0 in /algebra/blazingdb-calcite-application

    Bumps liquibase-core from 3.6.2 to 4.8.0.

    Release notes

    Sourced from liquibase-core's releases.

    v4.8.0

    Liquibase 4.8.0 release

    Please report any issues to https://github.com/liquibase/liquibase/issues.

    Notable Changes

    Liquibase 4.8.0 introduces the following functionality:

    • The init hub subcommand that connects your local Liquibase activity to Liquibase Hub and sets up the Liquibase environment to use Liquibase Hub. [DAT-8769]

    Note: For more information, see init hub and Getting Started with Liquibase Hub.

    • [PRO] The sqlcmd utility support to process complex SQL for MSSQL Server. Liquibase provides the liquibase.sqlcmd.conf file to pass arguments to your executor when running Liquibase Pro. [DAT-7447]

    Note: For more information, see Using the SQLCMD integration and runWith attribute with Liquibase Pro and MSSQL Server.

    • Changes to the behavior of the XML parser, which no longer allows referencing external DTD files for security reasons. If you use externally defined entities or any other potentially insecure XML feature in your changelogs, set liquibase.secureParsing=false. [PR#2384] [LB-2218]

    Note: For more information about the ways to set the parameter, see Command Parameters.

    • The upgrade of the postgresql (from 42.2.12 to 42.3.2) and h2 (from 2.0.206 to 2.1.210) drivers that Liquibase includes in the installation package. If you use those drivers and upgrade an existing Liquibase installation, remove the earlier versions of drivers from the LIQUIBASE_HOME/lib directory.

    Enhancements

    • Implemented the SimpleObjectConstructor interface for DB2 on z/OS [DAT-8580]
    • Included the CLI instructions on how to use the properties file with a nonstandard name when running the init project subcommand [DAT-9041]
    • Improved the output message for init start-h2 when the H2 database driver is specified, but there is no connection detected [DAT-8992]
    • Added validation errors for the enableCheckConstraint, disableCheckConstraint, dropPackage, dropPackageBody Change Types [DAT-9017]
    • [PR#2367] [Mike Olivas] Added example rollback scripts to the example-changelog.sql file [LB-2220]
    • [PR#1648] [Daniel Gray] Improved the exception error message for the customChange node with no class attribute [LB-1144]
    • [PR#2222] [msimko81] Added the offline mode support for the rollback-sql <tag> operation [LB-2198]
    • [PR#2273] [Tsvi Zandany] Added the autocomplete quality checks commands for macOS
    • [PR#2308] [Valentin Blistin] Added the close method for the ClassLoaderResourceAccessor class [LB-2205]

    Fixes

    ... (truncated)

    Changelog

    Sourced from liquibase-core's changelog.

    Liquibase Core Changelog

    Changes in version 4.8.0 (2022.02.23)

    Notable Changes

    Liquibase 4.8.0 introduces a built-in SQLCMD integration that allows you to specify the runwith paramter sqlcmd custom executor to process complex SQL for MSSQL Server. Liquibase provides the liquibase.sqlcmd.conf file to pass arguments to your executor when running Liquibase Pro.

    For new and existing Liquibase Hub users, Liquibase 4.8.0 introduces the init hub command, used in Hub’s Getting Started on-boarding. Users can get defaults and changelog files setup, working, and registered to Hub with just this one command.

    Enhancements

    • Implemented the SimpleObjectConstructor interface for DB2 on z/OS [DAT-8580]
    • Implemented the init hub command to complete Liquibase Hub onboarding
    • Included the CLI instructions on how to use the properties file with a nonstandard name when running the init project subcommand [DAT-9041]
    • Added to init start-h2 a clearer message when the H2 database driver is specified, but there is no connection detected. [DAT-8992]
    • Added validation errors for the enableCheckConstraint, disableCheckConstraint, dropPackage, dropPackageBody Change Types [DAT-9017]
    • [PR#2367] [Mike Olivas] Added example rollback scripts to the example-changelog.sql file [LB-2220]
    • [PR#1648] [Daniel Gray] Improved the exception error message for the customChange node with no class attribute [LB-1144]
    • [PR#2222] [msimko81] Added the offline mode support for the rollback-sql operation [LB-2198]

    Fixes

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies java 
    opened by dependabot[bot] 0
  • FileNotFoundError: [Errno 2] No such file or directory: 'blazingsql-orchestrator': 'blazingsql-orchestrator'

    FileNotFoundError: [Errno 2] No such file or directory: 'blazingsql-orchestrator': 'blazingsql-orchestrator'

    When I am trying to run blazingsql in google colab I am finding the following error Link of colab Notebook: https://blog.blazingdb.com/blazingsql-rapids-ai-now-free-on-google-colab-b8646f1ea948 <module 'subprocess' from '/usr/lib/python3.7/subprocess.py'>

    FileNotFoundError Traceback (most recent call last) in () 8 import subprocess 9 print(subprocess) ---> 10 subprocess.Popen(['blazingsql-orchestrator', '9100', '8889', '127.0.0.1', '8890'],stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 11 subprocess.Popen(['java', '-jar', '/usr/local/lib/blazingsql-algebra.jar', '-p', '8890']) 12 import pyblazing.apiv2.context as cont

    1 frames /usr/lib/python3.7/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session) 1549 if errno_num == errno.ENOENT: 1550 err_msg += ': ' + repr(err_filename) -> 1551 raise child_exception_type(errno_num, err_msg, err_filename) 1552 raise child_exception_type(err_msg) 1553

    FileNotFoundError: [Errno 2] No such file or directory: 'blazingsql-orchestrator': 'blazingsql-orchestrator'

    bug ? - Needs Triage 
    opened by SoumyaB57 0
Releases(v21.08.00)
  • v21.08.00(Aug 16, 2021)

    Improvements

    • Update ucx-py versions to 0.21
    • return ok for filesystems
    • Setting up default value for max_bytes_chunk_read to 256 MB

    Bug Fixes

    • Fix build due to changes in rmm device buffer
    • Fix reading decimal columns from orc file
    • Fix CC/CXX variables in CI
    • Fix latest cudf dependencies
    • Fix concat suite E2E test for nested calls
    • Fix for GCS credentials from filepath
    • Fix decimal support using float64
    • Fix build issue with thrust package
    Source code(tar.gz)
    Source code(zip)
  • v21.06.00(Aug 16, 2021)

    Note new versioning system from Major.Minor to Year.Month. Previous version was 0.19.

    New Features

    • Limited support of unbounded partitioned windows
    • Support for CURRENT_DATE, CURRENT_TIME and CURRENT_TIMESTAMP
    • Support for right outer join
    • Support for DURATION type
    • Support for IS NOT FALSE condition
    • Support ORDERing by null values
    • Support for multiple columns inside COUNT() statement

    Improvements

    • Support for concurrency in E2E tests
    • Better Support for unsigned types in C++ side
    • Folder refactoring related to caches, kernels, execution_graph, BlazingTable
    • Improve data loading when the algebra contains only BindableScan/Scan and Limit
    • Enable support for spdlog 1.8.5
    • Update RAPIDS version references

    Bug Fixes

    • Fix IS NOT DISTINCT FROM with joins
    • Fix wrong results from timestampdiff/add
    • Fixed build issues due to cudf aggregation API change
    • Comparing param set to true for e2e
    • Fixed provider unit_tests
    • Fix orc statistic building
    • Fix Decimal/Fixed Point issue
    • Fix for max_bytes_chunk_read param to csv files
    • Fix ucx-py versioning specs
    • Reading chunks of max bytes for csv files
    Source code(tar.gz)
    Source code(zip)
  • v0.19.0(Apr 21, 2021)

    New Features

    • New API that supports concurrent queries, by starting a query and obtaining a token, and then retrieving the result with that token.
    • Support for string CONCAT using the CONCAT keyword, instead of '||'.
    • New API to get the physical execution plan: bc.explain(query, detail = True)
    • Support for querying PostgreSQL tables
    • New documentation page

    Improvements

    • Improvements and expansion to the end-to-end testing framework, including adding testing for data with nulls
    • Improved performance of joins by adding a timeout to the concatenating CacheMachine
    • Improved kernel row output estimation

    Bug Fixes

    • Fixed bugs in uninitialized variables in orc metadata and improvements to handling the parseMetadata exceptions
    • Fixed bugs in handling nulls in case conditions with strings
    • Fixed issue with deleting allocated host memory
    • Fixed issues in capturing error messages from exceptions
    • Fixed bug when there are no projects in a BindableTableScan
    • Fixed issues from cuda when freeing pinned memory
    • Fixed bug in DistributeAggregationKernel where the wrong columns were being hashed
    • FIxed bug with empty row group ids for parquet
    • Fixed issues with int64 literal values
    • Fixed issue when CAST was applied to a literal
    • Fixed bug when getting ORC metadata for decimal type
    • Fixed bug with substrings with nulls
    • Fixed support for minus unary operator
    • Fixed bug with calculating number of batches in BindableTableScan
    • Fixed bug with full outer join when both tables contained nulls
    • Fixed bug with COUNT DISTINCT
    • Fixed issue with columns aliases when there was a Join operation
    • Fixed issue with python side exceptions
    • Fixed various issues due to changes in cudf or other dependencies

    Window Functions (Experimental)

    This release now provides limited Window Functions support. Window Functions that have the partition by clause support the following aggregations:

    • MIN
    • MAX
    • COUNT
    • SUM
    • AVG
    • ROW_NUMBER
    • LEAD
    • LAG Window Functions that have the do not have a partition by clause and have a bounded window frame using the ROWS BETWEEN (the window frame does not use the keyword UNBOUNDED) support the following aggregations:
    • MIN
    • MAX
    • COUNT
    • SUM
    • AVG At this moment, window frames using the keywords UNBOUNDED and CURRENT ROW don't fully work.

    Deprecated Features

    • Disabled support for outer joins with inequalities
    Source code(tar.gz)
    Source code(zip)
  • v0.18.0(Feb 26, 2021)

    New SQL Functions

    The following SQL commands are now supported:

    • REGEXP_REPLACE
    • INITCAP

    New Features

    • New centralized task executor for all query execution
    • New pinned memory buffer pool for improved performance in communication
    • New host memory buffer pool for improved performance in caching data to system memory
    • Support for UCX communications which enables usage of high performance communication hardware such as using InfiniBand
    • Creating table from ORC files now collects metadata from ORC files and can perform predicate pushdown on metadata
    • Progress bar when executing queries
    • Added ability to try to retry tasks when getting out of memory errors
    • Added ability to get maximum gpu memory used

    Improvements

    • Improved support for concurrent queries
    • Improvements to query execution logs
    • Added/improved communication logs
    • Added ability to disable logs
    • Improved storage plugin output messages
    • Improved support for creating tables from JSON files

    Bug Fixes

    • Fixed distribution so that its evenly distributes data loading based off of rowgroups
    • Fixed cython exception handling
    • Support FileSystems (GS, S3) when extension of the files are not provided
    • Fixed issue when creating tables from a local dir relative path
    • Misc bug fixes

    Codebase improvements

    • Code base clean up, improved code organization and refactoring
    • No longer depending on gtest for runtime
    • Reduced number of compilation warnings
    Source code(tar.gz)
    Source code(zip)
  • v0.17.0(Dec 14, 2020)

    New SQL Functions

    The following SQL commands are now supported:

    • TO_DATE / TO_TIMESTAMP
    • DAYOFWEEK
    • TRIM / LTRIM / RTRIM
    • LEFT / RIGHT
    • UPPER / LOWER
    • REPLACE
    • REVERSE

    New Features

    • New communications architecture with support for both TCP and UCX (UCX support is in beta)
    • Allow to create tables from compressed text delimited files
    • Allow to create tables off of Hive partitioned folder structure, where BlazingSQL will infer columns and types.
    • Added powerPC building script and instructions
    • Added local logging directory option to BlazingContext to help resolve logging file permission issues
    • Added option to read csv files in chunks
    • Logs are now configurable to have max size and be rotated

    Improvements

    • Added Apache Calcite rule for window functions. (Window functions not supported yet)
    • Add validation for the kwargs when BlazingContext.create_table API is called
    • Added validation for s3 buckets
    • Added scheduler file support for e2e testing framework
    • Improved how sampling is done for ORDER BY
    • Several changes to keep up with cuDF API changes
    • Remove temp files when an error occurs
    • Added new end-to-end tests
    • Added new unit tests
    • Improved contribution documentation
    • Code refactoring and removing dead or duplicate code

    Improvements in error logging

    • Improvement to error messaging when validating any GCP bucket
    • Added error logging in DataSourceSequence
    • Showing an appropriate error to indicate that we don't support opening directories with wildcards
    • Showing an appropriate error for invalid or unsupported expressions on the logical plan

    Changes or improvements in technology stack or CI

    • Added output compile json option for cppcheck
    • Bump junit from 4.12 to 4.13.1 in /algebra
    • Improved gpuCI scripts
    • Removed need to specify cuda version via a label for conda packages
    • Fixed cmake version to be 3.18.4
    • Fix SSL errors for conda

    Bug Fixes

    • Fixed issue when loading parquet files with local_files=True
    • Fixed logging directory setup
    • Fixed issues with config_options
    • Fixed issue in float columns when parsing parquet metadata
    • Fixed bug in MergeAggregations when single node has multiple batches
    • Fix graph thread pool hang when exception is thrown
    • Fix ignore headers when multiple CSV files was provided
    • Fix column_names (table) always as list of string
    • Fixed literal type inference for integers

    Deprecated features

    • Deprecated bc.partition
    Source code(tar.gz)
    Source code(zip)
  • v0.16.0(Oct 23, 2020)

    Improvements

    • Activate End-to-end test result validation for GPU_CI.
    • Add capacity to set the transport memory
    • Update conda recipe, remove cxx11 abi from cmake
    • Just one initialize() function at beginning and add logs related to allocation stuff
    • Make possible to read the system environment variables to setup config_option for BlazingContext
    • Update TPCH queries for end to end tests: converting implicit joins into explicit joins
    • Removing cudf source code dependency as some cudf utilities headers were exposed
    • Can now set manually BLAZING_CACHE_DIRECTORY

    Bug Fixes

    • Fixed issue due to cudf orc api change
    • Fixed issue parsing fixed width string literals
    • Fixed issue with hive string columns
    • Fixed issue due to an rmm include
    • Fixed build issues with latest rmm 0.16 and columnBasisTest due to deprecated drop_column() function
    • Fix metadata mistmatch due to parsedMetadata, caused by parquet files that had only nulls in certain columns for only some files
    • Removed workaround for parquet read schema
    • Fixed issue caused by creating tables with multiple csv files and having BSQL infer the datatypes and having a dtypes mismatch
    • Avoid read _metadata files
    • Fixed issues with parsers, in particular ORC parser was misbehaving
    • Fixed issue with logging directories in distributed environments
    • Pinned google cloud version to 1.16
    • Partial revert of some changes on parquet rowgroups flow with local_files=True
    • Fixed issue when loading paths with wildcards
    • Fixed issue with concat_all in concatenating cache
    • Fix arrow and spdlog compilation issues
    • Fixed intra-query memory leak in joins
    • Fixed crash when loading an empty folder
    • Fixed parseSchemaPython can throw exceptions
    Source code(tar.gz)
    Source code(zip)
  • v0.15.0(Aug 31, 2020)

    New Features:

    • Added a memory monitor for better memory management for out of core processing
    • Added list_tables() and describe_table() functions
    • Added support for constant expressions evaluation by Calcite
    • Added support for cross join
    • Added rand() and support for running unary operations on literals
    • Added get_free_memory() function

    Improvements

    Performance improvements:

    • Implemented Unordered pull from cache to help performance
    • Concatenating cache improvement and replacing PartwiseJoin::load_set with a concatenating cache
    • Adding max kernel num threads pool
    • Added new separate thresh for concat cache

    Stability improvements:

    • Added checks for concatenation to prevent String overflow
    • Added nogil statements for pure C functions in Cython
    • Round robing dask workers on single gpu queries
    • Reraising query errors in context.py
    • Implemented using threadpool for outgoing messages

    Documentation improvements:

    • Added exhale to generate doxygen for sphinx docs
    • Added Sphinx based code architecture documentation
    • Added doxygen comments to CacheMachine.h
    • Added more documentation about memory management
    • Updated readme
    • Added doxygen comments to some kernels and the batch processing

    Building improvements:

    • Updated Calcite to the most recent version 1.23
    • Added check for CUDF_HOME to allow build to use an existing prebuilt cudf source tree
    • Python/Cython check code style
    • Make AWS and GCS optional

    Logging improvements:

    • Logging level (flush_on) can be configurable
    • Set log_level when using LOGGING_LEVEL param

    Testing improvements:

    • Added unit tests on Calcite to check how logical plans are affected when rulesets are updated
    • Updated set of TPCH queries on the E2E tests
    • Added initial set of unit tests for WaitingQueue and nullptr checks around spdlog calls
    • Add unit test for Project kernel

    Other improvements:

    • Removed a lot of dead code from the codebase
    • Replace random_generator with cudf::sample
    • Adding extern C for include files
    • Use default client and network interface from Dask. BlazingSQL should now be able to infer the network interface.
    • Updated the GPUManager functions
    • Handle exceptions from pool_threads

    Bug Fixes

    • Various fixing of issues due to updates to cudf
    • Fixed issue with Hive partitions when doing SELECT *
    • Normalize columns before distribution in JoinPartitionKernel
    • Fixed issue with hive partitions base folder
    • Fix interops operators output types
    • Fix when the algebra plan was provided using one-line as logical plan
    • Fix issue related to Hive metadata
    • Remove temp files from data cached to disk
    • Fix when checking only Limit and Scan Kernels
    • Loading one file at a time (LimitKernel and ScanKernel)
    • Fixed small issue with hive types conversion
    • Fix for literal cast
    • Fixed issue with start and length of substring being different types
    • Fixed issue on logical plans when there is an EXISTS clause
    • Fixed issue with casting string to string
    • Fixed issue with getting table scan info
    • Fixed row_groups issue in ParquetParser.cpp
    • Fixed issue with some constant expressions not evaluated by calcite
    • Fixed issue with log directory creation in a distributed environment
    • Fixed issue where we were including testing hpp in our code
    • Fixed optimization regression on the select count(*) case
    • Fixed issue caused by using new arrow_io_source
    • Fixed e2e string comparison
    • Fixed random segfault issue in parser
    • Fixed issue with column names on sample function
    • Introduced config param for max orderby samples and fixed issue with oversampling in ORDER BY
    Source code(tar.gz)
    Source code(zip)
  • v0.14.0(Jun 24, 2020)

    New Features:

    • New execution architecture, supporting executing queries on data that does not fit in the GPU. The new architecture features the following:

      • The execution model is an acyclic graph of execution nodes with a cache in between execution nodes.
      • Each execution node operates independently on batches of data, allowing it to process steps in parallel as much as possible instead of sequentially.
      • Each cache between every execution step can hold the data in GPU, in system memory or on disk.
      • Has support for multi-partition dask.cudf.DataFrame result set outputs.
    • Added ability to set configuration options

    • Added support for using NULL as a literal value

    • Implemented CHAR_LENGTH function

    • Added ability to specify region for S3 buckets

    • Added type normalization for UNION ALL

    • Added support for MinIO Storage

    Improvements:

    • Improved support for CAST function to include TINYINT and SMALLINT
    • Handle behavior when the optimized plan contains a LogicalValues
    • Improvements to exception handling
    • Support modern compilers (>= g++-7.x)
    • Improved logging now uses spdlog
    • Adding event logging
    • BlazingSQL engine no longer needs to concatenate dask.cudf.DataFrame partitions prior to running a query on a dask.cudf.DataFrame table
    • Improved expression parser, including support for expression trees of unlimited size.
    • Optimized data loading for queries of the type: SELECT * FROM table LIMIT N
    • Added built in end to end testing framework
    • Added logging to condition variables that are waiting too long

    Bug Fixes:

    • Fixed bug in size estimation for tables before joins
    • Fixed issue with excessive thread creation in communication
    • Fixed bug in expression parsing for joins
    • Fixed bug caused by sharing data loaders when a query has one table more than once
    • FIxed Hive file format inference
    Source code(tar.gz)
    Source code(zip)
  • v0.13.0(Apr 7, 2020)

    New Features:

    • Support for AVG in distributed mode
    • Added ability to use existing memory allocator
    • Implemented unify_partitions function for preparing dask_cudf DataFrames prior to creating BlazingSQL tables
    • Implemented ROUND function
    • Implemented support for CASE with strings

    Improvements:

    • Local files can be referenced with relative file paths when creating tables.
    • Automatic casting for joins on similar data types (i.e. joining an int32 with an int64 will cast the int32 to an int64)
    • Updated AWS SDK version
    • More changes to related to changes migration of libcudf to libcudf++
    • Added docstrings to main python APIs

    Bug Fixes:

    • Fixed bug when for joining against empty DataFrame
    • Fixed bug with GROUP BY ignoring nulls
    • Fixed various issues related to creating tables from dask_cudf DataFrames
    • Fixed various bugs with creating tables from Hive Cursor
    • Fixed bugs related to new libcudf++ functionality
    • Fixed bug in LIMIT statement
    • Fixed bug in timestamp processing
    • Fixed bug in SUM0 aggregation (which enables COUNT DISTINCT)
    • Fixed bug when querying single file with multiple workers
    • Fixed bug with distributed COUNT aggregation without GROUP BY
    • Fixed bug when creating and querying a table with several Apache Parquet files and one is empty
    • Fixed bug with joins with nulls in the join key columns

    Other:

    • Temporarily deprecated JSON reader. In the meantime we recommend using: cudf.read_json
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Feb 6, 2020)

    New Features:

    • Ability to skip reading and processing row groups when querying Apache Parquet files by applying predicates on metadata
    • Ability to do SELECT COUNT (DISTINCT column)
    • Ability to use and set Pool memory allocator for increased performance and/or managed (UVM) allocator which provides robustness against running out of GPU memory

    Improvements:

    • New building scripts thanks to @dillon-cullinan

    Bug Fixes:

    • Fixed various bugs in the Apache Arrow provider
    • Fixed bug with incorrect data type in CASE statements
    • Fixed bug and memory leak in distributed joins
    • Fixed bug in usage of Google Cloud Storage plugin
    Source code(tar.gz)
    Source code(zip)
  • v0.11(Dec 17, 2019)

    New Features:

    • Merged all the code repos for the whole stack into one repo
    • Pythonization of the whole BlazingSQL stack. See our blog post for more information
    • New API for being able to query performance and execution logs
    • Ability to create BlazingSQL tables from Hive tables
    • Partial support for Non-equality joins. For example SELECT * FROM tableA as A INNER JOIN tableB as B ON A.key = B.key AND A.this_date > B.that_date
    • Added arrow-provider

    Improvements:

    • Optimized simple queries that only have COUNT(*)
    • Removed limitation on number of operands for outer joins
    • Improved error messaging
    • Improvements to relational algebra optimization

    Bug Fixes:

    • Fixed bug where a python script running BlazingSQL would hang at the end of a script
    • Fixed bug when using wildcards for file paths and using dask distribution
    • Fixed bug with HDFS
    • Fixed bug with projects with large amounts of transformations on large GPUs
    • Fixed bug with multiple projections on the same column
    • Fixed COUNT(*) to properly ignore nulls
    • Fixed stability issues with certains queries running on 3 or more nodes
    • Fixed bug with querying a GDF and no transformations are applied
    • Fixed bug with empty result sets
    • Fixed bug with empty column names
    Source code(tar.gz)
    Source code(zip)
  • v0.4.6(Nov 12, 2019)

    New Features

    • Implemented string concat operator
    • Implemented substring operator

    Improvements:

    • Improved management of services
    • Changed Apache Calcite schema database to an in-memory database
    • Improved performance of communication between nodes by enabling parallel messaging
    • Improved performance of data loading by enabling parallel file reading
    • Added new distributed join method for joining small tables

    Bug Fixes:

    • Fixed various issues with Timestamp data types
    • Fixed issue when column names were too long
    • Fixed bug in relational algebra generation
    • Fixed various bugs in communication layer
    • Fixed bug with order by with strings
    • Fixed issue with parsing Apache Parquet file schemas
    • Fixed memory leak in joins
    • Fixed memory leak in communication layer
    • Fixed bug in table concatenation in disitrubiton algorithms
    • Fixed bug when trying to join on columns of integers of different byte widths, or floats of different byte widths
    • Fixed bug when trying to do a union on columns of integers of different byte widths, or floats of different byte widths
    • Fixed bug in passing error message to user
    Source code(tar.gz)
    Source code(zip)
  • v0.4.5(Oct 22, 2019)

    New Features

    • Completely revamped data transport layer is much faster and robust
    • Added support for LIKE operator
    • Added ability to create tables from Dask dataframes.
    • Improved how services are launched from BlazingContext. Including new ready() function which checks to see if all services are online and shutdown() function to shutdown all services.

    Improvements

    • Improved performance logging
    • Now using in-memory H2 database for Apache Calcite table catalog
    • Updated to cudf v0.10

    Bug Fixes

    • Fixed bug in expression parsing
    • Fixed various bugs with date literals, date functions and GDF_TIMESTAMP data type
    • Fixed bug with aliases
    • Fixed bug in order by for distributed queries when there are empty partitions
    • Fixed bug in creating tables from S3 directories
    • Fixed bug where predicate pushdown was not happening in certain types of queries
    Source code(tar.gz)
    Source code(zip)
  • v0.4.4(Oct 22, 2019)

    New Features

    • Added support for CAST
    • Added file_format parameter to create_table. This parameter is used for when the file format is not determinable from the file extension.

    Bug Fixes

    • Fixed bug where aliases would sometimes not be set correctly
    Source code(tar.gz)
    Source code(zip)
  • v0.4.3(Sep 26, 2019)

    New Features

    • Added file_format parameter to create_table to help create tables from files that don't have extensions

    Bug Fixes

    • Fixed how releases are versioned for Conda
    • Fixed bug with joining against an empty table
    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Sep 20, 2019)

    New Features

    • Added support for CASE
    • Improved support for Boolean columns
    • Creating tables using wildcards in file paths
    • Added support for Google Cloud Storage

    Bug Fixes

    • Fixed bug in groups by's with strings in distributed cluster
    • Fixed issues in how BlazingContext launches processes
    • Fixed issue where releases were being done in Debug mode
    • Fixed bug related to creating multiple tables with the same name
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Sep 20, 2019)

    New Features:

    • Ability to compile and install using Conda
    • Creating BlazingContext can now automatically launches processes
    • Support for creating tables from JSON and ORC files
    • Added more CSV parsing parameters for creating tables from CSV files
    • Updated to use cudf v0.9 release
    • Added support for LIMIT

    Bug fixes

    • Fixed bug with processing queries using date literals
    • Fixed distribution issues with data with nulls
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Aug 16, 2019)

    A great deal has happened since we last released.

    • We now support distributed query execution!
    • Distributed results output to dask-cudf
    • Updated to cuDF 0.9
    • Millions, literally millions, of bug fixes.
    • No longer use main. before any table names. That was awful. bc.sql('select * from main.table_name') --> bc.sql('select * from table_name')
    Source code(tar.gz)
    Source code(zip)
  • simple-distribution-tcp-cudf0.7(Jun 14, 2019)

Owner
BlazingSQL
A GPU-accelerated SQL engine for data science on rapids.ai.
BlazingSQL
📊 A simple command-line utility for querying and monitoring GPU status

gpustat Just less than nvidia-smi? NOTE: This works with NVIDIA Graphics Devices only, no AMD support as of now. Contributions are welcome! Self-Promo

Jongwook Choi 3.2k Jan 04, 2023
jupyter/ipython experiment containers for GPU and general RAM re-use

ipyexperiments jupyter/ipython experiment containers and utils for profiling and reclaiming GPU and general RAM, and detecting memory leaks. About Thi

Stas Bekman 153 Dec 07, 2022
Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program.

py3nvml Documentation also available at readthedocs. Python 3 compatible bindings to the NVIDIA Management Library. Can be used to query the state of

Fergal Cotter 212 Jan 04, 2023
General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases.

Vulkan Kompute The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabl

The Institute for Ethical Machine Learning 1k Dec 26, 2022
Conda package for artifact creation that enables offline environments. Ideal for air-gapped deployments.

Conda-Vendor Conda Vendor is a tool to create local conda channels and manifests for vendored deployments Installation To install with pip, run: pip i

MetroStar - Tech 13 Nov 17, 2022
A Python function for Slurm, to monitor the GPU information

Gpu-Monitor A Python function for Slurm, where I couldn't use nvidia-smi to monitor the GPU information. whole repo is not finish Installation TODO Mo

Squidward Tentacles 2 Feb 11, 2022
cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Resources cuDF Reference Documentation: Python API refe

RAPIDS 5.2k Jan 08, 2023
Python 3 Bindings for the NVIDIA Management Library

====== pyNVML ====== *** Patched to support Python 3 (and Python 2) *** ------------------------------------------------ Python bindings to the NVID

Nicolas Hennion 95 Jan 01, 2023
Library for faster pinned CPU <-> GPU transfer in Pytorch

SpeedTorch Faster pinned CPU tensor - GPU Pytorch variabe transfer and GPU tensor - GPU Pytorch variable transfer, in certain cases. Update 9-29-1

Santosh Gupta 657 Dec 19, 2022
CUDA integration for Python, plus shiny features

PyCUDA lets you access Nvidia's CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist-so what's so special about P

Andreas Klöckner 1.4k Jan 02, 2023
cuGraph - RAPIDS Graph Analytics Library

cuGraph - GPU Graph Analytics The RAPIDS cuGraph library is a collection of GPU accelerated graph algorithms that process data found in GPU DataFrames

RAPIDS 1.2k Jan 01, 2023
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

ArrayFire 4k Dec 29, 2022
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

NVIDIA DALI The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provi

NVIDIA Corporation 4.2k Jan 08, 2023
A NumPy-compatible array library accelerated by CUDA

CuPy : A NumPy-compatible array library accelerated by CUDA Website | Docs | Install Guide | Tutorial | Examples | API Reference | Forum CuPy is an im

CuPy 6.6k Jan 05, 2023
cuSignal - RAPIDS Signal Processing Library

cuSignal The RAPIDS cuSignal project leverages CuPy, Numba, and the RAPIDS ecosystem for GPU accelerated signal processing. In some cases, cuSignal is

RAPIDS 646 Dec 30, 2022
cuML - RAPIDS Machine Learning Library

cuML - GPU Machine Learning Algorithms cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions t

RAPIDS 3.1k Jan 04, 2023
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem. Get Started on app.blazingsql.com Getting Started | Documentation | Examp

BlazingSQL 1.8k Jan 02, 2023
Python interface to GPU-powered libraries

Package Description scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries

Lev E. Givon 924 Dec 26, 2022
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code her

NVIDIA Corporation 6.9k Dec 28, 2022
A Python module for getting the GPU status from NVIDA GPUs using nvidia-smi programmically in Python

GPUtil GPUtil is a Python module for getting the GPU status from NVIDA GPUs using nvidia-smi. GPUtil locates all GPUs on the computer, determines thei

Anders Krogh Mortensen 927 Dec 08, 2022