ArrayFire: a general purpose GPU library.

Overview

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures including CPUs, GPUs, and other hardware acceleration devices.

Several of ArrayFire's benefits include:

ArrayFire provides software developers with a high-level abstraction of data which resides on the accelerator, the af::array object. Developers write code which performs operations on ArrayFire arrays which, in turn, are automatically translated into near-optimal kernels that execute on the computational device.

ArrayFire is successfully used on devices ranging from low-power mobile phones to high-power GPU-enabled supercomputers. ArrayFire runs on CPUs from all major vendors (Intel, AMD, ARM), GPUs from the prominent manufacturers (NVIDIA, AMD, and Qualcomm), as well as a variety of other accelerator devices on Windows, Mac, and Linux.

Installation

You can install the ArrayFire library from one of the following ways:

Package Managers

This approach is currently only supported for Ubuntu 18.04 and 20.04. Please go through our GitHub wiki page for the detailed instructions.

Official installers

Execute one of our official binary installers for Linux, OSX, and Windows platforms.

Build from source

Build from source by following instructions on our wiki.

Examples

The following examples are simplified versions of helloworld.cpp and conway_pretty.cpp, respectively. For more code examples, visit the examples/ directory.

Hello, world!

array A = randu(5, 3, f32); // Create 5x3 matrix of random floats on the GPU
array B = sin(A) + 1.5;     // Element-wise arithmetic
array C = fft(B);           // Fourier transform the result

float d[] = { 1, 2, 3, 4, 5, 6 };
array D(2, 3, d, afHost);   // Create 2x3 matrix from host data
D.col(0) = D.col(end);      // Copy last column onto first

array vals, inds;
sort(vals, inds, A);        // Sort A and print sorted array and corresponding indices
af_print(vals);
af_print(inds);

Conway's Game of Life

Visit the Wikipedia page for a description of Conway's Game of Life.

static const float h_kernel[] = {1, 1, 1, 1, 0, 1, 1, 1, 1};
static const array kernel(3, 3, h_kernel, afHost);

array state = (randu(128, 128, f32) > 0.5).as(f32); // Generate starting state
Window myWindow(256, 256);
while(!myWindow.close()) {
  array nHood = convolve(state, kernel); // Obtain neighbors
  array C0 = (nHood == 2);               // Generate conditions for life
  array C1 = (nHood == 3);
  state = state * C0 + C1;               // Update state
  myWindow.image(state);                 // Display
}

Conway's Game of Life

Documentation

You can find our complete documentation here.

Quick links:

Language support

ArrayFire has several official and third-party language API`s:

Native

Official wrappers

We currently support the following language wrappers for ArrayFire:

Wrappers for other languages are a work-in-progress: .NET, Fortran, Go, Java, Lua, NodeJS, R, Ruby

Third-party wrappers

The following wrappers are being maintained and supported by third parties:

Contributing

Contributions of any kind are welcome! Please refer to CONTRIBUTING.md to learn more about how you can get involved with ArrayFire.

Citations and Acknowledgements

If you redistribute ArrayFire, please follow the terms established in the license. If you wish to cite ArrayFire in an academic publication, please use the following citation document.

ArrayFire development is funded by ArrayFire LLC and several third parties, please see the list of acknowledgements for further details.

Support and Contact Info

Trademark Policy

The literal mark “ArrayFire” and ArrayFire logos are trademarks of AccelerEyes LLC DBA ArrayFire. If you wish to use either of these marks in your own project, please consult ArrayFire's Trademark Policy

Comments
  • Build error on OSX

    Build error on OSX

    While building af from the source, I got the error shown below. All the dependencies are installed (CUDA version: 6.5). How can I fix the issue?

    Any advice is appreciated.

    CMake Error at afcuda_generated_copy.cu.o.cmake:264 (message):
      Error generating file
      /Users/kerkil/Git/arrayfire/build/src/backend/cuda/CMakeFiles/afcuda.dir//./afcuda_generated_copy.cu.o
    
    
    make[2]: *** [src/backend/cuda/CMakeFiles/afcuda.dir/./afcuda_generated_copy.cu.o] Error 1
    2 errors detected in the compilation of "/var/folders/7m/g0rk38md4z75ryh_h6b60d040000gn/T//tmpxft_00003ee2_00000000-6_count.cpp1.ii".
    CMake Error at afcuda_generated_count.cu.o.cmake:264 (message):
      Error generating file
      /Users/kerkil/Git/arrayfire/build/src/backend/cuda/CMakeFiles/afcuda.dir//./afcuda_generated_count.cu.o
    
    
    make[2]: *** [src/backend/cuda/CMakeFiles/afcuda.dir/./afcuda_generated_count.cu.o] Error 1
    make[1]: *** [src/backend/cuda/CMakeFiles/afcuda.dir/all] Error 2
    make: *** [all] Error 2
    
    build OSX CUDA 
    opened by kerkilchoi 56
  • Add framework for extensible ArrayFire memory managers

    Add framework for extensible ArrayFire memory managers

    Motivation

    Many different use cases require performance across many different memory allocation patterns. Even different devices/backends have different costs associated with memory allocations/manipulations. Having the flexibility to implement different memory management schemes can help optimize performance for the use case and backend.

    Framework

    • The basic interface lives in a new header: include/af/memory.h and includes two interfaces:
      • A C-style interface defined in the af_memory_manager struct, which includes function pointers to which custom memory manager implementations should be defined along with device/backend-specific functions that can be called by the implementation (e.g. nativeAlloc) and will be dynamically set. Typesafe C-style struct inheritance should be used.
      • A C++ style interface using MemoryManagerBase, which defines pure-virtual methods for the API along with device/backend-specific functions as above.

    A C++ implementation is simple, and requires only:

    #include <af/memory.h>
    ...
    class MyCustomMemoryManager : public af::MemoryManagerBase {
    ...
      void* alloc(const size_t size, bool user_lock) override {
        ...
        void* ptr = this->nativeAlloc(...);
        ...
      }
    };
    
    // In some code run at startup:
    af::MemoryManagerBase* p = new MyCustomMemoryManager();
    af::setMemoryManager(p);
    

    For the C API:

    #include <af/memory.h>
    ...
    af_memory_manager_impl_alloc(const size_t size, bool user_lock) {
      ...
    }
    
    typedef struct af_memory_manager_impl {
      af_memory_manager manager; // inherit base methods
      // define custom implementation
    } af_memory_manager_impl;
    
    // In some code run at startup:
    ...
    af_memory_manager* p = (af_memory_manager*)malloc(sizeof(af_memory_manager_impl));
    p->af_memory_manager_alloc = &af_memory_manager_impl_alloc;
    ...
    af_set_memory_manager(p);
    
    

    Details

    • The F-bound polymorphism pattern present in the existing MemoryManager implementation is removed; removing this was required as it precludes dynamically dispatching to a derived implementation.
    • New interfaces are defined for C/C++ (see below)
    • MemoryManagerCWrapper wraps a C struct implementation of a memory manager and facilitates using the same backend and DeviceManager APIs to manipulate a manager implemented in C.

    API Design Decisions

    • If a custom memory manager is not defined or set, the default memory manager will be used. While the default memory manager implements the new existing interface, behavior is completely identical to existing behavior by default (as verified by tests)
    • Memory managers should be stored on the existing DeviceManager framework so as to preserve the integrity of existing backend APIs; memory managers can exist on a per-backend basis and work with the unified backend.
    • Existing ArrayFire APIs expect garbage collection and memory step sizing to be implemented in a memory manager. These and a few other slightly opinionated methods are included in the overall API.
      • That said, these methods can be noops or throw exceptions (e.g. garbage collection) if the style of custom memory manager implementation doesn't implement those facilities.
    • Setting a memory manager should use one API in the same C/C++ fashion so as to be compatible with the unified backend via dynamic invocation of symbols in a shared object. The C and C++ APIs should have a polymorphic relationship such that either can be passed to the public API (af::MemoryManagerBase is this a subtype of af_memory_manager, a C struct)

    • Adds tests defining custom memory manager implementations in both the C++ and C API and testing end-to-end Array allocations and public AF API calls (e.g. garbage collection, step size).
    opened by jacobkahn 52
  • Speedup of kernel caching mechanism by hashing sources at compile time

    Speedup of kernel caching mechanism by hashing sources at compile time

    Measured results: a) 1 million times join(1, af::dim4(10,10), af::dim4(10,10)) --> 63% faster vs master 3.8.0

    b) example/neural_network.cpp with varying batch sizes

    • to switch saturation between CPU to GPU)
    • the best test accuracy is obtained with a batch size around 48 (reason to go so small) on AMD A10-7870K (AMD Radeon R7 Graphics 8CU), on faster GPU's the improvement will persist with higher batch sizes. --> up to 18% faster vs master 3.8.0 Timings neural_network cacheHashing.xlsx

    c) memory footprint reduced by 37%, and on top no longer copies internal. All the OpenCL.cl kernel source code files occupy 451KB, vs the remaining code strings in the generated obfuscated hhp files only occupy 283KB. I assume that a similar effect is visible with the CUDA kernel code.

    Description

    Changes in backend/common/kernel_cache.cpp & backend/common/util.cpp

    1. Hashing is now incremental, so that only the dynamic parts are calculated
    2. Hashkey changed from string to size_t, to speed-up the find functions on map
    3. The hashkey is now for each kernel calculated at compile time by bin2cpp.exe
    4. The hashkey of multiple sources, is obtained by re-hashing the individual hashes

    Changes in all OpenCL kernels in backend/opencl/kernel/*.hpp

    1. The struct common::Source now contains: Source code, Source length & Source hash
    2. The static struct, generated at compile time, is now used directly

    Changes in interfaces:

    1. deterministicHash overloaded for incremental hashing of string, vector and vector
    2. getKernel now accepts common::Sources object, iso of vector

    Current flow of data: New kernel: 0. Compile, kernel.cl static const common::Source{*code, length, hash}

    1. Rep, kernel.hpp: vector
    2. Rep, kernel_cache.cpp: string tInstance <-- build directly from input data (fast)
    3. Rep, util.cpp: size_t moduleKey <--combined hashes for multiple sources <-- incremental hashes of options & tInstance (fast)

    Search kernel:

    1. Rep, kernel_cache.cpp: search map with moduleKey (1 cmp 64bit instruction per kernel)

    Previous (3.8.0 master) flow of data: New kernel:

    1. Once, kernel.hpp: static const string <-- main kernel codefile cached
    2. Rep, kernel.hpp: vector <-- build combined kernel codefiles
    3. Rep, kernel_cache.cpp: vector args <-- transform targs vector into args (replace)
    4. Rep, kernel_cache.cpp: string tInstance <-- build from args vector
    5. Rep, kernel_cache.cpp: vector hashingVals <-- copy tInstance + kernel codefiles + options
    6. Rep, util.cpp: string accumStr <-- copy vector into 1 string
    7. Rep, util.cpp: size_t hashkey <-- hash on full string (slow)
    8. Rep, kernel_cache.cpp: string moduleKey <-- convert size_t to string

    Search kernel:

    1. Rep, kernel_cache.cpp: search map with moduleKey (1cmp per char, 23 cmp 8bit instructions per kernel)

    Changes to Users

    None

    Checklist

    • [x] Rebased on latest master (Nov 18, 2020)
    • [x] Code compiles
    • [x] Tests pass
    • [-] Functions added to unified API
    • [-] Functions documented
    perf 
    opened by willyborn 43
  • Fontconfig error: Cannot load default config file on Mac OSX 10.11.3

    Fontconfig error: Cannot load default config file on Mac OSX 10.11.3

    Hi Everyone,

    I have a MacBook PRO with an NVIDIA discrete graphics card (750m) I succeeded to compile arrayfire without errors but when I try to run filters_cuda for example, I get:

    ArrayFire v3.3.0 (CUDA, 64-bit Mac OSX, build fd660a0) Platform: CUDA Toolkit 7.5, Driver: CUDA Driver Version: 7050 [0] GeForce GT 750M, 2048 MB, CUDA Compute 3.0 Fontconfig error: Cannot load default config file ArrayFire Exception (Internal error:998): @Freetype library:217: font face creation failed(3001)

    In function void af::Window::initWindow(const int, const int, const char *const) In file src/api/cpp/graphics.cpp:19 libc++abi.dylib: terminating with uncaught exception of type af::exception: ArrayFire Exception (Internal error:998): @Freetype library:217: font face creation failed(3001)

    In function void af::Window::initWindow(const int, const int, const char *const) In file src/api/cpp/graphics.cpp:19 Abort trap: 6

    I checked with brew install and I have both "fontconfig" and "freetype" libraries.

    OSX 
    opened by dvasiliu 42
  • NVCC does not support Apple Clang version 8.x

    NVCC does not support Apple Clang version 8.x

    Error message: nvcc fatal : The version ('80000') of the host compiler ('Apple clang') is not supported

    Steps to fix:

    1. Log in to https://developer.apple.com/downloads/
    2. Download Xcode CLT (Command Line Tools) 7.3
    3. Install CLT
    4. Run sudo xcode-select --switch /Library/Developer/CommandLineTools
    5. Verify that clang has been downgraded via clang --version

    Source: http://stackoverflow.com/a/36590330/701646

    Edit: Update to 7.3 and fail at 8.0

    OSX known issue 
    opened by mlloreda 37
  • Lapack tests fail if CBlast is used

    Lapack tests fail if CBlast is used

    A couple of BLAS-related tests (e.g. cholesky_dense, solve_dense,inverse_dense and LU) fail on a GTX Titan using OpenCL if compiled with CLBlast on Windows. Cholesky_dense gives errors like "Matrix C's OpenCL buffer is too small" , so I added some printf debugging to ClBlast's TestMatrixC:

    printf("ld          == %d\n", ld);
    printf("one         == %d\n", one);
    printf("two         == %d\n", two);
    printf("offset      == %d\n", offset);
    printf("buffer.size == %d\n", buffer.GetSize());
    printf("req size   == %d\n", required_size); 
    

    and Arrayfire's gpu_blas_herk_func in magma_blast_cblast.h:

    printf("triangle      == %d\n", triangle);
    printf("transpose     == %d\n", a_transpose);
    printf("n             == %d\n", n);
    printf("k             == %d\n", k);
    printf("a_buffer      == %d\n", a_buffer);
    printf("a_offset      == %d\n", a_offset);
    printf("a_ld          == %d\n", a_ld);
    printf("c_buffer      == %d\n", c_buffer);
    printf("c_offset      == %d\n", c_offset);
    printf("c_ld          == %d\n", c_ld);'
    

    with this, cholesky_dense_opencl produced the following output

    cholesky.txt

    I don't know if this is caused by an error in CLBlast (probably not, CLBlast's test all pass) or by the integration in arrayfire.

    Maybe @CNugteren could take a look at it?

    inverse_LU_solve.txt

    opened by fzimmermann89 35
  • Memory access errors after many iterations, cause: convolve or sum ?

    Memory access errors after many iterations, cause: convolve or sum ?

    Access Violation Error after

    Exception thrown at 0x00007FFBB53B669F (ntdll.dll) in MNIST_CNN-Toy.exe: 0xC0000005: 
    Access violation reading location 0x0000000000000010.
    

    from below line

    new_values(af::span, af::span, af::span, kernel) = 
    af::sum(af::convolve2(gradient, filter(af::span, af::span, kernel), AF_CONV_EXPAND), 3);
    

    in the code hosted at below URL https://github.com/Reithan/MachineLearning https://github.com/Reithan/MNIST_CNN-Toy

    Issue originally reported on slack channel by Reithan

    bug 
    opened by 9prady9 34
  • Compiling ArrayFire with FlexibLAS

    Compiling ArrayFire with FlexibLAS

    We are currently deploying a new AMD cluster, on which BLIS performs better than MKL, so we are moving away from building against MKL to use FlexiBLAS, which can be switched between MKL, BLIS, OpenBLAS or other BLAS/LAPACK libraries at run time.

    Would it be possible to build ArrayFire against FlexiBLAS instead of MKL ?

    At the moment, building it without MKL complains with

    392 CMake Error at CMakeModules/InternalUtils.cmake:10 (message):
    393   MKL not found
    394 Call Stack (most recent call first):
    

    Does ArrayFire needs MKL itself ? Or does it simply need BLAS/LAPACK ?

    feature 
    opened by mboisson 33
  • test/threading_cuda random crash at exit

    test/threading_cuda random crash at exit

    Worrying crash in test/threading_cuda : ArrayFire v3.7.0 (CUDA, 64-bit Linux, build 70ef1989) Platform: CUDA Toolkit 10.0, Driver: 440.33.01 [0] GeForce GTX 1080 Ti, 11179 MB, CUDA Compute 6.1

    $ test/threading_cuda Running main() from /local/nbuild/jenkins/workspace/AF-Release-Linux/test/gtest/googletest/src/gtest_main.cc [==========] Running 9 tests from 1 test case. [----------] Global test environment set-up. [----------] 9 tests from Threading [ RUN ] Threading.SetPerThreadActiveDevice Image IO Not Configured. Test will exit [ OK ] Threading.SetPerThreadActiveDevice (0 ms) [ RUN ] Threading.SimultaneousRead [ OK ] Threading.SimultaneousRead (5594 ms) [ RUN ] Threading.MemoryManagementScope [ OK ] Threading.MemoryManagementScope (2008 ms) [ RUN ] Threading.MemoryManagement_JIT_Node [ OK ] Threading.MemoryManagement_JIT_Node (8 ms) [ RUN ] Threading.FFT_R2C [ OK ] Threading.FFT_R2C (687 ms) [ RUN ] Threading.FFT_C2C [ OK ] Threading.FFT_C2C (13 ms) [ RUN ] Threading.FFT_ALL [ OK ] Threading.FFT_ALL (12 ms) [ RUN ] Threading.BLAS [ OK ] Threading.BLAS (339 ms) [ RUN ] Threading.Sparse [ OK ] Threading.Sparse (12699 ms) [----------] 9 tests from Threading (21360 ms total)

    [----------] Global test environment tear-down [==========] 9 tests from 1 test case ran. (21360 ms total) [ PASSED ] 9 tests.

    YOU HAVE 2 DISABLED TESTS

    Segmentation fault

    with gdb ... [ OK ] Threading.Sparse (12035 ms) [----------] 9 tests from Threading (20924 ms total) [----------] Global test environment tear-down [==========] 9 tests from 1 test case ran. (20924 ms total) [ PASSED ] 9 tests.

    YOU HAVE 2 DISABLED TESTS

    Program received signal SIGSEGV, Segmentation fault. 0x0000000000000000 in ?? () (gdb) bt #0 0x0000000000000000 in ?? () #1 0x00007fffe3f9f8ad in ?? () from /usr/local/cuda-10.0/lib64/libnvrtc.so.10.0 #2 0x00007fffe3f9f8d5 in ?? () from /usr/local/cuda-10.0/lib64/libnvrtc.so.10.0 #3 0x00007fffe5577d9d in __cxa_finalize () from /lib64/libc.so.6 #4 0x00007fffe3c26e06 in ?? () from /usr/local/cuda-10.0/lib64/libnvrtc.so.10.0 #5 0x0000000000000016 in ?? () #6 0x0000000000000000 in ?? ()

    opened by WilliamTambellini 32
  • OPT: Improved memcopy, JIT & join

    OPT: Improved memcopy, JIT & join

    • workload per thread is increased to optimize for memcopy and JIT.
    • memcopy can now increase its unit size
    • JIT now can write directly into the final buffer, instead of first evaluating into a temp buffer
    • JIT now performs multiple copies in parallel (when same size)
    • join is making optimal selection between (memcopy and JIT)

    Description

    • No feature changes for the end-user, only a stable performance improvement. Improvements are extreme for:
      • vectors in 2nd, 3th or 4th dimension
      • not yet evaluated arrays (JIT nodes)
      • arrays fully contained in L2 cache

    Fixes: #2417

    Changes to Users

    Performance improvement, less dependent on dimension.

    Checklist

    • [ x ] Rebased on latest master
    • [ x ] Code compiles
    • [ x ] Tests pass
    • [ ] Functions added to unified API
    • [ ] Functions documented
    perf improvement 
    opened by willyborn 31
  • [Build]Undefined reference to afcu_set_native_id

    [Build]Undefined reference to afcu_set_native_id

    Error while trying to build the program.

    Description

    I'm trying to use the setNativeId function from af/cuda.h. but Cmake can not find any reference to that function. I'm including correctly the af/cuda.h file in C++, so i think the issue is in Cmake. Cmake gets /opt/arrayfire/include included, but it still doesn't work properly. My AF Version is 3.6.2. I don't know if i need some additional headers for this function. I also tried to include /opt/arrayfire/include/af/cuda.h as an additional path, but it also failed.

    Error Log

    [email protected]:/workspaces/fiber/Debug> cmake .. -DBUILD_CUDA=ON
    Disabling BUILD_NIGHTLY_TESTS due to BUILD_TESTS
    -- Found libaf : ArrayFire::afcuda
    -- ArrayFire Include Dir /opt/arrayfire/include
    Enabled building tests
    Enabled building matio due to building tests
    -- Found PythonInterp: /usr/bin/python (found suitable version "3.4.6", minimum required is "3") 
    Build Test spot
    Build Test alpha
    Build Test interpolation
    Build Test ioMatlab
    Build Test param
    Build Test pmd
    Build Test raman
    Build Test smf_ssfm
    Build Test stepSizeDistribution
    Build Test utility
    Build Test fiber
     ##### Final Configuration ##### 
    SPOT_VERSION : v3.2-76-g351b46a
    CMAKE_BUILD_TYPE : DEBUG
    BUILD_MATLAB : ON
    BUILD_TESTS : ON
    BUILD_NIGHTLY_TESTS : OFF
    BUILD_SINGLETEST : 
    BUILD_PYTHON : ON
    BUILD_MATIO : ON
    BUILD_CUDA : ON
    CIRUNNER : OFF
    TEST_GPU : OFF
    BENCHMARK_LINEAR : ON
    BENCHMARK_NONLINEAR : ON
     ##### End of Configuration ##### 
    /opt/arrayfire/include
    -- Found MatLibs
    -- Found libmatio : /usr/local/lib/libmatio.so
    -- Found PythonInterp: /usr/bin/python (found version "3.4.6") 
    -- Configuring done
    -- Generating done
    -- Build files have been written to: /workspaces/fiber/Debug
    [email protected]:/workspaces/fiber/Debug> make -j8
    [  0%] Built target header_obj
    Scanning dependencies of target source_obj
    [  2%] Built target cuda_kernels
    [  9%] Built target gmock_main
    [  5%] Built target gmock
    [ 21%] Built target gtest
    [ 25%] Built target obj
    [ 27%] Built target gtest_main
    [ 42%] Built target test_obj
    Scanning dependencies of target cuda_kernel
    [ 43%] Building Fortran object CMakeFiles/source_obj.dir/src/bvp_solver/BVP_M-2.f90.o
    [ 44%] Linking CXX executable interpolation
    [ 45%] Linking CXX executable ioMatlab
    [ 46%] Linking CXX executable pmd
    [ 47%] Linking CXX executable smf_ssfm
    [ 48%] Linking CXX executable alpha
    [ 50%] Building CXX object test/cuda/CMakeFiles/cuda_kernel.dir/cuda_kernels.cpp.o
    [ 51%] Linking CXX executable param
    f951: Warning: Nonexistent include directory '/usr/include/eigen3' [-Wmissing-include-dirs]
    f951: Fatal Error: '/opt/arrayfire/include/af/cuda.h' is not a directory
    compilation terminated.
    CMakeFiles/source_obj.dir/build.make:166: recipe for target 'CMakeFiles/source_obj.dir/src/bvp_solver/BVP_M-2.f90.o' failed
    make[2]: *** [CMakeFiles/source_obj.dir/src/bvp_solver/BVP_M-2.f90.o] Error 1
    CMakeFiles/Makefile2:109: recipe for target 'CMakeFiles/source_obj.dir/all' failed
    make[1]: *** [CMakeFiles/source_obj.dir/all] Error 2
    make[1]: *** Waiting for unfinished jobs....
    /usr/lib64/gcc/x86_64-suse-linux/6/../../../../x86_64-suse-linux/bin/ld: CMakeFiles/test_obj.dir/__/src/param.cpp.o: in function `afcu::setNativeId(int)':
    /opt/arrayfire/include/af/cuda.h:115: undefined reference to `afcu_set_native_id'
    collect2: error: ld returned 1 exit status
    test/CMakeFiles/alpha.dir/build.make:166: recipe for target 'test/alpha' failed
    make[2]: *** [test/alpha] Error 1
    CMakeFiles/Makefile2:544: recipe for target 'test/CMakeFiles/alpha.dir/all' failed
    make[1]: *** [test/CMakeFiles/alpha.dir/all] Error 2
    /usr/lib64/gcc/x86_64-suse-linux/6/../../../../x86_64-suse-linux/bin/ld: CMakeFiles/test_obj.dir/__/src/param.cpp.o: in function `afcu::setNativeId(int)':
    /opt/arrayfire/include/af/cuda.h:115: undefined reference to `afcu_set_native_id'
    collect2: error: ld returned 1 exit status
    test/CMakeFiles/interpolation.dir/build.make:166: recipe for target 'test/interpolation' failed
    make[2]: *** [test/interpolation] Error 1
    CMakeFiles/Makefile2:387: recipe for target 'test/CMakeFiles/interpolation.dir/all' failed
    make[1]: *** [test/CMakeFiles/interpolation.dir/all] Error 2
    /usr/lib64/gcc/x86_64-suse-linux/6/../../../../x86_64-suse-linux/bin/ld: CMakeFiles/test_obj.dir/__/src/param.cpp.o: in function `afcu::setNativeId(int)':
    /opt/arrayfire/include/af/cuda.h:115: undefined reference to `afcu_set_native_id'
    collect2: error: ld returned 1 exit status
    test/CMakeFiles/ioMatlab.dir/build.make:166: recipe for target 'test/ioMatlab' failed
    make[2]: *** [test/ioMatlab] Error 1
    CMakeFiles/Makefile2:307: recipe for target 'test/CMakeFiles/ioMatlab.dir/all' failed
    make[1]: *** [test/CMakeFiles/ioMatlab.dir/all] Error 2
    /usr/lib64/gcc/x86_64-suse-linux/6/../../../../x86_64-suse-linux/bin/ld: CMakeFiles/test_obj.dir/__/src/param.cpp.o: in function `afcu::setNativeId(int)':
    /opt/arrayfire/include/af/cuda.h:115: undefined reference to `afcu_set_native_id'
    collect2: error: ld returned 1 exit status
    /usr/lib64/gcc/x86_64-suse-linux/6/../../../../x86_64-suse-linux/bin/ld: CMakeFiles/test_obj.dir/__/src/param.cpp.o: in function `afcu::setNativeId(int)':
    /opt/arrayfire/include/af/cuda.h:115: undefined reference to `afcu_set_native_id'
    collect2: error: ld returned 1 exit status
    test/CMakeFiles/pmd.dir/build.make:166: recipe for target 'test/pmd' failed
    make[2]: *** [test/pmd] Error 1
    CMakeFiles/Makefile2:347: recipe for target 'test/CMakeFiles/pmd.dir/all' failed
    make[1]: *** [test/CMakeFiles/pmd.dir/all] Error 2
    test/CMakeFiles/smf_ssfm.dir/build.make:166: recipe for target 'test/smf_ssfm' failed
    make[2]: *** [test/smf_ssfm] Error 1
    CMakeFiles/Makefile2:427: recipe for target 'test/CMakeFiles/smf_ssfm.dir/all' failed
    make[1]: *** [test/CMakeFiles/smf_ssfm.dir/all] Error 2
    /usr/lib64/gcc/x86_64-suse-linux/6/../../../../x86_64-suse-linux/bin/ld: CMakeFiles/test_obj.dir/__/src/param.cpp.o: in function `afcu::setNativeId(int)':
    /opt/arrayfire/include/af/cuda.h:115: undefined reference to `afcu_set_native_id'
    collect2: error: ld returned 1 exit status
    test/CMakeFiles/param.dir/build.make:166: recipe for target 'test/param' failed
    make[2]: *** [test/param] Error 1
    CMakeFiles/Makefile2:467: recipe for target 'test/CMakeFiles/param.dir/all' failed
    make[1]: *** [test/CMakeFiles/param.dir/all] Error 2
    [ 52%] Linking CUDA device code CMakeFiles/cuda_kernel.dir/cmake_device_link.o
    [ 53%] Linking CXX executable cuda_kernel
    /usr/lib64/gcc/x86_64-suse-linux/6/../../../../x86_64-suse-linux/bin/ld: CMakeFiles/obj.dir/__/__/src/param.cpp.o: in function `afcu::setNativeId(int)':
    /opt/arrayfire/include/af/cuda.h:115: undefined reference to `afcu_set_native_id'
    collect2: error: ld returned 1 exit status
    test/cuda/CMakeFiles/cuda_kernel.dir/build.make:162: recipe for target 'test/cuda/cuda_kernel' failed
    make[2]: *** [test/cuda/cuda_kernel] Error 1
    CMakeFiles/Makefile2:932: recipe for target 'test/cuda/CMakeFiles/cuda_kernel.dir/all' failed
    make[1]: *** [test/cuda/CMakeFiles/cuda_kernel.dir/all] Error 2
    Makefile:140: recipe for target 'all' failed
    make: *** [all] Error 2
    

    Build Environment

    Compiler version: gcc 6.3.0 Operating system: Debian GNU/Linux 9.13 Build environment: CMake variables: cmake .. -DBUILD_CUDA=ON

    question build 
    opened by DamjanBVB 30
  • use doxygen-awesome css theme

    use doxygen-awesome css theme

    Change documentation theme to use doxygen-awesome

    Hoping to improve readability and provide the foundation for a number of inbound documentation improvements 🤞

    This change bumps the doxygen.mk to 1.9.5. May still need some tweaks with the searchbar, need to test on arrayfire.org.

    Sidebar, lightmode/darkmode, and function preview below:
    image image

    opened by syurkevi 4
  • [Build] compute_arch deprecation not accounted for in CUDA v12

    [Build] compute_arch deprecation not accounted for in CUDA v12

    Description

    When automatic GPU detection fails, PTX code will be compiled for a list of common targets: https://github.com/arrayfire/arrayfire/blob/138f12e9f181b8a7bd013323137931aec0f3bd59/CMakeModules/select_compute_arch.cmake#L33

    This list is modified by various CUDA version checks to account for deprecation, eg:

    https://github.com/arrayfire/arrayfire/blob/138f12e9f181b8a7bd013323137931aec0f3bd59/CMakeModules/select_compute_arch.cmake#L83

    Proposed fix

    Add similar checks for CUDA 12 to allow build.

    Edit: Further debugging revealed that CUDA_COMMON_ARCHITECTURES is in fact being correctly created (even on CUDA v12) with deprecated architectures removed, but it happens too late, and then gets mysteriously overwritten later on.

    The following screenshots shows debug prints for CUDA_COMMON_ARCHITECTURES at various stages:

    1. After the last version check that removes depecrated compute architectures https://github.com/arrayfire/arrayfire/blob/138f12e9f181b8a7bd013323137931aec0f3bd59/CMakeModules/select_compute_arch.cmake#L95
    2. At the begining of function(CUDA_DETECT_INSTALLED_GPUS OUT_VARIABLE) https://github.com/arrayfire/arrayfire/blob/138f12e9f181b8a7bd013323137931aec0f3bd59/CMakeModules/select_compute_arch.cmake#L112 image

    Error Log

    nvcc fatal : Unsupported gpu architecture 'compute_35'

    Build Environment

    Compiler version: clang version 10.0.0-4ubuntu1 Operating system: Ubuntu 22.04 (WSL) Build environment: CUDA 12.0, vcpkg

    build 
    opened by mikex86 0
  • [Build] liblapack_static.a does not exist in CUDA 12

    [Build] liblapack_static.a does not exist in CUDA 12

    Description

    liblapack_static.a does not exist in CUDA 12, arrayfire/src/backend/cuda/CMakeLists.txt tries to find said library and fails. Possible work around without fix is creating a symlink from liblapack_static.a to libcusolver_lapack_static.a; file seems to have been renamed.

    Proposed fix

    Handle find libcusolver_lapack_static.a after appropriate CUDA version checks in https://github.com/arrayfire/arrayfire/blob/138f12e9f181b8a7bd013323137931aec0f3bd59/src/backend/cuda/CMakeLists.txt#L102

    Build Environment

    Compiler version: clang version 10.0.0-4ubuntu1 Operating system: Ubuntu 22.04 (WSL) Build environment: CUDA 12.0, vcpkg

    build 
    opened by mikex86 0
  • [BUG] Can't convert sparse COO matrix format to dense matrix format on OpenCL (3.8.2)

    [BUG] Can't convert sparse COO matrix format to dense matrix format on OpenCL (3.8.2)

    Here a very simple snippet that work on the CPU backend but not on my OpenCL backend. If you can't reproduce on your end, please let me know if you need something more.

    The documentation said that it is allowed to Converting storage formats is allowed between AF_STORAGE_CSR, AF_STORAGE_COO and AF_STORAGE_DENSE. https://arrayfire.org/docs/group__sparse__func__convert__to.htm


        int nSize = 5;
    
    af::array arrayA = af::range(af::dim4(nSize, nSize), 1, f32);
    
    for (int ii = 0; ii < nSize; ii++) {
    	for (int jj = 0; jj < nSize; jj = jj + 2)
    		arrayA(ii, jj) = 0.0;
    }
    
    af_print(arrayA)
    
    af::array arrayB = af::sparseConvertTo(arrayA, AF_STORAGE_COO);
    
    af_print(arrayB)
    
    af::array arrayC = af::sparseConvertTo(arrayB, AF_STORAGE_DENSE);
    
    af_print(arrayC)
    
    std::cout << "af::max<float>(abs(arrayA - arrayC)) = " << af::max<float>(abs(arrayA - arrayC));
    

    Here the output on OpenCL:

    ArrayFire v3.8.2 (OpenCL, 64-bit Windows, build 5752f2dc) [0] AMD: gfx906, 16368 MB arrayA [5 5 1 1] 0.0000 1.0000 0.0000 3.0000 0.0000 0.0000 1.0000 0.0000 3.0000 0.0000 0.0000 1.0000 0.0000 3.0000 0.0000 0.0000 1.0000 0.0000 3.0000 0.0000 0.0000 1.0000 0.0000 3.0000 0.0000

    arrayB Storage Format : AF_STORAGE_COO [5 5 1 1] arrayB: Values [10 1 1 1] 1.0000 1.0000 1.0000 1.0000 1.0000 3.0000 3.0000 3.0000 3.0000 3.0000

    arrayB: RowIdx [10 1 1 1] 0 1 2 3 4 0 1 2 3 4

    arrayB: ColIdx [10 1 1 1] 1 1 1 1 1 3 3 3 3 3

    In function opencl::buildProgram In file src\backend\opencl\compile_module.cpp:128 OpenCL Device: gfx906 Options: -cl-std=CL2.0 -D dim_t=long -D T=float -D resp=32 Log: C:\Users--\AppData\Local\Temp\OCL23708T1.cl:28:54: error: use of undeclared identifier 'reps' const int id = get_group_id(0) * get_local_size(0) * reps + get_local_id(0); ^ C:\Users--\AppData\Local\Temp\OCL23708T1.cl:31:35: error: use of undeclared identifier 'reps' for (int i = get_local_id(0); i < reps * dimSize; i += dimSize) { ^ 2 errors generated.

    error: Clang front-end compilation failed! Frontend phase failed compilation. Error: Compiling CL to IR

    0# af::array::host in afopencl 1# af::array::host in afopencl 2# 0x00007FFBB6321080 in VCRUNTIME140_1 3# _NLG_Return2 in VCRUNTIME140_1 4# RtlCaptureContext2 in ntdll 5# af::array::host in afopencl 6# af::array::host in afopencl 7# af::array::host in afopencl 8# af::array::host in afopencl 9# af::array::host in afopencl 10# af::array::host in afopencl 11# af::array::host in afopencl 12# af::array::host in afopencl 13# af::array::host in af 14# af::array::host in af 15# main at XYZ:99 16# invoke_main at D:\a_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:79 17# __scrt_common_main_seh at D:\a_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288 18# __scrt_common_main at D:\a_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:331 19# mainCRTStartup at D:\a_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp:17 20# BaseThreadInitThunk in KERNEL32 21# RtlUserThreadStart in ntdll


    bug 
    opened by sebastienleclaire 0
  • Option to avoid freeing device memory when the array is created with af_device_array

    Option to avoid freeing device memory when the array is created with af_device_array

    Current af_device_array function takes ownership of the memory pointer provided to the function. However this is not necessary desirable when interacting with other frameworks as the memory may be managed by another library independently.

    Description

    • Allow interaction between several framework running at the same time in the same application. For example I would like to allow CUDA device memory to be sent from CuPY/Torch/JAX to and from ArrayFire
    • Add a new function based on af_device_array that turns off the memory freeing when the reference count goes to zero inside arrayfire.

    The current API of af_device_array is like this:

        AFAPI af_err af_device_array(af_array *arr, void *data, const unsigned ndims, const dim_t * const dims, const af_dtype type);
    

    which does not allow to specify that memory should not be cleared.

    • Manipulating the memory manager does not sound possible either, as other operations should proceed as usual.
    feature 
    opened by lavaux 1
  • [BUG] 3.8.2 fails to build on cuda 12

    [BUG] 3.8.2 fails to build on cuda 12

    Trying to compile v3.8.2 against cuda 12 with gcc 12 results in:

    cd /build/arrayfire/src/arrayfire-full-3.8.2/build/src/backend/cuda/CMakeFiles/af_cuda_static_cuda_library.dir && /usr/bin/cmake -E make_directory /build/arrayfire/src/arrayfire-full-3.8.2/build/src/backend/cuda/CMakeFiles/af_cuda_static_cuda_library.dir//. && /usr/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=Release -D generated_file:STRING=/build/arrayfire/src/arrayfire-full-3.8.2/build/src/backend/cuda/CMakeFiles/af_cuda_static_cuda_library.dir//./af_cuda_static_cuda_library_generated_sparse_blas.cu.o -D generated_cubin_file:STRING=/build/arrayfire/src/arrayfire-full-3.8.2/build/src/backend/cuda/CMakeFiles/af_cuda_static_cuda_library.dir//./af_cuda_static_cuda_library_generated_sparse_blas.cu.o.cubin.txt -P /build/arrayfire/src/arrayfire-full-3.8.2/build/src/backend/cuda/CMakeFiles/af_cuda_static_cuda_library.dir//af_cuda_static_cuda_library_generated_sparse_blas.cu.o.Release.cmake
    /build/arrayfire/src/arrayfire-full-3.8.2/src/backend/cuda/sparse_blas.cu(45): error: identifier "CUSPARSE_CSRMV_ALG1" is undefined
    
    /build/arrayfire/src/arrayfire-full-3.8.2/src/backend/cuda/sparse_blas.cu(55): error: identifier "CUSPARSE_MV_ALG_DEFAULT" is undefined
    
    /build/arrayfire/src/arrayfire-full-3.8.2/src/backend/cuda/sparse_blas.cu(66): error: identifier "CUSPARSE_CSRMM_ALG1" is undefined
    
    /build/arrayfire/src/arrayfire-full-3.8.2/src/backend/cuda/sparse_blas.cu(76): error: identifier "CUSPARSE_CSRMM_ALG1" is undefined
    

    Apparently some stuff got removed in cuda 12.

    bug 
    opened by svenstaro 0
Releases(v3.8.2)
  • v3.8.2(May 19, 2022)

    v3.8.2

    Improvements

    • Optimize JIT by removing some consecutive cast operations #3031
    • Add driver checks checks for CUDA 11.5 and 11.6 #3203
    • Improve the timing algorithm used for timeit #3185
    • Dynamically link against CUDA numeric libraries by default #3205
    • Add support for pruning CUDA binaries to reduce static binary sizes #3234 #3237
    • Remove unused cuDNN libraries from installations #3235
    • Add support to staticly link NVRTC libraries after CUDA 11.5 #3236
    • Add support for compiling with ccache when building the CUDA backend #3241

    Fixes

    • Fix issue with consecutive moddims operations in the CPU backend #3232
    • Better floating point comparisons for tests #3212
    • Fix several warnings and inconsistencies with doxygen and documentation #3226
    • Fix issue when passing empty arrays into join #3211
    • Fix default value for the AF_COMPUTE_LIBRARY when not set #3228
    • Fix missing symbol issue when MKL is staticly linked #3244
    • Remove linking of OpenCL's library to the unified backend #3244

    Contributions

    Special thanks to our contributors: Jacob Kahn Willy Born

    Source code(tar.gz)
    Source code(zip)
    arrayfire-full-3.8.2.tar.bz2(64.07 MB)
  • v3.8.1(Jan 18, 2022)

    v3.8.1

    Improvements

    • moddims now uses JIT approach for certain special cases - #3177
    • Embed Version Info in Windows DLLs - #3025
    • OpenCL device max parameter is now queries from device properties - #3032
    • JIT Performance Optimization: Unique funcName generation sped up - #3040
    • Improved readability of log traces - #3050
    • Use short function name in non-debug build error messages - #3060
    • SIFT/GLOH are now available as part of website binaries - #3071
    • Short-circuit zero elements case in detail::copyArray backend function - #3059
    • Speedup of kernel caching mechanism - #3043
    • Add short-circuit check for empty Arrays in JIT evalNodes - #3072
    • Performance optimization of indexing using dynamic thread block sizes - #3111
    • ArrayFire starting with this release will use Intel MKL single dynamic library which resolves lot of linking issues unified library had when user applications used MKL themselves - #3120
    • Add shortcut check for zero elements in af_write_array - #3130
    • Speedup join by eliminating temp buffers for cascading joins - #3145
    • Added batch support for solve - #1705
    • Use pinned memory to copy device pointers in CUDA solve - #1705
    • Added package manager instructions to docs - #3076
    • CMake Build Improvements - #3027 , #3089 , #3037 , #3072 , #3095 , #3096 , #3097 , #3102 , #3106 , #3105 , #3120 , #3136 , #3135 , #3137 , #3119 , #3150 , #3138 , #3156 , #3139 , #1705 , #3162
    • CPU backend improvements - #3010 , #3138 , #3161
    • CUDA backend improvements - #3066 , #3091 , #3093 , #3125 , #3143 , #3161
    • OpenCL backend improvements - #3091 , #3068 , #3127 , #3010 , #3039 , #3138 , #3161
    • General(including JIT) performance improvements across backends - #3167
    • Testing improvements - #3072 , #3131 , #3151 , #3141 , #3153 , #3152 , #3157 , #1705 , #3170 , #3167
    • Update CLBlast to latest version - #3135 , #3179
    • Improved Otsu threshold computation helper in canny algorithm - #3169
    • Modified default parameters for fftR2C and fftC2R C++ API from 0 to 1.0 - #3178
    • Use appropriate MKL getrs_batch_strided API based on MKL Versions - #3181

    Fixes

    • Fixed a bug JIT kernel disk caching - #3182
    • Fixed stream used by thrust(CUDA backend) functions - #3029
    • Added workaround for new cuSparse API that was added by CUDA amid fix releases - #3057
    • Fixed const array indexing inside gfor - #3078
    • Handle zero elements in copyData to host - #3059
    • Fixed double free regression in OpenCL backend - #3091
    • Fixed an infinite recursion bug in NaryNode JIT Node - #3072
    • Added missing input validation check in sparse-dense arithmetic operations - #3129
    • Fixed bug in getMappedPtr in OpenCL due to invalid lambda capture - #3163
    • Fixed bug in getMappedPtr on Arrays that are not ready - #3163
    • Fixed edgeTraceKernel for CPU devices on OpenCL backend - #3164
    • Fixed windows build issue(s) with VS2019 - #3048
    • API documentation fixes - #3075 , #3076 , #3143 , #3161
    • CMake Build Fixes - #3088
    • Fixed the tutorial link in README - #3033
    • Fixed function name typo in timing tutorial - #3028
    • Fixed couple of bugs in CPU backend canny implementation - #3169
    • Fixed reference count of array(s) used in JIT operations. It is related to arrayfire's internal memory book keeping. The behavior/accuracy of arrayfire code wasn't broken earlier. It corrected the reference count to be of optimal value in the said scenarios. This may potentially reduce memory usage in some narrow cases - #3167
    • Added assert that checks if topk is called with a negative value for k - #3176
    • Fixed an Issue where countByKey would give incorrect results for any n > 128 - #3175

    Contributions

    Special thanks to our contributors: HO-COOH, Willy Born, Gilad Avidov, Pavan Yalamanchili

    Source code(tar.gz)
    Source code(zip)
    arrayfire-full-3.8.1.tar.bz2(63.82 MB)
  • v3.7.3(Nov 23, 2020)

    v3.7.3

    Improvements

    • Add f16 support for histogram - #2984
    • Update confidence connected components example for better illustration - #2968
    • Enable disk caching of OpenCL kernel binaries - #2970
    • Refactor extension of kernel binaries stored to disk .bin - #2970
    • Add minimum driver versions for CUDA toolkit 11 in internal map - #2982
    • Improve warnings messages from run-time kernel compilation functions - #2996

    Fixes

    • Fix bias factor of variance in var_all and cov functions - #2986
    • Fix a race condition in confidence connected components function for OpenCL backend - #2969
    • Safely ignore disk cache failures in CUDA backend for compiled kernel binaries - #2970
    • Fix randn by passing in correct values to Box-Muller - #2980
    • Fix rounding issues in Box-Muller function used for RNG - #2980
    • Fix problems in RNG for older compute architectures with fp16 - #2980 #2996
    • Fix performance regression of approx functions - #2977
    • Remove assert that check that signal/filter types have to be the same - #2993
    • Fix checkAndSetDevMaxCompute when the device cc is greater than max - #2996
    • Fix documentation errors and warnings - #2973 , #2987
    • Add missing opencl-arrayfire interoperability functions in unified back - #2981
    • Fix constexpr relates compilation error with VS2019 and Clang Compilers - #3049

    Contributions

    Special thanks to our contributors: P. J. Reed

    Source code(tar.gz)
    Source code(zip)
    arrayfire-full-3.7.3.tar.bz2(51.45 MB)
  • v3.8.0(Jan 8, 2021)

    v3.8.0

    New Functions

    • Ragged max reduction - #2786
    • Initialization list constructor for array class - #2829 , #2987
    • New API for following statistics function: cov, var and stdev - #2986
    • Bit-wise operator support for array and C API (af_bitnot) - #2865
    • allocV2 and freeV2 which return cl_mem on OpenCL backend - #2911
    • Move constructor and move assignment operator for Dim4 class - #2946

    Improvements

    • Add f16 support for histogram - #2984
    • Update confidence connected components example for better illustration - #2968
    • Enable disk caching of OpenCL kernel binaries - #2970
    • Refactor extension of kernel binaries stored to disk .bin - #2970
    • Add minimum driver versions for CUDA toolkit 11 in internal map - #2982
    • Improve warnings messages from run-time kernel compilation functions - #2996

    Fixes

    • Fix bias factor of variance in var_all and cov functions - #2986
    • Fix a race condition in confidence connected components function for OpenCL backend - #2969
    • Safely ignore disk cache failures in CUDA backend for compiled kernel binaries - #2970
    • Fix randn by passing in correct values to Box-Muller - #2980
    • Fix rounding issues in Box-Muller function used for RNG - #2980
    • Fix problems in RNG for older compute architectures with fp16 - #2980 #2996
    • Fix performance regression of approx functions - #2977
    • Remove assert that check that signal/filter types have to be the same - #2993
    • Fix checkAndSetDevMaxCompute when the device cc is greater than max - #2996
    • Fix documentation errors and warnings - #2973 , #2987
    • Add missing opencl-arrayfire interoperability functions in unified back - #2981

    Contributions

    Special thanks to our contributors: P. J. Reed

    Source code(tar.gz)
    Source code(zip)
    arrayfire-full-3.8.0.tar.bz2(51.46 MB)
  • v3.8.rc(Oct 5, 2020)

    v3.8.0 Release Candidate

    New Functions

    • Ragged max reduction - #2786
    • Initialization list constructor for array class - #2829 , #2987
    • New API for following statistics function: cov, var and stdev - #2986
    • Bit-wise operator support for array and C API (af_bitnot) - #2865
    • allocV2 and freeV2 which return cl_mem on OpenCL backend - #2911
    • Move constructor and move assignment operator for Dim4 class - #2946

    Improvements

    • Add f16 support for histogram - #2984
    • Update confidence connected components example for better illustration - #2968
    • Enable disk caching of OpenCL kernel binaries - #2970
    • Refactor extension of kernel binaries stored to disk .bin - #2970
    • Add minimum driver versions for CUDA toolkit 11 in internal map - #2982
    • Improve warnings messages from run-time kernel compilation functions - #2996

    Fixes

    • Fix bias factor of variance in var_all and cov functions - #2986
    • Fix a race condition in confidence connected components function for OpenCL backend - #2969
    • Safely ignore disk cache failures in CUDA backend for compiled kernel binaries - #2970
    • Fix randn by passing in correct values to Box-Muller - #2980
    • Fix rounding issues in Box-Muller function used for RNG - #2980
    • Fix problems in RNG for older compute architectures with fp16 - #2980 #2996
    • Fix performance regression of approx functions - #2977
    • Remove assert that check that signal/filter types have to be the same - #2993
    • Fix checkAndSetDevMaxCompute when the device cc is greater than max - #2996
    • Fix documentation errors and warnings - #2973 , #2987
    • Add missing opencl-arrayfire interoperability functions in unified back - #2981

    Contributions

    Special thanks to our contributors: P. J. Reed

    Source code(tar.gz)
    Source code(zip)
    arrayfire-full-3.8.rc.tar.bz2(51.47 MB)
  • v3.7.2(Jul 13, 2020)

    v3.7.2

    Improvements

    • Cache CUDA kernels to disk to improve load times(Thanks to @cschreib-ibex) #2848
    • Staticly link against cuda libraries #2785
    • Make cuDNN an optional build dependency #2836
    • Improve support for different compilers and OS #2876 #2945 #2925 #2942 #2943 #2945
    • Improve performance of join and transpose on CPU #2849
    • Improve documentation #2816 #2821 #2846 #2918 #2928 #2947
    • Reduce binary size using NVRTC and template reducing instantiations #2849 #2861 #2890
    • Improve reduceByKey performance on OpenCL by using builtin functions #2851
    • Improve support for Intel OpenCL GPUs #2855
    • Allow staticly linking against MKL #2877 (Sponsered by SDL)
    • Better support for older CUDA toolkits #2923
    • Add support for CUDA 11 #2939
    • Add support for ccache for faster builds #2931
    • Add support for the conan package manager on linux #2875
    • Propagate build errors up the stack in AFError exceptions #2948 #2957
    • Improve runtime dependency library loading #2954
    • Improved cuDNN runtime checks and warnings #2960
    • Document af_memory_manager_* native memory return values #2911
    • Add support for cuDNN 8 #2963

    Fixes

    • Bug crash when allocating large arrays #2827
    • Fix various compiler warnings #2827 #2849 #2872 #2876
    • Fix minor leaks in OpenCL functions #2913
    • Various continuous integration related fixes #2819
    • Fix zero padding with convolv2NN #2820
    • Fix af_get_memory_pressure_threshold return value #2831
    • Increased the max filter length for morph
    • Handle empty array inputs for LU, QR, and Rank functions #2838
    • Fix FindMKL.cmake script for sequential threading library #2840
    • Various internal refactoring #2839 #2861 #2864 #2873 #2890 #2891 #2913
    • Fix OpenCL 2.0 builtin function name conflict #2851
    • Fix error caused when releasing memory with multiple devices #2867
    • Fix missing set stacktrace symbol from unified API #2915
    • Fix zero padding issue in convolve2NN #2820
    • Fixed bugs in ReduceByKey #2957
    • Add clblast patch to handle custom context with multiple devices #2967

    Contributions

    Special thanks to our contributors: Corentin Schreiber Jacob Kahn Paul Jurczak Christoph Junghans

    Source code(tar.gz)
    Source code(zip)
    arrayfire-full-3.7.2.tar.bz2(51.47 MB)
  • v3.7.1(Mar 28, 2020)

    v3.7.1

    Improvements

    • Improve mtx download for test data #2742
    • Improve Documentation #2754 #2792 #2797
    • Remove verbose messages in older CMake versions #2773
    • Reduce binary size with the use of NVRTC #2790
    • Use texture memory to load LUT in orb and fast #2791
    • Add missing print function for f16 #2784
    • Add checks for f16 support in the CUDA backend #2784
    • Create a thrust policy to intercept temporary buffer allocations #2806

    Fixes

    • Fix segfault on exit when ArrayFire is not initialized in the main thread
    • Fix support for CMake 3.5.1 #2771 #2772 #2760
    • Fix evalMultiple if the input array sizes aren't the same #2766
    • Fix error when AF_BACKEND_DEFAULT is passed directly to backend #2769
    • Workaround name collision with AMD OpenCL implementation #2802
    • Fix on-exit errors with the unified backend #2769
    • Fix check for f16 compatibility in OpenCL #2773
    • Fix matmul on Intel OpenCL when passing same array as input #2774
    • Fix CPU OpenCL blas batching #2774
    • Fix memory pressure in the default memory manager #2801

    Contributions

    Special thanks to our contributors: padentomasello glavaux2

    Source code(tar.gz)
    Source code(zip)
    arrayfire-full-3.7.1.tar.bz2(50.61 MB)
  • v3.7.0(Feb 13, 2020)

    v3.7.0

    Major Updates

    • Added the ability to customize the memory manager(Thanks jacobkahn and flashlight) [#2461]
    • Added 16-bit floating point support for several functions [#2413] [#2587] [#2585] [#2587] [#2583]
    • Added sumByKey, productByKey, minByKey, maxByKey, allTrueByKey, anyTrueByKey, countByKey [#2254]
    • Added confidence connected components [#2748]
    • Added neural network based convolution and gradient functions [#2359]
    • Added a padding function [#2682]
    • Added pinverse for pseudo inverse [#2279]
    • Added support for uniform ranges in approx1 and approx2 functions. [#2297]
    • Added support to write to preallocated arrays for some functions [#2599] [#2481] [#2328] [#2327]
    • Added meanvar function [#2258]
    • Add support for sparse-sparse arithmetic support [#2312]
    • Added rsqrt function for reciprocal square root [#2500]
    • Added a lower level af_gemm function for general matrix multiplication [#2481]
    • Added a function to set the cuBLAS math mode for the CUDA backend [#2584]
    • Separate debug symbols into separate files [#2535]
    • Print stacktraces on errors [#2632]
    • Support move constructor for af::array [#2595]
    • Expose events in the public API [#2461]
    • Add setAxesLabelFormat to format labels on graphs [#2495]
    • Added deconvolution functions [#1881]

    Improvements

    • Better error messages for systems with driver or device incompatibilities [#2678] [#2448][#2761]
    • Optimized unified backend function calls [#2695]
    • Optimized anisotropic smoothing [#2713]
    • Optimized canny filter for CUDA and OpenCL [#2727]
    • Better MKL search script [#2738][#2743][#2745]
    • Better logging of different submodules in ArrayFire [#2670] [#2669]
    • Improve documentation [#2665] [#2620] [#2615] [#2639] [#2628] [#2633] [#2622] [#2617] [#2558] [#2326][#2515]
    • Optimized af::array assignment [#2575]
    • Update the k-means example to display the result [#2521]

    Fixes

    • Fix multi-config generators [#2736]
    • Fix access errors in canny [#2727]
    • Fix segfault in the unified backend if no backends are available [#2720]
    • Fix access errors in scan-by-key [#2693]
    • Fix sobel operator [#2600]
    • Fix an issue with the random number generator and s16 [#2587]
    • Fix issue with boolean product reduction [#2544]
    • Fix array_proxy move constructor [#2537]
    • Fix convolve3 launch configuration [#2519]
    • Fix an issue where the fft function modified the input array [#2520]
    • Added a work around for nvidia-opencl runtime if forge dependencies are missing [#2761]

    Contributions

    Special thanks to our contributors: @jacobkahn @WilliamTambellini @lehins @r-barnes @gaika @ShalokShalom

    Source code(tar.gz)
    Source code(zip)
  • v3.6.4(May 20, 2019)

    v3.6.4

    The source code with sub-modules can be downloaded directly from the following link:

    http://arrayfire.com/arrayfire_source/arrayfire-full-3.6.4.tar.bz2

    Fixes

    • Address a JIT performance regression due to moving kernel arguments to shared memory #2501
    • Fix the default parameter for setAxisTitle #2491
    Source code(tar.gz)
    Source code(zip)
  • v3.6.3(Apr 22, 2019)

    v3.6.3

    The source code with sub-modules can be downloaded directly from the following link:

    http://arrayfire.com/arrayfire_source/arrayfire-full-3.6.3.tar.bz2

    Improvements

    • Graphics are now a runtime dependency instead of a link time dependency #2365
    • Reduce the CUDA backend binary size using runtime compilation of kernels #2437
    • Improved batched matrix multiplication on the CPU backend by using Intel MKL's cblas_Xgemm_batched#2206
    • Print JIT kernels to disk or stream using the AF_JIT_KERNEL_TRACE environment variable #2404
    • void* pointers are now allowed as arguments to af::array::write() #2367
    • Slightly improve the efficiency of JITed tile operations #2472
    • Make the random number generation on the CPU backend to be consistent with CUDA and OpenCL #2435
    • Handled very large JIT tree generations #2484 #2487

    Bug Fixes

    • Fixed af::array::array_proxy move assignment operator #2479
    • Fixed input array dimensions validation in svdInplace() #2331
    • Fixed the typedef declaration for window resource handle #2357.
    • Increase compatibility with GCC 8 #2379
    • Fixed af::write tests #2380
    • Fixed a bug in broadcast step of 1D exclusive scan #2366
    • Fixed OpenGL related build errors on OSX #2382
    • Fixed multiple array evaluation. Performance improvement. #2384
    • Fixed buffer overflow and expected output of kNN SSD small test #2445
    • Fixed MKL linking order to enable threaded BLAS #2444
    • Added validations for forge module plugin availability before calling resource cleanup #2443
    • Improve compatibility on MSVC toolchain(_MSC_VER > 1914) with the CUDA backend #2443
    • Fixed BLAS gemm func generators for newest MSVC 19 on VS 2017 #2464
    • Fix errors on exits when using the cuda backend with unified #2470

    Documentation

    • Updated svdInplace() documentation following a bugfix #2331
    • Fixed a typo in matrix multiplication documentation #2358
    • Fixed a code snippet demonstrating C-API use #2406
    • Updated hamming matcher implementation limitation #2434
    • Added illustration for the rotate function #2453

    Misc

    • Use cudaMemcpyAsync instead of cudaMemcpy throughout the codebase #2362
    • Display a more informative error message if CUDA driver is incompatible #2421 #2448
    • Changed forge resource management to use smart pointers #2452
    • Deprecated intl and uintl typedefs in API #2360
    • Enabled graphics by default for all builds starting with v3.6.3 #2365
    • Fixed several warnings #2344 #2356 #2361
    • Refactored initArray() calls to use createEmptyArray(). initArray() is for internal use only by Array class. #2361
    • Refactored void* memory allocations to use unsigned char type #2459
    • Replaced deprecated MKL API with in-house implementations for sparse to sparse/dense conversions #2312
    • Reorganized and fixed some internal backend API #2356
    • Updated compilation order of CUDA files to speed up compile time #2368
    • Removed conditional graphics support builds after enabling runtime loading of graphics dependencies #2365
    • Marked graphics dependencies as optional in CPack RPM config #2365
    • Refactored a sparse arithmetic backend API #2379
    • Fixed const correctness of af_device_array API #2396
    • Update Forge to v1.0.4 #2466
    • Manage Forge resources from the DeviceManager class #2381
    • Fixed non-mkl & non-batch blas upstream call arguments #2401
    • Link MKL with OpenMP instead of TBB by default
    • use clang-format to format source code

    Contributions

    Special thanks to our contributors: Alessandro Bessi zhihaoy Jacob Khan William Tambellini

    Source code(tar.gz)
    Source code(zip)
  • v3.6.2(Nov 29, 2018)

    v3.6.2

    The source code with sub-modules can be downloaded directly from the following link:

    http://arrayfire.com/arrayfire_source/arrayfire-full-3.6.2.tar.bz2

    Features

    • Batching support for cond argument in select() [#2243]
    • Broadcast batching for matmul [#2315]
    • Add support for multiple nearest neighbours from nearestNeighbour() [#2280]

    Improvements

    • Performance improvements in morph() [#2238]
    • Fix linking errors when compiling without Freeimage/Graphics [#2248]
    • Fixes to improve the usage of ArrayFire as a sub-project [#2290]
    • Allow custom library path for loading dynamic backend libraries [#2302]

    Bug fixes

    • Fix overflow in dim4::ndims. [#2289]
    • Remove setDevice from af::array destructor [#2319]
    • Fix pow precision for integral types [#2305]
    • Fix issues with tile with a large repeat dimension [#2307]
    • Fix grid based indexing calculation in af_draw_hist [#2230]
    • Fix bug when using an af::array for indexing [#2311]
    • Fix CLBlast errors on exit on Windows [#2222]

    Documentation

    • Improve unwrap documentation [#2301]
    • Improve wrap documentation [#2320]
    • Fix and improve accum documentation [#2298]
    • Improve tile documentation [#2293]
    • Clarify approx* indexing in documentation [#2287]
    • Update examples of select in detailed documentation [#2277]
    • Update lookup examples [#2288]
    • Update set documentation [#2299]

    Misc

    • New ArrayFire ASSERT utility functions [#2249][#2256][#2257][#2263]
    • Improve error messages in JIT [#2309]
    • af* library and dependencies directory changed to lib64 [#2186]

    Contributions

    Thank you to our contributors: Jacob Kahn Vardan Akopian

    Source code(tar.gz)
    Source code(zip)
  • v3.6.1(Jul 6, 2018)

    v 3.6.1

    The source code for this release can be downloaded here: http://arrayfire.com/arrayfire_source/arrayfire-full-3.6.1.tar.bz2

    Improvements

    • FreeImage is now a run-time dependency [#2164]
    • Reduced binary size by setting the symbol visibility to hidden [#2168]
    • Add logging to memory manager and unified loader using the AF_TRACE environment variable [#2169][#2216]
    • Improved CPU Anisotropic Diffusion performance [#2174]
    • Perform normalization after FFT for improved accuracy [#2185, #2192]
    • Updated CLBlast to v1.4.0 [#2178]
    • Added additional validation when using af::seq for indexing [#2153]
    • Perform checks for unsupported cards by the CUDA implementation [#2182]
    • Avoid selecting backend if no devices are found. [#2218]

    Bug Fixes

    • Fixed region when all pixels were the foreground or background [#2152]
    • Fixed several memory leaks [#2202, #2201, #2180, #2179, #2177, #2175]
    • Fixed bug in setDevice which didn't allow you to select the last device [#2189]
    • Fixed bug in min/max where the first element of the array was a NaN value [#2155]
    • Fixed graphics window indexing [#2207]
    • Fixed renaming issue when installing cuda libraries on OSX [#2221]
    • Fixed NSIS installer PATH variable [#2223]
    Source code(tar.gz)
    Source code(zip)
  • v3.6.0(May 4, 2018)

    v3.6.0

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.6.0.tar.bz2

    Major Updates

    • Added the topk() function. 1
    • Added batched matrix multiply support.2 3
    • Added anisotropic diffusion, anisotropicDiffusion().Documentation 3.

    Features

    • Added support for batched matrix multiply. 1 2
    • New anisotropic diffusion function, anisotropicDiffusion(). Documentation 3.
    • New topk() function, which returns the top k elements along a given dimension of the input. Documentation. 4
    • New gradient diffusion example.

    Improvements

    • JITed select() and shift() functions for CUDA and OpenCL backends. 1
    • Significant CMake improvements. 2 3 4
    • Improved the quality of the random number generator 5
    • Corrected assert function calls in select() tests. 5
    • Modified af_colormap struct to match forge's definition. 6
    • Improved Black Scholes example. 7
    • Used CPack to generate installers. 8. We will be using CPack to generate installers beginning with this release.
    • Refactored black_scholes_options example to use built-in af::erfc function for cumulative normal distribution.9.
    • Reduced the scope of mutexes in memory manager 10
    • Official installers do not require the CUDA toolkit to be installed starting with v3.6.0.

    Bug fixes

    • Fixed shfl_down() warnings with CUDA 9. 1
    • Disabled CUDA JIT debug flags on ARM architecture.2
    • Fixed CLBLast install lib dir for linux platform where lib directory has arch(64) suffix.3
    • Fixed assert condition in 3d morph opencl kernel.4
    • Fixed JIT errors with large non-linear kernels5
    • Fixed bug in CPU JIT after moddims was called 5
    • Fixed a deadlock scenario caused by the method MemoryManager::nativeFree6

    Documentation

    • Fixed variable name typo in vectorization.md. 1
    • Fixed AF_API_VERSION value in Doxygen config file. 2

    Known issues

    • NVCC does not currently support platform toolset v141 (Visual Studio 2017 R15.6). Use the v140 platform toolset, instead. You may pass in the toolset version to CMake via the -T flag like so cmake -G "Visual Studio 15 2017 Win64" -T v140.
      • To download and install other platform toolsets, visit https://blogs.msdn.microsoft.com/vcblog/2017/11/15/side-by-side-minor-version-msvc-toolsets-in-visual-studio-2017
    • Several OpenCL tests failing on OSX:
      • canny_opencl, fft_opencl, gen_assign_opencl, homography_opencl, reduce_opencl, scan_by_key_opencl, solve_dense_opencl, sparse_arith_opencl, sparse_convert_opencl, where_opencl

    Contributions

    Special thanks to our contributors: Adrien F. Vincent, Cedric Nugteren, Felix, Filip Matzner, HoneyPatouceul, Patrick Lavin, Ralf Stubner, William Tambellini

    Source code(tar.gz)
    Source code(zip)
  • v3.5.1(Sep 19, 2017)

    v3.5.1

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.5.1.tar.bz2

    Installer CUDA Version: 8.0 (Required) Installer OpenCL Version: 1.2 (Minimum)

    Improvements

    • Relaxed af::unwrap() function's arguments. 1
    • Changed behavior of af::array::allocated() to specify memory allocated. 1
    • Removed restriction on the number of bins for af::histogram() on CUDA and OpenCL kernels. 1

    Performance

    • Improved JIT performance. 1
    • Improved CPU element-wise operation performance. 1
    • Improved regions performance using texture objects. 1

    Bug fixes

    • Fixed overflow issues in mean. 1
    • Fixed memory leak when chaining indexing operations. 1
    • Fixed bug in array assignment when using an empty array to index. 1
    • Fixed bug with af::matmul() which occured when its RHS argument was an indexed vector. 1
    • Fixed bug deadlock bug when sparse array was used with a JIT Array. 1
    • Fixed pixel tests for FAST kernels. 1
    • Fixed af::replace so that it is now copy-on-write. 1
    • Fixed launch configuration issues in CUDA JIT. 1
    • Fixed segfaults and "Pure Virtual Call" error warnings when exiting on Windows. 1 2
    • Workaround for clEnqueueReadBuffer bug on OSX. 1

    Build

    • Fixed issues when compiling with GCC 7.1. 1 2
    • Eliminated unnecessary Boost dependency from CPU and CUDA backends. 1

    Misc

    • Updated support links to point to Slack instead of Gitter. 1
    Source code(tar.gz)
    Source code(zip)
  • v3.5.0(Jun 23, 2017)

    v3.5.0

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.5.0.tar.bz2

    Installer CUDA Version: 8.0 (Required) Installer OpenCL Version: 1.2 (Minimum)

    Major Updates

    • ArrayFire now supports threaded applications. 1
    • Added Canny edge detector. 1
    • Added Sparse-Dense arithmetic operations. 1

    Features

    • ArrayFire Threading
      • af::array can be read by multiple threads
      • All ArrayFire functions can be executed concurrently by multiple threads
      • Threads can operate on different devices to simplify Muli-device workloads
    • New Canny edge detector function, af::canny(). 1
      • Can automatically calculate high threshold with AF_CANNY_THRESHOLD_AUTO_OTSU
      • Supports both L1 and L2 Norms to calculate gradients
    • New tuned OpenCL BLAS backend, CLBlast.

    Improvements

    • Converted CUDA JIT to use NVRTC instead of NVVM.
    • Performance improvements in af::reorder(). 1
    • Performance improvements in array::scalar(). 1
    • Improved unified backend performance. 1
    • ArrayFire now depends on Forge v1.0. 1
    • Can now specify the FFT plan cache size using the af::setFFTPlanCacheSize() function.
    • Get the number of physical bytes allocated by the memory manager af_get_allocated_bytes(). 1
    • af::dot() can now return a scalar value to the host. 1

    Bug Fixes

    • Fixed improper release of default Mersenne random engine. 1
    • Fixed af::randu() and af::randn() ranges for floating point types. 1
    • Fixed assignment bug in CPU backend. 1
    • Fixed complex (c32,c64) multiplication in OpenCL convolution kernels. 1
    • Fixed inconsistent behavior with af::replace() and replace_scalar(). 1
    • Fixed memory leak in af_fir(). 1
    • Fixed memory leaks in af_cast for sparse arrays. 1
    • Fixing correctness of af_pow for complex numbers by using Cartesian form. 1
    • Corrected af::select() with indexing in CUDA and OpenCL backends. 1
    • Workaround for VS2015 compiler ternary bug. 1
    • Fixed memory corruption in cuda::findPlan(). 1
    • Argument checks in af_create_sparse_array avoids inputs of type int64. 1

    Build fixes

    • On OSX, utilize new GLFW package from the brew package manager. 1 2
    • Fixed CUDA PTX names generated by CMake v3.7. 1
    • Support gcc > 5.x for CUDA. 1

    Examples

    • New genetic algorithm example. 1

    Documentation

    • Updated README.md to improve readability and formatting. 1
    • Updated README.md to mention Julia and Nim wrappers. 1
    • Improved installation instructions - docs/pages/install.md. 1

    Miscellaneous

    • A few improvements for ROCm support. 1
    • Removed CUDA 6.5 support. 1

    Known issues

    • Windows
      • The Windows NVIDIA driver version 37x.xx contains a bug which causes fftconvolve_opencl to fail. Upgrade or downgrade to a different version of the driver to avoid this failure.
      • The following tests fail on Windows with NVIDIA hardware: threading_cuda,qr_dense_opencl, solve_dense_opencl.
    • macOS
      • The Accelerate framework, used by the CPU backend on macOS, leverages Intel graphics cards (Iris) when there are no discrete GPUs available. This OpenCL implementation is known to give incorrect results on the following tests: lu_dense_{cpu,opencl}, solve_dense_{cpu,opencl}, inverse_dense_{cpu,opencl}.
      • Certain tests intermittently fail on macOS with NVIDIA GPUs apparently due to inconsistent driver behavior: fft_large_cuda and svd_dense_cuda.
      • The following tests are currently failing on macOS with AMD GPUs: cholesky_dense_opencl and scan_by_key_opencl.
    Source code(tar.gz)
    Source code(zip)
  • v3.4.2(Dec 21, 2016)

    v3.4.2

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.4.2.tar.bz2

    Installer CUDA Version: 8.0 (Required) Installer OpenCL Version: 1.2 (Minimum)

    Deprecation Announcement

    This release supports CUDA 6.5 and higher. The next ArrayFire release will support CUDA 7.0 and higher, dropping support for CUDA 6.5. Reasons for no longer supporting CUDA 6.5 include:

    • CUDA 7.0 NVCC supports the C++11 standard (whereas CUDA 6.5 does not), which is used by ArrayFire's CPU and OpenCL backends.
    • Very few ArrayFire users still use CUDA 6.5.

    As a result, the older Jetson TK1 / Tegra K1 will no longer be supported in the next ArrayFire release. The newer Jetson TX1 / Tegra X1 will continue to have full capability with ArrayFire.

    Docker

    Improvements

    • Implemented sparse storage format conversions between AF_STORAGE_CSR and AF_STORAGE_COO. 1
      • Directly convert between AF_STORAGE_COO <--> AF_STORAGE_CSR using the af::sparseConvertTo() function.
      • af::sparseConvertTo() now also supports converting to dense.
    • Added cast support for sparse arrays. 1
      • Casting only changes the values array and the type. The row and column index arrays are not changed.
    • Reintroduced automated computation of chart axes limits for graphics functions. 1
      • The axes limits will always be the minimum/maximum of the current and new limit.
      • The user can still set limits from API calls. If the user sets a limit from the API call, then the automatic limit setting will be disabled.
    • Using boost::scoped_array instead of boost::scoped_ptr when managing array resources. 1
    • Internal performance improvements to getInfo() by using const references to avoid unnecessary copying of ArrayInfo objects. 1
    • Added support for scalar af::array inputs for af::convolve() and set functions. 1 2 3
    • Performance fixes in af::fftConvolve() kernels. 1 2

    Build

    • Support for Visual Studio 2015 compilation. 1 2
    • Fixed FindCBLAS.cmake when PkgConfig is used. 1

    Bug fixes

    • Fixes to JIT when tree is large. 1 2
    • Fixed indexing bug when converting dense to sparse af::array as AF_STORAGE_COO. 1
    • Fixed af::bilateral() OpenCL kernel compilation on OS X. 1
    • Fixed memory leak in af::regions() (CPU) and af::rgb2ycbcr(). 1 2 3

    Installers

    • Major OS X installer fixes. 1
      • Fixed installation scripts.
      • Fixed installation symlinks for libraries.
    • Windows installer now ships with more pre-built examples.

    Examples

    • Added af::choleskyInPlace() calls to cholesky.cpp example. 1

    Documentation

    • Added u8 as supported data type in getting_started.md. 1
    • Fixed typos. 1

    CUDA 8 on OSX

    Known Issues

    • Known failures with CUDA 6.5. These include all functions that use sorting. As a result, sparse storage format conversion between AF_STORAGE_COO and AF_STORAGE_CSR has been disabled for CUDA 6.5.
    Source code(tar.gz)
    Source code(zip)
  • v3.4.1(Oct 15, 2016)

    v3.4.1

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.4.1.tar.bz2

    Installer CUDA Version: 8.0 (Required) Installer OpenCL Version: 1.2 (Minimum)

    Installers

    • Installers for Linux, OS X and Windows
      • CUDA backend now uses CUDA 8.0.
      • Uses Intel MKL 2017.
      • CUDA Compute 2.x (Fermi) is no longer compiled into the library.
    • Installer for OS X
      • The libraries shipping in the OS X Installer are now compiled with Apple Clang v7.3.1 (previouly v6.1.0).
      • The OS X version used is 10.11.6 (previously 10.10.5).
    • Installer for Jetson TX1 / Tegra X1
      • Requires JetPack for L4T 2.3 (containing Linux for Tegra r24.2 for TX1).
      • CUDA backend now uses CUDA 8.0 64-bit.
      • Using CUDA's cusolver instead of CPU fallback.
      • Uses OpenBLAS for CPU BLAS.
      • All ArrayFire libraries are now 64-bit.

    Improvements

    • Add sparse array support to af::eval(). 1
    • Add OpenCL-CPU fallback support for sparse af::matmul() when running on a unified memory device. Uses MKL Sparse BLAS.
    • When using CUDA libdevice, pick the correct compute version based on device. 1
    • OpenCL FFT now also supports prime factors 7, 11 and 13. 1 2

    Bug Fixes

    • Allow CUDA libdevice to be detected from custom directory.
    • Fix aarch64 detection on Jetson TX1 64-bit OS. 1
    • Add missing definition of af_set_fft_plan_cache_size in unified backend. 1
    • Fix intial values for af::min() and af::max() operations. 1 2
    • Fix distance calculation in af::nearestNeighbour for CUDA and OpenCL backend. 1 2
    • Fix OpenCL bug where scalars where are passed incorrectly to compile options. 1
    • Fix bug in af::Window::surface() with respect to dimensions and ranges. 1
    • Fix possible double free corruption in af_assign_seq(). 1
    • Add missing eval for key in af::scanByKey in CPU backend. 1
    • Fixed creation of sparse values array using AF_STORAGE_COO. 1 1

    Examples

    • Add a Conjugate Gradient solver example to demonstrate sparse and dense matrix operations. 1

    CUDA Backend

    • When using CUDA 8.0, compute 2.x are no longer in default compute list.
      • This follows CUDA 8.0 deprecating computes 2.x.
      • Default computes for CUDA 8.0 will be 30, 50, 60.
    • When using CUDA pre-8.0, the default selection remains 20, 30, 50.
    • CUDA backend now uses -arch=sm_30 for PTX compilation as default.
      • Unless compute 2.0 is enabled.

    Known Issues

    • af::lu() on CPU is known to give incorrect results when built run on OS X 10.11 or 10.12 and compiled with Accelerate Framework. 1
      • Since the OS X Installer libraries uses MKL rather than Accelerate Framework, this issue does not affect those libraries.
    Source code(tar.gz)
    Source code(zip)
  • v3.4.0(Sep 13, 2016)

    v3.4.0

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.4.0.tar.bz2

    Installer CUDA Version: 7.5 (Required) Installer OpenCL Version: 1.2 (Minimum)

    Major Updates

    • [Sparse Matrix and BLAS](ref sparse_func). 1 2
    • Faster JIT for CUDA and OpenCL. 1 2
    • Support for [random number generator engines](ref af::randomEngine). 1 2
    • Improvements to graphics. 1 2

    Features

    • [Sparse Matrix and BLAS](ref sparse_func) 1 2
      • Support for [CSR](ref AF_STORAGE_CSR) and [COO](ref AF_STORAGE_COO) [storage types](ref af_storage).
      • Sparse-Dense Matrix Multiplication and Matrix-Vector Multiplication as a part of af::matmul() using AF_STORAGE_CSR format for sparse.
      • Conversion to and from [dense](ref AF_STORAGE_DENSE) matrix to [CSR](ref AF_STORAGE_CSR) and [COO](ref AF_STORAGE_COO) [storage types](ref af_storage).
    • Faster JIT 1 2
      • Performance improvements for CUDA and OpenCL JIT functions.
      • Support for evaluating multiple outputs in a single kernel. See af::array::eval() for more.
    • [Random Number Generation](ref af::randomEngine) 1 2
      • af::randomEngine(): A random engine class to handle setting the type and seed for random number generator engines.
      • Supported engine types are:
    • Graphics 1 2
      • Using Forge v0.9.0
      • [Vector Field](ref af::Window::vectorField) plotting functionality. 1
      • Removed GLEW and replaced with glbinding.
        • Removed usage of GLEW after support for MX (multithreaded) was dropped in v2.0. 1
      • Multiple overlays on the same window are now possible.
        • Overlays support for same type of object (2D/3D)
        • Supported by af::Window::plot, af::Window::hist, af::Window::surface, af::Window::vectorField.
      • New API to set axes limits for graphs.
        • Draw calls do not automatically compute the limits. This is now under user control.
        • af::Window::setAxesLimits can be used to set axes limits automatically or manually.
        • af::Window::setAxesTitles can be used to set axes titles.
      • New API for plot and scatter:
        • af::Window::plot() and af::Window::scatter() now can handle 2D and 3D and determine appropriate order.
        • af_draw_plot_nd()
        • af_draw_plot_2d()
        • af_draw_plot_3d()
        • af_draw_scatter_nd()
        • af_draw_scatter_2d()
        • af_draw_scatter_3d()
    • New [interpolation methods](ref af_interp_type) 1
      • Applies to
        • af::resize()
        • af::transform()
        • af::approx1()
        • af::approx2()
    • Support for [complex mathematical functions](ref mathfunc_mat) 1
      • Add complex support for trig_mat, af::sqrt(), af::log().
    • af::medfilt1(): Median filter for 1-d signals 1
    • Generalized scan functions: scan_func_scan and scan_func_scanbykey
      • Now supports inclusive or exclusive scans
      • Supports binary operations defined by af_binary_op. 1
    • [Image Moments](ref moments_mat) functions 1
    • Add af::getSizeOf() function for af_dtype 1
    • Explicitly extantiate af::array::device() for `void * 1

    Bug Fixes

    • Fixes to edge-cases in morph_mat. 1
    • Makes JIT tree size consistent between devices. 1
    • Delegate higher-dimension in convolve_mat to correct dimensions. 1
    • Indexing fixes with C++11. 1 2
    • Handle empty arrays as inputs in various functions. 1
    • Fix bug when single element input to af::median. 1
    • Fix bug in calculation of time from af::timeit(). 1
    • Fix bug in floating point numbers in af::seq. 1
    • Fixes for OpenCL graphics interop on NVIDIA devices. 1
    • Fix bug when compiling large kernels for AMD devices. 1
    • Fix bug in af::bilateral when shared memory is over the limit. 1
    • Fix bug in kernel header compilation tool bin2cpp. 1
    • Fix inital values for morph_mat functions. 1
    • Fix bugs in af::homography() CPU and OpenCL kernels. 1
    • Fix bug in CPU TNJ. 1

    Improvements

    • CUDA 8 and compute 6.x(Pascal) support, current installer ships with CUDA 7.5. 1 2 3
    • User controlled FFT plan caching. 1
    • CUDA performance improvements for image_func_wrap, image_func_unwrap and approx_mat. 1
    • Fallback for CUDA-OpenGL interop when no devices does not support OpenGL. 1
    • Additional forms of batching with the transform_func_transform functions. New behavior defined here. 1
    • Update to OpenCL2 headers. 1
    • Support for integration with external OpenCL contexts. 1
    • Performance improvements to interal copy in CPU Backend. 1
    • Performance improvements to af::select and af::replace CUDA kernels. 1
    • Enable OpenCL-CPU offload by default for devices with Unified Host Memory. 1
      • To disable, use the environment variable AF_OPENCL_CPU_OFFLOAD=0.

    Build

    • Compilation speedups. 1
    • Build fixes with MKL. 1
    • Error message when CMake CUDA Compute Detection fails. 1
    • Several CMake build issues with Xcode generator fixed. 1 2
    • Fix multiple OpenCL definitions at link time. 1
    • Fix lapacke detection in CMake. 1
    • Update build tags of
    • Fix builds with GCC 6.1.1 and GCC 5.3.0. 1

    Installers

    • All installers now ship with ArrayFire libraries build with MKL 2016.
    • All installers now ship with Forge development files and examples included.
    • CUDA Compute 2.0 has been removed from the installers. Please contact us directly if you have a special need.

    Examples

    • Added [example simulating gravity](ref graphics/field.cpp) for demonstration of vector field.
    • Improvements to financial/black_scholes_options.cpp example.
    • Improvements to graphics/gravity_sim.cpp example.
    • Fix graphics examples to use af::Window::setAxesLimits and af::Window::setAxesTitles functions.

    Documentation & Licensing

    • ArrayFire copyright and trademark policy
    • Fixed grammar in license.
    • Add license information for glbinding.
    • Remove license infomation for GLEW.
    • Random123 now applies to all backends.
    • Random number functions are now under random_mat.

    Deprecations

    The following functions have been deprecated and may be modified or removed permanently from future versions of ArrayFire.

    • af::Window::plot3(): Use af::Window::plot instead.
    • af_draw_plot(): Use af_draw_plot_nd or af_draw_plot_2d instead.
    • af_draw_plot3(): Use af_draw_plot_nd or af_draw_plot_3d instead.
    • af::Window::scatter3(): Use af::Window::scatter instead.
    • af_draw_scatter(): Use af_draw_scatter_nd or af_draw_scatter_2d instead.
    • af_draw_scatter3(): Use af_draw_scatter_nd or af_draw_scatter_3d instead.

    Known Issues

    Certain CUDA functions are known to be broken on Tegra K1. The following ArrayFire tests are currently failing:

    • assign_cuda
    • harris_cuda
    • homography_cuda
    • median_cuda
    • orb_cudasort_cuda
    • sort_by_key_cuda
    • sort_index_cuda
    Source code(tar.gz)
    Source code(zip)
  • v3.3.2(Apr 26, 2016)

    v3.3.2

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.3.2.tar.bz2

    Improvements

    • Family of [Sort](ref sort_mat) functions now support higher order dimensions.
    • Improved performance of batched sort on dim 0 for all [Sort](ref sort_mat) functions.
    • [Median](ref stat_func_median) now also supports higher order dimensions.

    Bug Fixes

    Build

    Documentation

    • Fixed documentation for \ref af::replace().
    • Fixed images in [Using on OSX](ref using_on_osx) page.

    Installer

    • Linux x64 installers will now be compiled with GCC 4.9.2.
    • OSX installer gives better error messages on brew failures and now includes link to Fixing OS X Installer Failures for brew installation failures.
    Source code(tar.gz)
    Source code(zip)
  • v3.3.1(Mar 17, 2016)

    v3.3.1

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.3.1.tar.bz2

    Bug Fixes

    • Fixes to \ref af::array::device()
      • CPU Backend: evaluate arrays before returning pointer with asynchronous calls in CPU backend.
      • OpenCL Backend: fix segfaults when requested for device pointers on empty arrays.
    • Fixed \ref af::array::operator%() from using rem to mod.
    • Fixed array destruction when backends are switched in Unified API.
    • Fixed indexing after \ref af::moddims() is called.
    • Fixes FFT calls for CUDA and OpenCL backends when used on multiple devices.
    • Fixed unresolved external for some functions from \ref af::array::array_proxy class.

    Build

    • CMake compiles files in alphabetical order.
    • CMake fixes for BLAS and LAPACK on some Linux distributions.

    Improvements

    Documentation

    • Reorganized, cleaner README file.
    • Replaced non-free lena image in assets with free-to-distribute lena image.
    Source code(tar.gz)
    Source code(zip)
  • v3.3.0(Feb 28, 2016)

    v3.3.0

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.3.0.tar.bz2

    Major Updates

    • CPU backend supports aysnchronous execution.
    • Performance improvements to OpenCL BLAS and FFT functions.
    • Improved performance of memory manager.
    • Improvements to visualization functions.
    • Improved sorted order for OpenCL devices.
    • Integration with external OpenCL projects.

    Features

    • \ref af::getActiveBackend(): Returns the current backend being used.
    • Scatter plot added to graphics.
    • \ref af::transform() now supports perspective transformation matrices.
    • \ref af::infoString(): Returns af::info() as a string.
    • \ref af::printMemInfo(): Print a table showing information about buffer from the memory manager
      • The \ref AF_MEM_INFO macro prints numbers and total sizes of all buffers (requires including af/macros.h)
    • \ref af::allocHost(): Allocates memory on host.
    • \ref af::freeHost(): Frees host side memory allocated by arrayfire.
    • OpenCL functions can now use CPU implementation.
      • Currently limited to Unified Memory devices (CPU and On-board Graphics).
      • Functions: af::matmul() and all [LAPACK](ref linalg_mat) functions.
      • Takes advantage of optimized libraries such as MKL without doing memory copies.
      • Use the environment variable AF_OPENCL_CPU_OFFLOAD=1 to take advantage of this feature.
    • Functions specific to OpenCL backend.
      • \ref afcl::addDevice(): Adds an external device and context to ArrayFire's device manager.
      • \ref afcl::deleteDevice(): Removes an external device and context from ArrayFire's device manager.
      • \ref afcl::setDevice(): Sets an external device and context from ArrayFire's device manager.
      • \ref afcl::getDeviceType(): Gets the device type of the current device.
      • \ref afcl::getPlatform(): Gets the platform of the current device.
    • \ref af::createStridedArray() allows array creation user-defined strides and device pointer.
    • Expose functions that provide information about memory layout of Arrays.
      • \ref af::getStrides(): Gets the strides for each dimension of the array.
      • \ref af::getOffset(): Gets the offsets for each dimension of the array.
      • \ref af::getRawPtr(): Gets raw pointer to the location of the array on device.
      • \ref af::isLinear(): Returns true if all elements in the array are contiguous.
      • \ref af::isOwner(): Returns true if the array owns the raw pointer, false if it is a sub-array.
      • \ref af::getStrides(): Gets the strides of the array.
      • \ref af::getStrides(): Gets the strides of the array.
    • \ref af::getDeviceId(): Gets the device id on which the array resides.
    • \ref af::isImageIOAvailable(): Returns true if ArrayFire was compiled with Freeimage enabled
    • \ref af::isLAPACKAvailable(): Returns true if ArrayFire was compiled with LAPACK functions enabled

    Bug Fixes

    Improvements

    • Optionally offload BLAS and LAPACK functions to CPU implementations to improve performance.
    • Performance improvements to the memory manager.
    • Error messages are now more detailed.
    • Improved sorted order for OpenCL devices.
    • JIT heuristics can now be tweaked using environment variables. See [Environment Variables](ref configuring_environment) tutorial.
    • Add BUILD_<BACKEND> options to examples and tests to toggle backends when compiling independently.

    Examples

    • New visualization [example simulating gravity](ref graphics/gravity_sim.cpp).

    Build

    • Support for Intel icc compiler
    • Support to compile with Intel MKL as a BLAS and LAPACK provider
    • Tests are now available for building as standalone (like examples)
    • Tests can now be built as a single file for each backend
    • Better handling of NONFREE build options
    • Searching for GLEW in CMake default paths
    • Fixes for compiling with MKL on OSX.

    Installers

    • Improvements to OSX Installer
      • CMake config files are now installed with libraries
      • Independent options for installing examples and documentation components

    Deprecations

    • af_lock_device_arr is now deprecated to be removed in v4.0.0. Use \ref af_lock_array() instead.
    • af_unlock_device_arr is now deprecated to be removed in v4.0.0. use \ref af_unlock_array() instead.

    Documentation

    • Fixes to documentation for \ref matchTemplate().
    • Improved documentation for deviceInfo.
    • Fixes to documentation for \ref exp().

    Known Issues

    Source code(tar.gz)
    Source code(zip)
  • v3.3.alpha(Feb 4, 2016)

  • v3.2.2(Dec 31, 2015)

    Release Notes {#releasenotes}

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.2.2.tar.bz2

    v3.2.2

    Bug Fixes

    • Fixed memory leak in CUDA Random number generators
    • Fixed bug in af::select() and af::replace() tests
    • Fixed exception thrown when printing empty arrays with af::print()
    • Fixed bug in CPU random number generation. Changed the generator to mt19937
    • Fixed exception handling (internal)
      • Exceptions now show function, short file name and line number
      • Added AF_RETURN_ERROR macro to handle returning errors.
      • Removed THROW macro, and renamed AF_THROW_MSG to AF_THROW_ERR.
    • Fixed bug in \ref af::identity() that may have affected CUDA Compute 5.2 cards

    Build

    • Added a MIN_BUILD_TIME option to build with minimum optimization compiler flags resulting in faster compile times
    • Fixed issue in CBLAS detection by CMake
    • Fixed tests failing for builds without optional components FreeImage and LAPACK
    • Added a test for unified backend
    • Only info and backend tests are now built for unified backend
    • Sort tests execution alphabetically
    • Fixed compilation flags and errors in tests and examples
    • Moved AF_REVISION and AF_COMPILER_STR into src/backend. This is because as revision is updated with every commit, entire ArrayFire would have to be rebuilt in the old code.
      • v3.3 will add a af_get_revision() function to get the revision string.
    • Clean up examples
      • Remove getchar for Windows (this will be handled by the installer)
      • Other miscellaneous code cleanup
      • Fixed bug in [plot3.cpp](ref graphics/plot3.cpp) example
    • Rename clBLAS/clFFT external project suffix from external -> ext
    • Add OpenBLAS as a lapack/lapacke alternative

    Improvements

    • Added \ref AF_MEM_INFO macro to print memory info from ArrayFire's memory manager (cross issue)
    • Added additional paths for searching for libaf* for Unified backend on unix-style OS.
      • Note: This still requires dependencies such as forge, CUDA, NVVM etc to be in LD_LIBRARY_PATH as described in [Unified Backend](ref unifiedbackend)
    • Create streams for devices only when required in CUDA Backend

    Documentation

    • Hide scrollbars appearing for pre and code styles
    • Fix documentation for af::replace
    • Add code sample for converting the output of af::getAvailableBackends() into bools
    • Minor fixes in documentation
    Source code(tar.gz)
    Source code(zip)
  • v3.2.1(Dec 5, 2015)

    Release Notes {#releasenotes}

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.2.1.tar.bz2

    v3.2.1

    Bug Fixes

    • Fixed bug in homography()
    • Fixed bug in behavior of af::array::device()
    • Fixed bug when indexing with span along trailing dimension
    • Fixed bug when indexing in [GFor](ref gfor)
    • Fixed bug in CPU information fetching
    • Fixed compilation bug in unified backend caused by missing link library
    • Add missing symbol for af_draw_surface()

    Build

    • Tests can now be used as a standalone project
      • Tests can now be built using pre-compiled libraries
      • Similar to how the examples are built
    • The install target now installs the examples source irrespective of the BUILD_EXAMPLES value
      • Examples are not built if BUILD_EXAMPLES is off

    Documentation

    • HTML documentation is now built and installed in docs/html
    • Added documentation for \ref af::seq class
    • Updated Matrix Manipulation tutorial
    • Examples list is now generated by CMake
      • Examples are now listed as dir/example.cpp
    • Removed dummy groups used for indexing documentation (affcted doxygen < 1.8.9)
    Source code(tar.gz)
    Source code(zip)
  • v3.2.0(Nov 13, 2015)

    Release Notes

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.2.0.tar.bz2

    Major Updates

    • Added Unified backend
      • Allows switching backends at runtime
      • Read [Unified Backend](ref unifiedbackend) for more.
    • Support for 16-bit integers (\ref s16 and \ref u16)
      • All functions that support 32-bit interger types (\ref s32, \ref u32), now also support 16-bit interger types

    Function Additions

    • Unified Backend
      • \ref setBackend() - Sets a backend as active
      • \ref getBackendCount() - Gets the number of backends available for use
      • \ref getAvailableBackends() - Returns information about available backends
      • \ref getBackendId() - Gets the backend enum for an array
    • Vision
      • \ref homography() - Homography estimation
      • \ref gloh() - GLOH Descriptor for SIFT
    • Image Processing
      • \ref loadImageNative() - Load an image as native data without modification
      • \ref saveImageNative() - Save an image without modifying data or type
    • Graphics
      • \ref af::Window::plot3() - 3-dimensional line plot
      • \ref af::Window::surface() - 3-dimensional curve plot
    • Indexing
      • \ref af_create_indexers()
      • \ref af_set_array_indexer()
      • \ref af_set_seq_indexer()
      • \ref af_set_seq_param_indexer()
      • \ref af_release_indexers()
    • CUDA Backend Specific
      • \ref setNativeId() - Set the CUDA device with given native id as active
        • ArrayFire uses a modified order for devices. The native id for a device can be retreived using nvidia-smi
    • OpenCL Backend Specific
      • \ref setDeviceId() - Set the OpenCL device using the clDeviceId

    Other Improvements

    • Added \ref c32 and \ref c64 support for \ref isNaN(), \ref isInf() and \ref iszero()
    • Added CPU information for x86 and x86_64 architectures in CPU backend's \ref info()
    • Batch support for \ref approx1() and \ref approx2()
      • Now can be used with gfor as well
    • Added \ref s64 and \ref u64 support to:
      • \ref sort() (along with sort index and sort by key)
      • \ref setUnique(), \ref setUnion(), \ref setIntersect()
      • \ref convolve() and \ref fftConvolve()
      • \ref histogram() and \ref histEqual()
      • \ref lookup()
      • \ref mean()
    • Added \ref AF_MSG macro

    Build Improvements

    • Submodules update is now automatically called if not cloned recursively
    • Fixes for compilation on Visual Studio 2015
    • Option to use fallback to CPU LAPACK for linear algebra functions in case of CUDA 6.5 or older versions.

    Bug Fixes

    Documentation Updates

    • Improved tutorials documentation
      • More detailed Using on [Linux](ref using_on_windows), [OSX](ref using_on_windows), [Windows](ref using_on_windows) pages.
    • Added return type information for functions that return different type arrays

    New Examples

    • Graphics
      • [Plot3](ref plot3.cpp)
      • [Surface](ref surface.cpp)
    • [Shallow Water Equation](ref swe.cpp)
    • [Basic](ref basic.cpp) as a Unified backend example

    Installers

    • All installers now include the Unified backend and corresponding CMake files
    • Visual Studio projects include Unified in the Platform Configurations
    • Added installer for Jetson TX1
    • SIFT and GLOH do not ship with the installers as SIFT is protected by patents that do not allow commercial distribution without licensing.
    Source code(tar.gz)
    Source code(zip)
  • v3.1.3(Oct 18, 2015)

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.1.3.tar.bz2

    Bug Fixes

    • Fixed bugs in various OpenCL kernels without offset additions
    • Remove ARCH_32 and ARCH_64 flags
    • Fix missing symbols when freeimage is not found
    • Use CUDA driver version for Windows
    • Improvements to SIFT
    • Fixed memory leak in median
    • Fixes for Windows compilation when not using MKL #1047
    • Fixed for building without LAPACK

    Other

    • Documentation: Fixed documentation for select and replace
    • Documentation: Fixed documentation for af_isnan
    Source code(tar.gz)
    Source code(zip)
  • v3.1.2(Sep 26, 2015)

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.1.2.tar.bz2

    Bug Fixes

    • Fixed bug in assign that was causing test to fail
    • Fixed bug in convolve. Frequency condition now depends on kernel size only
    • Fixed bug in indexed reductions for complex type in OpenCL backend
    • Fixed bug in kernel name generation in ireduce for OpenCL backend
    • Fixed non-linear to linear indices in ireduce
    • Fixed bug in reductions for small arrays
    • Fixed bug in histogram for indexed arrays
    • Fixed compiler error CPUID for non-compliant devices
    • Fixed failing tests on i386 platforms
    • Add missing AFAPI

    Other

    • Documentation: Added missing examples and other corrections
    • Documentation: Fixed warnings in documentation building
    • Installers: Send error messages to log file in OSX Installer
    Source code(tar.gz)
    Source code(zip)
  • v3.1.1(Sep 13, 2015)

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.1.1.tar.bz2

    Installers

    • CUDA backend now depends on CUDA 7.5 toolkit
    • OpenCL backend now require OpenCL 1.2 or greater

    Bug Fixes

    • Fixed bug in reductions after indexing
    • Fixed bug in indexing when using reverse indices

    Build

    • cmake now includes PKG_CONFIG in the search path for CBLAS and LAPACKE libraries
    • heston_model.cpp example now builds with the default ArrayFire cmake files after installation

    Other

    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Aug 28, 2015)

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.1.0.tar.bz2

    Release Notes {#releasenotes}

    v3.1.0

    Function Additions

    • Computer Vision Functions
      • nearestNeighbour() - Nearest Neighbour with SAD, SSD and SHD distances
      • harris() - Harris Corner Detector
      • susan() - Susan Corner Detector
      • sift() - Scale Invariant Feature Transform (SIFT)
        • Method and apparatus for identifying scale invariant features" "in an image and use of same for locating an object in an image," David" "G. Lowe, US Patent 6,711,293 (March 23, 2004). Provisional application" "filed March 8, 1999. Asignee: The University of British Columbia. For" "further details, contact David Lowe ([email protected]) or the" "University-Industry Liaison Office of the University of British" "Columbia.")
        • SIFT is available for compiling but does not ship with ArrayFire hosted installers/pre-built libraries
      • dog() - Difference of Gaussians
    • Image Processing Functions
      • ycbcr2rgb() and rgb2ycbcr() - RGB <->YCbCr color space conversion
      • wrap() and unwrap() Wrap and Unwrap
      • sat() - Summed Area Tables
      • loadImageMem() and saveImageMem() - Load and Save images to/from memory
        • af_image_format - Added imageFormat (af_image_format) enum
    • Array & Data Handling
      • copy() - Copy
      • array::lock() and array::unlock() - Lock and Unlock
      • select() and replace() - Select and Replace
      • Get array reference count (af_get_data_ref_count)
    • Signal Processing
      • fftInPlace() - 1D in place FFT
      • fft2InPlace() - 2D in place FFT
      • fft3InPlace() - 3D in place FFT
      • ifftInPlace() - 1D in place Inverse FFT
      • ifft2InPlace() - 2D in place Inverse FFT
      • ifft3InPlace() - 3D in place Inverse FFT
      • fftR2C() - Real to complex FFT
      • fftC2R() - Complex to Real FFT
    • Linear Algebra
      • svd() and svdInPlace() - Singular Value Decomposition
    • Other operations
      • sigmoid() - Sigmoid
      • Sum (with option to replace NaN values)
      • Product (with option to replace NaN values)
    • Graphics
      • Window::setSize() - Window resizing using Forge API
    • Utility
      • Allow users to set print precision (print, af_print_array_gen)
      • saveArray() and readArray() - Stream arrays to binary files
      • toString() - toString function returns the array and data as a string
    • CUDA specific functionality
      • getStream() - Returns default CUDA stream ArrayFire uses for the current device
      • getNativeId() - Returns native id of the CUDA device

    Improvements

    • dot
      • Allow complex inputs with conjugate option
    • AF_INTERP_LOWER interpolation
      • For resize, rotate and transform based functions
    • 64-bit integer support
      • For reductions, random, iota, range, diff1, diff2, accum, join, shift and tile
    • convolve
      • Support for non-overlapping batched convolutions
    • Complex Arrays
      • Fix binary ops on complex inputs of mixed types
      • Complex type support for exp
    • tile
      • Performance improvements by using JIT when possible.
    • Add AF_API_VERSION macro
      • Allows disabling of API to maintain consistency with previous versions
    • Other Performance Improvements
      • Use reference counting to reduce unnecessary copies
    • CPU Backend
      • Device properties for CPU
      • Improved performance when all buffers are indexed linearly
    • CUDA Backend
      • Use streams in CUDA (no longer using default stream)
      • Using async cudaMem ops
      • Add 64-bit integer support for JIT functions
      • Performance improvements for CUDA JIT for non-linear 3D and 4D arrays
    • OpenCL Backend
      • Improve compilation times for OpenCL backend
      • Performance improvements for non-linear JIT kernels on OpenCL
      • Improved shared memory load/store in many OpenCL kernels (PR 933)
      • Using cl.hpp v1.2.7

    Bug Fixes

    • Common
      • Fix compatibility of c32/c64 arrays when operating with scalars
      • Fix median for all values of an array
      • Fix double free issue when indexing (30cbbc7)
      • Fix bug in rank
      • Fix default values for scale throwing exception
      • Fix conjg raising exception on real input
      • Fix bug when using conjugate transpose for vector input
      • Fix issue with const input for array_proxy::get()
    • CPU Backend
      • Fix randn generating same sequence for multiple calls
      • Fix setSeed for randu
      • Fix casting to and from complex
      • Check NULL values when allocating memory
      • Fix offset issue for CPU element-wise operations

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.1.0.tar.bz2

    New Examples

    • Match Template
    • Susan
    • Heston Model (contributed by Michael Nowotny)

    Distribution Changes

    • Fixed automatic detection of ArrayFire when using with CMake in the Windows Installer
    • Compiling ArrayFire with FreeImage as a static library for Linux x86 installers

    Known Issues

    • OpenBlas can cause issues with QR factorization in CPU backend
    • FreeImage older than 3.10 can cause issues with loadImageMem and saveImageMem
    • OpenCL backend issues on OSX
      • AMD GPUs not supported because of driver issues
      • Intel CPUs not supported
      • Linear algebra functions do not work on Intel GPUs.
    • Stability and correctness issues with open source OpenCL implementations such as Beignet, GalliumCompute.
    Source code(tar.gz)
    Source code(zip)
  • v3.0.2(Jun 26, 2015)

    The source code with submodules can be downloaded directly from the following link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.0.2.tar.bz2

    Bug Fixes

    • Added missing symbols from the compatible API
    • Fixed a bug affecting corner rows and elements in grad()
    • Fixed linear interpolation bugs affecting large images in the following:
      • approx1()
      • approx2()
      • resize()
      • rotate()
      • scale()
      • skew()
      • transform()

    Documentation

    • Added missing documentation for constant()
    • Added missing documentation for array::scalar()
    • Added supported input types for functions in arith.h
    Source code(tar.gz)
    Source code(zip)
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem. Get Started on app.blazingsql.com Getting Started | Documentation | Examp

BlazingSQL 1.8k Jan 02, 2023
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

NVIDIA DALI The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provi

NVIDIA Corporation 4.2k Jan 08, 2023
Library for faster pinned CPU <-> GPU transfer in Pytorch

SpeedTorch Faster pinned CPU tensor - GPU Pytorch variabe transfer and GPU tensor - GPU Pytorch variable transfer, in certain cases. Update 9-29-1

Santosh Gupta 657 Dec 19, 2022
jupyter/ipython experiment containers for GPU and general RAM re-use

ipyexperiments jupyter/ipython experiment containers and utils for profiling and reclaiming GPU and general RAM, and detecting memory leaks. About Thi

Stas Bekman 153 Dec 07, 2022
cuSignal - RAPIDS Signal Processing Library

cuSignal The RAPIDS cuSignal project leverages CuPy, Numba, and the RAPIDS ecosystem for GPU accelerated signal processing. In some cases, cuSignal is

RAPIDS 646 Dec 30, 2022
cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Resources cuDF Reference Documentation: Python API refe

RAPIDS 5.2k Jan 08, 2023
Python 3 Bindings for the NVIDIA Management Library

====== pyNVML ====== *** Patched to support Python 3 (and Python 2) *** ------------------------------------------------ Python bindings to the NVID

Nicolas Hennion 95 Jan 01, 2023
A Python function for Slurm, to monitor the GPU information

Gpu-Monitor A Python function for Slurm, where I couldn't use nvidia-smi to monitor the GPU information. whole repo is not finish Installation TODO Mo

Squidward Tentacles 2 Feb 11, 2022
A Python module for getting the GPU status from NVIDA GPUs using nvidia-smi programmically in Python

GPUtil GPUtil is a Python module for getting the GPU status from NVIDA GPUs using nvidia-smi. GPUtil locates all GPUs on the computer, determines thei

Anders Krogh Mortensen 927 Dec 08, 2022
Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program.

py3nvml Documentation also available at readthedocs. Python 3 compatible bindings to the NVIDIA Management Library. Can be used to query the state of

Fergal Cotter 212 Jan 04, 2023
📊 A simple command-line utility for querying and monitoring GPU status

gpustat Just less than nvidia-smi? NOTE: This works with NVIDIA Graphics Devices only, no AMD support as of now. Contributions are welcome! Self-Promo

Jongwook Choi 3.2k Jan 04, 2023
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code her

NVIDIA Corporation 6.9k Dec 28, 2022
General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases.

Vulkan Kompute The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabl

The Institute for Ethical Machine Learning 1k Dec 26, 2022
Python interface to GPU-powered libraries

Package Description scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries

Lev E. Givon 924 Dec 26, 2022
cuGraph - RAPIDS Graph Analytics Library

cuGraph - GPU Graph Analytics The RAPIDS cuGraph library is a collection of GPU accelerated graph algorithms that process data found in GPU DataFrames

RAPIDS 1.2k Jan 01, 2023
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

ArrayFire 4k Dec 29, 2022
QPT-Quick packaging tool 前项式Python环境快捷封装工具

QPT - Quick packaging tool 快捷封装工具 GitHub主页 | Gitee主页 QPT是一款可以“模拟”开发环境的多功能封装工具,一行命令即可将普通的Python脚本打包成EXE可执行程序,与此同时还可轻松引入CUDA等深度学习加速库, 尽可能在用户使用时复现您的开发环境。

GT-Zhang 545 Dec 28, 2022
Conda package for artifact creation that enables offline environments. Ideal for air-gapped deployments.

Conda-Vendor Conda Vendor is a tool to create local conda channels and manifests for vendored deployments Installation To install with pip, run: pip i

MetroStar - Tech 13 Nov 17, 2022
CUDA integration for Python, plus shiny features

PyCUDA lets you access Nvidia's CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist-so what's so special about P

Andreas Klöckner 1.4k Jan 02, 2023
cuML - RAPIDS Machine Learning Library

cuML - GPU Machine Learning Algorithms cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions t

RAPIDS 3.1k Jan 04, 2023