A Python package for manipulating 2-dimensional tabular data structures

Overview

datatable

Gitter chat PyPi version License Build Status Documentation Status Codacy Badge

This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame; however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's data.table and attempts to mimic its core algorithms and API.

Currently datatable is in the Beta stage and undergoing active development. Some of the features may still be missing. Python 3.6+ is required.

Project goals

datatable started in 2017 as a toolkit for performing big data (up to 100GB) operations on a single-node machine, at the maximum speed possible. Such requirements are dictated by modern machine-learning applications, which need to process large volumes of data and generate many features in order to achieve the best model accuracy. The first user of datatable was Driverless.ai.

The set of features that we want to implement with datatable is at least the following:

  • Column-oriented data storage.

  • Native-C implementation for all datatypes, including strings. Packages such as pandas and numpy already do that for numeric columns, but not for strings.

  • Support for date-time and categorical types. Object type is also supported, but promotion into object discouraged.

  • All types should support null values, with as little overhead as possible.

  • Data should be stored on disk in the same format as in memory. This will allow us to memory-map data on disk and work on out-of-memory datasets transparently.

  • Work with memory-mapped datasets to avoid loading into memory more data than necessary for each particular operation.

  • Fast data reading from CSV and other formats.

  • Multi-threaded data processing: time-consuming operations should attempt to utilize all cores for maximum efficiency.

  • Efficient algorithms for sorting/grouping/joining.

  • Expressive query syntax (similar to data.table).

  • LLVM-based lazy computation for complex queries (code generated, compiled and executed on-the-fly).

  • LLVM-based user-defined functions.

  • Minimal amount of data copying, copy-on-write semantics for shared data.

  • Use "rowindex" views in filtering/sorting/grouping/joining operators to avoid unnecessary data copying.

  • Interoperability with pandas / numpy / pure python: the users should have the ability to convert to another data-processing framework with ease.

  • Restrictions: Python 3.6+, 64-bit systems only.

Installation

On macOS, Linux and Windows systems installing datatable is as easy as

pip install datatable

On all other platforms a source distribution will be needed. For more information see Build instructions.

See also

Comments
  • [ENH] `nth` function

    [ENH] `nth` function

    Implement dt.nth(cols, n=0) function to return the nth row (also per group) for the specified columns. If n goes out of bounds, NA-row is returned.

    Closes #3128

    new feature 
    opened by samukweku 47
  • Implement cumulative functions

    Implement cumulative functions

    The list of functions to be implemented and the corresponding PR's

    • [x] cumsum() https://github.com/h2oai/datatable/pull/3257
    • [x] cumprod() https://github.com/h2oai/datatable/pull/3304
    • [x] cummax() https://github.com/h2oai/datatable/pull/3288
    • [x] cummin() https://github.com/h2oai/datatable/pull/3288
    • [x] cumcount() https://github.com/h2oai/datatable/pull/3310
    • [x] ngroup() - not strictly cumulative https://github.com/h2oai/datatable/pull/3310
    • [x] fillna() for forward/backward fill https://github.com/h2oai/datatable/pull/3311
    • [x] fillna() for filling with a value https://github.com/h2oai/datatable/pull/3344 ~~- [ ] rank~~ continued on #3148 ~~- [ ] rolling aggregations~~ continued on #1500
    new feature 
    opened by samukweku 42
  • Mac M1 import error

    Mac M1 import error

    Mac M1 on BigSur 11.4 Python 3.8.8 on Miniforge Conda environment DataTable: 1.0.0 Installed via pip install git+https://github.com/h2oai/datatable Import error:

    Traceback (most recent call last):
      File "/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-4-98efda56b751>", line 1, in <module>
        import datatable as dt
      File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
        module = self._system_import(name, *args, **kwargs)
      File "/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/__init__.py", line 23, in <module>
        from .frame import Frame
      File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
        module = self._system_import(name, *args, **kwargs)
      File "/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/frame.py", line 23, in <module>
        from datatable.lib._datatable import Frame
      File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
        module = self._system_import(name, *args, **kwargs)
      File "/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/lib/__init__.py", line 31, in <module>
        from . import _datatable as core
      File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
        module = self._system_import(name, *args, **kwargs)
    ImportError: dlopen(/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/lib/_datatable.cpython-38-darwin.so, 2): no suitable image found.  Did find:
    	/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/lib/_datatable.cpython-38-darwin.so: mach-o, but wrong architecture
    	/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/lib/_datatable.cpython-38-darwin.so: mach-o, but wrong architecture
    
    ITA 
    opened by acmilannesta 39
  • [ENH] Column aliasing

    [ENH] Column aliasing

    This PR implements column's aliasing as proposed in #2684. We couldn't name the method .as() though, because as is a built-in python keyword — hence, we use .alias() instead. Column aliasing is now also available in the group-by clause.

    Closes #2504

    test documentation new feature 
    opened by samukweku 30
  • memory leak and speed concerns

    memory leak and speed concerns

    import numpy as np
    import lightgbm_gpu as lgb
    import scipy
    import pandas as pd
    from sklearn.utils import shuffle
    from h2oaicore.metrics import def_rmse
    import datatable as dt
    
    def set_dt_col(train_dt, name, value):
        if isinstance(name, int):
            name = train_dt.names[name]
        train_dt[:, name] = dt.Frame(value)
        return train_dt
    
    nrow = 4000
    ncol = 5000
    X = np.random.randn(nrow, ncol)
    y = np.random.randn(nrow)
    model = lgb.LGBMRegressor(objective='regression', n_jobs=20)  # 40 very slow
    model.fit(X, y)
    
    X_dt = dt.Frame(X)
    cols_actual = list(X_dt.names)
    
    do_numpy = False
    score_f = def_rmse
    preds = model.predict(X)
    main_metric = score_f(actual=y, predicted=preds)
    seed = 1234
    def go():
        feature_importances = {}
        for n in range(ncol):
            print(n, flush=True)
            if do_numpy:
                shuf = shuffle(X[:,n].ravel())
                X_tmp = X # .copy()
                X_tmp[:,n] = shuf
                new_preds = model.predict(X_tmp)
                metric = score_f(actual=y, predicted=new_preds)
                col = "C" + str(n)
                feature_importances[col] = main_metric - metric
            else:
                col = cols_actual[n]
                shuf = shuffle(X_dt[:, col].to_numpy().ravel(), random_state=seed)
                X_tmp = set_dt_col(dt.Frame(X_dt), col, shuf)
                new_preds = model.predict(X_tmp)
    
                metric = score_f(actual=y, predicted=new_preds)
                feature_importances[col] = main_metric - metric
        return feature_importances
    
    print(go())
    

    Related to permutation variable importance.

    If do_numpy = False, so it uses dt, then I see the resident memory slowly creep up from about 0.8GB to 1.6GB at n=1800 etc. By n=4000 it's using 2.7GB.

    If I use do_numpy = True, so it uses no dt, then I see resident memory never change over all n.

    I thought at one point I only saw with LightGBM and not xgboost, but I'm not sure.

    Unit tests like this numpy version by Microsoft show LightGBM not itself leaking: https://github.com/Microsoft/LightGBM/issues/1968

    These 2 cases aren't doing exactly the same thing in that the numpy version keeps shuffling the same original X, while the dt version I think has essentially 2 copies, but the other original X_dt columns are not modified. But @st-pasha you can confirm.

    One can add the X_tmp = X.copy(), but it's not quite fair. It makes a full copy, while dt should get away with only overwriting a single column.

    Perhaps the flaw is how we are using dt and the frames?

    bug 
    opened by pseudotensor 27
  • segfault on Ubuntu 20.04 when in combination with LightGBM

    segfault on Ubuntu 20.04 when in combination with LightGBM

    # on host
    cd /tmp/
    wget https://files.slack.com/files-pri/T0329MHH6-F013VU6RW94/download/dt_lgb.gz?pub_secret=fb7b5f3988
    mv 'dt_lgb.gz?pub_secret=fb7b5f3988' dt_lgb.gz
    tar xfz dt_lgb.gz
    docker pull ubuntu:20.04
    docker run -t -v `pwd`:/tmp --security-opt seccomp=unconfined -i ubuntu:20.04 /bin/bash
    
    # on Ubuntu 20.04
    chmod 1777 /tmp
    apt-get update
    DEBIAN_FRONTEND=noninteractive apt-get install -y software-properties-common
    add-apt-repository -y ppa:deadsnakes/ppa
    apt-get update
    apt-get install -y python3.6 python3.6-dev virtualenv libgomp1 gdb vim valgrind
    
    # repro failure
    virtualenv -p python3.6 blah
    source blah/bin/activate
    pip install datatable
    pip install lightgbm
    pip install pandas
    cd /tmp/
    python lgb_prefit_df669346-4e47-4ecf-b131-0838ae8f9474.py
    

    fails with:

    /blah/lib/python3.6/site-packages/lightgbm/basic.py:1295: UserWarning: categorical_feature in Dataset is overridden.
    New categorical_feature is []
      'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
    /blah/lib/python3.6/site-packages/lightgbm/basic.py:842: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
    Please use categorical_feature argument of the Dataset constructor to pass this parameter.
      .format(key))
    Segmentation fault (core dumped)
    
    segfault 
    opened by arnocandel 26
  • Support for apache arrow.

    Support for apache arrow.

    Is there any reason why you did not go with apache arrow format from the beginning?

    It would be at least nice, if you allowed to_arrow_table and from_arrow_table conversions.

    question 
    opened by AnthonyJacob 25
  • Aggregator in datatable

    Aggregator in datatable

    • Is there something datatable can't do just yet, but you think it'd be nice if it did? Aggregate

    • Is it related to some problem you're trying to solve? Solve slow reading of NFF format files.

    • What do you think the API for your feature should be? See API in the Java code. Methods required are in base class DataSource

    See Java code in https://github.com/h2oai/vis-data-server/blob/master/library/src/main/java/com/h2o/data/Aggregator.java

    Plus other classes in that package for support. All of this should be done in C++

    improve 
    opened by lelandwilkinson 25
  • Steps towards Python 3.11 support

    Steps towards Python 3.11 support

    • Replace "Py_TYPE(obj) = type" with: "Py_SET_TYPE(obj, type)"
    • Replace "Py_REFCNT(Py_None) += n" with: "Py_SET_REFCNT(Py_None, Py_REFCNT(Py_None) + n)"
    • Add pythoncapi_compat.h to get Py_SET_TYPE() and Py_SET_REFCNT() on Python 3.9 and older. File copied from: https://github.com/pythoncapi/pythoncapi_compat

    On Python 3.10, Py_REFCNT() can no longer be used to set a reference count:

    • https://docs.python.org/dev/c-api/structures.html#c.Py_REFCNT
    • https://docs.python.org/dev/whatsnew/3.10.html#id2

    On Python 3.11, Py_TYPE() can no longer be used to set an object type:

    • https://docs.python.org/dev/c-api/structures.html#c.Py_TYPE
    • https://docs.python.org/dev/whatsnew/3.11.html#id2
    improve 
    opened by vstinner 21
  • Switch back to the Apache-v2 license

    Switch back to the Apache-v2 license

    The absolute majority of Python packages are using Apache, MIT, BSD, or similar open licenses. It would be courteous to the broader Python community, and invite broader collaboration/contribution, if we did as well.

    Historically, this project has been Apache from the very first commit. However, sometime before the public release, we switched to MPL-2 license. The idea was to have the same license as R data.table project (which at that time switched from GPL to MPL too). Unfortunately, we failed to grasp the primary difference between R and Python communities at that point: the majority of R packages are licensed as GPL, and within such environment, an MPL-licensed project can be integrated freely and will be seen as more open compared to others. On the contrary, within Python community, an MPL license is more restrictive and will be eyed with suspicion. In fact, MPL license creates a perfectly tangible barrier: ASF includes this license into the Category B list of software that can only be integrated in binary, but not in source code form.

    Please, share your thoughts/comments.

    wont-fix 
    opened by st-pasha 20
  • FTRL algo does not work properly on views

    FTRL algo does not work properly on views

    Hi,

    I'm trying to use datatable FTRL proximal algo on a dataset and it behaves strangely. LogLoss increases with the number of epochs.

    Here is the code I use :

    train_dt = dt.fread('dt_ftrl_test_set.csv.gz')
    features = [f for f in train_dt.names if f not in ['HasDetections']]
    for n in range(10):
        ftrl = Ftrl(nepochs=n+1)
        ftrl.fit(train_dt[:, features], train_dt[:, 'HasDetections'])
        print(log_loss(np.array(train_dt[trn_, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[trn_, features]))))
    

    The output is

    0.6975873940617929
    0.7004277294410224
    0.7030339011892597
    0.705290424565774
    0.7072685897773024
    0.7091474008277487
    0.7108282513596036
    0.7123130263929156
    0.713890830846544
    0.7151695514165213
    

    my own version of FTRL trains correctly with the following output:

    time_used:0:00:01.026606	epoch: 0   rows:10001	t_logloss:0.59638
    time_used:0:00:01.715622	epoch: 1   rows:10001	t_logloss:0.52452
    time_used:0:00:02.436984	epoch: 2   rows:10001	t_logloss:0.48113
    time_used:0:00:03.158367	epoch: 3   rows:10001	t_logloss:0.44260
    time_used:0:00:03.851369	epoch: 4   rows:10001	t_logloss:0.39633
    time_used:0:00:04.553488	epoch: 5   rows:10001	t_logloss:0.38197
    time_used:0:00:05.264179	epoch: 6   rows:10001	t_logloss:0.35380
    time_used:0:00:05.973398	epoch: 7   rows:10001	t_logloss:0.32839
    time_used:0:00:06.688121	epoch: 8   rows:10001	t_logloss:0.32057
    time_used:0:00:07.394217	epoch: 9   rows:10001	t_logloss:0.29917
    
    • Your environment? I'm on ubuntu 16.04, clang+llvm-7.0.0-x86_64-linux-gnu-ubuntu-16.04, python 3.6, datatable is compiled from source.

    let me know if you need more.

    I guess I'm missing something but could not find anything in the unit tests.

    Thanks for your help.

    P.S. : make test results and the dataset I use are attached. datatable_make_test_results.txt dt_ftrl_test_set.csv.gz

    bug views 
    opened by goldentom42 19
  • [ENH] nth function

    [ENH] nth function

    Implement dt.nth(cols, n) function to return the nth row (also per group) for the specified columns. If n goes out of bounds, NA-row is returned.

    Closes #3128

    new feature 
    opened by samukweku 1
  • `fread()` doesn't support unicode in file names on Windows

    `fread()` doesn't support unicode in file names on Windows

    我刚刚开始尝试使用datatable,发现如果文件中含有中文路径,将会出现IOError。 然而同一个文件,在全英文路径下则不会出现这样的问题。 报错信息附在最后。 我不知道,是否已存在了解决方案,我尝试搜过,但没有找到解决方案。

    My English is not good. I use machine translation:

    I just tried to use datatable, and found that if the file contains a Chinese path, an IOError will appear. However, for the same file, this problem will not occur in the full English path. The error information is attached at the end. I don't know whether there is a solution. I tried to search, but I didn't find a solution.

    IOError                                   Traceback (most recent call last)
    <timed exec> in <module>
    
    IOError: Unable to obtain size of D:/测试.csv: [errno 2] No such file or directory
    
    bug 
    opened by o414o 4
  • DT[f.A ==

    DT[f.A == "", :] is bugged for columns with all empty strings

    from datatable import dt, f
    
    DT = dt.Frame({"A": ["", ""]})
    DT[f.A == "", dt.count()][0, 0]
    # 0
    

    If any value in the column is not an empty string, it works as expected.

    Workaround:

    DT[dt.str.len(f.A) == 0, dt.count()][0, 0]
    # 2
    
    bug 
    opened by hallmeier 3
  • Is posible to read data from gcs://?

    Is posible to read data from gcs://?

    Hi guys, Is it possible to read data using fread() from gcs:// I don't see it in the docs and in the code I don't see any reference either

    Thank you! V

    opened by vgmartinez 2
Releases(v1.0.0)
  • v0.10.1(Dec 24, 2019)

  • v0.9.0(Dec 3, 2019)

    0.9.0 — 2019-06-15

    Added

    • Added function dt.models.kfold(nrows, nsplits) to prepare indices for k-fold splitting. This function will return nsplits pairs of row selectors such that when these selectors are applied to an nrows-rows frame, that frame will be split into train and test part according to the K-fold splitting scheme.

    • Added function dt.models.kfold_random(nrows, nsplits, seed), which is similar to kfold(nrows, nsplits), except that the assignment of rows into folds is randomized, not deterministic.

    • Frame.rbind() can now also accept a list or tuple of frames (previously only a vararg sequence was allowed).

    • Method .len() can be applied to a string column to obtain the lengths of strings in each row.

    • Method .re_match(re) applies to a string column, and produces boolean indicator whether each value matches the regular expression re or not. The method matches the entire string, not just the beginning. Thus, it most closely resembles Python function re.fullmatch().

    • Added early stopping support to FTRL algo, that can now do binomial and multinomial classification for categorical targets, as well as regression for continuous targets.

    • New function dt.median() can be used to compute median of a certain column or expression, either per group or for the entire Frame (#1530).

    • Frame.__str__() now returns a string containing the preview of the frame's data. This allows datatable frames to be used with print().

    • Added method dt.options.describe(), which will print the available options together with their values and descriptions.

    • Added dt.options.context(option=value), which can be used in a with- statement to temporarily change the value of one or more options, and then go back to their original values at the end of the with-block.

    • Added options fread.log.escape_unicode (controls treatment of unicode characters in fread's verbose log); and display.use_colors (allows to turn on/off colored output in the console).

    • dt.options now helps the user when they make a typo: if an option with a certain name does not exist, the error message will suggest the correct spelling.

    • most long-running operations in datatable will now show a progress bar. Its behavior can be controlled via dt.options.progress set of options.

    • internal function dt.internal.compiler_version().

    • New datatable.math module is a library of various mathematical functions that can be applied to datatable Frames. The set of functions is close to what is available in the standard python math module. See documentation for more details.

    • New module datatable.sphinxext.dtframe_directive, which can be used as a plugin for Sphinx. This module adds directive .. dtframe that allows to easily include a Frame display in an .rst document.

    • Frame can now be treated as an iterable over the columns. Thus, a Frame object can now be used in a for-loop, producing its individual columns.

    • A Frame can now be treated as a mapping; in particular both dict(frame) and **frame are now valid.

    • Single-column frames can be be used as sources for Frame construction.

    • CSV writer now quotes fields containing single-quote mark (').

    • Added parameter quoting= to method Frame.to_csv(). The accepted values are 4 constants from the standard csv module: csv.QUOTE_MINIMAL (default), csv.QUOTE_ALL, csv.QUOTE_NONNUMERIC and csv.QUOTE_NONE.

    Fixed

    • Fixed crash in certain circumstances when a key was applied after a groupby (#1639).

    • Frame.to_numpy() now returns a numpy masked_array if the frame has any NA values (#1619).

    • A keyed frame will now be rendered correctly when viewing it in python console via Frame.view() (#1672).

    • Str32 column can no longer overflow during the .replace() operation, or when converting from python, numpy or pandas, etc. In all these cases we will now transparently create a Str64 column instead (#1694).

    • The reported frame size (sys.getsizeof(DT)) is now more accurate; in particular the content of string columns is no longer ignored (#1697).

    • Type casting into str32 no longer produces an error if the resulting column is larger than 2GB. Now a str64 column will be returned instead (#1695).

    • Fixed memory leak during computation of a generic DT[i, j] expression. Another memory leak was during generation of string columns, now also fixed (#1705).

    • Fixed crash upon exiting from a python terminal, if the user ever called function frame_column_rowindex().type (#1703).

    • Pandas "boolean column with NAs" (of dtype object) now converts into datatable bool8 column when pandas DataFrame is converted into a datatable Frame (#1730).

    • Fixed conversion to numpy of a view Frame which contains NAs (#1738).

    • datatable can now be safely used with multiprocessing, or other modules that perform fork-without-exec (#1758). The child process will spawn its own thread pool that will have the same number of threads as the parent. Adjust dt.options.nthreads in the child process(es) if different number of threads is required.

    • The interactive mode is no longer improperly turned on in IPython (#1789).

    • Fixed issue with mis-aligned frame headers in IPython, caused by IPython inserting Out[X]: in front of the rendered Frame display (#1793).

    • Improved rendering of Frames in terminals with white background: we no longer use 'bright_white' color for emphasis, only 'bold' (#1793).

    • Fixed crash when a new column was created via partial assignment, i.e. DT[i, "new_col"] = expr (#1800).

    • Fixed memory leaks/crashes when materializing an object column (#1805).

    • Fixed creating a Frame from a pandas DataFrame that has duplicate column names (#1816).

    • Fixed a UnicodeDecodeError that could be thrown when viewing a Frame with unicode characters in Jupyter notebook. The error only manifested for strings that were longer than 50 bytes in length (#1825).

    • Fixed crash when Frame.colindex() was used without any arguments, now this raises an exception instead (#1834).

    • Fixed possible crash when writing to disk that doesn't have enough free space on it (#1837).

    • Fixed invalid Frame being created when reading a large string column (str64) with fread, and the column contains NA values.

    • Fixed FTRL model not resuming properly after unpickling (#1846).

    • Fixed crash that occurred when sorting by multiple columns, and the first column is of low cardinality (#1857).

    • Fixed display of NA values produced during a join, when a Frame was displayed in Jupyter Lab (#1872).

    • Fixed a crash when replacing values in a str64 column (#1890).

    • cbind() no longer throws an error when passed a generator producing temporary frames (#1905).

    • Fixed comparison of string columns vs. value None (#1912).

    • Fixed a crash when trying to select individual cells from a joined Frame, for the cells that were un-matched during the join (#1917).

    • Fixed a crash when writing a joined frame into CSV (#1919).

    • Fixed a crash when writing into CSV string view columns, especially of str64 type (#1921).

    Changed

    • A Frame will no longer be shown in "interactive" mode in console by default. The previous behavior can be restored with dt.options.display.interactive = True. Alternatively, you can explore a Frame interactively using frame.view(True).

    • Improved performance of type-casting a view column: now the code avoids materializing the column before performing the cast.

    • Frame class is now defined fully in C++, improving code robustness and performance. The property Frame.internal was removed, as it no longer represents anything. Certain internal properties of Frame can be accessed via functions declared in the dt.internal. module.

    • datatable no longer uses OpenMP for parallelism. Instead, we use our own thread pool to perform multi-threaded computations (#1736).

    • Parameter progress_fn in function dt.models.aggregate() is removed. In its place you can set the global option dt.options.progress.callback.

    • Removed deprecated Frame methods .topython(), .topandas(), .tonumpy(), and Frame.__call__().

    • Syntax DT[col] has been restored (was previously deprecated in 0.7.0), however it works only for col an integer or a string. Support for slices may be added in the future, or not: there is a potential to confuse DT[a:b] for a row selection. A column slice may still be selected via the i-j selector DT[:, a:b].

    • The nthreads= parameter in Frame.to_csv() was removed. If needed, please set the global option dt.options.nthreads.

    Deprecated

    • Frame method .scalar() is now deprecated and will be removed in release 0.10.0. Please use frame[0, 0] instead.

    • Frame method .append() is now deprecated and will be removed in release 0.10.0. Please use .rbind() instead.

    • Frame method .save() was renamed into .to_jay() (for consistency with other .to_*() methods). The old name is still usable, but marked as deprecated and will be removed in 0.10.0.

    Notes

    • Thanks to everyone who helped make datatable more stable by discovering and reporting bugs that were fixed in this release:

      • [Arno Candel][] (#1619, #1730, #1738, #1800, #1803, #1846, #1857, #1890, #1891, #1919, #1921),

      • [Antorsae][] (#1639),

      • [Olivier][] (#1872),

      • [Hawk Berry][] (#1834),

      • [Jonathan McKinney][] (#1816, #1837),

      • [Mateusz Dymczyk][] (#1912),

      • [NachiGithub][] (#1789, #1793),

      • [Pasha Stetsenko][] (#1672, #1694, #1695, #1697, #1703, #1705, #1905, #1917),

      • [Tom Kraljevic][] (#1805),

      • [XiaomoWu][] (#1825)

    Source code(tar.gz)
    Source code(zip)
    datatable-0.9.0-cp35-cp35m-linux_ppc64le.whl(10.72 MB)
    datatable-0.9.0-cp35-cp35m-linux_x86_64.whl(10.53 MB)
    datatable-0.9.0-cp35-cp35m-macosx_10_7_x86_64.whl(1.62 MB)
    datatable-0.9.0-cp35-cp35m-manylinux2010_x86_64.whl(14.59 MB)
    datatable-0.9.0-cp36-cp36m-linux_ppc64le.whl(10.72 MB)
    datatable-0.9.0-cp36-cp36m-linux_x86_64.whl(10.53 MB)
    datatable-0.9.0-cp36-cp36m-macosx_10_7_x86_64.whl(1.62 MB)
    datatable-0.9.0-cp36-cp36m-manylinux2010_x86_64.whl(14.60 MB)
    datatable-0.9.0-cp37-cp37m-linux_ppc64le.whl(10.72 MB)
    datatable-0.9.0-cp37-cp37m-linux_x86_64.whl(10.53 MB)
    datatable-0.9.0-cp37-cp37m-manylinux2010_x86_64.whl(14.68 MB)
    datatable-0.9.0.tar.gz(725.80 KB)
  • v0.8.0(Jul 25, 2019)

    0.8.0 — 2019-01-04

    Added

    • Method frame.to_tuples() converts a Frame into a list of tuples, each tuple representing a single row (#1439).

    • Method frame.to_dict() converts the Frame into a dictionary where the keys are column names and values are lists of elements in each column (#1439).

    • Methods frame.head(n) and frame.tail(n) added, returning the first/last n rows of the Frame respectively (#1307).

    • Frame objects can now be pickled using the standard Python pickle interface (#1444). This also has an added benefit of reducing the potential for a deadlock when using the multiprocessing module.

    • Added function repeat(frame, n) that creates a new Frame by row-binding n copies of the frame (#1459).

    • Module datatable now exposes C API, to allow other C/C++ libraries interact with datatable Frames natively (#1469). See "datatable/include/datatable.h" for the description of the API functions. Thanks Qiang Kou for testing this functionality.

    • The column selector j in DT[i, j] can now be a list/iterator of booleans. This list should have length DT.ncols, and the entries in this list will indicate whether to select the corresponding column of the Frame or not (#1503). This can be used to implement a simple column filter, for example:

      del DT[:, (name.endswith("_tmp") for name in DT.names)]
      
    • Added ability to train and fit an FTRL-Proximal (Follow The Regularized Leader) online learning algorithm on a data frame (#1389). The implementation is multi-threaded and has high performance.

    • Added functions log and log10 for computing the natural and base-10 logarithms of a column (#1558).

    • Sorting functionality is now integrated into the DT[i, j, ...] call via the function sort(). If sorting is specified alongside a groupby, the values will be sorted within each group (#1531).

    • A slice-valued i expression can now be combined with a by() operator in DT[i, j, by()]. The result is that the slice i is applied to each group produced by by(), before the j is evaluated (#1585).

    • Implemented sorting in reverse direction, via sort(-col), where col is any regular column selector such as f.A or f[column]. The - sign is symbolic, no actual negation occurs. As such, this works even for string columns (#792).

    Fixed

    • Fixed rendering of "view" Frames in a Jupyter notebook (#1448). This bug caused the frame to display wrong data when viewed in a notebook.

    • Fixed crash when an int-column i selector is applied to a Frame which already had another row filter applied (#1437).

    • Frame.copy() now retains the frame's key, if any (#1443).

    • Installation from source distribution now works as expected (#1451).

    • When a g.-column is used but there is no join frame, an appropriate error message is now emitted (#1481).

    • The equality operators == / != can now be applied to string columns too (#1491).

    • Function dt.split_into_nhot() now works correctly with view Frames (#1507).

    • DT.replace() now works correctly when the replacement list is [+inf] or [1.7976931348623157e+308] (#1510).

    • FTRL algorithm now works correctly with view frames (#1502).

    • Partial column update (i.e. expression of the form DT[i, j] = R) now works for string columns as well (#1523).

    • DT.replace() now throws an error if called with 0 or 1 argument (#1525).

    • Fixed crash when viewing a frame obtained by resizing a 0-row frame (#1527).

    • Function count() now returns correct result within the DT[i, j] expression with non-trivial i (#1316).

    • Fixed groupby when it is applied to a Frame with view columns (#1542).

    • When replacing an empty set of columns, the replacement frame can now be also empty (i.e. have shape [0 x 0]) (#1544).

    • Fixed join results when join is applied to a view frame (#1540).

    • Fixed Frame.replace() in view string columns (#1549).

    • A 0-row integer column can now be used as i in DT[i, j] (#1551).

    • A string column produced from a partial join now materializes correctly (#1556).

    • Fixed incorrect result during "true division" of integer columns, when one of the values was negative and the other positive (#1562).

    • Frame.to_csv() no longer crashes on Unix when writing an empty frame (#1565).

    • The build process on MacOS now ensures that the libomp.dylib is properly referenced via @rpath. This prevents installation problems caused by the dynamic dependencies referenced by their absolute paths which are not valid outside of the build machine (#1559).

    • Fixed crash when the RHS of assignment DT[i, j] = ... was a list of expressions (#1539).

    • Fixed crash when an empty by() condition was used in DT[i, j, by] (#1572).

    • Expression DT[:, :, by(...)] no longer produces duplicates of columns used in the by-clause (#1576).

    • In certain circumstances mixing computed and plain columns under groupby caused incorrect result (#1578).

    • Fixed an internal error which was occurring when multiple row filters were applied to a Frame in sequence (#1592).

    • Fixed rbinding of frames if one of them is a negative step slice (#1594).

    • Fixed a crash that occurred with the latest pandas 0.24.0 (#1600).

    • Fixed invalid result when cbinding several 0-row frames (#1604).

    Changed

    • The primary datatable expression DT[i, j, ...] is now evaluated entirely in C++, improving performance and reliability.

    • Setting frame.nrows now always pads the Frame with NAs, even if the Frame has only 1 row. Previously changing .nrows on a 1-row Frame caused its value to be repeated. Use frame.repeat() in order to expand the Frame by copying its values.

    • Improved the performance of setting frame.nrows. Now if the frame has multiple columns, a view will be created.

    • When no columns are selected in DT[i, j], the returned frame will now have the same number of rows as if at least 1 column was selected. Previously an empty [0 x 0] frame was returned.

    • Assigning a value to a column DT[:, 'A'] = x will attempt to preserve the column's stype; or if not possible, the column will be upcasted within its logical type.

    • It is no longer possible to assign a value of an incompatible logical type to an existing column. For example, an assignment DT[:, 'A'] = 3 is now legal only if column A is of integer or real type, but will raise an exception if A is a boolean or string.

    • Frame.rbind() method no longer has a return value. The method always updated the frame in-place, so it was confusing to both update in-place and return the original frame (#1610).

    • min() / max() over an empty or all-NA column now returns None instead of +Inf / -Inf respectively (#1624).

    Deprecated

    • Frame methods .topython(), .topandas() and .tonumpy() are now deprecated, they will be removed in 0.9.0. Please use .to_list(), .to_pandas() and .to_numpy() instead.

    • Calling a frame object DT(rows=i, select=j, groupby=g, join=z, sort=s) is now deprecated. Use the expression DT[i, j, by(g), join(z), sort(s)] instead, where symbols by(), join() and sort() can all be imported from the datatable namespace (#1579).

    Removed

    • Single-item Frame selectors are now prohibited: DT[col] is an error. In the future this expression will be interpreted as a row selector instead.

    Notes

    • datatable now uses integration with Codacy to keep track of code quality and potential errors.

    • Internally, we now allow each Column in a Frame to have its own separate RowIndex. This will improve the performance, especially in join/cbind operations. Applications that use the datatable's C API may need to be updated to account for this (#1188).

    • This release was prepared by:

    • Additional thanks to people who helped make datatable more stable by discovering and reporting bugs that were fixed in this release:

      Pasha Stetsenko (#1316, #1443, #1481, #1539, #1542, #1551, #1572, #1576, #1578, #1592, #1594, #1602, #1604), Arno Candel (#1437, #1491, #1510, #1525, #1549, #1556, #1562), Michael Frasco (#1448), Jonathan McKinney (#1451, #1565), CarlosThinkBig (#1475), Olivier (#1502), Oleksiy Kononenko (#1507, #1600), Nishant Kalonia (#1527, #1540), Megan Kurka (#1544), Joseph Granados (#1559).


    Download links

    Source code(tar.gz)
    Source code(zip)
    datatable-0.8.0-cp35-cp35m-linux_ppc64le.whl(1.93 MB)
    datatable-0.8.0-cp35-cp35m-linux_x86_64.whl(1.53 MB)
    datatable-0.8.0-cp35-cp35m-macosx_10_7_x86_64.whl(1.40 MB)
    datatable-0.8.0-cp36-cp36m-linux_ppc64le.whl(1.93 MB)
    datatable-0.8.0-cp36-cp36m-linux_x86_64.whl(1.53 MB)
    datatable-0.8.0-cp36-cp36m-macosx_10_7_x86_64.whl(1.40 MB)
    datatable-0.8.0-cp37-cp37m-linux_ppc64le.whl(1.93 MB)
    datatable-0.8.0-cp37-cp37m-linux_x86_64.whl(1.53 MB)
    datatable-0.8.0-cp37-cp37m-macosx_10_7_x86_64.whl(1.40 MB)
    datatable-0.8.0.tar.gz(575.81 KB)
  • v0.7.0(Nov 20, 2018)

    v0.7.0 — 2018-11-16

    Added

    • Frame can now be created from a list/dict of numpy arrays.
    • Filters can now be used together with groupby expressions.
    • fread's verbose output now includes time spent opening the input file.
    • Added ability to read/write Jay files.
    • Frames can now be constructed via the keyword-args list of columns (i.e. Frame(A=..., B=...)).
    • Implemented logical operators "and" & and "or" | for eager evaluator.
    • Implemented integer division // and modulo % operators.
    • A Frame can now have a key column (or columns).
    • Key column(s) are saved when the frame is saved into a Jay file.
    • A Frame can now be naturally-joined with a keyed Frame.
    • Columns can now be updated within join expressions.
    • The error message when selecting a column that does not exist in the Frame now refers to similarly-named columns in that Frame, if there are any. At most 3 possible columns are reported, and they are ordered from most likely to least likely (#1253).
    • Frame() constructor now accepts a list of tuples, which it treats as rows when creating the frame.
    • Frame() can now be constructed from a list of named tuples, which will be treated as rows and field names will be used as column names.
    • frame.copy() can now be used to create a copy of the Frame.
    • Frame() can now be constructed from a list of dictionaries, where each item in the list represents a single row.
    • Frame() can now be created from a datetime64 numpy array (#1274).
    • Groupby calculations are now parallel.
    • Frame.cbind() now accepts a list of frames as the argument.
    • Frame can now be sorted by multiple columns.
    • new function split_into_nhot() to split a string column into fragments and then convert them into a set of indicator variables ("n-hot encode").
    • ability to convert object columns into strings.
    • implemented Frame.replace() function.
    • function abs() to find the absolute value of elements in the frame.
    • improved handling of Excel files by fread:
      • sheet name can now be used as a path component in the file name, causing only that particular sheet to be parsed;
      • further, a cell range can be specified as a path component after the sheet name, forcing fread to consider only the provided cell range;
      • fread can now handle the situation when a spreadsheet has multiple separate tables in the same sheet. They will now be detected automatically and returned to the user as separate Frame objects (the name of each frame will contain the sheet name and cell range from where the data was extracted).
    • HTML rendering of Frames inside a Jupyter notebook.
    • set-theoretic functions: union, intersect, setdiff and symdiff.
    • support for multi-column keys.
    • ability to join Frames on multiple columns.
    • In Jupyter notebook columns now have visual indicators of their types. The logical types are color-coded, and the size of each element is given by the number of dots (#1428).

    Changed

    • names argument in Frame() constructor can no longer be a string -- use a list or tuple of strings instead.
    • Frame.resize() removed -- same functionality is available via assigning to Frame.nrows.
    • Frame.rename() removed -- .name setter can be used instead.
    • Frame([]) now creates a 0x0 Frame instead of 0x1.
    • Parameter inplace in Frame.cbind() removed (was deprecated). Instead of inplace=False use dt.cbind(...).
    • Frame.cbind() no longer returns anything (previously it returned self, but this was confusing w.r.t whether it modifies the target, or returns a modified copy).
    • DT[i, j] now returns a python scalar value if i is integer, and j is integer/string. This is referred to as "explicit element selection". In the unlikely scenario when a single element needs to be returned as a frame, one can always write DT[i:i+1, j] or DT[[i], j].
    • The performance of explicit element selection improved by a factor of 200x.
    • Building no longer requires an LLVM distribution.
    • DT[col] syntax has been deprecated and now emits a warning. This will be converted to an error in version 0.8.0, and will be interpreted as row selector in 0.9.0.
    • default format for Frame.save() is now "jay".

    Fixed

    • bug in dt.cbind() where the first Frame in the list was ignored.
    • bug with applying a cast expression to a view column.
    • occasional memory errors caused by a lack of available mmap handles.
    • memory leak in groupby operations.
    • names parameter in Frame constructor is now checked for correctness.
    • bug in fread with QR bump occurring out-of-sample.
    • import datatable now takes only 0.13s, down from 0.6s.
    • fread no longer wastes time reading the full input, if max_nrows option is used.
    • bug where max_nrows parameter was sometimes causing a seg.fault
    • fread performance bug caused by memory-mapped file being accidentally copied into RAM.
    • rare crash in fread when resizing the number of rows.
    • saving view frames to csv.
    • crash when sorting string columns containins NA strings.
    • crash when applying a filter to a 0-rows frame.
    • if x is a Frame, then y = dt.Frame(x) now creates a shallow copy instead of a copy-by-reference.
    • upgraded dependency version for typesentry, the previous version was not compatible with Python 3.7.
    • rare crash when converting a string column from pandas DataFrame, when that column contains many non-ASCII characters.
    • f-column-selectors should no longer throw errors and produce only unique ids when stringified (#1241).
    • crash when saving a frame with many boolean columns into CSV (#1278).
    • incorrect .stypes/.ltypes property after calling cbind().
    • calculation of min/max values in internal rowindex upon row resizing.
    • frame.sort() with no arguments no longer produces an error.
    • f-expressions now do not crash when reused with a different Frame.
    • g-columns can be properly selected in a join (#1352).
    • writing to disk of columns > 2GB in size (#1387).
    • crash when sorting by multiple columns and the first column was of string type (#1401).

    Download links

    Source code(tar.gz)
    Source code(zip)
    datatable-0.7.0-cp35-cp35m-linux_ppc64le.whl(1.86 MB)
    datatable-0.7.0-cp35-cp35m-linux_x86_64.whl(1.44 MB)
    datatable-0.7.0-cp35-cp35m-macosx_10_7_x86_64.whl(1.33 MB)
    datatable-0.7.0-cp36-cp36m-linux_ppc64le.whl(1.86 MB)
    datatable-0.7.0-cp36-cp36m-linux_x86_64.whl(1.44 MB)
    datatable-0.7.0-cp36-cp36m-macosx_10_7_x86_64.whl(1.33 MB)
    datatable-0.7.0-cp37-cp37m-linux_ppc64le.whl(1.86 MB)
    datatable-0.7.0-cp37-cp37m-linux_x86_64.whl(1.44 MB)
    datatable-0.7.0-cp37-cp37m-macosx_10_7_x86_64.whl(1.32 MB)
    datatable-0.7.0.tar.gz(513.80 KB)
  • v0.6.0(Jun 6, 2018)

    v0.6.0 — 2018-06-05

    Added

    • fread will detect feather file and issue an appropriate error message.
    • when fread extracts data from archives into memory, it will now display the size of the extracted data in verbose mode.
    • syntax DT[i, j, by] is now supported.
    • multiple reduction operators can now be performed at once.
    • in groupby, reduction columns can now be combined with regular or computed columns.
    • during grouping, group keys are now added automatically to the select list.
    • implement sum() reducer.
    • == operator now works for string columns too.
    • Improved performance of groupby operations.

    Fixed

    • fread will no longer emit an error if there is an NA string in the header.
    • if the input contains excessively long lines, fread will no longer waste time printing a sample of first 5 lines in verbose mode.
    • fixed wrong calculation of mean / standard deviation of line length in fread if the sample contained broken lines.
    • frame view will no longer get stuck in a Jupyter notebook.

    Download links

    Source code(tar.gz)
    Source code(zip)
    datatable-0.6.0-cp35-cp35m-linux_ppc64le.whl(3.56 MB)
    datatable-0.6.0-cp35-cp35m-linux_x86_64.whl(3.03 MB)
    datatable-0.6.0-cp35-cp35m-macosx_10_7_x86_64.whl(887.14 KB)
    datatable-0.6.0-cp36-cp36m-linux_ppc64le.whl(3.56 MB)
    datatable-0.6.0-cp36-cp36m-linux_x86_64.whl(3.03 MB)
    datatable-0.6.0-cp36-cp36m-macosx_10_7_x86_64.whl(887.15 KB)
    datatable-0.6.0.tar.gz(369.88 KB)
  • v0.5.0(May 25, 2018)

    v0.5.0 — 2018-05-25

    Added

    • rbind()-ing now works on columns of all types (including between any types).
    • dt.rbind() function to perform out-of-place row binding.
    • ability to change the number of rows in a Frame.
    • ability to modify a Frame in-place by assigning new values to particular cells.
    • dt.__git_version__ variable containing the commit hash from which the package was built.
    • ability to read .bz2 compressed files with fread.

    Fixed

    • Ensure that fread only emits messages to Python from the master thread.
    • Fread can now properly recognize quoted NA strings.
    • Fixed error when unbounded f-expressions were printed to console.
    • Fixed problems when operating with too many memory-mapped Frames at once.
    • Fixed incorrect groupby calculation in some rare cases.

    Download links

    Source code(tar.gz)
    Source code(zip)
    datatable-0.5.0-cp35-cp35m-linux_ppc64le.whl(3.41 MB)
    datatable-0.5.0-cp35-cp35m-linux_x86_64.whl(2.91 MB)
    datatable-0.5.0-cp35-cp35m-macosx_10_7_x86_64.whl(862.52 KB)
    datatable-0.5.0-cp36-cp36m-linux_ppc64le.whl(3.41 MB)
    datatable-0.5.0-cp36-cp36m-linux_x86_64.whl(2.91 MB)
    datatable-0.5.0-cp36-cp36m-macosx_10_7_x86_64.whl(862.51 KB)
    datatable-0.5.0.tar.gz(363.31 KB)
  • v0.3.2(Apr 25, 2018)

    Added

    • Implemented sorting for str64 columns.
    • write_csv can now write columns of type str64.
    • Fread can now accept a list of files to read, or a glob pattern.

    Fixed

    • Fix the source distribution (sdist) by including all the files that are required for building from source.
    • Install no longer fails with llvmlite 0.23.0 package.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Apr 20, 2018)

    Added

    • Added ability to delete rows from a view Frame.
    • Implement countna() function for obj64 columns.
    • New option dt.options.core_logger to help debug datatable.
    • New Frame method .materialize() to convert a view Frame into a "real" one. This method is noop if applied to a non-view Frame.
    • Several internal options to fine-tune the performance of sorting algorithm.
    • Significantly improved performance of sorting doubles.
    • fread can now read string columns that are larger than 2GB in size.
    • fread can now accept a list/tuple of stypes for its columns parameter.
    • improved logic for auto-assigning column names when they are missing.
    • fread now supports reading files that contain NUL characters.
    • Added global settings options.frame.names_auto_index and options.frame.names_auto_prefix to control automatic column name generation in a Frame.

    Changed

    • When creating a column of "object" type, we will now coerce float "nan" values into Nones.
    • Renamed fread's parameter strip_white into strip_whitespace.
    • Eliminated all assert() statements from C code, and replaced them with exception throws.
    • Default column names, if none given by the user, are "C0", "C1", "C2", ... for both fread and Frame constructor.
    • function-valued columns parameter in fread has been changed: if previously the function was invoked for every column, now it receives the list of all columns at once, and is expected to return a modified list (or dict / set / etc). Each column description in the list that the function receives carries the columns name and stype, in the future format field will also be added.

    Fixed

    • fread will no longer consume excessive amounts of memory when reading a file with too many columns and few rows.
    • fixed a possible crash when reading CSV file containing long string fields.
    • fread: NA fields with whitespace were not recognized correctly.
    • fread will no longer emit error messages or type-bump variables due to incorrectly recognized chunk boundaries.
    • Fixed a crash when rbinding string column with non-string: now an exception will be thrown instead.
    • Calling any stats function on a column of obj64 type will no longer result in a crash.
    • Columns/rows slices no longer fail on an empty Frame.
    • Fixed crash when materializing a view frame containing obj64 columns.
    • Fixed erroneous grouping calculations.
    • Fixed sorting of 1-row view frames.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Mar 19, 2018)

    Added

    • Method df.tonumpy() now has argument stype which will force conversion into a numpy array of the specific stype.
    • Enums stype and ltype that encapsulate the type-system of the datatable module.
    • It is now possible to fread from a bytes object.
    • Allow columns to be renamed by setting the names property on the datatable.
    • Internal "MemoryMapManager" will make datatable more robust when opening a frame with many columns on Linux systems. In particular, error 12 "not enough memory" should become much more rare now.
    • Number of threads used by fread can now be controlled via parameter nthreads.
    • It is now possible to supply string argument to dt.DataTable constructor, which in turn will try to interpret that argument via fread.
    • fread can now read compressed .xz files.
    • fread now automatically skips Ctrl+Z / NUL characters at the end of the file.
    • It is now possible to create a datatable from string numpy array.
    • Added parameters skip_blank_lines, strip_white, quotechar and dec to fread.
    • Single-column files with blank lines can now be read successfully.
    • Fread now recognizes \r\r\n as a valid line ending.
    • Added parameters url and cmd to fread, as well as ability to detect URLs automatically. The url parameter downloads file from HTTP/HTTPS/FTP server into a temporary location and reads it from there. The cmd parameter executes the provided shell command and then reads the data from the stdout.
    • It is now possible to pass file objects to fread (or any objects exposing method read()).
    • File path given to fread can now transparently select files within .zip archives. This doesn't work with archives-within-archives.
    • GenericReader now supports auto-detecting and reading UTF-16 files.
    • GenericReader now attempts to detect whether the input file is an HTML, and if so raises an exception with the appropriate error message.
    • Datatable can now use either llvm-4.0 or llvm-5.0 depending on what the user has.
    • fread now allows sep="", causing the file to be read line-by-line.
    • range arguments can now be passed to a DataTable constructor.
    • datatable will now fall back to eager execution if it cannot detect LLVM runtime.
    • simple Excel file reader.
    • It is now possible to select columns from DataTable by type: df[int] selects all integer columns from df.
    • Allow creating DataTable from list, while forcing a specific stype(s).
    • Added ability to delete rows from a DataTable: del df[rows, :]
    • DataTable can now accept pandas/numpy frames with columns of float16 dtype (which will be automatically converted to float32).
    • .isna() function now works on strings too.
    • .save() is now a method of Frame class.
    • Warnings now have custom display hook.
    • Added global option nthreads which control the number of Omp threads used by datatable for parallel execution. Example: dt.options.nthreads = 1.
    • Add method .scalar() to quickly convert a 1x1 Frame into a python scalar.
    • New methods .min1(), .max1(), .mean1(), .sum1(), .sd1(), .countna1() that are similar to .min(), .max(), etc. but return a scalar instead of a Frame (however they only work with a 1-column Frames).
    • Implemented method .nunique() to compute the number of unique values in each column.
    • Added stats functions .mode() and .nmodal().

    Changed

    • When writing "round" doubles/floats to CSV, they'll now always have trailing zero. For example, [0.0, 1.0, 1e23] now produce "0.0,1.0,1.0e+23" instead of "0,1,1e+23".
    • df.stypes now returns a tuple of stype elements (previously it was returning a list of strings). Likewise, df.types was renamed into df.ltypes and now it returns a tuple of ltype elements instead of strings.
    • Parameter colnames= in DataTable constructor was renamed to names=. The old parameter may still be used, but it will result in a warning.
    • DataTable can no longer have duplicate column names. If such names are given, they will be mangled to make them unique, and a warning will be issued.
    • Special characters (in the ASCII range \x00 - \x1F) are no longer permitted in the column names. If encountered, they will be replaced with a dot ..
    • Fread now ignores trailing whitespace on each line, even if ' ' separator is used.
    • Fread on an empty file now produces an empty DataTable, instead of an exception.
    • Fread's parameter skip_lines was replaced with skip_to_line, so that it's more in sync with the similar argument skip_to_string.
    • When saving datatable containing "obj64" columns, they will no longer be saved, and user warning will be shown (previously saving this column would eventually lead to a segfault).
    • (python) DataTable class was renamed into Frame.
    • "eager" evaluation engine is now the default.
    • Parameter inplace of method rbind() was removed: instead you can now rbind frames to an empty frame: dt.Frame().rbind(df1, df2).

    Fixed

    • datatable will no longer cause the C locale settings to change upon importing.
    • reading a csv file with invalid UTF-8 characters in column names will no longer throw an exception.
    • creating a DataTable from pandas.Series with explicit colnames will no longer ignore those column names.
    • fread(fill=True) will correctly fill missing fields with NAs.
    • fread(columns=set(...)) will correctly handle the case when the input contains multiple columns with the same names.
    • fread will no longer crash if the input dataset contains invalid utf8/win1252 data in the column headers (#594, #628).
    • fixed bug in exception handling, which occasionally caused empty exception messages.
    • fixed bug in fread where string fields starting with "NaN" caused an assertion error.
    • Fixed bug when saving a DataTable with unicode column names into .nff format on systems where default encoding is not unicode-aware.
    • More robust newline handling in fread (#634, #641, #647).
    • Quoted fields are now correctly unquoted in fread.
    • Fixed a bug in fread which occurred if the number of rows in the CSV file was estimated too low (#664).
    • Fixed fread bug where an invalid DataTable was constructed if parameter max_nrows was used and there were any string columns (#671).
    • Fixed a rare bug in fread which produced error message "Jump X did not finish reading where jump X+1 started" (#682).
    • Prevented memory leak when using "PyObject" columns in conjunction with numpy.
    • View frames can now be properly saved.
    • Fixed crash when sorting view frame by a string column.
    • Deleting 0 columns is no longer an error.
    • Rows filter now works properly when applied to a view table and using "eager" evaluation engine.
    • Computed columns expression can now be combined with rows expression, or applied to a view Frame.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Oct 18, 2017)

    Added

    • Ability to write DataTable into a CSV file: the .to_csv() method. The CSV writer is multi-threaded and extremely fast.
    • Added .internal.column(i).data_pointer getter, to allow native code from other libraries to easily access the data in each column.
    • Fread can now read hexadecimal floating-point numbers: floats and doubles.
    • Csv writer will now auto-quote an empty string, and a string containing leading/ trailing whitespace, so that it can be read by fread reliably.
    • Fread now prints file sizes in "human-readable" form, i.e. KB/MB/GB instead of bytes.
    • Fread can now understand a variety of "NaN" / "Inf" literals produced by different systems.
    • Add option hex to csv writer, which controls whether floats will be written in decimal (default) or hexadecimal format.
    • Csv writer now uses the "dragonfly" algorithm for writing doubles, which is faster than all known alternatives.
    • It is now allowed to pass a single-row numpy array as an argument to dt(rows=...), which will be treated the same as if it was a single-column array.
    • Now datatable's wheel will include libraries libomp and libc++ on the platforms where they are not widely available.
    • New fread's argument logger allows the user to supply custom logging mechanism to fread. When this argument is provided, "verbose" mode is turned on automatically.

    Changed

    • datatable will no longer attempt to distinguish between NA and NAN floating-point values.
    • Constructing DataTable from a 2D numpy array now preserves shape of that array. At the same time it is no longer true that arr.tolist() == numpy.array(DataTable(arr)).tolist(): the list will be transposed.
    • Converting a DataTable into a numpy array now also preserves shape. At the same time it is no longer true that dt.topython() == dt.tonumpy().tolist(): the list will be transposed.
    • The internal _datatable module was moved to datatable.lib._datatable.

    Fixed

    • datatable will now convert huge integers into double inf values instead of raising an exception.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Sep 12, 2017)

    Added

    • Environmental variable DTNOOPENMP will cause the datatable to be built without OpenMP support.
    • If d0 is a DataTable, then d1 = DataTable(d0) will create its shallow copy.
    • In addition to LLVM4 environmental variable, datatable will now also look for the llvm4 folder within the package's directory.
    • Getter df.internal.rowindex allows access to the RowIndex on the DataTable (for inspection / reuse).
    • Implemented statistics min, max, mean, stdev, countna for numeric and boolean columns.
    • A framework for computing and storin g per-column summary statistics.
    • sys.getsizeof(dt) can now be used to query the size of the datatable in memory.
    • This CHANGELOG file.

    Fixed

    • Filter function when applied to a view DataTable now produces correct result.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Aug 30, 2017)

Owner
H2O.ai
Fast Scalable Machine Learning For Smarter Applications
H2O.ai
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
Universal 1d/2d data containers with Transformers functionality for data analysis.

XPandas (extended Pandas) implements 1D and 2D data containers for storing type-heterogeneous tabular data of any type, and encapsulates feature extra

The Alan Turing Institute 25 Mar 14, 2022
Modin: Speed up your Pandas workflows by changing a single line of code

Scale your pandas workflows by changing one line of code To use Modin, replace the pandas import: # import pandas as pd import modin.pandas as pd Inst

8.2k Jan 01, 2023
A pure Python implementation of Apache Spark's RDD and DStream interfaces.

pysparkling Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to exe

Sven Kreiss 254 Dec 06, 2022
sqldf for pandas

pandasql pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar

yhat 1.2k Jan 09, 2023
The easy way to write your own flavor of Pandas

Pandas Flavor The easy way to write your own flavor of Pandas Pandas 0.23 added a (simple) API for registering accessors with Pandas objects. Pandas-f

Zachary Sailer 260 Jan 01, 2023
Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

Python for Data 348 Jan 03, 2023
cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Built based on the Apache Arrow columnar memory format,

RAPIDS 5.2k Dec 31, 2022
Koalas: pandas API on Apache Spark

pandas API on Apache Spark Explore Koalas docs » Live notebook · Issues · Mailing list Help Thirsty Koalas Devastated by Recent Fires The Koalas proje

Databricks 3.2k Jan 04, 2023
The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

Eyal Trabelsi 206 Dec 13, 2022
High performance datastore for time series and tick data

Arctic TimeSeries and Tick store Arctic is a high performance datastore for numeric data. It supports Pandas, numpy arrays and pickled objects out-of-

Man Group 2.9k Dec 23, 2022
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 01, 2023
A Python package for manipulating 2-dimensional tabular data structures

datatable This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame

H2O.ai 1.6k Jan 05, 2023
A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

Jason Carpenter 2.2k Jan 04, 2023
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

vaex io 7.7k Jan 01, 2023