High performance datastore for time series and tick data

Overview

arctic Arctic TimeSeries and Tick store

Documentation Status Travis CI Join the chat at https://gitter.im/manahl/arctic

Arctic is a high performance datastore for numeric data. It supports Pandas, numpy arrays and pickled objects out-of-the-box, with pluggable support for other data types and optional versioning.

Arctic can query millions of rows per second per client, achieves ~10x compression on network bandwidth, ~10x compression on disk, and scales to hundreds of millions of rows per second per MongoDB instance.

Arctic has been under active development at Man AHL since 2012.

Quickstart

Install Arctic

pip install git+https://github.com/manahl/arctic.git

Run a MongoDB

mongod --dbpath <path/to/db_directory>

Using VersionStore

from arctic import Arctic
import quandl

# Connect to Local MONGODB
store = Arctic('localhost')

# Create the library - defaults to VersionStore
store.initialize_library('NASDAQ')

# Access the library
library = store['NASDAQ']

# Load some data - maybe from Quandl
aapl = quandl.get("WIKI/AAPL", authtoken="your token here")

# Store the data in the library
library.write('AAPL', aapl, metadata={'source': 'Quandl'})

# Reading the data
item = library.read('AAPL')
aapl = item.data
metadata = item.metadata

VersionStore supports much more: See the HowTo!

Adding your own storage engine

Plugging a custom class in as a library type is straightforward. This example shows how.

Documentation

You can find complete documentation at Arctic docs

Concepts

Libraries

Arctic provides namespaced libraries of data. These libraries allow bucketing data by source, user or some other metric (for example frequency: End-Of-Day; Minute Bars; etc.).

Arctic supports multiple data libraries per user. A user (or namespace) maps to a MongoDB database (the granularity of mongo authentication). The library itself is composed of a number of collections within the database. Libraries look like:

  • user.EOD
  • user.ONEMINUTE

A library is mapped to a Python class. All library databases in MongoDB are prefixed with 'arctic_'

Storage Engines

Arctic includes three storage engines:

  • VersionStore: a key-value versioned TimeSeries store. It supports:
    • Pandas data types (other Python types pickled)
    • Multiple versions of each data item. Can easily read previous versions.
    • Create point-in-time snapshots across symbols in a library
    • Soft quota support
    • Hooks for persisting other data types
    • Audited writes: API for saving metadata and data before and after a write.
    • a wide range of TimeSeries data frequencies: End-Of-Day to Minute bars
    • See the HowTo
    • Documentation
  • TickStore: Column oriented tick database. Supports dynamic fields, chunks aren't versioned. Designed for large continuously ticking data.
  • Chunkstore: A storage type that allows data to be stored in customizable chunk sizes. Chunks aren't versioned, and can be appended to and updated in place.

Arctic storage implementations are pluggable. VersionStore is the default.

Requirements

Arctic currently works with:

  • Python 2.7, 3.4, 3.5, 3.6
  • pymongo >= 3.6
  • Pandas
  • MongoDB >= 2.4.x

Operating Systems:

  • Linux
  • macOS
  • Windows 10

Acknowledgements

Arctic has been under active development at Man AHL since 2012.

It wouldn't be possible without the work of the AHL Data Engineering Team including:

Contributions welcome!

License

Arctic is licensed under the GNU LGPL v2.1. A copy of which is included in LICENSE

Comments
  • Cannot compile on Windows

    Cannot compile on Windows

    I installed the recommended C++ Compiler for Python 2.7, however I can't seem to run the installer.

    The fatal error I got is: C1083: Cannot open include file: 'stdint.h': No such file or directory

    Is it possible to release a compiled version of this?

    help wanted 
    opened by beartastic 36
  • can't install on mac os x

    can't install on mac os x

    clang -fno-strict-aliasing -fno-common -dynamic -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/2.7.10/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/_compress.c -o build/temp.macosx-10.10-x86_64-2.7/src/_compress.o -fopenmp
    src/_compress.c:259:10: fatal error: 'omp.h' file not found
    #include "omp.h"
             ^
    1 error generated.
    error: command 'clang' failed with exit status 1
    

    i installed clang-omp..

    enhancement help wanted 
    opened by ccoossdddffdd 21
  • Usage discussion: VersionStore vs TickStore, allowed options for VersionStore.write..

    Usage discussion: VersionStore vs TickStore, allowed options for VersionStore.write..

    First of all - my thanks to the maintainers. This library is exactly what I was looking for and looks very promising.

    I've been having a bit of trouble figuring how to optimally use arctic though. I've been following the examples in /howto which are... sparse. Is there somewhere else I might find examples or docs?

    Now, some dumb questions about VersionStore and TickStore:

    • I've noticed that every time I write to a VersionStore, an entirely new version is created. Are finer-grained options for versioning available? For instance, I would like to write streaming updates to a single version, only incrementing version when manually specified. I tried just passing version=1 to lib.write, but this doesn't seem to be supported.
    • In what scenarios might one want to use VersionStore vs TickStore? It's not clear to me what the differences are from the README or the code.
    • My current use case is primarily as a database for streams - for this use case TickStore is recommended? Is there a reason one might want to use VersionStore for this?
    • ~~Is TickStore appropriate for data which may have more than row for each timestamp (event data)?~~ Nope, not allowed by TickStore

    Thanks in advance for your help and patience!

    opened by rueberger 19
  • how_to_use_arctic.py fails with SyntaxError: unexpected EOF while parsing on library.read()

    how_to_use_arctic.py fails with SyntaxError: unexpected EOF while parsing on library.read()

    Attempting to run the demo as a test, and getting errors on library.read(). Also, I get an error the first time I run store.initialize_library('username.scratch'), but the second time it did works. I did have to make the following change to get import Arctic to run:

    store/_version_store_utils.py, line 66: if pd.version.startswith("0.14"):

    See paste of output included the EOF error below. I'm not sure if it's a numpy bug or arctic, but I'm not sure where to go from here.

    Python 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, Oct 19 2015, 18:04:42) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org

    from arctic import Arctic from datetime import datetime as dt import pandas as pd store = Arctic('localhost') store.list_libraries() [u'NASDAQ', u'username.scratch'] store.initialize_library('username.scratch') No handlers could be found for logger "arctic.store.version_store" store.initialize_library('username.scratch') library = store['username.scratch'] df = pd.DataFrame({'prices': [1, 2, 3]},[dt(2014, 1, 1), dt(2014, 1, 2), dt(2014, 1, 3)]) library.write('SYMBOL', df) VersionedItem(symbol=SYMBOL,library=arctic_username.scratch,data=<type 'NoneType'>,version=19,metadata=None library.read('SYMBOL') Traceback (most recent call last): File "", line 1, in File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/version_store.py", line 319, in read date_range=date_range, read_preference=read_preference, *_kwargs) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/version_store.py", line 363, in _do_read data = handler.read(self._arctic_lib, version, symbol, from_version=from_version, *_kwargs) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/_pandas_ndarray_store.py", line 279, in read item = super(PandasDataFrameStore, self).read(arctic_lib, version, symbol, *_kwargs) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/_pandas_ndarray_store.py", line 193, in read date_range=date_range, *_kwargs) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/_ndarray_store.py", line 180, in read return self._do_read(collection, version, symbol, index_range=index_range) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/_ndarray_store.py", line 219, in _do_read dtype = self._dtype(version['dtype'], version.get('dtype_metadata', {})) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/_ndarray_store.py", line 139, in _dtype return np.dtype(string, metadata=metadata) File "/home/jeff/anaconda/lib/python2.7/site-packages/numpy/core/_internal.py", line 191, in _commastring newitem = (dtype, eval(repeats)) File "", line 1 ( ^ SyntaxError: unexpected EOF while parsing

    opened by jeffneuen 19
  • ServerSelectionTimeoutError:  even when can connect to mongo via CLI or Compass

    ServerSelectionTimeoutError: even when can connect to mongo via CLI or Compass

    Arctic Version

    # arctic==1.66.0
    

    Arctic Store

    # TICK_STORE
    

    Platform and version

    Ubuntu 18.04 LTS (mongodb)

    Description of problem and/or code sample that reproduces the issue

    I am trying to access a mongodb instance on the network, I have been able to connect via MongoDB CLI and Compass via the ip:port call below. But somehow when I run it I get a server timeout?

    from arctic import Arctic
    
    # also tried full mongo uri 'mongodb://197.168.0.210:27200/'
    store = Arctic('197.168.0.210:27200')
    
    store.list_libraries()
    
    
    ServerSelectionTimeoutError: 197.168.0.210:27200: timed out
    
    

    I checked the pymongo call to connect, and attempted to use the full mongo URI but it still fails any call to the database.

    opened by derekwong9 17
  • Problem with TickStore:

    Problem with TickStore: "arctic.date._mktz.TimezoneError: Timezone "UTC" can not be read"

    I'm with an issue trying to add data to a TickStore Library.

    The data is a DataFrame and it's index is a DatetimeIndex. (Pdb) data_new.index DatetimeIndex(['2014-11-21 16:56:58.534000-02:00', '2014-11-21 17:49:56.935000-02:00', '2014-11-21 18:00:01.099000-02:00', '2014-11-21 18:06:00.012000-02:00'], dtype='datetime64[ns, America/Sao_Paulo]', freq=None)

    When i try to add it to the library, i got this: (Pdb) library.write(i, data_new) arctic.date.mktz.TimezoneError: Timezone "UTC" can not be read, error: "[Errno 2] No such file or directory: '/usr/share/zoneinfo/UTC'"

    How can i deal with it?

    opened by rtadewald 17
  • Future Development? - Java API

    Future Development? - Java API

    Hi there,

    I have been trialling this out and it looks a great framework for storage retrieval, is there any plans for the Java API or are you looking for contributors to help? I'm testing this out for data storage for zipline/quantopian backtester and also for a JVM based project and was wondering what stage that was at if any?

    opened by michaeljohnbennett 17
  • arctic import error (gcc/cpython)

    arctic import error (gcc/cpython)

    Arctic Version

    1.55.0
    

    Arctic Store

    ChunkStore
    

    Platform and version

    macOS Sierra 10.12.6 anaconda python3.6

    Description of problem and/or code sample that reproduces the issue

    I am having an error when importing arctic that I believe relates to my current gcc or cython build. I think I need to install gcc --without-multilib, but I am not entirely sure. I also think it is possible that the gcc version python is using is outdated.

    Here is the full code output and error statement:

    Suryas-MacBook-Pro:~ surya$ python Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 6 2017, 12:04:38) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin Type "help", "copyright", "credits" or "license" for more information.

    import arctic Traceback (most recent call last): File "", line 1, in File "/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/init.py", line 3, in from .arctic import Arctic, register_library_type File "/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/arctic.py", line 12, in from .store import version_store, bson_store, metadata_store File "/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/store/version_store.py", line 15, in from ._pickle_store import PickleStore File "/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/store/_pickle_store.py", line 7, in from .._compression import decompress, compress_array File "/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/_compression.py", line 3, in from . import _compress as clz4 ImportError: dlopen(/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/_compress.cpython-36m-darwin.so, 2): Symbol not found: _GOMP_parallel Referenced from: /Users/surya/anaconda3/lib/python3.6/site-packages/arctic/_compress.cpython-36m-darwin.so Expected in: flat namespace in /Users/surya/anaconda3/lib/python3.6/site-packages/arctic/_compress.cpython-36m-darwin.so

    Here is my output for brew list --versions: coreutils 8.28_1 gcc 7.2.0 gmp 6.1.2_1 isl 0.18 libmpc 1.0.3_1 mongodb 3.4.10 mpfr 3.1.6 openssl 1.0.2l pkg-config 0.29.2

    opened by suryabahubalendruni 15
  • Randomly raise Exception: Error decompressing if append many times to a symbol in chunk store

    Randomly raise Exception: Error decompressing if append many times to a symbol in chunk store

    Arctic Version

    arctic (1.51.0)
    

    Arctic Store

    ChunkStore
    

    Platform and version

    Red Hat Enterprise Linux Server release 7.2 (Maipo)

    Description of problem and/or code sample that reproduces the issue

    I append daily data to one symbol (write if not exists and set chunk_size = 'A').

    The data looks like this:

    • columns of the DataFrame
    Index(['beta', 'btop', 'earnyild', 'growth', 'industry', 'leverage',
           'liquidty', 'momentum', 'resvol', 'sid', 'size', 'sizenl'],
          dtype='object')
    
    • head of the DataFrame (part)
                 beta   btop  earnyild  growth  industry  leverage  liquidty 
    date                                                                       
    2008-12-25  0.200 -0.386    -0.669  -0.432        23    -0.307     0.746   
    2008-12-25  0.653  0.048     0.671   0.182        10     0.255     1.097   
    2008-12-25 -1.726 -1.105    -1.042  -2.661        22    -0.732    -3.400   
    2008-12-25 -0.407  2.840     2.588  -1.505        19    -0.454    -1.137   
    2008-12-25  0.931  1.302    -0.946  -0.306        31     3.042    -0.429   
    
    • the dtypes
    beta        float64
    btop        float64
    earnyild    float64
    growth      float64
    industry      int64
    leverage    float64
    liquidty    float64
    momentum    float64
    resvol      float64
    sid           int64
    size        float64
    sizenl      float64
    

    it will randomly raise the following exception (2008-12-25 for example)

    [2017-08-29 14:17:00] [factor.value] [INFO] update 2008-12-25 barra exposures failed:Error decompressing
    [2017-08-29 14:17:00] [factor.value] [ERROR] Traceback (most recent call last):
      File "/home/quant/newalpha/warden/warden/_update_factors.py", line 88, in _update_barra_exposures
        n = update_lib(lib_factors, 'barra_exposures', exposures)
      File "/home/quant/newalpha/warden/warden/utils.py", line 70, in update_lib
        lib.append(symbol, data_to_append, metadata=meta)
      File "/opt/anaconda3/lib/python3.5/site-packages/arctic/chunkstore/chunkstore.py", line 503, in append
        self.__update(sym, item, metadata=metadata, combine_method=SER_MAP[sym[SERIALIZER]].combine, audit=audit)
      File "/opt/anaconda3/lib/python3.5/site-packages/arctic/chunkstore/chunkstore.py", line 415, in __update
        df = self.read(symbol, chunk_range=chunker.to_range(start, end), filter_data=False)
      File "/opt/anaconda3/lib/python3.5/site-packages/arctic/chunkstore/chunkstore.py", line 268, in read
        data = SER_MAP[sym[SERIALIZER]].deserialize(chunks, **kwargs)
      File "/opt/anaconda3/lib/python3.5/site-packages/arctic/serialization/numpy_arrays.py", line 195, in deserialize
        df = pd.concat([self.converter.objify(d, columns) for d in data], ignore_index=not index)
      File "/opt/anaconda3/lib/python3.5/site-packages/arctic/serialization/numpy_arrays.py", line 195, in <listcomp>
        df = pd.concat([self.converter.objify(d, columns) for d in data], ignore_index=not index)
      File "/opt/anaconda3/lib/python3.5/site-packages/arctic/serialization/numpy_arrays.py", line 126, in objify
        d = decompress(doc[DATA][doc[METADATA][LENGTHS][col][0]: doc[METADATA][LENGTHS][col][1] + 1])
      File "/opt/anaconda3/lib/python3.5/site-packages/arctic/_compression.py", line 55, in decompress
        return clz4.decompress(_str)
      File "_compress.pyx", line 121, in _compress.decompress (src/_compress.c:2151)
    Exception: Error decompressing
    
    

    It seems that the data was broken and can not be decompressed (for any date range). If I delete the document related to 2008, it can be decompressed again.

    Thx a lot!

    opened by lf-shaw 14
  • With lib_type='TickStoreV3': No field of name index - index.name and index.tzinfo not preserved - max_date returning min date (without timezone)

    With lib_type='TickStoreV3': No field of name index - index.name and index.tzinfo not preserved - max_date returning min date (without timezone)

    Hello,

    this code

    from pandas_datareader import data as pdr
    symbol = "IBM"
    df = pdr.DataReader(symbol, "yahoo", "2010-01-01", "2015-12-29")
    df.index = df.index.tz_localize('UTC')
    
    from arctic import Arctic
    store = Arctic('localhost')
    store.initialize_library('library_name', 'TickStoreV3')
    library = store['library_name']
    library.write(symbol, df)
    

    raises

    ValueError: no field of name index
    

    I'm using TickStoreV3 as lib_type because I'm not very interested (at least for now) by audited write, versioning...

    I noticed that

    >>> df['index']=0
    >>> library.write(symbol, df)
    1 buckets in 0.015091: approx 6626466 ticks/sec
    

    seems to fix this... but

    >>> library.read(symbol)
                               index        High   Adj Close     ...             Low       Close        Open
    1970-01-01 01:00:00+01:00      0  132.970001  116.564610     ...      130.850006  132.449997  131.179993
    1970-01-01 01:00:00+01:00      0  131.850006  115.156514     ...      130.100006  130.850006  131.679993
    1970-01-01 01:00:00+01:00      0  131.490005  114.408453     ...      129.809998  130.000000  130.679993
    1970-01-01 01:00:00+01:00      0  130.250000  114.012427     ...      128.910004  129.550003  129.869995
    1970-01-01 01:00:00+01:00      0  130.919998  115.156514     ...      129.050003  130.850006  129.070007
    ...                          ...         ...         ...     ...             ...         ...         ...
    1970-01-01 01:00:00+01:00      0  135.830002  135.500000     ...      134.020004  135.500000  135.830002
    1970-01-01 01:00:00+01:00      0  138.190002  137.929993     ...      135.649994  137.929993  135.880005
    1970-01-01 01:00:00+01:00      0  139.309998  138.539993     ...      138.110001  138.539993  138.300003
    1970-01-01 01:00:00+01:00      0  138.880005  138.250000     ...      138.110001  138.250000  138.429993
    1970-01-01 01:00:00+01:00      0  138.039993  137.610001     ...      136.539993  137.610001  137.740005
    
    [1507 rows x 7 columns]
    

    It looks like as if write was looking for a DataFrame with a column named 'index'... which is quite odd.

    If I do

    df['index']=1
    library.write(symbol, df)
    

    then

    library.write(symbol, df)
    

    raises

    OverflowError: Python int too large to convert to C long
    

    Any idea ?

    opened by femtotrader 13
  • MemoryError when saving a dataframe with large strings to TickStore

    MemoryError when saving a dataframe with large strings to TickStore

    Arctic Version

    1.79.2

    Arctic Store

    TickStore

    Platform and version

    Python 3.6.7, Linux Mint 19 Cinnamon 64-bit

    Description of problem and/or code sample that reproduces the issue

    Hi, I'm trying to save the following data: https://drive.google.com/file/d/1dWWBNvx6vjyNK4kjZTVL4-YM0fmWxT5b/view?usp=sharing

    to TickStore, code: https://pastebin.com/jEqXxq2t

    and getting a MemoryError, see the stack traces: https://pastebin.com/Uy4pYAfH

    I'm quite new to arctic so I might be doing something wrong, and I would appreciate if you could guide me with this.

    Side question: Considering the nature of my data (2 col made of a time stamp and long string/json file), what is the best way to store these using arctic?

    Thanks, Alan

    opened by alanbogossian 12
  • Missing last chunk in CHUNK_STORE

    Missing last chunk in CHUNK_STORE

    Arctic Version

    1.80.5
    

    Arctic Store

    # ChunkStore
    

    Platform and version

    Python 3.8.5

    Description of problem and/or code sample that reproduces the issue

    I noticed that if I save a dataframe where the UTC date carries over to the next day, most functions (reverse_iterator, get_chunk_ranges, get_info, ...) don't return the chunk for the new date. The following example will make this clear (jupyter notebook attached in the zip file):

    Set Up

    import pandas as pd
    from arctic import Arctic, CHUNK_STORE
    store = Arctic("localhost")
    store.initialize_library("scratch_lib", lib_type=CHUNK_STORE)
    
    lib = store["scratch_lib"]
    

    Create an Index with some times that will change dates when converted to UTC

    ind = pd.Index([pd.Timestamp("20121208T16:00", tz="US/Eastern"), pd.Timestamp("20121208T18:00", tz="US/Eastern"), 
                    pd.Timestamp("20121208T20:00", tz="US/Eastern"), pd.Timestamp("20121208T22:00", tz="US/Eastern")], name="date")
    print(ind)
    

    Output:

    DatetimeIndex(['2012-12-08 16:00:00-05:00', '2012-12-08 18:00:00-05:00', '2012-12-08 20:00:00-05:00', '2012-12-08 22:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', name='date', freq=None)

    print(ind.tz_convert("UTC"))

    Output

    DatetimeIndex(['2012-12-08 21:00:00+00:00', '2012-12-08 23:00:00+00:00', '2012-12-09 01:00:00+00:00', '2012-12-09 03:00:00+00:00'], dtype='datetime64[ns, UTC]', name='date', freq=None)

    Create dataframe, write it to the library, and read it back out

    df = pd.DataFrame([1, 2, 3, 4], index=ind, columns=["col"])
    lib.write("example_df", df, chunk_size="D")
    df_read = lib.read("example_df")
    print(df_read)
    

    Output

    date col 2012-12-08 21:00:00 1 2012-12-08 23:00:00 2 2012-12-09 01:00:00 3 2012-12-09 03:00:00 4

    This is different from what I expected. Is this behavior expected?

    lib.get_info("example_df")

    Output

    {'chunk_count': 1, 'len': 4, 'appended_rows': 0, 'metadata': {'columns': ['date', 'col']}, 'chunker': 'date', 'chunk_size': 'D', 'serializer': 'FrameToArray'}

    >> expected chunk_count = 2, not 1

    list(lib.get_chunk_ranges("example_df"))

    Output

    [(b'2012-12-08 00:00:00', b'2012-12-08 23:59:59.999000')]

    >> expected [(b'2012-12-08 00:00:00', b'2012-12-08 23:59:59.999000'), (b'2012-12-09 00:00:00', b'2012-12-09 23:59:59.999000')]

    iterator = lib.reverse_iterator("example_df")
    while True:
        data = next(iterator, None)
        if data is None:
            break
        print(data)
    

    Output

    date col 2012-12-08 21:00:00 1 2012-12-08 23:00:00 2

    **>> expected the following: date col 2012-12-09 01:00:00 3 2012-12-09 03:00:00 4

    date col 2012-12-08 21:00:00 1 2012-12-08 23:00:00 2**

    arctic_issue_example.zip

    opened by atamkapoor 1
  • best practice usage

    best practice usage

    Hello, thank you very much for making this open source

    1/ is there a optimised way to access timeseries of revisions. In the VersionStore if we have saved several versions

    version1 saved at 2022-01-04 2022-01-01 1 2022-01-02 2

    version2 saved at 2022-01-05 2022-01-01 1 2022-01-02 3

    then I would like to retrieve in an efficient manner the timeseries of change for the value as of 2022-01-02, ie: 2022-01-04 2 2022-01-05 3

    2/ is there a permission layer allowing to choose who has access to which ticker?

    opened by RockScience 0
  • Index Monotonic Sort Bug in class DateChunker

    Index Monotonic Sort Bug in class DateChunker

    Index Monotonic Sort Bug in class DateChunker (in file date_chunker.py)

    If the df's index is not monotonic increasing, arctic will sort the df by index. BUT the variable dates is still not in order.

    I suggest arctic to put the code dates = df.index.get_level_values('date') after the if sentence.

    def to_chunks(self, df, chunk_size='D', func=None, **kwargs):
        """
        chunks the dataframe/series by dates
    
        Parameters
        ----------
        df: pandas dataframe or series
        chunk_size: str
            any valid Pandas frequency string
        func: function
            func will be applied to each `chunk` generated by the chunker.
            This function CANNOT modify the date column of the dataframe!
    
        Returns
        -------
        generator that produces tuples: (start date, end date,
                  chunk_size, dataframe/series)
        """
        if 'date' in df.index.names:
            dates = df.index.get_level_values('date')
            if not df.index.is_monotonic_increasing:
                df = df.sort_index()
            # TODO dates won't be sorted, which will cause data store error.
            
          # dates = df.index.get_level_values('date')
    

    Anyway, arctic is an excellent project !

    这是我第一次在github上留言。蹩脚的英文。

    opened by qcyfred 0
  • MongoDB 4.2 EOL April 2023 - What's Next?

    MongoDB 4.2 EOL April 2023 - What's Next?

    I recently received a email from MongoDB reminding that my 4.2 cluster will be reaching EOL in April 2023. I am sure I am not the only one who received this...

    @jamesblackburn -What's the plan here? Will we need to request an extension from Mongo to continue running 4.2? Is there any roadmap to when if at all you will support 4.4 or 5.0? Per #938 a 4.4 or 5.0 version doesn't seem to be close?

    Will we need to plan to move to the S3 version? What's going on here?

    Some clarity here would help... as the date is approaching real fast.

    Btw - Thanks for the v1.80.5

    opened by luongjames8 6
  • mongodump and mongorestore library - Blob (not pure dataframe)

    mongodump and mongorestore library - Blob (not pure dataframe)

    Arctic Version

    # 1.80.0
    

    Arctic Store

    # VersionStore
    

    Platform and version

    Spyder (Python 3.8)

    Description of problem and/or code sample that reproduces the issue

    Hi, I use mongodump and mongorestore to move libraries in between PCs (let me know if there are easier ways). So for each library (in this case, my library is called "attribution_europe_data"), it has 5 collections (from MongoDB's point of view), which are attribution_europe_data / ....ARCTIC / ....snapshots /...version_nums/...versions, and during the mongodump process, it dumps 2 files for each collection, so a total of 10 files for each library.

    I successfully manage to mongorestore those 10 files into a seperate PC. ie I can do things like below (ie I can do things like print(Arctic('localhost')['attribution_data_europe'].list_symbols())

    image

    Now, each symbol in my library represents a pandas dataframe (actually they are saved as Blob, since they contain Objects), its around 5000rows x 2000 columns. The issue is if i read it in the new PC, eg "Arctic('localhost')['attribution_europe_data'].read('20220913').data" in Spyder, it will freeze, and eventually "restarting kernel...."

    image

    It shouldn't be a memory issue reading that dataframe, as I generated a similar size dataframe randomly in the same PC and it is ok.

    As a test, I use the same mongodump and mongorestore method on a smaller / simpler library, of which the library consists of a very simple symbol of a dictionary of {'hi':1}. And the new PC (where I restore it) is is able to read this library and this symbol without any issue. Similarly I use the same method on pure dataframe, as opposed to Blob, it works as well!

    So do you think during the mongodump and mongorestore process, it corrupts Blob object?

    Also what you guys normally use to transfer arctic libraries from one PC to another? surely there is a simplier way than mongodump and mongorestore?

    ============== Just to update on more investigations:

    1. if the symbol is a dataframe (that is NOT saved as a blob), it works
    2. if the symbol is a dict say {'hi':1}, it works
    3. if the symbol is a blob, it DOES NOT work (ie it will have trouble reading that symbol from the restored library in the new PC)
    4. if the symbol is a dict wrapped around a pure dataframe, eg {'hi' : pd.DataFrame(np.random.rand(2,2))}, then it works
    5. if the symbol is a dict wrapper around a blob, DOES NOT WORK, eg {'hi': some_blob}.

    I have included what it looks like in the old PC, and what error it throws up in the new PC if the symbol is a dict wraps around a blob

    (old PC) image

    (new PC) image

    opened by fengster123 0
Releases(v1.80.5)
The easy way to write your own flavor of Pandas

Pandas Flavor The easy way to write your own flavor of Pandas Pandas 0.23 added a (simple) API for registering accessors with Pandas objects. Pandas-f

Zachary Sailer 260 Jan 01, 2023
cuDF - GPU DataFrame Library

cuDF - GPU DataFrames NOTE: For the latest stable README.md ensure you are on the main branch. Built based on the Apache Arrow columnar memory format,

RAPIDS 5.2k Dec 31, 2022
A pure Python implementation of Apache Spark's RDD and DStream interfaces.

pysparkling Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to exe

Sven Kreiss 254 Dec 06, 2022
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

vaex io 7.7k Jan 01, 2023
Modin: Speed up your Pandas workflows by changing a single line of code

Scale your pandas workflows by changing one line of code To use Modin, replace the pandas import: # import pandas as pd import modin.pandas as pd Inst

8.2k Jan 01, 2023
Koalas: pandas API on Apache Spark

pandas API on Apache Spark Explore Koalas docs » Live notebook · Issues · Mailing list Help Thirsty Koalas Devastated by Recent Fires The Koalas proje

Databricks 3.2k Jan 04, 2023
High performance datastore for time series and tick data

Arctic TimeSeries and Tick store Arctic is a high performance datastore for numeric data. It supports Pandas, numpy arrays and pickled objects out-of-

Man Group 2.9k Dec 23, 2022
Universal 1d/2d data containers with Transformers functionality for data analysis.

XPandas (extended Pandas) implements 1D and 2D data containers for storing type-heterogeneous tabular data of any type, and encapsulates feature extra

The Alan Turing Institute 25 Mar 14, 2022
sqldf for pandas

pandasql pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar

yhat 1.2k Jan 09, 2023
A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

Jason Carpenter 2.2k Jan 04, 2023
The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

Eyal Trabelsi 206 Dec 13, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
A Python package for manipulating 2-dimensional tabular data structures

datatable This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame

H2O.ai 1.6k Jan 05, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 01, 2023
Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

Python for Data 348 Jan 03, 2023