当前位置：网站首页>Techniques for visualizing large time series.

Techniques for visualizing large time series.

2022-07-28 11:44:00 【The way of Python data】

source ：kaggle Competition book

dried food

author ： Devo

MidiMax Compression algorithm

brief introduction

In many time series problems , For example, financial time series data , We often need to visualize it so that we can understand the data , But we all know that the financial data is very huge , So if visualization is needed, it will cost more RAM, Disk and other computing storage resources , In this article, we introduce a compression algorithm “Midimax”, This algorithm will improve the effect of time series diagram by compressing the data size . The design of this algorithm has the following goals ：

Do not introduce non actual data . Only a subset of the original data is returned , So there is no average 、 Median interpolation 、 Regression and statistical aggregation ;
Fast and less computation ;
It should maximize information gain . This means that it should capture as many changes in the original data as possible ;
Taking the minimum and maximum points may give the wrong view of exaggerating variance , Therefore, the median point is taken to retain information about signal stability .

Midimax Compression algorithm

Algorithm pseudocode

Input time series data and compression coefficient to the algorithm （ Floating point numbers ）.
Split the time series data into non overlapping windows of equal size , Where the size is calculated as ：（ Compressibility factor ）.3 Represents the minimum obtained from each window 、 Median and maximum . therefore , To achieve 2 Compression factor of , The window size must be 6. A larger compression ratio requires a wider window .
Sort the values in each window in ascending order .
Select the first and last values of the minimum and maximum points . This will ensure that we maximize differences and retain information .
Choose an intermediate value for the intermediate value , The middle position is defined as （）. therefore , Even if the window size is uniform , No interpolation .
According to the original index （ Timestamp ） Reorder the selected points .

The case shows

Blue is the original picture ;
The green dot is Midimax The graph given by the algorithm .

Code

'''
 Code from ：https://medium.com/towards-data-science/midimax-data-compression-for-large-time-series-data-daf744c89310
'''
import pandas as pd
def compress_series(inputser: pd.Series, compfactor=2):
    """
    Split into segments and pick 3 points from each segment, the minimum,
    median, and maximum. Segment length = int(compfactor x 3). So, to achieve a
    compression factor of 2, a segment length of 6 is needed.
    Parameters
    ----------
    inputser : pd.Series
        Input data to be compressed.
    compfactor : float
        Compression factor. The default is 2.
    Returns
    -------
    pd.Series
        Compressed output series.
    """
    # If comp factor is too low, return original data
    if (compfactor < 1.4):
        return inputser

    win_size = int(3 * compfactor)  # window size

    # Create a column ofsegment numbers
    ser = inputser.rename('value')
    ser = ser.round(3)
    wdf = ser.to_frame()
    del ser
    start_idxs = wdf.index[range(0, len(wdf), win_size)]
    wdf['win_start'] = 0
    wdf.loc[start_idxs, 'win_start'] = 1
    wdf['win_num'] = wdf['win_start'].cumsum()
    wdf.drop(columns='win_start', inplace=True)
    del win_size, start_idxs

    # For each window, get the indices of min, median, and max
    def get_midimax_idxs(gdf):
        if len(gdf) == 1:
            return [gdf.index[0]]
        elif gdf['value'].iloc[0] == gdf['value'].iloc[-1]:
            return [gdf.index[0]]
        elif len(gdf) == 2:
            return [gdf.index[0], gdf.index[1]]
        else:
            return [gdf.index[0], gdf.index[len(gdf) // 2], gdf.index[-1]]

    wdf = wdf.dropna()
    wdf_sorted = wdf.sort_values(['win_num', 'value'])
    midimax_idxs = wdf_sorted.groupby('win_num').apply(get_midimax_idxs)

    # Convert into a list
    midimax_idxs = [idx for sublist in midimax_idxs for idx in sublist]
    midimax_idxs.sort()
    return inputser.loc[midimax_idxs]

Summary

Midimax It is a simple and lightweight Algorithm , It can reduce the size of data , And carry out fast graphic drawing , We found that ：

Midimax When drawing a large sequence diagram, the trend of the original sequence can be preserved ; Fewer points can be used to capture changes in the original data , And process a large amount of data in a few seconds .
Midimax Some details will be lost ; If the compression is too large, more information may be lost .

reference

1. https://github.com/edwinsutrisno/midimax_compression

2. Midimax Compression for Large Time-Series Data

-------- End --------