当前位置:网站首页>Time series database Apache iotdb unit vs. multivariate time series write and query performance comparison - Tian Yuan

Time series database Apache iotdb unit vs. multivariate time series write and query performance comparison - Tian Yuan

2022-06-21 04:47:00 Apache IoTDB

This sharing comes from community contributor Tian Yuan .

With the popularization of the Internet of things and the continuous development of Industrial Technology , The demand for efficient management of massive time series is becoming more and more widespread , The amount of data is getting larger and larger . There are two main types of time series , Unit time series and multivariate time series . A unit time series is a series with a single time-dependent variable , A unit time series contains only a list of timestamps and a list of values . Multivariate time series is a series with multiple time-dependent variables , Multivariate time series contain multiple unary time series as components , The sampling time points of each unary time series are the same , So the data can be expressed in matrix form , Each action a point in time , Each column is a unary time series .

f6c840cf26eca695acf35153be25d9b7.png

1

Temporal database classification

Time Series Database

Most of the current mainstream time series database storage engines only support one time series model ( Unit or multiple ), We can use time series database as a unit time series storage engine , Or multiple time series storage engine to classify them

Unit time series storage engine

The storage engine stores each time series independently , When writing to a physical disk , It corresponds to two columns of data , One column is the timestamp column , One column is the value column , The two correspond one by one . This storage engine is suitable for scenarios where each sensor collects data independently. The data collected by each sensor has an independent timestamp .

Time series databases based on existing key value databases basically belong to this category , Such as KairosDB and OpenTSDB etc. . There are also some storage engines for native temporal databases that fall into this category , Such as InfluxDB and Prometheus etc. .

0.12 And previous versions of Apache IoTDB The storage engine and file format of only support unit sequence , Unable to efficiently store and query multivariate time series .

Multivariate time series storage engine

The storage engine will share multiple time series to store a timestamp column , Besides , Each time series stores a column of values separately . One timestamp column corresponds to multiple value columns . This storage engine is suitable for scenarios where multiple sensors collect data at the same time , For example, in the actual production environment , The granularity of data collection is device level , The values of multiple sensors under a device correspond to the same timestamp .

Time series databases based on existing relational databases basically belong to this category , Model all sequences under a device into a table , There is only one time column , A typical image TimescaleDB . A few storage engines of native time series databases also use multivariate time series models to build storage engines , Such as TDengine.

Recently, there is also a model based on Prometheus The development of multiple time series storage engine Heracles , The storage engine is only a prototype system , Not closed Prometheus Main code Branch . In addition to reducing time stamp redundant storage and query efficiency , Heracles The paper also mentioned , Prometheus When writing, each time series will be locked one by one , And exists on the critical code path of the writing process , When many components of multivariate time series are written , The impact of locking overhead on write performance cannot be ignored .

90f086cf6d8515a535616021b9ba1c2f.png

2

Apache IoTDB Dual storage engine

Storage Engine

Apache IoTDB from 0.13 Version start , The innovation defines the dual storage engine of temporal database , Built in two efficient storage engines : Non shared timestamp storage engine supporting unit time series and shared timestamp storage engine supporting multiple time series .

Dual storage engine definition

From the overall architecture of the entire database management system , The storage engine interfaces with the query engine , Provide standardized data access formats for query engines , Dock the storage media down , Data organization according to the file format , At the granularity of data pages or other units , A specific interface provided through a storage medium , Read and write the data in the storage medium .

Unit and multi time series business scenarios put forward different requirements for the storage engine of time series database , So we are Apache IoTDB It supports two storage engines to meet different business demands of unit and multivariate time series respectively . The picture below is Apache IoTDB Schematic diagram of the overall architecture of dual storage engines , The main difference between the dual storage engines here is whether the sequences under the device share the time column , The original unshared timestamp storage engine is suitable for unit sequences , The newly added shared timestamp storage engine is optimized for multivariate sequences . Because of the difference between whether to share timestamp Columns , The two storage engines are in the result set format that interacts with the query engine 、 Memory tables 、 There are significant differences between the persistent sorting stage and the persistent encoding method . But thanks to good abstraction , On the metadata manager and cache manager , The two are shared . Even in the underlying file format , It also uses a dual storage engine hybrid file structure , Realized in a TsFile Can simultaneously mix memory unit time series and multivariate time series .

5d2cde1c2b5d9ff44a4ffcaa6f50334f.png

Dual storage engine data model design

Merge the two storage engines into one database , The first problem is how to be compatible with the original data model , And how to let users specify which storage engine to use . So we are innovating the metadata model , Make it possible for users to access API Specifies that some multivariate time series share timestamps at the same time , It is still necessary to ensure that the multivariate time series are compatible with the semantics of the original data model .

The specified granularity of the storage engine can be placed at the storage group level , However, this will lead to multiple time series under this storage group , Or both are unit time series , Users will be limited in their flexibility . Considering that the multiple time series are all under one device , So either all sequences under the device share a list of timestamps , Or non shared timestamp . So we set the granularity of the storage engine on the device , As shown in the figure below , In this way, you can have multiple time series and unit time series in the same storage group , In the device node of the metadata tree, Boolean variables are used to identify whether the sequences under the device share timestamp columns , That is, whether the sequence under the equipment is a multivariate time series .

8bb682c606948f6eaf957e87399b634c.png

For multivariate sequences , We have added ALIGNED keyword , The time series used to identify a device is a multivariate time series , Share a list of timestamps . As shown in the figure below , We use active creation and automatic creation respectively , by root.ln.wf01.GPS This device creates latitude and longitude A multivariate sequence of components . These two grammars create device nodes in the metadata tree , The attribute of whether to share timestamp is true.

1648b1a42aade59033b205797d7bbd51.png

dd1bf19f8b657c659a4c4478267e9c0f.png

3

Performance comparison

Performance Comparison

Write performance vs. disk footprint

To test multivariate time series with different component numbers , The write persistence performance of the shared timestamp storage engine is improved and the disk space is saved , We have tested respectively 1 Weight 、10 Weight 、30 Components and 100 Multivariate time series of components . The component types of time series are long type , The value is the same as the corresponding timestamp , The interval between any two adjacent timestamps is 1ms, The starting value of the timestamp is from 1646134492000 Start . In this set of experiments , Each component is written to 10,000,000 spot , And under each timestamp , Each component of a multivariate sequence has a value , That is, the ratio of null values of all multivariate time series is 0%.

As shown in the figure below , On the whole , When the number of components exceeds 1 when , On average, the persistence speed of multivariate time series is faster than that of unit time series 1.6 times .

4e1c7023bbce0242f454bcccd03bd6f7.png

In terms of disk occupancy , As shown in the figure below , When there is only one component in a multivariate time series , Because the storage method of multivariate time series will store more statistical information of various granularity of time series than that of unit time series , And null value information of component value column , So the storage mode of unit time series is in the case of only one component , It will occupy less space than the storage method of multivariate time series 1% Of disk space . But when the number of components exceeds 1 when , For example, the component numbers are 10、30 and 100 Under the circumstances , Because the storage method of multivariate time series only stores a list of time stamps , Compared with the storage method of unit time series , Less storage respectively 9、29 and 99 Column timestamp , Because all the value columns and time columns in the experiment write the same values , And adopt the same coding method , Multivariate time series occupy about less space than unit time series 50% Of disk space .

8ed64ac8f43b97a839a5df6a880247ad.png

Query performance comparison

The query scenarios of temporal databases are very rich , But in general, there are two kinds : The first is raw data query , Returns the original point written by the sequence , adopt where Whether the clause contains a value filter condition , It is subdivided into original data query without value filtering and original data query with value filtering ; The second is the downsampling query , The original data within a period of time will be returned after some operation . We fix the component number of multivariate time series as 30, In the above three query scenarios , Compare the query performance difference between multivariate time series and unit time series .

Raw data query without value filtering

The query duration of raw data without value filtering is related to the number of sequences queried , The more sequences there are , The larger the amount of data read from the disk , If it is a unit time series , You also need to align the timestamps of multiple sequences . Each query 30 All of the components , Inquire about sql Be similar to “select * from root.**”, After the number of components in the query increases further , The advantage of query performance of multivariate time series is further amplified , Because you can read more time series from disk than unit time series , And less time stamp alignment of more value Columns . As shown in the figure below , Full component raw data query without value filtering , Multivariate time series average faster than unit time series 62.2%.

22f2d9e5796a25fc71e90fd1a93fcd95.png

Raw data query with value filtering

The query efficiency of raw data with value filtering is related to the query selection rate , The selection rate refers to the percentage of the result set that meets the filter criteria of this query in the total data volume . We are in 90%、50% and 10% Under these three selection rates , To include 30 component , And the proportions of null values are 0%、10% as well as 50% Experiment on the data set of .

Every time a query involves 30 All components in the component , As shown in the figure below , When the selection rate is 90% when , Multivariate time series average faster than unit time series 34.8%; When the selection rate is 50% when , Multivariate time series average faster than unit time series 30.1%; When the selection rate is 10% when , Multivariate time series average faster than unit time series 4%. When the number of components of the query is further increased to 30 when , Under the combination of various selection rates and null value proportions , The average query performance of multivariate time series is that of unit time series 1.23 times . And 15 The query of components is similar to , If we just look at 90% and 50% The selection rate and the corresponding null value ratio are 0% and 10% The experimental results of , The average query performance improvement of multivariate time series in full component can reach 40%.

7682ac9baa23ccf48a6eb01c9c49a987.png

Downsampling query

Downsampling query is a query method that uses a lower frequency than the time frequency of data collection , Is a special case of aggregate query . for example , The frequency of data acquisition is one second , Want to follow 1 Minutes to show the data , You need to use a downsampling query . stay IoTDB in , have access to GROUP BY Clause to aggregate time intervals , Support sliding step size according to time interval and customization ( The default value is the same as the time interval ) Divide the result set , The default results are arranged in ascending chronological order . In this experiment, we specify the aggregation window as 5000ms, The aggregation operator used is count, Inquire about sql Be similar to “select count(*) from root.** group by([1646134492000, 1646144492001), 5000ms)”.

As shown in the figure below , The query involves all 30 Components , Multivariate time series are about... Faster than unit time series 15%; Query involves 15 Components , Multivariate time series are about... Faster than unit time series 10.9%; When only one component is queried , Multivariate time series are about... Slower than unit time series 6%.

fb3d2866851d130a3f31810f9502208e.png

4

summary

Summary

Through the above experiments, it can be seen that ,Apache IoTDB The proposed dual storage engines have their own application scenarios :

1

In a single component scenario , Modeling a sequence as a unit time series , Write persistence using the unshared timestamp storage engine is faster than using the shared timestamp storage engine , Disk usage will also be less , And the query performance is slightly better than the latter .

2

When the number of components is greater than 1 And the null value ratio is low , Modeling a series as a multivariate time series , The write persistence speed of the shared timestamp storage engine is faster than that of the unshared timestamp storage engine on average 1.6 times , Disk space will also be reduced by nearly half .

3

Modeling a series as a multivariate time series , After using the shared timestamp storage engine to write data , In various query scenarios , As long as the number of components involved in the query is greater than 1, The query performance of multivariate time series is better than that of unit time series ; Even when only one component is queried , Multivariate time series are only slightly inferior to unit time series , The average speed is about 10%.

About us

Apache IoTDB—— The best solution for massive time series data management , A high throughput 、 High compression 、 High availability 、 The original open source timing database of the Internet of things . from 0 To 1 Self developed sequential storage scheme 、 Internet of things data model 、 Low flow data transmission scheme , So that nanosecond sampling data can be written without pressure 、TB Level data query milliseconds 、 Data storage lossless compression dozens of times . The core technology comes from Tsinghua University 、 Self control . At present, it has been used in the State Grid 、 National Meteorological Administration 、 AVIC Chengfei 、 CNNC 、 Changan automobile 、 Goldwind technology and other enterprises are widely used .

efa40bf9d50a5f1062e38ad7e67de8a9.gif

As a global open source project , Up to now Apache IoTDB Have owned 193 Name contributor 、2K Star、634 Forks. We have a guide to participation , Welcome more and more small partners to help Apache IoTDB The continuous development and progress of the project .

Welcome to Apache IoTDB The first step in the community !

Wechat group : Add buddy qinchuqing

原网站

版权声明
本文为[Apache IoTDB]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/172/202206210444245140.html