当前位置:网站首页>Time series database Apache iotdb unit vs. multivariate time series write and query performance comparison - Tian Yuan
Time series database Apache iotdb unit vs. multivariate time series write and query performance comparison - Tian Yuan
2022-06-21 04:47:00 【Apache IoTDB】
This sharing comes from community contributor Tian Yuan .
With the popularization of the Internet of things and the continuous development of Industrial Technology , The demand for efficient management of massive time series is becoming more and more widespread , The amount of data is getting larger and larger . There are two main types of time series , Unit time series and multivariate time series . A unit time series is a series with a single time-dependent variable , A unit time series contains only a list of timestamps and a list of values . Multivariate time series is a series with multiple time-dependent variables , Multivariate time series contain multiple unary time series as components , The sampling time points of each unary time series are the same , So the data can be expressed in matrix form , Each action a point in time , Each column is a unary time series .

1
Temporal database classification
Time Series Database
Most of the current mainstream time series database storage engines only support one time series model ( Unit or multiple ), We can use time series database as a unit time series storage engine , Or multiple time series storage engine to classify them
Unit time series storage engine
The storage engine stores each time series independently , When writing to a physical disk , It corresponds to two columns of data , One column is the timestamp column , One column is the value column , The two correspond one by one . This storage engine is suitable for scenarios where each sensor collects data independently. The data collected by each sensor has an independent timestamp .
Time series databases based on existing key value databases basically belong to this category , Such as KairosDB and OpenTSDB etc. . There are also some storage engines for native temporal databases that fall into this category , Such as InfluxDB and Prometheus etc. .
0.12 And previous versions of Apache IoTDB The storage engine and file format of only support unit sequence , Unable to efficiently store and query multivariate time series .
Multivariate time series storage engine
The storage engine will share multiple time series to store a timestamp column , Besides , Each time series stores a column of values separately . One timestamp column corresponds to multiple value columns . This storage engine is suitable for scenarios where multiple sensors collect data at the same time , For example, in the actual production environment , The granularity of data collection is device level , The values of multiple sensors under a device correspond to the same timestamp .
Time series databases based on existing relational databases basically belong to this category , Model all sequences under a device into a table , There is only one time column , A typical image TimescaleDB . A few storage engines of native time series databases also use multivariate time series models to build storage engines , Such as TDengine.
Recently, there is also a model based on Prometheus The development of multiple time series storage engine Heracles , The storage engine is only a prototype system , Not closed Prometheus Main code Branch . In addition to reducing time stamp redundant storage and query efficiency , Heracles The paper also mentioned , Prometheus When writing, each time series will be locked one by one , And exists on the critical code path of the writing process , When many components of multivariate time series are written , The impact of locking overhead on write performance cannot be ignored .

2
Apache IoTDB Dual storage engine
Storage Engine
Apache IoTDB from 0.13 Version start , The innovation defines the dual storage engine of temporal database , Built in two efficient storage engines : Non shared timestamp storage engine supporting unit time series and shared timestamp storage engine supporting multiple time series .
Dual storage engine definition
From the overall architecture of the entire database management system , The storage engine interfaces with the query engine , Provide standardized data access formats for query engines , Dock the storage media down , Data organization according to the file format , At the granularity of data pages or other units , A specific interface provided through a storage medium , Read and write the data in the storage medium .
Unit and multi time series business scenarios put forward different requirements for the storage engine of time series database , So we are Apache IoTDB It supports two storage engines to meet different business demands of unit and multivariate time series respectively . The picture below is Apache IoTDB Schematic diagram of the overall architecture of dual storage engines , The main difference between the dual storage engines here is whether the sequences under the device share the time column , The original unshared timestamp storage engine is suitable for unit sequences , The newly added shared timestamp storage engine is optimized for multivariate sequences . Because of the difference between whether to share timestamp Columns , The two storage engines are in the result set format that interacts with the query engine 、 Memory tables 、 There are significant differences between the persistent sorting stage and the persistent encoding method . But thanks to good abstraction , On the metadata manager and cache manager , The two are shared . Even in the underlying file format , It also uses a dual storage engine hybrid file structure , Realized in a TsFile Can simultaneously mix memory unit time series and multivariate time series .

Dual storage engine data model design
Merge the two storage engines into one database , The first problem is how to be compatible with the original data model , And how to let users specify which storage engine to use . So we are innovating the metadata model , Make it possible for users to access API Specifies that some multivariate time series share timestamps at the same time , It is still necessary to ensure that the multivariate time series are compatible with the semantics of the original data model .
The specified granularity of the storage engine can be placed at the storage group level , However, this will lead to multiple time series under this storage group , Or both are unit time series , Users will be limited in their flexibility . Considering that the multiple time series are all under one device , So either all sequences under the device share a list of timestamps , Or non shared timestamp . So we set the granularity of the storage engine on the device , As shown in the figure below , In this way, you can have multiple time series and unit time series in the same storage group , In the device node of the metadata tree, Boolean variables are used to identify whether the sequences under the device share timestamp columns , That is, whether the sequence under the equipment is a multivariate time series .

For multivariate sequences , We have added ALIGNED keyword , The time series used to identify a device is a multivariate time series , Share a list of timestamps . As shown in the figure below , We use active creation and automatic creation respectively , by root.ln.wf01.GPS This device creates latitude and longitude A multivariate sequence of components . These two grammars create device nodes in the metadata tree , The attribute of whether to share timestamp is true.


3
Performance comparison
Performance Comparison
Write performance vs. disk footprint
To test multivariate time series with different component numbers , The write persistence performance of the shared timestamp storage engine is improved and the disk space is saved , We have tested respectively 1 Weight 、10 Weight 、30 Components and 100 Multivariate time series of components . The component types of time series are long type , The value is the same as the corresponding timestamp , The interval between any two adjacent timestamps is 1ms, The starting value of the timestamp is from 1646134492000 Start . In this set of experiments , Each component is written to 10,000,000 spot , And under each timestamp , Each component of a multivariate sequence has a value , That is, the ratio of null values of all multivariate time series is 0%.
As shown in the figure below , On the whole , When the number of components exceeds 1 when , On average, the persistence speed of multivariate time series is faster than that of unit time series 1.6 times .

In terms of disk occupancy , As shown in the figure below , When there is only one component in a multivariate time series , Because the storage method of multivariate time series will store more statistical information of various granularity of time series than that of unit time series , And null value information of component value column , So the storage mode of unit time series is in the case of only one component , It will occupy less space than the storage method of multivariate time series 1% Of disk space . But when the number of components exceeds 1 when , For example, the component numbers are 10、30 and 100 Under the circumstances , Because the storage method of multivariate time series only stores a list of time stamps , Compared with the storage method of unit time series , Less storage respectively 9、29 and 99 Column timestamp , Because all the value columns and time columns in the experiment write the same values , And adopt the same coding method , Multivariate time series occupy about less space than unit time series 50% Of disk space .

Query performance comparison
The query scenarios of temporal databases are very rich , But in general, there are two kinds : The first is raw data query , Returns the original point written by the sequence , adopt where Whether the clause contains a value filter condition , It is subdivided into original data query without value filtering and original data query with value filtering ; The second is the downsampling query , The original data within a period of time will be returned after some operation . We fix the component number of multivariate time series as 30, In the above three query scenarios , Compare the query performance difference between multivariate time series and unit time series .
Raw data query without value filtering
The query duration of raw data without value filtering is related to the number of sequences queried , The more sequences there are , The larger the amount of data read from the disk , If it is a unit time series , You also need to align the timestamps of multiple sequences . Each query 30 All of the components , Inquire about sql Be similar to “select * from root.**”, After the number of components in the query increases further , The advantage of query performance of multivariate time series is further amplified , Because you can read more time series from disk than unit time series , And less time stamp alignment of more value Columns . As shown in the figure below , Full component raw data query without value filtering , Multivariate time series average faster than unit time series 62.2%.

Raw data query with value filtering
The query efficiency of raw data with value filtering is related to the query selection rate , The selection rate refers to the percentage of the result set that meets the filter criteria of this query in the total data volume . We are in 90%、50% and 10% Under these three selection rates , To include 30 component , And the proportions of null values are 0%、10% as well as 50% Experiment on the data set of .
Every time a query involves 30 All components in the component , As shown in the figure below , When the selection rate is 90% when , Multivariate time series average faster than unit time series 34.8%; When the selection rate is 50% when , Multivariate time series average faster than unit time series 30.1%; When the selection rate is 10% when , Multivariate time series average faster than unit time series 4%. When the number of components of the query is further increased to 30 when , Under the combination of various selection rates and null value proportions , The average query performance of multivariate time series is that of unit time series 1.23 times . And 15 The query of components is similar to , If we just look at 90% and 50% The selection rate and the corresponding null value ratio are 0% and 10% The experimental results of , The average query performance improvement of multivariate time series in full component can reach 40%.

Downsampling query
Downsampling query is a query method that uses a lower frequency than the time frequency of data collection , Is a special case of aggregate query . for example , The frequency of data acquisition is one second , Want to follow 1 Minutes to show the data , You need to use a downsampling query . stay IoTDB in , have access to GROUP BY Clause to aggregate time intervals , Support sliding step size according to time interval and customization ( The default value is the same as the time interval ) Divide the result set , The default results are arranged in ascending chronological order . In this experiment, we specify the aggregation window as 5000ms, The aggregation operator used is count, Inquire about sql Be similar to “select count(*) from root.** group by([1646134492000, 1646144492001), 5000ms)”.
As shown in the figure below , The query involves all 30 Components , Multivariate time series are about... Faster than unit time series 15%; Query involves 15 Components , Multivariate time series are about... Faster than unit time series 10.9%; When only one component is queried , Multivariate time series are about... Slower than unit time series 6%.

4
summary
Summary
Through the above experiments, it can be seen that ,Apache IoTDB The proposed dual storage engines have their own application scenarios :
1
In a single component scenario , Modeling a sequence as a unit time series , Write persistence using the unshared timestamp storage engine is faster than using the shared timestamp storage engine , Disk usage will also be less , And the query performance is slightly better than the latter .
2
When the number of components is greater than 1 And the null value ratio is low , Modeling a series as a multivariate time series , The write persistence speed of the shared timestamp storage engine is faster than that of the unshared timestamp storage engine on average 1.6 times , Disk space will also be reduced by nearly half .
3
Modeling a series as a multivariate time series , After using the shared timestamp storage engine to write data , In various query scenarios , As long as the number of components involved in the query is greater than 1, The query performance of multivariate time series is better than that of unit time series ; Even when only one component is queried , Multivariate time series are only slightly inferior to unit time series , The average speed is about 10%.
About us
Apache IoTDB—— The best solution for massive time series data management , A high throughput 、 High compression 、 High availability 、 The original open source timing database of the Internet of things . from 0 To 1 Self developed sequential storage scheme 、 Internet of things data model 、 Low flow data transmission scheme , So that nanosecond sampling data can be written without pressure 、TB Level data query milliseconds 、 Data storage lossless compression dozens of times . The core technology comes from Tsinghua University 、 Self control . At present, it has been used in the State Grid 、 National Meteorological Administration 、 AVIC Chengfei 、 CNNC 、 Changan automobile 、 Goldwind technology and other enterprises are widely used .

As a global open source project , Up to now Apache IoTDB Have owned 193 Name contributor 、2K Star、634 Forks. We have a guide to participation , Welcome more and more small partners to help Apache IoTDB The continuous development and progress of the project .
Welcome to Apache IoTDB The first step in the community !
Wechat group : Add buddy qinchuqing
边栏推荐
- EPEL online mirror source
- Swoole summary
- 距离度量 —— 标准化欧氏距离 (Standardized EuclideanDistance)
- 441.排列硬币
- (超)低延迟视频流传输的未来
- 嵌入式接口之EXTI与NVIC的STM32模板库函数的一些解释
- Dual system installation
- Office, Excel, word, PPT operating skills personal notebook (continuously updated)
- Digital learning - data processing and visualization
- Intelligent storage | video DNA, unique security logo
猜你喜欢

Office, Excel, Word, ppt Operation Skills Personal Notebook (continuously updated)
Intelligent storage | video DNA, unique security logo

mongodb基本操作

【 uvm startup = = > Episode 6】 ~ synchrone communication element

SUSE system settings

Redis advanced data types bitmaps, hyperloglog, geo

C#获取客户端调用WebService接口原生的请求Xml

Automatic rollback and destruction of BSC smart chain flow pool, and deployment of charity return contract

Digital learning - model building and evaluation

JMeter usage teaching
随机推荐
时序数据库Apache IoTDB单元与多元时间序列写入与查询性能对比——田原
Development of general test system for motor drive unit with LabVIEW
Use selenium to automatically obtain the Ajax epidemic situation and store it in the database with the date of the day as the table name
Thymeleaf是什么
Reptiles rarely get one p
Part of anluan's basic problem series & brute force cracking series
OFFICE、EXCEL、WORD、PPT操作技巧個人筆記本(持續更新)
Kprobe User Guide
Opencv implementation of image batch number validity extraction
Dual system installation
数分学习-数据处理和可视化
Hypertext status
Golang实践录:ssh及scp的实现
Negative numbers in statistical ordered matrices
Variable testing of multi version dynamic library
In depth understanding of JVM just in time compiler
傅里叶变换原理——与时间无关的故事
Digital learning - data processing and visualization
Redis 删除策略
Cluster and distribution