当前位置：网站首页>[system design] index monitoring and alarm system

[system design] index monitoring and alarm system

2022-07-07 11:30:00 【Dotnet cross platform】

In this paper , We will discuss how to design a scalable index monitoring and warning system . A good monitoring and warning system , Observability of infrastructure , High availability , Reliability plays a key role .

The following figure shows some popular indicator monitoring and alarm services in the market .

Next , We will design a similar service , It can be used inside large companies .

The design requirements

Start with a story about Xiao Ming's interview .

interviewer ： If you are asked to design an indicator monitoring and alarm system , What would you do ？

Xiao Ming ： well , This system is for internal use , Or design image Datadog such SaaS service ？

interviewer ： Good question , At present, this system is only used internally .

Xiao Ming ： What indicator information do we want to collect ？

interviewer ： Including the index information of the operating system , Middleware metrics , And the running application services qps These indicators .

Xiao Ming ： How large is the infrastructure we monitor with this system ？

interviewer ：1 100 million active users ,1000 Server pools , Each pool 100 Taiwan machine .

Xiao Ming ： How long does the index data need to be saved ？

interviewer ： We want to keep it for a year .

Xiao Ming ： ok , For long-term storage , Can the resolution of index data be reduced ？

interviewer ： Good question , For the latest data , Will save 7 God ,7 It can be reduced to 1 Minute resolution , And then 30 After heaven , May, in accordance with the 1 Hour resolution for further summary .

Xiao Ming ： Which alarm channels are supported ？

interviewer ： mail , electric nailing , Enterprise WeChat ,Http Endpoint.

Xiao Ming ： Do we need to collect logs ？ There is also the need to support link tracking in distributed systems ？

interviewer ： Currently focusing on indicators , Others are not considered for the time being .

Xiao Ming ： well , Probably understand .

To sum up , The infrastructure being monitored is large , And indicators that need to support various dimensions . in addition , The overall system also has higher requirements , Consider scalability , Low latency , Reliability and flexibility .

Basic knowledge of

An indicator monitoring and alarm system usually consists of five components , As shown in the figure below

1. data collection ： Collect indicator data from different data sources .
2. The data transfer ： Send the index data to the index monitoring system .
3. data storage ： Store metrics .
4. The alarm ： Analyze the received data , When an abnormality is detected, an alarm notification can be sent .
5. visualization ： Visualization page , In graphics , Present data in the form of charts .

Data patterns

Indicator data is usually saved as a time series , It contains a set of values and their associated timestamps .

The sequence itself can be uniquely identified by name , It can also be identified by a group of labels .

Let's look at two examples .

Example 1： Production server i631 stay 20:00 Of CPU What's the load ？

The data points marked in the above figure can be represented in the following format

In the example above , The time series consists of the index name , label （host:i631,env:prod）, The timestamp and the corresponding value constitute .

Example 2： In the past 10 Within minutes, all in Shanghai Web The average number of servers CPU What's the load ？

conceptually , We will find something similar to the following

CPU.load host=webserver01,region=shanghai 1613707265 50

CPU.load host=webserver01,region=shanghai 1613707270 62

CPU.load host=webserver02,region=shanghai 1613707275 43

We can calculate the average by the value at the end of each line above CPU load , The above data format is also called row Protocol . It is the input format commonly used by many monitoring software on the market ,Prometheus and OpenTSDB There are two examples .

Each time series contains the following ：

• Index name , String type metric name .
• An array of key value pairs , Label indicating the indicator ,List<key,value>
• An array containing time stamps and corresponding values ,List <value, timestamp>

data storage

Data storage is the core of design , It is not recommended to build your own storage system , Nor is it recommended to use a conventional storage system （ such as MySQL） To finish the work .

In theory , Conventional databases can support time series data , But it requires expert level tuning of the database , To meet the needs of scenarios with a large amount of data .

Specifically speaking , Relational databases do not optimize time series data , There are several reasons

• Calculate the average value in the rolling time window , Need to write complex and difficult to read SQL.
• To support labels （tag/label） data , We need to add an index to each tag .
• by comparison , Relational databases do not perform well in continuous high concurrency write operations .

that NoSQL ? ？ Theoretically , A few in the market NoSQL Database can effectively process time series data . such as Cassandra and Bigtable Fine . however , Want to meet the needs of efficient storage and query of data , And building scalable systems , Need to understand each NoSQL How it works inside .

by comparison , Time series database specially optimized for time series data , More suitable for this kind of scene .

OpenTSDB It's a distributed temporal database , But because it's based on Hadoop and HBase, function Hadoop/HBase Clustering also brings complexity .Twitter Used MetricsDB Time series database stores index data , And Amazon offers Timestream Time series database service .

according to DB-engines The report of , The two most popular time series databases are InfluxDB and Prometheus , They can store a large amount of time series data , And support the real-time analysis of these data quickly .

As shown in the figure below ,8 nucleus CPU and 32 GB RAM Of InfluxDB Can handle over per second 250,000 Time to write .

High level design

• Metrics Source Source of indicators , Application service , database , Message queuing, etc .
• Metrics Collector Indicator collector .
• Time series DB Time series database , Store metrics .
• Query Service Query service , Provide index query interface .
• Alerting System The alarm system , When an exception is detected , Send alert notification .
• Visualization System visualization , Show the indicators in the form of charts .

In depth design

Now? , Let's focus on the data collection process . There are mainly two ways of pushing and pulling .

Pull mode

The figure above shows data collection using pull mode , The data collector is set separately , Regularly pull index data from running applications .

Here's a question , How does the data collector know the address of each data source ? A better solution is to introduce the service registration and discovery component , such as etcd,ZooKeeper, as follows

The following figure shows our current data pull process .

1. The indicator collector obtains metadata from the service discovery component , Including pulling interval ,IP Address , Overtime , Retry parameters, etc .
2. The indicator collector passes the set HTTP Endpoint obtains indicator data .

In a scenario with a large amount of data , A single indicator collector is difficult to support , We must use a set of indicator collectors . But how should multiple collectors and multiple data sources coordinate , In order to work normally without conflict ？

Consistent hashing is very suitable for this scenario , We can map the data source to the hash ring , as follows

This ensures that each indicator collector has a corresponding data source , Work with each other without conflict .

Push mode

As shown in the figure below , In push mode , Various indicator data sources （Web application , database , Message queue ） Send directly to the indicator collector .

In push mode , You need to install the collector agent on each monitored server , It can collect the indicator data of the server , Then send it to the indicator collector regularly .

Which is better, push or pull ？ There is no fixed answer , Both schemes are feasible , Even in some complex scenes , You need to support push and pull at the same time .

Extended data transmission

Now? , Let's focus on indicator collectors and time series databases . Whether you use push or pull mode , In a scenario where a large amount of data needs to be received , The indicator collector is usually a service cluster .

however , When the chronological database is unavailable , There is a risk of data loss , therefore , We introduced Kafka Message queuing components , Here's the picture

The indicator collector sends the indicator data to Kafka Message queue , Then consumers or stream processing services process data , such as Apache Storm、Flink and Spark, Finally, push it to the timing database .

Index calculation

Indicators can be aggregated and calculated in multiple places , See how they are different .

• Client agent ： The collection agent installed on the client only supports simple aggregation logic .
• Transmission pipeline ： Before the data is written to the timing database , We can use Flink Stream processing services perform aggregate Computing , Then write only the summarized data , This will greatly reduce the amount of writing . But because we don't store the original data , So the data accuracy is lost .
• Query end ： We can aggregate and query the original data in real time at the query end , But this way of query is not very efficient .

Temporal database query language

Most popular indicator monitoring systems , such as Prometheus and InfluxDB Not used SQL, It has its own query language . One of the main reasons is that it is difficult to pass SQL To query time series data , And it's hard to read , Like the following SQL Can you see what data you are looking for ？

select id,
       temp,
       avg(temp) over (partition by group_nr order by time_read) as rolling_avg
from (
  select id,
         temp,
         time_read,
         interval_group,
         id - row_number() over (partition by interval_group order by time_read) as group_nr
  from (
    select id,
    time_read,
    "epoch"::timestamp + "900 seconds"::interval * (extract(epoch from time_read)::int4 / 900) as interval_group,
    temp
    from readings
  ) t1
) t2
order by time_read;

by comparison , InfluxDB Used for timing data Flux The query language will be simpler and better understood , as follows

from(db:"telegraf")
  |> range(start:-1h)
  |> filter(fn: (r) => r._measurement == "foo")
  |> exponentialMovingAverage(size:-10s)

Data encoding and compression

Data encoding and compression can greatly reduce the size of data , Especially in time series database , Here is a simple example .

Because the time interval of general data collection is fixed , So we can store a basic value together with the increment , such as 1610087371, 10, 10, 9, 11 such , It can take up less space .

Down sampling

Down sampling is the process of converting high-resolution data into low-resolution data , This can reduce disk usage . Because our data retention period is 1 year , We can down sample the old data , This is an example ：

• 7 Day data , No sampling .
• 30 Day data , Down sampling to 1 Minute resolution
• 1 Annual data , Down sampling to 1 Hour resolution .

Let's look at another specific example , It is the 10 Second resolution data are aggregated into 30 Second resolution .

Raw data

After downsampling

Alarm service

Let's take a look at the design of alarm service , And the workflow .

1. load YAML Format alarm configuration file to cache .
- name: instance_down rules: # The service is unavailable for more than 5 Minute trigger alarm . - alert: instance_down expr: up == 0 for: 5m labels: severity: page
2. The alert manager reads the configuration from the cache .
3. According to the alarm rules , Query indicators according to the set time and conditions , If the threshold is exceeded , The alarm is triggered .
4. Alert Store Save the status of all alarms （ Hang up , Trigger , resolved ）.
5. Qualified alarms will be added to Kafka in .
6. Consumption queue , According to the alarm rules , Send alert information to different notification channels .

visualization

Visualization is built on the data layer , Indicator data can be displayed on the indicator dashboard , The alarm information can be displayed on the alarm dashboard . The following figure shows some indicators , Number of requests from the server 、 Memory /CPU utilization 、 Page load time 、 Traffic and login information .

Grafana It can be a very good visualization system , We can use it directly .

summary

In this paper , We introduce the design of index monitoring and alarm system . At a high level , We discussed data collection 、 Time series database 、 Alarm and visualization , The following figure is our final design ：

Reference

[0] System Design Interview Volume 2: https://www.amazon.com/System-Design-Interview-Insiders-Guide/dp/1736049119

[1] Datadog: https://www.datadoghq.com/

[2] Splunk: https://www.splunk.com/

[3] Elastic stack: https://www.elastic.co/elastic-stack

[4] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure: https://research.google/pubs/pub36356/

[5] Distributed Systems Tracing with Zipkin: https://blog.twitter.com/engineering/en_us/a/2012/distributed-systems-tracing-with-zipkin.html

[6] Prometheus: https://prometheus.io/docs/introduction/overview/

[7] OpenTSDB - A Distributed, Scalable Monitoring System: http://opentsdb.net/

[8] Data model: : https://prometheus.io/docs/concepts/data_model/

[9] Schema design for time-series data | Cloud Bigtable Documentation https://cloud.google.com/bigtable/docs/schema-design-time-series

[10] MetricsDB: TimeSeries Database for storing metrics at Twitter: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/metricsdb.html

[11] Amazon Timestream: https://aws.amazon.com/timestream/

[12] DB-Engines Ranking of time-series DBMS: https://db-engines.com/en/ranking/time+series+dbms

[13] InfluxDB: https://www.influxdata.com/

[14] etcd: https://etcd.io

[15] Service Discovery with Zookeeper https://cloud.spring.io/spring-cloud-zookeeper/1.2.x/multi/multi_spring-cloud-zookeeper-discovery.html

[16] Amazon CloudWatch: https://aws.amazon.com/cloudwatch/

[17] Graphite: https://graphiteapp.org/

[18] Push vs Pull: http://bit.ly/3aJEPxE

[19] Pull doesn’t scale - or does it?: https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/

[20] Monitoring Architecture: https://developer.lightbend.com/guides/monitoring-at-scale/monitoring-architecture/architecture.html

[21] Push vs Pull in Monitoring Systems: https://giedrius.blog/2019/05/11/push-vs-pull-in-monitoring-systems/

[22] Pushgateway: https://github.com/prometheus/pushgateway

[23] Building Applications with Serverless Architectures https://aws.amazon.com/lambda/serverless-architectures-learn-more/

[24] Gorilla: A Fast, Scalable, In-Memory Time Series Database: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

[25] Why We’re Building Flux, a New Data Scripting and Query Language: https://www.influxdata.com/blog/why-were-building-flux-a-new-data-scripting-and-query-language/

[26] InfluxDB storage engine: https://docs.influxdata.com/influxdb/v2.0/reference/internals/storage-engine/

[27] YAML: https://en.wikipedia.org/wiki/YAML

[28] Grafana Demo: https://play.grafana.org/

END

Made a .NET The learning website , Covering distributed systems , Data structure and algorithm , Design patterns , operating system , Computer network, etc , And job recommendation and interview experience sharing , Welcome to flirt .

reply dotnet Get the website address .

reply Interview questions obtain .NET Interview questions .

reply Programmer sideline Get a sideline guide for programmers .