当前位置:网站首页>[system design] index monitoring and alarm system

[system design] index monitoring and alarm system

2022-07-07 11:30:00 Dotnet cross platform

In this paper , We will discuss how to design a scalable index monitoring and warning system . A good monitoring and warning system , Observability of infrastructure , High availability , Reliability plays a key role .

The following figure shows some popular indicator monitoring and alarm services in the market .

bb4f4037423fe26d6482c4a193f971e6.png

Next , We will design a similar service , It can be used inside large companies .

The design requirements

Start with a story about Xiao Ming's interview .

31e4827650533fe866d41f1062af0aba.png

interviewer : If you are asked to design an indicator monitoring and alarm system , What would you do ?

Xiao Ming : well , This system is for internal use , Or design image Datadog such SaaS service ?

interviewer : Good question , At present, this system is only used internally .

Xiao Ming : What indicator information do we want to collect ?

interviewer : Including the index information of the operating system , Middleware metrics , And the running application services qps These indicators .

Xiao Ming : How large is the infrastructure we monitor with this system ?

interviewer :1 100 million active users ,1000 Server pools , Each pool 100 Taiwan machine .

Xiao Ming : How long does the index data need to be saved ?

interviewer : We want to keep it for a year .

Xiao Ming : ok , For long-term storage , Can the resolution of index data be reduced ?

interviewer : Good question , For the latest data , Will save 7 God ,7 It can be reduced to 1 Minute resolution , And then 30 After heaven , May, in accordance with the 1 Hour resolution for further summary .

Xiao Ming : Which alarm channels are supported ?

interviewer : mail , electric nailing , Enterprise WeChat ,Http Endpoint.

Xiao Ming : Do we need to collect logs ? There is also the need to support link tracking in distributed systems ?

interviewer : Currently focusing on indicators , Others are not considered for the time being .

Xiao Ming : well , Probably understand .

To sum up , The infrastructure being monitored is large , And indicators that need to support various dimensions . in addition , The overall system also has higher requirements , Consider scalability , Low latency , Reliability and flexibility .

Basic knowledge of

An indicator monitoring and alarm system usually consists of five components , As shown in the figure below

9b763ab81d7b17895872366dc7711fc3.png
  1. 1. data collection : Collect indicator data from different data sources .

  2. 2. The data transfer : Send the index data to the index monitoring system .

  3. 3. data storage : Store metrics .

  4. 4. The alarm : Analyze the received data , When an abnormality is detected, an alarm notification can be sent .

  5. 5. visualization : Visualization page , In graphics , Present data in the form of charts .

Data patterns

Indicator data is usually saved as a time series , It contains a set of values and their associated timestamps .

The sequence itself can be uniquely identified by name , It can also be identified by a group of labels .

Let's look at two examples .

Example 1: Production server i631 stay 20:00 Of CPU What's the load ?

ab45a258f507d7fa706bf75f424e5afe.png

The data points marked in the above figure can be represented in the following format

99d69353443d4a6a774c2fa0d34c6b4d.png

In the example above , The time series consists of the index name , label (host:i631,env:prod), The timestamp and the corresponding value constitute .

Example 2: In the past 10 Within minutes, all in Shanghai Web The average number of servers CPU What's the load ?

conceptually , We will find something similar to the following

CPU.load host=webserver01,region=shanghai 1613707265 50

CPU.load host=webserver01,region=shanghai 1613707270 62

CPU.load host=webserver02,region=shanghai 1613707275 43

We can calculate the average by the value at the end of each line above CPU load , The above data format is also called row Protocol . It is the input format commonly used by many monitoring software on the market ,Prometheus and OpenTSDB There are two examples .

Each time series contains the following :

  • •  Index name , String type metric name .

  • •  An array of key value pairs , Label indicating the indicator ,List<key,value>

  • •  An array containing time stamps and corresponding values ,List <value, timestamp>

data storage

Data storage is the core of design , It is not recommended to build your own storage system , Nor is it recommended to use a conventional storage system ( such as MySQL) To finish the work .

In theory , Conventional databases can support time series data , But it requires expert level tuning of the database , To meet the needs of scenarios with a large amount of data .

Specifically speaking , Relational databases do not optimize time series data , There are several reasons

  • •  Calculate the average value in the rolling time window , Need to write complex and difficult to read SQL.

  • •  To support labels (tag/label) data , We need to add an index to each tag .

  • •  by comparison , Relational databases do not perform well in continuous high concurrency write operations .

that NoSQL ? ? Theoretically , A few in the market NoSQL Database can effectively process time series data . such as Cassandra and Bigtable Fine . however , Want to meet the needs of efficient storage and query of data , And building scalable systems , Need to understand each NoSQL How it works inside .

by comparison , Time series database specially optimized for time series data , More suitable for this kind of scene .

OpenTSDB It's a distributed temporal database , But because it's based on Hadoop and HBase, function Hadoop/HBase Clustering also brings complexity .Twitter Used MetricsDB Time series database stores index data , And Amazon offers Timestream Time series database service .

according to DB-engines The report of , The two most popular time series databases are InfluxDB and Prometheus , They can store a large amount of time series data , And support the real-time analysis of these data quickly .

As shown in the figure below ,8 nucleus CPU and 32 GB RAM Of InfluxDB Can handle over per second 250,000 Time to write .

2ddefcd193f14bc0b5439bcdb73bc342.png

High level design

a5d00a87a0743f8cfda4e58ac31f06fa.png
  • • Metrics Source Source of indicators , Application service , database , Message queuing, etc .

  • • Metrics Collector Indicator collector .

  • • Time series DB Time series database , Store metrics .

  • • Query Service Query service , Provide index query interface .

  • • Alerting System The alarm system , When an exception is detected , Send alert notification .

  • • Visualization System visualization , Show the indicators in the form of charts .

In depth design

7a0000ca02d33ba87c9bb33ee278b029.png

Now? , Let's focus on the data collection process . There are mainly two ways of pushing and pulling .

Pull mode

5e8645b9483dbb98324fad1906587ee2.png

The figure above shows data collection using pull mode , The data collector is set separately , Regularly pull index data from running applications .

Here's a question , How does the data collector know the address of each data source ? A better solution is to introduce the service registration and discovery component , such as etcd,ZooKeeper, as follows

0ac3be8c8f222f77e846e57578895008.png

The following figure shows our current data pull process .

585f3a93d43a73cb5eb13690fd5a4439.png
  1. 1. The indicator collector obtains metadata from the service discovery component , Including pulling interval ,IP Address , Overtime , Retry parameters, etc .

  2. 2. The indicator collector passes the set HTTP Endpoint obtains indicator data .

In a scenario with a large amount of data , A single indicator collector is difficult to support , We must use a set of indicator collectors . But how should multiple collectors and multiple data sources coordinate , In order to work normally without conflict ?

Consistent hashing is very suitable for this scenario , We can map the data source to the hash ring , as follows

d83a9b4c5166fb08fbabfb9d46b9933b.png

This ensures that each indicator collector has a corresponding data source , Work with each other without conflict .

Push mode

As shown in the figure below , In push mode , Various indicator data sources (Web application , database , Message queue ) Send directly to the indicator collector .

a1723433c7f60f68d16ce10ed1bab3f2.png

In push mode , You need to install the collector agent on each monitored server , It can collect the indicator data of the server , Then send it to the indicator collector regularly .

Which is better, push or pull ? There is no fixed answer , Both schemes are feasible , Even in some complex scenes , You need to support push and pull at the same time .

Extended data transmission

edee94416a7e8b1b53835fdb708659ce.png

Now? , Let's focus on indicator collectors and time series databases . Whether you use push or pull mode , In a scenario where a large amount of data needs to be received , The indicator collector is usually a service cluster .

however , When the chronological database is unavailable , There is a risk of data loss , therefore , We introduced Kafka Message queuing components , Here's the picture

394b05f39561161c0b852dddfb852b12.png

The indicator collector sends the indicator data to Kafka Message queue , Then consumers or stream processing services process data , such as Apache Storm、Flink and Spark, Finally, push it to the timing database .

Index calculation

Indicators can be aggregated and calculated in multiple places , See how they are different .

  • •  Client agent : The collection agent installed on the client only supports simple aggregation logic .

  • •  Transmission pipeline : Before the data is written to the timing database , We can use Flink Stream processing services perform aggregate Computing , Then write only the summarized data , This will greatly reduce the amount of writing . But because we don't store the original data , So the data accuracy is lost .

  • •  Query end : We can aggregate and query the original data in real time at the query end , But this way of query is not very efficient .

Temporal database query language

Most popular indicator monitoring systems , such as Prometheus and InfluxDB Not used SQL, It has its own query language . One of the main reasons is that it is difficult to pass SQL To query time series data , And it's hard to read , Like the following SQL Can you see what data you are looking for ?

select id,
       temp,
       avg(temp) over (partition by group_nr order by time_read) as rolling_avg
from (
  select id,
         temp,
         time_read,
         interval_group,
         id - row_number() over (partition by interval_group order by time_read) as group_nr
  from (
    select id,
    time_read,
    "epoch"::timestamp + "900 seconds"::interval * (extract(epoch from time_read)::int4 / 900) as interval_group,
    temp
    from readings
  ) t1
) t2
order by time_read;

by comparison , InfluxDB Used for timing data Flux The query language will be simpler and better understood , as follows

from(db:"telegraf")
  |> range(start:-1h)
  |> filter(fn: (r) => r._measurement == "foo")
  |> exponentialMovingAverage(size:-10s)

Data encoding and compression

Data encoding and compression can greatly reduce the size of data , Especially in time series database , Here is a simple example .

f701fd073cf9c6abd99bca40ab156640.png

Because the time interval of general data collection is fixed , So we can store a basic value together with the increment , such as 1610087371, 10, 10, 9, 11 such , It can take up less space .

Down sampling

Down sampling is the process of converting high-resolution data into low-resolution data , This can reduce disk usage . Because our data retention period is 1 year , We can down sample the old data , This is an example :

  • • 7 Day data , No sampling .

  • • 30 Day data , Down sampling to 1 Minute resolution

  • • 1 Annual data , Down sampling to 1 Hour resolution .

Let's look at another specific example , It is the 10 Second resolution data are aggregated into 30 Second resolution .

Raw data

9d47fd102c5a27e5a74cca72432d6412.png

After downsampling

686932ed252f9bbecf92e1d3ad8d8a4e.png

Alarm service

Let's take a look at the design of alarm service , And the workflow .

1baa4b38dd4c054548b234bb1890e499.png
  1. 1. load YAML Format alarm configuration file to cache .

  2. - name: instance_down
      rules:
      #  The service is unavailable for more than  5  Minute trigger alarm .
      - alert: instance_down
        expr: up == 0
        for: 5m
        labels:
          severity: page

  3. 2. The alert manager reads the configuration from the cache .

  4. 3. According to the alarm rules , Query indicators according to the set time and conditions , If the threshold is exceeded , The alarm is triggered .

  5. 4. Alert Store Save the status of all alarms ( Hang up , Trigger , resolved ).

  6. 5. Qualified alarms will be added to Kafka in .

  7. 6. Consumption queue , According to the alarm rules , Send alert information to different notification channels .

visualization

Visualization is built on the data layer , Indicator data can be displayed on the indicator dashboard , The alarm information can be displayed on the alarm dashboard . The following figure shows some indicators , Number of requests from the server 、 Memory /CPU utilization 、 Page load time 、 Traffic and login information .

e6a1512518056543c1b616be48b62d2d.png

Grafana It can be a very good visualization system , We can use it directly .

summary

In this paper , We introduce the design of index monitoring and alarm system . At a high level , We discussed data collection 、 Time series database 、 Alarm and visualization , The following figure is our final design :

79bddc191232bf2a1ded78e00ded0471.png

Reference

[0] System Design Interview Volume 2: https://www.amazon.com/System-Design-Interview-Insiders-Guide/dp/1736049119

[1] Datadog: https://www.datadoghq.com/

[2] Splunk: https://www.splunk.com/

[3] Elastic stack: https://www.elastic.co/elastic-stack

[4] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure: https://research.google/pubs/pub36356/

[5] Distributed Systems Tracing with Zipkin: https://blog.twitter.com/engineering/en_us/a/2012/distributed-systems-tracing-with-zipkin.html

[6] Prometheus: https://prometheus.io/docs/introduction/overview/

[7] OpenTSDB - A Distributed, Scalable Monitoring System: http://opentsdb.net/

[8] Data model: : https://prometheus.io/docs/concepts/data_model/

[9] Schema design for time-series data | Cloud Bigtable Documentation https://cloud.google.com/bigtable/docs/schema-design-time-series

[10] MetricsDB: TimeSeries Database for storing metrics at Twitter: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/metricsdb.html

[11] Amazon Timestream: https://aws.amazon.com/timestream/

[12] DB-Engines Ranking of time-series DBMS: https://db-engines.com/en/ranking/time+series+dbms

[13] InfluxDB: https://www.influxdata.com/

[14] etcd: https://etcd.io

[15] Service Discovery with Zookeeper https://cloud.spring.io/spring-cloud-zookeeper/1.2.x/multi/multi_spring-cloud-zookeeper-discovery.html

[16] Amazon CloudWatch: https://aws.amazon.com/cloudwatch/

[17] Graphite: https://graphiteapp.org/

[18] Push vs Pull: http://bit.ly/3aJEPxE

[19] Pull doesn’t scale - or does it?: https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/

[20] Monitoring Architecture: https://developer.lightbend.com/guides/monitoring-at-scale/monitoring-architecture/architecture.html

[21] Push vs Pull in Monitoring Systems: https://giedrius.blog/2019/05/11/push-vs-pull-in-monitoring-systems/

[22] Pushgateway: https://github.com/prometheus/pushgateway

[23] Building Applications with Serverless Architectures https://aws.amazon.com/lambda/serverless-architectures-learn-more/

[24] Gorilla: A Fast, Scalable, In-Memory Time Series Database: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

[25] Why We’re Building Flux, a New Data Scripting and Query Language: https://www.influxdata.com/blog/why-were-building-flux-a-new-data-scripting-and-query-language/

[26] InfluxDB storage engine: https://docs.influxdata.com/influxdb/v2.0/reference/internals/storage-engine/

[27] YAML: https://en.wikipedia.org/wiki/YAML

[28] Grafana Demo: https://play.grafana.org/

END

Made a .NET The learning website , Covering distributed systems , Data structure and algorithm , Design patterns , operating system , Computer network, etc , And job recommendation and interview experience sharing , Welcome to flirt .

reply dotnet Get the website address .

reply Interview questions obtain .NET Interview questions .

reply Programmer sideline   Get a sideline guide for programmers .

3b0249e834fc3d11fa794c43c38e9430.gif

原网站

版权声明
本文为[Dotnet cross platform]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207070934505574.html