当前位置:网站首页>[system design] index monitoring and alarm system
[system design] index monitoring and alarm system
2022-07-07 11:30:00 【Dotnet cross platform】
In this paper , We will discuss how to design a scalable index monitoring and warning system . A good monitoring and warning system , Observability of infrastructure , High availability , Reliability plays a key role .
The following figure shows some popular indicator monitoring and alarm services in the market .
Next , We will design a similar service , It can be used inside large companies .
The design requirements
Start with a story about Xiao Ming's interview .
interviewer : If you are asked to design an indicator monitoring and alarm system , What would you do ?
Xiao Ming : well , This system is for internal use , Or design image Datadog such SaaS service ?
interviewer : Good question , At present, this system is only used internally .
Xiao Ming : What indicator information do we want to collect ?
interviewer : Including the index information of the operating system , Middleware metrics , And the running application services qps These indicators .
Xiao Ming : How large is the infrastructure we monitor with this system ?
interviewer :1 100 million active users ,1000 Server pools , Each pool 100 Taiwan machine .
Xiao Ming : How long does the index data need to be saved ?
interviewer : We want to keep it for a year .
Xiao Ming : ok , For long-term storage , Can the resolution of index data be reduced ?
interviewer : Good question , For the latest data , Will save 7 God ,7 It can be reduced to 1 Minute resolution , And then 30 After heaven , May, in accordance with the 1 Hour resolution for further summary .
Xiao Ming : Which alarm channels are supported ?
interviewer : mail , electric nailing , Enterprise WeChat ,Http Endpoint.
Xiao Ming : Do we need to collect logs ? There is also the need to support link tracking in distributed systems ?
interviewer : Currently focusing on indicators , Others are not considered for the time being .
Xiao Ming : well , Probably understand .
To sum up , The infrastructure being monitored is large , And indicators that need to support various dimensions . in addition , The overall system also has higher requirements , Consider scalability , Low latency , Reliability and flexibility .
Basic knowledge of
An indicator monitoring and alarm system usually consists of five components , As shown in the figure below
1. data collection : Collect indicator data from different data sources .
2. The data transfer : Send the index data to the index monitoring system .
3. data storage : Store metrics .
4. The alarm : Analyze the received data , When an abnormality is detected, an alarm notification can be sent .
5. visualization : Visualization page , In graphics , Present data in the form of charts .
Data patterns
Indicator data is usually saved as a time series , It contains a set of values and their associated timestamps .
The sequence itself can be uniquely identified by name , It can also be identified by a group of labels .
Let's look at two examples .
Example 1: Production server i631 stay 20:00 Of CPU What's the load ?
The data points marked in the above figure can be represented in the following format
In the example above , The time series consists of the index name , label (host:i631,env:prod), The timestamp and the corresponding value constitute .
Example 2: In the past 10 Within minutes, all in Shanghai Web The average number of servers CPU What's the load ?
conceptually , We will find something similar to the following
CPU.load host=webserver01,region=shanghai 1613707265 50
CPU.load host=webserver01,region=shanghai 1613707270 62
CPU.load host=webserver02,region=shanghai 1613707275 43
We can calculate the average by the value at the end of each line above CPU load , The above data format is also called row Protocol . It is the input format commonly used by many monitoring software on the market ,Prometheus and OpenTSDB There are two examples .
Each time series contains the following :
• Index name , String type metric name .
• An array of key value pairs , Label indicating the indicator ,List<key,value>
• An array containing time stamps and corresponding values ,List <value, timestamp>
data storage
Data storage is the core of design , It is not recommended to build your own storage system , Nor is it recommended to use a conventional storage system ( such as MySQL) To finish the work .
In theory , Conventional databases can support time series data , But it requires expert level tuning of the database , To meet the needs of scenarios with a large amount of data .
Specifically speaking , Relational databases do not optimize time series data , There are several reasons
• Calculate the average value in the rolling time window , Need to write complex and difficult to read SQL.
• To support labels (tag/label) data , We need to add an index to each tag .
• by comparison , Relational databases do not perform well in continuous high concurrency write operations .
that NoSQL ? ? Theoretically , A few in the market NoSQL Database can effectively process time series data . such as Cassandra and Bigtable Fine . however , Want to meet the needs of efficient storage and query of data , And building scalable systems , Need to understand each NoSQL How it works inside .
by comparison , Time series database specially optimized for time series data , More suitable for this kind of scene .
OpenTSDB It's a distributed temporal database , But because it's based on Hadoop and HBase, function Hadoop/HBase Clustering also brings complexity .Twitter Used MetricsDB Time series database stores index data , And Amazon offers Timestream Time series database service .
according to DB-engines The report of , The two most popular time series databases are InfluxDB and Prometheus , They can store a large amount of time series data , And support the real-time analysis of these data quickly .
As shown in the figure below ,8 nucleus CPU and 32 GB RAM Of InfluxDB Can handle over per second 250,000 Time to write .
High level design
• Metrics Source Source of indicators , Application service , database , Message queuing, etc .
• Metrics Collector Indicator collector .
• Time series DB Time series database , Store metrics .
• Query Service Query service , Provide index query interface .
• Alerting System The alarm system , When an exception is detected , Send alert notification .
• Visualization System visualization , Show the indicators in the form of charts .
In depth design
Now? , Let's focus on the data collection process . There are mainly two ways of pushing and pulling .
Pull mode
The figure above shows data collection using pull mode , The data collector is set separately , Regularly pull index data from running applications .
Here's a question , How does the data collector know the address of each data source ? A better solution is to introduce the service registration and discovery component , such as etcd,ZooKeeper, as follows
The following figure shows our current data pull process .
1. The indicator collector obtains metadata from the service discovery component , Including pulling interval ,IP Address , Overtime , Retry parameters, etc .
2. The indicator collector passes the set HTTP Endpoint obtains indicator data .
In a scenario with a large amount of data , A single indicator collector is difficult to support , We must use a set of indicator collectors . But how should multiple collectors and multiple data sources coordinate , In order to work normally without conflict ?
Consistent hashing is very suitable for this scenario , We can map the data source to the hash ring , as follows
This ensures that each indicator collector has a corresponding data source , Work with each other without conflict .
Push mode
As shown in the figure below , In push mode , Various indicator data sources (Web application , database , Message queue ) Send directly to the indicator collector .
In push mode , You need to install the collector agent on each monitored server , It can collect the indicator data of the server , Then send it to the indicator collector regularly .
Which is better, push or pull ? There is no fixed answer , Both schemes are feasible , Even in some complex scenes , You need to support push and pull at the same time .
Extended data transmission
Now? , Let's focus on indicator collectors and time series databases . Whether you use push or pull mode , In a scenario where a large amount of data needs to be received , The indicator collector is usually a service cluster .
however , When the chronological database is unavailable , There is a risk of data loss , therefore , We introduced Kafka Message queuing components , Here's the picture
The indicator collector sends the indicator data to Kafka Message queue , Then consumers or stream processing services process data , such as Apache Storm、Flink and Spark, Finally, push it to the timing database .
Index calculation
Indicators can be aggregated and calculated in multiple places , See how they are different .
• Client agent : The collection agent installed on the client only supports simple aggregation logic .
• Transmission pipeline : Before the data is written to the timing database , We can use Flink Stream processing services perform aggregate Computing , Then write only the summarized data , This will greatly reduce the amount of writing . But because we don't store the original data , So the data accuracy is lost .
• Query end : We can aggregate and query the original data in real time at the query end , But this way of query is not very efficient .
Temporal database query language
Most popular indicator monitoring systems , such as Prometheus and InfluxDB Not used SQL, It has its own query language . One of the main reasons is that it is difficult to pass SQL To query time series data , And it's hard to read , Like the following SQL Can you see what data you are looking for ?
select id,
temp,
avg(temp) over (partition by group_nr order by time_read) as rolling_avg
from (
select id,
temp,
time_read,
interval_group,
id - row_number() over (partition by interval_group order by time_read) as group_nr
from (
select id,
time_read,
"epoch"::timestamp + "900 seconds"::interval * (extract(epoch from time_read)::int4 / 900) as interval_group,
temp
from readings
) t1
) t2
order by time_read;
by comparison , InfluxDB Used for timing data Flux The query language will be simpler and better understood , as follows
from(db:"telegraf")
|> range(start:-1h)
|> filter(fn: (r) => r._measurement == "foo")
|> exponentialMovingAverage(size:-10s)
Data encoding and compression
Data encoding and compression can greatly reduce the size of data , Especially in time series database , Here is a simple example .
Because the time interval of general data collection is fixed , So we can store a basic value together with the increment , such as 1610087371, 10, 10, 9, 11 such , It can take up less space .
Down sampling
Down sampling is the process of converting high-resolution data into low-resolution data , This can reduce disk usage . Because our data retention period is 1 year , We can down sample the old data , This is an example :
• 7 Day data , No sampling .
• 30 Day data , Down sampling to 1 Minute resolution
• 1 Annual data , Down sampling to 1 Hour resolution .
Let's look at another specific example , It is the 10 Second resolution data are aggregated into 30 Second resolution .
Raw data
After downsampling
Alarm service
Let's take a look at the design of alarm service , And the workflow .
1. load YAML Format alarm configuration file to cache .
- name: instance_down
rules:
# The service is unavailable for more than 5 Minute trigger alarm .
- alert: instance_down
expr: up == 0
for: 5m
labels:
severity: page2. The alert manager reads the configuration from the cache .
3. According to the alarm rules , Query indicators according to the set time and conditions , If the threshold is exceeded , The alarm is triggered .
4. Alert Store Save the status of all alarms ( Hang up , Trigger , resolved ).
5. Qualified alarms will be added to Kafka in .
6. Consumption queue , According to the alarm rules , Send alert information to different notification channels .
visualization
Visualization is built on the data layer , Indicator data can be displayed on the indicator dashboard , The alarm information can be displayed on the alarm dashboard . The following figure shows some indicators , Number of requests from the server 、 Memory /CPU utilization 、 Page load time 、 Traffic and login information .
Grafana It can be a very good visualization system , We can use it directly .
summary
In this paper , We introduce the design of index monitoring and alarm system . At a high level , We discussed data collection 、 Time series database 、 Alarm and visualization , The following figure is our final design :
Reference
[0] System Design Interview Volume 2: https://www.amazon.com/System-Design-Interview-Insiders-Guide/dp/1736049119
[1] Datadog: https://www.datadoghq.com/
[2] Splunk: https://www.splunk.com/
[3] Elastic stack: https://www.elastic.co/elastic-stack
[4] Dapper, a Large-Scale Distributed Systems Tracing Infrastructure: https://research.google/pubs/pub36356/
[5] Distributed Systems Tracing with Zipkin: https://blog.twitter.com/engineering/en_us/a/2012/distributed-systems-tracing-with-zipkin.html
[6] Prometheus: https://prometheus.io/docs/introduction/overview/
[7] OpenTSDB - A Distributed, Scalable Monitoring System: http://opentsdb.net/
[8] Data model: : https://prometheus.io/docs/concepts/data_model/
[9] Schema design for time-series data | Cloud Bigtable Documentation https://cloud.google.com/bigtable/docs/schema-design-time-series
[10] MetricsDB: TimeSeries Database for storing metrics at Twitter: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/metricsdb.html
[11] Amazon Timestream: https://aws.amazon.com/timestream/
[12] DB-Engines Ranking of time-series DBMS: https://db-engines.com/en/ranking/time+series+dbms
[13] InfluxDB: https://www.influxdata.com/
[14] etcd: https://etcd.io
[15] Service Discovery with Zookeeper https://cloud.spring.io/spring-cloud-zookeeper/1.2.x/multi/multi_spring-cloud-zookeeper-discovery.html
[16] Amazon CloudWatch: https://aws.amazon.com/cloudwatch/
[17] Graphite: https://graphiteapp.org/
[18] Push vs Pull: http://bit.ly/3aJEPxE
[19] Pull doesn’t scale - or does it?: https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/
[20] Monitoring Architecture: https://developer.lightbend.com/guides/monitoring-at-scale/monitoring-architecture/architecture.html
[21] Push vs Pull in Monitoring Systems: https://giedrius.blog/2019/05/11/push-vs-pull-in-monitoring-systems/
[22] Pushgateway: https://github.com/prometheus/pushgateway
[23] Building Applications with Serverless Architectures https://aws.amazon.com/lambda/serverless-architectures-learn-more/
[24] Gorilla: A Fast, Scalable, In-Memory Time Series Database: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
[25] Why We’re Building Flux, a New Data Scripting and Query Language: https://www.influxdata.com/blog/why-were-building-flux-a-new-data-scripting-and-query-language/
[26] InfluxDB storage engine: https://docs.influxdata.com/influxdb/v2.0/reference/internals/storage-engine/
[27] YAML: https://en.wikipedia.org/wiki/YAML
[28] Grafana Demo: https://play.grafana.org/
END
Made a .NET The learning website , Covering distributed systems , Data structure and algorithm , Design patterns , operating system , Computer network, etc , And job recommendation and interview experience sharing , Welcome to flirt .
reply dotnet Get the website address .
reply Interview questions obtain .NET Interview questions .
reply Programmer sideline Get a sideline guide for programmers .
边栏推荐
- [untitled]
- Process control (creation, termination, waiting, program replacement)
- 关于jmeter中编写shell脚本json的应用
- Vuthink proper installation process
- Case study of Jinshan API translation function based on retrofit framework
- 分布式数据库主从配置(MySQL)
- From pornographic live broadcast to live broadcast E-commerce
- 请查收.NET MAUI 的最新学习资源
- How to add aplayer music player in blog
- 如何在博客中添加Aplayer音乐播放器
猜你喜欢
Activity lifecycle
Talk about SOC startup (x) kernel startup pilot knowledge
数据库同步工具 DBSync 新增对MongoDB、ES的支持
Table replication in PostgreSQL
Apprentissage comparatif non supervisé des caractéristiques visuelles par les assignations de groupes de contrôle
Vscode 尝试在目标目录创建文件时发生一个错误:拒绝访问【已解决】
Technology sharing | packet capturing analysis TCP protocol
关于SIoU《SIoU Loss: More Powerful Learning for Bounding Box Regression Zhora Gevorgyan 》的一些看法及代码实现
Verilog design responder [with source code]
What is cloud computing?
随机推荐
Array object sorting
Force buckle 1002 Find common characters
Verilog realizes nixie tube display driver [with source code]
PostgreSQL中的表复制
Design intelligent weighing system based on Huawei cloud IOT (STM32)
关于jmeter中编写shell脚本json的应用
在我有限的软件测试经历里,一段专职的自动化测试经验总结
oracle常见锁表处理方式
mif 文件格式记录
基于Retrofit框架的金山API翻译功能案例
The annual salary of general test is 15W, and the annual salary of test and development is 30w+. What is the difference between the two?
常用sql语句整理:mysql
Use metersphere to keep your testing work efficient
关于测试人生的一站式发展建议
從色情直播到直播電商
自动化测试框架
《论文阅读》Neural Approaches to Conversational AI(1)
Electron adding SQLite database
verilog设计抢答器【附源码】
Apprentissage comparatif non supervisé des caractéristiques visuelles par les assignations de groupes de contrôle