当前位置:网站首页>druid. The performance of IO + tranquility real-time tasks is summarized with the help of 2020 double 11
druid. The performance of IO + tranquility real-time tasks is summarized with the help of 2020 double 11
2022-07-29 02:00:00 【master-dragon】
Catalog
Mainly to sum up my double 11 Of druid.io Cluster operation and maintenance work and some thoughts
Draw a picture directly and remember 
The problem summary
Operation and maintenance automation ?
Many problems still pass the alarm , Then operate after manual judgment
This can be automated , The conditions of judgment should be abstract and logical ( If there are no machine indicators, go to the operation and maintenance department , Develop without its own system indicators , The script can be used ), Operations can be encapsulated ( There are operations , There is a fallback ), You can start
The need is repeated / Afterwards communication ?
Always consider how to solve problems when they are exposed , In fact, I mentioned before , Just don't pay attention to , Think the problem is not big or important .
So simulate as soon as possible , Get started , Simulate problems without problems , This is similar to writing algorithm considering boundary conditions , Have foresight
Insufficient design and ability ?
The architecture design is flawed , Then there is the lack of ability , If you want to change it, you can't change it , Have to go through a third party or continue in-depth research , Improve your ability , Then make the architecture better
double 11 The difference in peace ?
Not much difference , Because pressure measurement and capacity expansion are prepared in advance ; But everything has an accident , It depends on whether the consideration is comprehensive , There are also temporary remedies for incompleteness , Always aim to have the least impact , As an opportunity to test service ability , It is also an opportunity to expose problems and improve .
There is no special difference to say
druid.io Performance summary
druid.io + tranquility What I fear most is two middleManager Hang up ( Downtime or OutOfMemory), But judging from so long experience in operation and maintenance , Basically, this probability is very small ( The probability of machine downtime at the same time is small , Unless there is a power failure or something wrong with the machine room , secondly :middleManager Sufficient and real-time query makes some super large query restrictions , At the same time, the task has backup , And upstream lag Alarms and so on , Alarms are sufficient )
During the overall promotion, the task is very stable , Inquiries will rise a lot ( But query routing , Current limiting , There are all machine capacity expansion , There are all kinds of machine monitoring , It can be expanded at any time ), The query is also very stable ; The overall performance feels good 9 branch ( Full marks 10 branch ), Except for some accidents
- Insufficient estimation leads to temporary expansion
Real time task traffic is rising , Then expand , At present, manual or automatic , There is a time window and the correct estimate is basically solved in advance 99% Real time tasks of . a 2C4G tranquility Consumption upstream 1-3w qps Sure , But for the >=200wqps Upstream kafka Real time flow , It really consumes a lot of machines , meanwhile kafka partition It needs to be expanded within a reasonable range , It's too small to keep up with consumption
However, there will always be underestimation , This time double 11 Just meet , However, the timely alarm and operation , The temporary expansion has been solved
- Data skew
This common thing , Continued expansion can solve , Adding random dimensions can also solve ; Critical situation , Of course, it's expansion , Alleviate the lower tilt problem ; However, we still need to analyze the data later / government , Avoid serious tilting , At this stage, the inclination has hash All dimension values , Theoretically, it's ok , There is no quantitative level of inclination , This point may need to be predicted later , Handle well
Personal operation summary
This time I am on duty to summarize : As usual , Never mess , It is still the analysis of various indicators , Focus on the problem , Then operate reasonably , The action is nothing more than a little faster than usual .
As usual , There is no need to panic .
----------2020 year 11 month 20 Japan Friday 20 when 48 branch 28 second CST
边栏推荐
猜你喜欢

数学建模——派出所选址

5g commercial third year: driverless "going up the mountain" and "going to the sea"
![[netding cup 2020 rosefinch group]nmap](/img/22/1fdf716a216ae26b9110b2e5f211f6.png)
[netding cup 2020 rosefinch group]nmap
![[the road of Exile - Chapter 2]](/img/98/0a0558dc385141dbb4f97bc0e68b70.png)
[the road of Exile - Chapter 2]

What is the ISO assessment? How to do the waiting insurance scheme
![[public class preview]: application exploration of Kwai gpu/fpga/asic heterogeneous platform](/img/e7/1d06eba0e50eeb91d2d5da7524f4af.png)
[public class preview]: application exploration of Kwai gpu/fpga/asic heterogeneous platform
![[golang] network connection net.dial](/img/8d/7ef64cb63cbd230e5ac1655b86786f.png)
[golang] network connection net.dial

Stonedb invites you to participate in the open source community monthly meeting!

【7.21-26】代码源 - 【体育节】【丹钓战】【最大权值划分】
![[the road of Exile - Chapter 4]](/img/76/e1e249ddb2f963abb5d2b617a5f178.png)
[the road of Exile - Chapter 4]
随机推荐
规划数学期末模拟考试一
Dynamic memory and smart pointer
What are the common cyber threats faced by manufacturers and how do they protect themselves
For a safer experience, Microsoft announced the first PC with a secure Pluto chip
Day01作业
Come on, handwritten RPC S2 serialization exploration
【流放之路-第七章】
LeetCode 113:路径总和 II
[WesternCTF2018]shrine
覆盖接入2w+交通监测设备,EMQ为深圳市打造交通全要素数字化新引擎
Planning mathematics final simulation exam I
Day01 job
How companies make business decisions -- with the help of data-driven marketing
leetcode/乘积小于K 的连续子数组的个数
We summarized the three recommendations for the use of Nacos and first published the Nacos 3.0 plan for the 4th anniversary of the open source of Nacos
知道创宇上榜CCSIP 2022全景图多个领域
5g commercial third year: driverless "going up the mountain" and "going to the sea"
StoneDB 邀请您参与开源社区月会!
internship:用于类型判断的工具类编写
Data security is a competitive advantage. How can companies give priority to information security and compliance