当前位置:网站首页>druid. The performance of IO + tranquility real-time tasks is summarized with the help of 2020 double 11
druid. The performance of IO + tranquility real-time tasks is summarized with the help of 2020 double 11
2022-07-29 02:00:00 【master-dragon】
Catalog
Mainly to sum up my double 11 Of druid.io Cluster operation and maintenance work and some thoughts
Draw a picture directly and remember 
The problem summary
Operation and maintenance automation ?
Many problems still pass the alarm , Then operate after manual judgment
This can be automated , The conditions of judgment should be abstract and logical ( If there are no machine indicators, go to the operation and maintenance department , Develop without its own system indicators , The script can be used ), Operations can be encapsulated ( There are operations , There is a fallback ), You can start
The need is repeated / Afterwards communication ?
Always consider how to solve problems when they are exposed , In fact, I mentioned before , Just don't pay attention to , Think the problem is not big or important .
So simulate as soon as possible , Get started , Simulate problems without problems , This is similar to writing algorithm considering boundary conditions , Have foresight
Insufficient design and ability ?
The architecture design is flawed , Then there is the lack of ability , If you want to change it, you can't change it , Have to go through a third party or continue in-depth research , Improve your ability , Then make the architecture better
double 11 The difference in peace ?
Not much difference , Because pressure measurement and capacity expansion are prepared in advance ; But everything has an accident , It depends on whether the consideration is comprehensive , There are also temporary remedies for incompleteness , Always aim to have the least impact , As an opportunity to test service ability , It is also an opportunity to expose problems and improve .
There is no special difference to say
druid.io Performance summary
druid.io + tranquility What I fear most is two middleManager Hang up ( Downtime or OutOfMemory), But judging from so long experience in operation and maintenance , Basically, this probability is very small ( The probability of machine downtime at the same time is small , Unless there is a power failure or something wrong with the machine room , secondly :middleManager Sufficient and real-time query makes some super large query restrictions , At the same time, the task has backup , And upstream lag Alarms and so on , Alarms are sufficient )
During the overall promotion, the task is very stable , Inquiries will rise a lot ( But query routing , Current limiting , There are all machine capacity expansion , There are all kinds of machine monitoring , It can be expanded at any time ), The query is also very stable ; The overall performance feels good 9 branch ( Full marks 10 branch ), Except for some accidents
- Insufficient estimation leads to temporary expansion
Real time task traffic is rising , Then expand , At present, manual or automatic , There is a time window and the correct estimate is basically solved in advance 99% Real time tasks of . a 2C4G tranquility Consumption upstream 1-3w qps Sure , But for the >=200wqps Upstream kafka Real time flow , It really consumes a lot of machines , meanwhile kafka partition It needs to be expanded within a reasonable range , It's too small to keep up with consumption
However, there will always be underestimation , This time double 11 Just meet , However, the timely alarm and operation , The temporary expansion has been solved
- Data skew
This common thing , Continued expansion can solve , Adding random dimensions can also solve ; Critical situation , Of course, it's expansion , Alleviate the lower tilt problem ; However, we still need to analyze the data later / government , Avoid serious tilting , At this stage, the inclination has hash All dimension values , Theoretically, it's ok , There is no quantitative level of inclination , This point may need to be predicted later , Handle well
Personal operation summary
This time I am on duty to summarize : As usual , Never mess , It is still the analysis of various indicators , Focus on the problem , Then operate reasonably , The action is nothing more than a little faster than usual .
As usual , There is no need to panic .
----------2020 year 11 month 20 Japan Friday 20 when 48 branch 28 second CST
边栏推荐
- Thirty years of MPEG audio coding
- How to protect WordPress website from network attack? It is essential to take safety measures
- [the road of Exile - Chapter 6]
- Process -- user address space and kernel address space
- Some summaries of ibatis script and provider
- In depth analysis of C language memory alignment
- The brutal rule of blackmail software continues, and attacks increase by 105%
- [网鼎杯 2020 朱雀组]Nmap
- Reinforcement learning (II): SARS, with code rewriting
- Making high-precision map based on autoware (V)
猜你喜欢

We summarized the three recommendations for the use of Nacos and first published the Nacos 3.0 plan for the 4th anniversary of the open source of Nacos

【观察】三年跃居纯公有云SaaS第一,用友YonSuite的“飞轮效应”

使用POI,实现excel文件导出,图片url导出文件,图片和excel文件导出压缩包

数学建模——带相变材料的低温防护服御寒仿真模拟

活动速递| Apache Doris 性能优化实战系列直播课程初公开,诚邀您来参加!

golang启动报错【已解决】

Top network security prediction: nearly one-third of countries will regulate blackmail software response within three years

Autoware reports an error: can't generate global path for start solution
![[web technology] 1395 esbuild bundler HMR](/img/74/be75c8f745f18b374ed15c8e1b4466.png)
[web technology] 1395 esbuild bundler HMR

OpenGL development with QT (II) drawing cube
随机推荐
5g commercial third year: driverless "going up the mountain" and "going to the sea"
秘术冬潮烙技能搭配
More interesting Title Dynamic Effect
Nine days later, we are together to focus on the new development of audio and video and mystery technology
【流放之路-第七章】
Analyze OP based on autoware_ global_ Planner global path planning module re planning
JVM learning minutes
[WesternCTF2018]shrine
分布式开发漫谈
Super scientific and technological data leakage prevention system, control illegal Internet behaviors, and ensure enterprise information security
The scientific research environment has a great impact on people
【流放之路-第四章】
数学建模——公交调度优化
Lxml web page capture the most complete strategy
关于df[‘某一列名’][序号]
golang run时报undefined错误【已解决】
LeetCode 练习——剑指 Offer 45. 把数组排成最小的数
【MySQL】sql给表起别名
Event express | Apache Doris Performance Optimization Practice Series live broadcast course is open at the beginning. You are cordially invited to participate!
抓包工具Charles使用