当前位置:网站首页>On the first day of the new year, 3000 Apache servers went down
On the first day of the new year, 3000 Apache servers went down
2022-07-01 05:41:00 【CSDN information】
author |Ali Josie translator | Meniscus
Produce | CSDN(ID:CSDNnews)
The first day of the new year , It happens to be Saturday again , When I wake up in the morning, I see a pile of alarms that the entire infrastructure is down ! One of my colleagues encountered such a reality , You can imagine his mood at that time .
Early in the morning
First , The most important thing is to restore the service , Minimize the impact of service downtime , We've rebooted everything Apache The server , Fortunately, there is no problem . The next step is to find out the cause of the downtime . Why are all servers down on the first day of the new year ? This must not be an accident ?
We can see that the following logs are recorded on each server :
AH00171: Graceful restart requested, doing restart
libgomp: could not create thread pool destructor.
libgomp What is it? ? We checked the error on the Internet first .ServerFault Someone has asked this question , But no one answered , At least there is nothing we can use . But this question is a little strange , Because the questioner said that his server every 24~36 It happens every hour .
reflection
Go back to the mistake itself . We do a log rotation every morning , In this way, new logs are used every day . So restart the server . Seems to be Apache Successfully restarted , But because of libgomp The error is down again .
Searching for answers among the large number of results found on the Internet is like looking for a needle in a haystack , So we started reading libgomp Source code , See what happened . First ,libgomp What is it? ? According to the description of its home page :
- “GOMP The project is C、C++ and Fortran compiler OpenMP An implementation of ……GOMP Can simplify all GNU Parallel programming on the system .”
So it is OpenMP The implementation of the . How could it go wrong ?
Searched the source code , The only source where we found the error message was here :
So obviously , It is trying to create a thread key , But something went wrong . Check pthread_key_create Manuals :
- “pthread_key_create A thread specific data key is created , It can be used in all threads of the process .pthread_key_create() The provided key value is an opaque object , Used to locate thread specific data . Although different threads can use the same key name , adopt pthread_setspecific() Values bound to keys are maintained per thread , It is valid throughout the life cycle of the thread .”
interesting ! What is the return value ?
“pthread_key_create() The function will fail in the following cases :
Insufficient system resources , Cannot create another thread specific data key , Or the total number of keys per process has reached PTHREAD_KEYS_MAX ceiling .
Out of memory , Cannot create key .”
Then I checked the code , See what happened , as well as PTHREAD_KEYS_MAX What is the maximum :
So ,key Just one. 0~1024 Between ( Not included 1024) The number of , Assign to pthread_key_create The caller . These keys consist of a simple CAS Responsible for assignment , So there must be a place to release these keys . It seems that we have found the problem . We just need to increase PTHREAD_KEYS_MAX. however , This value is a constant . We even found a post , Ask for more PTHREAD_KEYS_MAX:
- “pthread_key_create() Will refuse more than PTHREAD_KEYS_MAX pthread_key_t Create request for . The problem I encountered was NetBSD On Apache Cannot work with multiple modules , Because this value is too low . After a long time , The server will fall into a state of being unable to provide services .”
The problem areas described in this post are similar to ours , So our hypothesis may be correct . But we still can't increase this value .
We began to investigate why we reloaded Apache Will enter libgomp This code . So obviously , heavy load Apache It can lead to mod_php Call one named Imagick Module .Imagick What is it? ? It is a use ImageMagick Library to create and modify pictures PHP Expand .
doubt
Seems closed Imagick You can avoid using libgomp, In this way, the maximum number of threads will not be encountered . And you only need to set an environment variable . It seems that this scheme is very safe , But we still have one big question :
Why is it in 1 month 1 Day occurrence ? And on such a large scale , Is it really accidental ?
Why is it all right to use it for so many years ? Will it be because of an update ?
Obviously, solving the problem in this way cannot satisfy us . There are many unsolved mysteries . We began to read further Apache HTTP And libgomp Code for , But it seems that everything is normal , At least we didn't find any problems . The problem cannot be repeated , Soon this problem will become an unsolved mystery . We searched for many unrelated keywords , I even found some information about “2038 In the problem ” 's post .
But none of this helps . We even suspected Apache Maximum uptime.
Finally, we checked Imagick Update log of , Found this :
“ Multiple modifications to reduce GOMP The occurrence of segment errors , Include :
In the process of closing , If possible , Call omp_pause_resource_all
Added
imagick.shutdown_sleep_count
( Default 10) andimagick.set_single_thread
( Default On). Both can reduce segment errors when closing .”
This is in line with our guess : take Imagick The maximum number of threads is set to 1 Can solve the problem . But it doesn't answer the biggest question about time .
Aura
After searching for more strange things , We'd like to see if anyone has this problem in January .
The first article is the key to unlock all this !
Suddenly thought of …… If the thread key has never been released , What will happen ? Is it possible ? This problem has never occurred since deployment dependency …… So we recalculated ,1024 Key , If you reload every morning , It will take two years 10 It takes months to exceed 1024 Reload times . If in the past 1024 Assign a thread key every morning in the day , If this key is never released ……
Finally saw a glimmer of dawn . We finally found a way to reproduce the problem . We made a test environment , Use the same server configuration , Then simply run the script .
for i in seq{
1..1100}; do sudo systemctl reload apache2;done
Reload apache2 1100 Time ( More 76 Secondary as redundancy ). Then, as expected, the problem appeared !
Apache After reloading 1024 Later ,libgomp It's a mistake . Now all the questions have been answered .
Let's see if we can add environment variables MAGICK_THREAD_LIMIT( new edition Imagick yes OMP_THREAD_LIMIT). Unfortunately , The problem remains. . So the next step is to update Imagick Version to a version that fixes the problem (v3.5.0+). Very lucky , There is no problem reloading thousands of times after the update .
Check
There is another unresolved problem : new edition Imagick Did you delete this key ? To answer this question , We used a tool :ltrace This tool can intercept and record the specific commands that the program runs . We start with the old version of Imagick(v.3.4.4) Running on the server ltrace:
ltrace -xpthread_key_*@libpthread.so.0 -L -c /usr/sbin/apache2 -k graceful
-x Is a search string for a function in a specific library , Here is libpthrad.so.0 Medium pthrad_key_create and pthread_key_delete.
-L tell ltrace Ignore the default filter , To reduce noise .
-c All results will be summarized at the end . and /usr/sbin/apache2 -k graceful amount to systemctlreload apache.
The result was not unexpected :
% time seconds usecs/call calls function
------ ----------- ----------- -----------------------------
100.00 0.000157 157 1 pthread_key_create
------ ----------- ----------- -----------------------------
100.00 0.000157 1total
3.4.4 Version only calls pthread_key_create Without deleting !
Then in the new version (v3.6.0) Run the same command on :
% time seconds usecs/call calls function
------ ----------- ----------- ------------------------------
------ ----------- ----------- ------------------------------
100.00 0.000000 0 total
It seems , The new version does not use multithreading , So there is no creation key at all .
summary
It's finally settled , But why hasn't it been restarted for so long ? We decided not to waste time on this issue , because “ If you exclude all the impossible options , Then the rest, no matter how incredible , It's all the truth .”
I feel strange after solving this problem . Although I feel very proud of solving the problem , But there are many long-running servers in the world. I don't know when I will encounter this problem .
Link to the original text :https://alijosie.medium.com/this-is-why-our-3000-apache-servers-went-down-on-the-first-day-of-2022-3cc5e9639587
This paper is about CSDN translate , Please indicate the source of reprint .
边栏推荐
- Summary of spanner's paper
- 移动端常用解决方案
- JDBC常见面试题
- What is the at instruction set often used in the development of IOT devices?
- 0xc000007b the application cannot start the solution normally (the pro test is valid)
- College community management system based on boot+jsp (with source code download link)
- 数据库连接池的简单实现
- Is it safe for a novice to open a securities account?
- 数据治理:数据治理管理(第五篇)
- Brief description of activation function
猜你喜欢
实战:redux的基本使用
ssm+mysql二手交易网站(论文+源码获取链接)
基于微信小程序的青少年生理健康知识小助手(免费获取源码+项目介绍+运行介绍+运行截图+论文)
Series of improving enterprise product delivery efficiency (1) -- one click installation and upgrade of enterprise applications
Actual combat: basic use of Redux
Build 2022 上开发者最应关注的七大方向主要技术更新
MySQL converts milliseconds to time string
MySQL数据迁移遇到的一些错误
El cascade echo failed; El cascader does not echo
Ssm+mysql second-hand trading website (thesis + source code access link)
随机推荐
Mongodb学习篇:安装后的入门第一课
excel高级绘图技巧100讲(一)-用甘特图来展示项目进度情况
Data governance: data governance management (Part V)
【医学分割】u2net
TypeORM 框架
How to create a progress bar that changes color according to progress
POL8901 LVDS转MIPI DSI 支持旋转图像处理芯片
Unity项目心得总结
2022.6.30-----leetcode. one thousand one hundred and seventy-five
Unity 使用Sqlite
win10、win11中Elan触摸板滚动方向反转、启动“双指点击打开右键菜单“、“双指滚动“
数据治理:元数据管理实施(第四篇)
Rainbow combines neuvector to practice container safety management
SSGSSRCSR区别
HDU - 1024 Max Sum Plus Plus(DP)
mysql 将毫秒数转为时间字符串
HCM 初学 ( 一 ) - 简介
Educational administration management system of SSM (free source code)
为什么用葫芦儿派盘取代U盘?
Chapitre d'apprentissage mongodb: Introduction à la première leçon après l'installation