当前位置:网站首页>On the first day of the new year, 3000 Apache servers went down
On the first day of the new year, 3000 Apache servers went down
2022-07-01 05:41:00 【CSDN information】
author |Ali Josie translator | Meniscus
Produce | CSDN(ID:CSDNnews)
The first day of the new year , It happens to be Saturday again , When I wake up in the morning, I see a pile of alarms that the entire infrastructure is down ! One of my colleagues encountered such a reality , You can imagine his mood at that time .
Early in the morning
First , The most important thing is to restore the service , Minimize the impact of service downtime , We've rebooted everything Apache The server , Fortunately, there is no problem . The next step is to find out the cause of the downtime . Why are all servers down on the first day of the new year ? This must not be an accident ?
We can see that the following logs are recorded on each server :
AH00171: Graceful restart requested, doing restart
libgomp: could not create thread pool destructor.
libgomp What is it? ? We checked the error on the Internet first .ServerFault Someone has asked this question , But no one answered , At least there is nothing we can use . But this question is a little strange , Because the questioner said that his server every 24~36 It happens every hour .
reflection
Go back to the mistake itself . We do a log rotation every morning , In this way, new logs are used every day . So restart the server . Seems to be Apache Successfully restarted , But because of libgomp The error is down again .
Searching for answers among the large number of results found on the Internet is like looking for a needle in a haystack , So we started reading libgomp Source code , See what happened . First ,libgomp What is it? ? According to the description of its home page :
- “GOMP The project is C、C++ and Fortran compiler OpenMP An implementation of ……GOMP Can simplify all GNU Parallel programming on the system .”
So it is OpenMP The implementation of the . How could it go wrong ?
Searched the source code , The only source where we found the error message was here :
So obviously , It is trying to create a thread key , But something went wrong . Check pthread_key_create Manuals :
- “pthread_key_create A thread specific data key is created , It can be used in all threads of the process .pthread_key_create() The provided key value is an opaque object , Used to locate thread specific data . Although different threads can use the same key name , adopt pthread_setspecific() Values bound to keys are maintained per thread , It is valid throughout the life cycle of the thread .”
interesting ! What is the return value ?
“pthread_key_create() The function will fail in the following cases :
Insufficient system resources , Cannot create another thread specific data key , Or the total number of keys per process has reached PTHREAD_KEYS_MAX ceiling .
Out of memory , Cannot create key .”
Then I checked the code , See what happened , as well as PTHREAD_KEYS_MAX What is the maximum :
So ,key Just one. 0~1024 Between ( Not included 1024) The number of , Assign to pthread_key_create The caller . These keys consist of a simple CAS Responsible for assignment , So there must be a place to release these keys . It seems that we have found the problem . We just need to increase PTHREAD_KEYS_MAX. however , This value is a constant . We even found a post , Ask for more PTHREAD_KEYS_MAX:
- “pthread_key_create() Will refuse more than PTHREAD_KEYS_MAX pthread_key_t Create request for . The problem I encountered was NetBSD On Apache Cannot work with multiple modules , Because this value is too low . After a long time , The server will fall into a state of being unable to provide services .”
The problem areas described in this post are similar to ours , So our hypothesis may be correct . But we still can't increase this value .
We began to investigate why we reloaded Apache Will enter libgomp This code . So obviously , heavy load Apache It can lead to mod_php Call one named Imagick Module .Imagick What is it? ? It is a use ImageMagick Library to create and modify pictures PHP Expand .
doubt
Seems closed Imagick You can avoid using libgomp, In this way, the maximum number of threads will not be encountered . And you only need to set an environment variable . It seems that this scheme is very safe , But we still have one big question :
Why is it in 1 month 1 Day occurrence ? And on such a large scale , Is it really accidental ?
Why is it all right to use it for so many years ? Will it be because of an update ?
Obviously, solving the problem in this way cannot satisfy us . There are many unsolved mysteries . We began to read further Apache HTTP And libgomp Code for , But it seems that everything is normal , At least we didn't find any problems . The problem cannot be repeated , Soon this problem will become an unsolved mystery . We searched for many unrelated keywords , I even found some information about “2038 In the problem ” 's post .
But none of this helps . We even suspected Apache Maximum uptime.
Finally, we checked Imagick Update log of , Found this :
“ Multiple modifications to reduce GOMP The occurrence of segment errors , Include :
In the process of closing , If possible , Call omp_pause_resource_all
Added
imagick.shutdown_sleep_count( Default 10) andimagick.set_single_thread( Default On). Both can reduce segment errors when closing .”
This is in line with our guess : take Imagick The maximum number of threads is set to 1 Can solve the problem . But it doesn't answer the biggest question about time .
Aura
After searching for more strange things , We'd like to see if anyone has this problem in January .

The first article is the key to unlock all this !
Suddenly thought of …… If the thread key has never been released , What will happen ? Is it possible ? This problem has never occurred since deployment dependency …… So we recalculated ,1024 Key , If you reload every morning , It will take two years 10 It takes months to exceed 1024 Reload times . If in the past 1024 Assign a thread key every morning in the day , If this key is never released ……
Finally saw a glimmer of dawn . We finally found a way to reproduce the problem . We made a test environment , Use the same server configuration , Then simply run the script .
for i in seq{
1..1100}; do sudo systemctl reload apache2;done
Reload apache2 1100 Time ( More 76 Secondary as redundancy ). Then, as expected, the problem appeared !
Apache After reloading 1024 Later ,libgomp It's a mistake . Now all the questions have been answered .
Let's see if we can add environment variables MAGICK_THREAD_LIMIT( new edition Imagick yes OMP_THREAD_LIMIT). Unfortunately , The problem remains. . So the next step is to update Imagick Version to a version that fixes the problem (v3.5.0+). Very lucky , There is no problem reloading thousands of times after the update .
Check
There is another unresolved problem : new edition Imagick Did you delete this key ? To answer this question , We used a tool :ltrace This tool can intercept and record the specific commands that the program runs . We start with the old version of Imagick(v.3.4.4) Running on the server ltrace:
ltrace -xpthread_key_*@libpthread.so.0 -L -c /usr/sbin/apache2 -k graceful
-x Is a search string for a function in a specific library , Here is libpthrad.so.0 Medium pthrad_key_create and pthread_key_delete.
-L tell ltrace Ignore the default filter , To reduce noise .
-c All results will be summarized at the end . and /usr/sbin/apache2 -k graceful amount to systemctlreload apache.
The result was not unexpected :
% time seconds usecs/call calls function
------ ----------- ----------- -----------------------------
100.00 0.000157 157 1 pthread_key_create
------ ----------- ----------- -----------------------------
100.00 0.000157 1total
3.4.4 Version only calls pthread_key_create Without deleting !
Then in the new version (v3.6.0) Run the same command on :
% time seconds usecs/call calls function
------ ----------- ----------- ------------------------------
------ ----------- ----------- ------------------------------
100.00 0.000000 0 total
It seems , The new version does not use multithreading , So there is no creation key at all .
summary
It's finally settled , But why hasn't it been restarted for so long ? We decided not to waste time on this issue , because “ If you exclude all the impossible options , Then the rest, no matter how incredible , It's all the truth .”
I feel strange after solving this problem . Although I feel very proud of solving the problem , But there are many long-running servers in the world. I don't know when I will encounter this problem .
Link to the original text :https://alijosie.medium.com/this-is-why-our-3000-apache-servers-went-down-on-the-first-day-of-2022-3cc5e9639587
This paper is about CSDN translate , Please indicate the source of reprint .
边栏推荐
- Txncoordsender of cockroachdb distributed transaction source code analysis
- 2022.6.30-----leetcode.1175
- mysql 将毫秒数转为时间字符串
- excel高级绘图技巧100讲(一)-用甘特图来展示项目进度情况
- Continuous breakthrough and steady progress -- Review and Prospect of cross platform development technology of mobile terminal
- Qt编写自定义控件-自绘电池
- Things generated by busybox
- Build 2022 上开发者最应关注的七大方向主要技术更新
- HCM 初学 ( 四 ) - 时间
- Use and principle of reentrantlock
猜你喜欢
![[medical segmentation] u2net](/img/b3/b1d188216310fe1217e360ac56af3b.jpg)
[medical segmentation] u2net

Fluentd is easy to use. Combined with the rainbow plug-in market, log collection is faster

Leetcode top 100 questions 1 Sum of two numbers

Set集合詳細講解

MySQL converts milliseconds to time string

Design and application of immutable classes

Tar command

busybox生成的东西

Redis数据库的部署及常用命令

Boot + jsp University Community Management System (with source Download Link)
随机推荐
激活函数简述
Data governance: data governance management (Part V)
What is the at instruction set often used in the development of IOT devices?
C语言初阶——牛客网精选好题
SSGSSRCSR区别
HDU - 1069 Monkey and Banana(DP+LIS)
HCM 初学 ( 一 ) - 简介
【问题思考总结】为什么寄存器清零是在用户态进行的?
Daily code 300 lines learning notes day 11
Introduction to 3D modeling and processing software Liu Ligang University of science and technology of China
Dynamic verification of new form items in El form; El form verifies that the dynamic form V-IF does not take effect;
[medical segmentation] u2net
Qt编译时,出现 first defined here,原因及解决方法
Advanced cross platform application development (III): online resource upgrade / hot update with uni app
CentOS 7 installed php7.0 using Yum or up2date
Using nocalhost to develop microservice application on rainbow
新手在挖财开通证券账户安全吗?
[ffmpeg] [reprint] image mosaic: picture in picture with wheat
Series of improving enterprise product delivery efficiency (1) -- one click installation and upgrade of enterprise applications
数据治理:数据治理框架(第一篇)