当前位置:网站首页>On the first day of the new year, 3000 Apache servers went down
On the first day of the new year, 3000 Apache servers went down
2022-07-01 05:41:00 【CSDN information】
author |Ali Josie translator | Meniscus
Produce | CSDN(ID:CSDNnews)
The first day of the new year , It happens to be Saturday again , When I wake up in the morning, I see a pile of alarms that the entire infrastructure is down ! One of my colleagues encountered such a reality , You can imagine his mood at that time .
Early in the morning
First , The most important thing is to restore the service , Minimize the impact of service downtime , We've rebooted everything Apache The server , Fortunately, there is no problem . The next step is to find out the cause of the downtime . Why are all servers down on the first day of the new year ? This must not be an accident ?
We can see that the following logs are recorded on each server :
AH00171: Graceful restart requested, doing restart
libgomp: could not create thread pool destructor.
libgomp What is it? ? We checked the error on the Internet first .ServerFault Someone has asked this question , But no one answered , At least there is nothing we can use . But this question is a little strange , Because the questioner said that his server every 24~36 It happens every hour .
reflection
Go back to the mistake itself . We do a log rotation every morning , In this way, new logs are used every day . So restart the server . Seems to be Apache Successfully restarted , But because of libgomp The error is down again .
Searching for answers among the large number of results found on the Internet is like looking for a needle in a haystack , So we started reading libgomp Source code , See what happened . First ,libgomp What is it? ? According to the description of its home page :
- “GOMP The project is C、C++ and Fortran compiler OpenMP An implementation of ……GOMP Can simplify all GNU Parallel programming on the system .”
So it is OpenMP The implementation of the . How could it go wrong ?
Searched the source code , The only source where we found the error message was here :
So obviously , It is trying to create a thread key , But something went wrong . Check pthread_key_create Manuals :
- “pthread_key_create A thread specific data key is created , It can be used in all threads of the process .pthread_key_create() The provided key value is an opaque object , Used to locate thread specific data . Although different threads can use the same key name , adopt pthread_setspecific() Values bound to keys are maintained per thread , It is valid throughout the life cycle of the thread .”
interesting ! What is the return value ?
“pthread_key_create() The function will fail in the following cases :
Insufficient system resources , Cannot create another thread specific data key , Or the total number of keys per process has reached PTHREAD_KEYS_MAX ceiling .
Out of memory , Cannot create key .”
Then I checked the code , See what happened , as well as PTHREAD_KEYS_MAX What is the maximum :
So ,key Just one. 0~1024 Between ( Not included 1024) The number of , Assign to pthread_key_create The caller . These keys consist of a simple CAS Responsible for assignment , So there must be a place to release these keys . It seems that we have found the problem . We just need to increase PTHREAD_KEYS_MAX. however , This value is a constant . We even found a post , Ask for more PTHREAD_KEYS_MAX:
- “pthread_key_create() Will refuse more than PTHREAD_KEYS_MAX pthread_key_t Create request for . The problem I encountered was NetBSD On Apache Cannot work with multiple modules , Because this value is too low . After a long time , The server will fall into a state of being unable to provide services .”
The problem areas described in this post are similar to ours , So our hypothesis may be correct . But we still can't increase this value .
We began to investigate why we reloaded Apache Will enter libgomp This code . So obviously , heavy load Apache It can lead to mod_php Call one named Imagick Module .Imagick What is it? ? It is a use ImageMagick Library to create and modify pictures PHP Expand .
doubt
Seems closed Imagick You can avoid using libgomp, In this way, the maximum number of threads will not be encountered . And you only need to set an environment variable . It seems that this scheme is very safe , But we still have one big question :
Why is it in 1 month 1 Day occurrence ? And on such a large scale , Is it really accidental ?
Why is it all right to use it for so many years ? Will it be because of an update ?
Obviously, solving the problem in this way cannot satisfy us . There are many unsolved mysteries . We began to read further Apache HTTP And libgomp Code for , But it seems that everything is normal , At least we didn't find any problems . The problem cannot be repeated , Soon this problem will become an unsolved mystery . We searched for many unrelated keywords , I even found some information about “2038 In the problem ” 's post .
But none of this helps . We even suspected Apache Maximum uptime.
Finally, we checked Imagick Update log of , Found this :
“ Multiple modifications to reduce GOMP The occurrence of segment errors , Include :
In the process of closing , If possible , Call omp_pause_resource_all
Added
imagick.shutdown_sleep_count( Default 10) andimagick.set_single_thread( Default On). Both can reduce segment errors when closing .”
This is in line with our guess : take Imagick The maximum number of threads is set to 1 Can solve the problem . But it doesn't answer the biggest question about time .
Aura
After searching for more strange things , We'd like to see if anyone has this problem in January .

The first article is the key to unlock all this !
Suddenly thought of …… If the thread key has never been released , What will happen ? Is it possible ? This problem has never occurred since deployment dependency …… So we recalculated ,1024 Key , If you reload every morning , It will take two years 10 It takes months to exceed 1024 Reload times . If in the past 1024 Assign a thread key every morning in the day , If this key is never released ……
Finally saw a glimmer of dawn . We finally found a way to reproduce the problem . We made a test environment , Use the same server configuration , Then simply run the script .
for i in seq{
1..1100}; do sudo systemctl reload apache2;done
Reload apache2 1100 Time ( More 76 Secondary as redundancy ). Then, as expected, the problem appeared !
Apache After reloading 1024 Later ,libgomp It's a mistake . Now all the questions have been answered .
Let's see if we can add environment variables MAGICK_THREAD_LIMIT( new edition Imagick yes OMP_THREAD_LIMIT). Unfortunately , The problem remains. . So the next step is to update Imagick Version to a version that fixes the problem (v3.5.0+). Very lucky , There is no problem reloading thousands of times after the update .
Check
There is another unresolved problem : new edition Imagick Did you delete this key ? To answer this question , We used a tool :ltrace This tool can intercept and record the specific commands that the program runs . We start with the old version of Imagick(v.3.4.4) Running on the server ltrace:
ltrace -xpthread_key_*@libpthread.so.0 -L -c /usr/sbin/apache2 -k graceful
-x Is a search string for a function in a specific library , Here is libpthrad.so.0 Medium pthrad_key_create and pthread_key_delete.
-L tell ltrace Ignore the default filter , To reduce noise .
-c All results will be summarized at the end . and /usr/sbin/apache2 -k graceful amount to systemctlreload apache.
The result was not unexpected :
% time seconds usecs/call calls function
------ ----------- ----------- -----------------------------
100.00 0.000157 157 1 pthread_key_create
------ ----------- ----------- -----------------------------
100.00 0.000157 1total
3.4.4 Version only calls pthread_key_create Without deleting !
Then in the new version (v3.6.0) Run the same command on :
% time seconds usecs/call calls function
------ ----------- ----------- ------------------------------
------ ----------- ----------- ------------------------------
100.00 0.000000 0 total
It seems , The new version does not use multithreading , So there is no creation key at all .
summary
It's finally settled , But why hasn't it been restarted for so long ? We decided not to waste time on this issue , because “ If you exclude all the impossible options , Then the rest, no matter how incredible , It's all the truth .”
I feel strange after solving this problem . Although I feel very proud of solving the problem , But there are many long-running servers in the world. I don't know when I will encounter this problem .
Link to the original text :https://alijosie.medium.com/this-is-why-our-3000-apache-servers-went-down-on-the-first-day-of-2022-3cc5e9639587
This paper is about CSDN translate , Please indicate the source of reprint .
边栏推荐
- Advanced cross platform application development (II): uni app practice
- tar命令
- Set集合詳細講解
- TypeORM 框架
- Build 2022 上开发者最应关注的七大方向主要技术更新
- ssm+mysql二手交易网站(论文+源码获取链接)
- First defined here occurs during QT compilation. Causes and Solutions
- 【医学分割】u2net
- Cockroachdb: the resistant geo distributed SQL database paper reading notes
- How to create a progress bar that changes color according to progress
猜你喜欢

HCM 初学 ( 二 ) - 信息类型

Things generated by busybox

Build 2022 上开发者最应关注的七大方向主要技术更新

Unity project experience summary

What is the at instruction set often used in the development of IOT devices?

Continuous breakthrough and steady progress -- Review and Prospect of cross platform development technology of mobile terminal

第05天-文件操作函数

Speed regulation and stroke control based on Ti drv8424 driving stepper motor

Leetcode top 100 question 2 Add two numbers

我从技术到产品经理的几点体会
随机推荐
Introduction of 3D Modeling and Processing Software Liu Ligang, Chinese University of Science and Technology
Continuous breakthrough and steady progress -- Review and Prospect of cross platform development technology of mobile terminal
第05天-文件操作函数
First defined here occurs during QT compilation. Causes and Solutions
HDU - 1069 Monkey and Banana(DP+LIS)
tar命令
HDU - 1024 Max Sum Plus Plus(DP)
如何添加葫芦儿派盘
Actual combat: basic use of Redux
Data governance: metadata management implementation (Part IV)
El cascade echo failed; El cascader does not echo
Unity 使用Sqlite
了解 JVM 中几个相关问题 — JVM 内存布局、类加载机制、垃圾回收
Basic electrician knowledge 100 questions
Thread process foundation of JUC
为什么用葫芦儿派盘取代U盘?
码蹄集 - MT3114 · 有趣的平衡 - 用样例通俗地讲解
Common solutions for mobile terminals
Set set detailed explanation
新手在挖财开通证券账户安全吗?