当前位置:网站首页>On the first day of the new year, 3000 Apache servers went down
On the first day of the new year, 3000 Apache servers went down
2022-07-01 05:41:00 【CSDN information】
author |Ali Josie translator | Meniscus
Produce | CSDN(ID:CSDNnews)
The first day of the new year , It happens to be Saturday again , When I wake up in the morning, I see a pile of alarms that the entire infrastructure is down ! One of my colleagues encountered such a reality , You can imagine his mood at that time .
Early in the morning
First , The most important thing is to restore the service , Minimize the impact of service downtime , We've rebooted everything Apache The server , Fortunately, there is no problem . The next step is to find out the cause of the downtime . Why are all servers down on the first day of the new year ? This must not be an accident ?
We can see that the following logs are recorded on each server :
AH00171: Graceful restart requested, doing restart
libgomp: could not create thread pool destructor.
libgomp What is it? ? We checked the error on the Internet first .ServerFault Someone has asked this question , But no one answered , At least there is nothing we can use . But this question is a little strange , Because the questioner said that his server every 24~36 It happens every hour .
reflection
Go back to the mistake itself . We do a log rotation every morning , In this way, new logs are used every day . So restart the server . Seems to be Apache Successfully restarted , But because of libgomp The error is down again .
Searching for answers among the large number of results found on the Internet is like looking for a needle in a haystack , So we started reading libgomp Source code , See what happened . First ,libgomp What is it? ? According to the description of its home page :
- “GOMP The project is C、C++ and Fortran compiler OpenMP An implementation of ……GOMP Can simplify all GNU Parallel programming on the system .”
So it is OpenMP The implementation of the . How could it go wrong ?
Searched the source code , The only source where we found the error message was here :
So obviously , It is trying to create a thread key , But something went wrong . Check pthread_key_create Manuals :
- “pthread_key_create A thread specific data key is created , It can be used in all threads of the process .pthread_key_create() The provided key value is an opaque object , Used to locate thread specific data . Although different threads can use the same key name , adopt pthread_setspecific() Values bound to keys are maintained per thread , It is valid throughout the life cycle of the thread .”
interesting ! What is the return value ?
“pthread_key_create() The function will fail in the following cases :
Insufficient system resources , Cannot create another thread specific data key , Or the total number of keys per process has reached PTHREAD_KEYS_MAX ceiling .
Out of memory , Cannot create key .”
Then I checked the code , See what happened , as well as PTHREAD_KEYS_MAX What is the maximum :
So ,key Just one. 0~1024 Between ( Not included 1024) The number of , Assign to pthread_key_create The caller . These keys consist of a simple CAS Responsible for assignment , So there must be a place to release these keys . It seems that we have found the problem . We just need to increase PTHREAD_KEYS_MAX. however , This value is a constant . We even found a post , Ask for more PTHREAD_KEYS_MAX:
- “pthread_key_create() Will refuse more than PTHREAD_KEYS_MAX pthread_key_t Create request for . The problem I encountered was NetBSD On Apache Cannot work with multiple modules , Because this value is too low . After a long time , The server will fall into a state of being unable to provide services .”
The problem areas described in this post are similar to ours , So our hypothesis may be correct . But we still can't increase this value .
We began to investigate why we reloaded Apache Will enter libgomp This code . So obviously , heavy load Apache It can lead to mod_php Call one named Imagick Module .Imagick What is it? ? It is a use ImageMagick Library to create and modify pictures PHP Expand .
doubt
Seems closed Imagick You can avoid using libgomp, In this way, the maximum number of threads will not be encountered . And you only need to set an environment variable . It seems that this scheme is very safe , But we still have one big question :
Why is it in 1 month 1 Day occurrence ? And on such a large scale , Is it really accidental ?
Why is it all right to use it for so many years ? Will it be because of an update ?
Obviously, solving the problem in this way cannot satisfy us . There are many unsolved mysteries . We began to read further Apache HTTP And libgomp Code for , But it seems that everything is normal , At least we didn't find any problems . The problem cannot be repeated , Soon this problem will become an unsolved mystery . We searched for many unrelated keywords , I even found some information about “2038 In the problem ” 's post .
But none of this helps . We even suspected Apache Maximum uptime.
Finally, we checked Imagick Update log of , Found this :
“ Multiple modifications to reduce GOMP The occurrence of segment errors , Include :
In the process of closing , If possible , Call omp_pause_resource_all
Added
imagick.shutdown_sleep_count( Default 10) andimagick.set_single_thread( Default On). Both can reduce segment errors when closing .”
This is in line with our guess : take Imagick The maximum number of threads is set to 1 Can solve the problem . But it doesn't answer the biggest question about time .
Aura
After searching for more strange things , We'd like to see if anyone has this problem in January .

The first article is the key to unlock all this !
Suddenly thought of …… If the thread key has never been released , What will happen ? Is it possible ? This problem has never occurred since deployment dependency …… So we recalculated ,1024 Key , If you reload every morning , It will take two years 10 It takes months to exceed 1024 Reload times . If in the past 1024 Assign a thread key every morning in the day , If this key is never released ……
Finally saw a glimmer of dawn . We finally found a way to reproduce the problem . We made a test environment , Use the same server configuration , Then simply run the script .
for i in seq{
1..1100}; do sudo systemctl reload apache2;done
Reload apache2 1100 Time ( More 76 Secondary as redundancy ). Then, as expected, the problem appeared !
Apache After reloading 1024 Later ,libgomp It's a mistake . Now all the questions have been answered .
Let's see if we can add environment variables MAGICK_THREAD_LIMIT( new edition Imagick yes OMP_THREAD_LIMIT). Unfortunately , The problem remains. . So the next step is to update Imagick Version to a version that fixes the problem (v3.5.0+). Very lucky , There is no problem reloading thousands of times after the update .
Check
There is another unresolved problem : new edition Imagick Did you delete this key ? To answer this question , We used a tool :ltrace This tool can intercept and record the specific commands that the program runs . We start with the old version of Imagick(v.3.4.4) Running on the server ltrace:
ltrace -xpthread_key_*@libpthread.so.0 -L -c /usr/sbin/apache2 -k graceful
-x Is a search string for a function in a specific library , Here is libpthrad.so.0 Medium pthrad_key_create and pthread_key_delete.
-L tell ltrace Ignore the default filter , To reduce noise .
-c All results will be summarized at the end . and /usr/sbin/apache2 -k graceful amount to systemctlreload apache.
The result was not unexpected :
% time seconds usecs/call calls function
------ ----------- ----------- -----------------------------
100.00 0.000157 157 1 pthread_key_create
------ ----------- ----------- -----------------------------
100.00 0.000157 1total
3.4.4 Version only calls pthread_key_create Without deleting !
Then in the new version (v3.6.0) Run the same command on :
% time seconds usecs/call calls function
------ ----------- ----------- ------------------------------
------ ----------- ----------- ------------------------------
100.00 0.000000 0 total
It seems , The new version does not use multithreading , So there is no creation key at all .
summary
It's finally settled , But why hasn't it been restarted for so long ? We decided not to waste time on this issue , because “ If you exclude all the impossible options , Then the rest, no matter how incredible , It's all the truth .”
I feel strange after solving this problem . Although I feel very proud of solving the problem , But there are many long-running servers in the world. I don't know when I will encounter this problem .
Link to the original text :https://alijosie.medium.com/this-is-why-our-3000-apache-servers-went-down-on-the-first-day-of-2022-3cc5e9639587
This paper is about CSDN translate , Please indicate the source of reprint .
边栏推荐
- Qt编写自定义控件-自绘电池
- Typeorm framework
- MySQL converts milliseconds to time string
- Leetcode top 100 question 2 Add two numbers
- 加密狗资料搜集
- JDBC common interview questions
- Advanced cross platform application development (III): online resource upgrade / hot update with uni app
- 3D建模與處理軟件簡介 劉利剛 中國科技大學
- 如何创建一个根据进度改变颜色的进度条
- Use and principle of wait notify
猜你喜欢

Leetcode top 100 questions 1 Sum of two numbers

What is the at instruction set often used in the development of IOT devices?

Memtable for leveldb source code analysis

Series of improving enterprise product delivery efficiency (1) -- one click installation and upgrade of enterprise applications

Tar command

Application and principle of ThreadPoolExecutor thread pool

教务管理系统(免费源码获取)

Ebpf cilium practice (2) - underlying network observability

Build 2022 上开发者最应关注的七大方向主要技术更新

Mongodb learning chapter: introduction after installation lesson 1
随机推荐
【医学分割】u2net
3D建模与处理软件简介 刘利刚 中国科技大学
【问题思考总结】为什么寄存器清零是在用户态进行的?
What things you didn't understand when you were a child and didn't understand until you grew up?
【考研高数 自用】高数第一章基础阶段思维导图
Redis database deployment and common commands
[excel] column operation, which performs specific column for data in a cell, such as text division by comma, colon, space, etc
Data governance: data governance management (Part V)
TypeORM 框架
Leetcode top 100 questions 1 Sum of two numbers
数据治理:元数据管理实施(第四篇)
Flowable source code comment (XXXIX) task listener
Data governance: data governance framework (Part I)
HCM 初学 ( 一 ) - 简介
[RootersCTF2019]babyWeb
Educational administration management system of SSM (free source code)
One click deployment of highly available emqx clusters in rainbow
win10、win11中Elan触摸板滚动方向反转、启动“双指点击打开右键菜单“、“双指滚动“
【QT】qt加减乘除之后,保留小数点后两位
Basic electrician knowledge 100 questions