当前位置:网站首页>Remember the stuck analysis of an industrial automation control system in .NET

Remember the stuck analysis of an industrial automation control system in .NET

2022-08-02 20:05:00 first-line coders

一:背景

1. 讲故事

Some time ago, I encountered several about the form program Process load lock 引发的 程序卡死线程暴涨 问题,这种 dump 分析难度较大,主要涉及到 Windows操作系统 和 C++ 的基础知识,So it is necessary to simply organize and share with you,上 windbg 说话.

二:WinDbg 分析

1. What the main thread is doing at the moment

The form program is stuck,The entry point of analysis is at 主线程 上,使用 ~0s; k 命令即可.


0:000> ~0s; k
ntdll!NtWaitForSingleObject+0x14:
00007ffc`6010e614 c3              ret
 # Child-SP          RetAddr               Call Site
00 0000008c`107fe5d8 00007ffc`5cda4313     ntdll!NtWaitForSingleObject+0x14
01 0000008c`107fe5e0 00007ffc`257b2fe8     KERNELBASE!WaitForSingleObjectEx+0x93
02 0000008c`107fe680 00007ffc`257b2f9e     clr!CLREventWaitHelper2+0x3c
03 0000008c`107fe6c0 00007ffc`257b2efc     clr!CLREventWaitHelper+0x1f
04 0000008c`107fe720 00007ffc`256beed2     clr!CLREventBase::WaitEx+0x71
05 0000008c`107fe7b0 00007ffc`25687e44     clr!WKS::GCHeap::WaitUntilGCComplete+0x2e
06 0000008c`107fe7e0 00007ffc`25688092     clr!Thread::RareDisablePreemptiveGC+0x18f
07 0000008c`107fe880 00007ffc`255d44f4     clr!JIT_RareDisableHelperWorker+0x42
08 0000008c`107fe9d0 00007ffc`22544314     clr!JIT_RareDisableHelper+0x14
09 0000008c`107fea10 00007ffc`22525f32     WindowsBase_ni+0x184314
0a 0000008c`107fead0 00007ffc`22520298     WindowsBase_ni+0x165f32
0b 0000008c`107feb10 00007ffc`2251edaf     WindowsBase_ni+0x160298
0c 0000008c`107feba0 00007ffc`202b6421     WindowsBase_ni+0x15edaf
...

from the hexagram WaitUntilGCComplete 函数看,The main thread at this point is waiting GC完成,Then who triggered it GC 呢? 接下来用 !t 命令查看下 GC 标记.


0:000> !t
ThreadCount:      58
UnstartedThread:  9
BackgroundThread: 39
PendingThread:    9
DeadThread:       5
Hosted Runtime:   no
  42   41  cd8 000001ec5f7f7c90  202b220 Preemptive  0000000000000000:0000000000000000 000001ec3353c710 0     MTA 
  43   34 1160 000001ec5f7f4db0    21220 Preemptive  0000000000000000:0000000000000000 000001ec3353c710 0     Ukn 
  44   33 218c 000001ec5f7f5580    2b220 Cooperative 0000000000000000:0000000000000000 000001ec3353c710 1     MTA (GC) 
  45   36 1110 000001ec5f7f8460  202b220 Preemptive  0000000000000000:0000000000000000 000001ec3353c710 0     MTA 
  48   32 26a8 000001ec545813e0    2b220 Preemptive  0000000000000000:0000000000000000 000001ec3353c710 0     MTA 
  49   31  4b4 000001ec54581bb0    2b220 Preemptive  0000000000000000:0000000000000000 000001ec3353c710 0     MTA 

从卦中看,当前的 44 号线程触发了 GC,Next, take a look at its thread stack.


0:000> ~~[218c]s
ntdll!NtWaitForSingleObject+0x14:
00007ffc`6010e614 c3              ret
0:044> k
 # Child-SP          RetAddr               Call Site
00 0000008c`0a0bd9b8 00007ffc`5cda4313     ntdll!NtWaitForSingleObject+0x14
01 0000008c`0a0bd9c0 00007ffc`257b2fe8     KERNELBASE!WaitForSingleObjectEx+0x93
02 0000008c`0a0bda60 00007ffc`257b2f9e     clr!CLREventWaitHelper2+0x3c
03 0000008c`0a0bdaa0 00007ffc`257b2efc     clr!CLREventWaitHelper+0x1f
04 0000008c`0a0bdb00 00007ffc`256c821d     clr!CLREventBase::WaitEx+0x71
05 0000008c`0a0bdb90 00007ffc`256c8120     clr!standalone::`anonymous namespace'::CreateSuspendableThread+0x10c
06 0000008c`0a0bdc50 00007ffc`257b9e4c     clr!GCToEEInterface::CreateThread+0x170
07 0000008c`0a0bde40 00007ffc`257b8543     clr!WKS::gc_heap::prepare_bgc_thread+0x4c
08 0000008c`0a0bde70 00007ffc`256be9f7     clr!WKS::gc_heap::garbage_collect+0xfbb37
09 0000008c`0a0bdeb0 00007ffc`256c0c47     clr!WKS::GCHeap::GarbageCollectGeneration+0xef
0a 0000008c`0a0bdf00 00007ffc`255dc7b3     clr!WKS::GCHeap::Alloc+0x29c
0b 0000008c`0a0bdf50 00007ffb`c631853d     clr!JIT_New+0x339

从线程栈看,GC Ready to use during triggering CreateThread 函数创建线程,Some friends may not understand,GCTriggers also create operations for threads??? 哈哈,This involves a bit CLR 的基础知识,workstation 的 bgc mode will have a dedicated 后台线程, And this background thread is created and destroyed at some point during runtime,So look at the thread stack,GC 正在等待 bgc 线程初始化完毕.

很奇怪的是,I'm from multiple stuck states dump 看,发现 GC all stuck on this CreateThread 函数上,The implication is that the thread is waiting indefinitely here.

2. CreateThread Why can't the initialization be completed?

如果大家玩过 C++ 的话,应该知道 C++ 的 dll 会有一个 DllMain 方法,its meaning and Main 方法一致,The code skeleton diagram is as follows:


// dllmain.cpp : Defines the entry point for the DLL application.
#include "pch.h"

BOOL APIENTRY DllMain( HMODULE hModule,
                       DWORD  ul_reason_for_call,
                       LPVOID lpReserved
                     )
{
    switch (ul_reason_for_call)
    {
    case DLL_PROCESS_ATTACH:
    case DLL_THREAD_ATTACH:
    case DLL_THREAD_DETACH:
    case DLL_PROCESS_DETACH:
        break;
    }
    return TRUE;
}

从 switch Look at the enumeration parameters in ,就是 dll 加载和卸载,线程创建和销毁,有此 DllMain 方法的 dll 都会收到通知,into this DllMain Before, a global one will be obtained first Process load lock(LdrpLoaderLock).

既然 GC cannot be created during the process CreateThread,Someone must be holding this LdrpLoaderLock 锁,The next question is how to find it which thread is holding LdrpLoaderLock ? 这就涉及到 windows 操作系统的 基础知识了.

3. 谁在持有 LdrpLoaderLock 锁?

LdrpLoaderLock 变量是在 ntdll.dll User mode gatewaydll中,可以用 x ntdll!LdrpLoaderLock 命令检索,Then see which critical section is held as.


0:044>  x ntdll!LdrpLoaderLock
00007ffc`601cf4f8 ntdll!LdrpLoaderLock = <no type information>

0:044> dt _RTL_CRITICAL_SECTION  00007ffc`601cf4f8
atl100!_RTL_CRITICAL_SECTION
   +0x000 DebugInfo        : 0x00007ffc`601cf978 _RTL_CRITICAL_SECTION_DEBUG
   +0x008 LockCount        : 0n-2
   +0x00c RecursionCount   : 0n1
   +0x010 OwningThread     : 0x00000000`0000138c Void
   +0x018 LockSemaphore    : (null) 
   +0x020 SpinCount        : 0x4000000

从卦中看,当前 138c Thread No. holds this critical section,Next, switch to this thread and look at its thread stack.


0:044> ~~[138c]s
win32u!NtUserMessageCall+0x14:
00007ffc`5c891184 c3              ret
0:061> k
 # Child-SP          RetAddr               Call Site
00 0000008c`00ffec68 00007ffc`5f21bfbe     win32u!NtUserMessageCall+0x14
01 0000008c`00ffec70 00007ffc`5f21be38     user32!SendMessageWorker+0x11e
02 0000008c`00ffed10 00007ffc`124fd4af     user32!SendMessageW+0xf8
03 0000008c`00ffed70 00007ffc`125e943b     cogxImagingDevice!DllUnregisterServer+0x3029f
04 0000008c`00ffeda0 00007ffc`125e9685     cogxImagingDevice!DllUnregisterServer+0x11c22b
05 0000008c`00ffede0 00007ffc`600b50e7     cogxImagingDevice!DllUnregisterServer+0x11c475
06 0000008c`00ffee20 00007ffc`60093ccd     ntdll!LdrpCallInitRoutine+0x6f
07 0000008c`00ffee90 00007ffc`60092eef     ntdll!LdrpProcessDetachNode+0xf5
08 0000008c`00ffef60 00007ffc`600ae319     ntdll!LdrpUnloadNode+0x3f
09 0000008c`00ffefb0 00007ffc`600ae293     ntdll!LdrpDecrementModuleLoadCountEx+0x71
0a 0000008c`00ffefe0 00007ffc`5cd7c00e     ntdll!LdrUnloadDll+0x93
0b 0000008c`00fff010 00007ffc`5d47cf78     KERNELBASE!FreeLibrary+0x1e
0c 0000008c`00fff040 00007ffc`5d447aa3     combase!CClassCache::CDllPathEntry::CFinishObject::Finish+0x28 [onecore\com\combase\objact\dllcache.cxx @ 3420] 
0d 0000008c`00fff070 00007ffc`5d4471a9     combase!CClassCache::CFinishComposite::Finish+0x4b [onecore\com\combase\objact\dllcache.cxx @ 3530] 
0e 0000008c`00fff0a0 00007ffc`5d3f1499     combase!CClassCache::FreeUnused+0xdd [onecore\com\combase\objact\dllcache.cxx @ 6547] 
0f 0000008c`00fff650 00007ffc`5d3f13c7     combase!CoFreeUnusedLibrariesEx+0x89 [onecore\com\combase\objact\dllapi.cxx @ 117] 
10 (Inline Function) --------`--------     combase!CoFreeUnusedLibraries+0xa [onecore\com\combase\objact\dllapi.cxx @ 74] 
11 0000008c`00fff690 00007ffc`6008a019     combase!CDllHost::MTADllUnloadCallback+0x17 [onecore\com\combase\objact\dllhost.cxx @ 929] 
12 0000008c`00fff6c0 00007ffc`6008bec4     ntdll!TppTimerpExecuteCallback+0xa9
13 0000008c`00fff710 00007ffc`5f167e94     ntdll!TppWorkerThread+0x644
14 0000008c`00fffa00 00007ffc`600d7ad1     kernel32!BaseThreadInitThunk+0x14
15 0000008c`00fffa30 00000000`00000000     ntdll!RtlUserThreadStart+0x21

可以看到,cogxImagingDevice 发起了一个 user32!SendMessageW 同步方法,Causes the program to completely deadlock,Some friends may be a little confused,我简单罗列下.

  1. 44 No. thread to use CreateThread 创建线程,But you have to get it first load lock,所以一直在等待 61 No. thread releases the load lock.
  2. 61 No. threads in a synchronized way user32!SendMessageW 给 主线程的 WndProc The gateway function prints synchronous messages,等待 The main thread responds,While the main thread is waiting GC 完成.
  3. 主线程 indefinitely 等待 GC 结束.

综合来看,As long as the main thread does not respond 44 号线程, 44No. thread will not be released load lock,这个 load lock 不释放,This will cause many threads to fail to initialize,This is in it dump also reacted,代码如下:


  70  Id: 300.4f0 Suspend: 0 Teb: 0000008c`102e1000 Unfrozen
 # Child-SP          RetAddr               Call Site
00 0000008c`0ecff388 00007ffc`6008902d     ntdll!NtWaitForSingleObject+0x14
01 0000008c`0ecff390 00007ffc`600b29a7     ntdll!LdrpDrainWorkQueue+0x15d
02 0000008c`0ecff3d0 00007ffc`600e76d5     ntdll!LdrpInitializeThread+0x8b
03 0000008c`0ecff4b0 00007ffc`600e7633     ntdll!_LdrpInitialize+0x89
04 0000008c`0ecff550 00007ffc`600e75de     ntdll!LdrpInitialize+0x3b
05 0000008c`0ecff580 00000000`00000000     ntdll!LdrInitializeThunk+0xe

  71  Id: 300.1c88 Suspend: 0 Teb: 0000008c`102e5000 Unfrozen
 # Child-SP          RetAddr               Call Site
00 0000008c`0f4ff268 00007ffc`6008902d     ntdll!NtWaitForSingleObject+0x14
01 0000008c`0f4ff270 00007ffc`600b29a7     ntdll!LdrpDrainWorkQueue+0x15d
02 0000008c`0f4ff2b0 00007ffc`600e76d5     ntdll!LdrpInitializeThread+0x8b
03 0000008c`0f4ff390 00007ffc`600e7633     ntdll!_LdrpInitialize+0x89
04 0000008c`0f4ff430 00007ffc`600e75de     ntdll!LdrpInitialize+0x3b
05 0000008c`0f4ff460 00000000`00000000     ntdll!LdrInitializeThunk+0xe

  72  Id: 300.15c0 Suspend: 0 Teb: 0000008c`102e7000 Unfrozen
 # Child-SP          RetAddr               Call Site
00 0000008c`0f8ff278 00007ffc`6008902d     ntdll!NtWaitForSingleObject+0x14
01 0000008c`0f8ff280 00007ffc`600b29a7     ntdll!LdrpDrainWorkQueue+0x15d
02 0000008c`0f8ff2c0 00007ffc`600e76d5     ntdll!LdrpInitializeThread+0x8b
03 0000008c`0f8ff3a0 00007ffc`600e7633     ntdll!_LdrpInitialize+0x89
04 0000008c`0f8ff440 00007ffc`600e75de     ntdll!LdrpInitialize+0x3b
05 0000008c`0f8ff470 00000000`00000000     ntdll!LdrInitializeThunk+0xe

  73  Id: 300.764 Suspend: 0 Teb: 0000008c`102ef000 Unfrozen
 # Child-SP          RetAddr               Call Site
00 0000008c`0fcff388 00007ffc`6008902d     ntdll!NtWaitForSingleObject+0x14
01 0000008c`0fcff390 00007ffc`600b29a7     ntdll!LdrpDrainWorkQueue+0x15d
02 0000008c`0fcff3d0 00007ffc`600e76d5     ntdll!LdrpInitializeThread+0x8b
03 0000008c`0fcff4b0 00007ffc`600e7633     ntdll!_LdrpInitialize+0x89
04 0000008c`0fcff550 00007ffc`600e75de     ntdll!LdrpInitialize+0x3b
05 0000008c`0fcff580 00000000`00000000     ntdll!LdrInitializeThunk+0xe

可以看到,There are a lot of threads that are stuck ntdll!LdrpInitializeThread+0x8b cannot proceed,So what exactly is this method doing?可以看下 反汇编代码 .


0:000> u ntdll!LdrpInitializeThread+0x8b
ntdll!LdrpInitializeThread+0x8b:
00007ffc`600b29a7 e874a50000      call    ntdll!LdrpAcquireLoaderLock (00007ffc`600bcf20)
00007ffc`600b29ac 90              nop
00007ffc`600b29ad 488b1d1c2a1200  mov     rbx,qword ptr [ntdll!PebLdr+0x10 (00007ffc`601d53d0)]
00007ffc`600b29b4 488d05152a1200  lea     rax,[ntdll!PebLdr+0x10 (00007ffc`601d53d0)]
00007ffc`600b29bb 483bd8          cmp     rbx,rax
00007ffc`600b29be 0f84c5000000    je      ntdll!LdrpInitializeThread+0x16d (00007ffc`600b2a89)
....

It can be clearly seen from the assembly,are stuck at acquiring load lock ntdll!LdrpAcquireLoaderLock 函数上.

三:总结

这是一个由 cogxImagingDevice.dllCauses a program deadlock,I checked that Baidu is a commercial version 视觉图像库,I can't solve this either,Can only recommend to friends.

  1. Get familiar with this dll 的配置,If it is not caused by the configuration, it is recommended to submit an official solution.

  2. 争取做到 C# 和 C++ process-level isolation,Or simply replace it cogxImagingDevice.dll ,Although this is a bit difficult.

这个 dump 给我们的教训是: 当 C# 和 C++ 混在一起,Strive to achieve the greatest possible isolation,Once there is a problem, it will test you right windows Understanding of the underlying knowledge,The threshold for analysis and troubleshooting is high.

图片名称
原网站

版权声明
本文为[first-line coders]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/214/202208021723251247.html