当前位置:网站首页>Server memory failure prediction can actually do this!
Server memory failure prediction can actually do this!
2022-07-26 09:11:00 【InfoQ】
One 、 Background introduction
- MCE It is difficult to locate the log directly to the failed memory slot .
- There is no intuitive CE/UCE Error count .
- Cannot be based on the memory module CE/UCE The amount of determines the health of memory .
Two 、EDAC Principle introduction
- 【edac_mc_alloc()】: Use structure mem_ctl_info To describe the memory controller , Only EDAC The core of can touch it , adopt edac_mc_alloc() This function allocates the contents of the filling structure .
- 【edac_device_handle_ce()】: Mark CE error .
- 【edac_device_handle_ue()】: sign UCE error .
- 【edac_mc_handle_error()】: Report memory events to user space , Its parameters include the hierarchy of fault points and fault types , Cumulative Correlation UCE/CE Error count statistics .
- 【edac_raw_mc_handle_error()】: Report memory events to user space , But don't do anything to find its location , Only when the hardware error comes from BIOS when , Will be edac_mc_handle_error() Call directly .
# ls /sys/devices/system/edac/mc/mc0/csrow0/
ce_count ch0_ce_count ch0_dimm_label ch1_ce_count ch1_dimm_label dev_type edac_mode mem_type power size_mb subsystem ue_count uevent
3、 ... and 、EDAC Application
# ls /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/
amd64_edac_mod.ko.xz edac_core.ko.xz i3000_edac.ko.xz i5000_edac.ko.xz i5400_edac.ko.xz i7core_edac.ko.xz ie31200_edac.ko.xz skx_edac.ko.xz
e752x_edac.ko.xz edac_mce_amd.ko.xz i3200_edac.ko.xz i5100_edac.ko.xz i7300_edac.ko.xz i82975x_edac.ko.xz sb_edac.ko.xz x38_edac.ko.xz# modinfo sb_edac
filename: /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/sb_edac.ko.xz
description: MC Driver for Intel Sandy Bridge and Ivy Bridge memory controllers - Ver: 1.1.1
...
# modinfo skx_edac
filename: /lib/modules/3.10.0-693.el7.x86_64/kernel/drivers/edac/skx_edac.ko.xz
description: MC Driver for Intel Skylake server processors
...# cat /etc/edac/labels.db
# EDAC Motherboard DIMM labels Database file.
#
# $Id: labels.db 102 2008-09-25 15:52:07Z grondo $
#
# Vendor-name and model-name are found from the program 'dmidecode'
# labels are found from the silk screen on the motherboard.
#
#Vendor: <vendor-name>
# Model: <model-name>
# <label>: <mc>.<row>.<channel>- BERT(Boot Error Record Table): It is mainly used to record errors during startup
- ERST(Error Record Serialization Table) : An abstract interface for storing errors permanently , Store errors related to various hardware or platforms , Error types include Corrected Error(CE),Uncorrected Recoverable Error(UCR), as well as Uncorrected Non-Recoverable Error, Or say Fatal Error.
- EINJ(Error Injection Table): The main function is to inject errors and trigger errors , Is a table for testing
- HEST(Hardware Error Source Table): Many error sources and types are defined . The purpose of defining these hardware error sources is to standardize the implementation of software and hardware error interfaces .
# See if it exists EINJ surface
# ls /sys/firmware/acpi/tables/EINJ
# grep < The following fields > /boot/config-3.10.0-693.el7.x86_64
CONFIG_DEBUG_FS=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_EINJ=m
# install einj
# modprobe einj
# Check the memory address range , This step is because /proc/iomem This file records the allocation of physical addresses , Some memory addresses are reserved by the system and occupied by other devices , Unable to inject errors .
# cat /proc/iomem | grep "System RAM"
00001000-000997ff
00100000-69f79fff
6c867000-6c9e6fff
6f345000-6f7fffff
100000000-407fffffff
# Check the memory page size
# getconf PAGESIZE
4096 namely 4KB
# Get into edac Error injection directory
# cat /proc/mounts | grep debugfs
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
# cd /sys/kernel/debug/apei/einj/
# Check the error types that support injection
# cat available_error_type
0x00000008 Memory Correctable
0x00000010 Memory Uncorrectable non-fatal
0x00000020 Memory Uncorrectable fatal
# Write the type of error to inject
echo 0x8 > error_type
# Write memory address mask
echo 0xfffffffffffff000 > param2
# Write memory address
echo 0x32dec000 > param1
# write in 0x0, if 1, Will skip the trigger link
echo 0x0 > notrigger
# Writing any integer triggers error injection , This is the last step of error injection
echo 1 > error_inject
# Check the log
# tail /var/log/message
xxxxxx xxxxxxxx kernel: [2258720.203422] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x32dec offset:0x0 grain:32 syndrome:0x0 - err_code:0101:0090 socket:0 imc:0 rank:0 bg:0 ba:3 row:327 col:300)
# Use edac-util -v see , You can see that the corresponding memory module has added CE Count Four 、 Summary and prospect
- EDAC You can definitely get the data on each memory of the server CE Count , We can go through CE Count to set the threshold , analysis CE Counting curve, etc , Combine with other MCE log 、SEL Wait for the health evaluation of the memory , Make memory prediction .EDAC stay vivo Since the server was fully launched , Cumulative discovery in advance 450+ case Of memory CE problem , The number of server downtime is significantly reduced . Migrate the server business that meets the repair standard , And replace the corresponding memory module , Avoid business instability caused by sudden server downtime , Even the losses caused by this .
- EDAC Is the server RAS(Reliability, Availability and Serviceability) A small part of memory applications .RAS It refers to some technical means , The combination of software and hardware ensures the three capabilities of the server .RAS There are still many optimizations in memory , for example MCA(Machine Check Architecture)recovery wait . In the future, we will also introduce RAS To mitigate the impact of hardware failure on the system .
- https://www.kernel.org/doc/html/latest/driver-api/edac.html
- https://www.kernel.org/doc/html/latest/admin-guide/ras.html
- https://www.kernel.org/doc/html/latest/firmware-guide/acpi/apei/einj.html
- https://github.com/grondo/edac-utils/
- https://uefi.org/specs/ACPI/6.4/18_ACPI_Platform_Error_Interfaces/ACPI_PLatform_Error_Interfaces.html
边栏推荐
猜你喜欢

【LeetCode数据库1050】合作过至少三次的演员和导演(简单题)

巴比特 | 元宇宙每日必读:元宇宙的未来是属于大型科技公司,还是属于分散的Web3世界?...

Day06 homework - skill question 6

The essence of attack and defense strategy behind the noun of network security

Original root and NTT 5000 word explanation

机器学习中的概率模型

Datawhale panda book has been published!

Qtcreator reports an error: you need to set an executable in the custom run configuration

多项式开根

数据库操作 技能6
随机推荐
ONTAP 9文件系统的限制
Study notes of dataX
756. 蛇形矩阵
JS file import of node
Canal 的学习笔记
codeforces dp合集
Node-v download and application, ES6 module import and export
PHP 之 Apple生成和验证令牌
网络安全漫山遍野的高大上名词之后的攻防策略本质
CSDN Top1 "how does a Virgo procedural ape" become a blogger with millions of fans through writing?
The largest number of statistical absolute values --- assembly language
【final关键字的使用】
聪明的美食家 C语言
day06 作业--增删改查
NFT与数字藏品到底有何区别?
ES6 modular import and export) (realize page nesting)
Innovus卡住,提示X Error:
数据库操作 技能6
Clean the label folder
MySQL strengthen knowledge points