RAS(一)介绍
写在开篇之前
近期收到了公司大礼包,想着在找工作期间把Linux RAS整理一下,写成系列文章。毕竟作为OS RAS负责人兼开发,为阿里云X86和倚天710 RAS落地了很多RAS增强和解决方案,对阿里云服务器稳定性做出些许贡献。期间也有不少其他团队过来请教过RAS事项,所以想着记录下来,对以后计划了解和学习RAS的Linux爱好者有所帮助。另外个人视角主要从Linux内核出发,梳理Linux RAS涉及的组件、功能、特性都有哪些,也会介绍内核RAS涉及的硬件。
RAS背景
随着云时代的到来,各个公司都在将产品、服务等迁移上云,云为数字化建设提供了极大的便利。国内外云服务器厂商如雨后春笋般出现,比如谷歌云、阿里云、腾讯云等等。近些年,全球范围内云服务出现多次宕机事件,云的稳定性越来越受到大家关注。据《中国数据灾备产业白皮书暨数据灾备建设调研报告》中描述,业务宕机1分钟,平均会使运输业损失15万美元,银行业损失27万美元,通信业损失35万美元,制造业损失42万美元,证券业损失45万美元,同时公司声誉等无形资产损失更是无法估量。
在嵌入式领域,越来越多的国产自研硬件商用发布,包括内存、硬盘、GPU、CPU等,应用于PC主机、汽车电子、工业控制等市场。随着这些硬件使用数量的飞速上升,稳定性问题也逐步开始暴露出来。比如笔者在华为内核团队负责对接某ARM嵌入式产品业务,在使用国产中发现使用国产内存条出现硬件问题的概率大大超过国外某大厂品牌。
总体来说,随着软件技术的成熟和完善,因为软件导致的问题占比逐年减少。此消彼长下,硬件问题占比逐年突显,比如硬件故障导致服务器异常或宕机问题已逐步成为云服务器Top1问题。
RAS定义
服务器硬件稳定性,主要体现在RAS上。RAS指机器的可靠性(Reliability)、可用性(Availability)和可服务性(Serviceability)。Linux Kernel对Reliability,Availability,Serviceability定义如下
Reliability
is the probability that a system will produce correct outputs.
•Generally measured as Mean Time Between Failures (MTBF)
•Enhanced by features that help to avoid, detect and repair hardware faults
Availability
is the probability that a system is operational at a given time
•Generally measured as a percentage of downtime per a period of time
•Often uses mechanisms to detect and correct hardware faults in runtime;
Serviceability (or maintainability)
is the simplicity and speed with which a system can be repaired or maintained
•Generally measured on Mean Time Between Repair (MTBR)
RAS目标是使系统尽可能长期可靠的运行而不停机,减少系统downtime;提供硬件检测上报机制,以便在硬件错误引起数据丢失或宕机之前能够通知管理员及时更换硬件;提供硬件错误恢复机制,并尽可能纠正错误,使系统可持续可靠的运行。
RAS涉及的硬件包括且不限于:CPU、Memory、IO、PCIe、硬盘和其他外设
•CPU – detect errors at instruction execution and at L1/L2/L3 caches;
•Memory – add error correction logic (ECC) to detect and correct errors;
•I/O – add CRC checksums for tranfered data;
•Storage – RAID, journal file systems, checksums, Self-Monitoring, Analysis and Reporting Technology (SMART).
通常来说,硬件错误分为CE、UE、Fatal Error、Non-fatal Error,定义如下
•Correctable Error (CE) - the error detection mechanism detected and corrected the error. Such errors are usually not fatal, although some Kernel mechanisms allow the system administrator to consider them as fatal.
•Uncorrected Error (UE) - the amount of errors happened above the error correction threshold, and the system was unable to auto-correct.
•Fatal Error - when an UE error happens on a critical component of the system (for example, a piece of the Kernel got corrupted by an UE), the only reliable way to avoid data corruption is to hang or reboot the machine.
•Non-fatal Error - when an UE error happens on an unused component, like a CPU in power down state or an unused memory bank, the system may still run, eventually replacing the affected hardware by a hot spare, if available.
但是实际这个定义比较宽泛且简陋,比如还有Defferred Error(DE)
Deferred error
The error was detected, was not corrected, and was deferred. The error has not been silently propagated. The error might be latent in the system. It is IMPLEMENTATION DEFINED whether the error continues to infect the state of the node or whether it has been deferred to the consumer. The node continues to operate. If the error might have been silently propagated, it must be reported as an Uncorrected error.
又比如Intel将软件可恢复的UC Error定义为UCR(Uncorrected Recoverable) Error,下面又分为SRAR、SRAO、UCNA等。
RAS基本框图
RAS基本流程框图如上,硬件发生故障后,通过硬件RAS能力触发中断或异常,通知到Firmware/OS,软件收到通知后采取相应的策略,比如Panic、执行Recover actions或者通知到用户。
随着RAS功能不断更新迭代以及架构不同,RAS体系开始呈现多样性,因不同使用场景所有不同,体现在:
1.通知方式多样
通知方式细分下来包括IRQ、Exception、Poll、SEA、SDEI、GPIO等方式。
2.Mode多样
硬件故障先通知到Firmware,然后Firmware带外处理或再通知到OS的方式,称为Firmware First Mode;
硬件故障通知到OS,OS处理硬件故障的方式,称为Kernel First Mode;
这两种方式还可以支持混合使用,各有优劣,要学会因地制宜。比如对于CE来说,服务器经常发生大量CE事件,就会产生CE Irq风暴,CPU长时间在处理这些Irq,就会导致其他任务得不到调度,影响整体性能。
3.芯片架构、硬件多样性
随着近些年芯片行业发展,芯片架构越来越多样性,包括Intel、AMD、ARM、RISC等,不同芯片架构下硬件组成也有些许差异。
4.软件多样性
对于Linux驱动来说,包括mce驱动、apei驱动、edac驱动等;
对于用户态RAS服务来说,包括mcelog、rasdaemon、perf event通知等;
总体来说,RAS是一个复杂的体系,不同芯片架构、不同硬件RAS功能各不相同,作为RAS开发要根据不同业务场景采取对应的RAS方案。
RAS故障处理流程
以Intel服务器为例,
1.Intel服务器内存发生CE故障后,硬件触发CMCI中断,执行OS注册的中断处理函数;
2.该函数调用EDAC驱动代码,读取MCA状态寄存器来获取硬件故障信息,比如故障级别、故障硬件位置、故障地址等等。EDAC驱动会将信息保存在/dev/mcelog;
3.Mcelog是一个用户态的服务程序,通过解析/dev/mcelog信息,将其保存在/var/log/mcelog。用户可以通过查看该文件了解此服务器是否发生过硬件故障以及故障发生的时间、硬件信息、是否恢复等关键信息;
RAS硬件故障举例
如下是x86服务器注入内存CE故障的日志,EDAC驱动会打印故障发生所在的硬件(Memory)、Addr、Processor、类型(CE)、memory channel/dimm等信息。
C++[22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR[22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090[22715.834759] EDAC sbridge MC3: TSC 0[22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86[22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0[22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0) |
---|
图片来源:v2-8e4986144a6a70301ee1a30c60c5ffad_720w.webp (720×378) (zhimg.com)