Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

Data center hyperscalers (Meta, Google, Alibaba) have disclosed over the last four years an unexpectedly high number of CPUs (~1 in 1000) that produce Silent Data Corruptions (SDCs), i.e. program executions that produce wrong results without any observable indication. Other types of processing chips of data-level parallel architectures are also reported (AI accelerators from Google and Meta, and NVIDIA GPUs) to generate SDCs. Reports converge that the root cause of such SDCs are silicon chips which are born defective (escaped manufacturing testing), become defective (aging), or just differ from each other (timing variability).

Industry-driven initiatives (Meta, AMD, OCP), professional organizations initiatives (IEEE Computer Society RAS in Data Centers Summit), panels, special sessions, publications, and blog posts, underline the problem’s importance. However, exaggeration (speakers stating that the “blast radius of SDCs is huge”) is not necessarily reflecting the size of the problem. The problem is real, it needs attention and resources, but it is not going to break computing.

This blog post aims to contribute to the ongoing discussion about SDCs in computing, clarifying a few aspects and prioritizing some objectives. 

Terminology (always)

SDCs are neither detected nor corrected. An SDC is an eventual result (a terrible effect indeed) of a defect in a silicon chip when a program runs. The effect is not detected, and it is not corrected; that’s why it is called silent (unlike effects like crashes which can’t be missed). Per the fault-tolerant computing terminology used for decades: we detect defects (or their models: the faults) and we correct their effects – the errors – at different abstraction layers (circuit, microarchitecture, ISA, software, system) so that the machine delivers the expected service (differently defined across domains).

Statements in talks and papers like “an SDC detection scheme” and “the SDC is corrected” (even if the speaker or author does know what they talk about) don’t educate newcomers to the topic.

What the computing community needs to do for SDCs is to minimize or, ideally, eliminate their rate. In a perfect world, the result of a computation produced by chips with or without defects should be either (a) correct (defect does not matter or correction worked), or (b) incorrect but known to be so (detection worked but no correction is possible).

How many SDCs are happening over time?

The key words in the previous paragraph are “minimize” and “eliminate” meaning reduce a number to a small value or turn it to zero. But how can we reduce or zero a number if we don’t know its initial value (before any action is taken)?

This brings us to the issue of the “SDC incidents rate” (or just “SDC rate”). The “1 in 1000 chips” disclosure is very useful, but it only means that one chip out of 1000 has been identified to contain a defect that can generate SDCs. The number of SDC incidents happening per day/hour/minute is, however, not known. In a single CPU, GPU or AI accelerator chip, the SDC rate depends on how often the chip runs programs with instructions that use the defective hardware unit at points of time (thus operational conditions) where the defect is active (not all defects are always active). At a data center scale, the SDC rate also depends on the number of systems, the number of chips and cores, the utilization of systems, the defectivity of each chip type (a function of its manufacturing and testing quality as well as its age and use conditions), and, of course, the existing hardware or software detection and correction mechanisms in place. Does SDC rate sound hard to estimate? It really is.

Why know the expected SDC rate?

As in all application areas where fault-tolerant computing is important, the cost that a designer, a user, or operator needs or affords to pay to achieve fault tolerance is the key aspect that drives decisions. The severity of the problem determines the cost; the severity here is the (unknown) SDC rate. Perfectionists may believe that “every single SDC matters” but like in all engineering fields, the challenge is in the performance/quality vs. cost tradeoffs. Computing faced a similar challenge when it had to deal with soft errors and it addressed the problem by estimating severity (the FIT rate – failures in time) through modeling and simulation and accelerated beaming experiments, and then paying the cost of hardware- or software-based solutions in terms of performance, power and silicon area (redundancy of any type to detect and correct is never free).

Similarly, until tight and validated bounds for the SDC rates happening in the wild are measured (“measuring the SDC rate” is a contradiction in terms; how to measure the silence?) or estimated, no easy decisions can be taken for the protection (and the cost it incurs) of large-scale systems from defects leading to SDCs. Unfortunately, unlike the soft errors case, the root causes are multiple and their contribution to the overall SDC rate varies over time and space.

The role of industry

Meta and Google are to be praised for raising awareness about the problem of SDCs at scale through sharing reports of their fleet findings and other dissemination activities. They have mobilized other companies which are the members of the Open Compute Project (OCP) to actively participate in the SDCs discussion and engage with academia. AMD and Intel, following their long tradition working with the research community, are actively participating at events, provide open test frameworks, and engage with research teams. 

Despite the honest recognition of the problem by many (not all) industry players, the research community should probably expect fewer inputs from the industry than in the soft errors era. Although, at that time the enemy was external (particles hitting silicon) the information on how often flipped bits affected the operation of programs (failure rates) was scarce. Failure rates in computing systems are valuable trade secrets – no surprise. Computer chips are perceived as flawless by the public that massively depends on their correct operation. Who would risk stating they fail with certain rates (and so terribly leading to corrupted outputs of which nobody is aware)? In our era of SDCs due to silicon defects in chips, the root causes come from the inside. Silicon vendors, designers, system integrators and software developers have their role in the problem and its solutions. Let’s hope that more detailed field information about SDC rates than “occasionally the multiplier produces wrong results”, “the FPU produces more frequently wrong results than the integer ALUs”, “the SDC incidents become more frequent when chips age and operate at low voltage”, “the chip may intermittently experience SDC resulting in incorrect results”, or “defects in unprotected hardware units are more likely to generate SDCs” will come out soon, but if not, we understand why.

Bad times or good times?

The research community is used to situations like these, where the importance of a research problem is clearly understood, but field information is not sufficient to analyze it and mitigate it. Of course, these are great days to work on computing systems resilience. Solutions are needed at all layers of the computing systems stack from the device physics and circuit levels all the way to the cloud system software through every layer in between. Despite the focus of the public discussion on cloud computing, all other domains in the computing continuum are equally affected (HPC, edge, IoT) because silicon defects affect program execution in all systems. Still, each domain has completely different cost and performance constraints. Accurate estimates of the expected SDC rates (and their impact) as discussed above are much needed before cost-effective mitigations are decided at any domain.

About the Author: Dimitris Gizopoulos is a Professor of Computer Architecture at the University of Athens and works on dependability and energy-efficiency, focusing on modeling, measuring, and mitigating the effects of defects, bugs, and variability in computing systems. Along with his research he tried to organize the MICRO symposium twice but ended up hosting the computer architecture community on Zoom; he promised himself another shot for a real edition.



Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.