by Dimitris Gizopoulos on Sep 16, 2024 | Tags: fault tolerance, Reliability, silent data corruptions
Data center hyperscalers (Meta, Google, Alibaba) have disclosed over the last four years an unexpectedly high number of CPUs (~1 in 1000) that produce Silent Data Corruptions (SDCs), i.e. program executions that produce wrong results without any observable indication....
Read more...
by Sudhanva Gurumurthi, Vilas Sridharan, and Sankar Gurumurthy on Jul 17, 2023 | Tags: Architecture, Debugging, fault tolerance, faults
Overview Reliability is essential for computing. However, as technology nodes have scaled, there have been several fundamental physical challenges to overcome to provide the abstraction of reliability. One such challenge has been the emergence of marginal...
Read more...