Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.
Archive of posts tagged: Reliability
SDCs: A B C

SDCs: A B C

Data center hyperscalers (Meta, Google, Alibaba) have disclosed over the last four years an unexpectedly high number of CPUs (~1 in 1000) that produce Silent Data Corruptions (SDCs), i.e. program executions that produce wrong results without any observable indication....

Read more...

Silent Data Corruption at Scale

Silent Data Corruption at Scale

Hyperscalers are reporting frequent silent data corruptions (SDCs)—a.k.a. silent errors or corrupt execution errors (CEEs)—in their cloud fleets caused by silicon manufacturing defects. Notably, SDCs at-scale exhibit error occurrence rates on the order of one fault...

Read more...