The discovery of Meltdown and Spectre, along with their extensive media coverage, brought hardware security research to the spotlight. A wake-up call for major chipset manufacturers such as Intel, AMD, and ARM, we learnt that hardware vulnerabilities can be exploited remotely from the software and break all isolation boundaries. This has led to a significant shift in how we think about computer architecture today. Despite this progress, there’s still a long road ahead.
Meltdown and Spectre, later called transient execution attacks, showed that the distinction between the architecture and microarchitecture is in our head, especially when it comes to security. In the early 2000s, as multi-tenant computers became the norm, there were a few research papers around exploiting cache timing to attack cryptography, but it was mostly considered an isolated problem, where the solution was to change how one writes cryptographic software. But, transient execution attacks showed that if you have never thought about security at the microarchitecture and silicon level, what you define on the top of it, the ISA, can be very futile and break miserably.
Since then, hundreds of research papers have been written about how to discover and detect such vulnerabilities, how to design more secure hardware, and how to develop software that is resilient. But how much real progress have we made in addressing hardware vulnerabilities and building more secure computers?
Impact assessment & response
Improper impact assessment leads to improper mitigation and response, panicking too much or entirely dismissing an otherwise critical security vulnerability.
Impact assessment for hardware vulnerabilities is hard. Hardware vulnerabilities are typically due to a unique interplay between the hardware and software. A general-purpose component that is affected by a vulnerability, e.g., a particular branch predictor, may exist in many different products under different execution and software constraints. A behavior that is considered a vulnerability in one environment may not be exploitable in a slightly different software configuration. And sometimes, it is not obvious what the actual loss of security (e.g., confidentiality, integrity, …) is when it comes to real workloads.
In the past few years, major vendors have formed (not so big) teams to reproduce reports of hardware vulnerabilities and assess their impact across different product lines. We have collectively gathered a lot of knowledge about the impact of such vulnerabilities across different execution environments e.g., virtual machines, operating systems, and trusted execution environments. Security engineers go out of their way to review the RTL code and microcode to understand the impact of a vulnerability on various products and use cases. On the upper layer of the stack, OS knobs and software policies are revisited regularly when there is a new finding. And red team efforts have shown end to end exploitation of these vulnerabilities in real scenarios.
The other challenge is coordinating with many stakeholders. I reported the Downfall vulnerability to Intel two months before I joined Google. Later on, when I saw the scale of the operation to respond to Downfall, I was humbled by how much more work needs to be done to mitigate it. Hundreds of people across the industry, including hardware and microcode engineers, kernel developers, security engineers and architects, lawyers, executives, and press work together to ensure that such a vulnerability that is affecting billions of users is mitigated properly, and users/developers are not panicking.
But there are too many things to look out for and not enough people. Transient execution attacks are only a tip of the iceberg of hardware flaws impacting security.
Rethinking architecture
From an architectural perspective, most people in the industry now agree that if you care about hardware vulnerabilities, isolating hardware resources down to the gate level is crucial. For processors, isolating physical cores, as proposed by Amazon Nitro and Google ASI, in theory, addresses a lot of such vulnerabilities. But the devil is in the details: These software architectures are only as good as how well the processor can clear core-private memory resources. An uphill battle between OS maintainers and hardware vendors, clearing all microarchitectural resources across domain switches is very costly with today’s hardware. Many of these buffers and register files inside the core were designed at a time where people thought they didn’t need them to be cleared frequently.
And long lived “scheduling separate security domains on separate SMT threads”. SMT (e.g., Intel hyper threading) in its current form provides little value in exchange for too much security risk.
Asset management is another crucial part of mitigating the risk of complex and messy hardware. Today, most server and mobile computers are equipped with dedicated cryptographic coprocessors to safeguard critical credentials in a low-trusted computing base, physically isolated from everything else. These physically isolated cores have been a great line of defense in protecting cryptographic keys against software and hardware vulnerabilities. Of course, this is limited, what about other assets that cannot be kept on a small cryptographic coprocessor?
These vulnerabilities have also changed how we perceive security extensions. Hardware extensions for improving memory safety, sandboxing, and mitigating exploits have been an exciting area of research for the computer architecture community. Today, handling transient execution attacks and side channels is always part of the discussion when it comes to these features. For example, one of the big drawbacks of ARM MTE or PAC is that they don’t provide any protection against speculative execution attacks. Not entirely useless, but this limitation hinders their deployment in places where the attacker has access to a rich execution environment (e.g., Javascript).
Buggy hardware or design flaws?
Are these the only hardware vulnerabilities we should be concerned about? Strangely, the community has mapped most components that are potentially vulnerable to speculative execution and side channels, but other hardware defects mysteriously interact with core security primitives. Last year, Reptar and Zenbleed were discovered by Google researchers, which in some ways are overlooked hardware erratas with critical impact on security. Unfortunately, there is little understanding and knowledge on how various bugs/erratas affect security and reliability. For decades, these errata that are reported by various design validation (DV) teams are either silently fixed or ignored, and that will not change without greater understanding of their impact on security and reliability at the upper layer of the stack.
Current DV practices are also limited in catching real vulnerabilities. Pre-silicon validation is prioritized because, once a chip is taped out, not a lot of people want to go back and look at what was wrong with it. But pre-silicon testing typically can only focus on a subset of components and doesn’t deal with real workloads. For example, many of the buffers and components inside Intel CPUs were entirely fine from a validation perspective, but their complex interaction with microcode and software has resulted in several vulnerabilities such as microarchitectural data sampling.
In the past few years, we have seen more research into post-silicon testing and fuzzing. It is always bad news to learn that your silicon is defective after tapeout, but better to be late than never. We can still iterate over fixing the bugs in the next revision of the chip, or at least learn from the mistakes for future designs. The Silifuzz project is an example of a running production system that has found CPU bugs that were not caught by DV hardware teams. While promising, it feels like hardware fuzzing is 15 years behind compared to the established software fuzzing practices.
Final thoughts
We yet don’t understand hardware vulnerabilities well enough. In principle, it is unclear what development practices are more likely to introduce bugs and vulnerabilities. Hardware vulnerabilities are not only limited to transient execution attacks or CPU vulnerabilities. It is inevitable that as we are building more diverse high-performance processors of all kinds: CPUs, GPUs, xPUs, we are also introducing more bugs and vulnerabilities, things we have never seen before to account for in the design.
It is more important to design hardware and tools that help us find these issues early and prepare us for the doomsday. I believe we need to develop new simulation tools and hardware interfaces that facilitate analysis and testing, improve isolation through the entire hardware stack, and design reconfigurable and patchable components that account for the unseen. Intel has gotten lucky several times in mitigating vulnerabilities through microcode, something for the rest of the industry to reflect on.
About the author: Daniel Moghimi is a Senior Research Scientist at Google, working on automated agents for hardware vulnerability analysis and computer architecture for data security and privacy. His past work has improved the security of superscalar CPUs, memory subsystems, and cryptographic implementations, which billions of users use daily.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.