Once upon a time, cores and memory ran at similar speeds, and programs could read and write memory directly without complications. The load-store interface was born as a simple way to give programs access to data, and, at this stage in computing history, this interface made perfect sense — what else could it have been?
Memory systems have changed enormously in the last three decades, but you wouldn’t know it from the memory interface. Programmers still access data through loads and stores, albeit with a few extra features (e.g., software prefetching, cache partitioning). This is a problem because the load-store interface is fundamentally compute-centric — data moves to compute, not vice versa — and it gives software too little control over data. Partly as a result, systems are burdened with tons of unnecessary data movement that increasingly limits their performance.
It’s past time to revisit the memory interface. Systems should open up the memory hierarchy to programmers, giving software visibility into and control over data movement throughout the system. Doing so will let systems solve a host of problems that are currently intractable or too expensive to solve in software. This means revisiting several old ideas whose time has come again, while adapting them to a modern context.
Memory systems have changed beyond recognition. Why hasn’t their interface?
Systems have grown from simple cores connected directly to memory, to heterogeneous multicores with deep cache hierarchies that distribute memories throughout, all connected via an on-chip network. At each step along the way, memory-system design changed in an evolutionary fashion to minimize disruption to software. To be sure, there have been significant microarchitectural changes: modern cache hierarchies are complex beasts running many sophisticated caching policies. But to software, the picture looks barely changed. There are a few new features like cache partitioning, which can be impactful, but the load-store paradigm remains.
The cache hierarchy in a modern multicore takes a huge amount of resources: roughly half of chip area and a similar fraction of power. Yet its operation is opaque to software, and software has at best indirect control over data movement. Application throughput, tail latency, and power can vary by an order of magnitude depending on whether data fits in caches. Data movement is often decisive as to whether applications’ goals are met, and only software knows about these goals. Why, then, does software have so little control over data movement?
Maybe current memory hierarchies are good enough? (They aren’t.)
The load-store interface hasn’t survived out of mere inertia. There are many advantages to sticking with simple, read-write semantics and keeping the cache hierarchy purely microarchitectural. It is simpler to reason about correctness, and hardware designers have a huge playground of ideas to improve performance transparently to software. The question is whether its advantages continue to outweigh its costs in an era where data movement dominates power and performance. Some have proposed adding hints to the load-store interface to let cache hierarchies make better decisions, which is a solid evolutionary improvement, but do these proposals go far enough? Is there a demonstrated need for a more radical change to the memory interface?
Absolutely, there is! Recent architecture conferences are rife with dozens (probably more) of proposals that specialize the memory hierarchy in various ways. The clearest examples are those that offload computation into the memory hierarchy, where software directly augments the memory interface with its own operations to avoid unnecessary data movement. Other examples change the traditional semantics of a memory hierarchy, e.g., to support dynamic object management, improve prefetching or streaming, support commutative operations, or accelerate graph workloads. Still other examples add custom hardware within the memory hierarchy for specific tasks, such as garbage collection, memory de-duplication, data transformations, graph traversals, or file-system redundancy. And this is just the tip of the iceberg.
These proposals demonstrate that there are huge opportunities on the table, if we are willing to specialize the memory hierarchy. But it is unlikely that these ideas will be implemented in a traditional cache hierarchy if they require custom, fixed-function hardware. Design and verification costs will overwhelm the benefits of accelerating any single task. The only path for these ideas to see the light of day is for software to be given radically more control over the memory hierarchy, letting applications re-program it to behave as desired.
A sketch of a future, programmable memory hierarchy
Hence, if we buy that current memory hierarchies aren’t cutting it, then the solution is to give software more control so that it can be specialized as needed by different tasks and applications. That means the memory interface must change, giving software visibility and control over things that have been hidden for thirty years or more. One question is whether it’s possible to address these needs without giving up on the proven advantages of modern cache hierarchies. Can we have our cake and eat it too?
The goals for a modern, programmable cache hierarchy would be, at a minimum, to give software visibility into important cache events that are currently hidden (e.g., when data moves between levels) and to let software interpose at critical points. This ability should be optional, and we should preserve the semantics and high performance of load-store access when it works well. The latter constraint rules out extreme processing in-memory (PIM) designs that completely eliminate the cache hierarchy. PIM designs can work well for applications with regular access patterns that are bandwidth-limited, where it is more effective to distribute compute near data. But PIM is not a good design for the majority of applications that have good data locality, where a cache hierarchy is highly beneficial.
Some good news is that we are not flying blind. Two decades ago, a variety of programmable memory and in-memory processing ideas were developed, such as Flash, Impulse, IRAM, Tempest and Typhoon, and Smart Memories. Many of these designs focused on cache coherence, with the idea that different applications would optimize the coherence protocol in different ways. Today, we face a different and wider set of challenges, with much weaker tailwinds from Moore’s Law. We therefore need a highly programmable solution that is simultaneously highly efficient.
A sketch of a solution can be drawn by combining several threads of past and recent work. Systems can borrow the idea from earlier programmable memory designs of placing lightweight, programmable compute engines inside the memory hierarchy itself, onto which software offloads computation to execute near-data. A key question will be the programming interface. Depending on use case, software may explicitly offload computation into the memory hierarchy, or it may request computation to run implicitly in response to data movement. The explicit-offload case is more familiar and can be addressed via task-parallel programming, remote memory operations, or (less efficiently) by migrating code. Support for long-lived, stateful computations (i.e., threads) may also be important for some workloads, such as prefetching or in-cache accelerators. The implicit-offload case requires a different approach, whereby software registers callbacks that trigger tasks in response to, e.g., misses, evictions, writebacks, etc. These callbacks give software visibility into data movement that is currently hidden and also allow software to modify the memory hierarchy’s behavior. In either case, low-level code must be hidden from application programmers via libraries or compiler support. Microarchitecturally, there is a wide-open design space for a programmable memory hierarchy. Coarse-grain reconfigurable arrays (CGRAs) are a good candidate for the engines, as recent work has demonstrated CGRAs that come within a small factor of ASIC performance and energy efficiency. Moreover, the CGRA fabrics can be small because tasks will be short (otherwise, just move the data to a core), and they can be specialized for common operations to improve efficiency. Hence, one attractive architecture would be a tiled design with a core, cache slice, and CGRA fabric per tile. Going beyond programming interface and microarchitecture, there are a host of other issues to address, such as OS integration, virtualization, and security.
Clearly, there is a lot of work to be done. This is to be expected: we are paying down a technical debt accumulated over three decades. But the alternative is to watch as systems are steadily drowned in data movement. Who doesn’t love a good challenge?
About the author: Nathan Beckmann is an assistant professor in the Computer Science Department and (by courtesy) Electrical and Computer Engineering Department at Carnegie Mellon University.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.