One of the long-standing debates in computer systems is the shared nothing vs shared memory debate. Should parallel computers provide the illusion of shared memory or should they do away with support for sharing?
The debate has seen a resurgence with the rise of warehouse-scale computing. While today’s datacenters are typically made up of commodity multiprocessors that enforce sharing within the nodes, there is no sharing across the nodes for the most part.
This status quo appears to be an uncomfortable truce though. There have been calls to do away with shared memory completely, even within the individual nodes. At the same time, there is also an opposing trend of exposing the memory of multiple nodes as a shared address space.
What is the way forward? Global shared memory? Pure shared nothing? Somewhere in the middle?
Whatever the path, we will argue that architects can add value by transferring the insights and lessons learned in supporting shared memory.
The importance of shared state
Datacenter operators and applications programmers invariably need to maintain and program with shared state. Coordination services, messaging systems, graph applications, and deep learning are examples of use-cases that rely on No-SQL-style Key-value-stores (KVS). Therefore, the debate is not about whether shared state abstraction needs to be maintained. It is all about the how. Should hardware, OS, or middleware enforce this abstraction?
All shared state abstractions — be it a datacenter KVS or hardware shared memory — have a number of common features. They provide a global namespace (cf. address space) via a familiar read/write/read-modify-write API. They often replicate (cf. cache) objects for performance and fault-tolerance, and they rely on replication protocols (cf. cache coherence protocols) for keeping the replicas consistent.
Sadly, but unsurprisingly, they also share a vexing problem: the tension between programmability and performance. Keep the replicas strongly consistent but forego performance. Or keep the replicas weakly consistent and forego programmability.
This conundrum has a more severe performance implication for a datacenter KVS than hardware shared memory. This is because a KVS needs to be available, i.e., it must continue to function amidst node failures and dropped messages. Supporting both availability and strong consistency, without sacrificing performance, is hard.
Faced with these opposing requirements, practitioners have come up with a solution that comprises the state of the art: a KVS with multiple consistency guarantees (cf. relaxed memory models). However, this API shifts the burden of finding the sweet spot between performance and correctness to the programmer, who is forced to reason about the appropriate consistency level for every access.
How can the architect help?
High-performance networking technologies and RDMA are starting to make the warehouse-scale computer look like a NUMA machine. In this context, how to fulfill the conflicting demands for a programmable yet performant shared state abstraction?
The thesis of this post is that viewing the datacenter like a cache-coherent NUMA machine can be a guiding metaphor. Below, we discuss a few examples.
Coherence-inspired Replication. Cache coherence protocols typically assume fault-free controllers and reliable message delivery, an assumption that leads to very efficient protocols.
For example, a coherence write often invalidates all sharers; in doing so, it allows the reads to access the value locally from their private caches — a powerful optimization, especially for read-heavy workloads. In contrast, replication protocols, which are often drawn from the distributed systems literature, are pessimistic in that they explicitly account for failures using quorums. A write, for instance, cannot simply wait until all replicas are invalidated because one of the replicas might have died! Instead, distributed systems often employ a more sophisticated, but slower, protocol where both writers and readers must communicate with a majority of other replicas (a quorum), thereby losing out on the local reads.
Although guaranteeing availability in the presence of failures is critical, failures are relatively infrequent in the datacenter. Google reports that for a service consisting of 2000 servers, there is a failure every 2.5 hours — in other words, 2.5 hours across 2000 servers without a single failure. Therefore, it makes sense to strike a balance with a fast path that allows for local reads, and a slow path that handles failures when they happen.
The protocol in the distributed systems literature that is closest to this philosophy is chain replication. It allows for writes to propagate to all replicas, but the propagation happens in a predetermined order (the chain) for serializing writes. In case of a failure, the chain is reconfigured (in the slow path) to exclude the failed node. Thus, although chain replication permits local reads, it suffers from high write latency owing to the chain. In contrast, recent work Hermes achieves both local reads and fast writes using a coherence-inspired fast path that propagates invalidations to all replicas, but uses timestamps (instead of the chain) for serializing writes.
One problem with writes waiting for all replicas is that the slowest replica determines the write latency. With low-latency RDMA and typical replication degrees of just 3 to 9 nodes, it may not be a problem today, but making this work for high replication degrees is an important research challenge.
Network Ordering. A number of coherence protocols have relied on the interconnect to provide ordering guarantees — an approach that not only simplifies the protocol but can also reduce latency. A classic example is a bus-based snooping protocol that uses the bus as an ordering point, allowing for writes to complete early as soon as the write is ordered. Network-ordered Paxos uses an optimization similar in spirit for both simplifying and improving the performance of consensus in a datacenter.
One limitation of network-imposed ordering is that it inevitably constrains concurrency. Perhaps this can be alleviated with architectural support for achieving high-throughput ordering.
Shared memory-inspired Consistency. Is there a consistency API that simplifies programming while still allowing for the system to maximize performance?
The distributed systems community has been exploring various consistency models in search of the holy grail. Causal consistency is one such model that has garnered renewed interest. (For memory model enthusiasts, this is a model like TSO but without the multi-copy atomicity part.)
Meanwhile in the shared memory world, after a 30-year old debate the community has converged to the Data-race-free programming paradigm, where the programmer labels synchronization operations and gets the benefits of both performance and strong consistency.
Kite advocates a DRF-inspired API for KVSes. Such an API allows for the porting of nonblocking (i.e., naturally fault-tolerant) shared memory algorithms for the KVS.
Going forward, there are other interesting parallels and opportunities. KVSes need to be able to tolerate catastrophic failures such as an entire datacenter going down. For this reason, data needs to be replicated not only within the datacenter but also beyond. Thus, this leads to natural replication domains: e.g., a rack, datacenter, and beyond. These replication domains bear a strong resemblance to coherence domains, aka scopes, in the world of heterogeneous shared memory. Can scoped consistency models be adapted for distributed systems?
Protocol Offloading. In the world of shared memory multiprocessors, coherence protocols are offloaded onto specialized controllers and do not consume CPU cycles for protocol actions. With high performance networking making inroads into the datacenter, power-limited CPUs rather than the network could turn out to be the bottleneck. It is natural to then consider the offloading of protocol actions. Consensus in a box is one example that explores the benefits of offloading a consensus protocol onto an FPGA.
Going forward, architects can help by identifying the set of primitives that are useful for enforcing consistency in a datacenter setting, and could make a case for hardware-supported replication, drawing on their extensive experience in designing hardware coherence protocols.
Summary and a call to action
Datacenters need to offer a performant yet programmable shared state abstraction. With the advent of high-performance networking and RDMA, the datacenter is starting to look like a distributed shared memory multiprocessor. We have argued how this analogy not only helps architects contextualize recent advances but also informs future research. The time is ripe for architects and systems researchers to get together to address this important problem.
About the Authors: Vijay Nagarajan and Boris Grot are Associate Professors; Vasilis Gavrielatos and Antonis Katsarakis are senior PhD students, all from the University of Edinburgh.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.