Seemingly insatiable application demands for memory bandwidth, coupled with the energy needed to sustain high off-chip bandwidth, are putting increasing demands on main memory systems. In the quest for solutions that provide higher performance and better energy efficiency than the traditional wide, parallel, off-chip main memory interfaces (e.g., DDR), newer memory architectures have emerged that exploit techniques such as tight in-package integration and high-speed serial links. JEDEC High Bandwidth Memory (HBM) is one such example that allows the DRAM and a processor to be integrated within a single package and communicate via a silicon interposer or bridge (i.e., a layer of silicon that overlaps both chips and allows for very high bandwidth communication through it), also referred to as 2.5D integration. Another example is Hybrid Memory Cube (HMC) which uses high-speed serial links in pursuit of similar goals. Coincidentally, both HBM and HMC rely on vertical stacking of multiple DRAM dies (“3D stacking”) internally within the memory modules for increased capacity and bandwidth. The similarity of the 3D internal structures, along with the similar overarching goals of improved performance and efficiency, have led to both HBM and HMC to be treated interchangeably as “3D DRAM” in some research studies. However, the fundamentally different approaches to physical memory integration taken by HMC and HBM result in important implications for architectural studies. Our goal in this blog is to draw attention to some of these important differences and highlight their implications for research in 3D DRAM and related technologies.
The emergence of high-performance DRAM architectures such as HBM and HMC represents an exciting development in external memory solutions evolution. Correspondingly, the architecture research community has embraced 3D-stacked memories, and platforms that use them, as evidenced by 18% of papers in recent high-profile architecture conferences (ISCA’20, MICRO’20 and HPCA’20) referring to at least one of HBM and HMC. However, as the architecture community delves deeper into solutions that employ 3D DRAMs and innovate on 3D DRAM architectures themselves, it is important to keep in mind that there are fundamentally different ways to incorporate 3D DRAM at a system level, and they are not interchangeable.
Contrasting 3D stacked memory solutions
Both the 3D DRAM solutions that we discuss in this article (HBM and HMC) have some commonalities: both solutions stack memory dies vertically, employ through-silicon vias (TSV) between memory dies, etc. While it is worth acknowledging these commonalities, the differences between these solutions and the implications of these differences, which we discuss next, are the focus of this article. We argue that research proposals should be cognizant of these differences as ideas proposed with, for example, HMC in mind may not always carry-over to HBM and vice versa. Our discussion moves through the system from the processor towards the memory module and identifies key system aspects wherein these 3D DRAM solutions differ. Note, however, that this discussion is not meant to capture all differences between these two architectures and their respective approaches. Further, as these architectures evolve, and new 3D memory organizations emerge, the specific differences may change. Our goal, however, is to encourage the research community to pay close attention to the relevant implementation differences of 3D DRAM solutions as they can have important architectural implications.
(1) Memory controller placement
Perhaps the most significant difference between HBM and HMC is the placement of the memory controller. As with the mainstream memories it shares many characteristics with, HBM is intended to be controlled by a memory controller on the processor side of the memory interface. HMC, on the other hand, integrates the memory controller within the memory module, on the memory side of the interface.
Implications: The placement of the memory controller is an important consideration in memory system optimizations that may be enabled by a memory architecture. As an example, a processor-side memory controller can enable memory scheduling optimizations that rely on host-side metadata or other application-aware characteristics that may be communicated from the host. However, such optimizations are typically not feasible in a memory-side memory controller where the necessary information may not be exposed over the memory interface. Conversely, a memory-side memory controller may perform optimizations based on proprietary information available to the memory manufacturer and may even fine-tune memory scheduling to tailor to the specific characteristics of individual memory devices.
(2) Processor-memory interface
Another key difference between the two 3D DRAM solutions is whether they utilize a DDR-like fixed-timing protocol (synchronous, memory interface with deterministic timing; the Double-Data Rate nature of the interface, while also a commonality, is not of relevance here) or a more flexible split-transaction protocol between the processor and DRAM (delay between a memory request and corresponding response can be variable). In mainstream systems, DRAM modules are controlled by the processor-side memory controller via deterministic commands as specified by a standard (typically defined by an industry body such as JEDEC). While HBM maintains this type of interface, by integrating processor and memory via high-speed serial links, HMC breaks with this interface and instead employs a generalized, packetized interface where read/write commands are conveyed via command packets to the memory module and responses are returned via separate packets with no fixed timing relationship to the requests. As such, in HMC, unlike in HBM or conventional DDR memories, read (or write) responses have non-deterministic timing.
Implications: A fixed timing interface imposes strict response scheduling requirements on the memory module, which essentially requires all memory accesses to be routed through the processor’s memory controller. This precludes direct connection of a memory module to multiple requestors or daisy chaining of memory modules. However, the single-origin nature of the request stream enables the host memory controller to have control over request scheduling and to better enforce request priorities and other quality of service considerations. On the other hand, the relaxed timing relationship between requests and responses in a split transaction interface enables architectures where memory modules may introduce a level of arbitration between memory requests from the processor and other sources. Such flexibility can more easily accommodate architectural capabilities such as direct daisy-chaining or networking of memory modules, direct connectivity to multiple requestors, or insertion of requests generated within a memory module using near-memory processing. In addition, the fixed-timing nature of HBM’s approach limits disruptions to the already developed DDR-based infrastructure and causes less shake-up of fundamental memory-side logic (memory modules are still largely passive and receive DDR commands). In contrast, a packetized interface requires an overhaul of processor-memory interaction. Another key aspect of differentiation due to differing interface choices is that while HBM employs a shared data bus for both reads and writes, HMC separates the two. As such, while HMC can prevent read and write traffic from interfering with each other, HBM better tackles workloads where read (or write) bandwidth demand is skewed by availing the entirety of data bandwidth to read (or write) traffic as needed.
(3) Processor-memory integration
While in both HBM and HMC the DRAM itself is 3D stacked, integration of processor and memory differs across these approaches. To realize an in-package wide parallel interface, HBM integrates memory modules with the processor via a silicon interposer (or silicon bridge, or similar technology) in a 2.5D manner. An interposer incorporates direct metal routes providing point-to-point interconnects between the processor and the memory. This allows a much wider interface to memory than is possible in conventional (off-package) DDR systems, which are limited by external package pin counts. In contrast, in HMC, the processor and the memory are organized similarly to conventional off-package memories (retaining the flexibility of board level integration), but to provide high memory bandwidth, the processor interacts with the memory module over high-speed serial links (with faster signaling rates) which carry the communication packets between the processor and the memory module.
Implications: The high-speed serial links in HMC can be configured to full, half and quarter width configurations of available link bandwidth. This configuration is an important knob for a system designer as the high-speed links consume high idle power. As such, depending on the high-speed link configuration, as well as due to interface bandwidth consumed by additional requestors or daisy-chaining (as discussed above), there can arise memory bandwidth differentials in an HMC system in that the base die of an HMC module can enjoy higher memory bandwidth than the processor. This difference can motivate research considering the placement of logic in the base die of 3D DRAMs to offload computations near memory. Conversely, a similar bandwidth differential between the memory module and the processor is harder to justify in HBM as there are no other requestors and the interface consumes little power when idle.
(4) Base die composition
HBM and HMC both employ a base die stacked below the memory dies. In both the solutions, the base die reserves area for TSVs that deliver signals, power, and ground to memory dies. However, HBM and HMC have different functionality within their base dies. As the processor interacts via packetized interface with the HMC module, the base die in HMC contains packet processing logic, SerDes logic, and memory controllers. In contrast, as the processor interacts with the HBM modules via deterministic commands, the memory controller is on the processor-side and the base die for HBM typically contains physical interfaces (PHY) to communicate with the processor. Both memory architectures are also likely to integrate various testing capabilities and other support logic on their base dies.
Implications: As discussed above, movement of the entire memory controller logic to the memory side, as is done in HMC, marks a major change for processor vendors who typically own this logic block. Further, the nature of logic available on the base die of each memory module may have implications for what else may be co-located with it. For example, the implications for near-memory processing on the base die was already touched on above as well.
(5) Miscellaneous
There exist many other differences between HBM and HMC. Given the generalized packetized interface exposed by HMC, these memory modules can be chained together to form a memory-module network to increase the total memory capacity available in the system. On the other hand, multiple HBM modules can be integrated in package to increase the total memory bandwidth available to the processor.
Implications: Differences in capabilities as manifested by these solutions can have system-level implications that ought to be considered in a holistic system proposal. As an example, an HMC-based solution in principle can better address memory capacity limitations than HBM which is limited in this regard. This is so, as today, total HBM capacity is more constrained because of the relatively small package area limits the total number of HBM stacks per package. However, HMC’s high memory capacity solution does manifest NUMA effects that are often hard to overcome with software-only solutions.
At the time of this writing, of the two 3D DRAM solutions, HBM has gained wider traction and has been deployed in high-performance GPUs, accelerators, and FPGAs from multiple major vendors. While this (along with the differences above) can change or evolve over time, at the very least our community’s research proposals should be aware of the differences in 3D stacked commercial memory solutions given the implications above. In summary, while 3D DRAM is enabling new forms of main memory organizations, architecture researchers should pay close attention to the multiple potential approaches to 3D DRAM integration at the system level and appropriately focus their solutions on 3D DRAM approaches that are conducive to their solutions.
About the authors:
Shaizeen Aga is a Technical Lead and Member of Technical Staff at AMD Research where she leads research on application-driven design of accelerators and future architectures. Her research interests include processor architectures, memory subsystems and security, with a specific interest in near-memory accelerators.
Nuwan Jayasena is a Fellow at AMD Research and his interests include memory systems, heterogeneous computing, and emerging applications. He is happiest when he cannot tell whether he is working on hardware or software.
Acknowledgement: The authors thank Gabe Loh for helpful comments on this blog.
© 2021 Advanced Micro Devices, Inc. All rights reserved.
AMD Disclaimer
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.