Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

In an era where artificial intelligence (AI) is rapidly advancing and the demise of Dennard scaling is becoming increasingly apparent, the importance of Sustainable Computing has never been more critical. Global datacenter electricity consumption reached 460 TWh in 2022 (more than doubling since 2018) and is expected, as described in the 2024 IEA report,  to exceed 620 TWh in 2026. As computing systems continue to grow in complexity and power consumption, it is imperative that we prioritize the development of energy-efficient and environmentally friendly solutions. Sustainable Computing aims to address the challenges posed by the ever-increasing demand for computational power while minimizing the negative impact on our planet. By focusing on innovative approaches to hardware design and manufacturing, software optimization, and resource management, Sustainable Computing seeks to create a future where computing can thrive without compromising the well-being of our environment.

The life cycle of computing is a multifaceted process that encompasses everything from the initial design of chips, Systems on Chips (SoCs), and Chiplets, to manufacturing, operations and eventual recycling. This cycle is a complex carbon-intensive multi-stage process, where each phase poses its own sustainability challenges that are exacerbated by the move to specialization and chip miniaturization.  To effectively tackle the challenges of Sustainable Computing, we must adopt a holistic perspective that explores the inter-relatedness of the design, manufacturing, and operations of computing. By examining the tradeoffs in and across the computing lifecycle phases, we may identify optimal design points for carbon lifetime efficiency of systems and software. By understanding the intricacies of the compute lifecycle and the associated tradeoffs, we can make informed decisions that drive the development of truly sustainable computing systems. In this blog article, which is based on keynote I gave in ASPLOS’24, we examine these challenges and offer some ideas for future research to address them

Challenges in Manufacturing

Manufacturing chips is an energy-intensive process that contributes significantly to the carbon footprint of the computing industry. The sources of greenhouse gas (GHG) emissions in chip manufacturing can be categorized into two main scopes: Scope 1, which includes direct emissions from processes involving gases, and Scope 2, which encompasses indirect emissions from electricity consumption. The manufacturing process also involves the use of harmful materials, such as polyfluoroalkyl substances which pose significant environmental and health risks. As the industry pushes towards smaller process nodes, the environmental impact of chip manufacturing has increased dramatically. Moving from 28-nanometer to 7-nanometer or even 2-nanometer technologies has led to a substantial increase in electricity consumption and the use of harmful gases.

To address these challenges, a two-fold approach is necessary. First, we must focus on process optimization. By modeling each step of the manufacturing process, from input to output, we can identify hotspots where energy consumption and harmful material use are highest. This data-driven approach will allow us to optimize these processes, increasing re-use, and reducing their environmental impact without compromising the quality or performance of the result. Second, we must explore the use of alternative, less harmful materials. This involves investing in research and development to discover new materials. In IBM Research, we believe that we can use AI to significantly accelerate the scientific discovery of new material

Challenges in Design

Operational energy efficiency as always been a goal in chip design, and it is ever more so now as we are reaching the limits of Dennard scaling. In a seminal paper, Brooke et. al. were first to argue for modeling and analysis of energy behavior as an integral part of the early design process of chips. As the demand for energy-efficient computing continues to grow, chip designers are increasingly turning to heterogeneous architectures and SoC designs. Heterogeneity involves the integration of specialized hardware components (accelerators) alongside general-purpose CPUs on a single chip allowing for the optimization of specific tasks and workloads. By tailoring the hardware to the software’s needs, heterogeneous designs can achieve significant improvements in energy efficiency during the operational stage of the compute life cycle. As an example, consider the 12nm RISC-V SoC described in a 2024 ISSCC paper.  This SoC is designed specifically for the application domain of collaborative autonomous vehicles. The SoC combines 14 different types of accelerators to support a plurality of workloads, including deep learning, signal processing, and cryptography. The design of the SoC was carried out by a small team of PhD students, postdocs, and industry researchers in 3 months thanks to the ESP open-source platform for agile SoC design that enables modularity and reuse

As the complexity of single die SoC grows, so is the carbon associated with the manufacturing. As chips become more complex and the number of processing steps increases, the fabrication yield is reduced and an amortized carbon cost per chip goes up. We posit that it is crucial to incorporate an analysis of the manufacturing cost in the chip design phase in addition to the operational energy estimation.  Such early analysis will allow to examine tradeoffs between complexity, redundancy, operational efficiency, and carbon manufacturing cost to identify the design exploration space that leads to the most beneficial pareto frontier balancing quality and cost across the entire life cycle.

As an example of why it is crucial to incorporate sustainability concerns across the life cycle of systems, consider the case of solid-state drives (SSDs).  With SSDs a common problem is reliability deterioration over time, correlated with the number and frequency of writes. Hardware redundancy is a common strategy for improving reliability over time thus extending the SSD’s lifetime. By incorporating redundant components, such as extra memory cells or controller circuits, designers can ensure that the device continues to function even if some components fail. This, in turn, can help amortize the embodied emissions associated with the device’s production over a longer period, thereby improving its lifetime carbon efficiency. However, hardware redundancy is negatively correlated with carbon cost of the manufacturing phase, when considered in isolation. This is the reason why a holistic lifecycle efficiency analysis is necessary to meaningfully identify the optimal design point, balancing redundant hardware cost, with value added in lifetime extension

Challenges in Operations

Data centers play a crucial role in the overall sustainability of computing. Managing data centers efficiently is complex due to several factors.

First, data centers are highly dynamic environments, running multiple workloads with different characteristics, and associated Service Level Objectives (SLOs) that are sharing IT resources across multiple geo-locations. This complexity makes it challenging to optimize resource allocation and minimize energy consumption without compromising performance and reliability.

Second, the growing adoption of heterogeneous architectures and technologies is making the problem of efficient data center management even harder. The efficiency of software running on these heterogeneous systems can vary significantly depending on the specific characteristics of both the software and the underlying hardware system. This means that traditional, one-size-fits-all black box approaches to resource management, may no longer be effective. This is the most evident with AI workloads. Careful resource management for AI requires knowledge of the specifics of the models, their layers, the model lifecycle phase, the configuration parameters such as batch size, as well as the particulars of memory layout, and technology specific sharing mechanisms

Third, the evolution of new technologies, such as low latency memory and storage hierarchy, results in shifting bottlenecks in the datacenter which, if not addressed, may interfere with our ability to gain the full efficiency benefit of these new technologies. Asaf Cidon et al posit that software bloat has become a new bottleneck, a problem that is exacerbated because of the lack of Operating System awareness to sustainability concerns. A possible way forward is to extend the functionality of Linux via extensions mechanisms such as eBPF to address this gap as described here.

Lastly, effective optimization of data center operations relies on the quantification of the energy consumed by different applications. This is a complex challenge, as it requires considering factors such as resource sharing, where multiple applications run on the same system, and the provenance chain across a distributed set of resources used by an application. Developing robust methodologies and techniques for energy quantification is essential for identifying opportunities for meaningful optimization and measuring its impact. The Open Source project Kepler is focused on that goal. One of the main issues in optimizing for sustainability in hybrid cloud environments is lack of transparency, and common methodology across cloud providers. Green Software Foundation, under the Linux Foundation, is working to define standards and common methodology to address this gap.

Ultimately, achieving truly sustainable computing requires a holistic, lifecycle-oriented approach that takes into account the complex tradeoffs and interdependencies between design, manufacturing, and operations. By developing new tools, methodologies, and best practices for modeling, analysis, and optimization across the entire compute lifecycle, the industry can drive significant improvements in the carbon efficiency and environmental sustainability of computing systems. As we continue to push the boundaries of performance and functionality, it is essential that we keep these lifecycle considerations at the forefront of our efforts to build a more sustainable future for computing

About the author:

Dr. Tamar Eilam is an IBM Fellow and Chief Scientist for Sustainable Computing in the IBM T. J. Watson Research Center, New York.  Tamar completed her Ph.D. in Computer Science in the Technion, Israel, in 2000. She joined the IBM T.J. Watson Research Center in New York as a Research Staff Member that same year. She was recognized as an IBM Fellow in 2014. The work described in this blog entry is partially based on IBM Research and Columbia University collaboration centered around ‘Sustainable Computing’ for which Tamar is a co-PI.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.