Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

This is the second part of my “Brief and Biased History of Computer Architecture”; you can find the first part here

While IBM rules the 1960s and the 360/91 might be considered the first supercomputer1My definition of a supercomputer is a system that prioritizes absolute performance on a problem domain over other considerations, even price/performance. Some of my friends disagree., other groups are pushing hard to build the fastest machines on the planet. 

  • Released by Control Data Corporation in 1964, the CDC 6600 is built by a team led by Seymour Cray2You may have noticed that I have been italicizing the names of Nobel, Turing and Eckert-Mauchly laureates. Seymour Cray also has a Supercomputing award named in his honor. and Jim Thornton. It beats the 7030 Stretch by a factor of three and holds the title of fastest computer on the planet for five years. Thomas J. Watson Jr., President of IBM, is apoplectic that a team of “34 people including the janitor” has bested Big Blue. 
  • ILLIAC-IV is the first machine envisioned with massive parallelism, aiming for 256, 64-bit floating-point units. It includes ALUs that can be sliced to operate on smaller subwords in a register. It starts working in 1973. Only ¼ of it is ever built, after suffering terrible budget overruns. It takes almost a decade for the programmers to realize the performance of the machine, but when they do so, it’s 2-6x faster than a CDC 7600. Widely viewed as a failure, ILLIAC-IV plants the seeds of both MIMD and SIMD processing, hastens the transition to DRAM, inspires compilation techniques that reorder computations, and spurs the development of CAD tools. 
  • CDC ails, so Cray founds his own company and delivers the Cray-1 in 1976, the first vector computer and the canonical supercomputer. Unlike today, where compute is cheap and memory is expensive, Cray is working in an era with the opposite situation. His machine includes a multi-banked memory system and the world’s fastest scalar execution unit, but their job is to make sure that the precious single ALU in the vector unit stays busy. It takes a decade for the compilers to catch up. Cray hires women3 For this link, pay more attention to the photos than the text. The text’s assertion that no wire was longer than three inches is visibly wrong. to hand-assemble the nanosecond-trimmed wires in his machines, finding them both faster and more accurate than men in a pre-VLSI example of the importance of Design for Manufacturability. Cray goes on to build a sequence of supercomputers, and is deemed a national security asset. He also obsesses about nuclear war, digging tunnels in his free time under his company’s Chippewa Falls building.4 I heard this story secondhand, from people who had worked with Cray in Chippewa Falls, but it is not clear that it’s documented in his biographies. Perhaps this is like Newton’s apple: the myth is more fun than what historians can prove.  

There are a number of ambitious failures during the 1980s:

  • Charles Leiserson and H. T. Kung invent systolic arrays around 1980, spurring a wave of parallel special-purpose computers. Kung is a polymath, with contributions across theory, databases, and networking as well. They are limited by the technology of their day, where one systolic node with a single ALU occupies an entire chip. 35 years later, systolic arrays come back with a vengeance.5 Measured in Moore’s Law doublings, that’s 35/2≅17 years, or 2^17=128K more transistors per chip. TPUv1 has 64K 8-bit MAC units. Coincidence? 
  • In 1981, the Symbolics LM-2 becomes the first commercially available Lisp machine. They aim to “close the semantic gap” between computer architecture and the Lisp programming language popular in AI6 Which turns out to be the wrong, symbolic, Marvin Minsky style of AI, but that’s another story with different heroes. , and they embrace exotic architectural features including extra tag bits that aid garbage collection. These architectural features don’t save them, and they are swept away by the performance of more general workstations. 20 years later, Sun Microsystems releases the picoJava processor, which is tailored “…to the Java environment;” it suffers a similar fate.7 The year-by-year improvements in general-purpose processors are no longer what they used to be. We are in an age of domain-specific processors; is it worth revisiting language-specific processors as well? This is less clear—a domain has a justifying application, while a language is a means to an end.  
  • Burton Smith at Denelcor builds the HEP processor in 1982, which tolerates operation and memory latency through a massively multithreaded design. It fails; he tries again in the 90s with Tera Computer, which sells approximately one pointer-chasing machine to the NSA. In an ironic twist of history, Tera buys Cray then renames itself Cray Inc. before failing. This kind of threading prefigures SMT in modern CPUs and the SIMT design of modern GPUs, both of which actually work. 
  • Around 1985, the Manchester dataflow machine aims to build a computer that represents computation as a graph, unlocking parallelism without the bottleneck of a single program counter. The control problems turn out to be the fundamental problem with all dataflow machines: finding an efficient way to manage and traverse the whole program graph is beyond the hardware and compilation technologies of the day. Ironically, the best descendant of these ideas is an Out-of-Order execution engine, which is anchored by a program counter, keeps the control problem tractable by only having hundreds of instructions simultaneously in flight, and fits the ROB or other scoreboarding structures all on-chip. 
  • Danny Hillis’ award-winning thesis observes that the vast majority of computers’ transistors are idle in memory, while a tiny fraction are busy in the compute unit. He founds Thinking Machines Corporation and attracts Richard Feynman, among other luminaries, to revolutionize parallel computing with a 1-bit, massively SIMD architecture. They fail, occasionally delivering high performance but never finding an easy way to program the machine, although they too spend a decade trying. Their work foreshadows the Processing-in-Memory work of the 90s, which never finds a killer application. 
  • Josh Fisher invents trace scheduling in his thesis, pursues the ELI architecture at Yale, and founds Multiflow Computer with John Ruttenberg and John O’Donnell in 1984. Somewhere along the way Josh coins the term “VLIW”. Multiflow ships over 100 machines before it fails.8 Customers keep buying machines from the successor maintenance company even after Multiflow closes. The ILP-focused Multiflow compiler lives on in the 90s as the best way to compile SPEC benchmarks to x86 and RISC processors. VLIWs find a healthy niche in embedded computing and particularly digital signal processing; the most popular units are Qualcomm Hexagons9 There’s a good chance you have a Hexagon in your pocket right now.
  • Bob Rau is chief architect of the Cydra 5 from Cydrome, which goes bankrupt in 1988. Their elegant and beautiful HW/SW combination of modulo scheduling algorithm and rotating register architecture gets new life in Itanium a decade later, which despite tens of billions of dollars of revenue is viewed as a much larger failure, but not before creating enough FUD that they cause SGI/MIPS and Alpha to fold their microprocessor efforts. 

Before I cut off my history, there are three huge commercial developments that don’t get enough coverage in academic computer architecture:

  • Stunning the world in 1995, Bob Colwell leads a team at Intel that builds the Pentium Pro, which holds the SPECint95 performance crown for two months. All manufacturers had been years late to deliver their first out-of-order designs, but no one outside Intel had expected Pentium Pro, based on the CISC-style x86 architecture, to be this good. I have friends who disagree with me10 Perhaps to my peril, as at least one of them holds a Turing Award. , but I think Pentium Pro ends the first round of the RISC-CISC wars with a decisive commercial win for CISC. Colwell goes on to design Pentium 4, which flames out with the end of Dennard scaling. But Pentium Pro returns in the form of Pentium M (built by the Intel team in Haifa), saving the company once again. Intel rides the volume subsidy to fund a 2-generation process lead over two decades. 
  • NVIDIA debuts the GeForce 256 in 1999; they claim it’s the world’s first GPU but 3dfx’s products are earlier. The ensuing GPU wars are vicious, with dozens of companies whittling down to the two remaining players today.11 SGI also succumbs to the Innovator’s Dilemma, although its alumni populate the GPU startups. This specialized architecture becomes surprisingly general and impressively programmable over time, coming to dominate not just graphics but also high-performance computing. It does, however, take ten years for the compilers to get good, and doing so requires hardware changes. When they do, the Cuda programming model looks like an inside-out vector machine. GPUs are just the right technology to launch the current AI revolution, where AlexNet runs on a pair of 2012 GPUs. 
  • In 2002, Intel adds Hyperthreading to its Xeon server processor line, allowing a single out-of-order processor core to share resources across two hardware-supported threads within a single thread. Hyperthreading is the first widespread commercial realization12 The IBM AS/400 Northstar is earlier, in 1998. of Simultaneous Multithreading (SMT), introduced by Dean Tullsen, Susan Eggers and Hank Levy in the 90s. I think of SMT as the last great general performance technique for CPUs, and in some ways it’s a hint of the difficulties that are soon to come: it improves the throughput but not the latency of the CPU. 

I’ll stop here, as more recent events are still controversial, with debates still raging and not yet settled by markets or history. But there are scary and wondrous things coming up: Dennard scaling ends around 2003, and while no one has gotten rich shorting Moore’s Law, in 2021 it feels more and more like the end is near. I’ve omitted the move to mobile and power efficiency (including the rise of ARM), and I’ve entirely skipped both the history of computer graphics and the new age of accelerators. Let’s see how much controversy we generate from this historical cutoff before addressing the last two decades. 

I’ve probably also left out some of your heroes and favorite machines. Rather than asking you to forgive me, instead, please get your flamethrower out and roast me! Mythologies are built through shared arguments and discussion, and even centuries-long debates (Leibniz or Newton?). I look forward to building a shared mythology together. 

Acknowledgements: Thanks to Bob Colwell, Paolo Faraboschi, Josh Fisher, Trevor Gale, Milad Hashemi, Bob Iannucci,  Norm Jouppi, Samira Khan, James Laudon, Martin Maas, Ravi Nair, David Patterson, Herman Schmit, and Michael D. Smith for comments on early drafts of this post.

About the Author: Cliff Young is a software engineer in Google Research, where he works on codesign for deep learning accelerators. He is one of the designers of Google’s Tensor Processing Unit (TPU) and one of the founders of the MLPerf benchmark. Previously, Cliff built special-purpose supercomputers for molecular dynamics at D. E. Shaw Research and was a Member of Technical Staff at Bell Labs. 

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

  • 1
    My definition of a supercomputer is a system that prioritizes absolute performance on a problem domain over other considerations, even price/performance. Some of my friends disagree.
  • 2
    You may have noticed that I have been italicizing the names of Nobel, Turing and Eckert-Mauchly laureates. Seymour Cray also has a Supercomputing award named in his honor.
  • 3
    For this link, pay more attention to the photos than the text. The text’s assertion that no wire was longer than three inches is visibly wrong.
  • 4
    I heard this story secondhand, from people who had worked with Cray in Chippewa Falls, but it is not clear that it’s documented in his biographies. Perhaps this is like Newton’s apple: the myth is more fun than what historians can prove.
  • 5
    Measured in Moore’s Law doublings, that’s 35/2≅17 years, or 2^17=128K more transistors per chip. TPUv1 has 64K 8-bit MAC units. Coincidence?
  • 6
    Which turns out to be the wrong, symbolic, Marvin Minsky style of AI, but that’s another story with different heroes.
  • 7
    The year-by-year improvements in general-purpose processors are no longer what they used to be. We are in an age of domain-specific processors; is it worth revisiting language-specific processors as well? This is less clear—a domain has a justifying application, while a language is a means to an end.
  • 8
    Customers keep buying machines from the successor maintenance company even after Multiflow closes.
  • 9
    There’s a good chance you have a Hexagon in your pocket right now.
  • 10
    Perhaps to my peril, as at least one of them holds a Turing Award.
  • 11
    SGI also succumbs to the Innovator’s Dilemma, although its alumni populate the GPU startups.
  • 12
    The IBM AS/400 Northstar is earlier, in 1998.