It feels strange to be writing about architecture in a time of great global stress. The COVID-19 pandemic has been an unprecedented disruption to our lives on a global scale. As I pondered what to write about for this blog post, I found myself coming back over and over again to the concept of balance. The longer I live in this world, the more I realize that the true art of living is in balance. And the true job of an architect is to create balanced systems. Everything, like the Force, should be in balance.
As I’ve transitioned to being an employee, a homeschool teacher for a 2nd grader and a pre-kindergartener, and a restaurateur (shopping, menus, cooking, cleanup), I have very much appreciated the exhortation of my manager, who tells us, “This is a marathon, not a sprint. Do what you need to do to not burn out.” Oddly, I think of how I am managing my life now in a similar way to how I think about, say, power management. There is a lot of work to do, and I could work feverishly for a while, but then find myself in some sort of thermal emergency and need to drastically slow down and recover. The key to a good system is to ensure that most of the time, you are not in an emergency, no matter what comes – a power virus, or a coronavirus. At the same time, it would do me no good to be the equivalent of an underpowered laptop, limping along and unable to make good on the various tasks at hand, leading to a chucked laptop (or a chucked employee).
As a young architect, I was good at balancing my life along many dimensions, but I didn’t realize that balance was key to architecture as well. I thought architecture was about understanding how computers worked, clever schemes, exploiting observations, and widgets. But really, it is about designing for a high-enough performance steady state that does not creep towards a line that should not be crossed (e.g., thermal emergency), but also designing for sudden unexpected changes (losing a DIMM), while balancing the investment between the two, since designing for the uncommon case leads to additional costs.
Designing for Steady State
When you consider how to design a pipeline, you must think about the rate of instructions being fed vs the rate of instructions being consumed. You must think about the capacity for buffering to make up any difference. If you can’t accept more instructions to execute and you’ve run out of buffering, then you’ve crossed a line – you must stall.
When you consider how to design an arbitration mechanism, you must consider the number of requests that can be accepted in a cycle vs the number of requests arriving. Who wins? Who loses? What happens to the loser? Where is it buffered and is the buffer size balanced to accept the losers? If there are 4 contenders to a register file port, and only 2 ports, if your buffering for storing the 2 losers is full, then you have crossed a line – you must stall.
When you design a chip, you expect a certain amount of memory traffic to leave the chip. But if your IOs have much less bandwidth than that traffic needs, then you’ll get queuing buildup, and you’ll have to stall.
If you have two teams who need to collaborate on a project and one team can deliver their portion on one date, and the other team can’t deliver until 6 months later, then the project will stall.
When you consider how to handle an ambiguously long (but certainly multi-month) stay-at-home order with no school for young children, you must think about how to ensure that you stay employed, how to ensure you don’t damage your children, and yet also manage to not damage yourself physically or mentally for the duration of the crisis. I personally have decided that occasionally, having a bourbon paired with sour cream and onion Lay’s is a great way to avoid mental damage, while possibly incurring physical damage, in order to brace myself for the next day. This is the balance I have decided on for now.
Designing for the Uncommon Case
The previous examples are about designing for steady state – but a good architect also needs to balance designing for the steady state vs designing for the unexpected.
How big should the voltage guardband be to avoid incorrectness during voltage droops?
How much canned food (food guardband) should you have?
How much money should a person (or a corporation) have in the bank?
How much duplication is required for disaster management at the datacenter scale? If a hurricane takes out a datacenter, should there have been another one exactly like it elsewhere, waiting for disaster? (Usually no).
Did you know that the rule of thumb is that all storage in a datacenter is 3x replicated?
But memory is certainly not replicated – it’s too expensive. But we’re willing to spend ECC bits to hedge against catastrophic data loss.
When you build out a datacenter, you could kit it out with as many machines as you think you might ever need, and have them sit idle and empty for extended periods of time, or you could add capacity in a piecemeal manner as load increases, which may add complexity to the bring-up process, or risk the inability to satisfy demand if it spikes suddenly. The world has seen how there have been spikes in demand for all sorts of things – toilet paper, Microsoft Teams, masks, ventilators, and some systems have been able to spin up and pivot, while others have not.
These are all questions of balance – cost for the steady state vs preparedness for the unexpected, against expectations of the likelihood or frequency of unexpected events.
All this to say, COVID-19 has led to quite a lot of introspection on my part about system design, optimization, and the parallels across all the facets of my life that are currently blurring together. Hopefully, everyone in our community is safe, well, and also able to introspect about how the meaning of life and the meaning of architecture are really the same thing. May everyone achieve balance at work, at home, and in the systems they design.
About the Author: Lisa Hsu is a Principal Engineer at Microsoft in the Azure Compute group, working on strategic initiatives for datacenter deployment.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.