Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

Introduction

I created the performance counters found in almost all CPU chips today, initially in the DEC Alpha 20164 and DEC NVAX chips, announced in early 1992 and late 1991 respectively, but running internally earlier. I made some mistakes, focusing on what events could be counted in the hardware, not on how software programmers could reliably observe execution slowdowns.

To support fine-grained analysis of software performance, five counters are particularly desirable. One simply gives a reliable high-resolution elapsed-time base, while the other four give an initial top-down decomposition of fine-grained instructions per cycle and the main reasons for low values.

The counters described here should all be accessible directly (i.e. with no global setup), via single user-mode read instructions, with one-cycle read latency. There is no reason to prevent multi-issue of these instructions, or provide any other “help” that would slow them down. If I want to serialize execution, I can add appropriate serialization instructions, but if the hardware “helps”, I cannot remove that help.

Chips today have many performance counters and timers accessible in user mode, but they remain unsuitable for careful performance work. The intended use of these counters is to do sequences of

  • Read one or more counters
  • <Execute some instructions>
  • Read again
  • Subtract and take ratios

The “some instructions” may be a single instruction, a short block of code, a named subroutine, the code between kernel-user transitions, or some other execution interval of interest. Current counters are ill-suited to this when the measured execution interval can be relatively short, on the order of nanoseconds or microseconds. Current-counter drawbacks include slow reads, global setup, and unknown units.

1. Elapsed-time counter

The goal here is a reliable high-resolution elapsed-time base. The current x86 RDTSC instruction returns a “constant cycle count”. But it takes 20-30 cycles to read, the result arrives in two pieces that must be shifted and ORed to make a full count, the granularity of increment is not specified (it is not 1), and the counts per second are not documented, much less architected to be the same across different chips. In addition, there is no required consistency between counts on different CPU cores, so the sequence above is meaningless if code is migrated to a different core between the first step and the third.

The 20-30 cycle read time prevents useful approaches such as automatic compiler insertion of per-subroutine timing, because the measurement slowdown and distortion is too high for short routines, and the short routines cannot be identified reliably without first inserting timing.

Some x86 implementations further “help” the performance programmer by draining the execution pipeline to guarantee in-order execution and sometimes also guarantee that two back-to-back read values will differ by at least one, i.e. are forced to be monotonically increasing rather then simply non-decreasing. These measures simply make the unacceptable measurement overhead worse.

With unspecified units, one can read and subtract RDTSC counter values and get a number, but there is no specification of what that number means. For some chips, it might mean that there are 2.7 counts per nanosecond, while for some other chip it might mean 3.9 counts per nanosecond. The time counter frequency TCFREQ tends to be the nominal GHz rate printed on the box the chip came in, a marketing number. But there is no reliable OS-agnostic way to determine what that constant is for a particular chip. “8 Ways to Check CPU Clock Speed on Linux” is indicative of the problem.

The ARMv8.6 architecture [“ArmArchitecture Reference Manual”, DDI 0487G.b (ID072021)] gets the design almost right for the system time counter:

– The frequency is architected at 1 nsec and the width 64 bits so wrap-around is not an issue

– The granularity of increment is recommended to be between 1 GHz (increment by 1 every 1 nsec) down to 50 MHz (increment by 20 every 20 nsec). These are practical numbers for careful performance analysis.

– Time is required not to go backward across cores:

Device A reads the time from the system counter.

Device A communicates with another agent in the system, Device B.

After recognizing the communication from Device A, Device B reads the time from the system counter.

It must be impossible for the sequence of events to show system time going backwards

The piece that is missing is a requirement for a one-cycle read.

For completeness, two other values are needed: an instruction giving the constant TCFREQ, and an instruction giving the current CPU frequency for a core. For both of these, kilohertz (1000 Hz) is an appropriate unit, giving values that easily fit in a default 32-bit integer. The first allows conversion between time counter increments and constant cycle increments, while the second allows monitoring of the current, possibly slow, frequency of a core to see if that explains slow software execution.

2. Instructions per constant cycle, IPCC

The original cycle counter on the Cray-1 circa 1976 counted once per CPU cycle. You could read it twice, subtract, and get 1. Things have gone downhill from there.

The introduction of power-saving Intel SpeedStep in January 2000 decoupled CPU cycle counts from elapsed time: as the CPU clock frequency was reduced, the number of counts per second varied in unpredictable and undocumented ways. Later, multi-core chips came along, and then multicore chips whose cores could be running at different clock frequencies. The cycle counter value on core A then bore no relationship to the counter value on core B. The possibility of code migrating from one core to another meant that some other machine-specific way had to be used to measure elapsed time. This was such a problem at Google circa 2005 that CPU power-saving was disabled in order to keep cycle counters in synch across multiple cores. At the time, observing time go backwards crashed the Google File System, GFS.

It turns out that measuring elapsed time is more important than counting varying CPU cycles. Eventually, so-called constant-rate TSC was implemented, restoring the ability to count elapsed time and removing the ability to count actual CPU cycles.

To examine software slowdowns caused by interference between programs using shared hardware such as caches or branch-instruction decoders or non-pipelined divide units, it is particularly helpful to measure instructions executed per cycle (IPC) during times of no interference and also during times of heavy interference to see how much actual slowdown the interference causes. With a constant-rate “cycle counter” the measurement is actually instructions per TCFREQ. The industry terminology is IPC, pretending that “cycle” speed doesn’t change. What we normally call IPC is really instructions per constant cycle, IPCC.

3. Cross-Program Interference

In a single computer running multiple programs, a change in IPCC over a given repeated code interval directly reflects other-program interference while using shared hardware. For example, just after the operating system zeros one or more physical pages in a minor page fault handler, the resumed user code may see a lower IPCC while it refills the handler’s D-cache evictions. In back-to-back bursts of page fault handling, the handler’s IPCC may increase as it warms the caches with its own code and data (to the corresponding detriment of the user code).

Microsecond-scale IPCC over short kernel- and user-mode execution intervals gives a powerful observation tool for understanding dynamic program interference, and hence for understanding one of the major sources of unexpected slow execution [Sites, Richard L. Understanding Software Dynamics. Addison-Wesley, 2021. ISBN 978-0-13-758973-9]. The elapsed-time counter here and the instructions-retired counter in Part 2 allow observation of microsecond-scale IPCC with tiny overhead.

4. Four other counters

Part 2 of this note will discuss the four counters needed to make sense of unexpectedly slow IPCC: total issue slots, issue slots filled with instructions, filled issue slots that actually issue, and instructions retired.

About the Author:  Richard L. Sites has been interested in computer architecture since 1965. He is perhaps best known as co-architect of the DEC Alpha, and for software tracing work at DEC, Adobe, and Google.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.