Part 1 of this article discussed a reliable elapsed-time counter design. This Part 2 discusses the four counters needed to make sense of unexpectedly slow execution: total issue slots, issue slots filled with instructions, filled issue slots that actually issue, and instructions retired. The design is inspired by Yasin’s 2014 paper [Yasin, Ahmad. “A Top-Down method for performance analysis and counters architecture,” ISPASS 2014].
1. Tracking Issue Slots
Some chips have some of the counters below, but they are not architected in a consistent way. They all require operating-system-specific kernel-mode access to set up global registers configuring what to count among dozens of counters.
The global setup is a killer. It means that only one program can use the counters at a time. This makes them practically useless in an environment with multiple unrelated programs running at once — no individual program can reliably use the counters without exposure to a second program configuring them differently. It also means that the operating system itself cannot use any of the counters without exposure to user re-configuration ruining their meaning.
The need for operating-system-specific manipulation is also a killer. It means that the same application code (C, JavaScript, or whatever) cannot be used unchanged across several operating systems or even operating system versions. This creates Linux-only code, FreeBSD-only code, Windows-only code, Mac-only code, etc.
The 20-30 cycle read time of today’s performance counters (often done in microcode that accesses counters implemented outside a CPU core, so-called uncore hardware) prevents useful approaches such as automatic compiler insertion of per-subroutine timing, because the measurement slowdown and distortion is too high for short routines.
For cores that implement simultaneous multithreading (SMT, Intel: Hyper-threading), each logical core must have its own set of these four counters. From a chip designer’s point of view, it can be interesting to see how a physical chip core performs across all logical threads, but from a programmer’s point of view, the only thing of interest is how it performs running her thread. We adopt the programmer’s point of view here. Intermixing counts from unrelated programs makes the entire exercise meaningless.
2. Total Issue Slots
Every CPU core design has a maximum total number of instructions, T, that it can sustain issuing each cycle. Each per-logical-core Total Issue Slots counter increments by T each (actual) CPU cycle. Different cores may count at different frequencies. In an N-way SMT implementation, the N counters per physical core each increment by T every cycle. T=4 is common today.
3. Filled Issue Slots
Each Filled Issue Slots counter increments each cycle by F, the number of instructions available to issue for that logical core. F is in the range 0..T inclusive. Whenever F is less than T, it indicates a front-end stall — the instruction-fetch hardware has not presented enough instructions to execute. In an SMT implementation, the N Filled counters reflect their respective instruction streams. From a programmer’s point of view, what matters is how many of her instructions are filled and available to issue.
4. Filled Issue Slots That Actually Issue
Each Issued Instructions counter increments each cycle by I, the number of filled instructions that actually issue for that logical core. I is in the range 0..F inclusive. Whenever I is less than F, it indicates a back-end stall — the instruction-execution units have instructions available that they are not able to accept.
5. Instructions Retired
Each Instructions Retired counter increments each cycle by R, the number of instructions retired in that cycle. R is typically in the range 0..T, but some hardware may be able to retire more than T instructions per cycle. John Hauser has pointed out that an instruction reading R can be predicted at issue time based on the I-stream up to that point. The value will be backed up if speculative execution recovery causes the reading instruction to re-execute. The instructions-retired counter and the elapsed-time counter in Part 1 allow calculating IPCC.
6. Useful Ratios
The ratio of Filled / Total slots gives the fraction of the total issue slots that contain instructions, or conversely, what fraction are wasted because instructions are unavailable.
The ratio of Issued / Filled slots gives the fraction of the issue slots with instructions actually start executing, or conversely, what fraction are wasted because execution units are unavailable.
The ratio of Retired / Total slots gives the fraction of the total issue slots that contain executed instructions, or conversely, what fraction are wasted.
The difference S of Issued minus Retired instructions gives the number of speculatively-issued instructions that were suppressed. S does not require a separate counter, just software subtraction.
Of the total issue slots T for each cycle for each logical core, these counters unambiguously assign each slot to one of four categories:
- Not filled with an instruction (front-end stall)
- Filled but not issued (back-end stall)
- Filled, Issued and suppressed (wasted speculation)
- Filled, Issued and retired
This top-down breakdown directly reflects the performance that a software programmer sees and hints at the major reasons for any slow execution. As a programmer makes code changes, she can see their effect on increasing or decreasing the number of instructions retired over various execution intervals of interest. When running in a multi-program environment, she can see the effects of shared-hardware interference.
7. Per-process Counters
It is useful for an operating system to provide per-process copies of the four counters above, context-switching them just like they were four more general registers. In addition to the global performance view of the four free-running counters above, four per-process counters can reveal program-specific execution bottlenecks. Doing so automatically removes today’s idle-process measurement distortion by accumulating it separately just for the idle process and accumulating a programmer’s measurements just for her processes.
8. Superiority Over Counting Hardware Events
Note that counting hardware events, such as L1 cache misses, is not as useful for programmers because only some hardware events result in slowing execution. This was a mistake on my part circa 1990 and it has been repeated ever since. A particular program may have lots of L1 cache misses and see a substantial slowdown because of that, while a different program may see no slowdown because the chip hardware successfully covers up the 10 extra cycles or so for each miss. The design here counts the issue slots that are unused — the slowdown that the programmer sees — rather than just counting the misses, some of which matter and some of which don’t matter but the user of the counts doesn’t know which is which.
I have no objection to the myriad of existing bottom-up performance-management-unit counters, but they all miss the mark when one cares about first-order performance measurements — too narrowly detailed and too slow to use and potentially meaningless due to global state changes or intermixed program counts.
I do object however to the train of thought that says “If four first-order performance counters are good, fourteen would be better.” Stick with just exactly these four counters, please.
9. Summary
An architected nanosecond-scale elapsed-time counter and four architected per-logical-core issue-slot counters will provide a strong underpinning for careful software performance analysis and understanding. All must be accessible directly (i.e., with no global setup) in user code via one-cycle instructions. A one-cycle read means that even short sequences of code can be measured with insignificant distortion.
See the case studies in Understanding Software Dynamics for examples of using fine-grained IPCC measurements to understand program interference, and see Yasin’s paper to understand the power of top-down issue-slot decomposition.
About the Author: Richard L. Sites has been interested in computer architecture since 1965. He is perhaps best known as co-architect of the DEC Alpha, and for software tracing work at DEC, Adobe, and Google.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.