Cache prefetching is a well-studied topic, but we continue to strive for improvement. Two small ideas are presented here, one for software prefetching and one for hardware.
- Bigger software prefetches
When I worked at Google, a survey of fleet execution time revealed 25-35% of all non-idle CPU spent in memcpy() or similar routines — comparison, checksums, decompress, page zeroing, kernel-user IO buffer transfers. Some calls only handled a few bytes of data: individual UTF-8 characters, webpage tokens, or decompress strings. Others handled a few hundred bytes of sentences and paragraphs, and others yet handled bulk data of 1000 bytes or more.
With main memory 200 or more CPU cycles away and bulk processing rates approaching 16 bytes per cycle, a good prefetching scheme needs to be at least 3200 bytes ahead to fully hide memory access latency. Today’s instructions to prefetch a single cache line are not adequate.
A useful addition is a pair of two-operand instructions
prefetch_r addr, len prefetch_w addr, len
These instructions prefetch min(len, 4096) bytes for read or write starting at addr. Bounding the length to 4KB means that no more than two pages are accessed and hence no more than two TLB lookups are needed. It also means that the prefetching does not get too far ahead of the actual processing.
The beginning of memcpy(dest, src, len) and its ilk can then look like this:
memcpy: prefetch_r src, len prefetch_w dst, len ... move the bytes ...
Getting the length sent all the way down to the memory controller logic can allow it to take advantage of 3x faster CAS-only accesses to DRAM row buffers. To take better advantage of this, DRAM interleaving might be done not on single cache lines but on groups of four or eight or modestly more lines.
The designed-in min operation avoids conditional branching, which might dominate if the length to move is tiny. It also avoids prefetching millions of bytes if the length is large. For short moves the only cost is the two starting prefetches, which ideally take just one or two cycles to issue. For relatively long moves, a software loop can interleave occasional 4KB prefetch instructions that keep pace with the data movement.
Two other variations are useful.
prefetch_i targ, len prefetch instruction stream prefetch_o dest, len prefetch for overwrite
The first is useful at the entry to a known large block of occasional code, especially if the compiler-linker path has packed likely-execution code.
The second variation delays and usually avoids the reads for to-be-overwritten lines, except a first or last partial cache line. This reduces the memory-system traffic of memcpy by 1/3.
These longer software prefetches may be especially useful to cover the latency for bulk misses that hit L3 or main-memory.
- Dynamic Hardware Next-N Prefetching
The common N+1 hardware prefetching design often speeds up execution slightly, but it also sometimes slows down execution by inaccurate prefetching of unused blocks, tying up the memory system and delaying useful data. Today’s mechanisms to statically enable or disable N+1 prefetching are not adequate.
Instead, we propose a simple dynamic mechanism, an array of two-bit saturating counters indexed by instruction address PC, implemented and sized much like dynamic branch-prediction arrays. The counter values mean:
00 no prefetching
01 N+1 prefetching, i.e. fetch 2 cache lines on a miss to line N
10 N+1..3 prefetching, 4 cache lines
11 N+1..7 prefetching, 8 cache lines
The counters are only used and updated during an L1 cache miss, so need not be very fast or expensive. If these counters have appropriate values for instructions that miss in an L1 cache, they prevent over-use of N+1 prefetching when it is useless, and they allow modest multi-line prefetching when that is useful. At its best the mechanism can reduce misses by a factor of eight.
This mechanism can be applied equally to L1 D-cache misses and L1 I-cache misses. It applies to all misses, not just those that would trigger a stream prefetcher.
2.1. Incrementing a counter. In addition to the array of counters, we keep a small round-robin array of recent increase-on-miss memory addresses (truncated to cache line size), each paired with a saturating counter number:
<address, counter#>
The array might be 16 to 32 entries.
Each L1 miss on address N looks in the array for N. If found, the corresponding two-bit counter is incremented. Each L1 miss also can create a new entry. The address in the new entry is just off the end of the current prefetch degree, a function of the counter value for the instruction that missed:
00 N+1
01 N+2
10 N+4
11 no new entry
The net effect is to increment the counter for PC1 if there is a miss at PC2 just off the end of PC1’s current prefetch interval. The next time PC1 takes a miss, it will prefetch a larger interval that includes PC2.
On the other hand, if PC1 accesses memory with no prefetching and that is the best behavior for that PC, then a miss to address N+1 will not occur and PC1 will remain as no prefetch.
2.2. Decrementing a counter. If the prefetch interval for a given instruction is too large, we need a way to dynamically reduce its prefetch interval. We could keep a second round-robin array of mid-address (none, N+1, N+2, N+4) for a miss’s prefetch interval, decrementing its two-bit counter when that array entry is replaced having seen no matching miss access. But this design does not work well if the reason for no access is precisely because the prefetch interval is good and the mid-address is hitting in the L1 cache.
Instead, we take the simple approach of periodically decrementing some counter. If it should not have been decremented, there will be one subsequent extra miss that re-increments it. There is a weak analogy here to the random early detection (RED) algorithm used in some TCP implementations.
The simple decrement design is to count misses that prefetch, e.g. whose instruction’s two-bit counter is not 00, and whenever the count reaches a threshold to decrement that instruction’s two-bit counter. A threshold of 32, for example, means that about 3% of the time prefetching misses will have their prefetch interval shortened. Use of a small linear feedback shift register to count and set the threshold can somewhat randomize when the decrements occur, breaking synchronization with any loop repetition cadence.
2.3. Prior work. The idea of prefetching multiple cache lines is not new, e.g., see Watkins et al. But in that paper, the prefetch degree was the same for all instructions and was updated only occasionally at a millisecond time scale.
The idea of feedback directed prefetching is also not new, e.g., see Srinath et al., which only considers stream-based prefetching of loop-intensive SPEC benchmarks.
The idea of PC-based prefetching rather than data-address-based is also not new, but it seems largely to be used only to separate different-stride streams that are tied to different PC addresses, i.e. just for stride-based prefetching and not for simple prefetching.
The hardware mechanism proposed here is modest and simple, applying to all L1 cache misses. It may be especially useful to cover the latency for misses that hit in L2.
About the Author: Richard L. Sites has been interested in computer architecture since 1965. He is perhaps best known as co-architect of the DEC Alpha, and for software tracing work at DEC, Adobe, and Google.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.