Cache prefetching explained

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').[1] [2] Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

Data vs. instruction cache prefetching

Cache prefetching can either fetch data or instructions into cache.

Hardware vs. software cache prefetching

Cache prefetching can be accomplished either by hardware or by software.

Methods of hardware prefetching

Stream buffers

k

subsequent addresses) are fetched into a separate buffer of depth

k

. This buffer is called a stream buffer and is separate from the cache. The processor then consumes data/instructions from the stream buffer if the address associated with the prefetched blocks match the requested address generated by the program executing on the processor. The figure below illustrates this setup:

Strided prefetching

This type of prefetching monitors the delta between the addresses of the memory accesses and looks for patterns within it.

Regular strides

In this pattern, consecutive memory accesses are made to blocks that are

s

addresses apart.[11] In this case, the prefetcher calculates the

s

and uses it to compute the memory address for prefetching. E.g.: If the

s

is 4, the address to be prefetched would A+4.

Irregular spatial strides

In this case, the delta between the addresses of consecutive memory accesses is variable but still follows a pattern. Some prefetchers designs[12] [13] exploit this property to predict and prefetch for future accesses.

Irregular temporal prefetching

This class of prefetchers look for memory access streams that repeat over time.[14] [15] E.g. In this stream of memory accesses: N, A, B, C, E, G, H, A, B, C, I, J, K, A, B, C, L, M, N, O, A, B, C, ...; the stream A,B,C is repeating over time. Other design variation have tried to provide more efficient, performant implementations.[16] [17]

Collaborative prefetching

Computer applications generate a variety of access patterns. The processor and memory subsystem architectures used to execute these applications further disambiguate the memory access patterns they generate. Hence, the effectiveness and efficiency of prefetching schemes often depend on the application and the architectures used to execute them.[18] Recent research[19] [20] has focused on building collaborative mechanisms to synergistically use multiple prefetching schemes for better prefetching coverage and accuracy.

Methods of software prefetching

Compiler directed prefetching

Compiler directed prefetching is widely used within loops with a large number of iterations. In this technique, the compiler predicts future cache misses and inserts a prefetch instruction based on the miss penalty and execution time of the instructions.

These prefetches are non-blocking memory operations, i.e. these memory accesses do not interfere with actual memory accesses. They do not change the state of the processor or cause page faults.

One main advantage of software prefetching is that it reduces the number of compulsory cache misses.

The following example shows how a prefetch instruction would be added into code to improve cache performance.

Consider a for loop as shown below:for (int i=0; i<1024; i++) At each iteration, the ith element of the array "array1" is accessed. Therefore, the system can prefetch the elements that are going to be accessed in future iterations by inserting a "prefetch" instruction as shown below:for (int i=0; i<1024; i++) Here, the prefetch stride,

k

depends on two factors, the cache miss penalty and the time it takes to execute a single iteration of the for loop. For instance, if one iteration of the loop takes 7 cycles to execute, and the cache miss penalty is 49 cycles then there should be

k=49/7=7

- which means that the system should prefetch 7 elements ahead. With the first iteration, i will be 0, so the system prefetches the 7th element. Now, with this arrangement, the first 7 accesses (i=0->6) will still be misses (under the simplifying assumption that each element of array1 is in a separate cache line of its own).

Comparison of hardware and software prefetching

Metrics of cache prefetching

There are three main metrics to judge cache prefetching[22]

Coverage

Coverage is the fraction of total misses that are eliminated because of prefetching, i.e.

Coverage=

CacheMisseseliminatedbyPrefetching
TotalCacheMisses
,

where,

TotalCacheMisses=(Cachemisseseliminatedbyprefetching)+(Cachemissesnoteliminatedbyprefetching)

Accuracy

Accuracy is the fraction of total prefetches that were useful - i.e. the ratio of the number of memory addresses prefetched were actually referenced by the program to the total prefetches done.

PrefetchAccuracy=

CacheMisseseliminatedbyprefetching
(UselessCachePrefetches)+(CacheMisseseliminatedbyprefetching)

While it appears that having perfect accuracy might imply that there are no misses, this is not the case. The prefetches themselves might result in new misses if the prefetched blocks are placed directly into the cache. Although these may be a small fraction of the total number of misses observed without any prefetching, this is a non-zero number of misses.

Timeliness

The qualitative definition of timeliness is how early a block is prefetched versus when it is actually referenced. An example to further explain timeliness is as follows:

Consider a for loop where each iteration takes 3 cycles to execute and the 'prefetch' operation takes 12 cycles. This implies that for the prefetched data to be useful, the system must start the prefetch

12/3=4

iterations prior to its usage to maintain timeliness.

See also

Notes and References

  1. Smith. Alan Jay. 1982-09-01. Cache Memories. ACM Comput. Surv.. 14. 3. 473–530. 10.1145/356887.356892. 6023466 . 0360-0300.
  2. Li . Chunlin . Song . Mingyang . Du . Shaofeng . Wang . Xiaohai . Zhang . Min . Luo . Youlong . 2020-09-01 . Adaptive priority-based cache replacement and prediction-based cache prefetching in edge computing environment . Journal of Network and Computer Applications . en . 165 . 102715 . 10.1016/j.jnca.2020.102715. 219506427 .
  3. Baer . Jean-Loup . Chen . Tien-Fu . 1991-01-01 . An Effective On-chip Preloading Scheme to Reduce Data Access Penalty . 1991 ACM/IEEE Conference on Supercomputing . Albuquerque, New Mexico, USA . Association for Computing Machinery . 176–186 . 10.1.1.642.703 . 10.1145/125826.125932 . 978-0897914598.
  4. Kennedy. Porterfield, Allan. Software methods for improvement of cache performance on supercomputer applications. 1989-01-01. Rice University. 1911/19069.
  5. Jouppi . Norman P. . 1990 . Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers . 17th annual international symposium on Computer Architecture – ISCA 1990 . New York, New York, USA . Association for Computing Machinery Press . 364–373 . 10.1.1.37.6114 . 10.1145/325164.325162 . 0-89791-366-3 . Proceedings of the 17th annual international symposium on Computer Architecture – ISCA 1990.
  6. Chen. Tien-Fu. Baer. Jean-Loup. 1450745. 1995-05-01. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers. 44. 5. 609–623. 10.1109/12.381947. 0018-9340.
  7. Palacharla . S. . Kessler . R. E. . 1994-01-01 . Evaluating Stream Buffers As a Secondary Cache Replacement . 21st Annual International Symposium on Computer Architecture . Chicago, Illinois, USA . IEEE Computer Society Press . 24–33 . 10.1.1.92.3031 . 10.1109/ISCA.1994.288164 . 978-0818655104.
  8. Grannaes . Marius . Jahre . Magnus . Natvig . Lasse . Storage Efficient Hardware Prefetching using Delta-Correlating Prediction Tables . 10.1.1.229.3483 . Journal of Instruction-Level Parallelism . 13 . 2011 . 1–16.
  9. Ishii . Yasuo . Inaba . Mary . Hiraki . Kei . 2009-06-08 . Access map pattern matching for data cache prefetch . ICS 2009 . New York, New York, USA . Association for Computing Machinery . 499–500 . 10.1145/1542275.1542349 . 978-1-60558-498-0 . Proceedings of the 23rd International Conference on Supercomputing . 37841036.
  10. Srinath . Santhosh . Mutlu . Onur . Kim . Hyesoon . Hyesoon Kim. Patt . Yale N.. Yale Patt . February 2007 . Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers . 2007 IEEE 13th International Symposium on High Performance Computer Architecture . 63–74 . 10.1109/HPCA.2007.346185. 978-1-4244-0804-7 . 6909725 .
  11. Kondguli . Sushant . Huang . Michael . November 2017 . T2: A Highly Accurate and Energy Efficient Stride Prefetcher . 2017 IEEE International Conference on Computer Design (ICCD) . 373–376 . 10.1109/ICCD.2017.64. 978-1-5386-2254-4 . 11055312 .
  12. Shevgoor . Manjunath . Koladiya . Sahil . Balasubramonian . Rajeev . Wilkerson . Chris . Pugsley . Seth H. . Chishti . Zeshan . December 2015 . Efficiently prefetching complex address patterns . 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) . 141–152 . 10.1145/2830772.2830793 . 9781450340342 . 15294463.
  13. Kim . Jinchun . Pugsley . Seth H. . Gratz . Paul V. . Reddy . A.L. Narasimha . Wilkerson . Chris . Chishti . Zeshan . October 2016 . Path confidence based lookahead prefetching . 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) . 1–12 . 10.1109/MICRO.2016.7783763. 978-1-5090-3508-3 . 1097472 .
  14. Joseph . Doug . Grunwald . Dirk . 1997-05-01 . Prefetching using Markov predictors . ISCA 1997 . ISCA 1997 . New York, New York, USA . Association for Computing Machinery . 252–263 . 10.1145/264107.264207 . 978-0-89791-901-2 . Proceedings of the 24th Annual International Symposium on Computer Architecture . 434419.
  15. Collins . J. . Sair . S. . Calder . B. . Tullsen . D.M. . November 2002 . Pointer cache assisted prefetching . 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings. . 62–73 . 10.1109/MICRO.2002.1176239. 0-7695-1859-1 . 5608519 .
  16. Jain . Akanksha . Lin . Calvin . 2013-12-07 . Linearizing irregular memory accesses for improved correlated prefetching . MICRO-46 . New York, New York, USA . Association for Computing Machinery . 247–259 . 10.1145/2540708.2540730 . 978-1-4503-2638-4 . Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture . 196831.
  17. Web site: Making Temporal Prefetchers Practical: The MISB Prefetcher – Research Articles – Arm Research – Arm Community . 2022-03-16 . community.arm.com . 24 June 2019 . en-us.
  18. Kim . Jinchun . Teran . Elvira . Gratz . Paul V. . Jiménez . Daniel A. . Pugsley . Seth H. . Wilkerson . Chris . 2017-05-12 . Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy . ACM SIGPLAN Notices . en . 52 . 4 . 737–749 . 10.1145/3093336.3037701 . 0362-1340. free .
  19. Kondguli . Sushant . Huang . Michael . 2018-06-02 . Division of labor: a more effective approach to prefetching . Proceedings of the 45th Annual International Symposium on Computer Architecture . ISCA '18 . Los Angeles, California . IEEE Press . 83–95 . 10.1109/ISCA.2018.00018 . 978-1-5386-5984-7. 50777324 .
  20. Pakalapati . Samuel . Panda . Biswabandan . May 2020 . Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Spatial Hardware Prefetching . 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) . 118–131 . 10.1109/ISCA45697.2020.00021. 978-1-7281-4661-4 . 218683672 .
  21. Callahan . David . Kennedy . Ken . Porterfield . Allan . 1991-01-01 . Software Prefetching . Fourth International Conference on Architectural Support for Programming Languages and Operating Systems . Santa Clara, California, USA . Association for Computing Machinery . 40–52 . 10.1145/106972.106979 . 978-0897913805.
  22. Book: Solihin, Yan . Fundamentals of parallel multicore architecture . CRC Press, Taylor & Francis Group . 2016 . 978-1482211184 . Boca Raton, Florida . 163 . en-us.