CPU Cache Overview

References:
https://stackoverflow.com/questions/3928995/how-do-cache-lines-work
https://lwn.net/Articles/250967/
Computer Systems - A Programmers Perspective – Chapter 6 "The Memory Hierarchy"
https://stackoverflow.com/a/11227902/560065

Overview
The following text provides a high level view of the different CPU read and write operations when experiencing different cache hit and miss scenarios for CPU D-cache and I-cache and how each scenario affects performance.

Most CPUs have Level 1, 2 and 3 cache's on chip (time of writing is 2017) which are usually SRAM. Usually separate L1 instruction cache and L1 data cache exist with L2 and L3 caches being "mixed" in that they can stored both data and instructions. Data is ideally prefetched from RAM or disk (main memory) which normally happens in 64 byte "cache lines" (CPU and memory architecture dependant). When a single byte is needed from RAM 64 bytes are read from the nearest 64 byte address boundary (the lowest six address bits are all zeros) as data is often needed from sequential addresses it helps to amortize some sequential reads over this one "larger" read. If just one byte is needed that is not in L1 D-cache, L2, or L3, then a cache line of 64 bytes will be read from RAM. The addresses of data read from RAM into cache are usually placed in a hash table that states where in the cache the data it is located, when making a read request the memory address can be checked against the hash table, if it is present in L1/L2/L3 cache it can be read from there, else read from main memory.

Typical memory read times for a 64 byte block can be as follows (it highly depends on CPU and memory architecture);
CPU Registers: 0 cycles
L1 cache: 1-4 cycles
L2 cache: 6-10 cycles
L3 cache: 24-40 cycles
RAM: 100-200 cycles

With the "Intel Core i7 processor each CPU chip has four cores. Each core has its own private L1 i-cache, L1 d-cache, and L2 unified cache. All of the cores share an on-chip L3 unified cache. An interesting feature of this hierarchy is that all of the SRAM cache memories are contained in the CPU chip. L1 I-cache access time: 4 cycles. L1 d-cache access time: 4 cycles. L2 unified cache access time: 11 cycles. L3 unified cache access time: 30-40 cycles."

Data Cache Read Hit
Data that needs to be read by the current instruction has already been prefetched into the D-cache so it is accessed from the local memory (D-cache) and thus much faster than if it was absent.

"As we have seen, the operation of a cache with respect to reads is straightforward. First, look for a copy of the desired word w in the cache. If there is a hit, return w immediately. If there is a miss, fetch the block that contains w from the next lower level of the memory hierarchy, store the block in some cache line (possibly evicting a valid line), and then return w." - It is important to note that different cache store and replacement algorithms also effect cache performance;

Data Cache Read Miss
Data that needs to be read by the current instruction hasn't been prefetched into the local memory so the instruction will [probably] stall whilst the main memory is accessed (usually off-chip RAM). Techniques such as pipelining exist to help maintain overall CPU performance during such events but for some workloads this doesn't help and the thread of execution waits for the read from main memory to return.

"Line Replacement on Misses in Set Associative Caches
If the word requested by the CPU is not stored in any of the lines in the set, then we have a cache miss, and the cache must fetch the block that contains the word from memory. However, once the cache has retrieved the block, which line should it replace? Of course, if there is an empty line, then it would be a good candidate. But if there are no empty lines in the set, then we must choose one of the nonempty lines and hope that the CPU does not reference the replaced line anytime soon. It is very difficult for programmers to exploit knowledge of the cache replacement policy in their codes, so we will not go into much detail about it here. The simplest replacement policy is to choose the line to replace at random. Other more sophisticated policies draw on the principle of locality to try to minimize the probability that the replaced line will be referenced in the near future. For example, a least-frequently-used (LFU) policy will replace the line that has been referenced the fewest times over some past time window. A least-recently-used (LRU) policy will replace the line that was last accessed the furthest in the past. All of these policies require additional time and hardware. But as we move further down the memory hierarchy, away from the CPU, the cost of a miss becomes more expensive and it becomes more worthwhile to minimize misses with good replacement
policies."

Data Cache Write Hit
Data that is being written or updated has been prefetched into the D-cache (more data may exist in the D-cache than is being updated by the current instruction, e.g. a 1MB cache) so the write is made to the respective entries in the D-cache and later de-staged back to main memory (RAM or disk).

"First, look for a copy of the desired word w in the cache... After the cache updates its copy of w, what does it do about updating the copy of w in the next lower level of the hierarchy? The simplest approach, known as write-through, is to immediately write w's cache block to the next lower level. While simple, write-through has the disadvantage of causing bus traffic with every write. Another approach, known as write-back, defers the update as long as possible by writing the updated block to the next lower level only when it is evicted from the cache by the replacement algorithm. Because of locality, write-back can significantly reduce the amount of bus traffic, but it has the disadvantage of additional complexity. The cache must maintain an additional dirty bit for each cache line that indicates whether or not the cache block has been modified."

Data Cache Write Miss
Data that is being written or updated which has not been prefetched into the D-cache. This will cause the currently executing instruction to stall whilst the data is fetched from main memory into a cache line to be updated.

"One approach, known as write-allocate, loads the corresponding block from the next lower level into the cache and then updates the cache block. Write-allocate tries to exploit spatial locality of writes, but it has the disadvantage that every miss results in a block transfer from the next lower level to cache. The alternative, known as no-write-allocate, bypasses the cache and writes the word directly to the next lower level. Write-through caches are typically no-write-allocate. Write-back caches are typically write-allocate."

Instruction Cache Read Hit
The current instruction to be fetched/decoded has been prefetched into the I-cache (on-chip memory). "Loading" this instruction (to then be decoded and executed) is very fast.

Instruction Cache Read Miss
The current instruction has not been prefetched to the local I-cache so execution of this instruction will stall whilst fetching it form main (off-chip) memory. Techniques like branch prediction and pipelining can help with this issue however certain workloads will suffer as instruction execution stalls.


Previous page: C Notes
Next page: DPDK Notes