CPU Cache

kuniga.me > NP-Incompleteness > CPU Cache

CPU Cache

24 Apr 2020

In this post we’ll study CPU caches. We’ll first understand why it exists, some of its typical properties and then explore vulnerabilities associated with CPU cache.

Motivation

According to Wikipedia [1]:

Microprocessors have advanced much faster than memory, especially in terms of their operating frequency, so memory became a performance bottleneck. How much faster is it? Doing some estimates based on a benchmark for Intel’s Haswell architecture [6], it seems that the worst case of RAM access is 100x slower than the fastest CPU cache. See Appendix A for how our calculations.

CPU cache is cache

I wanted to being with this tautology to call out that many of properties and design of a high-level caching system we might be more familiar with also applies to a CPU cache. It was useful for me to keep this in mind for understanding this new topic.

For example, as we’ll see later in more details, a CPU cache can be seen as a key-value table, which requires hash-like strategies for assigning keys to rows and for cache eviction.

Types

There are multiple CPU caches and they perform different functions. It varies depending on the CPU model but there are some general similarities.

Hierarchy

The CPU cache is divided into multiple levels of cache. The first level, referred to as L1 is the fastest and smallest. The n-th level, Ln is the largest and slowest. Typical values of n seem 3 or 4. In a multi-core CPU, the first levels of cache are exclusive to each cache, while the later levels are shared.

For a real-world example, Wikipedia [4] describes the architecture for Intel’s Kaby Lake architecture, where L1 is 64KB, L2 is 256KB and L3 is between 2MB to 8MB and shared by the cores.

Why do we have multiple-level of caches vs. a single level? This Q&A [5] suggests that it allows working with different tradeoffs of physical constraints, namely location and size of the component.

Data, Instruction, Address

L1 is often divided between data and instruction. As the name suggests, the L1-data caches data from memory needed during a program execution, while the L1-instruction caches the instructions needed to execute the program itself.

There’s another specialized cache called translation lookaside buffer, or TLB, which is used to lookup memory addresses (we’ll see more details later).

Why do we need separate caches for these 3 different uses? Wikipedia [1] provides an insight:

Pipelined CPUs access memory from multiple points in the pipeline: instruction fetch, virtual-to-physical address translation, and data fetch. The natural design is to use different physical caches for each of these points, so that no one physical resource has to be scheduled to service two points in the pipeline. It’s worth recalling that a CPU operates in cycles and it follows a set of operations in a pipeline fashion (think of a conveyor belt in a factory). The following is a diagram of a possible CPU pipeline:

pipeline
Source: https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors

Structure

We’ll now analyze the structure of a CPU cache, in particular a data cache. Like any general purpose cache, it conceptually represents a lookup table or a key-value structure, so we’ll use this as our analogy to make absorbing new terminology easier.

Cache entry

The value of our lookup table is called a cache line and it has a fixed size. The key is called tag and will correspond to a memory address in a data cache. The combination key-value entry of our table is called cache entry. The cache entry also reserves some bits to store flags, which can be seen as metadata.

A cache entry can have an associated index, which would correspond to the row number in a table. The index information can be utilized to shave off some bits from the tag. To see why, imagine a cache with 64 (2^6) entries and that our assignment policy (think hash function) is that an address A is assigned to index A mod 64. So for example, an address (in binary) 10110010110101 would be assigned to index 110101 (53 decimal). Because the number of entries, 64, is a power of 2, the 6 least significant bits are the buckets. This means our tag for 10110010110101 can be 10110010 since the other bits 110101 is implicit from the index.

Associativity

We gave a glimpse above when talking about the mod function that we can implement our lookup table using a simple hash function, and when talking about hash functions we necessarily need to think of collisions, i.e., two different entries mapping to the same index.

On one side a CPU cache is “bare metal” hardware, so complex hash collision resolution strategies is infeasible to implement. On the other hand it is a cache, not a persistent key-value store, meaning that when a collision happen we can just evict the previous entry.

Still, high eviction rates can also mean high cache misses rates which is a key quality factor of a cache. Given the limited size of a cache, there are two dimensions we can tradeoff: number of indexes and number of cache entries per index. In our analogy, this would roughly map to number of rows vs. number of columns.

Some common strategies are either one column (direct mapping) or two columns (two-way set associativity), but in theory we could have any number columns (Intel’s Haswell architecture uses 8) [6]. Any such strategy is what we call associativity. Let’s take a closer look at these strategies.

Direct-mapping

Like a hash table with a simple function. Each address maps to exactly one index/row. If a new entry maps to a row already “occupied”, the previous entry is evicted.

The advantage of this is simplicity: there’s no criteria we need to use to choose which “cell” of our table to assign it to once we determined its row, since there’s only one. Eviction is easy too.

The downside is that we’re subject of degenerate cases where two hot addresses map to the same index. This will result in the entries being swapped out and many cache misses.

Two-way set associativity

As we said, we have two columns for each row, so a given memory address can go to any of two cells. When deciding which column to assign it to, it boils down to which of the existing values in that row to evict, and a natural strategy is to use LRU (least recently used).

To do so, we need an extra bit to keep track of this information: every time an item is read, we set a bit in that cell to 1 and to 0 in the other. We can then always evict the slot with bit 0.

This is a more complex approach than direct-mapping and we need to use a bit to keep account for the LRU. On the other hand it avoids the degenerate case of a direct-mapping, but suffer the same if there are 3 or more hot addresses.

N-way associative cache

A more general structure would be N column per row. This further shifts the balance between complexity (the higher the N the more expensive it is) and avoiding degenerate cases (resilient to up N hot addresses).

Translation Lookaside Buffer

Before we describe the translation lookaside buffer, we need to revisit virtual memory.

Virtual Memory: a prelude

Virtual memory is an abstraction layer that hides the details of physical memory. Some use cases:

The mapping between virtual memory and physical memory is done by the memory management unit, or MMU. It keeps a table (page table) in memory (RAM) that has one entry for each page and it maps to corresponding the physical address.

Why memory is segmented in pages? The page table is a likely reason. If we mapped byte to byte, then the page table wouldn’t fit in memory because we’d need an entry for every byte! We need a good balance between a page that’s not too big (coarse granularity) which would lead to wasted memory if most of the page went unused vs. a page that’s not too small (fine granularity) which would lead to a gigantic page table.

TLB: The page table cache

Many of CPU instructions require access to data in (physical) memory, so it needs to translate a given virtual address to its physical correspondent. As we’ve seen, memory access is slow, so not surprisingly there’s also a cache for the page table, which is called translation lookaside buffer (TLB).

Page or Cache Coloring

Imagine our application needs access to a large block of memory - this will consist of a list of virtual pages that are adjacent in the virtual address space. However, these are not guaranteed to be adjacent in the physical address space.

The CPU cache is designed to map adjacent physical addresses to different cache lines. This is so that accessing a contiguous piece of physical memory will enable most of the data will be on cache at a time. If they mapped to the same cache line, we’d evict.

What can happen is that a list of adjacent virtual pages are not physically adjacent, which means there’s no guarantee they will map to different cache lines.

To address this problem, pages frames are “colored” based on which cache line they would be assigned to [8]. The corresponding virtual page “inherits” that color. When assigning sequential pages to physical locations, the OS tries to make sure that contiguous pages will have a diverse set of colors.

Security

Whenever adding caching to improve the performance of a system, an important consideration is security. One example could be caching some expensive computation which also perform access checks. The cached data could be vulnerable if an attacker has access to the key.

We’ll now concentrate into two famous exploits known as Spectre and Meltdown. Before that, we need to understand out of order execution.

Out of order execution

In conjunction with caching, an optimization technique the CPU employs to speed up execution is to execute instructions ahead of time (out of order) if it can guarantee the end result is correct.

For example, say one instruction i requires reading from a value from memory, which can be expensive on a cache miss. Say that the next instruction i+1 doesn’t depend on this value. The CPU can then execute that instruction i+1 before i is finished.

One more concrete example is called branch prediction. Say your code has a conditional like:

if (condition) { expression1; } else { expression2; }

and that evaluating condition is slow. Then the CPU might predict condition=true and execute expression1 before it knows whether condition is true. If it happens to be false, it just resumes the execution at the other branch.

Meltdown and Spectre

A few years ago, two vulnerabilities were found related to CPU cache, named Spectre and Meltdown. Let’s analyze the Spectre [11] first.

Spectre.

As we mentioned above, the CPU might perform branch prediction. The problem with that is the CPU will cache the result of expression1 even if it turns out to be abandoned later.

Consider the following contrived example proposed by ricbit [7]:

if (bigvalue < a.length) {
  value = a[bigvalue];
  if (value & 1 > 0) {
    x = a[100];
  } else {
    x = a[200];
  }
}
// Timing attack
const t1 = performance.now();
const y1 = a[100];
const duration1 = performance.now() - t1;

const t2 = performance.now();
const y2 = a[200];
const duration2 = performance.now() - t2;

if (duration1 * 10 < duration2) {
  // a[100] is cached, value & 1 = 1
} else if (duration2 * 10 < duration1) {
  // a[200] is cached, value & 1 = 0
}

Suppose we want to read the contents of the memory address represented by

a[bigvalue]

Because we didn’t explicitly allocate memory for that position, we’ll generally not have access to that. But say we can be sure the CPU will perform branch prediction in

bigvalue < a.length

spectre

And predict it will be true. Then it will execute the next few lines before it determines the condition is actually false. By that time it accessed the value in a[bigvalue] without permission checks. If the last bit of the value is 1, xwill have the value of a[100] and but importantly the CPU will cache the memory address represented by a[100]. Otherwise, it will do the same with a[200].

After the branch prediction fails and it’s cleaned up, we execute our timing attack. It simply measure the time to access both a[100] and a[200]. Whichever address is cached will return much faster, which in turn is a proxy for the last bit of the value in a[bigvalue]. Repeat this will all the bits and we’ll get the exact value.

Meltdown

Meltdown uses a similar approach [12], but instead of relying on branch prediction, it relies on CPU executing instructions after an exception is raised, so if the application manages to handle the exception, it can then use timing attacks to detect what value was cached.

The paper provides a code in assembly. I’ve attempted to write a pseudo (meaning it doesn’t work but captures the idea of the exploit) version in C for educational purposes:

#include <setjmp.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>

void primer(char probe[256]) {
    // Some specific address
    int addr = 136322;
    // This will throw segmentation fault, but we're "catching it"
    char v = *(char *)addr;
    // This will be executed spectulatively by the CPU
    probe[v] = 1;
}

sigjmp_buf point;

// segfault signal handling
static void handler(int sig, siginfo_t *dont_care, void *dont_care_either) {
   longjmp(point, 1);
}

int main() {
    struct sigaction sa;
    memset(&sa, 0, sizeof(sigaction));
    sigemptyset(&sa.sa_mask);

    sa.sa_flags     = SA_NODEFER;
    sa.sa_sigaction = handler;
    sigaction(SIGSEGV, &sa, NULL);

    char probe[256];
    if (setjmp(point) == 0) {
        primer(probe);
    } else {
        printf("Read chunks of probe and check which one is cached\n");
    }

    return 0;
}

The steps are:

meltdown

1-) Try accessing some protected memory address but exception is thrown due to invalid access. But handle the exception so that application doesn’t crash.

2-) Right after the exception, use the content of the protected address as index of a probe array (non-protected address). This exploits a CPU behavior that executes instructions out-of-order and caching them. In this scenario, it will execute the instruction above and cache that portion of the array.

3-) Timing: Try reading parts of the array and measure which one returns faster, indicating that address is cached, and hence we know that that address

Spectre vs. Meltdown?

One of the most difficult things for me is to understand the precise difference between Spectre and Meltdown. I’ve seen a lot of comparisons between the 2, but they’re too high-level to satisfy my curiosity. On the other hand the distinction provided by the papers are hard for me to grasp.

The discussion of Spectre [11] on Meltdown is a bit clearer:

Meltdown is a related attack which exploits out-of-order execution to leak kernel memory. Meltdown is distinct from Spectre attacks in two main ways. First, unlike Spectre, Meltdown does not use branch prediction. Instead, it relies on the observation that when an instruction causes a trap, following instructions are executed out-oforder before being terminated. Second, Meltdown exploits a vulnerability specific to many Intel and some ARM processors which allows certain speculatively executed instructions to bypass memory protection. Combining these issues, Meltdown accesses kernel memory from user space. This access causes a trap, but before the trap is issued, the instructions that follow the access leak the contents of the accessed memory through a cache covert channel. The remarks of Meltdown [12] on Spectre in more succinct and obscure: Meltdown is distinct from the Spectre Attacks in several ways, notably that Spectre requires tailoring to the victim process’s software environment, but applies more broadly to CPUs. My feeling is that the differences are too specific and technical to be explained in a simple way, yet I seem to lack the background to understand them fully.

Conclusion

I was reading a paper that made references to CPU cache cache lines, so that prompted me to research about it. I was surprised to learn how much information there is about it, so decided to dedicate an entire post for it.

It was also great to dive deeper in understanding about Spectre and Meltdown, even though I didn’t fully internalize their specifics. I also learned a lot about virtual memory.

Appendix A: RAM vs Cache latency estimates

[6] provides a benchmark for Intel’s Haswell architecture, a high-end i7-4770. It provides data in terms of CPU cycles:

References