## **Advanced Systems Lab**

Spring 2021

Lecture: Memory hierarchy, locality, caches

Instructor: Markus Püschel, Ce Zhang

TA: Joao Rivera, several more

ETH

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

# Organization

Temporal and spatial locality

Memory hierarchy

Caches

Chapter 5 in Computer Systems: A Programmer's Perspective, 2<sup>nd</sup> edition, Randal E. Bryant and David R. O'Hallaron, Addison Wesley 2010 Part of these slides are adapted from the course associated with this book









## **Example: Locality?**

```
sum = 0;
for (i = 0; i < n; i++)
  sum += a[i];
return sum;</pre>
```

#### Data:

- Temporal: **sum** referenced in each iteration
- Spatial: array a[] accessed consecutively

#### Instructions:

- Temporal: loops cycle through the same instructions
- Spatial: instructions referenced in sequence

Being able to assess the locality of code is a crucial skill for a performance programmer

# **Locality Example #1**

```
int sum_array_rows(double a[M][N])
{
  int i, j, sum = 0;
  for (i = 0; i < M; i++)
    for (j = 0; j < N; j++)
      sum += a[i][j];
  return sum;
}</pre>
```

# **Locality Example #2**

```
How to improve locality?
 int sum_array_3d(double a[K][M][N])
   int i, j, k, sum = 0;
   for (i = 0; i < M; i++)
     for (j = 0; j < N; j++)
for (k = 0; k < K; k++)
          sum += a[k][i][j];
   return sum;
Performance [flops/cycle]
                                   Loop order k-i-j
                                                         flags: -O3 -fno-vectorize
```



CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz gcc: Apple LLVM version 8.0.0 (clang-800.0.42.1)

# **Operational Intensity Again**

Definition: Given a program P, assume cold (empty) cache

Operational intensity: 
$$I(n) = \frac{W(n)}{Q(n)}$$
 #flops (input size n)

#bytes transferred cache  $\leftrightarrow$  memory (for input size n)

Examples: Determine asymptotic bounds on I(n)

| Vector sum: y = x + y                    | O(1)      |
|------------------------------------------|-----------|
| Matrix-vector product: y = Ax            | O(1)      |
| <ul><li>Fast Fourier transform</li></ul> | O(log(n)) |
| Matrix-matrix product: C = AB + C        | O(n)      |

## **Compute/Memory Bound**

A function/piece of code is:

- Compute bound if it has high operational intensity
- Memory bound if it has low operational intensity

Relationship between operational intensity and locality?

- They are closely related
- Operational intensity only describes the boundary last level cache/memory











## Types of Cache Misses (The 3 C's)

#### Compulsory (cold) miss

Occurs on first access to a block

#### Capacity miss

Occurs when working set is larger than the cache

#### Conflict miss

Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot

Not a clean classification but still useful



#### Ignore the variables sum, i, j **Example (S=8, E=1)** assume: cold (empty) cache, a[0][0] goes here int sum\_array\_rows(double a[16][16]) int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j];</pre> return sum; int sum\_array\_cols(double a[16][16]) int i, j; double sum = 0; for (j = 0; j < 16; j++) for (i = 0; i < 16; i++)</pre> B = 32 byte = 4 doubles sum += a[i][j]; return sum; How is the cache filled?









# **Terminology**

#### Direct mapped cache:

- Cache with E = 1
- Means every block from memory has a unique location in cache

#### Fully associative cache

- Cache with S = 1 (i.e., maximal E)
- Means every block from memory can be mapped to any location in cache
- In practice to expensive to build
- One can view the register file as a fully associative cache

#### LRU (least recently used) replacement

 when selecting which block should be replaced (happens only for E > 1), the least recently used one is chosen

## Small Example, Part 1



Array (accessed twice in example)

x = x[0], ..., x[7]

% Matlab style code
for j = 0:1
 for i = 0:7
 access(x[i])

Access pattern: 0123456701234567
Hit/Miss: MHMHMHMHMHMHMHMH

Result: 8 misses, 8 hits Spatial locality: yes Temporal locality: no

25

## Small Example, Part 2



Array (accessed twice in example) x = x[0], ..., x[7]

% Matlab style code
for j = 0:1
 for i = 0:2:7
 access(x[i])
 for i = 1:2:7
 access(x[i])

Result: 16 misses Spatial locality: no Temporal locality: no

## **Small Example, Part 3**





$$x = x[0], ..., x[7]$$

Access pattern: 0123012345674567 Hit/Miss: MHMHHHHHHHHHHH

Result: 4 misses, 12 hits (is optimal, why?)

**Spatial locality:** yes **Temporal locality:** yes

27

#### **Cache Performance Metrics**

#### Miss rate

Fraction of memory references not found in cache: misses / accesses
 = 1 - hit rate

#### Hit time

- Time to deliver a block in the cache to the processor
- Haswell: 4 clock cycles for L1 11 clock cycles for L2

#### Miss penalty

- Additional time required because of a miss
- Haswell: about 100 cycles for L3 miss

#### What about writes?

What to do on a write-hit?

- Write-through: write immediately to memory
- Write-back: defer write to memory until replacement of line

What to do on a write-miss?

- Write-allocate: load into cache, update line in cache
- No-write-allocate: writes immediately to memory

Write-back/write-allocate (Core)

Write-through/no-write-allocate



Write-hit Write-miss



Write-hit Write-miss

# **Example:**

z = x + y, x, y, z vector of doubles of length n

assume they fit jointly in cache + cold cache

memory traffic Q(n): 4n doubles = 32n bytes

operational intensity I(n)? W(n) = n flops, so

I(n) = W(n)/Q(n) = 1/32

# **Locality Optimization: Blocking**

Example: MMM

```
void mmm(double *A, double *B, double *C, int n) {
  for( int i = 0; i < n; i++ )
    for( int j = 0; j < n; j++ )
    for( int k = 0; k < n; k++ )
        C[n*i + j] = C[n*i + j] + A[n*i + k] * B[n*k + j]; }</pre>
```



31

## **Cache Miss Analysis MMM**

C = A\*B, all  $n \times n$ 

Assumptions: cache size γ << n, cache block: 8 doubles, only 1 cache

Triple loop:

Blocked (six-fold loop): block size b, 8 divides b





1. entry: n/8 + n = 9n/8 cache misses

1. block: nb/8 + nb/8 = nb/4 cache misses

same 2. block: same

Total:  $(n/b)^2 * nb/4 = n^3/(4b)$ 

Total:  $n^2 * 9n/8 = 9n^3/8$ 

#### How to choose b?

2. entry:

The above analysis assumes that the multiplication of b x b blocks can be done with only compulsory misses. This requires  $3b^2 \le \gamma$ .

b = sqrt( $\gamma$ /3) which yields about sqrt(3)/(4\*sqrt( $\gamma$ )) \* n<sup>3</sup> cache misses, a gain of  $\approx$  2.6\*sqrt( $\gamma$ ) I(n) = O(sqrt( $\gamma$ ))



### **On Previous Slide**

Refine the analysis by including the misses incurred by C

Compute the operational intensity in both cases

Try an analogous analysis for matrix-vector multiplication

#### The Killer: Two-Power Strided Working Sets

```
% t = 1,2,4,8,... a 2-power
% size W of working set: W = n/t
for (i = 0; i < n; i += t)
  access(x[i])
for (i = 0; i < n; i += t)
  access(x[i])</pre>
```

Cache: E = 2, B = 4 doubles



Working with a two-power-strided working set is like having a smaller cache

35

#### The Killer: Where Can It Occur?

Accessing two-power size 2D arrays (e.g., images) columnwise

- 2d Transforms
- Stencil computations
- Correlations

Various transform algorithms

- Fast Fourier transform
- Wavelet transforms
- Filter banks

## **Example from Before**

```
int sum_array_3d(double a[K][M][N])
                     int i, j, k, sum = 0;
                      for (i = 0; i < M; i++)
                                      for (j = 0; j < N; j++)
for (k = 0; k < K; k++)
                                                                        sum += a[k][i][j];
                      return sum;
                                                                                                                                                                                                                                                                                                                                                                                              CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
Performance [flops/cycle]
                                                                                                                                                                                                                                                                                                                                                                                              gcc: Apple LLVM version 8.0.0 (clang-800.0.42.1)
                                                                                                                                                                                                                                       Loop order k-i-j
                                                                                                                                                                                                                                                                                                                                                                                              flags: -O3 -fno-vectorize
                                                                                                                                                                                                                                                                                                        i-j-k
0.15
                                                                                                                                                                                                                                                                                                                                                                                                                                 2-power strides
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  37
                       \begin{array}{c} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0
```

### **Summary**

It is important to assess temporal and spatial locality in the code

Cache structure is determined by three parameters

- block size
- number of sets
- associativity

You should be able to roughly simulate a computation on paper

Blocking to improve locality

Two-power strides can be problematic (conflict misses)