# **Advanced Systems Lab**

Spring 2024

Lecture: Dense linear algebra, LAPACK/BLAS, ATLAS, fast MMM

Instructor: Markus Püschel

TA: Tommaso Pegolotti, several more

ETH

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

1

# **Overview**

Linear algebra software: the path to fast libraries, LAPACK and BLAS

Blocking (BLAS 3): key to performance

Fast MMM

- Algorithms
- ATLAS
- Model-based ATLAS

2

### **Linear Algebra Algorithms: Examples**

Solving systems of linear equations

Eigenvalue problems

Singular value decomposition

LU/Cholesky/QR/... decompositions

... and many others

Make up much of the numerical computation across disciplines (sciences, computer science, data science and machine learning, engineering)

Efficient software is extremely relevant

3

3

#### The Path to Fast Libraries

EISPACK and LINPACK (early 1970s)

- Focus on dense matrices
- Jack Dongarra, Jim Bunch, Cleve Moler, Gilbert Stewart
- LINPACK still the name of the benchmark for the <u>TOP500</u> (<u>Wiki</u>) list of most powerful supercomputers

Matlab: Invented in the late 1970s by Cleve Moler

Commercialized (MathWorks) in 1984

Motivation: Make LINPACK, EISPACK easy to use

Matlab uses linear algebra libraries but can only call it *if you operate with matrices and vectors and do not write your own loops* 

- A\*B (calls MMM routine)
- A\b (calls linear system solver)

#### The Path to Fast Libraries

#### EISPACK/LINPACK Problem:

- Implementation vector-based = low operational intensity (e.g., MMM as double loop over scalar products of vectors)
- Low performance on computers with caches (80s) and superscalar microarchitectures (late 90s)

5

5







#### **Small Detour: MMM Complexity?**

Usually computed as C = AB + C

Cost as computed before

- $n^3$  multiplications +  $n^3$  additions =  $2n^3$  floating point operations
- $= O(n^3)$  runtime

#### **Blocking**

- Increases locality
- Does not decrease cost

Can we reduce the op count?

9

9

# Strassen's Algorithm

Strassen, V. "Gaussian Elimination is Not Optimal," *Numerische Mathematik* 13, 354-356, 1969

Until then, MMM was thought to be  $\Theta(n^3)$ 

Recurrence for flops:

- $T(n) = 7T(n/2) + 9/2 n^2 = 7n^{\log_2(7)} 6n^2 = O(n^{2.808})$
- Later improved:  $9/2 \rightarrow 15/4$

Fewer ops from n = 654, but ...

- Structure more complex → runtime crossover much later
- Numerical stability inferior

Can we reduce more?



#### MMM Complexity: What is known

Coppersmith, D. and Winograd, S.: "Matrix Multiplication via Arithmetic Programming," *J. Symb. Comput.* 9, 251-280, 1990

Makes MMM O(n<sup>2.3755...</sup>)

**Current best (Mar 2024):** O(n<sup>2.371552...</sup>) Previous best: O(n<sup>2.371866...</sup>)

But unpractical (a galactic algorithm)

MMM is obviously  $\Omega(n^2)$ 

It could well be close to  $\Theta(n^2)$ 

Practically all code out there uses 2n<sup>3</sup> flops

Compare this to matrix-vector multiplication:

Known to be Θ(n²) (Winograd), i.e., boring

11

11

### The Path to Fast Libraries (continued)



ATLAS (late 1990s, inspired by PhiPAC): BLAS generator

Enumerates many implementation variants (blocking etc.) and picks the fastest (example): advent of so-called autotuning

Enables automatic performance porting

Most important: BLAS3 MMM generator





#### **Optimizing MMM**



#### References:

R. Clint Whaley, Antoine Petitet and Jack Dongarra, <u>Automated Empirical</u>
<u>Optimization of Software and the ATLAS project</u>, Parallel Computing, 27(1-2):3-35, 2001

K. Goto and R. van de Geijn, <u>Anatomy of high-performance matrix</u> <u>multiplication</u>, ACM Transactions on mathematical software (TOMS), 34(23), 2008

K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, P. Stodghill, Is Search Really Necessary to Generate High-Performance BLAS?, Proceedings of the IEEE, 93(2), pp. 358–386, 2005.

Our presentation is based on this paper

15

15

### **0: Starting Point**

Standard triple loop



Most important in practice (based on usage in LAPACK)

- Two out of N, M, K are small
- One out of N, M, K is small
- None is small (e.g., square matrices)





# 2: Blocking for Cache

a) Idea: Working set has to fit into cache Easy estimate: | working set | = 3  $N_B^2$  Model: 3  $N_B^2 \le C_1$ 

b) Closer analysis of working set:

$$N_B^2 + N_B + 1 \leq C_1$$
 all of b  $\bigcap_{\text{row of a}} \bigcap_{\text{element of c}} \bigcap_{\text{element of a}} \bigcap_{\text{element of a}} \bigcap_{\text{row of a}} \bigcap_{\text{element of a}}$ 



c) Take into account cache block size B<sub>1</sub>:

$$\left\lceil \frac{N_B^2}{B_1} \right\rceil + \left\lceil \frac{N_B}{B_1} \right\rceil + 1 \le \frac{C_1}{B_1}$$

19

19

# 2: Blocking for Cache

d) Take into account LRU replacement Build a history of accessed elements

i=0: 
$$a_{0,0} \, b_{0,0} \, a_{0,1} b_{1,0} \dots a_{0,N_B-1} \, b_{N_B-1,0} \, c_{0,0}$$
 (j=0)  $a_{0,0} \, b_{0,1} \, a_{0,1} b_{1,1} \dots a_{0,N_B-1} \, b_{N_B-1,1} \, c_{0,1}$  (j=1)  $\dots$ 

$$a_{0,0} b_{0,N_B-1} a_{0,1} b_{1,N_B-1} \dots a_{0,N_B-1} b_{N_B-1,N_B-1} c_{0,N_B-1}$$
 (j=N<sub>B</sub>-1)

Corresponding history:

$$\begin{array}{c} b_{0,0} \ b_{1,0} \dots b_{N_B-1,0} \ c_{0,0} \\ b_{0,1} \ b_{1,1} \dots b_{N_B-1,1} \ c_{0,1} \\ \dots \\ a_{0,0} \ b_{0,N_B-1} \ a_{0,1} b_{1,N_B-1} \dots \\ a_{0,N_B-1} \ b_{N_B-1,N_B-1} \ c_{0,N_B-1} \end{array}$$

#### Observations:

- All of b has to fit for next iteration (i = 1)
- When i = 1, row 1 of a will not cleanly replace row 0 of a
- When i = 1, elements of c will not cleanly replace previous elements of c

# 2: Blocking for Cache

d) Take into account LRU replacement



History (i = 0):

$$\begin{array}{l} b_{0,0}\,b_{1,0}\ldots b_{N_B-1,0}\,c_{0,0} \\ b_{0,1}\,b_{1,1}\ldots b_{N_B-1,1}\,c_{0,1} \\ \ldots \\ a_{0,0}\,b_{0,N_B-1}\,a_{0,1}b_{1,N_B-1}\ldots a_{0,N_B-1}\,b_{N_B-1,N_B-1}\,c_{0,N_B-1} \end{array}$$

#### Observations:

- All of b has to fit for next iteration (i = 1)
- When i = 1, row 1 of a will not cleanly replace row 0 of a
- When i = 1, elements of c will not cleanly replace previous elements of c

 $\left\lceil \frac{N_B^2}{B_1} \right\rceil + 3 \left\lceil \frac{N_B}{B_1} \right\rceil + 1 \le \frac{C_1}{B_1}$ 

This has to fit:

- Entire b
- 2 rows of a
- 1 row of c
- 1 element of c

21

21

# 2: Blocking for Cache

e) Take into account blocking for registers (next optimization)

$$\left\lceil \frac{N_B^2}{B_1} \right\rceil + 3 \left\lceil \frac{N_B M_U}{B_1} \right\rceil + \left\lceil \frac{M_U N_U}{B_1} \right\rceil \le \frac{C_1}{B_1}$$



Blocking mini-MMMs into micro-MMMs for registers revisits the question of loop order:



For fixed i, j: 2n operations

- n independent mults
- n dependent adds

k-i-j:

For fixed k: 2n<sup>2</sup> operations

- n² independent mults
- n² independent adds

Better ILP (but larger working set)

Result: k-i-j loop order for micro-MMMs

23

24

23

#### 3: Blocking for Registers for $i = 0:N_B:N-1$ for $j = 0:N_B:M-1$ $\begin{array}{ll} \text{or } j = 0:N_B:N-1 \\ \text{for } k = 0:N_B:K-1 \\ \text{for } i' = i:M_I:i+N_B-1 \\ \text{for } j' = j:N_I:j+N_B-1 \\ \text{for } k' = k:K_I:k+N_B-1 \\ \text{for } k'' = k':k':k'+K_U-1 \\ \text{for } i'' = i':i'+M_U-1 \\ \text{colored} i'' = j':j'+N_U-1 \\ \text{colored} i'' = colored i'' + a_i''k''*b_k''j'' \end{array}$ mini-MMM micro-MMM mini-MMM How to find the best $M_{U}$ , $N_{U}$ , $K_{U}$ ? micro-MMM ATLAS: uses search with bound $M_U + N_U + M_U N_U \le N_R$ number of are multiplied size of working set in X (all the red parts on the left) $\rightarrow M_{U}$ Model: Use largest M<sub>U</sub>, N<sub>U</sub> that satisfy this equation and $M_U \approx N_U$



# 5: Other optimizations (see paper)

Skewing: separate dependent add-mults for better ILP

Software pipelining: move load from one iteration to previous iteration to hide load latency (a form of prefetching)

Buffering to avoid TLB misses (later)

### **Remaining Details**

Register renaming and the refined model for x86

TLB-related optimizations

27

27

# **Dependencies**

Read-after-write (RAW) or true dependency

$$egin{array}{lll} m{W} & \mathbf{r}_1 = \mathbf{r}_3 + \mathbf{r}_4 & \textit{nothing can be done} \\ m{R} & \mathbf{r}_2 = 2\mathbf{r}_1 & \textit{no ILP} \end{array}$$

Write after read (WAR) or antidependency

Write after write (WAW) or output dependency

# **Resolving WAR by Renaming**

```
R r_1 = r_2 + r_3 dependency only by r_1 = r_2 + r_3 r_2 = r_4 + r_5 now ILP
```

Renaming can be done at three levels:

1. C source code (= you rename): use SSA style (next slide)

29

30

29

### Scalar Replacement + SSA

How to avoid WAR and WAW in your basic block source code

Solution: Single static assignment (SSA) code:

Each variable is assigned exactly once

```
cmore>
s266 = (t287 - t285);
s267 = (t282 + t286);
s268 = (t282 - t286);
s268 = (t284 - t288);
s270 = (t284 - t288);
s271 = (0.5*(t271 + t280));
s272 = (0.5*(t271 + t280));
s273 = (0.5*(t271 - t280));
s273 = (0.5*(t281 + t283) - (t285 + t287)));
s274 = (0.5*(s265 - s266));
t289 = ((9.0*s272) + (5.4*s273));
t290 = ((5.4*s272) + (12.6*s273));
t291 = ((1.8*s271) + (1.2*s274));
t292 = ((1.2*s271) + (2.4*s274));
a122 = (1.8*(t269 - t278));
a123 = (1.8*s267);
a124 = (1.8*s267);
a125 = ((a122 - a123) + a124);
t295 = ((a125 - a122) + (3.6*s267));
t296 = (a122 + a125 + (3.6*s269));
```

### **Resolving WAR by Renaming**

$$R$$
  $r_1 = r_2 + r_3$   
 $W$   $r_2 = r_4 + r_5$ 

dependency only by  $r_1 = r_2 + r_3$  now ILP  $r_1 = r_2 + r_3$ 

$$r_1 = r_2 + r_3$$

$$r = r_4 + r_5$$

#### Renaming can be done at three levels:

- 1. C source code (= you rename)
- 2. Compiler: Uses a different register upon register allocation,  $r = r_6$
- 3. Hardware (if supported): dynamic register renaming
  - Requires a separation of architectural and physical registers
  - Requires more physical than architectural registers

31

# **Register Renaming**



Each logical register has several associated physical registers

Hardware manages mapping architectural → physical registers

Hence: more instances of each r<sub>i</sub> can be created

Used in superscalar architectures (e.g., Intel Core) to increase ILP by dynamically resolving WAR/WAW dependencies

#### **Micro-MMM Standard Model**

this parameter I did not  $M_U^*N_U + M_U + N_U \le N_R - ceil((L_x+1)/2)$  core  $(N_R = 16)$ :  $M_U = 2$ ,  $N_U = 3$ 

reuse in a, b, c

Code sketch  $(K_{IJ} = 1)$ 

```
rc1 = c[0,0], ..., rc6 = c[1,2] // 6 registers
loop over k {
  load a // 2 registers
  load b // 3 registers
  compute // 6 independent mults, 6 independent adds, reuse a and b
}
c[0,0] = rc1, ..., c[1,2] = rc6
```

But on x86 that's not what ATLAS' search found

33

33

### **Extended Model (x86)**

Set  $M_U = 1$ ,  $N_U = N_R - 2 = 14$ 

reuse in c only

Code sketch  $(K_U = 1)$ 

#### Summary:

- no reuse in a and b
- + larger tile size available for c since for b only one register is used





### **Remaining Details**

Register renaming and the refined model for x86

TLB-related optimizations

3/

37

# **Virtual Memory System (Core Family)**

The processor works with virtual addresses

All caches work with *physical addresses* 

Both address spaces are organized in pages

Page size: 4 KB (can be changed to 2 MB and even 1 GB in OS settings)

Address translation: virtual address  $\rightarrow$  physical address



#### **Address Translation**

Uses a cache called translation lookaside buffer (TLB)

#### Skylake:

Level 1 ITLB (instructions): 128 entries

DTLB (data): 64 entries

Level 2 Shared (STLB): 1536 entries

#### Miss Penalties:

DTLB hit: no penalty

DTLB miss, STLB hit: few cycles penalty

STLB miss: can be very expensive

# **Impact on Performance**

Repeatedly accessing a working set spread over too many pages yields TLB misses and can result in a significant slowdown.

Example Skylake: STLB = 1536

A computation that repeatedly accesses a working set of 2048 doubles spread over 2048 pages will cause STLB misses.

How much space will this working set occupy in cache (assume no conflicts)? 2048 \* 64 B = 128 KB (fits into L2 cache)

41

41

### **Example MMM**



Working set at highest level

We are looking for parts in the working set that are spread out in memory:

- Block row of a: contiguous
- All of b: contiguous
- Block of c: if M > 512, then spread over  $N_B$  pages

Typically, N<sub>B</sub> is in the 10s, so no problem

### Example MMM, contd.

Interface BLAS function: dgemm(a, b, c, N, K, M, 1da, 1db, 1dc)

matrices sizes leading dimensions

Leading dimensions: Enable use on matrices inside matrices



Assume Ida, Idb, Idc > 512:

- Block row of a: spread over  $\geq N_B$  pages
- All of b: spread over ≥ K pages
- Block of c: Spread over ≥ N<sub>B</sub> pages

So copying to contiguous memory may pay off

43

43

### Example MMM, contd.

Resulting code (sketch):

```
// all of b reused: possible copy to contiguous memory for i=0:N_B:N-1
// block row of a reused: possibly copy for j=0:N_B:M-1
// block of c reused: possibly copy for k=0:N_B:K-1
.....
```

#### **Fast MMM: Principles**

Optimization for memory hierarchy

- Blocking for cache
- Blocking for registers

#### Basic block optimizations

- Loop order for ILP
- Unrolling + scalar replacement
- Scheduling & software pipelining

#### Optimizations for virtual memory

Buffering (copying spread-out data into contiguous memory)

#### Autotuning

- Search over parameters (ATLAS)
- Model to estimate parameters (Model-based ATLAS)

All high performance MMM libraries do some of these (but possibly in slightly different ways)

45

45

#### **Path to Fast Libraries**

LAPACK static higher level functions

BLAS kernel functions *generated* for each computer

The advent of SIMD vector instructions (SSE, 1999) made ATLAS obsolete

The advent of multicore systems (ca. 2005) required a redesign of LAPACK (just parallelizing BLAS is suboptimal)

Recently, BLAS interface needs to be extended to handle higher-order tensor operations (used in machine learning)

Automatic generation of blocked algorithms, alternatives to LAPACK (FLAME)

Small scale linear algebra requires quite different optimizations (see program generator <a href="SLinGen/LGen">SLinGen/LGen</a>)

#### **Lessons Learned**

Implementing even a relatively simple function with optimal performance can be highly nontrivial

Autotuning can find solutions that a human would not think of implementing

Understanding which choices lead to the fastest code can be very difficult

MMM is a great case study, touches on many performance-relevant issues

Most domains are not studied as carefully as dense linear algebra

47