# **Advanced Systems Lab**

Spring 2021

Lecture: Dense linear algebra, LAPACK/BLAS, ATLAS, fast MMM

Instructor: Markus Püschel, Ce Zhang

TA: Joao Rivera, several more

ETH

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

# **Overview**

Linear algebra software: the path to fast libraries, LAPACK and BLAS

Blocking (BLAS 3): key to performance

Fast MMM

- Algorithms
- ATLAS
- model-based ATLAS

## **Linear Algebra Algorithms: Examples**

Solving systems of linear equations

Eigenvalue problems

Singular value decomposition

LU/Cholesky/QR/... decompositions

... and many others

Make up much of the numerical computation across disciplines (sciences, computer science, engineering)

Efficient software is extremely relevant

3

#### The Path to Fast Libraries

EISPACK and LINPACK (early 1970s)

- Jack Dongarra, Jim Bunch, Cleve Moler, Gilbert Stewart
- LINPACK still the name of the benchmark for the <u>TOP500</u> (<u>Wiki</u>) list of most powerful supercomputers

Matlab: Invented in the late 1970s by Cleve Moler

Commercialized (MathWorks) in 1984

Motivation: Make LINPACK, EISPACK easy to use

Matlab uses linear algebra libraries but can only call it *if you operate with matrices and vectors and do not write your own loops* 

- A\*B (calls MMM routine)
- A\b (calls linear system solver)

#### The Path to Fast Libraries

#### EISPACK/LINPACK Problem:

- Implementation vector-based = low operational intensity (e.g., MMM as double loop over scalar products of vectors)
- Low performance on computers with deep memory hierarchy (became apparent in the 80s)

5

#### The Path to Fast Libraries

Now there is implementation effort for each processor!

#### LAPACK (late 1980s, early 1990s)

- Redesign all algorithms to be "block-based" to increase locality
- Jim Demmel, Jack Dongarra et al.

#### Two-layer architecture

BLAS static higher level functions kernel functions implemented for each computer

#### Basic Linear Algebra Subroutines (BLAS)

- BLAS 1: vector-vector operations (e.g., vector sum)
- BLAS 2: matrix-vector operations (e.g., matrix-vector product)
- BLAS 3: matrix-matrix operations (e.g., MMM)

LAPACK uses BLAS 3 as much as possible



## Reminder: Why is BLAS3 so important?

Using BLAS 3 (instead of BLAS 1 or 2) in LAPACK

- = blocking
- = high operational intensity I
- = high performance

Remember (blocking MMM):

$$I(n) =$$



O(1)



 $O(\sqrt{C})$ 



#### **Small Detour: MMM Complexity?**

Usually computed as C = AB + C

Cost as computed before

- $n^3$  multiplications +  $n^3$  additions =  $2n^3$  floating point operations
- $= O(n^3)$  runtime

#### **Blocking**

- Increases locality
- Does not decrease cost

Can we reduce the op count?

9

# Strassen's Algorithm

Strassen, V. "Gaussian Elimination is Not Optimal," *Numerische Mathematik* 13, 354-356, 1969

Until then, MMM was thought to be  $\Theta(n^3)$ 

Recurrence for flops:

- $T(n) = 7T(n/2) + 9/2 n^2 = 7n^{\log_2(7)} 6n^2 = O(n^{2.808})$
- Later improved:  $9/2 \rightarrow 15/4$

Fewer ops from n = 654, but ...

- Structure more complex → runtime crossover much later
- Numerical stability inferior

Can we reduce more?



#### MMM Complexity: What is known

Coppersmith, D. and Winograd, S.: "Matrix Multiplication via Arithmetic Programming," *J. Symb. Comput.* 9, 251-280, 1990

Makes MMM O(n<sup>2.376...</sup>)

Current best (Oct. 2020): O(n<sup>2.3728596...</sup>) Previous best: O(n<sup>2.3728639...</sup>)

But unpractical

MMM is obviously  $\Omega(n^2)$ 

It could well be close to  $\Theta(n^2)$ 

Practically all code out there uses 2n<sup>3</sup> flops

Compare this to matrix-vector multiplication:

Known to be Θ(n²) (Winograd), i.e., boring

11

## The Path to Fast Libraries (continued)



ATLAS (late 1990s, inspired by PhiPAC): BLAS generator

Enumerates many implementation variants (blocking etc.) and picks the fastest (example): advent of so-called autotuning

Enables automatic performance porting

Most important: BLAS3 MMM generator





#### **Optimizing MMM**



#### References:

R. Clint Whaley, Antoine Petitet and Jack Dongarra, <u>Automated Empirical</u>
<u>Optimization of Software and the ATLAS project</u>, Parallel Computing, 27(1-2):3-35, 2001

K. Goto and R. van de Geijn, <u>Anatomy of high-performance matrix</u> <u>multiplication</u>, ACM Transactions on mathematical software (TOMS), 34(23), 2008

K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, P. Stodghill, Is Search Really Necessary to Generate High-Performance BLAS?, Proceedings of the IEEE, 93(2), pp. 358–386, 2005.

Our presentation is based on this paper

15

## **0: Starting Point**

Standard triple loop



Most important in practice (based on usage in LAPACK)

- Two out of N, M, K are small
- One out of N, M, K is small
- None is small (e.g., square matrices)





## 2: Blocking for Cache

a) Idea: Working set has to fit into cache Easy estimate: | working set | = 3  $N_B^2$  Model: 3  $N_B^2 \le C_1$ 

b) Closer analysis of working set:



c) Take into account cache block size  $B_1$ :

$$\left\lceil \frac{N_B^2}{B_1} \right\rceil + \left\lceil \frac{N_B}{B_1} \right\rceil + 1 \le \frac{C_1}{B_1}$$

19

# 2: Blocking for Cache

d) Take into account LRU replacement Build a history of accessed elements

Corresponding history:

$$\begin{array}{l} b_{0,0}\,b_{1,0}\,\ldots\,b_{N_B-1,0}\,c_{0,0} \\ b_{0,1}\,b_{1,1}\,\ldots\,b_{N_B-1,1}\,c_{0,1} \\ \ldots \\ a_{0,0}\,b_{0,N_B-1}\,a_{0,1}b_{1,N_B-1}\,\ldots\,a_{0,N_B-1}\,b_{N_B-1,N_B-1}\,c_{0,N_B-1} \end{array}$$

#### Observations:

- All of b has to fit for next iteration (i = 1)
- When i = 1, row 1 of a will not cleanly replace row 0 of a
- When i = 1, elements of c will not cleanly replace previous elements of c

# 2: Blocking for Cache

d) Take into account LRU replacement



History (i = 0):

$$\begin{array}{l} b_{0,0} \ b_{1,0} \dots b_{N_B-1,0} \ c_{0,0} \\ b_{0,1} \ b_{1,1} \dots b_{N_B-1,1} \ c_{0,1} \\ \dots \\ a_{0,0} \ b_{0,N_B-1} \ a_{0,1} b_{1,N_B-1} \dots a_{0,N_B-1} \ b_{N_B-1,N_B-1} \ c_{0,N_B-1} \end{array}$$

#### Observations:

- All of b has to fit for next iteration (i = 1)
- When i = 1, row 1 of a will not cleanly replace row 0 of a
- When i = 1, elements of c will not cleanly replace previous elements of c

#### This has to fit:

- Entire b
- 2 rows of a
- 1 row of c
- 1 element of c

$$\left\lceil \frac{N_B^2}{B_1} \right\rceil + 3 \left\lceil \frac{N_B}{B_1} \right\rceil + 1 \le \frac{C_1}{B_1}$$

21

# 2: Blocking for Cache

e) Take into account blocking for registers (next optimization)

$$\left\lceil \frac{N_B^2}{B_1} \right\rceil + 3 \left\lceil \frac{N_B M_U}{B_1} \right\rceil + \left\lceil \frac{M_U N_U}{B_1} \right\rceil \le \frac{C_1}{B_1}$$

## 3: Blocking for Registers

Blocking mini-MMMs into micro-MMMs for registers revisits the question of loop order:

For fixed i,j: 2n operations

- · n independent mults
- n dependent adds

For fixed k:  $2n^2$  operations

- n² independent mults
- n<sup>2</sup> independent adds

Better ILP (but larger working set)

Result: k-i-j loop order for micro-MMMs

23

# 3: Blocking for Registers





#### How to find the best $M_{U}$ , $N_{U}$ , $K_{U}$ ?

ATLAS: uses search with bound

 $\underbrace{M_U + N_U + M_U N_U}_{\text{Size of working set in } \textbf{X}} \leq N_R \quad \text{number of registers}$ 

Model: Use largest  $\mathbf{M}_{\rm U},~\mathbf{N}_{\rm U}$  that satisfy this equation and  $\mathbf{M}_{\rm U}\approx\mathbf{N}_{\rm U}$ 

## 4: Basic Block Optimizations



Loads from a and b ( $M_U + N_U$  many) at 2 Requires  $M_U + N_U + M_U N_U$  scalar variables

Example of ATLAS-generated code

25

## 5: Other optimizations

Skewing: separate dependent add-mults for better ILP

Software pipelining: move load from one iteration to previous iteration to high load latency (a form of prefetching)

Buffering to avoid TLB misses (later)

## **Remaining Details**

Register renaming and the refined model for x86

TLB-related optimizations

27

# **Dependencies**

Read-after-write (RAW) or true dependency

$$W$$
  $r_1 = r_3 + r_4$  nothing can be done  $R$   $r_2 = 2r_1$  no ILP

Write after read (WAR) or antidependency

$$egin{array}{lll} R & r_1 = r_2 + r_3 & dependency only \ by \\ W & r_2 = r_4 + r_5 & name \rightarrow rename \end{array} \qquad egin{array}{lll} r_1 = r_2 + r_3 \\ r = r_4 + r_5 \end{array} \qquad now \ ILP$$

Write after write (WAW) or output dependency

## **Resolving WAR by Renaming**

```
R r_1 = r_2 + r_3 dependency only by r_1 = r_2 + r_3 now ILP r_2 = r_4 + r_5
```

Renaming can be done at three levels:

C source code (= you rename): use SSA style (next slide)

29

## Scalar Replacement + SSA

How to avoid WAR and WAW in your basic block source code

Solution: Single static assignment (SSA) code:

Each variable is assigned exactly once

# **Resolving WAR by Renaming**

$$\mathbf{R}$$
  $\mathbf{r}_1 = \mathbf{r}_2 + \mathbf{r}_3$   
 $\mathbf{W}$   $\mathbf{r}_2 = \mathbf{r}_4 + \mathbf{r}_5$ 

dependency only by  $r_1 = r_2 + r_3$  now ILP  $r_1 = r_2 + r_3$ 

$$r_1 = r_2 + r_3$$

$$r = r_4 + r_5$$

#### Renaming can be done at three levels:

C source code (= you rename)

Compiler: Uses a different register upon register allocation, r = r<sub>6</sub>

Hardware (if supported): dynamic register renaming

- Requires a separation of architectural and physical registers
- Requires more physical than architectural registers

## **Register Renaming**



Hardware manages mapping architectural → physical registers

Each logical register has several associated physical registers

Hence: more instances of each r<sub>i</sub> can be created

Used in superscalar architectures (e.g., Intel Core) to increase ILP by dynamically resolving WAR/WAW dependencies

#### **Micro-MMM Standard Model**

```
this parameter I did not explain, see paper

Core (NR = 16): MU = 2, NU = 3

ext{reuse in a, b, c}
```

Code sketch (KU = 1)

```
rc1 = c[0,0], ..., rc6 = c[1,2] // 6 registers
loop over k {
  load a // 2 registers
  load b // 3 registers
  compute // 6 independent mults, 6 independent adds, reuse a and b
}
c[0,0] = rc1, ..., c[1,2] = rc6
```

33

## **Extended Model (x86)**

```
Set MU = 1, NU = NR - 2 = 14
```

reuse in c

Code sketch (KU = 1)

#### Summary:

- no reuse in a and b
- + larger tile size available for c since for b only one register is used





## **Remaining Details**

Register renaming and the refined model for x86

TLB-related optimizations

37

# **Virtual Memory System (Core Family)**

The processor works with virtual addresses

All caches work with *physical addresses* 

Both address spaces are organized in pages

Page size: 4 KB (can be changed to 2 MB and even 1 GB in OS settings)

Address translation: virtual address  $\rightarrow$  physical address



#### **Address Translation**

Uses a cache called translation lookaside buffer (TLB)

#### Haswell/Skylake:

Level 1 ITLB (instructions): 128 entries

DTLB (data): 64 entries

Level 2 Shared (STLB): 1024/1536 entries (Haswell/Skylake)

#### Miss Penalties:

DTLB hit: no penalty

DTLB miss, STLB hit: few cycles penalty

STLB miss: can be very expensive

## **Impact on Performance**

Repeatedly accessing a working set spread over too many pages yields TLB misses and can result in a significant slowdown.

Example Haswell: STLB = 1024

A computation that repeatedly accesses a working set of 2048 doubles spread over 2048 pages will cause STLB misses.

How much space will this working set occupy in cache (assume no conflicts)? 2048 \* 64 B = 128 KB (fits into L2 cache)

41

## **Example MMM**



We are looking for parts in the working set that are spread out in memory:

- Block row of a: contiguous
- All of b: contiguous
- Block of c: if M > 512, then spread over  $N_B$  pages

Typically, N<sub>B</sub> is in the 10s, so no problem

## Example MMM, contd.

Interface BLAS function: dgemm(a, b, c, N, K, M, 1da, 1db, 1dc)

matrices sizes leading dimensions

Leading dimensions: Enable use on matrices inside matrices



Assume Ida, Idb, Idc > 512:

- Block row of a: spread over  $\geq N_B$  pages
- All of b: spread over ≥ K pages
- Block of c: Spread over ≥ N<sub>B</sub> pages

So copying to contiguous memory may pay off

43

## Example MMM, contd.

Resulting code (sketch):

#### **Fast MMM: Principles**

Optimization for memory hierarchy

- Blocking for cache
- Blocking for registers

#### Basic block optimizations

- Loop order for ILP
- Unrolling + scalar replacement
- Scheduling & software pipelining

#### Optimizations for virtual memory

Buffering (copying spread-out data into contiguous memory)

#### Autotuning

- Search over parameters (ATLAS)
- Model to estimate parameters (Model-based ATLAS)

All high performance MMM libraries do some of these (but possibly in slightly different ways)

4

#### **Path to Fast Libraries**

LAPACK static higher level functions

BLAS kernel functions *generated* for each computer

The advent of SIMD vector instructions (SSE, 1999) made ATLAS obsolete

The advent of multicore systems (ca. 2005) required a redesign of LAPACK (just parallelizing BLAS is suboptimal)

Recently, BLAS interface needs to be extended to handle higher-order tensor operations (used in machine learning)

Automatic generation of blocked algorithms, alternatives to LAPACK (FLAME)

Program generator for small linear algebra operations (SLinGen/LGen)

#### **Lessons Learned**

Implementing even a relatively simple function with optimal performance can be highly nontrivial

Autotuning can find solutions that a human would not think of implementing

Understanding which choices lead to the fastest code can be very difficult

MMM is a great case study, touches on many performance-relevant issues

Most domains are not studied as carefully as dense linear algebra