

## Overview

Linear algebra software: the path to fast libraries, LAPACK and BLAS

Blocking (BLAS 3): key to performance

### Fast MMM

- Algorithms
- ATLAS
- model-based ATLAS





# The Path to Fast Libraries

EISPACK/LINPACK Problem:

- Implementation vector-based = low operational intensity (e.g., MMM as double loop over scalar products of vectors)
- Low performance on computers with deep memory hierarchy (became apparent in the 80s)

































# **2: Blocking for Cache** e) Take into account blocking for registers (next optimization) $\left[\frac{N_B^2}{B_1}\right] + 3\left[\frac{N_BM_U}{B_1}\right] + \left[\frac{M_UN_U}{B_1}\right] \le \frac{C_1}{B_1}$









## **Remaining Details**

Register renaming and the refined model for x86

**TLB-related optimizations** 













| Extended Model (x86)                                                                                          |    |  |
|---------------------------------------------------------------------------------------------------------------|----|--|
| Set $MU = 1$ , $NU = NR - 2 = 14$<br>a 	b 	c 	c 	reuse in c<br>Code sketch (KII = 1)                          |    |  |
| <pre>rc1 = c[0],, rc14 = c[13] // 14 registers loop over k {     load a</pre>                                 |    |  |
| Summary:<br>- no reuse in a and b<br>+ larger tile size available for c since for b only one register is used | 34 |  |







Register renaming and the refined model for x86

**TLB-related optimizations** 

Virtual Memory System (Core Family)

The processor works with virtual addresses

All caches work with *physical addresses* 

Both address spaces are organized in pages

Page size: 4 KB (can be changed to 2 MB and even 1 GB in OS settings)

Address translation: virtual address  $\rightarrow$  physical address

37

















## **Lessons Learned**

Implementing even a relatively simple function with optimal performance can be highly nontrivial

Autotuning can find solutions that a human would not think of implementing

Understanding which choices lead to the fastest code can be very difficult

MMM is a great case study, touches on many performance-relevant issues

Most domains are not studied as carefully as dense linear algebra