Overview

Linear algebra software: the path to fast libraries, LAPACK and BLAS

Blocking (BLAS 3): key to performance

Fast MMM
- Algorithms
- ATLAS
- Model-based ATLAS
Linear Algebra Algorithms: Examples

Solving systems of linear equations
Eigenvalue problems
Singular value decomposition
LU/Cholesky/QR/... decompositions
... and many others

Make up much of the numerical computation across disciplines (sciences, computer science, data science and machine learning, engineering)

Efficient software is extremely relevant

The Path to Fast Libraries

**EISPACK** and **LINPACK** (early 1970s)
- Focus on dense matrices
- Jack Dongarra, Jim Bunch, Cleve Moler, Gilbert Stewart
- LINPACK still the name of the benchmark for the [TOP500](https://www.top500.org/) list of most powerful supercomputers

Matlab: Invented in the late 1970s by Cleve Moler

Commercialized (MathWorks) in 1984

Motivation: Make LINPACK, EISPACK easy to use

Matlab uses linear algebra libraries but can only call it *if you operate with matrices and vectors and do not write your own loops*
- A*B (calls MMM routine)
- A\b (calls linear system solver)
The Path to Fast Libraries

EISPACK/LINPACK Problem:
- Implementation vector-based = low operational intensity
  (e.g., MMM as double loop over scalar products of vectors)
- Low performance on computers with caches (80s) and superscalar microarchitectures (late 90s)

LAPACK (late 1980s, early 1990s)
- Redesign all algorithms to be “block-based” to increase locality
- Jack Dongarra (Turing award 2021), Jim Demmel et al.

Two-layer architecture

Basic Linear Algebra Subroutines (BLAS)
- BLAS 1: vector-vector operations (e.g., vector sum)
- BLAS 2: matrix-vector operations (e.g., matrix-vector product)
- BLAS 3: matrix-matrix operations (e.g., MMM)

LAPACK uses BLAS 3 as much as possible
Fast MMM → fast BLAS3 (since all MMM related)

Now there is implementation effort for each processor!
Reminder: Why is BLAS3 so important?

Using BLAS 3 (instead of BLAS 1 or 2) in LAPACK
= blocking
= high operational intensity I
= high performance

Remember (blocking MMM):

\[ I(n) = \]

\[
\begin{array}{c}
\text{ } \\
\text{ } \\
\end{array}
\]

\[ O(1) \]

\[
\begin{array}{c}
\text{ } \\
\text{ } \\
\end{array}
\]

\[ O(\sqrt{C}) \]
Small Detour: MMM Complexity?

Usually computed as $C = AB + C$

Cost as computed before
- $n^3$ multiplications + $n^3$ additions = $2n^3$ floating point operations
- = $O(n^3)$ runtime

Blocking
- Increases locality
- Does not decrease cost

Can we reduce the op count?

Strassen’s Algorithm

Strassen, V. "Gaussian Elimination is Not Optimal," *Numerische Mathematik* 13, 354-356, 1969

*Until then, MMM was thought to be $\Theta(n^3)$*

Recurrence for flops:
- $T(n) = 7T(n/2) + 9/2 n^2 = 7n^{\log_2 7} - 6n^2 = O(n^{2.808})$
- Later improved: $9/2 \rightarrow 15/4$

Fewer ops from $n = 654$, but ...
- Structure more complex → runtime crossover much later
- Numerical stability inferior

Can we reduce more?
MMM Complexity: What is known


Makes MMM $O(n^{2.3755...})$

**Current best (Mar 2024):** $O(n^{2.371552...})$  
**Previous best:** $O(n^{2.371866...})$

But unpractical ([a galactic algorithm](#))

MMM is obviously $\Omega(n^2)$

It could well be close to $\Theta(n^2)$

Practically all code out there uses $2n^3$ flops

Compare this to matrix-vector multiplication:

- Known to be $\Theta(n^2)$ (Winograd), i.e., boring

---

The Path to Fast Libraries (continued)

ATLAS (late 1990s, inspired by PhiPAC): BLAS generator

Enumerates many implementation variants (blocking etc.) and picks the fastest (example): advent of so-called autotuning

Enables automatic performance porting

Most important: BLAS3 MMM generator
**ATLAS Architecture**

- Detect Hardware Parameters
  - L1Size
  - NR
  - MulAdd
  - L

- ATLAS Search Engine (MMSearch)
  - NB
  - MUL/ADD
  - xFetch
  - MulAdd
  - Latency

- ATLAS MM Code Generator (MMCase)
  - NB
  - MUL/ADD
  - xFetch
  - MulAdd
  - Latency

- Compile, Execute, Measure

**Hardware parameters:**
- L1Size: size of L1 data cache
- NR: number of registers
- MulAdd: fused multiply-add available?
- L*: latency of FP multiplication

**Search parameters:**
- for example blocking sizes
- span search space
- specify code
- found by orthogonal line search

source: Pingali, Yotov, et al., Cornell U.

---

**ATLAS**

- Detect Hardware Parameters
  - L1Size
  - NR
  - MulAdd
  - L

- ATLAS Search Engine
  - NB
  - MUL/ADD
  - xFetch
  - MulAdd
  - Latency

- ATLAS MM Code Generator
  - NB
  - MUL/ADD
  - xFetch
  - MulAdd
  - Latency

- Compile, Execute, Measure

- MiniMMM Source

**Model-Based ATLAS (2005)**

- Detect Hardware Parameters
  - L1Size
  - L1Size
  - NR
  - MulAdd
  - L

- Model
  - NB
  - MUL/ADD
  - xFetch
  - MulAdd
  - Latency

- ATLAS MM Code Generator
  - NB
  - MUL/ADD
  - xFetch
  - MulAdd
  - Latency

- MiniMMM Source

- Search for parameters replaced by model to compute them
- Much faster + provides understanding of what parameters are found

source: Pingali, Yotov, et al., Cornell U.
Optimizing MMM

References:


K. Goto and R. van de Geijn, Anatomy of high-performance matrix multiplication, ACM Transactions on mathematical software (TOMS), 34(23), 2008


Our presentation is based on this paper

0: Starting Point

Standard triple loop

```
// Computes c = c + ab
for i = 0:N-1
  for j = 0:M-1
    for k = 0:K-1
      c_ik = c_ik + a_ik * b_kj
```

Matlab-style code notation

```
N x K
\times
K x M
= N x M
```

Most important in practice (based on usage in LAPACK)

- Two out of \( N, M, K \) are small
- One out of \( N, M, K \) is small
- None is small (e.g., square matrices)
1: Loop Order

// Computes C = C + AB
for i = 0:N-1
    for j = 0:M-1
        for k = 0:K-1
            c_{ij} = c_{ij} + a_{ik} * b_{kj}

i,j,k loops can be permuted in any order!

i-j-k: B is reused, good if M < N (B is smaller than A)
j-i-k: A is reused, good if N < M

Other options are inferior, e.g., k-i-j:

ATLAS does versioning (code for both variants)
Poor temporal locality w.r.t. C

2: Blocking for Cache

Like multiplying matrices consisting of size N_B x N_B entries
Assume N_B | M,N,K

Results in six-fold loop
Formally obtained through loop-tiling and loop exchange

How to find the best N_B?
ATLAS: uses search over all N_B^2 ≤ min(C, 80^2) (C = measured cache size)
Model: explained next, uses C_1 = measured L1 cache size
2: Blocking for Cache

a) Idea: Working set has to fit into cache
Easy estimate: $|\text{working set}| = 3N_B^2$
Model: $3N_B^2 \leq C_1$

b) Closer analysis of working set:

$$N_B^2 + N_B + 1 \leq C_1$$

all of b row of a element of c

b) Take into account cache block size $B_1$:

$$\left\lfloor \frac{N_B^2}{B_1} \right\rfloor + \left\lfloor \frac{N_B}{B_1} \right\rfloor + 1 \leq \frac{C_1}{B_1}$$

2: Blocking for Cache

d) Take into account LRU replacement
Build a history of accessed elements

Observations:
- All of $b$ has to fit for next iteration ($i = 1$)
- When $i = 1$, row 1 of $a$ will not cleanly replace row 0 of $a$
- When $i = 1$, elements of $c$ will not cleanly replace previous elements of $c$
2: Blocking for Cache

d) Take into account LRU replacement

History (i = 0):

| b₀,₀ b₁,₀ ... bᵮ₋₁,₀ t₀,₀ |
| b₀,₁ b₁,₁ ... bᵮ₋₁,₁ t₀,₁ |
| ... |
| a₀,₀ b₀,ᵮ₋₁ t₀,ᵮ₋₁ b₁,ᵮ₋₁ ... a₀,ᵮ₋₁ bᵮ₋₁,ᵮ₋₁ t₀,ᵮ₋₁ | b₀,ᵮ₋₁ t₀,ᵮ₋₁ b₁,ᵮ₋₁ ... bᵮ₋₁,ᵮ₋₁ t₀,ᵮ₋₁ |

Observations:
• All of b has to fit for next iteration (i = 1)
• When i = 1, row 1 of a will not cleanly replace row 0 of a
• When i = 1, elements of c will not cleanly replace previous elements of c

This has to fit:
• Entire b
• 2 rows of a
• 1 row of c
• 1 element of c

\[ \left\lfloor \frac{N^2}{B_1} \right\rfloor + 3 \left\lfloor \frac{N_B}{B_1} \right\rfloor + 1 \leq \frac{C_1}{B_1} \]

2: Blocking for Cache

e) Take into account blocking for registers (next optimization)

\[ \left\lfloor \frac{N^2}{B_1} \right\rfloor + 3 \left\lfloor \frac{N_B M_L}{B_1} \right\rfloor + \left\lfloor \frac{M_L N_L}{B_1} \right\rfloor \leq \frac{C_1}{B_1} \]
3: Blocking for Registers

Blocking mini-MMMs into micro-MMMs for registers revisits the question of loop order:

i-j-k:

- For fixed i, j: 2n operations
  - n independent mults
  - n dependent adds

k-i-j:

- For fixed k: 2n^2 operations
  - n^2 independent mults
  - n^2 independent adds

Better ILP (but larger working set)

Result: k-i-j loop order for micro-MMMs

3: Blocking for Registers

How to find the best \( M_U, N_U, K_U \)?

ATLAS: uses search with bound

\[
M_U + N_U + M_U N_U \leq N_B
\]

number of registers

size of working set in (all the red parts on the left)

Model: Use largest \( M_U, N_U \) that satisfy this equation and \( M_U = N_U \)
4: Basic Block Optimizations

Unroll micro-MMMs
Scalar replacement
Loads from c (M_uN_u many) at ①
Loads from a and b (M_u + N_u many) at ②
Requires M_u + N_u + M_uN_u scalar variables
(all the red parts on the right)

Example of ATLAS-generated code

5: Other optimizations (see paper)

Skewing: separate dependent add-mults for better ILP

Software pipelining: move load from one iteration to previous iteration to hide load latency (a form of prefetching)

Buffering to avoid TLB misses (later)
Remaining Details

Register renaming and the refined model for x86

TLB-related optimizations

Dependencies

Read-after-write (RAW) or true dependency

\[
\begin{align*}
W: & \quad r_1 &= r_3 + r_4 \\
R: & \quad r_2 &= 2r_1
\end{align*}
\]

nothing can be done
no ILP

Write after read (WAR) or antidependency

\[
\begin{align*}
R: & \quad r_1 &= r_2 + r_3 \\
W: & \quad r_2 &= r_4 + r_5
\end{align*}
\]

dependency only by name → rename

\[
\begin{align*}
W: & \quad r_1 &= r_2 + r_3 \\
\cdots
W: & \quad r_1 &= r_4 + r_5
\end{align*}
\]

dependency only by name → rename

now ILP

Write after write (WAW) or output dependency

\[
\begin{align*}
W: & \quad r_1 &= r_2 + r_3 \\
W: & \quad r_1 &= r_4 + r_5
\end{align*}
\]

now ILP
Resolving WAR by Renaming

Renaming can be done at three levels:

1. C source code (= you rename): use SSA style (next slide)

Scalar Replacement + SSA

How to avoid WAR and WAW in your basic block source code

Solution: Single static assignment (SSA) code:

- Each variable is assigned exactly once

```plaintext
s266 = (t287 - t285);
s267 = (t282 + t286);
s268 = (t282 - t286);
s269 = (t284 + t288);
s278 = (t284 - t288);
s271 = (0.5*(t271 + t280));
s272 = (0.5*(t271 - t280));
s273 = (0.5*(t284 + t288) - (t285 + t287));
s274 = (0.5*(s265 - s266));
t289 = ((9.0*s272) + (5.4*s273));
t290 = ((5.4*s272) + (12.6*s273));
t291 = ((1.8*s271) + (1.2*s274));
t292 = ((1.2*s271) + (2.4*s274));
a122 = (1.8*(t269 - t278));
a123 = (1.8*s267);
a124 = (1.8*s269);
t293 = ((a122 - a123) + a124);
a125 = (1.8*(t267 - t276));
t294 = (a125 + a123 + a124);
t295 = ((a125 + a122) + (3.6*s267));
t296 = (a122 + a125 + (3.6*s269));
```

no duplicates
Resolving WAR by Renaming

Renaming can be done at three levels:

1. C source code (= you rename)
2. Compiler: Uses a different register upon register allocation, $r = r_6$
3. Hardware (if supported): dynamic register renaming
   - Requires a separation of architectural and physical registers
   - Requires more physical than architectural registers

Register Renaming

Each logical register has several associated physical registers

Hardware manages mapping architectural $\rightarrow$ physical registers

Hence: more instances of each $r_i$ can be created

Used in superscalar architectures (e.g., Intel Core) to increase ILP by dynamically resolving WAR/WAW dependencies
Micro-MMM Standard Model

\[ M_u \cdot N_u + M_u + N_u \leq N_R - \text{ceil}(L+1)/2 \]

Core (\(N_R = 16\)): \(M_u = 2, N_u = 3\)

\[ \begin{array}{c}
\bullet & & \bullet \\
\text{a} & & \text{b} & & \text{c} \\
\end{array} \]

```
reuse in a, b, c
```

Code sketch (\(K_u = 1\))

\[
\begin{aligned}
rc1 &= c[0,0], \ldots, rc6 = c[1,2] \quad \text{// 6 registers} \\
\text{loop over } k \{ \\
\text{load } a \quad \text{// 2 registers} \\
\text{load } b \quad \text{// 3 registers} \\
\text{compute} \quad \text{// 6 independent mults, 6 independent adds, reuse } a \text{ and } b \\
\} \\
c[0,0] &= rc1, \ldots, c[1,2] = rc6
\end{aligned}
\]

But on x86 that's not what ATLAS' search found

Extended Model (x86)

Set \(M_u = 1, N_u = N_R - 2 = 14\)

\[ \begin{array}{c}
\bullet & & \bullet \\
\text{a} & & \text{b} & & \text{c} \\
\end{array} \]

```
reuse in c only
```

Code sketch (\(K_u = 1\))

\[
\begin{aligned}
rc1 &= c[0], \ldots, rc14 = c[13] \quad \text{// 14 registers} \\
\text{loop over } k \{ \\
\text{load } a \quad \text{// 1 register} \\
r_b &= b[1] \quad \text{// 1 register} \\
r_c1 &= rc1 + rb \quad \text{// add (two-operand)} \\
r_b &= b[2] \quad \text{// reuse register (WAR: register renaming resolves it)} \\
r_b &= r_b \cdot a \\
r_c2 &= rc2 + rb \\
\} \\
c[0] &= rc1, \ldots, c[13] = rc14
\end{aligned}
\]

**Summary:**

- no reuse in a and b
- + larger tile size available for c since for b only one register is used
Visualization of What Seems to Happen

\[ 2 \quad \bullet \quad 3 = 2 \times 3 \]

- reuse in \(a, b, c\)

\[ 1 \quad \bullet \quad 14 = 14 \]

- reuse in \(c\) only

Experiments

**Unleashed:** Not generated = hand-written contributed mini-MMM code

**Refined model** for computing register tiles on x86

Blocking by model is for L1 cache

**Result:** Model-based is comparable to search-based (except Itanium)

Search optimized for L2 cache (without knowing), the model blocks for L1 cache which here cannot store floats

ATLAS generated (Code generation with search) is the baseline

Graph: Pingali, Yotov, Cornell U.
Remaining Details

Register renaming and the refined model for x86

TLB-related optimizations

Virtual Memory System (Core Family)

The processor works with virtual addresses

All caches work with physical addresses

Both address spaces are organized in pages

Page size: 4 KB (can be changed to 2 MB and even 1 GB in OS settings)

Address translation: virtual address $\rightarrow$ physical address
Virtual/Physical Addresses

Processor: virtual addresses
Caches: physical addresses
Page size = 4 KB

VPN: virtual page number
PPN: physical page number

L1 cache lookup can start concurrently with address translation!

How would Intel (likely) increase the L1 cache size?

Address Translation

Uses a cache called translation lookaside buffer (TLB)

Skylake:
- **Level 1**
  - ITLB (instructions): 128 entries
  - DTLB (data): 64 entries
- **Level 2**
  - Shared (STLB): 1536 entries

Miss Penalties:
- **DTLB hit: no penalty**
- **DTLB miss, STLB hit: few cycles penalty**
- **STLB miss: can be very expensive**
**Impact on Performance**

Repeatedly accessing a working set spread over too many pages yields TLB misses and can result in a significant slowdown.

**Example Skylake:** STLB = 1536

A computation that repeatedly accesses a working set of 2048 doubles spread over 2048 pages will cause STLB misses.

*How much space will this working set occupy in cache (assume no conflicts)?*

2048 * 64 B = 128 KB (fits into L2 cache)

---

**Example MMM**

We are looking for parts in the working set that are spread out in memory:

- Block row of $a$: contiguous
- All of $b$: contiguous
- Block of $c$: if $M > 512$, then spread over $N_B$ pages

Typically, $N_B$ is in the 10s, so no problem
**Example MMM, contd.**

Interface BLAS function: \texttt{dgemm}(a, b, c, N, K, M, lda, ldb, ldc)

Leading dimensions: Enable use on matrices inside matrices

- Block row of \(a\): spread over \(\geq N_a\) pages
- All of \(b\): spread over \(\geq K\) pages
- Block of \(c\): Spread over \(\geq N_c\) pages

*So copying to contiguous memory may pay off*

---

**Example MMM, contd.**

Resulting code (sketch):

```c
// all of \(b\) reused: possible copy to contiguous memory
for i = 0:N_a
    // block row of \(a\) reused: possibly copy
    for j = 0:N_b
        // block of \(c\) reused: possibly copy
        for k = 0:N_c
            ....
```
Fast MMM: Principles

Optimization for memory hierarchy
- Blocking for cache
- Blocking for registers

Basic block optimizations
- Loop order for ILP
- Unrolling + scalar replacement
- Scheduling & software pipelining

Optimizations for virtual memory
- Buffering (copying spread-out data into contiguous memory)

Autotuning
- Search over parameters (ATLAS)
- Model to estimate parameters (Model-based ATLAS)

All high performance MMM libraries do some of these (but possibly in slightly different ways)

Path to Fast Libraries

<table>
<thead>
<tr>
<th>LAPACK</th>
<th>static higher level functions</th>
</tr>
</thead>
<tbody>
<tr>
<td>BLAS</td>
<td>kernel functions generated for each computer</td>
</tr>
</tbody>
</table>

The advent of SIMD vector instructions (SSE, 1999) made ATLAS obsolete

The advent of multicore systems (ca. 2005) required a redesign of LAPACK (just parallelizing BLAS is suboptimal)

Recently, BLAS interface needs to be extended to handle higher-order tensor operations (used in machine learning)

Automatic generation of blocked algorithms, alternatives to LAPACK (FLAME)

Small scale linear algebra requires quite different optimizations (see program generator SLinGen/LGen)
Lessons Learned

Implementing even a relatively simple function with optimal performance can be highly nontrivial.

Autotuning can find solutions that a human would not think of implementing.

Understanding which choices lead to the fastest code can be very difficult.

MMM is a great case study, touches on many performance-relevant issues.

Most domains are not studied as carefully as dense linear algebra.