## Advanced Systems Lab Spring 2022 Lecture: Optimization for Instruction-Level Parallelism

Instructor: Markus Püschel, Ce Zhang TA: Joao Rivera, several more



Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich



| Mapı                   | oing                | of exe                    | ecutior                    | n unit                                | s to p                     | orts               |                          |
|------------------------|---------------------|---------------------------|----------------------------|---------------------------------------|----------------------------|--------------------|--------------------------|
| Port 0                 | Port 1              | . Port 2                  | Port 3                     | Port 4                                | Port 5                     | Port 6             | Port 7                   |
| fp fma                 | fp fm               | a load                    | load                       | store                                 | SIMD log                   | Int ALU            | st addr                  |
| fp mul                 | fp mu               | ıl st addr                | st addr                    |                                       | shuffle                    |                    |                          |
| fp add                 | fp ad               | d e                       | xecution un                | its                                   | fp mov                     |                    |                          |
| fp div                 | SIMD I              | fp = 1                    | floating point             |                                       | Int ALU                    |                    |                          |
| SIMD log               | Int AL              | U log =<br>fp un          | logic<br>its do scalar and | vector flops                          |                            |                    |                          |
| Int ALU                |                     | SIME                      | log: other, non-           | fp SIMD ops                           |                            |                    |                          |
| Execution<br>Unit (fp) | Latency<br>[cycles] | Throughput<br>[ops/cycle] | Gap<br>[cycles/issue]      | <ul><li>Every</li><li>Gap =</li></ul> | port can iss<br>1/throughp | ue one inst<br>out | ruction/cyc              |
| fma                    | 4                   | 2                         | 0.5                        | • Intel c                             | alls gap the               | e throughp         | ut!                      |
| mul                    | 4                   | 2                         | 0.5                        | Same                                  | exec units f               | or scalar ar       | nd vector flo            |
| add                    | 4                   | 2                         | 0.5                        | <ul> <li>Same</li> </ul>              | atency/thr                 | AVX vector         | r scalar<br>r (four doub |
| div (scalar)           | 14                  | 1/4                       | 4                          | flops,                                | except for c               | liv                |                          |













```
void reduce(vec_ptr v, data_t *dest)
{
    int i;
    int length = vec_length(v);
    data_t *d = get_vec_start(v);
    data_t t = IDENT;
    for (i = 0; i < length; i++)
        t = t OP d[i];
    *dest = t;
}
d[0] OP d[1] OP d[2] OP ... OP d[length-1]
data_t: double or int
OP: + or *
IDENT: 0 or 1</pre>
```



9



| {   | <pre>int length = vec_length(v);<br/>int limit = length-1;<br/>data_t *d = get_vec_start(v);<br/>data_t x = IDENT;<br/>int i;<br/>/* Combine 2 elements at a time */<br/>for (i = 0; i &lt; limit; i += 2)</pre> |
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| orm | 2x more useful work per iteration                                                                                                                                                                                |

























## Summary (ILP)

Deep pipelines and multiple ports require ILP for good throughput

ILP may have to be made explicit in program

Potential blockers for compilers

- Reassociation changes result (floating point)
- Too many choices, no good way of deciding

## Unrolling

- By itself does usually nothing (branch prediction works usually well)
- But may be needed to enable additional transformations (here: reassociation)

How to program this example?

- Solution 1: program generator generates alternatives and picks best
- Solution 2: use model based on latency and throughput

25