



# Today

Architecture/Microarchitecture: What is the difference?

In detail: Intel Skylake

Derivation of runtime bounds

Execution units, latency and throughput

Brief: Apple M series

3



| MMX:<br>Multimedia extension                            | Intel x8 | 86                   | Processors (subset)                    |      |
|---------------------------------------------------------|----------|----------------------|----------------------------------------|------|
| <b>SSE:</b><br>Streaming SIMD extension                 |          | x86-16               | 8086<br>286                            | 1978 |
| <b>AVX:</b><br>Advanced vector extensions               |          | <b>x86-32</b><br>MMX | 386<br>486<br>Pentium<br>Pentium MMX   |      |
| <b>Backward compatible:</b><br>Old binary code (≥ 8086) |          | SSE<br>SSE2<br>SSE3  | Pentium III<br>Pentium 4<br>Pentium 4E |      |
| runs on newer processors.<br>New code to run on old     |          | x86-64               | Pentium 4F<br>Core 2<br><i>Penryn</i>  | time |
| processors?                                             |          | SSE4                 | Core i3/5/7                            |      |
| Depends on compiler flags.                              |          | AVX<br>AVX2          | Sandy Bridge<br>Haswell                |      |
|                                                         |          | AVX-512              | Skylake-X                              |      |
|                                                         |          |                      | Ice Lake                               |      |
|                                                         |          |                      | Golden Cove                            | 5    |



































| zeci                        | utio                | n I        | Units                 | and F                           | orts (                                      | Skyla                                                                   | ke)     |         |
|-----------------------------|---------------------|------------|-----------------------|---------------------------------|---------------------------------------------|-------------------------------------------------------------------------|---------|---------|
| Port 0                      | Port 1              | L          | Port 2                | Port 3                          | Port 4                                      | Port 5                                                                  | Port 6  | Port 7  |
| Ļ                           |                     |            | Ļ                     |                                 |                                             | +                                                                       |         |         |
| fp fma                      | fp fm               | a          | load                  | load                            | store                                       | SIMD log                                                                | Int ALU | st addr |
| fp mul                      | fp mu               | ıl         | st addr               | st addr                         |                                             | shuffle                                                                 |         |         |
| fp add                      | fp ad               | d          | ех                    | ecution un                      | its                                         | fp mov                                                                  |         |         |
| fp div                      | SIMD I              | og         |                       | oating point                    |                                             | Int ALU                                                                 |         |         |
| SIMD log                    | Int AL              | U          | log = l<br>fp uni     | ogic<br>ts do scalar <i>and</i> | vector flops                                |                                                                         |         |         |
| Int ALU                     |                     |            | SIMD                  | log: other, non-                | p SIMD ops                                  |                                                                         |         |         |
| Execution<br>Unit (fp)      | Latency<br>[cycles] |            | roughput<br>os/cycle] | 1/Throughput<br>[cycles/issue]  |                                             | ort can issue one instruction/cycle<br>Ils 1/throughput the throughput! |         |         |
| fma                         | 4                   | 2          |                       | 0.5                             | Same e                                      | xec units for scalar and vector flops                                   |         |         |
| mul                         | 4                   | 2          |                       | 0.5                             |                                             | atency/throughput for scalar<br>ouble) and AVX vector (four doubles)    |         |         |
| add                         | 4                   | 2          |                       | 0.5                             |                                             | flops, except for div                                                   |         |         |
| div (scalar)<br>div (4-way) | 14<br>14            | 1/4<br>1/8 |                       | 4<br>8                          | <u>Check Agner Fog's tables</u> (pp. 278ff) |                                                                         |         |         |













# **Firestorm Microarchitecture**

#### Integer ports:

- 1: alu + flags + branch + addr + msr/mrs nzcv + mrs
- 2: alu + flags + branch + addr + msr/mrs nzcv + ptrauth 3: alu + flags + mov-from-simd/fp?
- 4: alu + mov-from-simd/fp?
- 5: alu + mul + div
- 6: alu + mul + madd + crc + bfm/extr

### Load and store ports:

- 7: store + amx 8: load/store + amx
- 9: load
- 10: load

### FP/SIMD ports:

11: fp/simd 12: fp/simd 13: fp/simd + fcsel + to-gpr 14: fp/simd + fcsel + to-gpr + fcmp/e + fdiv + ...

| Instruction | Latency<br>[cycles] | Throughput<br>[ops/cycle] | 1/Throughput<br>[cycles/issue] |
|-------------|---------------------|---------------------------|--------------------------------|
| fma         | 4                   | 4                         | 0.25                           |
| add         | 3                   | 4                         | 0.25                           |
| mul         | 4                   | 4                         | 0.25                           |
| div         | 10                  | 1                         | 1                              |
| load        |                     | 3                         | 0.33                           |
| store       |                     | 2                         | 0.5                            |

Latency and throughput of FP instructions in double precision. The numbers are the same for scalar and vector instructions.

This information is based on black-box reverse engineering (micro-benchmarking) https://dougallj.github.io/applecpu/firestorm.html

29

| Integer ports:<br>1: alu + br + mrs                     | Instruction                                                                                                                                | Latency<br>[cycles] | Throughput<br>[ops/cycle] | 1/Throughput<br>[cycles/issue] |  |
|---------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|---------------------|---------------------------|--------------------------------|--|
| 2: alu + br + div + ptrauth<br>3: alu + mul + bfm + crc | fma                                                                                                                                        | 4                   | 2                         | 0.5                            |  |
|                                                         | add                                                                                                                                        | 3                   | 2                         | 0.5                            |  |
| Load and store ports:<br>4: load/store + amx            | mul                                                                                                                                        | 4                   | 2                         | 0.5                            |  |
| 5: load                                                 | div (scalar)<br>div (2-way)                                                                                                                | 10<br>11            | 1<br>0.5                  | 1 2                            |  |
| <i>FP/SIMD ports:</i><br>6: fp/simd                     | load                                                                                                                                       |                     | 2                         | 0.5                            |  |
| 7: fp/simd + fcsel + to-gpr + fcmp/e + fdiv +           | store                                                                                                                                      |                     | 1                         | 1                              |  |
|                                                         | Latency and throughput of FP instructions in double precision. The numbers are the same for scalar and vector instructions except for div. |                     |                           |                                |  |



## Apple M2 (5 nm, June 2022)

https://en.wikipedia.org/wiki/Apple\_M2

### Apple M3 (3 nm, October 2023)

https://en.wikipedia.org/wiki/Apple\_M3

### Apple M4 (3 nm, May 2024)

https://en.wikipedia.org/wiki/Apple\_M4

Some information on units in Apple developer's guide (requires account)

See semester project by F. Sidler on measuring instructions on M3

For the perf cores, the previous M1 lat/tp table for the performance cores seems to be the same for M3 (and thus likely for M2)

31

