

## <section-header><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item>



- Architecture/Microarchitecture: What is the difference?
- In detail: Core 2/Core i7
- Crucial microarchitectural parameters
- Peak performance
- Operational intensity



3





















![](_page_7_Figure_0.jpeg)

![](_page_7_Figure_1.jpeg)

![](_page_8_Figure_0.jpeg)

```
/* matrix multiplication; A, B, C are n x n matrices of doubles */
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
for (k = 0; k < n; k++)
        C[i*n+j] += A[i*n+k]*B[k*n+j];</pre>
```

## Operational intensity:

- Flops: W(n) = 2n<sup>3</sup>
- Memory/cache transfers (doubles): ≥ 3n<sup>2</sup> (just from the reads)
- Reads (bytes): Q(n) ≥ 24n<sup>2</sup>
- Operational intensity: I(n) = W(n)/Q(n) ≤ 1/12 n

![](_page_8_Figure_7.jpeg)

17

![](_page_9_Figure_0.jpeg)

![](_page_9_Figure_1.jpeg)

![](_page_10_Figure_0.jpeg)

![](_page_10_Figure_1.jpeg)

![](_page_11_Figure_0.jpeg)

| <b>MMX:</b><br>Multimedia extension |      |               |                   |      |
|-------------------------------------|------|---------------|-------------------|------|
| SSE:                                | Inte | l x86         | Processors        |      |
| Streaming SIMD extension            |      | x86-16        | 8086              |      |
| AVX:                                |      |               |                   |      |
| Advanced vector extensions          |      |               | 286               |      |
|                                     |      | x86-32        | 386               |      |
|                                     |      |               | 486               |      |
|                                     |      |               | Pentium           |      |
|                                     |      | MMX           | Pentium MMX       |      |
|                                     |      | SSE           | Pentium III       |      |
|                                     |      | SSE2          | Pentium 4         | time |
|                                     |      | SSE3          | Pentium 4E        |      |
|                                     | x8   | 86-64 / em64t | Pentium 4F        |      |
|                                     |      |               | Core 2 Duo        |      |
|                                     |      | SSE4          | Penryn            |      |
|                                     |      |               | Core i7 (Nehalem) |      |
|                                     |      | AVX           | Sandy Bridge      | V V  |
|                                     |      | AVX2          | Haswell           |      |
|                                     |      |               |                   | 24   |

![](_page_12_Figure_0.jpeg)

| Single-precision (SP) FP MUL                                                                                                                                             | 4,1                                                                                           | 4,1                                                      | Issue port 0; Writeback port 0                                             | SSE based FP |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|----------------------------------------------------------|----------------------------------------------------------------------------|--------------|
| FP MUL (X87)                                                                                                                                                             | 5, 1                                                                                          | 5, 1                                                     | Issue port 0; Writeback port 0                                             | x87 FP       |
| FP Shuffle<br>DIV/SQRT                                                                                                                                                   | 1, 1                                                                                          | 1, 1                                                     | FP shuffle does not handle QW shuffle.                                     |              |
| <ul> <li>1 add and 1 m</li> <li>Assume 3 GH</li> <li>6 Gflop/s sca</li> </ul>                                                                                            | nult / cy<br>z:<br>I <b>lar peal</b>                                                          | vcle: 2 flo<br>k <b>perform</b>                          | ps/cycle<br>nance on one core                                              |              |
| <ul> <li>1 add and 1 r</li> <li>Assume 3 GH</li> <li>6 Gflop/s sca</li> </ul>                                                                                            | nult / cy<br>z:<br>l <b>ar peal</b>                                                           | vcle: 2 flo<br>k <b>perform</b>                          | ps/cycle<br>nance on one core                                              |              |
| <ul> <li>1 add and 1 r</li> <li>Assume 3 GH</li> <li>6 Gflop/s sca</li> <li>Vector double p</li> <li>1 yadd and 1</li> </ul>                                             | nult / cy<br>iz:<br>i <b>lar peal</b><br>precisic                                             | vcle: 2 flo<br>k <i>perform</i><br>on (SSE2              | ps/cycle<br>nance on one core<br>)                                         |              |
| <ul> <li>1 add and 1 r</li> <li>Assume 3 GH</li> <li>6 Gflop/s sca</li> <li>Vector double p</li> <li>1 vadd and 1</li> <li>Assume 2 CH</li> </ul>                        | nult / cy<br> z:<br>  <b>lar peal</b><br><b>precisic</b><br> vmult /                          | rcle: 2 flo<br>k <b>perform</b><br>on (SSE2<br>cycle (2- | ps/cycle<br>nance on one core<br>)<br>way): 4 flops/cycle                  |              |
| <ul> <li>1 add and 1 r</li> <li>Assume 3 GH</li> <li>6 Gflop/s sca</li> <li>Vector double p</li> <li>1 vadd and 1</li> <li>Assume 3 GH</li> <li>12 Gflop/s pe</li> </ul> | mult / cy<br>lz:<br>l <b>lar peal</b><br><b>precisic</b><br>vmult /<br>lz:<br><b>eak perf</b> | vcle: 2 flo<br>k perform<br>on (SSE2<br>cycle (2-        | ps/cycle<br>nance on one core<br>)<br>way): 4 flops/cycle<br>e on one core |              |

![](_page_13_Figure_0.jpeg)

![](_page_13_Figure_1.jpeg)

## Summary

- Architecture vs. microarchitecture
- To optimize code one needs to understand a suitable abstraction of the microarchitecture
- Operational intensity:
  - High = compute bound = runtime dominated by data operations
  - Low = memory bound = runtime dominated by data movement

29