ref: https://saidganim.github.io/amdncte.html
Abstract
- Intel CPU์๋ง ์ฐ๊ตฌ๊ฐ ์ง์ค๋์ด ์๋ค. ๊ทธ๋ผ์ผ๋ก ๋น๊ต์ ์ ์ ์ทจ์ฝ์ ์ด ๋ค๋ฅธ CPU์์ ๋ฐ๊ฒฌ๋์๋ค.
- AMD CPU์ ๊ฒฐ์ ์ ์ฐพ๋๋ค, transient execution hijacking ๊ณต๊ฒฉ์ ํตํด์
- AMD Zen family์ Meltdown/MDS ์ ๋น์ทํ ํจํด์ ๋ณผ์์๋ค.
- AMD์์์ Meltdown์ Intel CPU์ ๋น๊ตํ์ฌ์ ์ ํ๋ ์ทจ์ฝ์ฑ์ ๋ณด์ด์ง๋ง, ๋ค๋ฅธ microarchitectural attack๋ค์ amplifyํ ๊ฐ๋ฅ์ฑ์ด ์์ ๊ฒ์ด๋ค.
Core
-
For a load instruction issued onto the pipeline to work a TLB hit is required
-
For a transient load to work its enough to have only the canonical part of the address to be matched
-
If we try to deference the non-canonical addresss when TLB contains an entry with canonical address,
-
the content of canonical address will be passed transiently to the load.
-
We suspect that the full address check is done when instruction leaves the pipeline in program order.
-
We verified, that source of leakage could be the L1 cache and a not-committed entry from the Store Queue as well.
-
AMD Optimization manual describes, that [11:0] bits are used to determine the Store-To-Load-Fowarding(STLF)
-
However we did not see any illegal STLF which is triggered by the lowest 12-bit match even within the same address space
-
Moreover, to trigger the illegal STLF we noticed, the TLB-hit is not enough.
-
The second condition is store instruciton from the Store Queue has to be an L1DcCache hit.
- We verified our main observation (non-canonical address violation) on both speculative (Spectre-type) and non-speculative (Meltdown-type) execution path.
- To explain this behaviour we learned the patent where it is stated, that AMD CPUs may require the mircro-TLB hit before any load instruction passes Figure 4.
- micro-TLB is dedicated structure, which keeps partial information from the main TLB
- However, we were told by the AMD security team, that the Ryzen-family of CPu is not equipped with micro TLB, but use the main TLB for this check
- In other words, if there is no TLB hit, Load will not be passsed transiently.
- Hence we found out that the main TLB ignores the upper bits when it compares.
- We also could not trigger this leak between two user spaces running on the same core, obviously because of the TLB flush.
- However, it is possible to โleakโ across different threads with the same address space via L1D cache.
Store to load forwarding
- Buffering stores until retirement avoids WAW and WAR dependencies but introduces a new issue. Consider the following scenario: a store executes and buffers its address and data in the store queue. A few instructions later, a load executes that reads from the same memory address to which the store just wrote. If the load reads its data from the memory system, it will read an old value that would have been overwritten by the preceding store. The data obtained by the load will be incorrect.
Retirement ๋ ๋๊น์ง ์ ์ฅ์ ๋ฒํผ๋งํ๋ฉด WAW ๋ฐ WAR ์ข ์์ฑ์ด ๋ฐฉ์ง๋์ง๋ง, ์๋ก์ด issue๊ฐ ๋ฐ์ํ๋ค. ๋ค์์ ์๋๋ฆฌ์ค๋ฅผ ๊ณ ๋ คํ๋ผ: ์ ์ฅ์๊ฐ ์ ์ฅ์์ ๋๊ธฐ์ด์ ์ฃผ์์ ๋ฐ์ดํฐ๋ฅผ ์คํํ๊ณ ๋ฒํผ๋งํ๋ค. ๋ช๊ฐ์ง ๋ช ๋ น ํ ์ ์ฅ์๊ฐ ๋ฐฉ๊ธ ์ด ๊ฒ๊ณผ ๋์ผํ ๋ฉ๋ชจ๋ฆฌ ์ฃผ์์์ ์ฝ๋ ๋ก๋๊ฐ ์คํ๋๋ค. ๋ก๋๊ฐ ๋ฉ๋ชจ๋ฆฌ ์์คํ ์์ ๋ฐ์ดํฐ๋ฅผ ์ฝ๋ ๊ฒฝ์ฐ ์ด์ ์ ์ฅ์์์ ๋ฎ์ด์ด ์ด์ ๊ฐ์ ์ฝ๋๋ค. ๋ก๋์ ์ํ ๋ฐ์ดํฐ๋ ๋ถ์ ํํด์ง ๊ฒ์ด๋ค.
- To solve this problem, processors employ a technique called store-to-load forwarding using the store queue. In addition to buffering stores until retirement, the store queue serves a second purpose: forwarding data from completed but not-yet-retired (โin-flightโ) stores to later loads. Rather than a simple FIFO queue, the store queue is really a Content-Addressable Memory (CAM) searched using the memory address. When a load executes, it searches the store queue for in-flight stores to the same address that are logically earlier in program order. If a matching store exists, the load obtains its data value from that store instead of the memory system. If there is no matching store, the load accesses the memory system as usual; any preceding, matching stores must have already retired and committed their values. This technique allows loads to obtain correct data if their producer store has completed but not yet retired.
์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด์ ํ๋ก์ธ์๋ ์ ์ฅ ๋๊ธฐ์ด์ ์ฌ์ฉํ์ฌ STORE-To-Load-Fowarding์ด๋ผ๋ ๊ธฐ์ ์ ์ฌ์ฉํ๋ค. ํ๊ธฐ๋ ๋๊น์ง ์ ์ฅ์๋ฅผ ๋ฒํผ๋งํ๋ ๊ฒ ์ธ์๋ ์ ์ฅ์ ๋๊ธฐ์ด์ ์๋ฃ๋์ง๋ง ์์ง ํ๊ธฐ ๋์ง ์์ โ์งํ ์คโ ์ ์ฅ์์ ๋ฐ์ดํฐ๋ฅผ ์ด ํ ๋ก๋๋ก ์ ๋ฌํ๋ ๋๋ฒ์งธ ๋ชฉ์ ์ ์ ๊ณตํ๋ค. ๋จ์ํ FIFO ๋๊ธฐ์ด์ด ์๋๋ผ ์ ์ฅ ๋๊ธฐ์ด์ ์ค์ ๋ก ๋ฉ๋ชจ๋ฆฌ ์ฃผ์๋ฅผ ์ฌ์ฉํ์ฌ ๊ฒ์๋ CAM(Content-Addressable Memory)์ด๋ค. ๋ก๋๊ฐ ์คํ๋๋ฉด ํ๋ก๊ทธ๋จ ์์์์ ๋ ผ๋ฆฌ์ ์ผ๋ก ๋ ๋น ๋ฅธ ๋์ผํ ์ฃผ์๋ก ์ด๋ ์ค์ธ ์คํ ์ด์ ๋ํด ์คํ ์ด ํ๋ฅผ ๊ฒ์ํ๋ค. ์ผ์นํ๋ ์ ์ฅ์๊ฐ ์๋ ๊ฒฝ์ฐ ๋ก๋๋ ๋ฉ๋ชจ๋ฆฌ ์์คํ ๋์ ํด๋น ์ ์ฅ์์์ ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์จ๋ค. ์ผ์นํ๋ ์ ์ฅ์๊ฐ ์์ผ๋ฉด ๋ก๋๋ ํ์๊ฐ์ด ๋ฉ๋ชจ๋ฆฌ ์์คํ ์ ์์ธ์คํ๋ค. ์ด์ ์ ์ผ์นํ๋ ์ ์ฅ์๋ ์ด๋ฏธ ์ฌ์ฉ ์ค์ง๋๊ณ ๊ฐ์ ์ปค๋ฐํด์ผํ๋ค. ์ด๊ธฐ์ ์ ์ฌ์ฉํ๋ฉด ์์ฐ์ ์ ์ฅ์๊ฐ ์๋ฃ๋์ง๋ง ์์ง ํ๊ธฐ๋์ง ์์ ๊ฒฝ์ฐ ๋ก๋์์ ์ฌ๋ฐ๋ฅธ ๋ฐ์ดํฐ๋ฅผ ์ป์ ์ ์๋ค.
- Multiple stores to the loadโs memory address may be present in the store queue. To handle this case, the store queue is priority encoded to select the latest store that is logically earlier than the load in program order. The determination of which store is โlatestโ can be achieved by attaching some sort of timestamp to the instructions as they are fetched and decoded, or alternatively by knowing the relative position (slot) of the load with respect to the oldest and newest stores within the store queue.
๋ก๋์ ๋ฉ๋ชจ๋ฆฌ ์ฃผ์์ ๋ํ ์ฌ๋ฌ ์ ์ฅ์๊ฐ ์ ์ฅ์ ๋๊ธฐ์ด์ ์์ ์์๋ค. ๋ ผ๋ฆฌ์ ์ผ๋ก ๋ ๋น ๋ฅธ ์ต์ ์คํ ์ด๋ฅผ ์ ํํ๋๋ก ์ฐ์ ์ ์ผ๋ก ์ธ์ฝ๋ฉ ๋๋ค. ์ด๋ค ์ ์ฅ์๊ฐ ์ต์ ์ธ์ง ๊ฒฐ์ ํ๋ ๊ฒ์ ๋ช ๋ น์ด ์ธ์ถ ๋๊ณ ๋์ฝ๋ฉ ๋ ๋ ๋ช ๋ น์ ์ผ์ข ์ ํ์์คํฌํ๋ฅผ ์ฒจ๋ถํ๊ฑฐ๋, ๋๋ ๊ทธ ์์ ์๋ ๊ฐ์ฅ ์ค๋๋ ์ ์ฅ์์ ์ต์ ์ ์ฅ์์ ๋ํ ๋ถํ์ ์๋์์น(์ฌ๋กฏ)์ ๋ฌ๋ฉด ์ ์ ์๋ค.