why CXL
similar to message passing vs shared memory in programming. But applied in hardware.
Without cohenrency, it is unclear if a memory is being modified. Forcing a ownership transfer at some point.
where CXL is not applicable
Traditional non-coherent I/O devices mainly rely on standard Producer-Consumer ordering models and execute against Host-attached memory. For such devices, there is little interaction with the Host except for work submission and signaling on work completion boundaries. Such accelerators also tend to work on data streams or large contiguous data objects. These devices typically do not need the advanced capabilities provided by CXL, and traditional PCIe* is sufficient as an accelerator-attached medium.
what CXL is good for
CXL allow for other ordering than just Producer-Consumer. It also allows a large amount of atomic operations.
how concurrency can be done
1. snoopy bus
- send all requests to all processors (boardcast)
- processor snoops the request and decide if it needs to respond
- boardcast works well with small number of processors
- bus contention makes sure only write will go through
write invalidate: when one processor writes, all other processor’s cache line is invalidated write boadcast: when one processor writes, all other processor’s cache line is updated
snooping refers to the processor silently monitoring the bus and update its own status based on the bus transaction.
2. directory-based cache coherence
- no boardcast, no longer need to hold the bus
Local node: the node that the request originates from Home node: the node that holds the memory (may just hold a mapping considering that SN will perform actual memory access)
Subordinate node: the node that respond to Home node’s request?
CHI
HN-F: Fully coherent Home Node
https://developer.arm.com/documentation/101481/0200/Components-and-configuration/Components/Fully-coherent-Home-Node–HN-F-
HN-I: I/O coherent Home Node (HN-I) -> converts CHI transactions to AMBA transactions
without DMT (data memory transfer)
RN -> HN -> SN -> HN -> RN
with DMT
RN -> HN -> SN -> RN
with out DCT
HN node can snoop the bus and notice that a prior RN has cached the data, so it can retrieve from Cache instead of going to memory.
RN -> HN -> another processor -> HN -> RN
Direct Cache Transfer
RN -> HN -> another processor -> RN
CXL
CXL type
type 1: device without memory = cxl.cache + cxl.io type 2: device with memory = cxl.mem + cxl.cache + cxl.io type 3: memory expansion = cxl.mem + cxl.io
cxl.io
non-coherent protocol for IO access basically PCIe
cxl.cache
protocol for transferring one 64 byte cache line
D2H, H2D
bias based coherency
Host-bias: meaning that host has cache line so a coherency protocol needs to be followed Device-bias: meaning that device do not need to worry about host cache line and can operate on the memory very fast
cxl.mem
• HDM-H (Host-only Coherent): Used only for Type 3 Devices • HDM-D (Device Coherent): Used only for legacy Type 2 Devices that rely on CXL.cache to manage coherence with the Host • HDM-DB (Device Coherent using Back-Invalidate): Can be used by Type 2 Devices or Type 3 Devices
M2S, S2M : master to subordinate, subordinate to master
HDM-DB allows type 2 device to operate on other type 3 device’s memory. (this is called direct P2P)
HDM
Host-managed Device Memory. Device-attached memory that is mapped to system coherent address space and accessible to the Host using standard write-back semantics. Memory located on a CXL device can be mapped as either HDM or PDM.
HPA
Host Physical Address.
PDM
Private Device Memory.
DPA
Device Physical Address. DPA forms a device-scoped flat address space. An LD-FAM device presents a distinct DPA space per LD. A G-FAM device presents the same DPA space to all hosts. The CXL HDM decoders or GFD decoders map HPA into DPA space.
LD: logical device
FAM
Fabric-Attached Memory. HDM within a Type 2 or Type 3 device that can be made accessible to multiple hosts concurrently. Each HDM region can either be pooled (dedicated to a single host) or shared (accessible concurrently by multiple hosts).
SMMU
IO coherent
https://developer.arm.com/documentation/109242/0100/System-architecture-considerations/I-O-coherency
A device is I/O coherent with the PE caches if its transactions snoop the PE caches for cacheable regions of memory. This improves performance by avoiding Cache Maintenance Operation (CMO).
The device does not need to access the external memory.
The PE does not snoop the device cache.