Cxl

why CXL

similar to message passing vs shared memory in programming. But applied in hardware.

Without cohenrency, it is unclear if a memory is being modified. Forcing a ownership transfer at some point.

where CXL is not applicable

Traditional non-coherent I/O devices mainly rely on standard Producer-Consumer ordering models and execute against Host-attached memory. For such devices, there is little interaction with the Host except for work submission and signaling on work completion boundaries. Such accelerators also tend to work on data streams or large contiguous data objects. These devices typically do not need the advanced capabilities provided by CXL, and traditional PCIe* is sufficient as an accelerator-attached medium.

what CXL is good for

CXL allow for other ordering than just Producer-Consumer. It also allows a large amount of atomic operations.

how concurrency can be done

1. snoopy bus

send all requests to all processors (boardcast)
processor snoops the request and decide if it needs to respond
boardcast works well with small number of processors
bus contention makes sure only write will go through

write invalidate: when one processor writes, all other processor’s cache line is invalidated write boadcast: when one processor writes, all other processor’s cache line is updated

snooping refers to the processor silently monitoring the bus and update its own status based on the bus transaction.

2. directory-based cache coherence

no boardcast, no longer need to hold the bus

Local node: the node that the request originates from Home node: the node that holds the memory (may just hold a mapping considering that SN will perform actual memory access)

Subordinate node: the node that respond to Home node’s request?

CHI

HN-F: Fully coherent Home Node

https://developer.arm.com/documentation/101481/0200/Components-and-configuration/Components/Fully-coherent-Home-Node–HN-F-

HN-I: I/O coherent Home Node (HN-I) -> converts CHI transactions to AMBA transactions

without DMT (data memory transfer)

RN -> HN -> SN -> HN -> RN

with DMT

RN -> HN -> SN -> RN

with out DCT

HN node can snoop the bus and notice that a prior RN has cached the data, so it can retrieve from Cache instead of going to memory.

RN -> HN -> another processor -> HN -> RN

Direct Cache Transfer

RN -> HN -> another processor -> RN

CXL type

type 1: device without memory = cxl.cache + cxl.io type 2: device with memory = cxl.mem + cxl.cache + cxl.io type 3: memory expansion = cxl.mem + cxl.io

cxl.io

non-coherent protocol for IO access basically PCIe

cxl.cache

protocol for transferring one 64 byte cache line

D2H, H2D

bias based coherency

Host-bias: meaning that host has cache line so a coherency protocol needs to be followed Device-bias: meaning that device do not need to worry about host cache line and can operate on the memory very fast

cxl.mem

• HDM-H (Host-only Coherent): Used only for Type 3 Devices • HDM-D (Device Coherent): Used only for legacy Type 2 Devices that rely on CXL.cache to manage coherence with the Host • HDM-DB (Device Coherent using Back-Invalidate): Can be used by Type 2 Devices or Type 3 Devices

M2S, S2M : master to subordinate, subordinate to master

HDM-DB allows type 2 device to operate on other type 3 device’s memory. (this is called direct P2P)

HDM

Host-managed Device Memory. Device-attached memory that is mapped to system coherent address space and accessible to the Host using standard write-back semantics. Memory located on a CXL device can be mapped as either HDM or PDM.

HPA

Host Physical Address.

PDM

Private Device Memory.

DPA

Device Physical Address. DPA forms a device-scoped flat address space. An LD-FAM device presents a distinct DPA space per LD. A G-FAM device presents the same DPA space to all hosts. The CXL HDM decoders or GFD decoders map HPA into DPA space.

LD: logical device

FAM

Fabric-Attached Memory. HDM within a Type 2 or Type 3 device that can be made accessible to multiple hosts concurrently. Each HDM region can either be pooled (dedicated to a single host) or shared (accessible concurrently by multiple hosts).

SMMU

IO coherent

https://developer.arm.com/documentation/109242/0100/System-architecture-considerations/I-O-coherency

A device is I/O coherent with the PE caches if its transactions snoop the PE caches for cacheable regions of memory. This improves performance by avoiding Cache Maintenance Operation (CMO).

The device does not need to access the external memory.

The PE does not snoop the device cache.