0%

riscv iommu SPEC 调研

DMA 重映射

iommu 作用

  1. 安全性, 防止恶意访问
    In the absence of an IOMMU, a device driver must program devices with Physical Addresses, which implies that DMA from a device could be used to access any memory, such as privileged memory,and cause malicious or unintended corruptions. This may be caused by hardware bugs, devicedriver bugs, or by malicious software/hardware. 013

在没有IOMMU的情况下,设备驱动程序必须用物理地址对设备进行编程,这意味着来自设备的DMA可以被用来访问任何内存,如特权内存,并造成恶意或意外的损坏。这可能是由硬件错误、设备驱动程序错误或恶意软件/硬件造成的。

  1. 使传统的32位外设可以访问超过4G的内存区间, 不再需要软件做 bounce buffers, 提高性能
    Legacy 32-bit devices cannot access the memory above 4 GiB. 013
    The integration of the IOMMU,through its address remapping capability, offers a simple mechanism for the DMA to directly accessany address in the system 013
    Without an IOMMU, the OS must resort to copying data through buffers (also known as bounce buffers) allocated in memory below 4GiB. 013

  2. The IOMMU can be useful as it permits to allocate large regions of memory without the need to becontiguous in physical memory 013
    可以使用连续物理内存

中断重映射

MSI 重映射

To handle MSIs from a device controlled by a guest OS, the hypervisor configures an IOMMU toredirect those MSIs to a guest interrupt file in an IMSIC (see Figure 3) or to a memory-residentinterrupt file. The IOMMU is responsible to use the MSI address-translation data structures suppliedby the hypervisor to perform the MSI redirection. Because every interrupt file, real or virtual,occupies a naturally aligned 4-KiB page of address space, the required address translation is from avirtual (guest) page address to a physical page address, 015

hypervisor配置了一个IOMMU,将这些guest 的MSI (GPA) 重定向到IMSIC中的guest interrupt file (HPA)

利用 iommu 重定向能力 使的 guest msi的GPA地址访问直接映射为 HPA的msi的地址访问, 从而让guest 可以直接读写物理msi mmio, 实现中断重映射能力, 中断直通给vcpu.

MSI 的地址重映射(GPA->HPA)是hypervisor 帮guest os做的 iommu 映射

Device-Directory-Table DDT

base format dc

extended format dc

Non-leaf DDT entry

A valid (V==1) non-leaf DDT entry provides PPN of the next level DDT.

Leaf DDT entry

The leaf DDT page is indexed by DDI[0] and holds the device-context (DC). 024
In base-format the DC is 32-bytes. In extended-format the DC is 64-bytes. 024

下面是base format dc DDI[0]的描述

Translation control (tc)

Translation control (tc) 025

The PDTV is expected to be set to 1 when DC is associated with a device that supports multiple process contexts and thus generates a valid process_id with its memory accesses. For PCIe, for example, if the request has a PASID then the PASID is used as the process_id. 027

tc.PDTV = 0 时, fsc 为 iostap/iovstap, 即1-stage 映射基址
tc.PDTV = 1 时, fsc 为 pdtp (associated with a device that supports multiple process contexts)

iohgatp

IO hypervisor guest address translation and protection (iohgatp) 027

The iohgatp field holds the PPN of the root G-stage page table and a virtual machine identified by aguest soft-context ID (GSCID) 027

The root page table as determined by iohgatp.PPN is 16 KiB and must be aligned to a 16-KiBboundary. If the root page table is not aligned to 16 KiB as required, then all entries in that G-stageroot page table appear to an IOMMU as UNSPECIFIED and any address an IOMMU may compute anduse for accessing an entry in the root page table is also UNSPECIFIED. 027

iohgatp.mode

The G-stage page table format and MODE encoding follow the format defined by the privileged specification. 027
mode-> Bare or 同hgatp的mode

fsc

First-Stage context (fsc) 027

tc.PDTV = 0 & iohgatp.mode = Bare 该域表示 iosatp;
tc.PDTV = 0 & iohgatp.mode != Bare 该域表示 iovstatp


格式同satp

tc.PDTV = 1 该域表示 PDTP

When PDTV is 1, the fsc field holds the process-directory table pointer (pdtp). 028

When the device supports multiple process contexts, selected by the process_id, the PDT is used to determine the S/VS-stage page table and associated PSCID for virtual address translation and protection. 028

The pdtp field holds the PPN of the root PDT and the MODE field that determines the number of levels of the PDT.

PDT 和 PSCID 结合找出对应的页表基址

pdtp.mode

^a5d0fe

ta

Translation attributes (ta) 028

The PSCID field of ta provides the process soft-context ID that identifies the address-space of the process. PSCID facilitates address-translation fences on a per-address-space basis.
The PSCID field in ta is used as the address-space ID if PDTV is 0 and the iosatp/iovsatp MODE field is not Bare. 029

tc.pdtv = 0 & iosatp/iovsatp.mode != Bare 时, ta.PSCID 表示ASID

PDT Process-Directory-Table

Non-leaf PDT entry

V == 1 表示为 非leaf 节点

Leaf PDT entry

First-Stage context (fsc)

First-Stage context (fsc) 031

The software assigned process soft-context ID (PSCID) is used as the address space ID (ASID) for the process identified by the S/VS-stage page table. 032

pdtv = 1
The pdtp field holds the PPN of the root PDT and the MODE field that determines the number of levels of the PDT.

pdtp.mode 确定了使用几级页表

Translation attributes (ta)

PC is valid if the V bit is 1; If it is 0, all other bits in PC are don’t care and may be freely used bysoftware. 031

MSI page table

MSI page table pointer (msiptp) 029
DC.msiptp


An MSI page table is a flat array of MSI page table entries (MSI PTEs) 032
Msi page table 只有一级页表

地址A匹配

a write to guest physical address A is recognized as an MSI to a virtual interrupt file 029
(A >> 12) & ~msi_addr_mask = (msi_addr_pattern & ~msi_addr_mask)
认为是msi 地址

Each MSI PTE may specify either the address of a real guest interruptfile that substitutes for the targeted virtual interrupt file (下图1), or a memory-residentinterrupt file in which to store incoming MSIs for the virtual interrupt file(下图2) . 032

MSI PTE write-through mode

also called msipte

If V = 1 and the custom-use bit C = 0, then bit 2 of the first doubleword is field W (Write-through). If W = 1, the MSI PTE specifies write-through mode for incoming MSIs, and if W = 0, it specifies MRIFmode. 033

When an MSI PTE has fields V = 1, C = 0, and W = 1 (write-through mode), the PTE’s complete
format is:

An MSI PTE in write-through mode allows a hypervisor to route an MSI intended for a virtual interrupt file to go instead to a guest interrupt file of a real IMSIC in the machine. 033

Memory-Mapped Register

所有的寄存器在下面的这个表格中
Table 10. IOMMU Memory-mapped register layout 065

其中比较重要的有ddtp

ddtp

Device-directory-table pointer (ddtp) 069

iommu_mode

IOMMU capabilities

IOMMU capabilities (capabilities) 066

翻译过程

Process to translate an IOVA

Process to translate an IOVA 038

整个翻译过程还是比较复杂的, 需要根据多个字段判断是DC/PC/MSI table
最终将pte确定后, 根据iohgatp.mode 确定是2-stage 还是1-stage 的SPA/GPA翻译.

用device_id 来确定DDT的叶子节点DDTE 是必须的, 无论最终是 locate DC/PC/MSI, 都需要DDTE的解读.

Process to translate addresses of MSIs

Process to translate addresses of MSIs 043

  1. 省略了MRIF 模式, 这部分比较复杂, 一般也用不到, 等用到的时候再另行分析
  2. 非MRIF模式, 即msipte.W=1时, 提取的IMSIC的interrupt file 的index no为 extract(iova>>12, DC.msi_addr_mask)
    ◦ x = a b c d e f g h
    ◦ y = 1 0 1 0 0 1 1 0
    ◦ r = acfg
  3. 最终配置的地址为 msipte.PPN << 12 | iova[11:0], 其中 msipte = DC.msiptp.PPN << 12 | interrupt_file_no * 16;

Process to locate the Device-context

Process to locate the Device-context 040
use device_id, 该流程是和前面的 translate iova 结合来看的

The device context is located using the hardware provided unique device_id. The supported device_id width may be up to 24-bit. 019

  • capabilities.MSI_FLAT == 0

    DDI[0] -> device_id[6:0]
    DDI[1] -> device_id[15:7]
    DDI[2] -> device_id[23:16]

  • capabilities.MSI_FLAT == 1

    DDI[0] -> device_id[5:0]
    DDI[1] -> device_id[14:6]
    DDI[2] -> device_id[23:15]

如capabilities.MSI_FLAT == 0 时, base dc walk 图

Process to locate the Process-context

Process to locate the Process-context 042
use process_id
The hardware identities associated with transaction - the device_id and if applicable the process_id. The IOMMU uses the hardware identities to retrieve the context information to perform the requested address translations 017

process_id 是20 bit, 由pdtp.MODE 确定几级页表

PDI[0] = process_id[7:0]
PDI[1] = process_id[16:8]
PDI[2] = process_id[19:17]

Queue Interface

  1. A command-queue (CQ) used by software to queue commands to the IOMMU.
  2. A fault/event queue (FQ) used by IOMMU to bring faults and events to software attention.
  3. A page-request queue (PQ) used by IOMMU to report “Page Request” messages received from
    PCIe devices. This queue is supported if the IOMMU supports PCIe defined Page Request
    Interface

CQ

cmdq

The PPN of the base of this in-memory queue and the size of the queue is configured into a
memory-mapped register called command-queue base (cqb).

The tail of the command-queue resides in a software controlled read/write memory-mapped
register called command-queue tail (cqt).

The head of the command-queue resides in a read-only memory-mapped IOMMU controlled
register called command-queue head (cqh)

If cqh == cqt, the command-queue is empty.
If cqt == (cqh - 1) the command-queue is full.

IOMMU Page-Table cache invalidation commands

IOTINVAL.VMA ensures that previous stores made to the S/VS-stage page tables by the harts are
observed by the IOMMU before all subsequent implicit reads from IOMMU to the corresponding
S/VS-stage page tables.

IOTINVAL.GVMA ensures that previous stores made to the G-stage page tables are observed before all
subsequent implicit reads from IOMMU to the corresponding G-stage page tables. Setting PSCV to 1
with IOTINVAL.GVMA is illegal.

IOMMU Command-queue Fence commands

A IOFENCE.C command guarantees that all previous commands fetched from the CQ have been
completed and committed

FQ

faultq

Fault/Event queue is an in-memory queue data structure used to report events and faults raised
when processing transactions.

The PPN of the base of this in-memory queue and the size of the queue is configured into a
memory-mapped register called fault-queue base (fqb).

The tail of the fault-queue resides in a IOMMU controlled read-only memory-mapped register called
fqt.

The fqh is an index into the next fault record that SW should process next.


CAUSE:


PQ

priq or Page-Request-Queue

Page-request queue is an in-memory queue data structure used to report PCIe ATS “Page Request”
and “Stop Marker” messages to software.

The base PPN of this in-memory queue and the size of the queue is configured into a memory-mapped register called page-request queue base (pqb).

The tail of the queue resides in a IOMMU controlled read-only memory-mapped register called pqt.

The head of the queue resides in a software controlled read/write memory-mapped register called
pqh.

If pqh == pqt, the page-request queue is empty.
If pqt == (pqh - 1) the page-request queue is full.

PCIe 相关

ATS & PRI
Address Translate Service
Page Request Interface

PCIe 设备可以发出缺页请求,iommu硬件在解析到缺页请求后可以直接将缺页请求写入 PRI queue (PQ), 软件在建立好页表后,可以通过CMD queue 发送 PRI response 给 PCIe 设备。具体的 ATS 和 PRI 的实现是硬件相关的
软件控制 DMA 发起后,设备先发起 ATC 请求,从 IOMMU 请求该 va 对应的 pa,如果 IOMMU 里已经有 va 到 pa 的映射,那么设备可以得到 pa,然后设备再用 pa 发起一次内存访问,该访问将直接访问对应 pa 地址,不在 IOMMU 做地址翻译;
如果 IOMMU 没有 va 到 pa 的映射, 那么设备得到这个消息后会继续向 IOMMU 发 PRI 请求,设备得到从 IOMMU 来的 PRI response 后发送内存访问请求,该请求就可以在 IOMMU 中翻译得到 pa, 最终访问到物理内存。

从上述链路的角度看, 设备端相当于做了tlb的事情.


ATS机制想要解决的问题(优势):

  1. 能够分担主机(CPU)侧的查表压力,特别是大带宽、大拓扑下的IO数据流,CPU侧的IOMMU的查表将会成为性能瓶颈,而ATS机制正好可以提供将这些查表压力卸载到不同的设备中,对整个系统实现“who consume it, who pay for it”。
  2. 决定查表性能的好坏的一个最为关键的点是TLB的预测。而像传统PCIe IO数据流,在CPU侧集中式做IOMMU的查表,对于其TLB的预测(prefetch)是很困难,极为不友好的。因为很多不同workload的IO流汇聚到一点,对于TLB的预测的冲击很大,流之间的干扰很大,很难做出准确的预测,从而TLB的命中率始终做不到太高,从而影响IO性能。而ATS的机制恰好提供了一个TLB的预测卸载到源头去的机制,让用户(设备)自己根据自己自身的业务流来设计自己的预测策略,而且用户彼此之间的预测模型不会受到彼此的影响,从而大大提高了用户自己的预测的准确性。抽象来看,这时候的设备更像是CPU核,直接根据自身跑的workload来预测本地的TLB,从而提升预测性能,进而提升整系统的预测性能。