DMA 重映射
iommu 作用
- 安全性, 防止恶意访问
In the absence of an IOMMU, a device driver must program devices with Physical Addresses, which implies that DMA from a device could be used to access any memory, such as privileged memory,and cause malicious or unintended corruptions. This may be caused by hardware bugs, devicedriver bugs, or by malicious software/hardware. 013
在没有IOMMU的情况下,设备驱动程序必须用物理地址对设备进行编程,这意味着来自设备的DMA可以被用来访问任何内存,如特权内存,并造成恶意或意外的损坏。这可能是由硬件错误、设备驱动程序错误或恶意软件/硬件造成的。
使传统的32位外设可以访问超过4G的内存区间, 不再需要软件做 bounce buffers, 提高性能
Legacy 32-bit devices cannot access the memory above 4 GiB. 013
The integration of the IOMMU,through its address remapping capability, offers a simple mechanism for the DMA to directly accessany address in the system 013
Without an IOMMU, the OS must resort to copying data through buffers (also known as bounce buffers) allocated in memory below 4GiB. 013The IOMMU can be useful as it permits to allocate large regions of memory without the need to becontiguous in physical memory 013
可以使用连续物理内存
中断重映射
MSI 重映射
To handle MSIs from a device controlled by a guest OS, the hypervisor configures an IOMMU toredirect those MSIs to a guest interrupt file in an IMSIC (see Figure 3) or to a memory-residentinterrupt file. The IOMMU is responsible to use the MSI address-translation data structures suppliedby the hypervisor to perform the MSI redirection. Because every interrupt file, real or virtual,occupies a naturally aligned 4-KiB page of address space, the required address translation is from avirtual (guest) page address to a physical page address, 015
hypervisor配置了一个IOMMU,将这些guest 的MSI (GPA) 重定向到IMSIC中的guest interrupt file (HPA)
利用 iommu 重定向能力 使的 guest msi的GPA地址访问直接映射为 HPA的msi的地址访问, 从而让guest 可以直接读写物理msi mmio, 实现中断重映射能力, 中断直通给vcpu.
MSI 的地址重映射(GPA->HPA)是hypervisor 帮guest os做的 iommu 映射
Device-Directory-Table DDT
base format dc
extended format dc
Non-leaf DDT entry
A valid (V==1
) non-leaf DDT entry provides PPN of the next level DDT.
Leaf DDT entry
The leaf DDT page is indexed by DDI[0] and holds the device-context (DC). 024
In base-format the DC is 32-bytes. In extended-format the DC is 64-bytes. 024
下面是base format dc DDI[0]的描述
Translation control (tc)
Translation control (tc) 025
The PDTV is expected to be set to 1 when DC is associated with a device that supports multiple process contexts and thus generates a valid process_id
with its memory accesses. For PCIe, for example, if the request has a PASID then the PASID is used as the process_id. 027
tc.PDTV = 0 时, fsc 为 iostap/iovstap, 即1-stage 映射基址
tc.PDTV = 1 时, fsc 为 pdtp (associated with a device that supports multiple process contexts)
iohgatp
IO hypervisor guest address translation and protection (iohgatp) 027
The iohgatp field holds the PPN of the root G-stage page table and a virtual machine identified by aguest soft-context ID (GSCID) 027
The root page table as determined by iohgatp.PPN is 16 KiB and must be aligned to a 16-KiBboundary. If the root page table is not aligned to 16 KiB as required, then all entries in that G-stageroot page table appear to an IOMMU as UNSPECIFIED and any address an IOMMU may compute anduse for accessing an entry in the root page table is also UNSPECIFIED. 027
iohgatp.mode
The G-stage page table format and MODE encoding follow the format defined by the privileged specification. 027
mode-> Bare or 同hgatp的mode
fsc
First-Stage context (fsc) 027
tc.PDTV = 0 & iohgatp.mode = Bare 该域表示 iosatp;
tc.PDTV = 0 & iohgatp.mode != Bare 该域表示 iovstatp
格式同satp
tc.PDTV = 1 该域表示 PDTP
When PDTV is 1, the fsc field holds the process-directory table pointer (pdtp). 028
When the device supports multiple process contexts, selected by the process_id, the PDT is used to determine the S/VS-stage page table and associated PSCID for virtual address translation and protection. 028
The pdtp field holds the PPN of the root PDT and the MODE field that determines the number of levels of the PDT.
PDT 和 PSCID 结合找出对应的页表基址
pdtp.mode
^a5d0fe
ta
Translation attributes (ta) 028
The PSCID field of ta provides the process soft-context ID that identifies the address-space of the process. PSCID facilitates address-translation fences on a per-address-space basis.
The PSCID field in ta is used as the address-space ID if PDTV is 0 and the iosatp/iovsatp MODE field is not Bare. 029
tc.pdtv = 0 & iosatp/iovsatp.mode != Bare 时, ta.PSCID 表示ASID
PDT Process-Directory-Table
Non-leaf PDT entry
V == 1 表示为 非leaf 节点
Leaf PDT entry
First-Stage context (fsc)
First-Stage context (fsc) 031
The software assigned process soft-context ID
(PSCID) is used as the address space ID (ASID) for the process identified by the S/VS-stage page table. 032
pdtv = 1
The pdtp field holds the PPN of the root PDT and the MODE field that determines the number of levels of the PDT.
pdtp.mode 确定了使用几级页表
Translation attributes (ta)
PC is valid if the V bit is 1; If it is 0, all other bits in PC are don’t care and may be freely used bysoftware. 031
MSI page table
MSI page table pointer (msiptp) 029
DC.msiptp
An MSI page table is a flat array of MSI page table entries (MSI PTEs) 032
Msi page table 只有一级页表
地址A匹配
a write to guest physical address A is recognized as an MSI to a virtual interrupt file 029(A >> 12) & ~msi_addr_mask = (msi_addr_pattern & ~msi_addr_mask)
认为是msi 地址
Each MSI PTE may specify either the address of a real guest interruptfile that substitutes for the targeted virtual interrupt file (下图1), or a memory-residentinterrupt file in which to store incoming MSIs for the virtual interrupt file(下图2) . 032
MSI PTE write-through mode
also called msipte
If V = 1 and the custom-use bit C = 0, then bit 2 of the first doubleword is field W (Write-through). If W = 1, the MSI PTE specifies write-through mode for incoming MSIs, and if W = 0, it specifies MRIFmode. 033
When an MSI PTE has fields V = 1, C = 0, and W = 1 (write-through mode), the PTE’s complete
format is:
An MSI PTE in write-through mode allows a hypervisor to route an MSI intended for a virtual interrupt file to go instead to a guest interrupt file of a real IMSIC in the machine. 033
Memory-Mapped Register
所有的寄存器在下面的这个表格中
Table 10. IOMMU Memory-mapped register layout 065
其中比较重要的有ddtp
ddtp
Device-directory-table pointer (ddtp) 069
iommu_mode
IOMMU capabilities
IOMMU capabilities (capabilities) 066
翻译过程
Process to translate an IOVA
Process to translate an IOVA 038
整个翻译过程还是比较复杂的, 需要根据多个字段判断是DC/PC/MSI table
最终将pte确定后, 根据iohgatp.mode 确定是2-stage 还是1-stage 的SPA/GPA翻译.
用device_id 来确定DDT的叶子节点DDTE 是必须的, 无论最终是 locate DC/PC/MSI, 都需要DDTE的解读.
Process to translate addresses of MSIs
Process to translate addresses of MSIs 043
- 省略了MRIF 模式, 这部分比较复杂, 一般也用不到, 等用到的时候再另行分析
- 非MRIF模式, 即msipte.W=1时, 提取的IMSIC的interrupt file 的index no为 extract(iova>>12, DC.msi_addr_mask)
◦ x = a b c d e f g h
◦ y = 1 0 1 0 0 1 1 0
◦ r = acfg- 最终配置的地址为 msipte.PPN << 12 | iova[11:0], 其中 msipte = DC.msiptp.PPN << 12 | interrupt_file_no * 16;
Process to locate the Device-context
Process to locate the Device-context 040
use device_id, 该流程是和前面的 translate iova 结合来看的
The device context is located using the hardware provided unique device_id. The supported device_id width may be up to 24-bit. 019
capabilities.MSI_FLAT == 0
DDI[0] -> device_id[6:0]
DDI[1] -> device_id[15:7]
DDI[2] -> device_id[23:16]capabilities.MSI_FLAT == 1
DDI[0] -> device_id[5:0]
DDI[1] -> device_id[14:6]
DDI[2] -> device_id[23:15]
如capabilities.MSI_FLAT == 0 时, base dc walk 图
Process to locate the Process-context
Process to locate the Process-context 042
use process_id
The hardware identities associated with transaction - the device_id and if applicable the process_id. The IOMMU uses the hardware identities to retrieve the context information to perform the requested address translations 017
process_id 是20 bit, 由pdtp.MODE 确定几级页表
PDI[0] = process_id[7:0]
PDI[1] = process_id[16:8]
PDI[2] = process_id[19:17]
Queue Interface
- A command-queue (CQ) used by software to queue commands to the IOMMU.
- A fault/event queue (FQ) used by IOMMU to bring faults and events to software attention.
- A page-request queue (PQ) used by IOMMU to report “Page Request” messages received from
PCIe devices. This queue is supported if the IOMMU supports PCIe defined Page Request
Interface
CQ
cmdq
The PPN of the base of this in-memory queue and the size of the queue is configured into a
memory-mapped register called command-queue base (cqb).
The tail of the command-queue resides in a software controlled read/write memory-mapped
register called command-queue tail (cqt).
The head of the command-queue resides in a read-only memory-mapped IOMMU controlled
register called command-queue head (cqh)
If cqh == cqt, the command-queue is empty.
If cqt == (cqh - 1) the command-queue is full.
IOMMU Page-Table cache invalidation commands
IOTINVAL.VMA ensures that previous stores made to the S/VS-stage page tables by the harts are
observed by the IOMMU before all subsequent implicit reads from IOMMU to the corresponding
S/VS-stage page tables.
IOTINVAL.GVMA ensures that previous stores made to the G-stage page tables are observed before all
subsequent implicit reads from IOMMU to the corresponding G-stage page tables. Setting PSCV to 1
with IOTINVAL.GVMA is illegal.
IOMMU Command-queue Fence commands
A IOFENCE.C command guarantees that all previous commands fetched from the CQ have been
completed and committed
FQ
faultq
Fault/Event queue is an in-memory queue data structure used to report events and faults raised
when processing transactions.
The PPN of the base of this in-memory queue and the size of the queue is configured into a
memory-mapped register called fault-queue base (fqb).
The tail of the fault-queue resides in a IOMMU controlled read-only memory-mapped register called
fqt.
The fqh is an index into the next fault record that SW should process next.
CAUSE:
PQ
priq or Page-Request-Queue
Page-request queue is an in-memory queue data structure used to report PCIe ATS “Page Request”
and “Stop Marker” messages to software.
The base PPN of this in-memory queue and the size of the queue is configured into a memory-mapped register called page-request queue base (pqb).
The tail of the queue resides in a IOMMU controlled read-only memory-mapped register called pqt.
The head of the queue resides in a software controlled read/write memory-mapped register called
pqh.
If pqh == pqt, the page-request queue is empty.
If pqt == (pqh - 1) the page-request queue is full.
PCIe 相关
ATS & PRI
Address Translate Service
Page Request Interface
PCIe 设备可以发出缺页请求,iommu硬件在解析到缺页请求后可以直接将缺页请求写入 PRI queue (PQ), 软件在建立好页表后,可以通过CMD queue 发送 PRI response 给 PCIe 设备。具体的 ATS 和 PRI 的实现是硬件相关的
软件控制 DMA 发起后,设备先发起 ATC 请求,从 IOMMU 请求该 va 对应的 pa,如果 IOMMU 里已经有 va 到 pa 的映射,那么设备可以得到 pa,然后设备再用 pa 发起一次内存访问,该访问将直接访问对应 pa 地址,不在 IOMMU 做地址翻译;
如果 IOMMU 没有 va 到 pa 的映射, 那么设备得到这个消息后会继续向 IOMMU 发 PRI 请求,设备得到从 IOMMU 来的 PRI response 后发送内存访问请求,该请求就可以在 IOMMU 中翻译得到 pa, 最终访问到物理内存。
从上述链路的角度看, 设备端相当于做了tlb的事情.
ATS机制想要解决的问题(优势):
- 能够分担主机(CPU)侧的查表压力,特别是大带宽、大拓扑下的IO数据流,CPU侧的IOMMU的查表将会成为性能瓶颈,而ATS机制正好可以提供将这些查表压力卸载到不同的设备中,对整个系统实现“who consume it, who pay for it”。
- 决定查表性能的好坏的一个最为关键的点是TLB的预测。而像传统PCIe IO数据流,在CPU侧集中式做IOMMU的查表,对于其TLB的预测(prefetch)是很困难,极为不友好的。因为很多不同workload的IO流汇聚到一点,对于TLB的预测的冲击很大,流之间的干扰很大,很难做出准确的预测,从而TLB的命中率始终做不到太高,从而影响IO性能。而ATS的机制恰好提供了一个TLB的预测卸载到源头去的机制,让用户(设备)自己根据自己自身的业务流来设计自己的预测策略,而且用户彼此之间的预测模型不会受到彼此的影响,从而大大提高了用户自己的预测的准确性。抽象来看,这时候的设备更像是CPU核,直接根据自身跑的workload来预测本地的TLB,从而提升预测性能,进而提升整系统的预测性能。