riscv qemu-kvm 框架

arm vs riscv 硬件虚拟化

riscv-aia

riscv 体系下, 中断直通需要支持 aia 架构的 riscv-imsic

riscv-imsic stable

riscv-imsic release

riscv-iommu

设备直通需要 iommu 组件, riscv 体系下, iommu 属于 non-isa 部分.

Non-ISA specifications do not add new instructions, create or change opcodes, or in any way modify the RISC-V ISA. They do help us to develop an ecosystem around the ISA Specifications.

https://github.com/riscv-non-isa/riscv-iommu/

设备直通

软硬件架构支撑

pcie 支持
中断直通
iommu

qemu-kvm 架构中的设备直通都是针对的PCIE设备
qemu 框架中使用vfio 模式支撑 PCIE 设备直通体系.

设备直通包含两方面:

中断直通
DMA 重映射

arm 下的PCIE 设备直通

正常的pcie 设备 kvm 注入中断的过程:

QEMU通过对设备ioctl（VFIO_DEVICE_SET_IRQS）将VFIO设备中断与eventfd关联，并对VFIO设备申请中断并填充中断处理函数vfio_msihandler()；
QEMU中将guest要求的中断virq与eventfd关联，即当eventfd收到事件时，会往guest OS注入中断，这是通过QEMU对调用ioctl(KVM_IRQFD)实现的；
Guest OS对可以产生MSI/MSIX中断的内存映射区（设备配置空间或设备BAR空间）发起写操作时，会产生VM Exit到QEMU，QEMU将写的数据填写到设备的BAR空间中MSIX对应的Table中，从而触发ITS产生中断；
当VFIO设备收到中断时，首先触发vfio-pci设备的中断处理函数vfio_msihandler()，它会调用eventfd_signal()向与virq关联的eventfd发送事件，eventfd收到事件后往guest OS注入中断；

而在GICv4 lpi升级到vlpi 后, 上述步骤变为:

QEMU通过对设备ioctl（VFIO_DEVICE_SET_IRQS）将VFIO设备中断与eventfd关联
Guest OS对可以产生MSI/MSIX中断的内存映射区（设备配置空间或设备BAR空间）发起写操作时, 或pcie 外设写MSI/MSIX 中断内存映射区后, guest os 正在运行, 由guest os 自己处理中断; 如guest os不在运行, 则中断由gicv4的doorbell机制, 由doorbell的中断处理函数注入中断给vcpu. 切换vcpu的guest os运行处理中断.

中断直通硬件支撑

riscv-imsic

imsic 中新增了 guest interrupt file 的逻辑, 每一个guest interrupt file 绑定一个物理cpu上的某个 vcpu

软件在申请中断时, 需要写msi 地址, 表明要申请哪个硬件中断号, 绑定的是哪个cpu 或 vcpu (选择的是 M-level interrupt file 还是S-level interrupt file 还是哪个 guest interrupt file )

外设硬件写msi地址后, imsci 中断控制器做出相应, 发出中断相应到 cpu csr, 根据申请中断时绑定的 interrupt file 来决定操作哪个csr:

绑定了 M-level的interrupt file, 则写 mip 的 SEIP 置位
绑定了 S-level 的interrupt file, 则将 sip 的 SEIP 置位
绑定了 guest interrupt file[X], 则将 hgeip的对应 X bit 置位.

软件在收到中断后, 由中断管理程序查询 [m/s/vs]topi 寄存器(IMSIC 添加) , 查询最高优先级的硬件中断号, 转到对应的中断handler 进行处理.

mips gic 虚拟化

A generic External Interrupt Controller (EIC) typically has a number of input Interrupt Ports that are statically tied to devices in the system. It also has one logical output port to each core in the system, where the port has independent channels for root and guest interrupts. Input port interrupts are routed to the output ports. Logic within the EIC is implementation-dependent, although each slice of logic for the interface to the core can also have a root and guest section to configure interrupts separately for root and guest for the core. The following sections describe the interface virtualization.

一个通用的外部中断控制器（EIC）通常有一些输入的中断端口，这些端口与系统中的设备静态地联系在一起。它也有一个逻辑输出端口到系统中的每个核心，其中端口有独立的通道用于root 和guest interrupts。输入端口中断被路由到输出端口。EIC内的逻辑是依赖于实现的，尽管每个用于内核接口的逻辑片也可以有一个根和客户部分来分别配置内核的root和guest 的中断

If an EIC interrupt port is tied to either a root or guest-owned device (and not just root), the port should be modified such that it can be programmed with GuestID. In a multi-core system, each port also identifies a core destination. An interrupt can then be routed to a specific core through a specific interrupt channel (root or guest) for the core.

The logical output port to a core is split into the Root Interrupt Bus and Guess Interrupt Bus. These two independent channels route root and guest interrupts to the core. The following steps are required to deliver up to two interrupts (one for root, one for guest) in a cycle. The description assumes that interrupts are prioritized and represented by an Interrupt Priority Level.

Prioritize all incoming root (GuestID=0) interrupts every cycle based on assigned Interrupt Priority Level (IPL).
• Deliver highest priority interrupt for the cycle on Root Interrupt Bus.
Prioritize all incoming guest (GuestID!=0) interrupts every cycle based on assigned Interrupt Priority Level (IPL).
• If prioritized GuestID != resident GuestID, deliver interrupt on Root Interrupt Bus. Otherwise, deliver on Guest Interrupt Bus.
The resident GuestID is established from an input to the EIC from each core.
The External Interrupt Controller may reassign an interrupt from the Root Interrupt Bus to the Guest Interrupt Bus if the guest interrupt on the Root Interrupt Bus can be delivered to the core guest context as a result of a context switch, that is, the guest is now resident. Such handling is optional because root software can accomplish the same task by reprogramming the interrupt controller before switching guest context. The EIC can reassign active interrupts in this way as long as the core has not registered interrupt. This may be established by checking the Interrupt Priority Level that accompanies the interrupt acknowledgment.

中断直通场景分析

背景:
不同的vcpu可以运行不同guest os.
一个物理cpu 有三套寄存器

v 开头的 vsip vsie 等
h 开头的 hvip hip hie hgeie hgeip 等
s 开头的 sip sie 等

当V=0 => V=1 时, v开头的寄存器会替换成 s 开头的寄存器, 此时变成vcpu的执行环境
V=1 时, 只能访问 s 开头的寄存器.

在设备直通场景时, 0 < hstatus.VGEIN <= GEILEN(物理cpu上托管的vcpu的数量),

需要HS vmm 对 hgeie hstatus操作, 对应物理cpu上托管的vcpu
如当前物理cpu上托管了8个vcpu, 正在运行的是第2个vcpu

对 hstatus.VGEIN 设置为2
对 hgeie 的前8个bit 置1, 表示物理cpu 管了8个ready vcpu 的中断状态, 这8个vcpu都要处理guest external interrupt.

在前面前提下, 硬件需要将hip.VSEIP 与 hgeip的状态区分.
前面hip.VSEIP 的来源: “bit of hgeip selected by hstatus.VGEIN“

在设备直通场景(设置了hstatus.VGEIN时), 中断控制器需要判断给哪个vcpu, 导致的直接结果就是要设置hgeip 的哪个bit, 同时硬件应该将 hip.VSEIP 置为hgeip 与 hstatus.VGEIN 逻辑与的结果. 而hip.SGEIP 置为 hgeip & hgeie 逻辑与的结果.

假如中断控制器要发给第 3 个vcpu, 就需要将hgeip 的第 3 个bit 置1

情景1 : 假设物理cpu的状态 V=1 mode 正在运行第二个vcpu, hstatus.VGEIN = 2:

此时因为正在运行的是第二个vcpu, hstatus.VGEIN=2, 则hip.VSEIP = 0, 而 hip.SGEIP 为 1, 因为vsip.SEIP->sip.SEIP = hip.VSEIP, 此时vsip.SEIP 没有置位(此时假设只有外部中断, SSIP STIP 都是0), 此时硬件根据vsip penging为无信号, 而hip.SGEIP 有信号, 不能将中断委托给vcpu, 而应将中断给到 host os HS-mode的vmm.

从vcpu陷入到hypervisor vmm 后, vmm 需要check hip.SGEIP & hie SGEIE(或hip&hie), 有待处理的虚拟外部中断, 进而查hgeip, 查到是第 3 个vcpu的, 则切换到第 3 个vcpu 运行, 切换前将hstatus.VGEIN 设置为3. 此时vsip.SEIP = hip.VSEIP 会被置1(hgeip 逻辑与 hstatus.VGEIN) , 第 3 个vcpu 陷入V=1 mode, 处理虚拟外部中断. 如guest os kernel 将sie.SEIE 置过位, 则guest os 会处理external 中断(10号guest external中断会转换成9 号external 中断), guest os 需要查询中断控制器, 判断外部中断是谁的, 该由谁的中断处理函数处理. 处理完后将中断控制器的pending 清0(该操作导致中断控制器把hgeip清0), 返回到 hypervisor vmm 后, vsip.SEIP = hip.VSEIP 也会因hgeip 而清0.
情景2: 假设物理cpu的状态 V=0 mode, 处于host下

中断由host os接收
hypervisor vmm 需要check hip.SGEIP & hie SGEIE (或hip&hie), 有待处理的虚拟外部中断, 进而查hgeip, 查到是第三个vcpu的, 则切换到第三个vcpu 运行, 切换前将hstatus.VGEIN 设置为3. 此时vsip.SEIP = hip.VSEIP 会被置1(hgeip 逻辑与 hstatus.VGEIN) , 第三个vcpu 陷入V=1 mode, 处理虚拟外部中断. 如guest os kernel 将sie.SEIE 置过位, 则guest os 会处理external 中断(10号guest external中断会转换成9 号external 中断), guest os 需要查询中断控制器, 判断外部中断是谁的, 该由谁的中断处理函数处理. 处理完后将中断控制器的pending 清0(该操作导致中断控制器把hgeip清0), 返回到 hypervisor vmm 后, vsip.SEIP = hip.VSEIP 也会因hgeip 而清0.
情景3: 假设物理cpu的状态 V=1 mode 正在运行第三个vcpu, hstatus.VGEIN = 3:
hgeip 逻辑与 hstatus.VGEIN 不为0, hip.VSEIP 置1.
因为vsip.SEIP->sip.SEIP = hip.VSEIP, 此时vsip.SEIP 置位, 此时硬件根据vsip penging有信号, hip.SGEIP 有信号, 应将中断给到 vcpu guest os.
vcpu处理虚拟外部中断. 如guest os kernel 将sie.SEIE 置过位, 则guest os 会处理external 中断(10号guest external中断会转换成9 号external 中断), guest os 需要查询中断控制器, 判断外部中断是谁的, 该由谁的中断处理函数处理. 处理完后将中断控制器的pending 清0(该操作导致中断控制器把hgeip清0), 返回到 hypervisor vmm 后, vsip.SEIP = hip.VSEIP 也会因hgeip 而清0.

PCIE 设备在 riscv-iommu 支持下可以直接写 msi 地址, 发送对应的中断给对应的vcpu, riscv-iommu的该硬件设计简化了hypervisor和中断子系统的软件程序.

arm gicv4 下的PCIE 设备直通过程

中断直通场景下的问题比较多, 比较重要的有下面几个:

PCIE 设备是怎样开关中断?
PCIE 设备是怎样发送中断的? 中断是怎么由os处理的?
guest os 下的pcie设备是虚拟的外设, 虚拟的pcie设备怎么和真实的pcie设备建立关联的, 启用中断的操作怎么反映的PCIE的真实物理地址上的?
guest os 下申请中断, guest os只有虚拟中断控制器, 它是怎样和真实的中断控制器建立连接的, 怎样设置的虚拟中断路由, 中断控制器怎么绑定的vcpu 和对应的硬件中断号?

对于PCIE设备，一般支持MSI中断和MSIX中断，MSI和MSIX中断都是通过对某个映射内存区域写数据（写的地址为Message Addr，写的数据为Message Data），从而触发基于信息的中断。
产生MSI中断的内存映射区在PCIE设备的配置空间，而产生MSIX中断的内存映射区在PCIE设备的BAR空间
MSI中断最多支持32个，且要求申请的中断连续
而MSIX中断可支持的比较多（2048），不要求申请的中断连续；

这里仅对 MSI 类型进行分析

PCIE 外设开启中断
在 host os下的流程

写 PCIE 的 MSI对应的配置空间, 最终反映到了外设的寄存器上, 对相应的enable 位置1, 中间过程是比较复杂的, 这里暂时略过, 属于pcie driver 的范围.
设置linux virq 和相应外设硬件中断号的映射关系.
kernel 申请virq 同硬件中断号建立映射, 最终建立硬件中断号同 irq handler 的映射关系

PCIE 外设发送中断及中断处理流程
kernel 查询中断控制器或查cpu的csr 判断哪个硬件中断号pending了, 然后根据硬件中断号找到对应的linux virq, 最终找到中断handler 进行处理

guestos 下直通的PCIE 开启中断流程

PCIE 外设开启中断
这里就比较复杂了

首先需要说明, guest 下的中断控制器 pcie 外设等外部控制器相关的都是qemu 模拟的
在直通场景下, guest os下看到的外设外部控制器这些也全是模拟的, 并不能直接访问到真实外设/控制器的物理地址

这里就涉及到两部分

对中断控制器的模拟, 在直通时怎么将虚拟中断控制器的输入反馈给真实的中断控制器
对PCIE 设备的模拟, 在直通时怎么将虚拟外设的操作反馈给真实的外设

另外还有一部分是软件的流程, guest kernel下相关的 deviceid - 硬件中断号-vcpu的关系怎么路由给 host os的vmm? 为什么这么做, 大概有两点:

vcpu 退出了, 运行在host os下, 需要host的 cpu 接管vcpu的中断(类似的doorbell机制), 进而需要对对应的vcpu 注入中断, 再调度到vcpu
无论是否直通, 外设的中断信号都是发送给真实的中断控制器的, 而guest os下都是虚拟的中断控制器, 最终建立 deviceid - 硬件中断号 -vcpu 的路由都是在host os下的vmm中完成的

PCIE 设备是怎样发送中断的, 中断是怎样处理的?

PCIE 设备硬件发送中断信号给中断控制器, 中断控制器中通过中断路由表查表找到deviceid 对应的硬件中断号 cpu/vcpuid, 最后给托管vcpu的对应的物理cpu 设置pending 信号及硬件中断号相关的寄存器等设置对应的位.

物理cpu运行在host os时, 由doorbell机制处理, 对vcpu注入中断, 调度到vcpu, vcpu下的guest os 查询虚拟中断控制器或特有的vcpu相关的寄存器查询硬件中断号, 找到对应的linux virq, 最终找到irq handler 进行处理
物理cpu处在运行该vcpuid的上下文时, vcpu下的guest os 查询虚拟中断控制器或特有的vcpu相关的寄存器查询硬件中断号, 找到对应的linux virq, 最终找到irq handler 进行处理
物理cpu 处在另一个vcpuid的上下文时, 此时由于cpu的doorbell机制, vcpu不能收到中断, 而是由host os收到中断, 此时会因中断陷入到host os的vmm, 然后cpu 查询中断信息, 判断是哪个vcpu的中断, 然后给对应的vcpu 注入中断, 最后由该vcpu guest os 查询虚拟中断控制器或特有的vcpu相关的寄存器查询硬件中断号, 找到对应的linux virq, 最终找到irq handler 进行处理

参考arm的架构
添加的内容:

kernel中kvm module下 gic 控制器ops 相关的埋桩
中断控制器的driver 对应的kvm相关的ops 实现
qemu 中虚拟中断控制器的注册流程, 最终下发给kvm, 让kvm 建立对应的映射关系.
guest 虚拟中断控制器的mmio 模拟, 这部分在kvm的管理的内存下, 需要中断控制器 driver 实现 mmio的注册, 因guest page fault 陷入虚拟中断控制器对应的mmio后的读写流程, 解析guest os 下的deviceid-硬件中断号-vcpu的映射关系, 在真实的中断控制器下建立对应的vcpu的中断路由. 建立cpu-vcpu的doorbell 路由及中断注入的流程.

PCIE的部分:

qemu中添加 PCIE 配置空间 BAR 空间等的mmio的模拟
qemu vfio 框架中需要的所有PCIE相关的driver pcie相关的机制都要准备好.
iommu 部分, 需要建立 PCIE空间(设备相关的地址空间)的 dma remap 等, 需要 iommu driver 完成.
中断部分, 需要qemu-kvm 联合将msi/msix 空间的mmio的读写进行解析, 最终转化为对真实的中断控制器的操作, 建立真实的中断路由
当虚拟机因为写PCI配置空间而发生VM-exit时，最终会完成msi和msix的使能，在qemu侧会设置eventfd的处理函数，并通过kvm将irqfd注册到内核中，进而注册虚拟中断给虚拟机。

直通框架依赖的kvm的feature

riscv kvm上未实现的feature

ioeventfd

存在这样一种情况，即I/O请求本身只是作为一个通知事件，这个事件本身可能是通知KVM或者QEMU完成另一个具体的I/O，这种情况下没有必要像普通I/O一样等待数据完全写完，而是只需要完成一个简单的通知。
如果这种I/O请求也使用之前同步的方式完成，很明显会增加不必要的路径。
ioeventfd就是对这种通知I/O进行的优化，用户层程序（如QEMU）可以为虚拟机特定的地址关联一个eventfd，并对该eventfd进行事件监听，然后调用ioctl(KVM_IOEVENTFD)向KVM注册这段地址, 当虚拟机内部因为I/O发生VM Exit时，KVM可以判断其地址是否有对应的eventfd，如果有就直接调用eventfd_signal发送信号到对应的fd，这样，QEMU就能够从其事件监听循环返回，进而进行处理

这里与一般的mmio 的处理流程有差别的地方就是对guest 来说, 这是一个异步调用, guest 在读写某数据时, 并不会等这个数据读写完, 而是直接再次进入到guest中了. 该vcpu 只会陷入到 kvm下一次, 然后通知用户态进程, 该用户态进程并不是和这个vcpu处在同一个cpu上, 该vcpu只陷入到HS-mode的kvm一次, 就继续回到guest os中了.

irqfd
ioeventfd是虚拟机内部操作系统通知KVM/QEMU的一种快捷通道, 与之类似，irqfd是KVM/QEMU通知虚拟机内部操作系统的快捷通道。irqfd将一个eventfd与一个全局的中断号联系起来，当向这个eventfd发送信号时，就会导致对应的中断注入到虚拟机中

blog

硬件虚拟化及设备直通框架