0%

riscv kvm 提交整理

1. RISC-V: Add initial skeletal KVM support

This patch adds initial skeletal KVM RISC-V support which has:
1. A simple implementation of arch specific VM functions
   except kvm_vm_ioctl_get_dirty_log() which will implemeted
   in-future as part of stage2 page loging.
2. Stubs of required arch specific VCPU functions except
   kvm_arch_vcpu_ioctl_run() which is semi-complete and
   extended by subsequent patches.
3. Stubs for required arch specific stage2 MMU functions.

这个补丁增加了初始的KVM RISC-V 框架
它具有。

  1. 除了kvm_vm_ioctl_get_dirty_log()之外,一个简单的arch特定虚拟机函数的实现,它将作为第二阶段页面记录的一部分在未来实施。
  2. 除了kvm_arch_vcpu_ioctl_run()是半完全的并由后续补丁扩展外,所需的特定arch VCPU函数的stub。
  3. 所需的arch 特定第二阶段MMU功能的 stub 。

2. RISC-V: KVM: Implement VCPU create, init and destroy functions

This patch implements VCPU create, init and destroy functions
required by generic KVM module. We don't have much dynamic
resources in struct kvm_vcpu_arch so these functions are quite
simple for KVM RISC-V.

这个补丁实现了通用KVM模块所需的VCPU创建、启动和销毁功能。我们在结构kvm_vcpu_arch中没有很多动态资源,所以这些函数对于KVM RISC-V来说非常简单。

3. RISC-V: KVM: Implement VCPU interrupts and requests handling

This patch implements VCPU interrupts and requests which are both
asynchronous events.

The VCPU interrupts can be set/unset using KVM_INTERRUPT ioctl from
user-space. In future, the in-kernel IRQCHIP emulation will use
kvm_riscv_vcpu_set_interrupt() and kvm_riscv_vcpu_unset_interrupt()
functions to set/unset VCPU interrupts.

Important VCPU requests implemented by this patch are:
KVM_REQ_SLEEP       - set whenever VCPU itself goes to sleep state
KVM_REQ_VCPU_RESET  - set whenever VCPU reset is requested

The WFI trap-n-emulate (added later) will use KVM_REQ_SLEEP request
and kvm_riscv_vcpu_has_interrupt() function.

The KVM_REQ_VCPU_RESET request will be used by SBI emulation (added
later) to power-up a VCPU in power-off state. The user-space can use
the GET_MPSTATE/SET_MPSTATE ioctls to get/set power state of a VCPU.

这个补丁实现了VCPU中断和请求,它们都是异步事件。
VCPU中断可以使用用户空间的KVM_INTERRUPT ioctl进行设置/取消。
在未来,内核IRQCHIP仿真将使用kvm_riscv_vcpu_set_interrupt()和kvm_riscv_vcpu_unset_interrupt()函数来设置/取消VCPU中断。
这个补丁实现的重要VCPU请求是。
KVM_REQ_SLEEP -每当VCPU本身进入睡眠状态时设置
KVM_REQ_VCPU_RESET -每当VCPU复位时设置
WFI trap-n-emulate(稍后添加)将使用KVM_REQ_SLEEP请求和kvm_riscv_vcpu_has_interrupt()函数。
KVM_REQ_VCPU_RESET请求将被SBI仿真使用(稍后添加),以使VCPU处于断电状态。
用户空间可以使用GET_MPSTATE/SET_MPSTATE ioctls来获取/设置VCPU的电源状态。

4. RISC-V: KVM: Implement KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls

For KVM RISC-V, we use KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls to access
VCPU config and registers from user-space.

We have three types of VCPU registers:
1. CONFIG - these are VCPU config and capabilities
2. CORE   - these are VCPU general purpose registers
3. CSR    - these are VCPU control and status registers

The CONFIG register available to user-space is ISA. The ISA register is
a read and write register where user-space can only write the desired
VCPU ISA capabilities before running the VCPU.

The CORE registers available to user-space are PC, RA, SP, GP, TP, A0-A7,
T0-T6, S0-S11 and MODE. Most of these are RISC-V general registers except
PC and MODE. The PC register represents program counter whereas the MODE
register represent VCPU privilege mode (i.e. S/U-mode).

The CSRs available to user-space are SSTATUS, SIE, STVEC, SSCRATCH, SEPC,
SCAUSE, STVAL, SIP, and SATP. All of these are read/write registers.

In future, more VCPU register types will be added (such as FP) for the
KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls.

对于KVM RISC-V,我们使用KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls来访问VCPU配置和用户空间的寄存器。
我们有三种类型的VCPU寄存器。

  1. CONFIG - 这些是VCPU配置和能力
  2. CORE - 这些是VCPU通用寄存器
  3. CSR–这些是VCPU控制和状态寄存器
    用户空间可用的CONFIG寄存器是ISA。ISA寄存器是一个读写寄存器,用户空间在运行VCPU之前只能写入所需的VCPU ISA功能。用户空间可用的CORE寄存器是PC、RA、SP、GP、TP、A0-A7、T0-T6、S0-S11和MODE。其中大部分是RISC-V通用寄存器,除了PC和MODE。PC寄存器代表程序计数器,而MODE寄存器代表VCPU的特权模式(即S/U模式)。
    用户空间可用的CSR有SSTATUS、SIE、STVEC、SSCRATCH、SEPC、SCAUSE、STVAL、SIP和SATP。
    所有这些都是读/写寄存器。

在未来,更多的VCPU寄存器类型将被添加到KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls中(如FP)。

5. RISC-V: KVM: Implement VCPU world-switch

This patch implements the VCPU world-switch for KVM RISC-V.

The KVM RISC-V world-switch (i.e. __kvm_riscv_switch_to()) mostly
switches general purpose registers, SSTATUS, STVEC, SSCRATCH and
HSTATUS CSRs. Other CSRs are switched via vcpu_load() and vcpu_put()
interface in kvm_arch_vcpu_load() and kvm_arch_vcpu_put() functions
respectively.

这个补丁实现了KVM RISC-V的VCPU world 切换。KVM RISC-V世界切换(即__kvm_riscv_switch_to())大多切换通用寄存器、SSTATUS、STVEC、SSCRATCH和HSTATUS CSR。其他CSR分别通过kvm_arch_vcpu_load()和kvm_arch_vcpu_put()函数中的vcpu_load()和vcpu_put()接口进行切换。

6. RISC-V: KVM: Handle MMIO exits for VCPU

We will get stage2 page faults whenever Guest/VM access SW emulated
MMIO device or unmapped Guest RAM.

This patch implements MMIO read/write emulation by extracting MMIO
details from the trapped load/store instruction and forwarding the
MMIO read/write to user-space. The actual MMIO emulation will happen
in user-space and KVM kernel module will only take care of register
updates before resuming the trapped VCPU.

The handling for stage2 page faults for unmapped Guest RAM will be
implemeted by a separate patch later.

每当Guest/VM访问SW模拟的MMIO设备或未映射的Guest RAM时,我们将得到阶段2的page fault。这个补丁通过从被困的加载/存储指令中提取MMIO 指令细节并将MMIO读/写转发到用户空间来实现MMIO读/写仿真。实际的MMIO仿真将发生在用户空间,KVM内核模块将只负责在恢复被困的VCPU之前进行寄存器更新。对未映射的Guest RAM的第2阶段page fault的处理将由一个单独的补丁来实现。

7. RISC-V: KVM: Handle WFI exits for VCPU

We get illegal instruction trap whenever Guest/VM executes WFI
instruction.

This patch handles WFI trap by blocking the trapped VCPU using
kvm_vcpu_block() API. The blocked VCPU will be automatically
resumed whenever a VCPU interrupt is injected from user-space
or from in-kernel IRQCHIP emulation.

每当Guest/VM执行WFI指令时,我们会得到非法指令陷阱。这个补丁通过使用kvm_vcpu_block()API阻塞VCPU来处理WFI陷阱。每当从用户空间或内核IRQCHIP仿真中注入VCPU中断时,被阻塞的VCPU将被自动恢复。

8. RISC-V: KVM: Implement VMID allocator

We implement a simple VMID allocator for Guests/VMs which:
1. Detects number of VMID bits at boot-time
2. Uses atomic number to track VMID version and increments
   VMID version whenever we run-out of VMIDs
3. Flushes Guest TLBs on all host CPUs whenever we run-out
   of VMIDs
4. Force updates HW Stage2 VMID for each Guest VCPU whenever
   VMID changes using VCPU request KVM_REQ_UPDATE_HGATP

我们为guest/虚拟机实现了一个简单的VMID分配器

  1. 在启动时检测VMID位的数量。
  2. 使用原子序数来跟踪VMID版本,并在我们用完VMID时增加VMID版本
  3. 每当我们用完VMID时,就会在所有host CPU上刷新Guest TLB
  4. 每当VMID发生变化时,使用VCPU请求KVM_REQ_UPDATE_HGATP为每个guest VCPU强制更新HW Stage2 VMID

9. RISC-V: KVM: Implement stage2 page table programming

This patch implements all required functions for programming
the stage2 page table for each Guest/VM.

At high-level, the flow of stage2 related functions is similar
from KVM ARM/ARM64 implementation but the stage2 page table
format is quite different for KVM RISC-V.

这个补丁实现了为每个guest/虚拟机的stage2页表编程的所有必要功能。在高层次上,阶段2相关函数的流程与KVM ARM/ARM64实现相似,但阶段2页表格式与KVM RISC-V有很大不同。

提供直接编程 stage-2 页表的接口
kvm_riscv_gstage_alloc_pgd/kvm_riscv_gstage_free_pgd
gstage_get_leaf_entry
gstage_pte_page_vaddr 等

arch/riscv/kvm/mmu.c

10. RISC-V: KVM: Implement MMU notifiers ???

This patch implements MMU notifiers for KVM RISC-V so that Guest
physical address space is in-sync with Host physical address space.

This will allow swapping, page migration, etc to work transparently
with KVM RISC-V.

这个补丁为KVM RISC-V实现了MMU通知器,以便Guest物理地址空间与Host物理地址空间同步。这将允许交换、页面迁移等与KVM RISC-V透明地工作。

当KVM_CAP_SYNC_MMU功能可用时,备份内存区域的变化会自动反映到guest中。例如,一个影响该区域的mmap()将被立即变成可见。另一个例子是madvise(MADV_DROP)。

11. RISC-V: KVM: Add timer functionality

The RISC-V hypervisor specification doesn't have any virtual timer
feature.

Due to this, the guest VCPU timer will be programmed via SBI calls.
The host will use a separate hrtimer event for each guest VCPU to
provide timer functionality. We inject a virtual timer interrupt to
the guest VCPU whenever the guest VCPU hrtimer event expires.

This patch adds guest VCPU timer implementation along with ONE_REG
interface to access VCPU timer state from user space.

RISC-V hypervisor 规范没有任何虚拟定时器功能。由于这个原因,guest VCPU定时器将通过SBI调用进行编程。host os将为每个客户VCPU使用一个单独的hrtimer事件来提供定时器功能。每当客体VCPU的hrtimer事件过期时,我们就向客体VCPU注入一个虚拟定时器中断。这个补丁增加了客户VCPU定时器的实现以及ONE_REG接口,以便从用户空间访问VCPU定时器的状态。

12. RISC-V: KVM: FP lazy save/restore

This patch adds floating point (F and D extension) context save/restore
for guest VCPUs. The FP context is saved and restored lazily only when
kernel enter/exits the in-kernel run loop and not during the KVM world
switch. This way FP save/restore has minimal impact on KVM performance.

这个补丁为客户VCPU增加了浮点(F和D扩展)上下文保存/恢复。只有在内核进入/退出 run-loop 时,才会延迟的保存和恢复FP上下文,而不是在KVM世界切换时。这样,FP保存/恢复对KVM性能的影响就很小。

13. RISC-V: KVM: Implement ONE REG interface for FP registers

Add a KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctl interface for floating
point registers such as F0-F31 and FCSR. This support is added for
both 'F' and 'D' extensions.

为F0-F31和FCSR等浮点寄存器添加一个KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctl接口。这个支持是为F和D的扩展添加的。

14. RISC-V: KVM: Add SBI v0.1 support

The KVM host kernel is running in HS-mode needs so we need to handle
the SBI calls coming from guest kernel running in VS-mode.

This patch adds SBI v0.1 support in KVM RISC-V. Almost all SBI v0.1
calls are implemented in KVM kernel module except GETCHAR and PUTCHART
calls which are forwarded to user space because these calls cannot be
implemented in kernel space. In future, when we implement SBI v0.2 for
Guest, we will forward SBI v0.2 experimental and vendor extension calls
to user space.

KVM host os 以HS模式运行,所以我们需要处理来自VS模式下运行的 guest os的SBI调用。这个补丁在KVM RISC-V中增加了SBI v0.1支持。除了GETCHAR和PUTCHART调用被转发到用户空间,几乎所有SBI v0.1调用都在KVM内核模块中实现,因为这些调用不能在内核空间中实现。在未来,当我们为Guest实现SBI v0.2时,我们将把SBI v0.2的实验和vendor 扩展调用转发给用户空间。

16. RISC-V: KVM: Fix GPA passed to __kvm_riscv_hfence_gvma_xyz() functions

The parameter passed to HFENCE.GVMA instruction in rs1 register
is guest physical address right shifted by 2 (i.e. divided by 4).

Unfortunately, we overlooked the semantics of rs1 registers for
HFENCE.GVMA instruction and never right shifted guest physical
address by 2. This issue did not manifest for hypervisors till
now because:
  1) Currently, only __kvm_riscv_hfence_gvma_all() and SBI
     HFENCE calls are used to invalidate TLB.
  2) All H-extension implementations (such as QEMU, Spike,
     Rocket Core FPGA, etc) that we tried till now were
     conservatively flushing everything upon any HFENCE.GVMA
     instruction.

This patch fixes GPA passed to __kvm_riscv_hfence_gvma_vmid_gpa()
and __kvm_riscv_hfence_gvma_gpa() functions.

rs1寄存器中传递给HFENCE.GVMA指令的参数是GPA右移2(即除以4)。不幸的是,我们忽略了HFENCE.GVMA指令的rs1寄存器的语义,也没有将客户的物理地址右移2。这个问题直到现在还没有在管理程序中表现出来,
因为:
1)目前,只有__kvm_riscv_hfence_gvma_all()和SBI HFENCE调用被用来使TLB失效。
2)到目前为止,我们尝试的所有H扩展实现(如QEMU、Spike、Rocket Core FPGA等)都是保守地在任何HFENCE.GVMA指令上刷新一切。
这个补丁修复了传递给__kvm_riscv_hfence_gvma_vmid_gpa()和__kvm_riscv_hfence_gvma_gpa()函数的GPA。

bug, 不关注

17. KVM: RISC-V: Unmap stage2 mapping when deleting/moving a memslot

Unmap stage2 page tables when a memslot is being deleted or moved.  It's
the architectures' responsibility to ensure existing mappings are removed
when kvm_arch_flush_shadow_memslot() returns.

当一个memslot被删除或移动时,unmap stage2页表。当kvm_arch_flush_shadow_memslot()返回时,架构有责任确保现有的映射被移除。

18. KVM: Let/force architectures to deal with arch specific memslot data

Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch
code to handle propagating arch specific data from "new" to "old" when
necessary.  This is a baby step towards dynamically allocating "new" from
the get go, and is a (very) minor performance boost on x86 due to not
unnecessarily copying arch data.

For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE
and FLAGS_ONLY.  This is functionally a nop as the previous behavior
would overwrite the pointer for CREATE, and eventually discard/ignore it
for DELETE.

For x86, copy the arch data only for FLAGS_ONLY changes.  Unlike PPC HV,
x86 needs to reallocate arch data in the MOVE case as the size of x86's
allocations depend on the alignment of the memslot's gfn.

Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to
match the "commit" prototype.

将 “old slot “传递给kvm_arch_prepare_memory_region(),必要时强制arch-specific 代码处理从 “new “传播到 “old “的arch 特定数据。这是朝着动态分配 “从头开始的新 “迈出的一小步,也是对x86的一个(非常)小的性能提升,因为没有不必要地复制arch 数据。
对于PPC HV,在!CREATE和!DELETE路径中复制rmap,即对于MOVE和FLAGS_ONLY。这在功能上是一个问题,因为之前的行为会覆盖CREATE的指针,并最终丢弃/忽略它用于DELETE。对于x86,只复制FLAGS_ONLY变化的档案数据。与PPC HV不同,在MOVE情况下,x86需要重新分配arch 数据,因为x86 分配的大小取决于memslot gfn的排列。
机会性地调整kvm_arch_prepare_memory_region() param顺序以匹配 “commit “原型。

内存memslot 动态分配相关优化

19. KVM: RISC-V: Use “new” memslot instead of userspace memory region ???

Get the slot ID, hva, etc... from the "new" memslot instead of the
userspace memory region when preparing/committing a memory region.  This
will allow a future commit to drop @mem from the prepare/commit hooks
once all architectures convert to using "new".

Opportunistically wait to get the various "new" values until after
filtering out the DELETE case in anticipation of a future commit passing
NULL for @new when deleting a memslot.

在准备/提交内存区域时,从 “new- memslot “而不是用户空间内存区域获取slot ID、hva等。这将允许未来的提交在所有架构转换为使用 “新 “时,从准备/提交的钩子中删除@mem。机会性地等待得到各种 “new 值 , 直到过滤掉 DELETE 的情况后 , 以期待未来的提交在删除一个 memslot 时通过 @new 的 NULL

20. KVM: RISC-V: Use common KVM implementation of MMU memory caches

Use common KVM's implementation of the MMU memory caches, which for all
intents and purposes is semantically identical to RISC-V's version, the
only difference being that the common implementation will fall back to an
atomic allocation if there's a KVM bug that triggers a cache underflow.

RISC-V appears to have based its MMU code on arm64 before the conversion
to the common caches in commit c1a33aebe91d ("KVM: arm64: Use common KVM
implementation of MMU memory caches"), despite having also copy-pasted
the definition of KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE in kvm_types.h.

Opportunistically drop the superfluous wrapper
kvm_riscv_stage2_flush_cache(), whose name is very, very confusing as
"cache flush" in the context of MMU code almost always refers to flushing
hardware caches, not freeing unused software objects.

No functional change intended.

使用MMU内存缓存的普通KVM实现,就所有的意图和目的而言,它与RISC-V-s版本在语义上是相同的,唯一的区别是,如果有一个KVM错误触发了缓存下溢,普通实现将退回到原子分配。RISC-V似乎在转换到提交c1a33aebe91d中的普通缓存之前将其MMU代码基于arm64(KVM:arm64:使用MMU内存缓存的普通KVM实现),尽管在kvm_types.h中也复制了KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE的定义。机会性地放弃多余的包装器kvm_riscv_stage2_flush_cache(),其名称非常非常令人困惑,因为在MMU代码的上下文中,缓存刷新 “几乎总是指刷新硬件缓存,而不是释放未使用的软件对象。没有功能变化的意图。

21. RISC-V: KVM: Add SBI v0.2 base extension

SBI v0.2 base extension defined to allow backward compatibility and
probing of future extensions. This is also the only mandatory SBI
extension that must be implemented by SBI implementors.

SBI v0.2基础扩展的定义是允许向后兼容和探测未来的扩展。这也是唯一的强制性SBI扩展,必须由SBI实现者来实现。

22. RISC-V: KVM: Add SBI HSM extension in KVM

SBI HSM extension allows OS to start/stop harts any time. It also allows
ordered booting of harts instead of random booting.

Implement SBI HSM exntesion and designate the vcpu 0 as the boot vcpu id.
All other non-zero non-booting vcpus should be brought up by the OS
implementing HSM extension. If the guest OS doesn't implement HSM
extension, only single vcpu will be available to OS.

SBI HSM扩展允许操作系统在任何时候开始/停止 harts。它还允许有序启动 hart,而不是随机启动。实施SBI HSM exntesion,并将vcpu 0指定为boot vcpu id。所有其他非零的非启动vcpus都应该由实现HSM扩展的操作系统提出。如果guest os 没有实现HSM扩展,那么只有单个vcpu可以被操作系统使用。

23. RISC-V: KVM: Forward SBI experimental and vendor extensions

The SBI experimental extension space is for temporary (or experimental)
stuff whereas SBI vendor extension space is for hardware vendor specific
stuff. Both these SBI extension spaces won't be standardized by the SBI
specification so let's blindly forward such SBI calls to the userspace.

SBI实验性扩展空间用于临时(或实验性)特性,而SBI vendor 扩展空间用于硬件vendor 特性。这两个SBI扩展空间都不会被SBI规范化,我们可以盲目地将这种SBI调用转发给 kvm的 用户空间。

24. KVM: RISC-V: Avoid spurious virtual interrupts after clearing hideleg CSR

避免在清除隐藏的CSR后出现虚假的虚拟中断
When the last VM is terminated, the host kernel will invoke function
hardware_disable_nolock() on each CPU to disable the related virtualization
functions. Here, RISC-V currently only clears hideleg CSR and hedeleg CSR.
This behavior will cause the host kernel to receive spurious interrupts if
hvip CSR has pending interrupts and the corresponding enable bits in vsie
CSR are asserted. To avoid it, hvip CSR and vsie CSR must be cleared
before clearing hideleg CSR.

当最后一个虚拟机被终止时,主机内核将在每个CPU上调用函数hardhard_disable_nolock()来禁用相关的虚拟化函数。在这里,RISC-V目前只清除了hideleg CSR和hedeleg CSR。如果hvip CSR有待定的中断,并且vsie CSR中相应的启用位被断言,这种行为将导致主机内核收到虚假的中断。为了避免这种情况,hvip CSR和vsie CSR必须在清除隐藏的CSR之前被清除。

25. kvm/riscv: rework guest entry logic

In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
(EQS) by calling guest_enter_irqoff(), and unmask IRQs prior to exiting
the EQS by calling guest_exit(). As the IRQ entry code will not wake RCU
in this case, we may run the core IRQ code and IRQ handler without RCU
watching, leading to various potential problems.

Additionally, we do not inform lockdep or tracing that interrupts will
be enabled during guest execution, which caan lead to misleading traces
and warnings that interrupts have been enabled for overly-long periods.

This patch fixes these issues by using the new timing and context
entry/exit helpers to ensure that interrupts are handled during guest
vtime but with RCU watching, with a sequence:

        guest_timing_enter_irqoff();

        guest_state_enter_irqoff();
        < run the vcpu >
        guest_state_exit_irqoff();

        < take any pending IRQs >

        guest_timing_exit_irqoff();

Since instrumentation may make use of RCU, we must also ensure that no
instrumented code is run during the EQS. I've split out the critical
section into a new kvm_riscv_enter_exit_vcpu() helper which is marked
noinstr.

RCU锁及开关中断相关优化

26. RISC-V: KVM: Add common kvm_riscv_vcpu_sbi_system_reset() function

We rename kvm_sbi_system_shutdown() to kvm_riscv_vcpu_sbi_system_reset()
and move it to vcpu_sbi.c so that it can be shared by SBI v0.1 shutdown
and SBI v0.3 SRST extension.

我们将kvm_sbi_system_shutdown()重命名为kvm_riscv_vcpu_sbi_system_reset(),并将其移至vcpu_sbi.c,以便它可以被SBI v0.1关闭和SBI v0.3 SRST扩展共享。

The SBI v0.3 specification defines SRST (System Reset) extension which
provides a standard poweroff and reboot interface. This patch implements
SRST extension for the KVM Guest.

SBI v0.3规范定义了SRST(系统重置)扩展,它提供了一个标准的断电和重启接口。这个补丁实现了KVM Guest的SRST扩展。
guest os 重启

27. RISC-V: KVM: Add common kvm_riscv_vcpu_wfi() function

The wait for interrupt (WFI) instruction emulation can share the VCPU
halt logic with SBI HSM suspend emulation so this patch adds a common
kvm_riscv_vcpu_wfi() function for this purpose.

等待中断(WFI)指令仿真可以与SBI HSM暂停仿真共享VCPU停止逻辑,所以这个补丁为此增加了一个通用的kvm_riscv_vcpu_wfi()函数。

28. RISC-V: KVM: Implement SBI HSM suspend call

The SBI v0.3 specification extends SBI HSM extension by adding SBI HSM
suspend call and related HART states. This patch extends the KVM RISC-V
HSM implementation to provide KVM guest a minimal SBI HSM suspend call
which is equivalent to a WFI instruction.

SBI v0.3规范通过添加SBI HSM暂停调用和相关的HART状态扩展了SBI HSM的扩展。这个补丁扩展了KVM RISC-V HSM的实现,为KVM guest os提供了一个最小的SBI HSM暂停调用,相当于一个WFI指令。

29. RISC-V: KVM: Add Sv57x4 mode support for G-stage

Latest QEMU supports G-stage Sv57x4 mode so this patch extends KVM
RISC-V G-stage handling to detect and use Sv57x4 mode when available.

最新的QEMU支持G阶段的Sv57x4模式,所以这个补丁扩展了KVM RISC-V G阶段的处理,以检测并在可用时使用Sv57x4模式。

30. RISC-V: KVM: Treat SBI HFENCE calls as NOPs

We should treat SBI HFENCE calls as NOPs until nested virtualization
is supported by KVM RISC-V. This will help us test booting a hypervisor
under KVM RISC-V.

我们应该把SBI HFENCE调用当作NOP,直到KVM RISC-V支持嵌套虚拟化。这将有助于我们在KVM RISC-V下测试启动一个管理程序。

31. RISC-V: KVM: Add remote HFENCE functions based on VCPU requests

The generic KVM has support for VCPU requests which can be used
to do arch-specific work in the run-loop. We introduce remote
HFENCE functions which will internally use VCPU requests instead
of host SBI calls.

Advantages of doing remote HFENCEs as VCPU requests are:
1) Multiple VCPUs of a Guest may be running on different Host CPUs
   so it is not always possible to determine the Host CPU mask for
   doing Host SBI call. For example, when VCPU X wants to do HFENCE
   on VCPU Y, it is possible that VCPU Y is blocked or in user-space
   (i.e. vcpu->cpu < 0).
2) To support nested virtualization, we will be having a separate
   shadow G-stage for each VCPU and a common host G-stage for the
   entire Guest/VM. The VCPU requests based remote HFENCEs helps
   us easily synchronize the common host G-stage and shadow G-stage
   of each VCPU without any additional IPI calls.

This is also a preparatory patch for upcoming nested virtualization
support where we will be having a shadow G-stage page table for
each Guest VCPU.

通用的KVM支持VCPU请求,可以用来在run-loop中做arch specific工作。我们引入了远程HFENCE函数,它在内部将使用VCPU请求而不是主机SBI调用。

作为VCPU请求做远程HFENCE的好处是:
1)一个guest os的多个VCPU可能在不同的物理 CPU上运行,所以并不总是能够确定物理cpuCPU掩码来做主机SBI调用。例如,当VCPU X想在VCPU Y上做HFENCE时,有可能VCPU Y被阻塞或在用户空间(即vcpu->cpu < 0)。
2)为了支持嵌套虚拟化,我们将为每个VCPU提供一个单独的影子G阶段,为整个Guest/VM提供一个通用的G-stage 页表。基于VCPU请求的远程HFENCEs帮助我们轻松地同步每个VCPU的物理的G-stage页表和独立的影子G-stage页表,而不需要任何额外的IPI调用。

这也是即将到来的嵌套虚拟化支持的一个准备补丁,我们将为每个guest VCPU提供一个影子G阶段的页表。

32. RISC-V: KVM: Cleanup stale TLB entries when host CPU changes

On RISC-V platforms with hardware VMID support, we share same
VMID for all VCPUs of a particular Guest/VM. This means we might
have stale G-stage TLB entries on the current Host CPU due to
some other VCPU of the same Guest which ran previously on the
current Host CPU.

To cleanup stale TLB entries, we simply flush all G-stage TLB
entries by VMID whenever underlying Host CPU changes for a VCPU.

VMID 相关. 增加在重新执行vcpu时刷新所有G-stage tlb的功能.

在支持硬件VMID的RISC-V平台上,我们为一个特定的guest/VM的所有VCPU共享相同的VMID。这意味着我们在当前的主机CPU上可能有陈旧的G级TLB条目,这是因为之前在当前的主机CPU上运行的同一guest的其他VCPU。为了清理陈旧的TLB条目,我们只需在底层主机CPU为VCPU改变时,通过VMID刷新所有G级TLB条目。

33. RISC-V: KVM: Add extensible system instruction emulation framework

We will be emulating more system instructions in near future with
upcoming AIA, PMU, Nested and other virtualization features.

To accommodate above, we add an extensible system instruction emulation
framework in vcpu_insn.c.

我们将在不久的将来用即将到来的AIA、PMU、Nested和其他虚拟化功能模拟更多的系统指令。为了适应上述情况,我们在vcpu_insn.c中增加了一个可扩展的系统指令仿真框架。

34. RISC-V: KVM: Add extensible CSR emulation framework

We add an extensible CSR emulation framework which is based upon the
existing system instruction emulation. This will be useful to upcoming
AIA, PMU, Nested and other virtualization features.

The CSR emulation framework also has provision to emulate CSR in user
space but this will be used only in very specific cases such as AIA
IMSIC CSR emulation in user space or vendor specific CSR emulation
in user space.

By default, all CSRs not handled by KVM RISC-V will be redirected back
to Guest VCPU as illegal instruction trap.

我们增加了一个可扩展的CSR仿真框架,它是基于现有的系统指令仿真的。这对即将到来的AIA、PMU、Nested和其他虚拟化功能很有用。CSR仿真框架也有在用户空间仿真CSR的规定,但这将只用于非常特殊的情况,如AIA IMSIC CSR仿真在用户空间或vendor厂商特定的CSR仿真在用户空间。默认情况下,所有未被KVM RISC-V处理的CSR将被重定向回Guest VCPU作为非法指令陷阱。

35. RISC-V: KVM: Add G-stage ioremap() and iounmap() functions

The in-kernel AIA IMSIC support requires on-demand mapping / unmapping
of Guest IMSIC address to Host IMSIC guest files. To help achieve this,
we add kvm_riscv_stage2_ioremap() and kvm_riscv_stage2_iounmap() functions.
These new functions for updating G-stage page table mappings will be called
in atomic context so we have special "in_atomic" parameter for this purpose.

中断AIA IMSIC 相关
内核 AIA IMSIC 的支持需要按需将guest os IMSIC 地址映射 / 解除映射到主机 IMSIC guest file。为了帮助实现这一点,我们增加了kvm_riscv_stage2_ioremap()和kvm_riscv_stage2_iounmap()函数。这些用于更新G阶段页表映射的新函数将在原子上下文中被调用,因此我们有特殊的in_atomic 参数来实现这一目的。

36. RISC-V: KVM: Add support for Svpbmt inside Guest/VM

The Guest/VM can use Svpbmt in VS-stage page tables when allowed by the
Hypervisor using the henvcfg.PBMTE bit.

We add Svpbmt support for the KVM Guest/VM which can be enabled/disabled
by the KVM user-space (QEMU/KVMTOOL) using the ISA extension ONE_REG
interface.

当Hypervisor使用henvcfg.PBMTE位允许时,Guest/VM可以在VS阶段的页表中使用Svpbmt。我们为KVM Guest/VM添加了Svpbmt支持,它可以通过KVM用户空间(QEMU/KVMTOOL)使用ISA扩展ONE_REG接口启用/禁用。

37. RISC-V: KVM: Support sstc extension

Sstc extension allows the guest to program the vstimecmp CSR directly
instead of making an SBI call to the hypervisor to program the next
event. The timer interrupt is also directly injected to the guest by
the hardware in this case. To maintain backward compatibility, the
hypervisors also update the vstimecmp in an SBI set_time call if
the hardware supports it. Thus, the older kernels in guest also
take advantage of the sstc extension.

Sstc扩展允许guest os直接对vstimecmp CSR进行编程,而不是对管理程序进行SBI调用来对下一个事件进行编程。在这种情况下,定时器中断也是由硬件直接注入到guest os上的。为了保持向后的兼容性,如果硬件支持,管理程序也会在SBI set_time调用中更新vstimecmp。因此,客户中较早的内核也利用了sstc的扩展。

38. RISC-V: KVM: Allow Guest use Svinval extension

We should advertise Svinval ISA extension to KVM user-space whenever
host supports it. This will allow KVM user-space (i.e. QEMU or KVMTOOL)
to pass on this information to Guest via ISA string.

RISC-V: KVM: Allow Guest use Zihintpause extension

We should advertise Zihintpause ISA extension to KVM user-space whenever
host supports it. This will allow KVM user-space (i.e. QEMU or KVMTOOL)
to pass on this information to Guest via ISA string.

Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: Provide UAPI for Zicbom block size

We're about to allow guests to use the Zicbom extension. KVM
userspace needs to know the cache block size in order to
properly advertise it to the guest. Provide a virtual config
register for userspace to get it with the GET_ONE_REG API, but
setting it cannot be supported, so disallow SET_ONE_REG.

RISC-V: KVM: Expose Zicbom to the guest

Guests may use the cbo.inval,clean,flush instructions when the
CPU has the Zicbom extension and the hypervisor sets henvcfg.CBIE
(for cbo.inval) and henvcfg.CBCFE (for cbo.clean,flush).

Add Zicbom support for KVM guests which may be enabled and
disabled from KVM userspace using the ISA extension ONE_REG API.

Also opportunistically switch the other isa extension checks in
kvm_riscv_vcpu_update_config() to riscv_isa_extension_available().

39. RISC-V: KVM: Save mvendorid, marchid, and mimpid when creating VCPU

We should save VCPU mvendorid, marchid, and mimpid at the time
of creating VCPU so that we don't have to do host SBI call every
time Guest/VM ask for these details.

RISC-V: KVM: Add ONE_REG interface for mvendorid, marchid, and mimpid

We add ONE_REG interface for VCPU mvendorid, marchid, and mimpid
so that KVM user-space can change this details to support migration
across heterogeneous hosts.

增加 mvendorid marchid mimpid get_reg/set_reg qemu相关接口

总结

  1. 通用KVM模块所需的VCPU创建、启动和销毁功能

  2. 实现了VCPU中断和请求, 实现vcpu 设置 睡眠KVM_REQ_SLEEP/复位KVM_REQ_VCPU_RESET/休眠WFI 请求, 用户空间可以使用GET_MPSTATE/SET_MPSTATE ioctls来获取/设置VCPU的电源状态。

  3. KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls来访问读写VCPU 配置寄存器/通用寄存器/vcpu 控制和状态寄存器/浮点相关寄存器, 及mvendorid mimpid (version of the processor implementation) marchid的读取.

  4. 实现上下文切换时各寄存器状态的保存恢复, 包括上述的配置寄存器/通用寄存器/vcpu 控制和状态寄存器/浮点相关寄存器, 及host os的相关寄存器的保存恢复

  5. 实现 MMIO exit 加载/存储指令中提取MMIO 指令细节并将MMIO读/写转发到用户空间来实现MMIO读/写仿真, KVM内核模块将负责提取 guest page fault中的mmio读写指令细节, 将其传达给用户态虚拟机管理程序(如qemu), 并在恢复VCPU之前进行寄存器更新

  6. 为guest/虚拟机实现了VMID分配器/管理, 主要功能在VMID发生变化时,使用VCPU请求KVM_REQ_UPDATE_HGATP为每个guest VCPU强制更新HW Stage2 VMID

  7. 实现了为每个guest/虚拟机的stage2页表编程的所有必要功能。在高层次上,阶段2相关函数的流程与KVM ARM/ARM64实现相似,但阶段2页表格式与KVM RISC-V有很大不同

  8. sbi相关实现, guest 调用sbi ecall相关指令时, 需要陷入到 hypervisor中, 需要kvm 模块实现对应的sbi请求. 相关sbi 规范需支持 v01->v02->v03 的演化.

    除了GETCHAR和PUTCHART调用被转发到用户空间,几乎所有SBI v0.1调用都在KVM内核模块中实现, SBI v0.2的实验和vendor 扩展调用转发给用户空间

    SBI HSM扩展允许操作系统在任何时候开始/停止 harts。它还允许有序启动 hart,而不是随机启动。实施SBI HSM exntesion,并将vcpu 0指定为boot vcpu id。所有其他非零的非启动vcpus都应该由实现HSM扩展的操作系统提出。如果guest os 没有实现HSM扩展,那么只有单个vcpu可以被操作系统使用。

    实现 SBI v0.3规范中的SRST(系统重置)扩展, hsm暂停调用扩展

    实现hfence 扩展, 为未来的嵌套虚拟化做准备

  9. kvm中 hva->gpa 内存memslot 动态分配相关优化

  10. 为即将到来的AIA、PMU、Nested和其他虚拟化功能模拟更多的系统指令。增加了一个可扩展的系统指令仿真框架, CSR仿真框架也有在用户空间仿真CSR的规定,但这将只用于非常特殊的情况. 如AIA IMSIC CSR仿真在用户空间或vendor厂商特定的CSR仿真在用户空间

  11. 实现 sstc, 允许guest os直接对vstimecmp CSR进行编程,而不是通过hypervisor进行SBI调用来对下一个事件进行编程。在这种情况下,定时器中断也是由硬件直接注入到guest os上的。为了保持向后的兼容性,如果硬件支持,hypervisor也会在SBI set_time调用中更新vstimecmp.