KVM currenty fails a nested VMRUN and injects VMEXIT_INVALID (aka
SVM_EXIT_ERR) if L1 sets NP_ENABLE and the host does not support NPTs.
On first glance, it seems like the check should actually be for
guest_cpu_cap_has(X86_FEATURE_NPT) instead, as it is possible for the
host to support NPTs but the guest CPUID to not advertise it.
However, the consistency check is not architectural to begin with. The
APM does not mention VMEXIT_INVALID if NP_ENABLE is set on a processor
that does not have X86_FEATURE_NPT. Hence, NP_ENABLE should be ignored
if X86_FEATURE_NPT is not available for L1. Apart from the consistency
check, this is currently the case because NP_ENABLE is actually copied
from VMCB01 to VMCB02, not from VMCB12.
On the other hand, the APM does mention two other consistency checks for
NP_ENABLE, both of which are missing (paraphrased):
In Volume #2, 15.25.3 (24593—Rev. 3.42—March 2024):
If VMRUN is executed with hCR0.PG cleared to zero and NP_ENABLE set to
1, VMRUN terminates with #VMEXIT(VMEXIT_INVALID)
In Volume #2, 15.25.4 (24593—Rev. 3.42—March 2024):
When VMRUN is executed with nested paging enabled (NP_ENABLE = 1), the
following conditions are considered illegal state combinations, in
addition to those mentioned in “Canonicalization and Consistency
Checks”:
• Any MBZ bit of nCR3 is set.
• Any G_PAT.PA field has an unsupported type encoding or any
reserved field in G_PAT has a nonzero value.
Replace the existing consistency check with consistency checks on
hCR0.PG and nCR3. Only perform the consistency checks if L1 has
X86_FEATURE_NPT and NP_ENABLE is set in VMCB12. The G_PAT consistency
check will be addressed separately.
As it is now possible for an L1 to run L2 with NP_ENABLE set but
ignored, also check that L1 has X86_FEATURE_NPT in nested_npt_enabled().
Pass L1's CR0 to __nested_vmcb_check_controls(). In
nested_vmcb_check_controls(), L1's CR0 is available through
kvm_read_cr0(), as vcpu->arch.cr0 is not updated to L2's CR0 until later
through nested_vmcb02_prepare_save() -> svm_set_cr0().
In svm_set_nested_state(), L1's CR0 is available in the captured save
area, as svm_get_nested_state() captures L1's save area when running L2,
and L1's CR0 is stashed in VMCB01 on nested VMRUN (in
nested_svm_vmrun()).
Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN")
Cc: stable(a)vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed(a)linux.dev>
---
arch/x86/kvm/svm/nested.c | 21 ++++++++++++++++-----
arch/x86/kvm/svm/svm.h | 3 ++-
2 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 74211c5c68026..87bcc5eff96e8 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -325,7 +325,8 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size)
}
static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
- struct vmcb_ctrl_area_cached *control)
+ struct vmcb_ctrl_area_cached *control,
+ unsigned long l1_cr0)
{
if (CC(!vmcb12_is_intercept(control, INTERCEPT_VMRUN)))
return false;
@@ -333,8 +334,12 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
if (CC(control->asid == 0))
return false;
- if (CC((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) && !npt_enabled))
- return false;
+ if (nested_npt_enabled(to_svm(vcpu))) {
+ if (CC(!kvm_vcpu_is_legal_gpa(vcpu, control->nested_cr3)))
+ return false;
+ if (CC(!(l1_cr0 & X86_CR0_PG)))
+ return false;
+ }
if (CC(!nested_svm_check_bitmap_pa(vcpu, control->msrpm_base_pa,
MSRPM_SIZE)))
@@ -400,7 +405,12 @@ static bool nested_vmcb_check_controls(struct kvm_vcpu *vcpu)
struct vcpu_svm *svm = to_svm(vcpu);
struct vmcb_ctrl_area_cached *ctl = &svm->nested.ctl;
- return __nested_vmcb_check_controls(vcpu, ctl);
+ /*
+ * Make sure we did not enter guest mode yet, in which case
+ * kvm_read_cr0() could return L2's CR0.
+ */
+ WARN_ON_ONCE(is_guest_mode(vcpu));
+ return __nested_vmcb_check_controls(vcpu, ctl, kvm_read_cr0(vcpu));
}
static
@@ -1831,7 +1841,8 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
ret = -EINVAL;
__nested_copy_vmcb_control_to_cache(vcpu, &ctl_cached, ctl);
- if (!__nested_vmcb_check_controls(vcpu, &ctl_cached))
+ /* 'save' contains L1 state saved from before VMRUN */
+ if (!__nested_vmcb_check_controls(vcpu, &ctl_cached, save->cr0))
goto out_free;
/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f6fb70ddf7272..3e805a43ffcdb 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -552,7 +552,8 @@ static inline bool gif_set(struct vcpu_svm *svm)
static inline bool nested_npt_enabled(struct vcpu_svm *svm)
{
- return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
+ return guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_NPT) &&
+ svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
}
static inline bool nested_vnmi_enabled(struct vcpu_svm *svm)
--
2.51.2.1041.gc1ab5b90ca-goog
The struct ip_tunnel_info has a flexible array member named
options that is protected by a counted_by(options_len)
attribute.
The compiler will use this information to enforce runtime bounds
checking deployed by FORTIFY_SOURCE string helpers.
As laid out in the GCC documentation, the counter must be
initialized before the first reference to the flexible array
member.
In the normal case the ip_tunnel_info_opts_set() helper is used
which would initialize options_len properly, however in the GRE
ERSPAN code a partial update is done, preventing the use of the
helper function.
Before this change the handling of ERSPAN traffic in GRE tunnels
would cause a kernel panic when the kernel is compiled with
GCC 15+ and having FORTIFY_SOURCE configured:
memcpy: detected buffer overflow: 4 byte write of buffer size 0
Call Trace:
<IRQ>
__fortify_panic+0xd/0xf
erspan_rcv.cold+0x68/0x83
? ip_route_input_slow+0x816/0x9d0
gre_rcv+0x1b2/0x1c0
gre_rcv+0x8e/0x100
? raw_v4_input+0x2a0/0x2b0
ip_protocol_deliver_rcu+0x1ea/0x210
ip_local_deliver_finish+0x86/0x110
ip_local_deliver+0x65/0x110
? ip_rcv_finish_core+0xd6/0x360
ip_rcv+0x186/0x1a0
Link: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-co…
Reported-at: https://launchpad.net/bugs/2129580
Fixes: bb5e62f2d547 ("net: Add options as a flexible array to struct ip_tunnel_info")
Signed-off-by: Frode Nordahl <fnordahl(a)ubuntu.com>
---
net/ipv4/ip_gre.c | 18 ++++++++++++++++--
net/ipv6/ip6_gre.c | 18 ++++++++++++++++--
2 files changed, 32 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 761a53c6a89a..285a656c9e41 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -330,6 +330,22 @@ static int erspan_rcv(struct sk_buff *skb, struct tnl_ptk_info *tpi,
if (!tun_dst)
return PACKET_REJECT;
+ /* The struct ip_tunnel_info has a flexible array member named
+ * options that is protected by a counted_by(options_len)
+ * attribute.
+ *
+ * The compiler will use this information to enforce runtime bounds
+ * checking deployed by FORTIFY_SOURCE string helpers.
+ *
+ * As laid out in the GCC documentation, the counter must be
+ * initialized before the first reference to the flexible array
+ * member.
+ *
+ * Link: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-co…
+ */
+ info = &tun_dst->u.tun_info;
+ info->options_len = sizeof(*md);
+
/* skb can be uncloned in __iptunnel_pull_header, so
* old pkt_md is no longer valid and we need to reset
* it
@@ -344,10 +360,8 @@ static int erspan_rcv(struct sk_buff *skb, struct tnl_ptk_info *tpi,
memcpy(md2, pkt_md, ver == 1 ? ERSPAN_V1_MDSIZE :
ERSPAN_V2_MDSIZE);
- info = &tun_dst->u.tun_info;
__set_bit(IP_TUNNEL_ERSPAN_OPT_BIT,
info->key.tun_flags);
- info->options_len = sizeof(*md);
}
skb_reset_mac_header(skb);
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index c82a75510c0e..eb840a11b93b 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -535,6 +535,22 @@ static int ip6erspan_rcv(struct sk_buff *skb,
if (!tun_dst)
return PACKET_REJECT;
+ /* The struct ip_tunnel_info has a flexible array member named
+ * options that is protected by a counted_by(options_len)
+ * attribute.
+ *
+ * The compiler will use this information to enforce runtime bounds
+ * checking deployed by FORTIFY_SOURCE string helpers.
+ *
+ * As laid out in the GCC documentation, the counter must be
+ * initialized before the first reference to the flexible array
+ * member.
+ *
+ * Link: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-co…
+ */
+ info = &tun_dst->u.tun_info;
+ info->options_len = sizeof(*md);
+
/* skb can be uncloned in __iptunnel_pull_header, so
* old pkt_md is no longer valid and we need to reset
* it
@@ -543,7 +559,6 @@ static int ip6erspan_rcv(struct sk_buff *skb,
skb_network_header_len(skb);
pkt_md = (struct erspan_metadata *)(gh + gre_hdr_len +
sizeof(*ershdr));
- info = &tun_dst->u.tun_info;
md = ip_tunnel_info_opts(info);
md->version = ver;
md2 = &md->u.md2;
@@ -551,7 +566,6 @@ static int ip6erspan_rcv(struct sk_buff *skb,
ERSPAN_V2_MDSIZE);
__set_bit(IP_TUNNEL_ERSPAN_OPT_BIT,
info->key.tun_flags);
- info->options_len = sizeof(*md);
ip6_tnl_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error);
--
2.43.0
From: Chen Linxuan <me(a)black-desk.cn>
When using fsconfig(..., FSCONFIG_CMD_CREATE, ...), the filesystem
context is retrieved from the file descriptor. Since the file structure
persists across syscall restarts, the context state is preserved:
// fs/fsopen.c
SYSCALL_DEFINE5(fsconfig, ...)
{
...
fc = fd_file(f)->private_data;
...
ret = vfs_fsconfig_locked(fc, cmd, ¶m);
...
}
In vfs_cmd_create(), the context phase is transitioned to
FS_CONTEXT_CREATING before calling vfs_get_tree():
// fs/fsopen.c
static int vfs_cmd_create(struct fs_context *fc, bool exclusive)
{
...
fc->phase = FS_CONTEXT_CREATING;
...
ret = vfs_get_tree(fc);
...
}
However, vfs_get_tree() may return -ERESTARTNOINTR if the filesystem
implementation needs to restart the syscall. For example, cgroup v1 does
this when it encounters a race condition where the root is dying:
// kernel/cgroup/cgroup-v1.c
int cgroup1_get_tree(struct fs_context *fc)
{
...
if (unlikely(ret > 0)) {
msleep(10);
return restart_syscall();
}
return ret;
}
If the syscall is restarted, fsconfig() is called again and retrieves
the *same* fs_context. However, vfs_cmd_create() rejects the call
because the phase was left as FS_CONTEXT_CREATING during the first
attempt:
if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
return -EBUSY;
Fix this by resetting fc->phase back to FS_CONTEXT_CREATE_PARAMS if
vfs_get_tree() returns -ERESTARTNOINTR.
Cc: stable(a)vger.kernel.org
Signed-off-by: Chen Linxuan <me(a)black-desk.cn>
---
fs/fsopen.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/fsopen.c b/fs/fsopen.c
index f645c99204eb..8a7cb031af50 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -229,6 +229,10 @@ static int vfs_cmd_create(struct fs_context *fc, bool exclusive)
fc->exclusive = exclusive;
ret = vfs_get_tree(fc);
+ if (ret == -ERESTARTNOINTR) {
+ fc->phase = FS_CONTEXT_CREATE_PARAMS;
+ return ret;
+ }
if (ret) {
fc->phase = FS_CONTEXT_FAILED;
return ret;
---
base-commit: 187d0801404f415f22c0b31531982c7ea97fa341
change-id: 20251213-mount-ebusy-8ee3888a7e4f
Best regards,
--
Chen Linxuan <me(a)black-desk.cn>