The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 179c6c27bf487273652efc99acd3ba512a23c137 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 3 Aug 2021 09:27:46 -0700
Subject: [PATCH] KVM: SVM: Fix off-by-one indexing when nullifying last used
SEV VMCB
Use the raw ASID, not ASID-1, when nullifying the last used VMCB when
freeing an SEV ASID. The consumer, pre_sev_run(), indexes the array by
the raw ASID, thus KVM could get a false negative when checking for a
different VMCB if KVM manages to reallocate the same ASID+VMCB combo for
a new VM.
Note, this cannot cause a functional issue _in the current code_, as
pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and
last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is
guaranteed to mismatch on the first VMRUN. However, prior to commit
8a14fe4f0c54 ("kvm: x86: Move last_cpu into kvm_vcpu_arch as
last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the
last_cpu variable. Thus it's theoretically possible that older versions
of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID
and VMCB exactly match those of a prior VM.
Fixes: 70cd94e60c73 ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled")
Cc: Tom Lendacky <thomas.lendacky(a)amd.com>
Cc: Brijesh Singh <brijesh.singh(a)amd.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 6710d9ee2e4b..4d0aba185412 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -189,7 +189,7 @@ static void sev_asid_free(struct kvm_sev_info *sev)
for_each_possible_cpu(cpu) {
sd = per_cpu(svm_data, cpu);
- sd->sev_vmcbs[pos] = NULL;
+ sd->sev_vmcbs[sev->asid] = NULL;
}
mutex_unlock(&sev_bitmap_lock);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 179c6c27bf487273652efc99acd3ba512a23c137 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 3 Aug 2021 09:27:46 -0700
Subject: [PATCH] KVM: SVM: Fix off-by-one indexing when nullifying last used
SEV VMCB
Use the raw ASID, not ASID-1, when nullifying the last used VMCB when
freeing an SEV ASID. The consumer, pre_sev_run(), indexes the array by
the raw ASID, thus KVM could get a false negative when checking for a
different VMCB if KVM manages to reallocate the same ASID+VMCB combo for
a new VM.
Note, this cannot cause a functional issue _in the current code_, as
pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and
last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is
guaranteed to mismatch on the first VMRUN. However, prior to commit
8a14fe4f0c54 ("kvm: x86: Move last_cpu into kvm_vcpu_arch as
last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the
last_cpu variable. Thus it's theoretically possible that older versions
of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID
and VMCB exactly match those of a prior VM.
Fixes: 70cd94e60c73 ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled")
Cc: Tom Lendacky <thomas.lendacky(a)amd.com>
Cc: Brijesh Singh <brijesh.singh(a)amd.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 6710d9ee2e4b..4d0aba185412 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -189,7 +189,7 @@ static void sev_asid_free(struct kvm_sev_info *sev)
for_each_possible_cpu(cpu) {
sd = per_cpu(svm_data, cpu);
- sd->sev_vmcbs[pos] = NULL;
+ sd->sev_vmcbs[sev->asid] = NULL;
}
mutex_unlock(&sev_bitmap_lock);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 179c6c27bf487273652efc99acd3ba512a23c137 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc(a)google.com>
Date: Tue, 3 Aug 2021 09:27:46 -0700
Subject: [PATCH] KVM: SVM: Fix off-by-one indexing when nullifying last used
SEV VMCB
Use the raw ASID, not ASID-1, when nullifying the last used VMCB when
freeing an SEV ASID. The consumer, pre_sev_run(), indexes the array by
the raw ASID, thus KVM could get a false negative when checking for a
different VMCB if KVM manages to reallocate the same ASID+VMCB combo for
a new VM.
Note, this cannot cause a functional issue _in the current code_, as
pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and
last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is
guaranteed to mismatch on the first VMRUN. However, prior to commit
8a14fe4f0c54 ("kvm: x86: Move last_cpu into kvm_vcpu_arch as
last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the
last_cpu variable. Thus it's theoretically possible that older versions
of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID
and VMCB exactly match those of a prior VM.
Fixes: 70cd94e60c73 ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled")
Cc: Tom Lendacky <thomas.lendacky(a)amd.com>
Cc: Brijesh Singh <brijesh.singh(a)amd.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 6710d9ee2e4b..4d0aba185412 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -189,7 +189,7 @@ static void sev_asid_free(struct kvm_sev_info *sev)
for_each_possible_cpu(cpu) {
sd = per_cpu(svm_data, cpu);
- sd->sev_vmcbs[pos] = NULL;
+ sd->sev_vmcbs[sev->asid] = NULL;
}
mutex_unlock(&sev_bitmap_lock);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 85cd39af14f498f791d8aab3fbd64cd175787f1a Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Wed, 4 Aug 2021 05:28:52 -0400
Subject: [PATCH] KVM: Do not leak memory for duplicate debugfs directories
KVM creates a debugfs directory for each VM in order to store statistics
about the virtual machine. The directory name is built from the process
pid and a VM fd. While generally unique, it is possible to keep a
file descriptor alive in a way that causes duplicate directories, which
manifests as these messages:
[ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!
Even though this should not happen in practice, it is more or less
expected in the case of KVM for testcases that call KVM_CREATE_VM and
close the resulting file descriptor repeatedly and in parallel.
When this happens, debugfs_create_dir() returns an error but
kvm_create_vm_debugfs() goes on to allocate stat data structs which are
later leaked. The slow memory leak was spotted by syzkaller, where it
caused OOM reports.
Since the issue only affects debugfs, do a lookup before calling
debugfs_create_dir, so that the message is downgraded and rate-limited.
While at it, ensure kvm->debugfs_dentry is NULL rather than an error
if it is not created. This fixes kvm_destroy_vm_debugfs, which was not
checking IS_ERR_OR_NULL correctly.
Cc: stable(a)vger.kernel.org
Fixes: 536a6f88c49d ("KVM: Create debugfs dir and stat files for each VM")
Reported-by: Alexey Kardashevskiy <aik(a)ozlabs.ru>
Suggested-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Acked-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d20fba0fc290..b50dbe269f4b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -892,6 +892,8 @@ static void kvm_destroy_vm_debugfs(struct kvm *kvm)
static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
{
+ static DEFINE_MUTEX(kvm_debugfs_lock);
+ struct dentry *dent;
char dir_name[ITOA_MAX_LEN * 2];
struct kvm_stat_data *stat_data;
const struct _kvm_stats_desc *pdesc;
@@ -903,8 +905,20 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
return 0;
snprintf(dir_name, sizeof(dir_name), "%d-%d", task_pid_nr(current), fd);
- kvm->debugfs_dentry = debugfs_create_dir(dir_name, kvm_debugfs_dir);
+ mutex_lock(&kvm_debugfs_lock);
+ dent = debugfs_lookup(dir_name, kvm_debugfs_dir);
+ if (dent) {
+ pr_warn_ratelimited("KVM: debugfs: duplicate directory %s\n", dir_name);
+ dput(dent);
+ mutex_unlock(&kvm_debugfs_lock);
+ return 0;
+ }
+ dent = debugfs_create_dir(dir_name, kvm_debugfs_dir);
+ mutex_unlock(&kvm_debugfs_lock);
+ if (IS_ERR(dent))
+ return 0;
+ kvm->debugfs_dentry = dent;
kvm->debugfs_stat_data = kcalloc(kvm_debugfs_num_entries,
sizeof(*kvm->debugfs_stat_data),
GFP_KERNEL_ACCOUNT);
@@ -5201,7 +5215,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm)
}
add_uevent_var(env, "PID=%d", kvm->userspace_pid);
- if (!IS_ERR_OR_NULL(kvm->debugfs_dentry)) {
+ if (kvm->debugfs_dentry) {
char *tmp, *p = kmalloc(PATH_MAX, GFP_KERNEL_ACCOUNT);
if (p) {
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 85cd39af14f498f791d8aab3fbd64cd175787f1a Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Wed, 4 Aug 2021 05:28:52 -0400
Subject: [PATCH] KVM: Do not leak memory for duplicate debugfs directories
KVM creates a debugfs directory for each VM in order to store statistics
about the virtual machine. The directory name is built from the process
pid and a VM fd. While generally unique, it is possible to keep a
file descriptor alive in a way that causes duplicate directories, which
manifests as these messages:
[ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!
Even though this should not happen in practice, it is more or less
expected in the case of KVM for testcases that call KVM_CREATE_VM and
close the resulting file descriptor repeatedly and in parallel.
When this happens, debugfs_create_dir() returns an error but
kvm_create_vm_debugfs() goes on to allocate stat data structs which are
later leaked. The slow memory leak was spotted by syzkaller, where it
caused OOM reports.
Since the issue only affects debugfs, do a lookup before calling
debugfs_create_dir, so that the message is downgraded and rate-limited.
While at it, ensure kvm->debugfs_dentry is NULL rather than an error
if it is not created. This fixes kvm_destroy_vm_debugfs, which was not
checking IS_ERR_OR_NULL correctly.
Cc: stable(a)vger.kernel.org
Fixes: 536a6f88c49d ("KVM: Create debugfs dir and stat files for each VM")
Reported-by: Alexey Kardashevskiy <aik(a)ozlabs.ru>
Suggested-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Acked-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d20fba0fc290..b50dbe269f4b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -892,6 +892,8 @@ static void kvm_destroy_vm_debugfs(struct kvm *kvm)
static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
{
+ static DEFINE_MUTEX(kvm_debugfs_lock);
+ struct dentry *dent;
char dir_name[ITOA_MAX_LEN * 2];
struct kvm_stat_data *stat_data;
const struct _kvm_stats_desc *pdesc;
@@ -903,8 +905,20 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
return 0;
snprintf(dir_name, sizeof(dir_name), "%d-%d", task_pid_nr(current), fd);
- kvm->debugfs_dentry = debugfs_create_dir(dir_name, kvm_debugfs_dir);
+ mutex_lock(&kvm_debugfs_lock);
+ dent = debugfs_lookup(dir_name, kvm_debugfs_dir);
+ if (dent) {
+ pr_warn_ratelimited("KVM: debugfs: duplicate directory %s\n", dir_name);
+ dput(dent);
+ mutex_unlock(&kvm_debugfs_lock);
+ return 0;
+ }
+ dent = debugfs_create_dir(dir_name, kvm_debugfs_dir);
+ mutex_unlock(&kvm_debugfs_lock);
+ if (IS_ERR(dent))
+ return 0;
+ kvm->debugfs_dentry = dent;
kvm->debugfs_stat_data = kcalloc(kvm_debugfs_num_entries,
sizeof(*kvm->debugfs_stat_data),
GFP_KERNEL_ACCOUNT);
@@ -5201,7 +5215,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm)
}
add_uevent_var(env, "PID=%d", kvm->userspace_pid);
- if (!IS_ERR_OR_NULL(kvm->debugfs_dentry)) {
+ if (kvm->debugfs_dentry) {
char *tmp, *p = kmalloc(PATH_MAX, GFP_KERNEL_ACCOUNT);
if (p) {
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 85cd39af14f498f791d8aab3fbd64cd175787f1a Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Wed, 4 Aug 2021 05:28:52 -0400
Subject: [PATCH] KVM: Do not leak memory for duplicate debugfs directories
KVM creates a debugfs directory for each VM in order to store statistics
about the virtual machine. The directory name is built from the process
pid and a VM fd. While generally unique, it is possible to keep a
file descriptor alive in a way that causes duplicate directories, which
manifests as these messages:
[ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!
Even though this should not happen in practice, it is more or less
expected in the case of KVM for testcases that call KVM_CREATE_VM and
close the resulting file descriptor repeatedly and in parallel.
When this happens, debugfs_create_dir() returns an error but
kvm_create_vm_debugfs() goes on to allocate stat data structs which are
later leaked. The slow memory leak was spotted by syzkaller, where it
caused OOM reports.
Since the issue only affects debugfs, do a lookup before calling
debugfs_create_dir, so that the message is downgraded and rate-limited.
While at it, ensure kvm->debugfs_dentry is NULL rather than an error
if it is not created. This fixes kvm_destroy_vm_debugfs, which was not
checking IS_ERR_OR_NULL correctly.
Cc: stable(a)vger.kernel.org
Fixes: 536a6f88c49d ("KVM: Create debugfs dir and stat files for each VM")
Reported-by: Alexey Kardashevskiy <aik(a)ozlabs.ru>
Suggested-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Acked-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d20fba0fc290..b50dbe269f4b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -892,6 +892,8 @@ static void kvm_destroy_vm_debugfs(struct kvm *kvm)
static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
{
+ static DEFINE_MUTEX(kvm_debugfs_lock);
+ struct dentry *dent;
char dir_name[ITOA_MAX_LEN * 2];
struct kvm_stat_data *stat_data;
const struct _kvm_stats_desc *pdesc;
@@ -903,8 +905,20 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
return 0;
snprintf(dir_name, sizeof(dir_name), "%d-%d", task_pid_nr(current), fd);
- kvm->debugfs_dentry = debugfs_create_dir(dir_name, kvm_debugfs_dir);
+ mutex_lock(&kvm_debugfs_lock);
+ dent = debugfs_lookup(dir_name, kvm_debugfs_dir);
+ if (dent) {
+ pr_warn_ratelimited("KVM: debugfs: duplicate directory %s\n", dir_name);
+ dput(dent);
+ mutex_unlock(&kvm_debugfs_lock);
+ return 0;
+ }
+ dent = debugfs_create_dir(dir_name, kvm_debugfs_dir);
+ mutex_unlock(&kvm_debugfs_lock);
+ if (IS_ERR(dent))
+ return 0;
+ kvm->debugfs_dentry = dent;
kvm->debugfs_stat_data = kcalloc(kvm_debugfs_num_entries,
sizeof(*kvm->debugfs_stat_data),
GFP_KERNEL_ACCOUNT);
@@ -5201,7 +5215,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm)
}
add_uevent_var(env, "PID=%d", kvm->userspace_pid);
- if (!IS_ERR_OR_NULL(kvm->debugfs_dentry)) {
+ if (kvm->debugfs_dentry) {
char *tmp, *p = kmalloc(PATH_MAX, GFP_KERNEL_ACCOUNT);
if (p) {
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e30e8d46cf605d216a799a28c77b8a41c328613a Mon Sep 17 00:00:00 2001
From: Mark Rutland <mark.rutland(a)arm.com>
Date: Mon, 2 Aug 2021 11:42:00 +0100
Subject: [PATCH] arm64: fix compat syscall return truncation
Due to inconsistencies in the way we manipulate compat GPRs, we have a
few issues today:
* For audit and tracing, where error codes are handled as a (native)
long, negative error codes are expected to be sign-extended to the
native 64-bits, or they may fail to be matched correctly. Thus a
syscall which fails with an error may erroneously be identified as
failing.
* For ptrace, *all* compat return values should be sign-extended for
consistency with 32-bit arm, but we currently only do this for
negative return codes.
* As we may transiently set the upper 32 bits of some compat GPRs while
in the kernel, these can be sampled by perf, which is somewhat
confusing. This means that where a syscall returns a pointer above 2G,
this will be sign-extended, but will not be mistaken for an error as
error codes are constrained to the inclusive range [-4096, -1] where
no user pointer can exist.
To fix all of these, we must consistently use helpers to get/set the
compat GPRs, ensuring that we never write the upper 32 bits of the
return code, and always sign-extend when reading the return code. This
patch does so, with the following changes:
* We re-organise syscall_get_return_value() to always sign-extend for
compat tasks, and reimplement syscall_get_error() atop. We update
syscall_trace_exit() to use syscall_get_return_value().
* We consistently use syscall_set_return_value() to set the return
value, ensureing the upper 32 bits are never set unexpectedly.
* As the core audit code currently uses regs_return_value() rather than
syscall_get_return_value(), we special-case this for
compat_user_mode(regs) such that this will do the right thing. Going
forward, we should try to move the core audit code over to
syscall_get_return_value().
Cc: <stable(a)vger.kernel.org>
Reported-by: He Zhe <zhe.he(a)windriver.com>
Reported-by: weiyuchen <weiyuchen3(a)huawei.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com>
Link: https://lore.kernel.org/r/20210802104200.21390-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index e58bca832dff..41b332c054ab 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -320,7 +320,17 @@ static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
static inline unsigned long regs_return_value(struct pt_regs *regs)
{
- return regs->regs[0];
+ unsigned long val = regs->regs[0];
+
+ /*
+ * Audit currently uses regs_return_value() instead of
+ * syscall_get_return_value(). Apply the same sign-extension here until
+ * audit is updated to use syscall_get_return_value().
+ */
+ if (compat_user_mode(regs))
+ val = sign_extend64(val, 31);
+
+ return val;
}
static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index cfc0672013f6..03e20895453a 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -29,22 +29,23 @@ static inline void syscall_rollback(struct task_struct *task,
regs->regs[0] = regs->orig_x0;
}
-
-static inline long syscall_get_error(struct task_struct *task,
- struct pt_regs *regs)
+static inline long syscall_get_return_value(struct task_struct *task,
+ struct pt_regs *regs)
{
- unsigned long error = regs->regs[0];
+ unsigned long val = regs->regs[0];
if (is_compat_thread(task_thread_info(task)))
- error = sign_extend64(error, 31);
+ val = sign_extend64(val, 31);
- return IS_ERR_VALUE(error) ? error : 0;
+ return val;
}
-static inline long syscall_get_return_value(struct task_struct *task,
- struct pt_regs *regs)
+static inline long syscall_get_error(struct task_struct *task,
+ struct pt_regs *regs)
{
- return regs->regs[0];
+ unsigned long error = syscall_get_return_value(task, regs);
+
+ return IS_ERR_VALUE(error) ? error : 0;
}
static inline void syscall_set_return_value(struct task_struct *task,
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 499b6b2f9757..b381a1ee9ea7 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1862,7 +1862,7 @@ void syscall_trace_exit(struct pt_regs *regs)
audit_syscall_exit(regs);
if (flags & _TIF_SYSCALL_TRACEPOINT)
- trace_sys_exit(regs, regs_return_value(regs));
+ trace_sys_exit(regs, syscall_get_return_value(current, regs));
if (flags & (_TIF_SYSCALL_TRACE | _TIF_SINGLESTEP))
tracehook_report_syscall(regs, PTRACE_SYSCALL_EXIT);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index f8192f4ae0b8..23036334f4dc 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -29,6 +29,7 @@
#include <asm/unistd.h>
#include <asm/fpsimd.h>
#include <asm/ptrace.h>
+#include <asm/syscall.h>
#include <asm/signal32.h>
#include <asm/traps.h>
#include <asm/vdso.h>
@@ -890,7 +891,7 @@ static void do_signal(struct pt_regs *regs)
retval == -ERESTART_RESTARTBLOCK ||
(retval == -ERESTARTSYS &&
!(ksig.ka.sa.sa_flags & SA_RESTART)))) {
- regs->regs[0] = -EINTR;
+ syscall_set_return_value(current, regs, -EINTR, 0);
regs->pc = continue_addr;
}
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index 263d6c1a525f..50a0f1a38e84 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -54,10 +54,7 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
ret = do_ni_syscall(regs, scno);
}
- if (is_compat_task())
- ret = lower_32_bits(ret);
-
- regs->regs[0] = ret;
+ syscall_set_return_value(current, regs, 0, ret);
/*
* Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
@@ -115,7 +112,7 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
* syscall. do_notify_resume() will send a signal to userspace
* before the syscall is restarted.
*/
- regs->regs[0] = -ERESTARTNOINTR;
+ syscall_set_return_value(current, regs, -ERESTARTNOINTR, 0);
return;
}
@@ -136,7 +133,7 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
* anyway.
*/
if (scno == NO_SYSCALL)
- regs->regs[0] = -ENOSYS;
+ syscall_set_return_value(current, regs, -ENOSYS, 0);
scno = syscall_trace_enter(regs);
if (scno == NO_SYSCALL)
goto trace_exit;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e30e8d46cf605d216a799a28c77b8a41c328613a Mon Sep 17 00:00:00 2001
From: Mark Rutland <mark.rutland(a)arm.com>
Date: Mon, 2 Aug 2021 11:42:00 +0100
Subject: [PATCH] arm64: fix compat syscall return truncation
Due to inconsistencies in the way we manipulate compat GPRs, we have a
few issues today:
* For audit and tracing, where error codes are handled as a (native)
long, negative error codes are expected to be sign-extended to the
native 64-bits, or they may fail to be matched correctly. Thus a
syscall which fails with an error may erroneously be identified as
failing.
* For ptrace, *all* compat return values should be sign-extended for
consistency with 32-bit arm, but we currently only do this for
negative return codes.
* As we may transiently set the upper 32 bits of some compat GPRs while
in the kernel, these can be sampled by perf, which is somewhat
confusing. This means that where a syscall returns a pointer above 2G,
this will be sign-extended, but will not be mistaken for an error as
error codes are constrained to the inclusive range [-4096, -1] where
no user pointer can exist.
To fix all of these, we must consistently use helpers to get/set the
compat GPRs, ensuring that we never write the upper 32 bits of the
return code, and always sign-extend when reading the return code. This
patch does so, with the following changes:
* We re-organise syscall_get_return_value() to always sign-extend for
compat tasks, and reimplement syscall_get_error() atop. We update
syscall_trace_exit() to use syscall_get_return_value().
* We consistently use syscall_set_return_value() to set the return
value, ensureing the upper 32 bits are never set unexpectedly.
* As the core audit code currently uses regs_return_value() rather than
syscall_get_return_value(), we special-case this for
compat_user_mode(regs) such that this will do the right thing. Going
forward, we should try to move the core audit code over to
syscall_get_return_value().
Cc: <stable(a)vger.kernel.org>
Reported-by: He Zhe <zhe.he(a)windriver.com>
Reported-by: weiyuchen <weiyuchen3(a)huawei.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com>
Link: https://lore.kernel.org/r/20210802104200.21390-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will(a)kernel.org>
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index e58bca832dff..41b332c054ab 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -320,7 +320,17 @@ static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
static inline unsigned long regs_return_value(struct pt_regs *regs)
{
- return regs->regs[0];
+ unsigned long val = regs->regs[0];
+
+ /*
+ * Audit currently uses regs_return_value() instead of
+ * syscall_get_return_value(). Apply the same sign-extension here until
+ * audit is updated to use syscall_get_return_value().
+ */
+ if (compat_user_mode(regs))
+ val = sign_extend64(val, 31);
+
+ return val;
}
static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc)
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index cfc0672013f6..03e20895453a 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -29,22 +29,23 @@ static inline void syscall_rollback(struct task_struct *task,
regs->regs[0] = regs->orig_x0;
}
-
-static inline long syscall_get_error(struct task_struct *task,
- struct pt_regs *regs)
+static inline long syscall_get_return_value(struct task_struct *task,
+ struct pt_regs *regs)
{
- unsigned long error = regs->regs[0];
+ unsigned long val = regs->regs[0];
if (is_compat_thread(task_thread_info(task)))
- error = sign_extend64(error, 31);
+ val = sign_extend64(val, 31);
- return IS_ERR_VALUE(error) ? error : 0;
+ return val;
}
-static inline long syscall_get_return_value(struct task_struct *task,
- struct pt_regs *regs)
+static inline long syscall_get_error(struct task_struct *task,
+ struct pt_regs *regs)
{
- return regs->regs[0];
+ unsigned long error = syscall_get_return_value(task, regs);
+
+ return IS_ERR_VALUE(error) ? error : 0;
}
static inline void syscall_set_return_value(struct task_struct *task,
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 499b6b2f9757..b381a1ee9ea7 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1862,7 +1862,7 @@ void syscall_trace_exit(struct pt_regs *regs)
audit_syscall_exit(regs);
if (flags & _TIF_SYSCALL_TRACEPOINT)
- trace_sys_exit(regs, regs_return_value(regs));
+ trace_sys_exit(regs, syscall_get_return_value(current, regs));
if (flags & (_TIF_SYSCALL_TRACE | _TIF_SINGLESTEP))
tracehook_report_syscall(regs, PTRACE_SYSCALL_EXIT);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index f8192f4ae0b8..23036334f4dc 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -29,6 +29,7 @@
#include <asm/unistd.h>
#include <asm/fpsimd.h>
#include <asm/ptrace.h>
+#include <asm/syscall.h>
#include <asm/signal32.h>
#include <asm/traps.h>
#include <asm/vdso.h>
@@ -890,7 +891,7 @@ static void do_signal(struct pt_regs *regs)
retval == -ERESTART_RESTARTBLOCK ||
(retval == -ERESTARTSYS &&
!(ksig.ka.sa.sa_flags & SA_RESTART)))) {
- regs->regs[0] = -EINTR;
+ syscall_set_return_value(current, regs, -EINTR, 0);
regs->pc = continue_addr;
}
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index 263d6c1a525f..50a0f1a38e84 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -54,10 +54,7 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
ret = do_ni_syscall(regs, scno);
}
- if (is_compat_task())
- ret = lower_32_bits(ret);
-
- regs->regs[0] = ret;
+ syscall_set_return_value(current, regs, 0, ret);
/*
* Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
@@ -115,7 +112,7 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
* syscall. do_notify_resume() will send a signal to userspace
* before the syscall is restarted.
*/
- regs->regs[0] = -ERESTARTNOINTR;
+ syscall_set_return_value(current, regs, -ERESTARTNOINTR, 0);
return;
}
@@ -136,7 +133,7 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
* anyway.
*/
if (scno == NO_SYSCALL)
- regs->regs[0] = -ENOSYS;
+ syscall_set_return_value(current, regs, -ENOSYS, 0);
scno = syscall_trace_enter(regs);
if (scno == NO_SYSCALL)
goto trace_exit;
From: Niklas Cassel <niklas.cassel(a)wdc.com>
A user space process should not need the CAP_SYS_ADMIN capability set
in order to perform a BLKREPORTZONE ioctl.
Getting the zone report is required in order to get the write pointer.
Neither read() nor write() requires CAP_SYS_ADMIN, so it is reasonable
that a user space process that can read/write from/to the device, also
can get the write pointer. (Since e.g. writes have to be at the write
pointer.)
Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Signed-off-by: Niklas Cassel <niklas.cassel(a)wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal(a)dc.com>
Reviewed-by: Aravind Ramesh <aravind.ramesh(a)wdc.com>
Reviewed-by: Adam Manzanares <a.manzanares(a)samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani(a)oracle.com>
Cc: stable(a)vger.kernel.org # v4.10+
---
Changes since v3:
-Picked up additional Reviewed-by tags.
block/blk-zoned.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 8a60dbeb44be..1d0c76c18fc5 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -360,9 +360,6 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
if (!blk_queue_is_zoned(q))
return -ENOTTY;
- if (!capable(CAP_SYS_ADMIN))
- return -EACCES;
-
if (copy_from_user(&rep, argp, sizeof(struct blk_zone_report)))
return -EFAULT;
--
2.31.1
From: Niklas Cassel <niklas.cassel(a)wdc.com>
Zone management send operations (BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE
and BLKFINISHZONE) should be allowed under the same permissions as write().
(write() does not require CAP_SYS_ADMIN).
Additionally, other ioctls like BLKSECDISCARD and BLKZEROOUT only check if
the fd was successfully opened with FMODE_WRITE.
(They do not require CAP_SYS_ADMIN).
Currently, zone management send operations require both CAP_SYS_ADMIN
and that the fd was successfully opened with FMODE_WRITE.
Remove the CAP_SYS_ADMIN requirement, so that zone management send
operations match the access control requirement of write(), BLKSECDISCARD
and BLKZEROOUT.
Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Signed-off-by: Niklas Cassel <niklas.cassel(a)wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal(a)wdc.com>
Reviewed-by: Aravind Ramesh <aravind.ramesh(a)wdc.com>
Reviewed-by: Adam Manzanares <a.manzanares(a)samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani(a)oracle.com>
Cc: stable(a)vger.kernel.org # v4.10+
---
Changes since v3:
-Picked up additional Reviewed-by tags.
Note to backporter:
Function was added as blkdev_reset_zones_ioctl() in v4.10.
Function was renamed to blkdev_zone_mgmt_ioctl() in v5.5.
The patch is valid both before and after the function rename.
block/blk-zoned.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 86fce751bb17..8a60dbeb44be 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -421,9 +421,6 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
if (!blk_queue_is_zoned(q))
return -ENOTTY;
- if (!capable(CAP_SYS_ADMIN))
- return -EACCES;
-
if (!(mode & FMODE_WRITE))
return -EBADF;
--
2.31.1