From: Jonathan Bell <jonathan(a)raspberrypi.org>
Seen on a VLI VL805 PCIe to USB controller. For non-stream endpoints
at least, if the xHC halts on a particular TRB due to an error then
the DCS field in the Out Endpoint Context maintained by the hardware
is not updated with the current cycle state.
Using the quirk XHCI_EP_CTX_BROKEN_DCS and instead fetch the DCS bit
from the TRB that the xHC stopped on.
Cc: stable(a)vger.kernel.org
Link: https://github.com/raspberrypi/linux/issues/3060
Signed-off-by: Jonathan Bell <jonathan(a)raspberrypi.org>
Signed-off-by: Bjørn Mork <bjorn(a)mork.no>
---
Ran into this issue on an RPi4 running Debian bullseye, having mostly
a plain v5.10.40 kernel. Using an RTL2838 (0bda:2838) with rtl-sdr
just did not work, showing all the issues described on the above link.
This quirk found in https://github.com/raspberrypi/linux.git solves
the problem for me. I don't see why it shouldn't be in mainline. And
I propose adding it to stable as well, since it solves a real problem.
Mostly for my own convenience as I'd prefer just using a Debian built
kernel ;-)
Did not check this submission with Jonathan - hoping it is OK...
Bjørn
drivers/usb/host/xhci-pci.c | 4 +++-
drivers/usb/host/xhci-ring.c | 26 ++++++++++++++++++++++++++
drivers/usb/host/xhci.h | 1 +
3 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index 18c2bbddf080..6f3bed09028c 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -279,8 +279,10 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
pdev->device == 0x3432)
xhci->quirks |= XHCI_BROKEN_STREAMS;
- if (pdev->vendor == PCI_VENDOR_ID_VIA && pdev->device == 0x3483)
+ if (pdev->vendor == PCI_VENDOR_ID_VIA && pdev->device == 0x3483) {
xhci->quirks |= XHCI_LPM_SUPPORT;
+ xhci->quirks |= XHCI_EP_CTX_BROKEN_DCS;
+ }
if (pdev->vendor == PCI_VENDOR_ID_ASMEDIA &&
pdev->device == PCI_DEVICE_ID_ASMEDIA_1042_XHCI)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 8fea44bbc266..a9c860ff5177 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -559,8 +559,11 @@ static int xhci_move_dequeue_past_td(struct xhci_hcd *xhci,
struct xhci_ring *ep_ring;
struct xhci_command *cmd;
struct xhci_segment *new_seg;
+ struct xhci_segment *halted_seg = NULL;
union xhci_trb *new_deq;
int new_cycle;
+ union xhci_trb *halted_trb;
+ int index = 0;
dma_addr_t addr;
u64 hw_dequeue;
bool cycle_found = false;
@@ -600,6 +603,29 @@ static int xhci_move_dequeue_past_td(struct xhci_hcd *xhci,
new_deq = ep_ring->dequeue;
new_cycle = hw_dequeue & 0x1;
+ /*
+ * Quirk: xHC write-back of the DCS field in the hardware dequeue
+ * pointer is wrong - use the cycle state of the TRB pointed to by
+ * the dequeue pointer.
+ */
+ if (xhci->quirks & XHCI_EP_CTX_BROKEN_DCS &&
+ !(ep->ep_state & EP_HAS_STREAMS))
+ halted_seg = trb_in_td(xhci, cur_td->start_seg,
+ cur_td->first_trb, cur_td->last_trb,
+ hw_dequeue & ~0xf, false);
+ if (halted_seg) {
+ index = ((dma_addr_t)(hw_dequeue & ~0xf) - halted_seg->dma) /
+ sizeof(*halted_trb);
+ halted_trb = &halted_seg->trbs[index];
+ state->new_cycle_state = halted_trb->generic.field[3] & 0x1;
+ xhci_dbg(xhci, "Endpoint DCS = %d TRB index = %d cycle = %d\n",
+ (u8)(hw_dequeue & 0x1), index,
+ state->new_cycle_state);
+ } else {
+ state->new_cycle_state = hw_dequeue & 0x1;
+ }
+ state->stream_id = stream_id;
+
/*
* We want to find the pointer, segment and cycle state of the new trb
* (the one after current TD's last_trb). We know the cycle state at
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 3c7d281672ae..911aeb7d8a19 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1896,6 +1896,7 @@ struct xhci_hcd {
#define XHCI_SG_TRB_CACHE_SIZE_QUIRK BIT_ULL(39)
#define XHCI_NO_SOFT_RETRY BIT_ULL(40)
#define XHCI_BROKEN_D3COLD BIT_ULL(41)
+#define XHCI_EP_CTX_BROKEN_DCS BIT_ULL(42)
unsigned int num_active_eps;
unsigned int limit_active_eps;
--
2.30.2
Hi Greg,
Linus has taken in a group of mm/thp commits Cc stable today:
504e070dc08f mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
22061a1ffabd mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
31657170deaf mm/thp: fix page_address_in_vma() on file THP tails
494334e43c16 mm/thp: fix vma_address() if virtual address below file offset
732ed55823fc mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
3b77e8c8cde5 mm/thp: make is_huge_zero_pmd() safe and quicker
99fa8a48203d mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
ffc90cbb2970 mm, thp: use head page in __migration_entry_wait()
and I expect some more to follow in a few days time (thanks Andrew).
No problem with the commits themselves, but I'm aware that some of them
have dependencies on other commits not yet in stable, which I have to
sort out for you now.
I'd prefer to avoid a deluge of "does not apply" messages, so ask you
please to hold off trying to merge these into stable trees for a few days:
I'll get back to you with what's needed for them to apply.
Thanks,
Hugh
If a user program uses userfaultfd on ranges of heap memory, it may
end up passing a tagged pointer to the kernel in the range.start
field of the UFFDIO_REGISTER ioctl. This can happen when using an
MTE-capable allocator, or on Android if using the Tagged Pointers
feature for MTE readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field
of struct uffd_msg. However, from the application's perspective,
the tagged address *is* the memory address, so if the application
is unaware of memory tags, it may get confused by receiving an
address that is, from its point of view, outside of the bounds of the
allocation. We observed this behavior in the kselftest for userfaultfd
[2] but other applications could have the same problem.
Fix this by remembering which tag was used to originally register the
userfaultfd and passing that tag back in fault.address. In a future
enhancement, we may want to pass back the original fault address,
but like SA_EXPOSE_TAGBITS, this should be guarded by a flag.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Signed-off-by: Peter Collingbourne <pcc(a)google.com>
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0…
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Cc: <stable(a)vger.kernel.org> # 5.4
---
Documentation/arm64/tagged-pointers.rst | 5 +++++
fs/userfaultfd.c | 17 +++++++++++------
include/linux/mm_types.h | 3 ++-
3 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/Documentation/arm64/tagged-pointers.rst b/Documentation/arm64/tagged-pointers.rst
index 19d284b70384..ec8e1f90b744 100644
--- a/Documentation/arm64/tagged-pointers.rst
+++ b/Documentation/arm64/tagged-pointers.rst
@@ -73,6 +73,11 @@ flag setting.
Non-zero tags are never preserved in sigcontext.fault_address
regardless of the SA_EXPOSE_TAGBITS flag setting.
+When using userfaultfd the address tag supplied in the range.start
+field of the UFFDIO_REGISTER ioctl is preserved and returned to
+userspace via the fault.address field of struct uffd_msg, and the
+tag of the original fault address is discarded.
+
The architecture prevents the use of a tagged PC, so the upper byte will
be set to a sign-extension of bit 55 on exception return.
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index dd7a6c62b56f..adb0f7d0638a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -110,15 +110,15 @@ static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode,
struct userfaultfd_wake_range *range = key;
int ret;
struct userfaultfd_wait_queue *uwq;
- unsigned long start, len;
+ unsigned long start, len, addr;
uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
ret = 0;
/* len == 0 means wake all */
start = range->start;
len = range->len;
- if (len && (start > uwq->msg.arg.pagefault.address ||
- start + len <= uwq->msg.arg.pagefault.address))
+ addr = untagged_addr(uwq->msg.arg.pagefault.address);
+ if (len && (start > addr || start + len <= addr))
goto out;
WRITE_ONCE(uwq->waken, true);
/*
@@ -480,8 +480,9 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
uwq.wq.private = current;
- uwq.msg = userfault_msg(vmf->address, vmf->flags, reason,
- ctx->features);
+ uwq.msg = userfault_msg(
+ vmf->address + vmf->vma->vm_userfaultfd_ctx.address_tag,
+ vmf->flags, reason, ctx->features);
uwq.ctx = ctx;
uwq.waken = false;
@@ -1287,7 +1288,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
unsigned long vm_flags, new_flags;
bool found;
bool basic_ioctls;
- unsigned long start, end, vma_end;
+ unsigned long address_tag, start, end, vma_end;
user_uffdio_register = (struct uffdio_register __user *) arg;
@@ -1313,6 +1314,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
vm_flags |= VM_UFFD_MINOR;
}
+ address_tag = uffdio_register.range.start -
+ untagged_addr(uffdio_register.range.start);
+
ret = validate_range(mm, &uffdio_register.range.start,
uffdio_register.range.len);
if (ret)
@@ -1462,6 +1466,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
*/
vma->vm_flags = new_flags;
vma->vm_userfaultfd_ctx.ctx = ctx;
+ vma->vm_userfaultfd_ctx.address_tag = address_tag;
if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma))
hugetlb_unshare_all_pmds(vma);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8f0fb62e8975..cb93e5b17896 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -286,9 +286,10 @@ struct vm_region {
};
#ifdef CONFIG_USERFAULTFD
-#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, 0, })
struct vm_userfaultfd_ctx {
struct userfaultfd_ctx *ctx;
+ unsigned long address_tag;
};
#else /* CONFIG_USERFAULTFD */
#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
--
2.32.0.93.g670b81a890-goog
If perf_event_open() is called with another task as target and
perf_event_attr::sigtrap is set, and the target task's user does not
match the calling user, also require the CAP_KILL capability.
Otherwise, with the CAP_PERFMON capability alone it would be possible
for a user to send SIGTRAP signals via perf events to another user's
tasks. This could potentially result in those tasks being terminated if
they cannot handle SIGTRAP signals.
Note: The check complements the existing capability check, but is not
supposed to supersede the ptrace_may_access() check. At a high level we
now have:
capable of CAP_PERFMON and (CAP_KILL if sigtrap)
OR
ptrace_may_access() // also checks for same thread-group and uid
Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events")
Cc: <stable(a)vger.kernel.org> # 5.13+
Reported-by: Dmitry Vyukov <dvyukov(a)google.com>
Signed-off-by: Marco Elver <elver(a)google.com>
---
v2:
* Drop kill_capable() and just check CAP_KILL (reported by Ondrej Mosnacek).
* Use ns_capable(__task_cred(task)->user_ns, CAP_KILL) to check for
capability in target task's ns (reported by Ondrej Mosnacek).
---
kernel/events/core.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fe88d6eea3c2..43c99695dc3f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12152,10 +12152,23 @@ SYSCALL_DEFINE5(perf_event_open,
}
if (task) {
+ bool is_capable;
+
err = down_read_interruptible(&task->signal->exec_update_lock);
if (err)
goto err_file;
+ is_capable = perfmon_capable();
+ if (attr.sigtrap) {
+ /*
+ * perf_event_attr::sigtrap sends signals to the other
+ * task. Require the current task to have CAP_KILL.
+ */
+ rcu_read_lock();
+ is_capable &= ns_capable(__task_cred(task)->user_ns, CAP_KILL);
+ rcu_read_unlock();
+ }
+
/*
* Preserve ptrace permission check for backwards compatibility.
*
@@ -12165,7 +12178,7 @@ SYSCALL_DEFINE5(perf_event_open,
* perf_event_exit_task() that could imply).
*/
err = -EACCES;
- if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
+ if (!is_capable && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
goto err_cred;
}
--
2.32.0.93.g670b81a890-goog
From: Eric Biggers <ebiggers(a)google.com>
Add a helper function fscrypt_symlink_getattr() which will be called
from the various filesystems' ->getattr() methods to read and decrypt
the target of encrypted symlinks in order to report the correct st_size.
Detailed explanation:
As required by POSIX and as documented in various man pages, st_size for
a symlink is supposed to be the length of the symlink target.
Unfortunately, st_size has always been wrong for encrypted symlinks
because st_size is populated from i_size from disk, which intentionally
contains the length of the encrypted symlink target. That's slightly
greater than the length of the decrypted symlink target (which is the
symlink target that userspace usually sees), and usually won't match the
length of the no-key encoded symlink target either.
This hadn't been fixed yet because reporting the correct st_size would
require reading the symlink target from disk and decrypting or encoding
it, which historically has been considered too heavyweight to do in
->getattr(). Also historically, the wrong st_size had only broken a
test (LTP lstat03) and there were no known complaints from real users.
(This is probably because the st_size of symlinks isn't used too often,
and when it is, typically it's for a hint for what buffer size to pass
to readlink() -- which a slightly-too-large size still works for.)
However, a couple things have changed now. First, there have recently
been complaints about the current behavior from real users:
- Breakage in rpmbuild:
https://github.com/rpm-software-management/rpm/issues/1682https://github.com/google/fscrypt/issues/305
- Breakage in toybox cpio:
https://www.mail-archive.com/toybox@lists.landley.net/msg07193.html
- Breakage in libgit2: https://issuetracker.google.com/issues/189629152
(on Android public issue tracker, requires login)
Second, we now cache decrypted symlink targets in ->i_link. Therefore,
taking the performance hit of reading and decrypting the symlink target
in ->getattr() wouldn't be as big a deal as it used to be, since usually
it will just save having to do the same thing later.
Also note that eCryptfs ended up having to read and decrypt symlink
targets in ->getattr() as well, to fix this same issue; see
commit 3a60a1686f0d ("eCryptfs: Decrypt symlink target for stat size").
So, let's just bite the bullet, and read and decrypt the symlink target
in ->getattr() in order to report the correct st_size. Add a function
fscrypt_symlink_getattr() which the filesystems will call to do this.
(Alternatively, we could store the decrypted size of symlinks on-disk.
But there isn't a great place to do so, and encryption is meant to hide
the original size to some extent; that property would be lost.)
Cc: stable(a)vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
fs/crypto/hooks.c | 44 +++++++++++++++++++++++++++++++++++++++++
include/linux/fscrypt.h | 7 +++++++
2 files changed, 51 insertions(+)
diff --git a/fs/crypto/hooks.c b/fs/crypto/hooks.c
index a73b0376e6f3..af74599ae1cf 100644
--- a/fs/crypto/hooks.c
+++ b/fs/crypto/hooks.c
@@ -384,3 +384,47 @@ const char *fscrypt_get_symlink(struct inode *inode, const void *caddr,
return ERR_PTR(err);
}
EXPORT_SYMBOL_GPL(fscrypt_get_symlink);
+
+/**
+ * fscrypt_symlink_getattr() - set the correct st_size for encrypted symlinks
+ * @path: the path for the encrypted symlink being queried
+ * @stat: the struct being filled with the symlink's attributes
+ *
+ * Override st_size of encrypted symlinks to be the length of the decrypted
+ * symlink target (or the no-key encoded symlink target, if the key is
+ * unavailable) rather than the length of the encrypted symlink target. This is
+ * necessary for st_size to match the symlink target that userspace actually
+ * sees. POSIX requires this, and some userspace programs depend on it.
+ *
+ * This requires reading the symlink target from disk if needed, setting up the
+ * inode's encryption key if possible, and then decrypting or encoding the
+ * symlink target. This makes lstat() more heavyweight than is normally the
+ * case. However, decrypted symlink targets will be cached in ->i_link, so
+ * usually the symlink won't have to be read and decrypted again later if/when
+ * it is actually followed, readlink() is called, or lstat() is called again.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int fscrypt_symlink_getattr(const struct path *path, struct kstat *stat)
+{
+ struct dentry *dentry = path->dentry;
+ struct inode *inode = d_inode(dentry);
+ const char *link;
+ DEFINE_DELAYED_CALL(done);
+
+ /*
+ * To get the symlink target that userspace will see (whether it's the
+ * decrypted target or the no-key encoded target), we can just get it in
+ * the same way the VFS does during path resolution and readlink().
+ */
+ link = READ_ONCE(inode->i_link);
+ if (!link) {
+ link = inode->i_op->get_link(dentry, inode, &done);
+ if (IS_ERR(link))
+ return PTR_ERR(link);
+ }
+ stat->size = strlen(link);
+ do_delayed_call(&done);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(fscrypt_symlink_getattr);
diff --git a/include/linux/fscrypt.h b/include/linux/fscrypt.h
index 2ea1387bb497..b7bfd0cd4f3e 100644
--- a/include/linux/fscrypt.h
+++ b/include/linux/fscrypt.h
@@ -253,6 +253,7 @@ int __fscrypt_encrypt_symlink(struct inode *inode, const char *target,
const char *fscrypt_get_symlink(struct inode *inode, const void *caddr,
unsigned int max_size,
struct delayed_call *done);
+int fscrypt_symlink_getattr(const struct path *path, struct kstat *stat);
static inline void fscrypt_set_ops(struct super_block *sb,
const struct fscrypt_operations *s_cop)
{
@@ -583,6 +584,12 @@ static inline const char *fscrypt_get_symlink(struct inode *inode,
return ERR_PTR(-EOPNOTSUPP);
}
+static inline int fscrypt_symlink_getattr(const struct path *path,
+ struct kstat *stat)
+{
+ return -EOPNOTSUPP;
+}
+
static inline void fscrypt_set_ops(struct super_block *sb,
const struct fscrypt_operations *s_cop)
{
--
2.32.0
From: Paul Burton <paulburton(a)google.com>
Currently tgid_map is sized at PID_MAX_DEFAULT entries, which means that
on systems where pid_max is configured higher than PID_MAX_DEFAULT the
ftrace record-tgid option doesn't work so well. Any tasks with PIDs
higher than PID_MAX_DEFAULT are simply not recorded in tgid_map, and
don't show up in the saved_tgids file.
In particular since systemd v243 & above configure pid_max to its
highest possible 1<<22 value by default on 64 bit systems this renders
the record-tgids option of little use.
Increase the size of tgid_map to the configured pid_max instead,
allowing it to cover the full range of PIDs up to the maximum value of
PID_MAX_LIMIT if the system is configured that way.
On 64 bit systems with pid_max == PID_MAX_LIMIT this will increase the
size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in
memory overhead sounds significant 64 bit systems are presumably best
placed to accommodate it, and since tgid_map is only allocated when the
record-tgid option is actually used presumably the user would rather it
spends sufficient memory to actually record the tgids they expect.
The size of tgid_map could also increase for CONFIG_BASE_SMALL=y
configurations, but these seem unlikely to be systems upon which people
are both configuring a large pid_max and running ftrace with record-tgid
anyway.
Of note is that we only allocate tgid_map once, the first time that the
record-tgid option is enabled. Therefore its size is only set once, to
the value of pid_max at the time the record-tgid option is first
enabled. If a user increases pid_max after that point, the saved_tgids
file will not contain entries for any tasks with pids beyond the earlier
value of pid_max.
Link: https://lkml.kernel.org/r/20210701172407.889626-2-paulburton@google.com
Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks")
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Joel Fernandes <joelaf(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Paul Burton <paulburton(a)google.com>
[ Fixed comment coding style ]
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
kernel/trace/trace.c | 63 +++++++++++++++++++++++++++++++++-----------
1 file changed, 47 insertions(+), 16 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4843076d67d3..14f56e9fa001 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2191,8 +2191,15 @@ void tracing_reset_all_online_cpus(void)
}
}
+/*
+ * The tgid_map array maps from pid to tgid; i.e. the value stored at index i
+ * is the tgid last observed corresponding to pid=i.
+ */
static int *tgid_map;
+/* The maximum valid index into tgid_map. */
+static size_t tgid_map_max;
+
#define SAVED_CMDLINES_DEFAULT 128
#define NO_CMDLINE_MAP UINT_MAX
static arch_spinlock_t trace_cmdline_lock = __ARCH_SPIN_LOCK_UNLOCKED;
@@ -2468,24 +2475,41 @@ void trace_find_cmdline(int pid, char comm[])
preempt_enable();
}
+static int *trace_find_tgid_ptr(int pid)
+{
+ /*
+ * Pairs with the smp_store_release in set_tracer_flag() to ensure that
+ * if we observe a non-NULL tgid_map then we also observe the correct
+ * tgid_map_max.
+ */
+ int *map = smp_load_acquire(&tgid_map);
+
+ if (unlikely(!map || pid > tgid_map_max))
+ return NULL;
+
+ return &map[pid];
+}
+
int trace_find_tgid(int pid)
{
- if (unlikely(!tgid_map || !pid || pid > PID_MAX_DEFAULT))
- return 0;
+ int *ptr = trace_find_tgid_ptr(pid);
- return tgid_map[pid];
+ return ptr ? *ptr : 0;
}
static int trace_save_tgid(struct task_struct *tsk)
{
+ int *ptr;
+
/* treat recording of idle task as a success */
if (!tsk->pid)
return 1;
- if (unlikely(!tgid_map || tsk->pid > PID_MAX_DEFAULT))
+ ptr = trace_find_tgid_ptr(tsk->pid);
+ if (!ptr)
return 0;
- tgid_map[tsk->pid] = tsk->tgid;
+ *ptr = tsk->tgid;
return 1;
}
@@ -5225,6 +5249,8 @@ int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set)
int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
{
+ int *map;
+
if ((mask == TRACE_ITER_RECORD_TGID) ||
(mask == TRACE_ITER_RECORD_CMD))
lockdep_assert_held(&event_mutex);
@@ -5247,10 +5273,19 @@ int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
trace_event_enable_cmd_record(enabled);
if (mask == TRACE_ITER_RECORD_TGID) {
- if (!tgid_map)
- tgid_map = kvcalloc(PID_MAX_DEFAULT + 1,
- sizeof(*tgid_map),
- GFP_KERNEL);
+ if (!tgid_map) {
+ tgid_map_max = pid_max;
+ map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map),
+ GFP_KERNEL);
+
+ /*
+ * Pairs with smp_load_acquire() in
+ * trace_find_tgid_ptr() to ensure that if it observes
+ * the tgid_map we just allocated then it also observes
+ * the corresponding tgid_map_max value.
+ */
+ smp_store_release(&tgid_map, map);
+ }
if (!tgid_map) {
tr->trace_flags &= ~TRACE_ITER_RECORD_TGID;
return -ENOMEM;
@@ -5664,18 +5699,14 @@ static void *saved_tgids_next(struct seq_file *m, void *v, loff_t *pos)
{
int pid = ++(*pos);
- if (pid > PID_MAX_DEFAULT)
- return NULL;
-
- return &tgid_map[pid];
+ return trace_find_tgid_ptr(pid);
}
static void *saved_tgids_start(struct seq_file *m, loff_t *pos)
{
- if (!tgid_map || *pos > PID_MAX_DEFAULT)
- return NULL;
+ int pid = *pos;
- return &tgid_map[*pos];
+ return trace_find_tgid_ptr(pid);
}
static void saved_tgids_stop(struct seq_file *m, void *v)
--
2.30.2
From: Paul Burton <paulburton(a)google.com>
The tgid_map array records a mapping from pid to tgid, where the index
of an entry within the array is the pid & the value stored at that index
is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map
array & dereferences the pointers which results in the tgid, but then it
passes that dereferenced value to trace_find_tgid() which treats it as a
pid & does a further lookup within the tgid_map array. It seems likely
that the intent here was to skip over entries in tgid_map for which the
recorded tgid is zero, but instead we end up skipping over entries for
which the thread group leader hasn't yet had its own tgid recorded in
tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let
seq_read() iterate over the whole tgid_map array & filter out empty
entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that
approach, removing the incorrect logic here entirely.
Link: https://lkml.kernel.org/r/20210630003406.4013668-1-paulburton@google.com
Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks")
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Joel Fernandes <joelaf(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Paul Burton <paulburton(a)google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
---
kernel/trace/trace.c | 38 +++++++++++++-------------------------
1 file changed, 13 insertions(+), 25 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 60492464281e..4843076d67d3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5662,37 +5662,20 @@ static const struct file_operations tracing_readme_fops = {
static void *saved_tgids_next(struct seq_file *m, void *v, loff_t *pos)
{
- int *ptr = v;
+ int pid = ++(*pos);
- if (*pos || m->count)
- ptr++;
-
- (*pos)++;
-
- for (; ptr <= &tgid_map[PID_MAX_DEFAULT]; ptr++) {
- if (trace_find_tgid(*ptr))
- return ptr;
- }
+ if (pid > PID_MAX_DEFAULT)
+ return NULL;
- return NULL;
+ return &tgid_map[pid];
}
static void *saved_tgids_start(struct seq_file *m, loff_t *pos)
{
- void *v;
- loff_t l = 0;
-
- if (!tgid_map)
+ if (!tgid_map || *pos > PID_MAX_DEFAULT)
return NULL;
- v = &tgid_map[0];
- while (l <= *pos) {
- v = saved_tgids_next(m, v, &l);
- if (!v)
- return NULL;
- }
-
- return v;
+ return &tgid_map[*pos];
}
static void saved_tgids_stop(struct seq_file *m, void *v)
@@ -5701,9 +5684,14 @@ static void saved_tgids_stop(struct seq_file *m, void *v)
static int saved_tgids_show(struct seq_file *m, void *v)
{
- int pid = (int *)v - tgid_map;
+ int *entry = (int *)v;
+ int pid = entry - tgid_map;
+ int tgid = *entry;
+
+ if (tgid == 0)
+ return SEQ_SKIP;
- seq_printf(m, "%d %d\n", pid, trace_find_tgid(pid));
+ seq_printf(m, "%d %d\n", pid, tgid);
return 0;
}
--
2.30.2