The tgid_map array records a mapping from pid to tgid, where the index of an entry within the array is the pid & the value stored at that index is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map array & dereferences the pointers which results in the tgid, but then it passes that dereferenced value to trace_find_tgid() which treats it as a pid & does a further lookup within the tgid_map array. It seems likely that the intent here was to skip over entries in tgid_map for which the recorded tgid is zero, but instead we end up skipping over entries for which the thread group leader hasn't yet had its own tgid recorded in tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let seq_read() iterate over the whole tgid_map array & filter out empty entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that approach, removing the incorrect logic here entirely.
Signed-off-by: Paul Burton paulburton@google.com Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks") Cc: Steven Rostedt rostedt@goodmis.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Fernandes joelaf@google.com Cc: stable@vger.kernel.org --- kernel/trace/trace.c | 38 +++++++++++++------------------------- 1 file changed, 13 insertions(+), 25 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d23a09d3eb37b..9570667310bcc 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5608,37 +5608,20 @@ static const struct file_operations tracing_readme_fops = {
static void *saved_tgids_next(struct seq_file *m, void *v, loff_t *pos) { - int *ptr = v; + int pid = ++(*pos);
- if (*pos || m->count) - ptr++; - - (*pos)++; - - for (; ptr <= &tgid_map[PID_MAX_DEFAULT]; ptr++) { - if (trace_find_tgid(*ptr)) - return ptr; - } + if (pid > PID_MAX_DEFAULT) + return NULL;
- return NULL; + return &tgid_map[pid]; }
static void *saved_tgids_start(struct seq_file *m, loff_t *pos) { - void *v; - loff_t l = 0; - - if (!tgid_map) + if (!tgid_map || *pos > PID_MAX_DEFAULT) return NULL;
- v = &tgid_map[0]; - while (l <= *pos) { - v = saved_tgids_next(m, v, &l); - if (!v) - return NULL; - } - - return v; + return &tgid_map[*pos]; }
static void saved_tgids_stop(struct seq_file *m, void *v) @@ -5647,9 +5630,14 @@ static void saved_tgids_stop(struct seq_file *m, void *v)
static int saved_tgids_show(struct seq_file *m, void *v) { - int pid = (int *)v - tgid_map; + int *entry = (int *)v; + int pid = entry - tgid_map; + int tgid = *entry; + + if (tgid == 0) + return SEQ_SKIP;
- seq_printf(m, "%d %d\n", pid, trace_find_tgid(pid)); + seq_printf(m, "%d %d\n", pid, tgid); return 0; }
base-commit: 62fb9874f5da54fdb243003b386128037319b219
Currently tgid_map is sized at PID_MAX_DEFAULT entries, which means that on systems where pid_max is configured higher than PID_MAX_DEFAULT the ftrace record-tgid option doesn't work so well. Any tasks with PIDs higher than PID_MAX_DEFAULT are simply not recorded in tgid_map, and don't show up in the saved_tgids file.
In particular since systemd v243 & above configure pid_max to its highest possible 1<<22 value by default on 64 bit systems this renders the record-tgids option of little use.
Increase the size of tgid_map to PID_MAX_LIMIT instead, allowing it to cover the full range of PIDs up to the maximum value of pid_max.
On 64 bit systems this will increase the size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in memory overhead sounds significant 64 bit systems are presumably best placed to accommodate it, and since tgid_map is only allocated when the record-tgid option is actually used presumably the user would rather it spends sufficient memory to actually record the tgids they expect.
The size of tgid_map will also increase for CONFIG_BASE_SMALL=y configurations, but these seem unlikely to be systems upon which people are running ftrace with record-tgid anyway.
Signed-off-by: Paul Burton paulburton@google.com Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks") Cc: Steven Rostedt rostedt@goodmis.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Fernandes joelaf@google.com Cc: stable@vger.kernel.org --- kernel/trace/trace.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 9570667310bcc..c893c0c2754f8 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2460,7 +2460,7 @@ void trace_find_cmdline(int pid, char comm[])
int trace_find_tgid(int pid) { - if (unlikely(!tgid_map || !pid || pid > PID_MAX_DEFAULT)) + if (unlikely(!tgid_map || !pid || pid > PID_MAX_LIMIT)) return 0;
return tgid_map[pid]; @@ -2472,7 +2472,7 @@ static int trace_save_tgid(struct task_struct *tsk) if (!tsk->pid) return 1;
- if (unlikely(!tgid_map || tsk->pid > PID_MAX_DEFAULT)) + if (unlikely(!tgid_map || tsk->pid > PID_MAX_LIMIT)) return 0;
tgid_map[tsk->pid] = tsk->tgid; @@ -5194,7 +5194,7 @@ int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
if (mask == TRACE_ITER_RECORD_TGID) { if (!tgid_map) - tgid_map = kvcalloc(PID_MAX_DEFAULT + 1, + tgid_map = kvcalloc(PID_MAX_LIMIT + 1, sizeof(*tgid_map), GFP_KERNEL); if (!tgid_map) { @@ -5610,7 +5610,7 @@ static void *saved_tgids_next(struct seq_file *m, void *v, loff_t *pos) { int pid = ++(*pos);
- if (pid > PID_MAX_DEFAULT) + if (pid > PID_MAX_LIMIT) return NULL;
return &tgid_map[pid]; @@ -5618,7 +5618,7 @@ static void *saved_tgids_next(struct seq_file *m, void *v, loff_t *pos)
static void *saved_tgids_start(struct seq_file *m, loff_t *pos) { - if (!tgid_map || *pos > PID_MAX_DEFAULT) + if (!tgid_map || *pos > PID_MAX_LIMIT) return NULL;
return &tgid_map[*pos];
On Tue, 29 Jun 2021 17:34:06 -0700 Paul Burton paulburton@google.com wrote:
On 64 bit systems this will increase the size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in memory overhead sounds significant 64 bit systems are presumably best placed to accommodate it, and since tgid_map is only allocated when the record-tgid option is actually used presumably the user would rather it spends sufficient memory to actually record the tgids they expect.
NAK. Please see how I fixed this for the saved_cmdlines, and implement it the same way.
785e3c0a3a87 ("tracing: Map all PIDs to command lines")
It's a cache, it doesn't need to save everything.
-- Steve
The size of tgid_map will also increase for CONFIG_BASE_SMALL=y configurations, but these seem unlikely to be systems upon which people are running ftrace with record-tgid anyway.
Hi Steven,
On Wed, Jun 30, 2021 at 08:35:13AM -0400, Steven Rostedt wrote:
On Tue, 29 Jun 2021 17:34:06 -0700 Paul Burton paulburton@google.com wrote:
On 64 bit systems this will increase the size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in memory overhead sounds significant 64 bit systems are presumably best placed to accommodate it, and since tgid_map is only allocated when the record-tgid option is actually used presumably the user would rather it spends sufficient memory to actually record the tgids they expect.
NAK. Please see how I fixed this for the saved_cmdlines, and implement it the same way.
785e3c0a3a87 ("tracing: Map all PIDs to command lines")
It's a cache, it doesn't need to save everything.
Well sure, but it's a cache that (modulo pid recycling) previously had a 100% hit rate for tasks observed in sched_switch events.
It differs from saved_cmdlines in a few key ways that led me to treat it differently:
1) The cost of allocating map_pid_to_cmdline is paid by all users of ftrace, whilst as I mentioned in my commit description the cost of allocating tgid_map is only paid by those who actually enable the record-tgid option.
2) We verify that the data in map_pid_to_cmdline is valid by cross-referencing it against map_cmdline_to_pid before reporting it. We don't currently have an equivalent for tgid_map, so we'd need to add a second array or make tgid_map an array of struct { int pid; int tgid; } to avoid reporting incorrect tgids. We therefore need to double the memory we consume or further reduce the effectiveness of this cache.
3) As mentioned before, with the default pid_max tgid_map/record-tgid has a 100% hit rate which was never the case for saved_cmdlines. If we go with a solution that changes this property then I certainly think the docs need updating - the description of saved_tgids in Documentation/trace/ftrace.rst makes no mention of this being anything but a perfect recreation of pid->tgid relationships, and unlike the description of saved_cmdlines it doesn't use the word "cache" at all.
Having said that I think taking a similar approach to saved_cmdlines would be better than what we have now, though I'm not sure whether it'll be sufficient to actually be usable for me. My use case is grouping threads into processes when displaying scheduling information, and experience tells me that if any threads don't get grouped appropriately the result will be questions.
Thanks, Paul
On Wed, 30 Jun 2021 14:09:42 -0700 Paul Burton paulburton@google.com wrote:
Hi Steven,
On Wed, Jun 30, 2021 at 08:35:13AM -0400, Steven Rostedt wrote:
On Tue, 29 Jun 2021 17:34:06 -0700 Paul Burton paulburton@google.com wrote:
On 64 bit systems this will increase the size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in memory overhead sounds significant 64 bit systems are presumably best placed to accommodate it, and since tgid_map is only allocated when the record-tgid option is actually used presumably the user would rather it spends sufficient memory to actually record the tgids they expect.
NAK. Please see how I fixed this for the saved_cmdlines, and implement it the same way.
785e3c0a3a87 ("tracing: Map all PIDs to command lines")
It's a cache, it doesn't need to save everything.
Well sure, but it's a cache that (modulo pid recycling) previously had a 100% hit rate for tasks observed in sched_switch events.
Obviously it wasn't 100% when it went over the PID_MAX_DEFAULT.
It differs from saved_cmdlines in a few key ways that led me to treat it differently:
- The cost of allocating map_pid_to_cmdline is paid by all users of ftrace, whilst as I mentioned in my commit description the cost of allocating tgid_map is only paid by those who actually enable the record-tgid option.
I'll admit that this is a valid point.
- We verify that the data in map_pid_to_cmdline is valid by cross-referencing it against map_cmdline_to_pid before reporting it. We don't currently have an equivalent for tgid_map, so we'd need to add a second array or make tgid_map an array of struct { int pid; int tgid; } to avoid reporting incorrect tgids. We therefore need to double the memory we consume or further reduce the effectiveness of this cache.
Double 256K is just 512K which is still much less than 16M.
- As mentioned before, with the default pid_max tgid_map/record-tgid has a 100% hit rate which was never the case for saved_cmdlines. If we go with a solution that changes this property then I certainly think the docs need updating - the description of saved_tgids in Documentation/trace/ftrace.rst makes no mention of this being anything but a perfect recreation of pid->tgid relationships, and unlike the description of saved_cmdlines it doesn't use the word "cache" at all.
Having said that I think taking a similar approach to saved_cmdlines would be better than what we have now, though I'm not sure whether it'll be sufficient to actually be usable for me. My use case is grouping threads into processes when displaying scheduling information, and experience tells me that if any threads don't get grouped appropriately the result will be questions.
I found a few bugs recently in the saved_cmdlines that were causing a much higher miss rate, but now it's been rather accurate. I wonder how much pain that's been causing you?
Anyway, I'll wait to hear what Joel says on this. If he thinks this is worth 16M of memory when enabled, I may take it.
-- Steve
On Wed, Jun 30, 2021 at 5:34 PM Steven Rostedt rostedt@goodmis.org wrote: [..]
Having said that I think taking a similar approach to saved_cmdlines would be better than what we have now, though I'm not sure whether it'll be sufficient to actually be usable for me. My use case is grouping threads into processes when displaying scheduling information, and experience tells me that if any threads don't get grouped appropriately the result will be questions.
I found a few bugs recently in the saved_cmdlines that were causing a much higher miss rate, but now it's been rather accurate. I wonder how much pain that's been causing you?
Anyway, I'll wait to hear what Joel says on this. If he thinks this is worth 16M of memory when enabled, I may take it.
I am not a huge fan of the 16M, in Android we enable this feature on all devices. Low end Android devices traced in the field are sometimes 1 GB and the added memory pressure may be unwelcome. Very least, maybe make it optional only for folks who increase pid_max?
thanks, -Joel
On Wed, 30 Jun 2021 18:34:11 -0400 Joel Fernandes joelaf@google.com wrote:
Anyway, I'll wait to hear what Joel says on this. If he thinks this is worth 16M of memory when enabled, I may take it.
I am not a huge fan of the 16M, in Android we enable this feature on all devices. Low end Android devices traced in the field are sometimes 1 GB and the added memory pressure may be unwelcome. Very least, maybe make it optional only for folks who increase pid_max?
Yeah, can we just set it to the size of pid_max, at whatever it is set to?
-- Steve
On Wed, 30 Jun 2021 19:11:45 -0400 Steven Rostedt rostedt@goodmis.org wrote:
On Wed, 30 Jun 2021 18:34:11 -0400 Joel Fernandes joelaf@google.com wrote:
Anyway, I'll wait to hear what Joel says on this. If he thinks this is worth 16M of memory when enabled, I may take it.
I am not a huge fan of the 16M, in Android we enable this feature on all devices. Low end Android devices traced in the field are sometimes 1 GB and the added memory pressure may be unwelcome. Very least, maybe make it optional only for folks who increase pid_max?
Yeah, can we just set it to the size of pid_max, at whatever it is set to?
Can you send a V2 with this update and address Joel's comments to patch 1, and get it to me today? I plan on sending Linus a pull request tomorrow, which means I have to kick off my tests tonight for what I want to send.
-- Steve
The tgid_map array records a mapping from pid to tgid, where the index of an entry within the array is the pid & the value stored at that index is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map array & dereferences the pointers which results in the tgid, but then it passes that dereferenced value to trace_find_tgid() which treats it as a pid & does a further lookup within the tgid_map array. It seems likely that the intent here was to skip over entries in tgid_map for which the recorded tgid is zero, but instead we end up skipping over entries for which the thread group leader hasn't yet had its own tgid recorded in tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let seq_read() iterate over the whole tgid_map array & filter out empty entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that approach, removing the incorrect logic here entirely.
Signed-off-by: Paul Burton paulburton@google.com Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks") Cc: Steven Rostedt rostedt@goodmis.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Fernandes joelaf@google.com Cc: stable@vger.kernel.org --- Changes in v2: - Add comments describing why we know tgid_map is non-NULL in saved_tgids_next() & saved_tgids_start(). --- kernel/trace/trace.c | 45 ++++++++++++++++++++------------------------ 1 file changed, 20 insertions(+), 25 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d23a09d3eb37..7a37c9e36b88 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5608,37 +5608,23 @@ static const struct file_operations tracing_readme_fops = {
static void *saved_tgids_next(struct seq_file *m, void *v, loff_t *pos) { - int *ptr = v; + int pid = ++(*pos);
- if (*pos || m->count) - ptr++; - - (*pos)++; - - for (; ptr <= &tgid_map[PID_MAX_DEFAULT]; ptr++) { - if (trace_find_tgid(*ptr)) - return ptr; - } + if (pid > PID_MAX_DEFAULT) + return NULL;
- return NULL; + // We already know that tgid_map is non-NULL here because the v + // argument is by definition a non-NULL pointer into tgid_map returned + // by saved_tgids_start() or an earlier call to saved_tgids_next(). + return &tgid_map[pid]; }
static void *saved_tgids_start(struct seq_file *m, loff_t *pos) { - void *v; - loff_t l = 0; - - if (!tgid_map) + if (!tgid_map || *pos > PID_MAX_DEFAULT) return NULL;
- v = &tgid_map[0]; - while (l <= *pos) { - v = saved_tgids_next(m, v, &l); - if (!v) - return NULL; - } - - return v; + return &tgid_map[*pos]; }
static void saved_tgids_stop(struct seq_file *m, void *v) @@ -5647,9 +5633,18 @@ static void saved_tgids_stop(struct seq_file *m, void *v)
static int saved_tgids_show(struct seq_file *m, void *v) { - int pid = (int *)v - tgid_map; + int *entry = (int *)v; + int pid, tgid; + + // We already know that tgid_map is non-NULL here because the v + // argument is by definition a non-NULL pointer into tgid_map returned + // by saved_tgids_start() or saved_tgids_next(). + pid = entry - tgid_map; + tgid = *entry; + if (tgid == 0) + return SEQ_SKIP;
- seq_printf(m, "%d %d\n", pid, trace_find_tgid(pid)); + seq_printf(m, "%d %d\n", pid, tgid); return 0; }
base-commit: 62fb9874f5da54fdb243003b386128037319b219
Currently tgid_map is sized at PID_MAX_DEFAULT entries, which means that on systems where pid_max is configured higher than PID_MAX_DEFAULT the ftrace record-tgid option doesn't work so well. Any tasks with PIDs higher than PID_MAX_DEFAULT are simply not recorded in tgid_map, and don't show up in the saved_tgids file.
In particular since systemd v243 & above configure pid_max to its highest possible 1<<22 value by default on 64 bit systems this renders the record-tgids option of little use.
Increase the size of tgid_map to the configured pid_max instead, allowing it to cover the full range of PIDs up to the maximum value of PID_MAX_LIMIT if the system is configured that way.
On 64 bit systems with pid_max == PID_MAX_LIMIT this will increase the size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in memory overhead sounds significant 64 bit systems are presumably best placed to accommodate it, and since tgid_map is only allocated when the record-tgid option is actually used presumably the user would rather it spends sufficient memory to actually record the tgids they expect.
The size of tgid_map could also increase for CONFIG_BASE_SMALL=y configurations, but these seem unlikely to be systems upon which people are both configuring a large pid_max and running ftrace with record-tgid anyway.
Of note is that we only allocate tgid_map once, the first time that the record-tgid option is enabled. Therefore its size is only set once, to the value of pid_max at the time the record-tgid option is first enabled. If a user increases pid_max after that point, the saved_tgids file will not contain entries for any tasks with pids beyond the earlier value of pid_max.
Signed-off-by: Paul Burton paulburton@google.com Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks") Cc: Steven Rostedt rostedt@goodmis.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Fernandes joelaf@google.com Cc: stable@vger.kernel.org --- Changes in v2: - Use the configured value of pid_max at the time record-tgid is enabled rather than unconditionally using PID_MAX_LIMIT, to avoid added memory overhead for systems that don't configure such a high pid_max. --- kernel/trace/trace.c | 60 ++++++++++++++++++++++++++++++-------------- 1 file changed, 41 insertions(+), 19 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 7a37c9e36b88..3c4b3b207c06 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2184,8 +2184,13 @@ void tracing_reset_all_online_cpus(void) } }
+// The tgid_map array maps from pid to tgid; i.e. the value stored at index i +// is the tgid last observed corresponding to pid=i. static int *tgid_map;
+// The maximum valid index into tgid_map. +static size_t tgid_map_max; + #define SAVED_CMDLINES_DEFAULT 128 #define NO_CMDLINE_MAP UINT_MAX static arch_spinlock_t trace_cmdline_lock = __ARCH_SPIN_LOCK_UNLOCKED; @@ -2458,24 +2463,39 @@ void trace_find_cmdline(int pid, char comm[]) preempt_enable(); }
+static int *trace_find_tgid_ptr(int pid) +{ + // Pairs with the smp_store_release in set_tracer_flag() to ensure that + // if we observe a non-NULL tgid_map then we also observe the correct + // tgid_map_max. + int *map = smp_load_acquire(&tgid_map); + + if (unlikely(!map || pid > tgid_map_max)) + return NULL; + + return &map[pid]; +} + int trace_find_tgid(int pid) { - if (unlikely(!tgid_map || !pid || pid > PID_MAX_DEFAULT)) - return 0; + int *ptr = trace_find_tgid_ptr(pid);
- return tgid_map[pid]; + return ptr ? *ptr : 0; }
static int trace_save_tgid(struct task_struct *tsk) { + int *ptr; + /* treat recording of idle task as a success */ if (!tsk->pid) return 1;
- if (unlikely(!tgid_map || tsk->pid > PID_MAX_DEFAULT)) + ptr = trace_find_tgid_ptr(tsk->pid); + if (!ptr) return 0;
- tgid_map[tsk->pid] = tsk->tgid; + *ptr = tsk->tgid; return 1; }
@@ -5171,6 +5191,8 @@ int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set)
int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled) { + int *map; + if ((mask == TRACE_ITER_RECORD_TGID) || (mask == TRACE_ITER_RECORD_CMD)) lockdep_assert_held(&event_mutex); @@ -5193,10 +5215,17 @@ int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled) trace_event_enable_cmd_record(enabled);
if (mask == TRACE_ITER_RECORD_TGID) { - if (!tgid_map) - tgid_map = kvcalloc(PID_MAX_DEFAULT + 1, - sizeof(*tgid_map), - GFP_KERNEL); + if (!tgid_map) { + tgid_map_max = pid_max; + map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map), + GFP_KERNEL); + + // Pairs with smp_load_acquire() in + // trace_find_tgid_ptr() to ensure that if it observes + // the tgid_map we just allocated then it also observes + // the corresponding tgid_map_max value. + smp_store_release(&tgid_map, map); + } if (!tgid_map) { tr->trace_flags &= ~TRACE_ITER_RECORD_TGID; return -ENOMEM; @@ -5610,21 +5639,14 @@ static void *saved_tgids_next(struct seq_file *m, void *v, loff_t *pos) { int pid = ++(*pos);
- if (pid > PID_MAX_DEFAULT) - return NULL; - - // We already know that tgid_map is non-NULL here because the v - // argument is by definition a non-NULL pointer into tgid_map returned - // by saved_tgids_start() or an earlier call to saved_tgids_next(). - return &tgid_map[pid]; + return trace_find_tgid_ptr(pid); }
static void *saved_tgids_start(struct seq_file *m, loff_t *pos) { - if (!tgid_map || *pos > PID_MAX_DEFAULT) - return NULL; + int pid = *pos;
- return &tgid_map[*pos]; + return trace_find_tgid_ptr(pid); }
static void saved_tgids_stop(struct seq_file *m, void *v)
On Thu, 1 Jul 2021 10:24:07 -0700 Paul Burton paulburton@google.com wrote:
+static int *trace_find_tgid_ptr(int pid) +{
- // Pairs with the smp_store_release in set_tracer_flag() to ensure that
- // if we observe a non-NULL tgid_map then we also observe the correct
- // tgid_map_max.
BTW, it's against the Linux kernel coding style to use // for comments.
I can take this patch, but I need to change this to:
/* * Pairs with the smp_store_release in set_tracer_flag() to ensure that * if we observe a non-NULL tgid_map then we also observe the correct * tgid_map_max. */
Same with the other comments. Please follow coding style that can be found in:
Documentation/process/coding-style.rst
And see section 8 on Commenting.
Thanks,
-- Steve
- int *map = smp_load_acquire(&tgid_map);
- if (unlikely(!map || pid > tgid_map_max))
return NULL;
- return &map[pid];
+}
Hi Steven,
On Thu, Jul 01, 2021 at 02:12:21PM -0400, Steven Rostedt wrote:
On Thu, 1 Jul 2021 10:24:07 -0700 Paul Burton paulburton@google.com wrote:
+static int *trace_find_tgid_ptr(int pid) +{
- // Pairs with the smp_store_release in set_tracer_flag() to ensure that
- // if we observe a non-NULL tgid_map then we also observe the correct
- // tgid_map_max.
BTW, it's against the Linux kernel coding style to use // for comments.
I can take this patch, but I need to change this to:
/* * Pairs with the smp_store_release in set_tracer_flag() to ensure that * if we observe a non-NULL tgid_map then we also observe the correct * tgid_map_max. */
Same with the other comments. Please follow coding style that can be found in:
Documentation/process/coding-style.rst
And see section 8 on Commenting.
Yeah, sorry about that - I should know better having been a maintainer in a former life...
Just to confirm - are you happy to fix those up when applying or should I send a v3?
Thanks, Paul
On Thu, 1 Jul 2021 11:15:37 -0700 Paul Burton paulburton@google.com wrote:
Yeah, sorry about that - I should know better having been a maintainer in a former life...
No problem.
Just to confirm - are you happy to fix those up when applying or should I send a v3?
I made the conversion and I'm going to start my testing now.
Joel, I never saw a reviewed-by from you for this patch.
Thanks!
-- Steve
On Thu, Jul 1, 2021 at 1:24 PM Paul Burton paulburton@google.com wrote:
The tgid_map array records a mapping from pid to tgid, where the index of an entry within the array is the pid & the value stored at that index is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map array & dereferences the pointers which results in the tgid, but then it passes that dereferenced value to trace_find_tgid() which treats it as a pid & does a further lookup within the tgid_map array. It seems likely that the intent here was to skip over entries in tgid_map for which the recorded tgid is zero, but instead we end up skipping over entries for which the thread group leader hasn't yet had its own tgid recorded in tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let seq_read() iterate over the whole tgid_map array & filter out empty entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that approach, removing the incorrect logic here entirely.
Signed-off-by: Paul Burton paulburton@google.com Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks") Cc: Steven Rostedt rostedt@goodmis.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Fernandes joelaf@google.com
Reviewed-by: Joel Fernandes (Google) joel@joelfernandes.org
On Tue, 29 Jun 2021 17:34:05 -0700 Paul Burton paulburton@google.com wrote:
The tgid_map array records a mapping from pid to tgid, where the index of an entry within the array is the pid & the value stored at that index is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map array & dereferences the pointers which results in the tgid, but then it passes that dereferenced value to trace_find_tgid() which treats it as a pid & does a further lookup within the tgid_map array. It seems likely that the intent here was to skip over entries in tgid_map for which the recorded tgid is zero, but instead we end up skipping over entries for which the thread group leader hasn't yet had its own tgid recorded in tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let seq_read() iterate over the whole tgid_map array & filter out empty entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that approach, removing the incorrect logic here entirely.
Signed-off-by: Paul Burton paulburton@google.com Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks") Cc: Steven Rostedt rostedt@goodmis.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Fernandes joelaf@google.com Cc: stable@vger.kernel.org
Joel,
Can you review this please.
-- Steve
On Wed, Jun 30, 2021 at 8:31 AM Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 29 Jun 2021 17:34:05 -0700 Paul Burton paulburton@google.com wrote:
The tgid_map array records a mapping from pid to tgid, where the index of an entry within the array is the pid & the value stored at that index is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map array & dereferences the pointers which results in the tgid, but then it passes that dereferenced value to trace_find_tgid() which treats it as a pid & does a further lookup within the tgid_map array. It seems likely that the intent here was to skip over entries in tgid_map for which the recorded tgid is zero, but instead we end up skipping over entries for which the thread group leader hasn't yet had its own tgid recorded in tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let seq_read() iterate over the whole tgid_map array & filter out empty entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that approach, removing the incorrect logic here entirely.
Signed-off-by: Paul Burton paulburton@google.com Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks") Cc: Steven Rostedt rostedt@goodmis.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Fernandes joelaf@google.com Cc: stable@vger.kernel.org
Joel,
Can you review this please.
Sure thing Steve, will review it today.
thanks, -Joel
On Tue, Jun 29, 2021 at 8:34 PM Paul Burton paulburton@google.com wrote:
The tgid_map array records a mapping from pid to tgid, where the index of an entry within the array is the pid & the value stored at that index is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map array & dereferences the pointers which results in the tgid, but then it passes that dereferenced value to trace_find_tgid() which treats it as a pid & does a further lookup within the tgid_map array. It seems likely that the intent here was to skip over entries in tgid_map for which the recorded tgid is zero, but instead we end up skipping over entries for which the thread group leader hasn't yet had its own tgid recorded in tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let seq_read() iterate over the whole tgid_map array & filter out empty entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that approach, removing the incorrect logic here entirely.
Looks reasonable except for one nit:
Signed-off-by: Paul Burton paulburton@google.com Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks") Cc: Steven Rostedt rostedt@goodmis.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Fernandes joelaf@google.com Cc: stable@vger.kernel.org
kernel/trace/trace.c | 38 +++++++++++++------------------------- 1 file changed, 13 insertions(+), 25 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d23a09d3eb37b..9570667310bcc 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5608,37 +5608,20 @@ static const struct file_operations tracing_readme_fops = {
static void *saved_tgids_next(struct seq_file *m, void *v, loff_t *pos) {
int *ptr = v;
int pid = ++(*pos);
if (*pos || m->count)
ptr++;
(*pos)++;
for (; ptr <= &tgid_map[PID_MAX_DEFAULT]; ptr++) {
if (trace_find_tgid(*ptr))
return ptr;
It would be great if you can add back the check for !tgid_map to both next() and show() as well, for added robustness (since the old code previously did it).
With that change: Reviewed-by: Joel Fernandes (Google) joel@joelfernandes.org
thanks,
-Joel
}
if (pid > PID_MAX_DEFAULT)
return NULL;
return NULL;
return &tgid_map[pid];
}
static void *saved_tgids_start(struct seq_file *m, loff_t *pos) {
void *v;
loff_t l = 0;
if (!tgid_map)
if (!tgid_map || *pos > PID_MAX_DEFAULT) return NULL;
v = &tgid_map[0];
while (l <= *pos) {
v = saved_tgids_next(m, v, &l);
if (!v)
return NULL;
}
return v;
return &tgid_map[*pos];
}
static void saved_tgids_stop(struct seq_file *m, void *v) @@ -5647,9 +5630,14 @@ static void saved_tgids_stop(struct seq_file *m, void *v)
static int saved_tgids_show(struct seq_file *m, void *v) {
int pid = (int *)v - tgid_map;
int *entry = (int *)v;
int pid = entry - tgid_map;
int tgid = *entry;
if (tgid == 0)
return SEQ_SKIP;
seq_printf(m, "%d %d\n", pid, trace_find_tgid(pid));
seq_printf(m, "%d %d\n", pid, tgid); return 0;
}
base-commit: 62fb9874f5da54fdb243003b386128037319b219
2.32.0.93.g670b81a890-goog
Hi Joel,
On Wed, Jun 30, 2021 at 06:29:55PM -0400, Joel Fernandes wrote:
On Tue, Jun 29, 2021 at 8:34 PM Paul Burton paulburton@google.com wrote:
The tgid_map array records a mapping from pid to tgid, where the index of an entry within the array is the pid & the value stored at that index is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map array & dereferences the pointers which results in the tgid, but then it passes that dereferenced value to trace_find_tgid() which treats it as a pid & does a further lookup within the tgid_map array. It seems likely that the intent here was to skip over entries in tgid_map for which the recorded tgid is zero, but instead we end up skipping over entries for which the thread group leader hasn't yet had its own tgid recorded in tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let seq_read() iterate over the whole tgid_map array & filter out empty entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that approach, removing the incorrect logic here entirely.
Looks reasonable except for one nit:
Signed-off-by: Paul Burton paulburton@google.com Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks") Cc: Steven Rostedt rostedt@goodmis.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Fernandes joelaf@google.com Cc: stable@vger.kernel.org
kernel/trace/trace.c | 38 +++++++++++++------------------------- 1 file changed, 13 insertions(+), 25 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d23a09d3eb37b..9570667310bcc 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5608,37 +5608,20 @@ static const struct file_operations tracing_readme_fops = {
static void *saved_tgids_next(struct seq_file *m, void *v, loff_t *pos) {
int *ptr = v;
int pid = ++(*pos);
if (*pos || m->count)
ptr++;
(*pos)++;
for (; ptr <= &tgid_map[PID_MAX_DEFAULT]; ptr++) {
if (trace_find_tgid(*ptr))
return ptr;
It would be great if you can add back the check for !tgid_map to both next() and show() as well, for added robustness (since the old code previously did it).
That condition cannot happen, because both next() & show() are called to iterate through the content of the seq_file & by definition their v argument is non-NULL (else seq_file would have finished iterating already). That argument came from either start() or an earlier call to next(), which would only have returned a non-NULL pointer into tgid_map if tgid_map is non-NULL.
I've added comments to this effect in v2, though the second patch in v2 does wind up effectively adding back the check in next() anyway in order to reuse some code.
I was tempted to just add the redundant checks anyway (pick your battles and all) but for show() in particular it wound up making things seem non-sensical to me ("display the value describing this non-NULL pointer into tgid_map only if tgid_map is not NULL?").
Thanks, Paul
With that change: Reviewed-by: Joel Fernandes (Google) joel@joelfernandes.org
thanks,
-Joel
}
if (pid > PID_MAX_DEFAULT)
return NULL;
return NULL;
return &tgid_map[pid];
}
static void *saved_tgids_start(struct seq_file *m, loff_t *pos) {
void *v;
loff_t l = 0;
if (!tgid_map)
if (!tgid_map || *pos > PID_MAX_DEFAULT) return NULL;
v = &tgid_map[0];
while (l <= *pos) {
v = saved_tgids_next(m, v, &l);
if (!v)
return NULL;
}
return v;
return &tgid_map[*pos];
}
static void saved_tgids_stop(struct seq_file *m, void *v) @@ -5647,9 +5630,14 @@ static void saved_tgids_stop(struct seq_file *m, void *v)
static int saved_tgids_show(struct seq_file *m, void *v) {
int pid = (int *)v - tgid_map;
int *entry = (int *)v;
int pid = entry - tgid_map;
int tgid = *entry;
if (tgid == 0)
return SEQ_SKIP;
seq_printf(m, "%d %d\n", pid, trace_find_tgid(pid));
seq_printf(m, "%d %d\n", pid, tgid); return 0;
}
base-commit: 62fb9874f5da54fdb243003b386128037319b219
2.32.0.93.g670b81a890-goog
On Thu, Jul 1, 2021 at 1:32 PM Paul Burton paulburton@google.com wrote:
Hi Joel,
On Wed, Jun 30, 2021 at 06:29:55PM -0400, Joel Fernandes wrote:
On Tue, Jun 29, 2021 at 8:34 PM Paul Burton paulburton@google.com wrote:
The tgid_map array records a mapping from pid to tgid, where the index of an entry within the array is the pid & the value stored at that index is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map array & dereferences the pointers which results in the tgid, but then it passes that dereferenced value to trace_find_tgid() which treats it as a pid & does a further lookup within the tgid_map array. It seems likely that the intent here was to skip over entries in tgid_map for which the recorded tgid is zero, but instead we end up skipping over entries for which the thread group leader hasn't yet had its own tgid recorded in tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let seq_read() iterate over the whole tgid_map array & filter out empty entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that approach, removing the incorrect logic here entirely.
Looks reasonable except for one nit:
Signed-off-by: Paul Burton paulburton@google.com Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks") Cc: Steven Rostedt rostedt@goodmis.org Cc: Ingo Molnar mingo@redhat.com Cc: Joel Fernandes joelaf@google.com Cc: stable@vger.kernel.org
kernel/trace/trace.c | 38 +++++++++++++------------------------- 1 file changed, 13 insertions(+), 25 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d23a09d3eb37b..9570667310bcc 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5608,37 +5608,20 @@ static const struct file_operations tracing_readme_fops = {
static void *saved_tgids_next(struct seq_file *m, void *v, loff_t *pos) {
int *ptr = v;
int pid = ++(*pos);
if (*pos || m->count)
ptr++;
(*pos)++;
for (; ptr <= &tgid_map[PID_MAX_DEFAULT]; ptr++) {
if (trace_find_tgid(*ptr))
return ptr;
It would be great if you can add back the check for !tgid_map to both next() and show() as well, for added robustness (since the old code previously did it).
That condition cannot happen, because both next() & show() are called to iterate through the content of the seq_file & by definition their v argument is non-NULL (else seq_file would have finished iterating already). That argument came from either start() or an earlier call to next(), which would only have returned a non-NULL pointer into tgid_map if tgid_map is non-NULL.
Hmm, You do have a point. Alright then. You could add my Reviewed-by tag for this patch to subsequent postings.
thanks, -Joel
On Thu, 1 Jul 2021 10:31:59 -0700 Paul Burton paulburton@google.com wrote:
I was tempted to just add the redundant checks anyway (pick your battles and all) but for show() in particular it wound up making things seem non-sensical to me ("display the value describing this non-NULL pointer into tgid_map only if tgid_map is not NULL?").
I agree with your assessment, and will actually take your first patch, as I don't think the comment is that helpful, not to mention, we don't use '//' comments in the kernel, so that would have to be changed.
But for cases like this, I usually have something like:
if (WARN_ON_ONCE(!tgid_map)) return -1;
Because the logic is what makes tgid_map not being NULL, but as experience has taught me, the logic can sometimes be mistaken, at least as time goes by. And things that are protected by logic, deserve a WARN*() when it doesn't go as planned.
We can always add that later, if needed.
-- Steve
On Thu, Jul 1, 2021 at 2:07 PM Steven Rostedt rostedt@goodmis.org wrote:
On Thu, 1 Jul 2021 10:31:59 -0700 Paul Burton paulburton@google.com wrote:
I was tempted to just add the redundant checks anyway (pick your battles and all) but for show() in particular it wound up making things seem non-sensical to me ("display the value describing this non-NULL pointer into tgid_map only if tgid_map is not NULL?").
I agree with your assessment, and will actually take your first patch, as I don't think the comment is that helpful, not to mention, we don't use '//' comments in the kernel, so that would have to be changed.
But for cases like this, I usually have something like:
if (WARN_ON_ONCE(!tgid_map)) return -1;
Because the logic is what makes tgid_map not being NULL, but as experience has taught me, the logic can sometimes be mistaken, at least as time goes by. And things that are protected by logic, deserve a WARN*() when it doesn't go as planned.
We can always add that later, if needed.
Agreed, was thinking similar/same.
thanks,
-Joel
On Thu, Jul 01, 2021 at 02:07:54PM -0400, Steven Rostedt wrote:
On Thu, 1 Jul 2021 10:31:59 -0700 Paul Burton paulburton@google.com wrote:
I was tempted to just add the redundant checks anyway (pick your battles and all) but for show() in particular it wound up making things seem non-sensical to me ("display the value describing this non-NULL pointer into tgid_map only if tgid_map is not NULL?").
I agree with your assessment, and will actually take your first patch, as I don't think the comment is that helpful,
Thanks - agreed, the comment doesn't add much.
not to mention, we don't use '//' comments in the kernel, so that would have to be changed.
D'oh! Apparently a year away from the kernel melted my internal style checker. Interestingly though, checkpatch didn't complain about this as I would have expected...
Thanks, Paul
[ Added Joe Perches ]
On Thu, 1 Jul 2021 11:12:54 -0700 Paul Burton paulburton@google.com wrote:
not to mention, we don't use '//' comments in the kernel, so that would have to be changed.
D'oh! Apparently a year away from the kernel melted my internal style checker. Interestingly though, checkpatch didn't complain about this as I would have expected...
Joe, should the above be added to checkpatch?
I do understand that there are a few cases it's acceptable. Like for SPDX headers.
-- Steve
On 2021-07-01 11:26, Steven Rostedt wrote:
[ Added Joe Perches ]
On Thu, 1 Jul 2021 11:12:54 -0700 Paul Burton paulburton@google.com wrote:
not to mention, we don't use '//' comments in the kernel, so that would have to be changed.
D'oh! Apparently a year away from the kernel melted my internal style checker. Interestingly though, checkpatch didn't complain about this as I would have expected...
Joe, should the above be added to checkpatch?
I do understand that there are a few cases it's acceptable. Like for SPDX headers.
C99 comments are allowed since about 5 years ago.
And if you really want there's a checkpatch option to disable them.
On Thu, 01 Jul 2021 12:35:29 -0700 Joe Perches joe@perches.com wrote:
C99 comments are allowed since about 5 years ago.
Really, I thought Linus hated them. Personally, I find them rather ugly myself. The only user of them I see in the kernel/ directory appears to be for RCU. But Paul's on the C/C++ committee, so perhaps he favors them.
The net/ directory doesn't have any, except perhaps to comment out code (which I sometimes use it for that too).
The block/, arch/x86/ directories don't have them either.
I wouldn't go and change checkpatch, but I still rather avoid them, especially for multi line comments.
/* * When it comes to multi line comments I prefer using something * that denotes a start and an end to the comment, as it makes it * look like a nice clip of information. */
Instead of:
// When it comes to multi line comments I prefer using something // that denotes a start and an end to the comment, as it makes it // look like a nice clip of information.
Which just looks like noise. But hey, maybe that's just me because I find "*" as a sign of information and '//' something to ignore. ;-)
-- Steve
On 2021-07-01 12:51, Steven Rostedt wrote:
On Thu, 01 Jul 2021 12:35:29 -0700 Joe Perches joe@perches.com wrote:
C99 comments are allowed since about 5 years ago.
Really, I thought Linus hated them. Personally, I find them rather ugly myself. The only user of them I see in the kernel/ directory appears to be for RCU. But Paul's on the C/C++ committee, so perhaps he favors them.
The net/ directory doesn't have any, except perhaps to comment out code (which I sometimes use it for that too).
The block/, arch/x86/ directories don't have them either.
I wouldn't go and change checkpatch, but I still rather avoid them, especially for multi line comments.
/*
- When it comes to multi line comments I prefer using something
- that denotes a start and an end to the comment, as it makes it
- look like a nice clip of information.
*/
Instead of:
// When it comes to multi line comments I prefer using something // that denotes a start and an end to the comment, as it makes it // look like a nice clip of information.
Which just looks like noise. But hey, maybe that's just me because I find "*" as a sign of information and '//' something to ignore. ;-)
May I suggest using something other than an amber vt220?
On Thu, Jul 1, 2021 at 5:07 PM Joe Perches joe@perches.com wrote:
On 2021-07-01 12:51, Steven Rostedt wrote:
On Thu, 01 Jul 2021 12:35:29 -0700 Joe Perches joe@perches.com wrote:
C99 comments are allowed since about 5 years ago.
Really, I thought Linus hated them. Personally, I find them rather ugly myself. The only user of them I see in the kernel/ directory appears to be for RCU. But Paul's on the C/C++ committee, so perhaps he favors them.
The net/ directory doesn't have any, except perhaps to comment out code (which I sometimes use it for that too).
The block/, arch/x86/ directories don't have them either.
I wouldn't go and change checkpatch, but I still rather avoid them, especially for multi line comments.
/*
- When it comes to multi line comments I prefer using something
- that denotes a start and an end to the comment, as it makes it
- look like a nice clip of information.
*/
Instead of:
// When it comes to multi line comments I prefer using something // that denotes a start and an end to the comment, as it makes it // look like a nice clip of information.
Which just looks like noise. But hey, maybe that's just me because I find "*" as a sign of information and '//' something to ignore. ;-)
May I suggest using something other than an amber vt220?
Steve - mostly comments are to be ignored and the code is the ultimate source of truth ;-), so // is fine :-D
That said, don't discard the amber vt220 I recently sent you just because Joe says so ;-) <:o)
- Joel
linux-stable-mirror@lists.linaro.org