Allow debugfs override of sched_tick_max_deferment in order to ease finding/fixing the remaining issues with full nohz.
The value to be written is in jiffies, and -1 means the max deferment is disabled (scheduler_tick_max_deferment() returns KTIME_MAX.)
Cc: Frederic Weisbecker fweisbec@gmail.com Signed-off-by: Kevin Hilman khilman@linaro.org --- kernel/sched/core.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5ac63c9a995a..4b1fe3e69fe4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2175,6 +2175,8 @@ void scheduler_tick(void) }
#ifdef CONFIG_NO_HZ_FULL +static u32 sched_tick_max_deferment = HZ; + /** * scheduler_tick_max_deferment * @@ -2193,13 +2195,25 @@ u64 scheduler_tick_max_deferment(void) struct rq *rq = this_rq(); unsigned long next, now = ACCESS_ONCE(jiffies);
- next = rq->last_sched_tick + HZ; + if (sched_tick_max_deferment == -1) + return KTIME_MAX; + + next = rq->last_sched_tick + sched_tick_max_deferment;
if (time_before_eq(next, now)) return 0;
return jiffies_to_usecs(next - now) * NSEC_PER_USEC; } + +static __init int sched_nohz_full_init_debug(void) +{ + debugfs_create_u32("sched_tick_max_deferment", 0644, NULL, + &sched_tick_max_deferment); + + return 0; +} +late_initcall(sched_nohz_full_init_debug); #endif
notrace unsigned long get_parent_ip(unsigned long addr)
The conversion of the max deferment from usecs to nsecs can easily overflow on platforms where a long is 32-bits. To fix, cast the usecs value to u64 before multiplying by NSECS_PER_USEC.
This was discovered on 32-bit ARM platform when extending the max deferment value.
Cc: Frederic Weisbecker fweisbec@gmail.com Signed-off-by: Kevin Hilman khilman@linaro.org --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4b1fe3e69fe4..3d7c80e1c4d9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2203,7 +2203,7 @@ u64 scheduler_tick_max_deferment(void) if (time_before_eq(next, now)) return 0;
- return jiffies_to_usecs(next - now) * NSEC_PER_USEC; + return (u64)jiffies_to_usecs(next - now) * NSEC_PER_USEC; }
static __init int sched_nohz_full_init_debug(void)
On Tue, Dec 17, 2013 at 01:23:08PM -0800, Kevin Hilman wrote:
The conversion of the max deferment from usecs to nsecs can easily overflow on platforms where a long is 32-bits. To fix, cast the usecs value to u64 before multiplying by NSECS_PER_USEC.
This was discovered on 32-bit ARM platform when extending the max deferment value.
Cc: Frederic Weisbecker fweisbec@gmail.com Signed-off-by: Kevin Hilman khilman@linaro.org
kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4b1fe3e69fe4..3d7c80e1c4d9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2203,7 +2203,7 @@ u64 scheduler_tick_max_deferment(void) if (time_before_eq(next, now)) return 0;
- return jiffies_to_usecs(next - now) * NSEC_PER_USEC;
- return (u64)jiffies_to_usecs(next - now) * NSEC_PER_USEC;
Just to be sure I understand the issue. The problem is that jiffies_to_usecs() return an unsigned int which is then multiplied by NSEC_PER_USEC. If the result of the mul is too big to be stored in an unsigned int, we overflow and may loose some high part of the result. Right?
} static __init int sched_nohz_full_init_debug(void) -- 1.8.3
Frederic Weisbecker fweisbec@gmail.com writes:
On Tue, Dec 17, 2013 at 01:23:08PM -0800, Kevin Hilman wrote:
The conversion of the max deferment from usecs to nsecs can easily overflow on platforms where a long is 32-bits. To fix, cast the usecs value to u64 before multiplying by NSECS_PER_USEC.
This was discovered on 32-bit ARM platform when extending the max deferment value.
Cc: Frederic Weisbecker fweisbec@gmail.com Signed-off-by: Kevin Hilman khilman@linaro.org
kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4b1fe3e69fe4..3d7c80e1c4d9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2203,7 +2203,7 @@ u64 scheduler_tick_max_deferment(void) if (time_before_eq(next, now)) return 0;
- return jiffies_to_usecs(next - now) * NSEC_PER_USEC;
- return (u64)jiffies_to_usecs(next - now) * NSEC_PER_USEC;
Just to be sure I understand the issue. The problem is that jiffies_to_usecs() return an unsigned int which is then multiplied by NSEC_PER_USEC. If the result of the mul is too big to be stored in an unsigned int, we overflow and may loose some high part of the result. Right?
Correct.
Kevin
On Tue, Dec 17, 2013 at 01:23:07PM -0800, Kevin Hilman wrote:
Allow debugfs override of sched_tick_max_deferment in order to ease finding/fixing the remaining issues with full nohz.
The value to be written is in jiffies, and -1 means the max deferment is disabled (scheduler_tick_max_deferment() returns KTIME_MAX.)
Cc: Frederic Weisbecker fweisbec@gmail.com Signed-off-by: Kevin Hilman khilman@linaro.org
kernel/sched/core.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5ac63c9a995a..4b1fe3e69fe4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2175,6 +2175,8 @@ void scheduler_tick(void) } #ifdef CONFIG_NO_HZ_FULL +static u32 sched_tick_max_deferment = HZ;
/**
- scheduler_tick_max_deferment
@@ -2193,13 +2195,25 @@ u64 scheduler_tick_max_deferment(void) struct rq *rq = this_rq(); unsigned long next, now = ACCESS_ONCE(jiffies);
- next = rq->last_sched_tick + HZ;
- if (sched_tick_max_deferment == -1)
return KTIME_MAX;
- next = rq->last_sched_tick + sched_tick_max_deferment;
if (time_before_eq(next, now)) return 0; return jiffies_to_usecs(next - now) * NSEC_PER_USEC; }
+static __init int sched_nohz_full_init_debug(void) +{
- debugfs_create_u32("sched_tick_max_deferment", 0644, NULL,
&sched_tick_max_deferment);
- return 0;
+} +late_initcall(sched_nohz_full_init_debug);
If the goal is mostly to turn off sched_tick_max_deferment (set to -1), we should perhaps make it a boolean sched feature (see kernel/sched/features.h) as it's a pretty well consolidated interface.
#endif notrace unsigned long get_parent_ip(unsigned long addr) -- 1.8.3
Frederic Weisbecker fweisbec@gmail.com writes:
On Tue, Dec 17, 2013 at 01:23:07PM -0800, Kevin Hilman wrote:
Allow debugfs override of sched_tick_max_deferment in order to ease finding/fixing the remaining issues with full nohz.
The value to be written is in jiffies, and -1 means the max deferment is disabled (scheduler_tick_max_deferment() returns KTIME_MAX.)
Cc: Frederic Weisbecker fweisbec@gmail.com Signed-off-by: Kevin Hilman khilman@linaro.org
kernel/sched/core.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5ac63c9a995a..4b1fe3e69fe4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2175,6 +2175,8 @@ void scheduler_tick(void) } #ifdef CONFIG_NO_HZ_FULL +static u32 sched_tick_max_deferment = HZ;
/**
- scheduler_tick_max_deferment
@@ -2193,13 +2195,25 @@ u64 scheduler_tick_max_deferment(void) struct rq *rq = this_rq(); unsigned long next, now = ACCESS_ONCE(jiffies);
- next = rq->last_sched_tick + HZ;
- if (sched_tick_max_deferment == -1)
return KTIME_MAX;
- next = rq->last_sched_tick + sched_tick_max_deferment;
if (time_before_eq(next, now)) return 0; return jiffies_to_usecs(next - now) * NSEC_PER_USEC; }
+static __init int sched_nohz_full_init_debug(void) +{
- debugfs_create_u32("sched_tick_max_deferment", 0644, NULL,
&sched_tick_max_deferment);
- return 0;
+} +late_initcall(sched_nohz_full_init_debug);
If the goal is mostly to turn off sched_tick_max_deferment (set to -1), we should perhaps make it a boolean sched feature (see kernel/sched/features.h) as it's a pretty well consolidated interface.
Well, I suspect folks may want to set it to various values, depending on workload to experiment with the results.
Also, my first attempt was to add control over this via sysctl[1] (though not sched_features) and you suggested[2] I use debugfs instead since this should be a temporary hack until we can remove the 1Hz residual tick.
Kevin
[1] http://marc.info/?l=linux-kernel&m=137159992306877&w=2 [2] http://marc.info/?l=linux-kernel&m=137166737830821&w=2
On Mon, Jan 06, 2014 at 10:37:27AM -0800, Kevin Hilman wrote:
Frederic Weisbecker fweisbec@gmail.com writes:
On Tue, Dec 17, 2013 at 01:23:07PM -0800, Kevin Hilman wrote:
Allow debugfs override of sched_tick_max_deferment in order to ease finding/fixing the remaining issues with full nohz.
The value to be written is in jiffies, and -1 means the max deferment is disabled (scheduler_tick_max_deferment() returns KTIME_MAX.)
Cc: Frederic Weisbecker fweisbec@gmail.com Signed-off-by: Kevin Hilman khilman@linaro.org
kernel/sched/core.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5ac63c9a995a..4b1fe3e69fe4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2175,6 +2175,8 @@ void scheduler_tick(void) } #ifdef CONFIG_NO_HZ_FULL +static u32 sched_tick_max_deferment = HZ;
/**
- scheduler_tick_max_deferment
@@ -2193,13 +2195,25 @@ u64 scheduler_tick_max_deferment(void) struct rq *rq = this_rq(); unsigned long next, now = ACCESS_ONCE(jiffies);
- next = rq->last_sched_tick + HZ;
- if (sched_tick_max_deferment == -1)
return KTIME_MAX;
- next = rq->last_sched_tick + sched_tick_max_deferment;
if (time_before_eq(next, now)) return 0; return jiffies_to_usecs(next - now) * NSEC_PER_USEC; }
+static __init int sched_nohz_full_init_debug(void) +{
- debugfs_create_u32("sched_tick_max_deferment", 0644, NULL,
&sched_tick_max_deferment);
- return 0;
+} +late_initcall(sched_nohz_full_init_debug);
If the goal is mostly to turn off sched_tick_max_deferment (set to -1), we should perhaps make it a boolean sched feature (see kernel/sched/features.h) as it's a pretty well consolidated interface.
Well, I suspect folks may want to set it to various values, depending on workload to experiment with the results.
Another option is to add an integer file in sched_features/ debugfs directory. But all other files there are boolean, so that wouldn't integrate there very well.
One of the things I would like to try is to offline sched_class(current[$CPU])::scheduler_tick() to the timekeeper or any housekeeping CPU.
So the housekeeper could handle the periodic tick on behalf of full dynticks CPUs. And then being able to tune the frequency of this sounds interesting.
So yeah having an tunable integer makes sense after all.
Also, my first attempt was to add control over this via sysctl[1] (though not sched_features) and you suggested[2] I use debugfs instead since this should be a temporary hack until we can remove the 1Hz residual tick.
Right, but SCHED_FEAT are debugfs :) And I thought we could either reuse it or reuse the sched feature debugfs directory. But again I realize it's all made of bool values so it's not very welcoming for consistency.
Anyway thinking about it more, perhaps we should actually use your patch that use sysctl since the rest of the scheduler does that for tunable numbers.
Now since it's sysctl, I'm kind of more picky about correctness limits: what if people set high values, thinking the kernel can handle them just fine, while it can't yet obviously? Should we ignore values that goes too far? And how to we draw the line?
Thoughts?
Kevin
[1] http://marc.info/?l=linux-kernel&m=137159992306877&w=2 [2] http://marc.info/?l=linux-kernel&m=137166737830821&w=2
Frederic Weisbecker fweisbec@gmail.com writes:
On Mon, Jan 06, 2014 at 10:37:27AM -0800, Kevin Hilman wrote:
Frederic Weisbecker fweisbec@gmail.com writes:
On Tue, Dec 17, 2013 at 01:23:07PM -0800, Kevin Hilman wrote:
Allow debugfs override of sched_tick_max_deferment in order to ease finding/fixing the remaining issues with full nohz.
The value to be written is in jiffies, and -1 means the max deferment is disabled (scheduler_tick_max_deferment() returns KTIME_MAX.)
Cc: Frederic Weisbecker fweisbec@gmail.com Signed-off-by: Kevin Hilman khilman@linaro.org
kernel/sched/core.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5ac63c9a995a..4b1fe3e69fe4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2175,6 +2175,8 @@ void scheduler_tick(void) } #ifdef CONFIG_NO_HZ_FULL +static u32 sched_tick_max_deferment = HZ;
/**
- scheduler_tick_max_deferment
@@ -2193,13 +2195,25 @@ u64 scheduler_tick_max_deferment(void) struct rq *rq = this_rq(); unsigned long next, now = ACCESS_ONCE(jiffies);
- next = rq->last_sched_tick + HZ;
- if (sched_tick_max_deferment == -1)
return KTIME_MAX;
- next = rq->last_sched_tick + sched_tick_max_deferment;
if (time_before_eq(next, now)) return 0; return jiffies_to_usecs(next - now) * NSEC_PER_USEC; }
+static __init int sched_nohz_full_init_debug(void) +{
- debugfs_create_u32("sched_tick_max_deferment", 0644, NULL,
&sched_tick_max_deferment);
- return 0;
+} +late_initcall(sched_nohz_full_init_debug);
If the goal is mostly to turn off sched_tick_max_deferment (set to -1), we should perhaps make it a boolean sched feature (see kernel/sched/features.h) as it's a pretty well consolidated interface.
Well, I suspect folks may want to set it to various values, depending on workload to experiment with the results.
Another option is to add an integer file in sched_features/ debugfs directory. But all other files there are boolean, so that wouldn't integrate there very well.
One of the things I would like to try is to offline sched_class(current[$CPU])::scheduler_tick() to the timekeeper or any housekeeping CPU.
So the housekeeper could handle the periodic tick on behalf of full dynticks CPUs. And then being able to tune the frequency of this sounds interesting.
So yeah having an tunable integer makes sense after all.
Also, my first attempt was to add control over this via sysctl[1] (though not sched_features) and you suggested[2] I use debugfs instead since this should be a temporary hack until we can remove the 1Hz residual tick.
Right, but SCHED_FEAT are debugfs :) And I thought we could either reuse it or reuse the sched feature debugfs directory. But again I realize it's all made of bool values so it's not very welcoming for consistency.
Anyway thinking about it more, perhaps we should actually use your patch that use sysctl since the rest of the scheduler does that for tunable numbers.
Now since it's sysctl, I'm kind of more picky about correctness limits: what if people set high values, thinking the kernel can handle them just fine, while it can't yet obviously? Should we ignore values that goes too far? And how to we draw the line?
Thoughts?
Well, since I think itt's more of a debug feature, I don't think we should draw a line. For example, the current patch lets you disable it completly by setting it to -1. Surely this will break some use cases, but breaking things is kinda the point so we are better able to see where the problems are.
Also, you previously made the case that it shouldn't be a sysctl option since this will (hopefully) be going away in the not too distant future. Isn't that still the case?
Kevin
linaro-kernel@lists.linaro.org