On 64-bit platforms, reads/writes of the various cpustat fields are atomic due to native 64-bit loads/stores. However, on non 64-bit platforms, reads/writes of the cpustat fields are not atomic and could lead to inconsistent statistics.
This problem was originally reported by Frederic Weisbecker as a 64-bit limitation with the nsec granularity cputime accounting for full dynticks, but then we realized that it's a problem that's been around for awhile and not specific to the new cputime accounting.
This series fixes this by first converting all access to the cputime fields to use accessor functions, and then converting the accessor functions to use the atomic64 functions.
Implemented based on idea proposed by Frederic Weisbecker.
Kevin Hilman (2): cpustat: use accessor functions for get/set/add cpustat: convert to atomic operations
arch/s390/appldata/appldata_os.c | 16 +++++++-------- drivers/cpufreq/cpufreq_governor.c | 18 ++++++++--------- drivers/cpufreq/cpufreq_ondemand.c | 2 +- drivers/macintosh/rack-meter.c | 6 +++--- fs/proc/stat.c | 40 +++++++++++++++++++------------------- fs/proc/uptime.c | 2 +- include/linux/kernel_stat.h | 11 ++++++++++- kernel/sched/core.c | 12 +++++------- kernel/sched/cputime.c | 29 +++++++++++++-------------- 9 files changed, 70 insertions(+), 66 deletions(-)
Add some accessor functions in order to facilitate the conversion to atomic reads/writes of cpustat values.
Signed-off-by: Kevin Hilman khilman@linaro.org --- arch/s390/appldata/appldata_os.c | 16 +++++++-------- drivers/cpufreq/cpufreq_governor.c | 18 ++++++++--------- drivers/cpufreq/cpufreq_ondemand.c | 2 +- drivers/macintosh/rack-meter.c | 6 +++--- fs/proc/stat.c | 40 +++++++++++++++++++------------------- fs/proc/uptime.c | 2 +- include/linux/kernel_stat.h | 7 ++++++- kernel/sched/core.c | 12 +++++------- kernel/sched/cputime.c | 29 +++++++++++++-------------- 9 files changed, 66 insertions(+), 66 deletions(-)
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c index 87521ba..eff76f8 100644 --- a/arch/s390/appldata/appldata_os.c +++ b/arch/s390/appldata/appldata_os.c @@ -113,21 +113,21 @@ static void appldata_get_os_data(void *data) j = 0; for_each_online_cpu(i) { os_data->os_cpu[j].per_cpu_user = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_USER]); + cputime_to_jiffies(kcpustat_cpu_get(i, CPUTIME_USER)); os_data->os_cpu[j].per_cpu_nice = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_NICE]); + cputime_to_jiffies(kcpustat_cpu_get(i, CPUTIME_NICE)); os_data->os_cpu[j].per_cpu_system = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]); + cputime_to_jiffies(kcpustat_cpu_get(i, CPUTIME_SYSTEM)); os_data->os_cpu[j].per_cpu_idle = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IDLE]); + cputime_to_jiffies(kcpustat_cpu_get(i, CPUTIME_IDLE)); os_data->os_cpu[j].per_cpu_irq = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IRQ]); + cputime_to_jiffies(kcpustat_cpu_get(i, CPUTIME_IRQ)); os_data->os_cpu[j].per_cpu_softirq = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]); + cputime_to_jiffies(kcpustat_cpu_get(i, CPUTIME_SOFTIRQ)); os_data->os_cpu[j].per_cpu_iowait = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IOWAIT]); + cputime_to_jiffies(kcpustat_cpu_get(i, CPUTIME_IOWAIT)); os_data->os_cpu[j].per_cpu_steal = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_STEAL]); + cputime_to_jiffies(kcpustat_cpu_get(i, CPUTIME_STEAL)); os_data->os_cpu[j].cpu_id = i; j++; } diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 6c5f1d3..ec6c315 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -36,12 +36,12 @@ static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
- busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER]; - busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM]; - busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ]; - busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ]; - busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL]; - busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE]; + busy_time = kcpustat_cpu_get(cpu, CPUTIME_USER); + busy_time += kcpustat_cpu_get(cpu, CPUTIME_SYSTEM); + busy_time += kcpustat_cpu_get(cpu, CPUTIME_IRQ); + busy_time += kcpustat_cpu_get(cpu, CPUTIME_SOFTIRQ); + busy_time += kcpustat_cpu_get(cpu, CPUTIME_STEAL); + busy_time += kcpustat_cpu_get(cpu, CPUTIME_NICE);
idle_time = cur_wall_time - busy_time; if (wall) @@ -103,7 +103,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) u64 cur_nice; unsigned long cur_nice_jiffies;
- cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE] - + cur_nice = kcpustat_cpu_get(j, CPUTIME_NICE) - cdbs->prev_cpu_nice; /* * Assumption: nice time between sampling periods will @@ -113,7 +113,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) cputime64_to_jiffies64(cur_nice);
cdbs->prev_cpu_nice = - kcpustat_cpu(j).cpustat[CPUTIME_NICE]; + kcpustat_cpu_get(j, CPUTIME_NICE); idle_time += jiffies_to_usecs(cur_nice_jiffies); }
@@ -216,7 +216,7 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data, &j_cdbs->prev_cpu_wall); if (ignore_nice) j_cdbs->prev_cpu_nice = - kcpustat_cpu(j).cpustat[CPUTIME_NICE]; + kcpustat_cpu_get(j, CPUTIME_NICE); }
/* diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 7731f7c..ac5d49f 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -403,7 +403,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, &dbs_info->cdbs.prev_cpu_wall); if (od_tuners.ignore_nice) dbs_info->cdbs.prev_cpu_nice = - kcpustat_cpu(j).cpustat[CPUTIME_NICE]; + kcpustat_cpu_get(j, CPUTIME_NICE);
} return count; diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c index cad0e19..e799f3c 100644 --- a/drivers/macintosh/rack-meter.c +++ b/drivers/macintosh/rack-meter.c @@ -83,11 +83,11 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu) { u64 retval;
- retval = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE] + - kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT]; + retval = kcpustat_cpu_get(cpu, CPUTIME_IDLE) + + kcpustat_cpu_get(cpu, CPUTIME_IOWAIT);
if (rackmeter_ignore_nice) - retval += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE]; + retval += kcpustat_cpu_get(cpu, CPUTIME_NICE);
return retval; } diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..6a24276 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -25,7 +25,7 @@ static cputime64_t get_idle_time(int cpu) { cputime64_t idle;
- idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE]; + idle = kcpustat_cpu_get(cpu, CPUTIME_IDLE); if (cpu_online(cpu) && !nr_iowait_cpu(cpu)) idle += arch_idle_time(cpu); return idle; @@ -35,7 +35,7 @@ static cputime64_t get_iowait_time(int cpu) { cputime64_t iowait;
- iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT]; + iowait = kcpustat_cpu_get(cpu, CPUTIME_IOWAIT); if (cpu_online(cpu) && nr_iowait_cpu(cpu)) iowait += arch_idle_time(cpu); return iowait; @@ -52,7 +52,7 @@ static u64 get_idle_time(int cpu)
if (idle_time == -1ULL) /* !NO_HZ or cpu offline so we can rely on cpustat.idle */ - idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE]; + idle = kcpustat_cpu_get(cpu, CPUTIME_IDLE); else idle = usecs_to_cputime64(idle_time);
@@ -68,7 +68,7 @@ static u64 get_iowait_time(int cpu)
if (iowait_time == -1ULL) /* !NO_HZ or cpu offline so we can rely on cpustat.iowait */ - iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT]; + iowait = kcpustat_cpu_get(cpu, CPUTIME_IOWAIT); else iowait = usecs_to_cputime64(iowait_time);
@@ -95,16 +95,16 @@ static int show_stat(struct seq_file *p, void *v) jif = boottime.tv_sec;
for_each_possible_cpu(i) { - user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; - nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; - system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]; + user += kcpustat_cpu_get(i, CPUTIME_USER); + nice += kcpustat_cpu_get(i, CPUTIME_NICE); + system += kcpustat_cpu_get(i, CPUTIME_SYSTEM); idle += get_idle_time(i); iowait += get_iowait_time(i); - irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ]; - softirq += kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]; - steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; - guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; - guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + irq += kcpustat_cpu_get(i, CPUTIME_IRQ); + softirq += kcpustat_cpu_get(i, CPUTIME_SOFTIRQ); + steal += kcpustat_cpu_get(i, CPUTIME_STEAL); + guest += kcpustat_cpu_get(i, CPUTIME_GUEST); + guest_nice += kcpustat_cpu_get(i, CPUTIME_GUEST_NICE); sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i);
@@ -132,16 +132,16 @@ static int show_stat(struct seq_file *p, void *v)
for_each_online_cpu(i) { /* Copy values here to work around gcc-2.95.3, gcc-2.96 */ - user = kcpustat_cpu(i).cpustat[CPUTIME_USER]; - nice = kcpustat_cpu(i).cpustat[CPUTIME_NICE]; - system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]; + user = kcpustat_cpu_get(i, CPUTIME_USER); + nice = kcpustat_cpu_get(i, CPUTIME_NICE); + system = kcpustat_cpu_get(i, CPUTIME_SYSTEM); idle = get_idle_time(i); iowait = get_iowait_time(i); - irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ]; - softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]; - steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; - guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; - guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + irq = kcpustat_cpu_get(i, CPUTIME_IRQ); + softirq = kcpustat_cpu_get(i, CPUTIME_SOFTIRQ); + steal = kcpustat_cpu_get(i, CPUTIME_STEAL); + guest = kcpustat_cpu_get(i, CPUTIME_GUEST); + guest_nice = kcpustat_cpu_get(i, CPUTIME_GUEST_NICE); seq_printf(p, "cpu%d", i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c index 9610ac7..fd9c93d 100644 --- a/fs/proc/uptime.c +++ b/fs/proc/uptime.c @@ -18,7 +18,7 @@ static int uptime_proc_show(struct seq_file *m, void *v)
idletime = 0; for_each_possible_cpu(i) - idletime += (__force u64) kcpustat_cpu(i).cpustat[CPUTIME_IDLE]; + idletime += (__force u64) kcpustat_cpu_get(i, CPUTIME_IDLE);
do_posix_clock_monotonic_gettime(&uptime); monotonic_to_bootbased(&uptime); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 66b7078..df8ad75 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -32,7 +32,7 @@ enum cpu_usage_stat { };
struct kernel_cpustat { - u64 cpustat[NR_STATS]; + u64 _cpustat[NR_STATS]; };
struct kernel_stat { @@ -51,6 +51,11 @@ DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat); #define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat)) #define kstat_cpu(cpu) per_cpu(kstat, cpu) #define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu) +#define kcpustat_cpu_get(cpu, i) (kcpustat_cpu(cpu)._cpustat[i]) +#define kcpustat_cpu_set(cpu, i, val) (kcpustat_cpu(cpu)._cpustat[i] = (val)) +#define kcpustat_cpu_add(cpu, i, val) (kcpustat_cpu(cpu)._cpustat[i] += (val)) +#define kcpustat_this_cpu_set(i, val) (kcpustat_this_cpu->_cpustat[i] = (val)) +#define kcpustat_this_cpu_add(i, val) (kcpustat_this_cpu->_cpustat[i] += (val))
extern unsigned long long nr_context_switches(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26058d0..4a8234c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8082,19 +8082,17 @@ static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft, s64 val = 0;
for_each_online_cpu(cpu) { - struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu); - val += kcpustat->cpustat[CPUTIME_USER]; - val += kcpustat->cpustat[CPUTIME_NICE]; + val += kcpustat_cpu_get(cpu, CPUTIME_USER); + val += kcpustat_cpu_get(cpu, CPUTIME_NICE); } val = cputime64_to_clock_t(val); cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_USER], val);
val = 0; for_each_online_cpu(cpu) { - struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu); - val += kcpustat->cpustat[CPUTIME_SYSTEM]; - val += kcpustat->cpustat[CPUTIME_IRQ]; - val += kcpustat->cpustat[CPUTIME_SOFTIRQ]; + val += kcpustat_cpu_get(cpu, CPUTIME_SYSTEM); + val += kcpustat_cpu_get(cpu, CPUTIME_IRQ); + val += kcpustat_cpu_get(cpu, CPUTIME_SOFTIRQ); }
val = cputime64_to_clock_t(val); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 293b202..a4c0594 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -124,7 +124,7 @@ static inline void task_group_account_field(struct task_struct *p, int index, * is the only cgroup, then nothing else should be necessary. * */ - __get_cpu_var(kernel_cpustat).cpustat[index] += tmp; + kcpustat_this_cpu_add(index, tmp);
#ifdef CONFIG_CGROUP_CPUACCT if (unlikely(!cpuacct_subsys.active)) @@ -175,8 +175,6 @@ void account_user_time(struct task_struct *p, cputime_t cputime, static void account_guest_time(struct task_struct *p, cputime_t cputime, cputime_t cputime_scaled) { - u64 *cpustat = kcpustat_this_cpu->cpustat; - /* Add guest time to process. */ p->utime += cputime; p->utimescaled += cputime_scaled; @@ -185,11 +183,12 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
/* Add guest time to cpustat. */ if (TASK_NICE(p) > 0) { - cpustat[CPUTIME_NICE] += (__force u64) cputime; - cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime; + kcpustat_this_cpu_add(CPUTIME_NICE, (__force u64) cputime); + kcpustat_this_cpu_add(CPUTIME_GUEST_NICE, + (__force u64) cputime); } else { - cpustat[CPUTIME_USER] += (__force u64) cputime; - cpustat[CPUTIME_GUEST] += (__force u64) cputime; + kcpustat_this_cpu_add(CPUTIME_USER, (__force u64) cputime); + kcpustat_this_cpu_add(CPUTIME_GUEST, (__force u64) cputime); } }
@@ -249,9 +248,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset, */ void account_steal_time(cputime_t cputime) { - u64 *cpustat = kcpustat_this_cpu->cpustat; - - cpustat[CPUTIME_STEAL] += (__force u64) cputime; + kcpustat_this_cpu_add(CPUTIME_STEAL, (__force u64) cputime); }
/* @@ -260,13 +257,12 @@ void account_steal_time(cputime_t cputime) */ void account_idle_time(cputime_t cputime) { - u64 *cpustat = kcpustat_this_cpu->cpustat; struct rq *rq = this_rq();
if (atomic_read(&rq->nr_iowait) > 0) - cpustat[CPUTIME_IOWAIT] += (__force u64) cputime; + kcpustat_this_cpu_add(CPUTIME_IOWAIT, (__force u64) cputime); else - cpustat[CPUTIME_IDLE] += (__force u64) cputime; + kcpustat_this_cpu_add(CPUTIME_IDLE, (__force u64) cputime); }
static __always_inline bool steal_account_process_tick(void) @@ -344,15 +340,16 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick, struct rq *rq) { cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy); - u64 *cpustat = kcpustat_this_cpu->cpustat;
if (steal_account_process_tick()) return;
if (irqtime_account_hi_update()) { - cpustat[CPUTIME_IRQ] += (__force u64) cputime_one_jiffy; + kcpustat_this_cpu_add(CPUTIME_IRQ, + (__force u64) cputime_one_jiffy); } else if (irqtime_account_si_update()) { - cpustat[CPUTIME_SOFTIRQ] += (__force u64) cputime_one_jiffy; + kcpustat_this_cpu_add(CPUTIME_SOFTIRQ, + (__force u64) cputime_one_jiffy); } else if (this_cpu_ksoftirqd() == p) { /* * ksoftirqd time do not get accounted in cpu_softirq_time.
On Fri, Feb 22, 2013 at 11:26 AM, Kevin Hilman khilman@linaro.org wrote:
Add some accessor functions in order to facilitate the conversion to atomic reads/writes of cpustat values.
Signed-off-by: Kevin Hilman khilman@linaro.org
drivers/cpufreq/cpufreq_governor.c | 18 ++++++++--------- drivers/cpufreq/cpufreq_ondemand.c | 2 +-
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 6c5f1d3..ec6c315 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -36,12 +36,12 @@ static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
busy_time = kcpustat_cpu_get(cpu, CPUTIME_USER);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_SYSTEM);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_IRQ);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_SOFTIRQ);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_STEAL);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_NICE); idle_time = cur_wall_time - busy_time; if (wall)
@@ -103,7 +103,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) u64 cur_nice; unsigned long cur_nice_jiffies;
cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE] -
cur_nice = kcpustat_cpu_get(j, CPUTIME_NICE) - cdbs->prev_cpu_nice; /* * Assumption: nice time between sampling periods will
@@ -113,7 +113,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) cputime64_to_jiffies64(cur_nice);
cdbs->prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); idle_time += jiffies_to_usecs(cur_nice_jiffies); }
@@ -216,7 +216,7 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data, &j_cdbs->prev_cpu_wall); if (ignore_nice) j_cdbs->prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } /*
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 7731f7c..ac5d49f 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -403,7 +403,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, &dbs_info->cdbs.prev_cpu_wall); if (od_tuners.ignore_nice) dbs_info->cdbs.prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } return count;
For cpufreq:
Acked-by: Viresh Kumar viresh.kumar@linaro.org
Though i believe you also need this:
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index 64ef737..38e3ad7 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -242,7 +242,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, &dbs_info->cdbs.prev_cpu_wall); if (cs_tuners.ignore_nice) dbs_info->cdbs.prev_cpu_nice = - kcpustat_cpu(j).cpustat[CPUTIME_NICE]; + kcpustat_cpu_get(j, CPUTIME_NICE); } return count; }
BTW, i don't see kcpustat_cpu() used in
kernel/sched/core.c | 12 +++++------- kernel/sched/cputime.c | 29 +++++++++++++--------------
I searched tip/master as well as lnext/master.
On Fri, Feb 22, 2013 at 11:51 AM, Viresh Kumar viresh.kumar@linaro.org wrote:
On Fri, Feb 22, 2013 at 11:26 AM, Kevin Hilman khilman@linaro.org wrote:
Add some accessor functions in order to facilitate the conversion to atomic reads/writes of cpustat values.
Signed-off-by: Kevin Hilman khilman@linaro.org
drivers/cpufreq/cpufreq_governor.c | 18 ++++++++--------- drivers/cpufreq/cpufreq_ondemand.c | 2 +-
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 6c5f1d3..ec6c315 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -36,12 +36,12 @@ static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
busy_time = kcpustat_cpu_get(cpu, CPUTIME_USER);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_SYSTEM);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_IRQ);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_SOFTIRQ);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_STEAL);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_NICE); idle_time = cur_wall_time - busy_time; if (wall)
@@ -103,7 +103,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) u64 cur_nice; unsigned long cur_nice_jiffies;
cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE] -
cur_nice = kcpustat_cpu_get(j, CPUTIME_NICE) - cdbs->prev_cpu_nice; /* * Assumption: nice time between sampling periods will
@@ -113,7 +113,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) cputime64_to_jiffies64(cur_nice);
cdbs->prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); idle_time += jiffies_to_usecs(cur_nice_jiffies); }
@@ -216,7 +216,7 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data, &j_cdbs->prev_cpu_wall); if (ignore_nice) j_cdbs->prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } /*
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 7731f7c..ac5d49f 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -403,7 +403,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, &dbs_info->cdbs.prev_cpu_wall); if (od_tuners.ignore_nice) dbs_info->cdbs.prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } return count;
For cpufreq:
Acked-by: Viresh Kumar viresh.kumar@linaro.org
Though i believe you also need this:
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index 64ef737..38e3ad7 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -242,7 +242,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, &dbs_info->cdbs.prev_cpu_wall); if (cs_tuners.ignore_nice) dbs_info->cdbs.prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } return count;
}
BTW, i don't see kcpustat_cpu() used in
kernel/sched/core.c | 12 +++++------- kernel/sched/cputime.c | 29 +++++++++++++--------------
I searched tip/master as well as lnext/master.
Added by Frederic's Adaptive NOHZ patchset?
On 22 February 2013 12:47, Amit Kucheria amit.kucheria@linaro.org wrote:
On Fri, Feb 22, 2013 at 11:51 AM, Viresh Kumar viresh.kumar@linaro.org wrote:
BTW, i don't see kcpustat_cpu() used in
kernel/sched/core.c | 12 +++++------- kernel/sched/cputime.c | 29 +++++++++++++--------------
I searched tip/master as well as lnext/master.
Added by Frederic's Adaptive NOHZ patchset?
I don't even see them on our unused-nohz-adaptive-tickless-v2 branch :) Maybe some other latest work.
On Fri, Feb 22, 2013 at 12:47:33PM +0530, Amit Kucheria wrote:
On Fri, Feb 22, 2013 at 11:51 AM, Viresh Kumar viresh.kumar@linaro.org wrote:
On Fri, Feb 22, 2013 at 11:26 AM, Kevin Hilman khilman@linaro.org wrote:
Add some accessor functions in order to facilitate the conversion to atomic reads/writes of cpustat values.
Signed-off-by: Kevin Hilman khilman@linaro.org
drivers/cpufreq/cpufreq_governor.c | 18 ++++++++--------- drivers/cpufreq/cpufreq_ondemand.c | 2 +-
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 6c5f1d3..ec6c315 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -36,12 +36,12 @@ static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
busy_time = kcpustat_cpu_get(cpu, CPUTIME_USER);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_SYSTEM);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_IRQ);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_SOFTIRQ);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_STEAL);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_NICE); idle_time = cur_wall_time - busy_time; if (wall)
@@ -103,7 +103,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) u64 cur_nice; unsigned long cur_nice_jiffies;
cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE] -
cur_nice = kcpustat_cpu_get(j, CPUTIME_NICE) - cdbs->prev_cpu_nice; /* * Assumption: nice time between sampling periods will
@@ -113,7 +113,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) cputime64_to_jiffies64(cur_nice);
cdbs->prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); idle_time += jiffies_to_usecs(cur_nice_jiffies); }
@@ -216,7 +216,7 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data, &j_cdbs->prev_cpu_wall); if (ignore_nice) j_cdbs->prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } /*
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 7731f7c..ac5d49f 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -403,7 +403,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, &dbs_info->cdbs.prev_cpu_wall); if (od_tuners.ignore_nice) dbs_info->cdbs.prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } return count;
For cpufreq:
Acked-by: Viresh Kumar viresh.kumar@linaro.org
Though i believe you also need this:
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index 64ef737..38e3ad7 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -242,7 +242,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, &dbs_info->cdbs.prev_cpu_wall); if (cs_tuners.ignore_nice) dbs_info->cdbs.prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } return count;
}
BTW, i don't see kcpustat_cpu() used in
kernel/sched/core.c | 12 +++++------- kernel/sched/cputime.c | 29 +++++++++++++--------------
I searched tip/master as well as lnext/master.
Added by Frederic's Adaptive NOHZ patchset?
No, Kevin has been using kcpustat_this_cpu_set() from kernel/sched/cputime.c. That's because it's always written locally.
Viresh Kumar viresh.kumar@linaro.org writes:
On Fri, Feb 22, 2013 at 11:26 AM, Kevin Hilman khilman@linaro.org wrote:
Add some accessor functions in order to facilitate the conversion to atomic reads/writes of cpustat values.
Signed-off-by: Kevin Hilman khilman@linaro.org
drivers/cpufreq/cpufreq_governor.c | 18 ++++++++--------- drivers/cpufreq/cpufreq_ondemand.c | 2 +-
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 6c5f1d3..ec6c315 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -36,12 +36,12 @@ static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
busy_time = kcpustat_cpu_get(cpu, CPUTIME_USER);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_SYSTEM);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_IRQ);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_SOFTIRQ);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_STEAL);
busy_time += kcpustat_cpu_get(cpu, CPUTIME_NICE); idle_time = cur_wall_time - busy_time; if (wall)
@@ -103,7 +103,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) u64 cur_nice; unsigned long cur_nice_jiffies;
cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE] -
cur_nice = kcpustat_cpu_get(j, CPUTIME_NICE) - cdbs->prev_cpu_nice; /* * Assumption: nice time between sampling periods will
@@ -113,7 +113,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) cputime64_to_jiffies64(cur_nice);
cdbs->prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); idle_time += jiffies_to_usecs(cur_nice_jiffies); }
@@ -216,7 +216,7 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data, &j_cdbs->prev_cpu_wall); if (ignore_nice) j_cdbs->prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } /*
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 7731f7c..ac5d49f 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -403,7 +403,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, &dbs_info->cdbs.prev_cpu_wall); if (od_tuners.ignore_nice) dbs_info->cdbs.prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } return count;
For cpufreq:
Acked-by: Viresh Kumar viresh.kumar@linaro.org
Though i believe you also need this:
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index 64ef737..38e3ad7 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -242,7 +242,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, &dbs_info->cdbs.prev_cpu_wall); if (cs_tuners.ignore_nice) dbs_info->cdbs.prev_cpu_nice =
kcpustat_cpu(j).cpustat[CPUTIME_NICE];
kcpustat_cpu_get(j, CPUTIME_NICE); } return count;
}
Thanks, I missed that one. I've added it in locally.
Kevin
Looks good, just a minor neat:
On Thu, Feb 21, 2013 at 09:56:43PM -0800, Kevin Hilman wrote:
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 66b7078..df8ad75 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -32,7 +32,7 @@ enum cpu_usage_stat { }; struct kernel_cpustat {
- u64 cpustat[NR_STATS];
- u64 _cpustat[NR_STATS];
Why do you need the underscore?
On Fri, 2013-02-22 at 14:38 +0100, Frederic Weisbecker wrote:
Looks good, just a minor neat:
On Thu, Feb 21, 2013 at 09:56:43PM -0800, Kevin Hilman wrote:
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 66b7078..df8ad75 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -32,7 +32,7 @@ enum cpu_usage_stat { }; struct kernel_cpustat {
- u64 cpustat[NR_STATS];
- u64 _cpustat[NR_STATS];
Why do you need the underscore?
To make sure there's nobody still referencing the data directly instead of through the accessors.
On Fri, Feb 22, 2013 at 02:58:51PM +0100, Peter Zijlstra wrote:
On Fri, 2013-02-22 at 14:38 +0100, Frederic Weisbecker wrote:
Looks good, just a minor neat:
On Thu, Feb 21, 2013 at 09:56:43PM -0800, Kevin Hilman wrote:
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 66b7078..df8ad75 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -32,7 +32,7 @@ enum cpu_usage_stat { }; struct kernel_cpustat {
- u64 cpustat[NR_STATS];
- u64 _cpustat[NR_STATS];
Why do you need the underscore?
To make sure there's nobody still referencing the data directly instead of through the accessors.
Good point!
For non 64-bit platforms, convert cpustat fields to atomic64 type so reads and udpates of cpustats are atomic on those platforms as well.
For 64-bit platforms, the cpustat field is left as u64 because on 64-bit, using atomic64_add will have the additional overhead of a lock. We could also have used atomic64_set(atomic64_read() + delta), but on 32-bit platforms using the generic 64-bit ops (lib/atomic64.c), that results in taking a lock twice.
Signed-off-by: Kevin Hilman khilman@linaro.org --- include/linux/kernel_stat.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index df8ad75..a433f87 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -32,7 +32,11 @@ enum cpu_usage_stat { };
struct kernel_cpustat { +#ifdef CONFIG_64BIT u64 _cpustat[NR_STATS]; +#else + atomic64_t _cpustat[NR_STATS]; +#endif };
struct kernel_stat { @@ -51,11 +55,23 @@ DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat); #define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat)) #define kstat_cpu(cpu) per_cpu(kstat, cpu) #define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu) +#ifdef CONFIG_64BIT #define kcpustat_cpu_get(cpu, i) (kcpustat_cpu(cpu)._cpustat[i]) #define kcpustat_cpu_set(cpu, i, val) (kcpustat_cpu(cpu)._cpustat[i] = (val)) #define kcpustat_cpu_add(cpu, i, val) (kcpustat_cpu(cpu)._cpustat[i] += (val)) #define kcpustat_this_cpu_set(i, val) (kcpustat_this_cpu->_cpustat[i] = (val)) #define kcpustat_this_cpu_add(i, val) (kcpustat_this_cpu->_cpustat[i] += (val)) +#else +#define kcpustat_cpu_get(cpu, i) atomic64_read(&kcpustat_cpu(cpu)._cpustat[i]) +#define kcpustat_cpu_set(cpu, i, val) \ + atomic64_set(val, &kcpustat_cpu(cpu)._cpustat[i]) +#define kcpustat_cpu_add(cpu, i, val) \ + atomic64_add(val, &kcpustat_cpu(cpu)._cpustat[i]) +#define kcpustat_this_cpu_set(i, val) \ + atomic64_set(val, &kcpustat_this_cpu->_cpustat[i]) +#define kcpustat_this_cpu_add(i, val) \ + atomic64_add(val, &kcpustat_this_cpu->_cpustat[i]) +#endif
extern unsigned long long nr_context_switches(void);
On Thu, 2013-02-21 at 21:56 -0800, Kevin Hilman wrote:
On 64-bit platforms, reads/writes of the various cpustat fields are atomic due to native 64-bit loads/stores. However, on non 64-bit platforms, reads/writes of the cpustat fields are not atomic and could lead to inconsistent statistics.
Which is a problem how?
This problem was originally reported by Frederic Weisbecker as a 64-bit limitation with the nsec granularity cputime accounting for full dynticks, but then we realized that it's a problem that's been around for awhile and not specific to the new cputime accounting.
This series fixes this by first converting all access to the cputime fields to use accessor functions, and then converting the accessor functions to use the atomic64 functions.
Argh!! at what cost? 64bit atomics are like expensive. Wouldn't adding a seqlock be saner?
On Fri, Feb 22, 2013 at 09:46:07AM +0100, Peter Zijlstra wrote:
On Thu, 2013-02-21 at 21:56 -0800, Kevin Hilman wrote:
On 64-bit platforms, reads/writes of the various cpustat fields are atomic due to native 64-bit loads/stores. However, on non 64-bit platforms, reads/writes of the cpustat fields are not atomic and could lead to inconsistent statistics.
Which is a problem how?
So here is a possible scenario, CPU 0 reads a kcpustat value, and CPU 1 writes it at the same time:
//Initial value of "cpustat" is 0xffffffff == CPU 0 == == CPU 1 ==
//load low part mov %eax, [cpustat] inc [cpustat] //Update the high part if necessary jnc 1f inc [cpustat + 4] 1: //load high part mov %edx, [cpustat + 4]
Afterward, CPU 0 will think the value is 0x1ffffffff while it's actually 0x100000000.
atomic64_read() and atomic64_set() are supposed to take care of that, without even the need for _inc() or _add() parts that use LOCK.
This problem was originally reported by Frederic Weisbecker as a 64-bit limitation with the nsec granularity cputime accounting for full dynticks, but then we realized that it's a problem that's been around for awhile and not specific to the new cputime accounting.
This series fixes this by first converting all access to the cputime fields to use accessor functions, and then converting the accessor functions to use the atomic64 functions.
Argh!! at what cost? 64bit atomics are like expensive. Wouldn't adding a seqlock be saner?
Not sure. This requires a spinlock in the write side which is called from fast path like the timer interrupt.
* Peter Zijlstra peterz@infradead.org wrote:
On Fri, 2013-02-22 at 13:50 +0100, Frederic Weisbecker wrote:
atomic64_read() and atomic64_set() are supposed to take care of that, without even the need for _inc() or _add() parts that use LOCK.
Are you sure? Generally atomic*_set() is not actually an atomic operation.
as per Documentation/atomic_ops.h:
#define atomic_read(v) ((v)->counter)
which simply reads the counter value currently visible to the calling thread. The read is atomic in that the return value is guaranteed to be one of the values initialized or modified with the interface operations if a proper implicit or explicit memory barrier is used after possible runtime initialization by any other thread and the value is modified only with the interface operations.
...
Properly aligned pointers, longs, ints, and chars (and unsigned equivalents) may be atomically loaded from and stored to in the same sense as described for atomic_read() and atomic_set().
The ACCESS_ONCE() macro should be used to prevent the compiler from using optimizations that might otherwise optimize accesses out of existence on the one hand, or that might create unsolicited accesses on the other.
This is usually a side effect of M[O]ESI cache coherency protocols - you can only get a 'split' word access if the word crosses cache line boundaries (multiples of 32 bytes, generally).
Thanks,
Ingo
On Fri, 2013-02-22 at 14:54 +0100, Ingo Molnar wrote:
- Peter Zijlstra peterz@infradead.org wrote:
On Fri, 2013-02-22 at 13:50 +0100, Frederic Weisbecker wrote:
atomic64_read() and atomic64_set() are supposed to take care of that, without even the need for _inc() or _add() parts that use LOCK.
Are you sure? Generally atomic*_set() is not actually an atomic operation.
as per Documentation/atomic_ops.h:
I think the interesting part is:
"The setting is atomic in that the return values of the atomic operations by all threads are guaranteed to be correct reflecting either the value that has been set with this operation or set with another operation. A proper implicit or explicit memory barrier is needed before the value set with the operation is guaranteed to be readable with atomic_read from another thread."
Which does give us the wanted guarantee, however:
I checked arch/x86/include/asm/atomic64_32.h and we use cmpxchg8b for everything from _set() to _read(), which translates into 'horridly stupendifyingly slow' for a number of machines, but coherent.
* Peter Zijlstra peterz@infradead.org wrote:
On Fri, 2013-02-22 at 14:54 +0100, Ingo Molnar wrote:
- Peter Zijlstra peterz@infradead.org wrote:
On Fri, 2013-02-22 at 13:50 +0100, Frederic Weisbecker wrote:
atomic64_read() and atomic64_set() are supposed to take care of that, without even the need for _inc() or _add() parts that use LOCK.
Are you sure? Generally atomic*_set() is not actually an atomic operation.
as per Documentation/atomic_ops.h:
I think the interesting part is:
"The setting is atomic in that the return values of the atomic operations by all threads are guaranteed to be correct reflecting either the value that has been set with this operation or set with another operation. A proper implicit or explicit memory barrier is needed before the value set with the operation is guaranteed to be readable with atomic_read from another thread."
Which does give us the wanted guarantee, however:
I checked arch/x86/include/asm/atomic64_32.h and we use cmpxchg8b for everything from _set() to _read(), which translates into 'horridly stupendifyingly slow' for a number of machines, but coherent.
That's a valid concern - and cmpxchg8b is the only 64-bit op available on most 32-bit x86 CPUs which does not involve the FPU.
Wondering how significant this range of x86 problem boxes will be by the time any of these changes reaches upstream and distros - and how much 'horridly stupendifyingly slow' is in terms of cycles expended.
Thanks,
Ingo
On Fri, 2013-02-22 at 15:16 +0100, Ingo Molnar wrote:
I checked arch/x86/include/asm/atomic64_32.h and we use cmpxchg8b for everything from _set() to _read(), which translates into 'horridly stupendifyingly slow' for a number of machines, but coherent.
That's a valid concern - and cmpxchg8b is the only 64-bit op available on most 32-bit x86 CPUs which does not involve the FPU.
Wondering how significant this range of x86 problem boxes will be by the time any of these changes reaches upstream and distros
- and how much 'horridly stupendifyingly slow' is in terms of
cycles expended.
On the !x86 side of things we're implementing (generic) atomic64 using hashed spinlocks, so there too using a single spinlock around the entire data structure is a complete win.
On Fri, 2013-02-22 at 13:50 +0100, Frederic Weisbecker wrote:
Which is a problem how?
So here is a possible scenario, CPU 0 reads a kcpustat value, and CPU 1 writes it at the same time:
//Initial value of "cpustat" is 0xffffffff == CPU 0 == == CPU 1 == //load low part mov %eax, [cpustat] inc [cpustat] //Update the high part if necessary jnc 1f inc [cpustat + 4] 1: //load high part mov %edx, [cpustat + 4]
Afterward, CPU 0 will think the value is 0x1ffffffff while it's actually 0x100000000.
atomic64_read() and atomic64_set() are supposed to take care of that, without even the need for _inc() or _add() parts that use LOCK.
Sure I get that, but again, why is that a problem,.. who relies on these statistics that makes it a problem?
On Fri, Feb 22, 2013 at 03:05:39PM +0100, Peter Zijlstra wrote:
On Fri, 2013-02-22 at 13:50 +0100, Frederic Weisbecker wrote:
Which is a problem how?
So here is a possible scenario, CPU 0 reads a kcpustat value, and CPU 1 writes it at the same time:
//Initial value of "cpustat" is 0xffffffff == CPU 0 == == CPU 1 == //load low part mov %eax, [cpustat] inc [cpustat] //Update the high part if necessary jnc 1f inc [cpustat + 4] 1: //load high part mov %edx, [cpustat + 4]
Afterward, CPU 0 will think the value is 0x1ffffffff while it's actually 0x100000000.
atomic64_read() and atomic64_set() are supposed to take care of that, without even the need for _inc() or _add() parts that use LOCK.
Sure I get that, but again, why is that a problem,.. who relies on these statistics that makes it a problem?
I guess we want to provide at least some minimal reliability in /proc/stat I mean we don't mind if the read is slightly off, reading stats from userspace is inherently racy anyway, but if it suddenly shows a wrong increase of 4 billions which disappear soon after, it looks like a bug to me.
On Fri, 2013-02-22 at 13:50 +0100, Frederic Weisbecker wrote:
Argh!! at what cost? 64bit atomics are like expensive. Wouldn't
adding
a seqlock be saner?
Not sure. This requires a spinlock in the write side which is called from fast path like the timer interrupt.
A single spinlock is _way_ cheaper than a ton of cmpxchg8b()s to update a few variables.
* Peter Zijlstra peterz@infradead.org wrote:
On Fri, 2013-02-22 at 13:50 +0100, Frederic Weisbecker wrote:
Argh!! at what cost? 64bit atomics are like expensive. Wouldn't
adding
a seqlock be saner?
Not sure. This requires a spinlock in the write side which is called from fast path like the timer interrupt.
A single spinlock is _way_ cheaper than a ton of cmpxchg8b()s to update a few variables.
Every cmpxchg8b() will be roughly as expensive as a spinlock acquire+release fastpath.
Thanks,
Ingo
On Fri, 2013-02-22 at 15:09 +0100, Peter Zijlstra wrote:
On Fri, 2013-02-22 at 13:50 +0100, Frederic Weisbecker wrote:
Argh!! at what cost? 64bit atomics are like expensive. Wouldn't
adding
a seqlock be saner?
Not sure. This requires a spinlock in the write side which is called from fast path like the timer interrupt.
A single spinlock is _way_ cheaper than a ton of cmpxchg8b()s to update a few variables.
We also have include/linux/u64_stats_sync.h since 2.6.36
On Fri, Feb 22, 2013 at 06:21:31AM -0800, Eric Dumazet wrote:
On Fri, 2013-02-22 at 15:09 +0100, Peter Zijlstra wrote:
On Fri, 2013-02-22 at 13:50 +0100, Frederic Weisbecker wrote:
Argh!! at what cost? 64bit atomics are like expensive. Wouldn't
adding
a seqlock be saner?
Not sure. This requires a spinlock in the write side which is called from fast path like the timer interrupt.
A single spinlock is _way_ cheaper than a ton of cmpxchg8b()s to update a few variables.
We also have include/linux/u64_stats_sync.h since 2.6.36
Interesting, we should probably use that instead.
Frederic Weisbecker fweisbec@gmail.com writes:
On Fri, Feb 22, 2013 at 06:21:31AM -0800, Eric Dumazet wrote:
On Fri, 2013-02-22 at 15:09 +0100, Peter Zijlstra wrote:
On Fri, 2013-02-22 at 13:50 +0100, Frederic Weisbecker wrote:
Argh!! at what cost? 64bit atomics are like expensive. Wouldn't
adding
a seqlock be saner?
Not sure. This requires a spinlock in the write side which is called from fast path like the timer interrupt.
A single spinlock is _way_ cheaper than a ton of cmpxchg8b()s to update a few variables.
We also have include/linux/u64_stats_sync.h since 2.6.36
Interesting, we should probably use that instead.
OK, I'll spin a version using the u64_stats interface.
Unfortunately, for 32-bit platforms that have atomic 64-bit loads stores[1], u64_stats leads to some unnecessary overhead, but I'll look at possibly optimizing u64_stats for those platforms as a follow-up patch.
Kevin
[1] ARM >= v6k devices have ldrexd/strexd instructions for 64-bit loads stores which are used by the atomic64 accessors on those devices. (c.f. arch/arm/include/asm/atomic.h:atomic64_read()
linaro-kernel@lists.linaro.org