January 2020 - Linux-kselftest-mirror

[PATCH v6 0/6] mm/hmm/test: add self tests for HMM

by Ralph Campbell

This series adds new functions to the mmu interval notifier API to allow device drivers with MMUs to dynamically mirror a process' page tables based on device faults and invalidation callbacks. The Nouveau driver is updated to use the extended API and a set of stand alone self tests is added to help validate and maintain correctness. The patches are based on linux-5.5.0-rc6 and are for Jason's rdma/hmm tree since I believe he is planning some interval notifier changes. Changes v5 -> v6: Rebase to linux-5.5.0-rc6 Refactored mmu interval notifier patches Converted nouveau to use the new mmu interval notifier API Changes v4 -> v5: Added mmu interval notifier insert/remove/update callable from the invalidate() callback Updated HMM tests to use the new core interval notifier API Changes v1 -> v4: https://lore.kernel.org/linux-mm/20191104222141.5173-1-rcampbell@nvidia.com Ralph Campbell (6): mm/mmu_notifier: add mmu_interval_notifier_insert_safe() mm/mmu_notifier: add mmu_interval_notifier_put() mm/notifier: add mmu_interval_notifier_update() mm/mmu_notifier: add mmu_interval_notifier_find() nouveau: use new mmu interval notifiers mm/hmm/test: add self tests for HMM MAINTAINERS | 3 + drivers/gpu/drm/nouveau/nouveau_svm.c | 313 ++++-- include/linux/mmu_notifier.h | 29 + lib/Kconfig.debug | 11 + lib/Makefile | 1 + lib/test_hmm.c | 1368 ++++++++++++++++++++++++ mm/mmu_notifier.c | 223 +++- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 3 + tools/testing/selftests/vm/config | 2 + tools/testing/selftests/vm/hmm-tests.c | 1354 +++++++++++++++++++++++ tools/testing/selftests/vm/run_vmtests | 16 + tools/testing/selftests/vm/test_hmm.sh | 97 ++ 13 files changed, 3289 insertions(+), 132 deletions(-) create mode 100644 lib/test_hmm.c create mode 100644 tools/testing/selftests/vm/hmm-tests.c create mode 100755 tools/testing/selftests/vm/test_hmm.sh -- 2.20.1

5 years, 4 months

4
16
0 0

arm64: bpf: Elide some moves to a0 after calls

by Palmer Dabbelt

There's four patches here, but only one of them actually does anything. The first patch fixes a BPF selftests build failure on my machine and has already been sent to the list separately. The next three are just staged such that there are some patches that avoid changing any functionality pulled out from the whole point of those refactorings, with two cleanups and then the idea. Maybe this is an odd thing to say in a cover letter, but I'm not actually sure this patch set is a good idea. The issue of extra moves after calls came up as I was reviewing some unrelated performance optimizations to the RISC-V BPF JIT. I figured I'd take a whack at performing the optimization in the context of the arm64 port just to get a breath of fresh air, and I'm not convinced I like the results. That said, I think I would accept something like this for the RISC-V port because we're already doing a multi-pass optimization for shrinking function addresses so it's not as much extra complexity over there. If we do that we should probably start puling some of this code into the shared BPF compiler, but we're also opening the doors to more complicated BPF JIT optimizations. Given that the BPF JIT appears to have been designed explicitly to be simple/fast as opposed to perform complex optimization, I'm not sure this is a sane way to move forward. I figured I'd send the patch set out as more of a question than anything else. Specifically: * How should I go about measuring the performance of these sort of optimizations? I'd like to balance the time it takes to run the JIT with the time spent executing the program, but I don't have any feel for what real BPF programs look like or have any benchmark suite to run. Is there something out there this should be benchmarked against? (I'd also like to know that to run those benchmarks on the RISC-V port.) * Is this the sort of thing that makes sense in a BPF JIT? I guess I've just realized I turned "review this patch" into a way bigger rabbit hole than I really want to go down... I worked on top of 5.4 for these, but trivially different versions of the patches applied on Linus' master a few days ago when I tried. LMK if those aren't sane places to start from over here, I'm new to both arm64 and BPF so I might be a bit lost. [PATCH 1/4] selftests/bpf: Elide a check for LLVM versions that can't [PATCH 2/4] arm64: bpf: Convert bpf2a64 to a function [PATCH 3/4] arm64: bpf: Split the read and write halves of dst [PATCH 4/4] arm64: bpf: Elide some moves to a0 after calls

5 years, 4 months

5
10
0 0

[RFC PATCH v1] pin_on_cpu: Introduce thread CPU pinning system call

by Mathieu Desnoyers

There is an important use-case which is not possible with the "rseq" (Restartable Sequences) system call, which was left as future work. That use-case is to modify user-space per-cpu data structures belonging to specific CPUs which may be brought offline and online again by CPU hotplug. This can be used by memory allocators to migrate free memory pools when CPUs are brought offline, or by ring buffer consumers to target specific per-CPU buffers, even when CPUs are brought offline. A few rather complex prior attempts were made to solve this. Those were based on in-kernel interpreters (cpu_opv, do_on_cpu). That complexity was generally frowned upon, even by their author. This patch fulfills this use-case in a refreshingly simple way: it introduces a "pin_on_cpu" system call, which allows user-space threads to pin themselves on a specific CPU (which needs to be present in the thread's allowed cpu mask), and then clear this pinned state. "But this can already be done with sched_setaffinity", some would rightfully reply. However, there is a significant twist in the way pin_on_cpu deals with CPU hotplug compared to the allowed cpu mask. When all CPUs part of the thread's allowed cpu mask are offlined, this mask is effectively reset to include all CPUs. This behavior is completely incompatible with modifying per-cpu data structures: the updates then become racy between concurrent CPUs trying to update the given per-cpu data. Conversely, all threads pinned on a given CPU with pin_on_cpu are guaranteed to be scheduled on the same runqueue when that CPU is offline. If that CPU is brought back online, the CPU hotplug scheduler hooks are responsible for migrating back the tasks to their pinned CPU. For instance, this allows implementing this userspace library API for incrementing a per-cpu counter for a specific cpu number received as parameter: static inline __attribute__((always_inline)) int percpu_addv(intptr_t *v, intptr_t count, int cpu) { int ret; ret = rseq_addv(v, count, cpu); check: if (rseq_unlikely(ret)) { pin_on_cpu_set(cpu); ret = rseq_addv(v, count, percpu_current_cpu()); pin_on_cpu_clear(); goto check; } return 0; } Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Joel Fernandes <joelaf(a)google.com> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Catalin Marinas <catalin.marinas(a)arm.com> Cc: Dave Watson <davejwatson(a)fb.com> Cc: Will Deacon <will.deacon(a)arm.com> Cc: Shuah Khan <shuah(a)kernel.org> Cc: Andi Kleen <andi(a)firstfloor.org> Cc: linux-kselftest(a)vger.kernel.org Cc: "H . Peter Anvin" <hpa(a)zytor.com> Cc: Chris Lameter <cl(a)linux.com> Cc: Russell King <linux(a)arm.linux.org.uk> Cc: Michael Kerrisk <mtk.manpages(a)gmail.com> Cc: "Paul E . McKenney" <paulmck(a)linux.vnet.ibm.com> Cc: Paul Turner <pjt(a)google.com> Cc: Boqun Feng <boqun.feng(a)gmail.com> Cc: Josh Triplett <josh(a)joshtriplett.org> Cc: Steven Rostedt <rostedt(a)goodmis.org> Cc: Ben Maurer <bmaurer(a)fb.com> Cc: linux-api(a)vger.kernel.org Cc: Andy Lutomirski <luto(a)amacapital.net> --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/exec.c | 1 + include/linux/sched.h | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/sched.h | 6 + init/init_task.c | 1 + kernel/sched/core.c | 321 +++++++++++++++++++++++-- kernel/sched/deadline.c | 54 +++-- kernel/sched/fair.c | 19 ++ kernel/sched/rt.c | 15 +- kernel/sched/sched.h | 28 +++ kernel/sys_ni.c | 1 + 14 files changed, 413 insertions(+), 42 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 15908eb9b17e..0b1081a9b872 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -440,3 +440,4 @@ 433 i386 fspick sys_fspick __ia32_sys_fspick 434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open 435 i386 clone3 sys_clone3 __ia32_sys_clone3 +436 i386 pin_on_cpu sys_pin_on_cpu __ia32_sys_pin_on_cpu diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c29976eca4a8..90f9b3cab88d 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -357,6 +357,7 @@ 433 common fspick __x64_sys_fspick 434 common pidfd_open __x64_sys_pidfd_open 435 common clone3 __x64_sys_clone3/ptregs +436 common pin_on_cpu __x64_sys_pin_on_cpu # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/exec.c b/fs/exec.c index c27231234764..6d882dbdd1e3 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1827,6 +1827,7 @@ static int __do_execve_file(int fd, struct filename *filename, current->fs->in_exec = 0; current->in_execve = 0; rseq_execve(current); + current->pinned_cpu = -1; acct_update_integrals(current); task_numa_free(current, false); free_bprm(bprm); diff --git a/include/linux/sched.h b/include/linux/sched.h index 7f0bb6fff27c..ac0cac7b8d1d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -651,6 +651,7 @@ struct task_struct { /* Current CPU: */ unsigned int cpu; #endif + int pinned_cpu; unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; struct task_struct *last_wakee; diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index be0d0cf788ba..46fee5af99e3 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1000,6 +1000,7 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags) asmlinkage long sys_pidfd_send_signal(int pidfd, int sig, siginfo_t __user *info, unsigned int flags); +asmlinkage long sys_pin_on_cpu(int cmd, int flags, int cpu); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 1fc8faa6e973..43b0c956cc3c 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -851,8 +851,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open) __SYSCALL(__NR_clone3, sys_clone3) #endif +#define __NR_pin_on_cpu 436 +__SYSCALL(__NR_pin_on_cpu, sys_pin_on_cpu) + #undef __NR_syscalls -#define __NR_syscalls 436 +#define __NR_syscalls 437 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 25b4fa00bad1..590cdc613698 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -114,4 +114,10 @@ struct clone_args { SCHED_FLAG_KEEP_ALL | \ SCHED_FLAG_UTIL_CLAMP) +enum pin_on_cpu_cmd { + PIN_ON_CPU_CMD_QUERY = 0, + PIN_ON_CPU_CMD_SET = (1 << 0), + PIN_ON_CPU_CMD_CLEAR = (1 << 1), +}; + #endif /* _UAPI_LINUX_SCHED_H */ diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5eab7b..9aabce589cc7 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -88,6 +88,7 @@ struct task_struct init_task .tasks = LIST_HEAD_INIT(init_task.tasks), #ifdef CONFIG_SMP .pushable_tasks = PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO), + .pinned_cpu = -1, #endif #ifdef CONFIG_CGROUP_SCHED .sched_task_group = &root_task_group, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8dacda4b0362..6ca904d6e0ef 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -52,6 +52,8 @@ const_debug unsigned int sysctl_sched_features = #undef SCHED_FEAT #endif +#define PIN_ON_CPU_CMD_BITMASK (PIN_ON_CPU_CMD_SET | PIN_ON_CPU_CMD_CLEAR) + /* * Number of tasks to iterate in a single balance run. * Limited because this is done with IRQs disabled. @@ -1457,8 +1459,13 @@ static inline bool is_per_cpu_kthread(struct task_struct *p) */ static inline bool is_cpu_allowed(struct task_struct *p, int cpu) { - if (!cpumask_test_cpu(cpu, p->cpus_ptr)) - return false; + if (is_pinned_task(p)) { + if (!allowed_pinned_cpu(p, cpu)) + return false; + } else { + if (!cpumask_test_cpu(cpu, p->cpus_ptr)) + return false; + } if (is_per_cpu_kthread(p)) return cpu_online(cpu); @@ -1662,6 +1669,12 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, goto out; } + /* Prevent removing the currently pinned CPU from the allowed cpu mask. */ + if (is_pinned_task(p) && !cpumask_test_cpu(p->pinned_cpu, new_mask)) { + ret = -EINVAL; + goto out; + } + do_set_cpus_allowed(p, new_mask); if (p->flags & PF_KTHREAD) { @@ -1674,6 +1687,10 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, p->nr_cpus_allowed != 1); } + /* Task pinned to a CPU overrides allowed cpu mask. */ + if (is_pinned_task(p)) + goto out; + /* Can the task run on the task's current CPU? If so, we're done */ if (cpumask_test_cpu(task_cpu(p), new_mask)) goto out; @@ -1813,11 +1830,20 @@ static int migrate_swap_stop(void *data) if (task_cpu(arg->src_task) != arg->src_cpu) goto unlock; - if (!cpumask_test_cpu(arg->dst_cpu, arg->src_task->cpus_ptr)) - goto unlock; - - if (!cpumask_test_cpu(arg->src_cpu, arg->dst_task->cpus_ptr)) - goto unlock; + if (is_pinned_task(arg->src_task)) { + if (!allowed_pinned_cpu(arg->src_task, arg->dst_cpu)) + goto unlock; + } else { + if (!cpumask_test_cpu(arg->dst_cpu, arg->src_task->cpus_ptr)) + goto unlock; + } + if (is_pinned_task(arg->dst_task)) { + if (!allowed_pinned_cpu(arg->dst_task, arg->src_cpu)) + goto unlock; + } else { + if (!cpumask_test_cpu(arg->src_cpu, arg->dst_task->cpus_ptr)) + goto unlock; + } __migrate_swap_task(arg->src_task, arg->dst_cpu); __migrate_swap_task(arg->dst_task, arg->src_cpu); @@ -1858,11 +1884,21 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p, if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu)) goto out; - if (!cpumask_test_cpu(arg.dst_cpu, arg.src_task->cpus_ptr)) - goto out; + if (is_pinned_task(arg.src_task)) { + if (!allowed_pinned_cpu(arg.src_task, arg.dst_cpu)) + goto out; + } else { + if (!cpumask_test_cpu(arg.dst_cpu, arg.src_task->cpus_ptr)) + goto out; + } - if (!cpumask_test_cpu(arg.src_cpu, arg.dst_task->cpus_ptr)) - goto out; + if (is_pinned_task(arg.dst_task)) { + if (!allowed_pinned_cpu(arg.dst_task, arg.src_cpu)) + goto out; + } else { + if (!cpumask_test_cpu(arg.src_cpu, arg.dst_task->cpus_ptr)) + goto out; + } trace_sched_swap_numa(cur, arg.src_cpu, p, arg.dst_cpu); ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg); @@ -2034,6 +2070,18 @@ static int select_fallback_rq(int cpu, struct task_struct *p) enum { cpuset, possible, fail } state = cpuset; int dest_cpu; + /* + * If the task is pinned to a CPU which is online, pick that pinned CPU + * number. + * If the task is pinned to a CPU which is offline, pick a CPU which is + * guaranteed to be the same for all tasks pinned to that offlined CPU. + */ + if (is_pinned_task(p)) { + if (cpu_online(p->pinned_cpu)) + return p->pinned_cpu; + else + return pinned_cpu_offline_offload(p); + } /* * If the node that the CPU is on has been offlined, cpu_to_node() * will return -1. There is no CPU on the node, and we should @@ -2104,10 +2152,15 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags) { lockdep_assert_held(&p->pi_lock); - if (p->nr_cpus_allowed > 1) - cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags); - else - cpu = cpumask_any(p->cpus_ptr); + if (is_pinned_task(p)) + cpu = p->pinned_cpu; + else { + if (p->nr_cpus_allowed > 1) + cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, + wake_flags); + else + cpu = cpumask_any(p->cpus_ptr); + } /* * In order not to call set_task_cpu() on a blocking task we need @@ -6130,8 +6183,13 @@ int migrate_task_to(struct task_struct *p, int target_cpu) if (curr_cpu == target_cpu) return 0; - if (!cpumask_test_cpu(target_cpu, p->cpus_ptr)) - return -EINVAL; + if (is_pinned_task(p)) { + if (!allowed_pinned_cpu(p, target_cpu)) + return -EINVAL; + } else { + if (!cpumask_test_cpu(target_cpu, p->cpus_ptr)) + return -EINVAL; + } /* TODO: This is not properly updating schedstats */ @@ -6300,6 +6358,7 @@ static void migrate_tasks(struct rq *dead_rq, struct rq_flags *rf) rq->stop = stop; } + #endif /* CONFIG_HOTPLUG_CPU */ void set_rq_online(struct rq *rq) @@ -6380,11 +6439,100 @@ static int cpuset_cpu_inactive(unsigned int cpu) return 0; } +static bool skip_pinned_task(int pinned_cpu, int cpu, + bool first_online) +{ + if (pinned_cpu < 0) + return true; + if (first_online) { + if (cpu_online(pinned_cpu) && pinned_cpu != cpu) + return true; + } else { + if (pinned_cpu != cpu) + return true; + } + return false; +} + +static void sched_cpu_migrate_pinned_tasks(unsigned int cpu) +{ + struct rq *rq = cpu_rq(cpu); + struct task_struct *p, *t; + bool first_online = false; + + if (cpu == cpumask_first(cpu_online_mask)) + first_online = true; + + /* + * This state transition (online && !active) when going online + * only allow bound kthreads to be scheduled. + * At this point, the CPU is completely online and running, + * but no userspace tasks are scheduled yet. + */ + read_lock(&tasklist_lock); + for_each_process_thread(p, t) { + struct rq *target_rq; + struct rq_flags rf; + int pinned_cpu; + + /* + * Migrate t to cpu if pinned to this cpu. + * + * Migrate t to cpu if its pinned cpu is offline + * and cpu becomes the new first online cpu. + * + * Transition of t->pinned_cpu to cpu can only + * happen if the thread is scheduled on cpu, which + * is impossible at this point because the cpu is + * not active. + * + * Transition of t->pinned_cpu from cpu to -1 or some + * other cpu number may happen concurrently. Therefore, + * skip early (without rq lock), and check again with + * the rq lock held to eliminate concurrent transitions + * from cpu to -1 or some other cpu number. + */ + pinned_cpu = READ_ONCE(t->pinned_cpu); + if (skip_pinned_task(pinned_cpu, cpu, first_online)) + continue; + if (pinned_cpu == cpu) + printk("pin_on_cpu migrate to owner: online cpu %d\n", + cpu); + if (first_online && !cpu_online(pinned_cpu)) + printk("pin_on_cpu migrate to new offload cpu %d\n", + cpu); + target_rq = task_rq_lock(t, &rf); + pinned_cpu = t->pinned_cpu; + if (skip_pinned_task(pinned_cpu, cpu, first_online)) + goto unlock; + WARN_ON_ONCE(target_rq == rq); + update_rq_clock(target_rq); + if (task_running(target_rq, t) || t->state == TASK_WAKING) { + struct migration_arg arg = { t, cpu }; + /* Need help from migration thread: drop lock and wait. */ + task_rq_unlock(target_rq, t, &rf); + stop_one_cpu(cpu_of(target_rq), migration_cpu_stop, &arg); + continue; + } else if (task_on_rq_queued(t)) { + /* + * OK, since we're going to drop the lock immediately + * afterwards anyway. + */ + rq = move_queued_task(target_rq, &rf, t, cpu); + } + unlock: + task_rq_unlock(target_rq, t, &rf); + } + read_unlock(&tasklist_lock); +} + int sched_cpu_activate(unsigned int cpu) { struct rq *rq = cpu_rq(cpu); struct rq_flags rf; + sched_cpu_migrate_pinned_tasks(cpu); + #ifdef CONFIG_SCHED_SMT /* * When going up, increment the number of cores with SMT present. @@ -7899,6 +8047,145 @@ struct cgroup_subsys cpu_cgrp_subsys = { #endif /* CONFIG_CGROUP_SCHED */ +static void do_set_pinned_cpu(struct task_struct *p, int cpu) +{ + struct rq *rq = task_rq(p); + bool queued, running; + + lockdep_assert_held(&p->pi_lock); + + queued = task_on_rq_queued(p); + running = task_current(rq, p); + + if (queued) { + /* + * Because __kthread_bind() calls this on blocked tasks without + * holding rq->lock. + */ + lockdep_assert_held(&rq->lock); + dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK); + } + if (running) + put_prev_task(rq, p); + + WRITE_ONCE(p->pinned_cpu, cpu); + + if (queued) + enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK); + if (running) + set_next_task(rq, p); +} + +static int __do_pin_on_cpu(int cpu) +{ + struct task_struct *p = current; + struct rq_flags rf; + struct rq *rq; + int ret = 0, dest_cpu; + struct migration_arg arg = { p }; + + cpus_read_lock(); + rq = task_rq_lock(p, &rf); + update_rq_clock(rq); + if (cpu >= 0 && !cpumask_test_cpu(cpu, current->cpus_ptr)) { + ret = -EINVAL; + goto out; + } +#ifdef CONFIG_SMP + do_set_pinned_cpu(p, cpu); + if (cpu >= 0) { + if (cpu_online(cpu)) + dest_cpu = cpu; + else + dest_cpu = pinned_cpu_offline_offload(p); + if (task_cpu(p) == dest_cpu) { + /* + * If the task already runs on the pinned cpu, we're + * done. + */ + goto out; + } + } else { + /* + * When clearing the pinned cpu, we may need to migrate the + * current task if it is currently sitting on a runqueue which + * does not belong to the allowed mask. + */ + dest_cpu = cpumask_any(p->cpus_ptr); + } + arg.dest_cpu = dest_cpu; + /* Need help from migration thread: drop lock and wait. */ + task_rq_unlock(rq, p, &rf); + stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg); + + /* Preempt disable prevents hotplug on current cpu. */ + preempt_disable(); + WARN_ON_ONCE(cpu >= 0 && cpu_online(cpu) && + smp_processor_id() != cpu); + preempt_enable(); + cpus_read_unlock(); + return 0; +#endif +out: + task_rq_unlock(rq, p, &rf); + cpus_read_unlock(); + return ret; +} + +static int pin_on_cpu_set(int cpu) +{ + if (cpu < 0 || !cpu_possible(cpu)) { + return -EINVAL; + } + return __do_pin_on_cpu(cpu); +} + +static int pin_on_cpu_clear(void) +{ + return __do_pin_on_cpu(-1); +} + +/* + * sys_pin_on_cpu - pin current task to a specific cpu. + * @cmd: command to issue (enum pin_on_cpu_cmd) + * @flags: system call flags + * @cpu: cpu where the task should run. + * + * Returns -EINVAL if cmd is unknown. + * Returns -EINVAL if flags are unknown. + * Returns -EINVAL if the CPU is not part of the possible CPUs. + * Returns -EINVAL if the CPU is not part of the allowed cpu mask + * for the current task. + * + * PIN_ON_CPU_CMD_QUERY: Return the mask of supported commands. + * PIN_ON_CPU_CMD_SET: Pin the current task to a specific CPU. + * PIN_ON_CPU_CMD_CLEAR: Clear cpu pinning for current task. + * + * If the pinned CPU is online, the current task will run on that CPU. + * + * If the pinned CPU is offline, the scheduler guarantees that + * all tasks pinned to that CPU number are moved to the same + * runqueue. + * + * Removing the pinned CPU from the task's allowed cpu mask is + * forbidden. + */ +SYSCALL_DEFINE3(pin_on_cpu, int, cmd, int, flags, int, cpu) +{ + if (unlikely(flags)) + return -EINVAL; + switch (cmd) { + case PIN_ON_CPU_CMD_QUERY: + return PIN_ON_CPU_CMD_BITMASK; + case PIN_ON_CPU_CMD_SET: + return pin_on_cpu_set(cpu); + case PIN_ON_CPU_CMD_CLEAR: + return pin_on_cpu_clear(); + default: + return -EINVAL; + } +} + void dump_cpu_task(int cpu) { pr_info("Task dump for CPU %d:\n", cpu); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index a8a08030a8f7..8a1581e8509e 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -535,24 +535,31 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p if (!later_rq) { int cpu; - /* - * If we cannot preempt any rq, fall back to pick any - * online CPU: - */ - cpu = cpumask_any_and(cpu_active_mask, p->cpus_ptr); - if (cpu >= nr_cpu_ids) { - /* - * Failed to find any suitable CPU. - * The task will never come back! - */ - BUG_ON(dl_bandwidth_enabled()); - + if (is_pinned_task(p)) { + if (cpu_online(p->pinned_cpu)) + cpu = p->pinned_cpu; + else + cpu = pinned_cpu_offline_offload(p); + } else { /* - * If admission control is disabled we - * try a little harder to let the task - * run. + * If we cannot preempt any rq, fall back to pick any + * online CPU: */ - cpu = cpumask_any(cpu_active_mask); + cpu = cpumask_any_and(cpu_active_mask, p->cpus_ptr); + if (cpu >= nr_cpu_ids) { + /* + * Failed to find any suitable CPU. + * The task will never come back! + */ + BUG_ON(dl_bandwidth_enabled()); + + /* + * If admission control is disabled we + * try a little harder to let the task + * run. + */ + cpu = cpumask_any(cpu_active_mask); + } } later_rq = cpu_rq(cpu); double_lock_balance(rq, later_rq); @@ -1836,9 +1843,15 @@ static void task_fork_dl(struct task_struct *p) static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu) { - if (!task_running(rq, p) && - cpumask_test_cpu(cpu, p->cpus_ptr)) - return 1; + if (!task_running(rq, p)) { + if (is_pinned_task(p)) { + if (allowed_pinned_cpu(p, cpu)) + return 1; + } else { + if (cpumask_test_cpu(cpu, p->cpus_ptr)) + return 1; + } + } return 0; } @@ -1987,7 +2000,8 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq) /* Retry if something changed. */ if (double_lock_balance(rq, later_rq)) { if (unlikely(task_rq(task) != rq || - !cpumask_test_cpu(later_rq->cpu, task->cpus_ptr) || + (is_pinned_task(task) && !allowed_pinned_cpu(task, later_rq->cpu)) || + (!is_pinned_task(task) && !cpumask_test_cpu(later_rq->cpu, task->cpus_ptr)) || task_running(rq, task) || !dl_task(task) || !task_on_rq_queued(task))) { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 69a81a5709ff..e96ae1ce9829 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7223,6 +7223,25 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) lockdep_assert_held(&env->src_rq->lock); + if (is_pinned_task(p)) { + if (task_running(env->src_rq, p)) { + schedstat_inc(p->se.statistics.nr_failed_migrations_running); + return 0; + } + if (cpu_online(p->pinned_cpu)) { + if (env->dst_cpu == p->pinned_cpu) + return 1; + else + return 0; + } else { + if (env->dst_cpu == + pinned_cpu_offline_offload(p)) + return 1; + else + return 0; + } + } + /* * We do not migrate tasks that are: * 1) throttled_lb_pair, or diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 9b8adc01be3d..2774311e5750 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1600,9 +1600,15 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p) static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu) { - if (!task_running(rq, p) && - cpumask_test_cpu(cpu, p->cpus_ptr)) - return 1; + if (!task_running(rq, p)) { + if (is_pinned_task(p)) { + if (allowed_pinned_cpu(p, cpu)) + return 1; + } else { + if (cpumask_test_cpu(cpu, p->cpus_ptr)) + return 1; + } + } return 0; } @@ -1738,7 +1744,8 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq) * Also make sure that it wasn't scheduled on its rq. */ if (unlikely(task_rq(task) != rq || - !cpumask_test_cpu(lowest_rq->cpu, task->cpus_ptr) || + (is_pinned_task(task) && !allowed_pinned_cpu(task, lowest_rq->cpu)) || + (!is_pinned_task(task) && !cpumask_test_cpu(lowest_rq->cpu, task->cpus_ptr)) || task_running(rq, task) || !rt_task(task) || !task_on_rq_queued(task))) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 49ed949f850c..922bc618cc87 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -187,6 +187,34 @@ static inline int task_has_dl_policy(struct task_struct *p) return dl_policy(p->policy); } +/* + * All tasks which require to be pinned on offlined CPUs are sent + * to runqueue of the first online CPU. + */ +static inline bool is_pinned_task(struct task_struct *p) +{ + return p->pinned_cpu >= 0; +} + +static inline int pinned_cpu_offline_offload(struct task_struct *p) +{ + return cpumask_first(cpu_online_mask); +} + +static inline bool allowed_pinned_cpu(struct task_struct *p, int cpu) +{ + if (!cpu_possible(cpu)) + return false; + if (cpu_online(p->pinned_cpu)) { + if (p->pinned_cpu == cpu) + return true; + } else { + if (cpu == pinned_cpu_offline_offload(p)) + return true; + } + return false; +} + #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) /* diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 34b76895b81e..7e5192cd8d9d 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -449,3 +449,4 @@ COND_SYSCALL(setuid16); /* restartable sequence */ COND_SYSCALL(rseq); +COND_SYSCALL(pin_on_cpu); -- 2.17.1

5 years, 4 months

5
12
0 0

[PATCH 1/2] cgroup: allow deletion of cgroups containing only dying processes

by Suren Baghdasaryan

A cgroup containing only dying tasks will be seen as empty when a userspace process reads its cgroup.procs or cgroup.tasks files. It should be safe to delete such a cgroup as it is considered empty. However if one of the dying tasks did not reach cgroup_exit then an attempt to delete the cgroup will fail with EBUSY because cgroup_is_populated() will not consider it empty until all tasks reach cgroup_exit. Such a condition can be triggered when a task consumes large amounts of memory and spends enough time in exit_mm to create delay between the moment it is flagged as PF_EXITING and the moment it reaches cgroup_exit. Fix this by detecting cgroups containing only dying tasks during cgroup destruction and proceeding with it while postponing the final step of releasing the last reference until the last task reaches cgroup_exit. Signed-off-by: Suren Baghdasaryan <surenb(a)google.com> Reported-by: JeiFeng Lee <linger.lee(a)mediatek.com> Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") --- include/linux/cgroup-defs.h | 3 ++ kernel/cgroup/cgroup.c | 65 +++++++++++++++++++++++++++++++++---- 2 files changed, 61 insertions(+), 7 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 63097cb243cb..f9bcccbac8dd 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -71,6 +71,9 @@ enum { /* Cgroup is frozen. */ CGRP_FROZEN, + + /* Cgroup is dead. */ + CGRP_DEAD, }; /* cgroup_root->flags */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 735af8f15f95..a99ebddd37d9 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -795,10 +795,11 @@ static bool css_set_populated(struct css_set *cset) * that the content of the interface file has changed. This can be used to * detect when @cgrp and its descendants become populated or empty. */ -static void cgroup_update_populated(struct cgroup *cgrp, bool populated) +static bool cgroup_update_populated(struct cgroup *cgrp, bool populated) { struct cgroup *child = NULL; int adj = populated ? 1 : -1; + bool state_change = false; lockdep_assert_held(&css_set_lock); @@ -817,6 +818,7 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated) if (was_populated == cgroup_is_populated(cgrp)) break; + state_change = true; cgroup1_check_for_release(cgrp); TRACE_CGROUP_PATH(notify_populated, cgrp, cgroup_is_populated(cgrp)); @@ -825,6 +827,21 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated) child = cgrp; cgrp = cgroup_parent(cgrp); } while (cgrp); + + return state_change; +} + +static void cgroup_prune_dead(struct cgroup *cgrp) +{ + lockdep_assert_held(&css_set_lock); + + do { + /* put the base reference if cgroup was already destroyed */ + if (!cgroup_is_populated(cgrp) && + test_bit(CGRP_DEAD, &cgrp->flags)) + percpu_ref_kill(&cgrp->self.refcnt); + cgrp = cgroup_parent(cgrp); + } while (cgrp); } /** @@ -838,11 +855,15 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated) static void css_set_update_populated(struct css_set *cset, bool populated) { struct cgrp_cset_link *link; + bool state_change; lockdep_assert_held(&css_set_lock); - list_for_each_entry(link, &cset->cgrp_links, cgrp_link) - cgroup_update_populated(link->cgrp, populated); + list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { + state_change = cgroup_update_populated(link->cgrp, populated); + if (state_change && !populated) + cgroup_prune_dead(link->cgrp); + } } /* @@ -5458,8 +5479,26 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) * Only migration can raise populated from zero and we're already * holding cgroup_mutex. */ - if (cgroup_is_populated(cgrp)) - return -EBUSY; + if (cgroup_is_populated(cgrp)) { + struct css_task_iter it; + struct task_struct *task; + + /* + * cgroup_is_populated does not account for exiting tasks + * that did not reach cgroup_exit yet. Check if all the tasks + * in this cgroup are exiting. + */ + css_task_iter_start(&cgrp->self, 0, &it); + do { + task = css_task_iter_next(&it); + } while (task && (task->flags & PF_EXITING)); + css_task_iter_end(&it); + + if (task) { + /* cgroup is indeed populated */ + return -EBUSY; + } + } /* * Make sure there's no live children. We can't test emptiness of @@ -5510,8 +5549,20 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) cgroup_bpf_offline(cgrp); - /* put the base reference */ - percpu_ref_kill(&cgrp->self.refcnt); + /* + * Take css_set_lock because of the possible race with + * cgroup_update_populated. + */ + spin_lock_irq(&css_set_lock); + /* The last task might have died since we last checked */ + if (cgroup_is_populated(cgrp)) { + /* mark cgroup for future destruction */ + set_bit(CGRP_DEAD, &cgrp->flags); + } else { + /* put the base reference */ + percpu_ref_kill(&cgrp->self.refcnt); + } + spin_unlock_irq(&css_set_lock); return 0; }; -- 2.25.0.rc1.283.g88dfdc4193-goog

5 years, 4 months

3
18
0 0

[RESEND PATCH v9 00/13] selftests/resctrl: Add resctrl selftest

by Fenghua Yu

[Resend the v9 patch set to Shuah Khan and linux-kselftest mailing list. No code and commit message change.] With more and more resctrl features are being added by Intel, AMD and ARM, a test tool is becoming more and more useful to validate that both hardware and software functionalities work as expected. We introduce resctrl selftest to cover resctrl features on X86, AMD and ARM architectures. It first implements MBM (Memory Bandwidth Monitoring), MBA (Memory Bandwidth Allocation), L3 CAT (Cache Allocation Technology), and CQM (Cache QoS Monitoring) tests. We will enhance the selftest tool to include more functionality tests in the future. The tool has been tested on Intel RDT, AMD QoS and ARM MPAM and is in tools/testing/selftests/resctrl in order to have generic test code base for all architectures. The selftest tool we are introducing here provides a convenient tool which does automatic resctrl testing, is easily available in kernel tree, and covers Intel RDT, AMD QoS and ARM MPAM. There is an existing resctrl test suite 'intel_cmt_cat'. But its major purpose is to test Intel RDT hardware via writing and reading MSR registers. It does access resctrl file system; but the functionalities are very limited. And it doesn't support automatic test and a lot of manual verifications are involved. Changelog: v9: - Per Boris suggestion, add Co-developed-by in each patch to make it clear who contributed to the patch set. v8: Update code per comments from Andre Przywara from ARM: - Change Makefile and remove inline assembly code to build and test the tool on ARM - Change the output to TAP format because the format is both readable by human and other test tools. - Detect resctrl feature from /proc/cpuinfo instead of dmesg to support generic detection on all architectures. - Fix a few coding issues. v7: - Fix a few warnings when compiling patches separately, pointed by Babu v6: - Fix a benchmark reading optimized out issue in newer GCC. - Fix a few coding style issues. - Re-arrange code among patches to make cleaner code. No change in patches structure. v5: - Based the v4 patches submitted by Fenghua Yu and added changes to support AMD. - Changed the function name get_sock_num to get_resource_id. Intel uses socket number for schemata and AMD uses l3 index id. To generalize, changed the function name to get_resource_id. - Added the code to detect vendor. - Disabled the few tests for AMD where the test results are not clear. Also AMD does not have IMC. - Fixed few compile issues. - Some cleanup to make each patch independent. - Tested the patches on AMD system. Fenghua, Need your help to test on Intel box. Please feel free to change and resubmit if something broken. v4: - address comments from Balu and Randy - Add CAT and CQM tests v3: - Change code based on comments from Babu Moger - Remove some unnessary code and use pipe to communicate b/w processes v2: - Change code based on comments from Babu Moger - Clean up other places. Babu Moger (3): selftests/resctrl: Add vendor detection mechanism selftests/resctrl: Use cache index3 id for AMD schemata masks selftests/resctrl: Disable MBA and MBM tests for AMD Fenghua Yu (6): selftests/resctrl: Add README for resctrl tests selftests/resctrl: Add MBM test selftests/resctrl: Add MBA test selftests/resctrl: Add Cache QoS Monitoring (CQM) selftest selftests/resctrl: Add Cache Allocation Technology (CAT) selftest selftests/resctrl: Add the test in MAINTAINERS Sai Praneeth Prakhya (4): selftests/resctrl: Add basic resctrl file system operations and data selftests/resctrl: Read memory bandwidth from perf IMC counter and from resctrl file system selftests/resctrl: Add callback to start a benchmark selftests/resctrl: Add built in benchmark MAINTAINERS | 1 + tools/testing/selftests/resctrl/Makefile | 17 + tools/testing/selftests/resctrl/README | 53 ++ tools/testing/selftests/resctrl/cache.c | 272 +++++++ tools/testing/selftests/resctrl/cat_test.c | 250 ++++++ tools/testing/selftests/resctrl/cqm_test.c | 176 +++++ tools/testing/selftests/resctrl/fill_buf.c | 213 +++++ tools/testing/selftests/resctrl/mba_test.c | 171 ++++ tools/testing/selftests/resctrl/mbm_test.c | 145 ++++ tools/testing/selftests/resctrl/resctrl.h | 107 +++ .../testing/selftests/resctrl/resctrl_tests.c | 202 +++++ tools/testing/selftests/resctrl/resctrl_val.c | 744 ++++++++++++++++++ tools/testing/selftests/resctrl/resctrlfs.c | 722 +++++++++++++++++ 13 files changed, 3073 insertions(+) create mode 100644 tools/testing/selftests/resctrl/Makefile create mode 100644 tools/testing/selftests/resctrl/README create mode 100644 tools/testing/selftests/resctrl/cache.c create mode 100644 tools/testing/selftests/resctrl/cat_test.c create mode 100644 tools/testing/selftests/resctrl/cqm_test.c create mode 100644 tools/testing/selftests/resctrl/fill_buf.c create mode 100644 tools/testing/selftests/resctrl/mba_test.c create mode 100644 tools/testing/selftests/resctrl/mbm_test.c create mode 100644 tools/testing/selftests/resctrl/resctrl.h create mode 100644 tools/testing/selftests/resctrl/resctrl_tests.c create mode 100644 tools/testing/selftests/resctrl/resctrl_val.c create mode 100644 tools/testing/selftests/resctrl/resctrlfs.c -- 2.19.1

5 years, 4 months

2
15
0 0

[PATCH v10 1/8] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

by Mina Almasry

These counters will track hugetlb reservations rather than hugetlb memory faulted in. This patch only adds the counter, following patches add the charging and uncharging of the counter. This is patch 1 of an 8 patch series. Problem: Currently tasks attempting to reserve more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to reserve hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in. We have users hitting their hugetlb_cgroup limits and thus we've been looking at this failure mode. We'd like to improve this behavior such that users violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in. This gives the user an opportunity to fallback more gracefully to non-hugetlbfs memory for example. The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd. Proposed Solution: A new page counter named 'hugetlb.xMB.reservation_[limit|usage|max_usage]_in_bytes'. This counter has slightly different semantics than 'hugetlb.xMB.[limit|usage|max_usage]_in_bytes': - While usage_in_bytes tracks all *faulted* hugetlb memory, reservation_usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb memory faulted in without a prior reservation. - If a task attempts to reserve more memory than limit_in_bytes allows, the kernel will allow it to do so. But if a task attempts to reserve more memory than reservation_limit_in_bytes, the kernel will fail this reservation. This proposal is implemented in this patch series, with tests to verify functionality and show the usage. Alternatives considered: 1. A new cgroup, instead of only a new page_counter attached to the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code duplication with hugetlb_cgroup. Keeping hugetlb related page counters under hugetlb_cgroup seemed cleaner as well. 2. Instead of adding a new counter, we considered adding a sysctl that modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at reservation time rather than fault time. Adding a new page_counter seems better as userspace could, if it wants, choose to enforce different cgroups differently: one via limit_in_bytes, and another via reservation_limit_in_bytes. This could be very useful if you're transitioning how hugetlb memory is partitioned on your system one cgroup at a time, for example. Also, someone may find usage for both limit_in_bytes and reservation_limit_in_bytes concurrently, and this approach gives them the option to do so. Testing: - Added tests passing. - Used libhugetlbfs for regression testing. [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html Signed-off-by: Mina Almasry <almasrymina(a)google.com> --- Changes in v10: - Renamed reservation_* to resv.* --- include/linux/hugetlb.h | 4 +- mm/hugetlb_cgroup.c | 115 +++++++++++++++++++++++++++++++++++----- 2 files changed, 104 insertions(+), 15 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 1e897e4168ac1..dea6143aa0685 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -432,8 +432,8 @@ struct hstate { unsigned int surplus_huge_pages_node[MAX_NUMNODES]; #ifdef CONFIG_CGROUP_HUGETLB /* cgroup control files */ - struct cftype cgroup_files_dfl[5]; - struct cftype cgroup_files_legacy[5]; + struct cftype cgroup_files_dfl[7]; + struct cftype cgroup_files_legacy[9]; #endif char name[HSTATE_NAME_LEN]; }; diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index e434b05416c68..209f9b9604d34 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -36,6 +36,11 @@ struct hugetlb_cgroup { */ struct page_counter hugepage[HUGE_MAX_HSTATE]; + /* + * the counter to account for hugepage reservations from hugetlb. + */ + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; + atomic_long_t events[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS]; atomic_long_t events_local[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS]; @@ -55,6 +60,14 @@ struct hugetlb_cgroup { static struct hugetlb_cgroup *root_h_cgroup __read_mostly; +static inline struct page_counter * +hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved) +{ + if (reserved) + return &h_cg->reserved_hugepage[idx]; + return &h_cg->hugepage[idx]; +} + static inline struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s) { @@ -295,28 +308,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, enum { RES_USAGE, + RES_RESERVATION_USAGE, RES_LIMIT, + RES_RESERVATION_LIMIT, RES_MAX_USAGE, + RES_RESERVATION_MAX_USAGE, RES_FAILCNT, + RES_RESERVATION_FAILCNT, }; static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { struct page_counter *counter; + struct page_counter *reserved_counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)]; + reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)]; switch (MEMFILE_ATTR(cft->private)) { case RES_USAGE: return (u64)page_counter_read(counter) * PAGE_SIZE; + case RES_RESERVATION_USAGE: + return (u64)page_counter_read(reserved_counter) * PAGE_SIZE; case RES_LIMIT: return (u64)counter->max * PAGE_SIZE; + case RES_RESERVATION_LIMIT: + return (u64)reserved_counter->max * PAGE_SIZE; case RES_MAX_USAGE: return (u64)counter->watermark * PAGE_SIZE; + case RES_RESERVATION_MAX_USAGE: + return (u64)reserved_counter->watermark * PAGE_SIZE; case RES_FAILCNT: return counter->failcnt; + case RES_RESERVATION_FAILCNT: + return reserved_counter->failcnt; default: BUG(); } @@ -338,10 +365,16 @@ static int hugetlb_cgroup_read_u64_max(struct seq_file *seq, void *v) 1 << huge_page_order(&hstates[idx])); switch (MEMFILE_ATTR(cft->private)) { + case RES_RESERVATION_USAGE: + counter = &h_cg->reserved_hugepage[idx]; + /* Fall through. */ case RES_USAGE: val = (u64)page_counter_read(counter); seq_printf(seq, "%llu\n", val * PAGE_SIZE); break; + case RES_RESERVATION_LIMIT: + counter = &h_cg->reserved_hugepage[idx]; + /* Fall through. */ case RES_LIMIT: val = (u64)counter->max; if (val == limit) @@ -365,6 +398,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, int ret, idx; unsigned long nr_pages; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); + bool reserved = false; if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */ return -EINVAL; @@ -378,9 +412,14 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx])); switch (MEMFILE_ATTR(of_cft(of)->private)) { + case RES_RESERVATION_LIMIT: + reserved = true; + /* Fall through. */ case RES_LIMIT: mutex_lock(&hugetlb_limit_mutex); - ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages); + ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx, + reserved), + nr_pages); mutex_unlock(&hugetlb_limit_mutex); break; default: @@ -406,18 +445,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { int ret = 0; - struct page_counter *counter; + struct page_counter *counter, *reserved_counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)]; + reserved_counter = + &h_cg->reserved_hugepage[MEMFILE_IDX(of_cft(of)->private)]; switch (MEMFILE_ATTR(of_cft(of)->private)) { case RES_MAX_USAGE: page_counter_reset_watermark(counter); break; + case RES_RESERVATION_MAX_USAGE: + page_counter_reset_watermark(reserved_counter); + break; case RES_FAILCNT: counter->failcnt = 0; break; + case RES_RESERVATION_FAILCNT: + reserved_counter->failcnt = 0; + break; default: ret = -EINVAL; break; @@ -472,7 +519,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx) struct hstate *h = &hstates[idx]; /* format the size */ - mem_fmt(buf, 32, huge_page_size(h)); + mem_fmt(buf, sizeof(buf), huge_page_size(h)); /* Add the limit file */ cft = &h->cgroup_files_dfl[0]; @@ -482,15 +529,30 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx) cft->write = hugetlb_cgroup_write_dfl; cft->flags = CFTYPE_NOT_ON_ROOT; - /* Add the current usage file */ + /* Add the reservation limit file */ cft = &h->cgroup_files_dfl[1]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.max", buf); + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT); + cft->seq_show = hugetlb_cgroup_read_u64_max; + cft->write = hugetlb_cgroup_write_dfl; + cft->flags = CFTYPE_NOT_ON_ROOT; + + /* Add the current usage file */ + cft = &h->cgroup_files_dfl[2]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.current", buf); cft->private = MEMFILE_PRIVATE(idx, RES_USAGE); cft->seq_show = hugetlb_cgroup_read_u64_max; cft->flags = CFTYPE_NOT_ON_ROOT; + /* Add the current reservation usage file */ + cft = &h->cgroup_files_dfl[3]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.current", buf); + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE); + cft->seq_show = hugetlb_cgroup_read_u64_max; + cft->flags = CFTYPE_NOT_ON_ROOT; + /* Add the events file */ - cft = &h->cgroup_files_dfl[2]; + cft = &h->cgroup_files_dfl[4]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf); cft->private = MEMFILE_PRIVATE(idx, 0); cft->seq_show = hugetlb_events_show; @@ -498,7 +560,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx) cft->flags = CFTYPE_NOT_ON_ROOT; /* Add the events.local file */ - cft = &h->cgroup_files_dfl[3]; + cft = &h->cgroup_files_dfl[5]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf); cft->private = MEMFILE_PRIVATE(idx, 0); cft->seq_show = hugetlb_events_local_show; @@ -507,7 +569,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx) cft->flags = CFTYPE_NOT_ON_ROOT; /* NULL terminate the last cft */ - cft = &h->cgroup_files_dfl[4]; + cft = &h->cgroup_files_dfl[6]; memset(cft, 0, sizeof(*cft)); WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys, @@ -521,7 +583,7 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx) struct hstate *h = &hstates[idx]; /* format the size */ - mem_fmt(buf, 32, huge_page_size(h)); + mem_fmt(buf, sizeof(buf), huge_page_size(h)); /* Add the limit file */ cft = &h->cgroup_files_legacy[0]; @@ -530,28 +592,55 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx) cft->read_u64 = hugetlb_cgroup_read_u64; cft->write = hugetlb_cgroup_write_legacy; - /* Add the usage file */ + /* Add the reservation limit file */ cft = &h->cgroup_files_legacy[1]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.limit_in_bytes", buf); + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT); + cft->read_u64 = hugetlb_cgroup_read_u64; + cft->write = hugetlb_cgroup_write_legacy; + + /* Add the usage file */ + cft = &h->cgroup_files_legacy[2]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf); cft->private = MEMFILE_PRIVATE(idx, RES_USAGE); cft->read_u64 = hugetlb_cgroup_read_u64; + /* Add the reservation usage file */ + cft = &h->cgroup_files_legacy[3]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.usage_in_bytes", buf); + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE); + cft->read_u64 = hugetlb_cgroup_read_u64; + /* Add the MAX usage file */ - cft = &h->cgroup_files_legacy[2]; + cft = &h->cgroup_files_legacy[4]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf); cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE); cft->write = hugetlb_cgroup_reset; cft->read_u64 = hugetlb_cgroup_read_u64; + /* Add the MAX reservation usage file */ + cft = &h->cgroup_files_legacy[5]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.max_usage_in_bytes", buf); + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_MAX_USAGE); + cft->write = hugetlb_cgroup_reset; + cft->read_u64 = hugetlb_cgroup_read_u64; + /* Add the failcntfile */ - cft = &h->cgroup_files_legacy[3]; + cft = &h->cgroup_files_legacy[6]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); + cft->write = hugetlb_cgroup_reset; + cft->read_u64 = hugetlb_cgroup_read_u64; + + /* Add the reservation failcntfile */ + cft = &h->cgroup_files_legacy[7]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.failcnt", buf); + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_FAILCNT); cft->write = hugetlb_cgroup_reset; cft->read_u64 = hugetlb_cgroup_read_u64; /* NULL terminate the last cft */ - cft = &h->cgroup_files_legacy[4]; + cft = &h->cgroup_files_legacy[8]; memset(cft, 0, sizeof(*cft)); WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, -- 2.25.0.rc1.283.g88dfdc4193-goog

5 years, 5 months

4
23
0 0

[PATCH 0/3] Fix reconnection latency caused by FIN/ACK handling race

by sjpark＠amazon.com

From: SeongJae Park <sjpark(a)amazon.de> When closing a connection, the two acks that required to change closing socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in reverse order. This is possible in RSS disabled environments such as a connection inside a host. For example, expected state transitions and required packets for the disconnection will be similar to below flow. 00 (Process A) (Process B) 01 ESTABLISHED ESTABLISHED 02 close() 03 FIN_WAIT_1 04 ---FIN--> 05 CLOSE_WAIT 06 <--ACK--- 07 FIN_WAIT_2 08 <--FIN/ACK--- 09 TIME_WAIT 10 ---ACK--> 11 LAST_ACK 12 CLOSED CLOSED The acks in lines 6 and 8 are the acks. If the line 8 packet is processed before the line 6 packet, it will be just ignored as it is not a expected packet, and the later process of the line 6 packet will change the status of Process A to FIN_WAIT_2, but as it has already handled line 8 packet, it will not go to TIME_WAIT and thus will not send the line 10 packet to Process B. Thus, Process B will left in CLOSE_WAIT status, as below. 00 (Process A) (Process B) 01 ESTABLISHED ESTABLISHED 02 close() 03 FIN_WAIT_1 04 ---FIN--> 05 CLOSE_WAIT 06 (<--ACK---) 07 (<--FIN/ACK---) 08 (fired in right order) 09 <--FIN/ACK--- 10 <--ACK--- 11 (processed in reverse order) 12 FIN_WAIT_2 Later, if the Process B sends SYN to Process A for reconnection using the same port, Process A will responds with an ACK for the last flow, which has no increased sequence number. Thus, Process A will send RST, wait for TIMEOUT_INIT (one second in default), and then try reconnection. If reconnections are frequent, the one second latency spikes can be a big problem. Below is a tcpdump results of the problem: 14.436259 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq 2560603644 14.436266 IP 127.0.0.1.4242 > 127.0.0.1.45150: Flags [.], ack 5, win 512 14.436271 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [R], seq 2541101298 /* ONE SECOND DELAY */ 15.464613 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq 2560603644 Patchset Organization --------------------- The first patch fix a trivial nit. The second one fix the problem by adjusting the resend delay of the SYN in the case. Finally, the third patch adds a user space test to reproduce this problem. The patches are based on the v5.5. You can also clone the complete git tree: $ git clone git://github.com/sjp38/linux -b patches/finack_lat/v1 The web is also available: https://github.com/sjp38/linux/tree/patches/finack_lat/v1 SeongJae Park (3): net/ipv4/inet_timewait_sock: Fix inconsistent comments tcp: Reduce SYN resend delay if a suspicous ACK is received selftests: net: Add FIN_ACK processing order related latency spike test net/ipv4/inet_timewait_sock.c | 1 + net/ipv4/tcp_input.c | 6 +- tools/testing/selftests/net/.gitignore | 2 + tools/testing/selftests/net/Makefile | 2 + tools/testing/selftests/net/fin_ack_lat.sh | 42 ++++++++++ .../selftests/net/fin_ack_lat_accept.c | 49 +++++++++++ .../selftests/net/fin_ack_lat_connect.c | 81 +++++++++++++++++++ 7 files changed, 182 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/net/fin_ack_lat.sh create mode 100644 tools/testing/selftests/net/fin_ack_lat_accept.c create mode 100644 tools/testing/selftests/net/fin_ack_lat_connect.c -- 2.17.1

5 years, 5 months

6
24
0 0

[PATCH] Kernel selftests: tpm2: check for tpm support

by Nikita Sobolev

tpm2 tests set fails if there is no /dev/tpm0 and /dev/tpmrm0 supported. Check if these files exist before run and mark test as skipped in case of absence. Signed-off-by: Nikita Sobolev <Nikita.Sobolev(a)synopsys.com> --- tools/testing/selftests/tpm2/test_smoke.sh | 13 +++++++++++-- tools/testing/selftests/tpm2/test_space.sh | 9 ++++++++- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/tpm2/test_smoke.sh b/tools/testing/selftests/tpm2/test_smoke.sh index 8155c2ea7ccb..b630c7b5950a 100755 --- a/tools/testing/selftests/tpm2/test_smoke.sh +++ b/tools/testing/selftests/tpm2/test_smoke.sh @@ -1,8 +1,17 @@ #!/bin/bash # SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) +self.flags = flags -python -m unittest -v tpm2_tests.SmokeTest -python -m unittest -v tpm2_tests.AsyncTest +# Kselftest framework requirement - SKIP code is 4. +ksft_skip=4 + + +if [ -f /dev/tpm0 ] ; then + python -m unittest -v tpm2_tests.SmokeTest + python -m unittest -v tpm2_tests.AsyncTest +else + exit $ksft_skip +fi CLEAR_CMD=$(which tpm2_clear) if [ -n $CLEAR_CMD ]; then diff --git a/tools/testing/selftests/tpm2/test_space.sh b/tools/testing/selftests/tpm2/test_space.sh index a6f5e346635e..180b469c53b4 100755 --- a/tools/testing/selftests/tpm2/test_space.sh +++ b/tools/testing/selftests/tpm2/test_space.sh @@ -1,4 +1,11 @@ #!/bin/bash # SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) -python -m unittest -v tpm2_tests.SpaceTest +# Kselftest framework requirement - SKIP code is 4. +ksft_skip=4 + +if [ -f /dev/tpmrm0 ] ; then + python -m unittest -v tpm2_tests.SpaceTest +else + exit $ksft_skip +fi -- 2.16.2

5 years, 5 months

1
0
0 0

[PATCH][RESEND] selftests/ftrace: Have pid filter test use instance flag

by Steven Rostedt

From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org> While running the ftracetests, the pid filter test failed because the instance "foo" existed, and it was using it to rerun the test under a instance named foo. The collision caused the test to fail as the mkdir failed as the name already existed. As of commit b5b77be812de7 ("selftests: ftrace: Allow some tests to be run in a tracing instance") all a selftest needs to do to be tested in an instance is to set the "instance" flag. There's no reason a selftest needs to create an instance to run its test in an instance directly. Remove the open coded testing in an instance for the pid filter test and have it set the "instance" flag instead. Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org> --- .../selftests/ftrace/test.d/ftrace/func-filter-pid.tc | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/func-filter-pid.tc b/tools/testing/selftests/ftrace/test.d/ftrace/func-filter-pid.tc index 64cfcc75e3c1..f2ee1e889e13 100644 --- a/tools/testing/selftests/ftrace/test.d/ftrace/func-filter-pid.tc +++ b/tools/testing/selftests/ftrace/test.d/ftrace/func-filter-pid.tc @@ -1,6 +1,7 @@ #!/bin/sh # SPDX-License-Identifier: GPL-2.0 # description: ftrace - function pid filters +# flags: instance # Make sure that function pid matching filter works. # Also test it on an instance directory @@ -96,13 +97,6 @@ do_test() { } do_test - -mkdir instances/foo -cd instances/foo -do_test -cd ../../ -rmdir instances/foo - do_reset exit 0 -- 2.20.1

5 years, 5 months

2
1
0 0

[PATCH v2 0/8] mm/gup: track FOLL_PIN pages (follow on from v12)

by John Hubbard

OK, as requested, I've split the tracking patch into 6 smaller patches, and it should be *much* easier to understand and review now. ============================================================ Changes since v1: * Split the tracking patch into 6 smaller patches * Rebased onto today's linux-next/akpm (there weren't any conflicts). * Fixed an "unsigned int" vs. "int" problem in gup_benchmark, reported by Nathan Chancellor. (I don't see it in my local builds, probably because they use gcc, but an LLVM test found the mismatch.) * Fixed a huge page pincount problem (add/subtract vs. increment/decrement), spotted by Jan Kara. ============================================================ There is a reasonable case to be made for merging two of the patches (patches 4 and 5), given that patch 4 provides tracking that has upper limits on the number of pins that can be done with huge pages. Let me know if anyone wants those merged, but unless there is some weird chance of someone grabbing patch 4 and not patch 5, I don't really see the need. Meanwhile, it's easier to review in this form. Also, patch 3 has been revived. Earlier reviewers asked for it to be merged into the tracking patch (one cannot please everyone, heh), but now it's back out on it's own. This activates tracking of FOLL_PIN pages. This is in support of fixing the get_user_pages()+DMA problem described in [1]-[4]. It is based on today's (Jan 28) linux-next (branch: akpm), commit 280e9cb00b41 ("drivers/media/platform/sti/delta/delta-ipc.c: fix read buffer overflow") There is a git repo and branch, for convenience in reviewing: git@github.com:johnhubbard/linux.git track_user_pages_v2_linux-next_akpm_28Jan2020 FOLL_PIN support is (so far) in mmotm and linux-next. However, the patch to use FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test suite failure that involved (I think) page refcount overflows when huge pages were used. This patch definitively solves that kind of overflow problem, by adding an exact pincount, for compound pages (of order > 1), in the 3rd struct page of a compound page. If available, that form of pincounting is used, instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara for that idea. Here's the last reviewed version of the tracking patch (v11): https://lore.kernel.org/r/20191216222537.491123-1-jhubbard@nvidia.com Jan Kara had provided a reviewed-by tag for that, but I've had to remove it (again) here, due to having changed the patch "a little bit", in order to add the feature described above. Other interesting changes: * dump_page(): added one, or two new things to report for compound pages: head refcount (for all compound pages), and map_pincount (for compound pages of order > 1). * Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the huge page refcount upper limit problems, and added notes about how it works now. Also added a note about the dump_page() enhancements. * Added some comments in gup.c and mm.h, to explain that there are two ways to count pinned pages: exact (for compound pages of order > 1) and fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages). ============================================================ General notes about the tracking patch: This is a prerequisite to solving the problem of proper interactions between file-backed pages, and [R]DMA activities, as discussed in [1], [2], [3], [4] and in a remarkable number of email threads since about 2017. :) In contrast to earlier approaches, the page tracking can be incrementally applied to the kernel call sites that, until now, have been simply calling get_user_pages() ("gup"). In other words, opt-in by changing from this: get_user_pages() (sets FOLL_GET) put_page() to this: pin_user_pages() (sets FOLL_PIN) unpin_user_page() ============================================================ Next steps: * Convert more subsystems from get_user_pages() to pin_user_pages(). * Work with Ira and others to connect this all up with file system leases. [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ [4] LWN kernel index: get_user_pages() https://lwn.net/Kernel/Index/#Memory_management-get_user_pages John Hubbard (8): mm: dump_page: print head page's refcount, for compound pages mm/gup: split get_user_pages_remote() into two routines mm/gup: pass a flags arg to __gup_device_* functions mm/gup: track FOLL_PIN pages mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting mm/gup_benchmark: support pin_user_pages() and related calls selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage Documentation/core-api/pin_user_pages.rst | 47 +-- include/linux/mm.h | 109 ++++- include/linux/mm_types.h | 7 +- include/linux/mmzone.h | 2 + include/linux/page_ref.h | 10 + mm/debug.c | 22 +- mm/gup.c | 460 ++++++++++++++++----- mm/gup_benchmark.c | 71 +++- mm/huge_memory.c | 29 +- mm/hugetlb.c | 44 +- mm/page_alloc.c | 2 + mm/rmap.c | 6 + mm/vmstat.c | 2 + tools/testing/selftests/vm/gup_benchmark.c | 15 +- tools/testing/selftests/vm/run_vmtests | 22 + 15 files changed, 681 insertions(+), 167 deletions(-) -- 2.25.0

5 years, 5 months

3
17
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror January 2020