Star64_linux/kernel
Peter Zijlstra a35b6466aa sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.

His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.

It was found that:

  schedule()
    idle_balance()
      load_balance()
        local_irq_save()
        double_rq_lock()
        update_h_load()
          walk_tg_tree(tg_load_down)
            tg_load_down()

Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.

This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.

I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).

Cc: pjt@google.com
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: Peter Portante <pportant@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 18:41:54 +02:00
..
debug kdb: Switch to nolock variants of kmsg_dump functions 2012-07-21 10:34:00 -07:00
events perf: Use css_tryget() to avoid propping up css refcount 2012-06-18 11:45:57 +02:00
gcov
irq Merge branches 'irq-urgent-for-linus' and 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-06-04 11:36:51 -07:00
power Make wait_for_device_probe() also do scsi_complete_async_scans() 2012-07-18 18:15:46 -07:00
sched sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies 2012-08-13 18:41:54 +02:00
time Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-07-18 10:36:02 -07:00
trace Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-07-14 11:16:24 -07:00
.gitignore
acct.c
async.c
audit.c
audit.h
audit_tree.c
audit_watch.c
auditfilter.c
auditsc.c
backtracetest.c
bounds.c
capability.c
cgroup.c cgroup: fix cgroup hierarchy umount race 2012-07-07 16:08:18 -07:00
cgroup_freezer.c
compat.c
configs.c
cpu.c kernel/cpu.c: document clear_tasks_mm_cpumask() 2012-05-31 17:49:30 -07:00
cpu_pm.c
cpuset.c cpusets: Remove/update outdated comments 2012-07-24 13:53:28 +02:00
crash_dump.c
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper 2012-06-20 14:39:36 -07:00
extable.c
fork.c sched: Fix fork() error path to not crash 2012-07-05 20:57:32 +02:00
freezer.c
futex.c
futex_compat.c
groups.c
hrtimer.c hrtimer: Update hrtimer base offsets each hrtimer_interrupt 2012-07-11 23:34:39 +02:00
hung_task.c
irq_work.c
itimer.c
jump_label.c
kallsyms.c
kcmp.c syscalls, x86: add __NR_kcmp syscall 2012-05-31 17:49:32 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kexec.c
kfifo.c
kmod.c
kprobes.c
ksysfs.c
kthread.c
latencytop.c
lglock.c
lockdep.c
lockdep_internals.h
lockdep_proc.c
lockdep_states.h
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-06-01 10:34:35 -07:00
module.c
mutex-debug.c
mutex-debug.h
mutex.c
mutex.h
notifier.c
nsproxy.c
padata.c
panic.c
params.c
pid.c
pid_namespace.c pidns: guarantee that the pidns init will be the last pidns process reaped 2012-06-20 14:39:36 -07:00
posix-cpu-timers.c
posix-timers.c
printk.c printk: Implement some unlocked kmsg_dump functions 2012-07-21 10:34:00 -07:00
profile.c
ptrace.c
range.c
rcu.h
rcupdate.c
rcutiny.c
rcutiny_plugin.h
rcutorture.c
rcutree.c Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation" 2012-07-02 11:39:19 -07:00
rcutree.h Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation" 2012-07-02 11:39:19 -07:00
rcutree_plugin.h Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation" 2012-07-02 11:39:19 -07:00
rcutree_trace.c
relay.c splice: fix racy pipe->buffers uses 2012-06-13 21:16:42 +02:00
res_counter.c
resource.c
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c
rtmutex.h
rtmutex_common.h
rwsem.c
seccomp.c
semaphore.c
signal.c new helper: signal_delivered() 2012-06-01 12:58:52 -04:00
smp.c
smpboot.c
smpboot.h
softirq.c
spinlock.c
srcu.c
stacktrace.c
stop_machine.c
sys.c c/r: prctl: less paranoid prctl_set_mm_exe_file() 2012-07-11 16:04:43 -07:00
sys_ni.c syscalls, x86: add __NR_kcmp syscall 2012-05-31 17:49:32 -07:00
sysctl.c
sysctl_binary.c
task_work.c
taskstats.c
test_kprobes.c
time.c
timeconst.pl
timer.c
tracepoint.c
tsacct.c
uid16.c
up.c
user-return-notifier.c
user.c
user_namespace.c
utsname.c
utsname_sysctl.c
wait.c
watchdog.c watchdog: Quiet down the boot messages 2012-06-14 12:20:50 +02:00
workqueue.c
workqueue_sched.h