Star64_linux/kernel
Srivatsa S. Bhat 8d056c48e4 CPU hotplug, smp: flush any pending IPI callbacks before CPU offline
There is a race between the CPU offline code (within stop-machine) and
the smp-call-function code, which can lead to getting IPIs on the
outgoing CPU, *after* it has gone offline.

Specifically, this can happen when using
smp_call_function_single_async() to send the IPI, since this API allows
sending asynchronous IPIs from IRQ disabled contexts.  The exact race
condition is described below.

During CPU offline, in stop-machine, we don't enforce any rule in the
_DISABLE_IRQ stage, regarding the order in which the outgoing CPU and
the other CPUs disable their local interrupts.  Due to this, we can
encounter a situation in which an IPI is sent by one of the other CPUs
to the outgoing CPU (while it is *still* online), but the outgoing CPU
ends up noticing it only *after* it has gone offline.

              CPU 1                                         CPU 2
          (Online CPU)                               (CPU going offline)

       Enter _PREPARE stage                          Enter _PREPARE stage

                                                     Enter _DISABLE_IRQ stage

                                                   =
       Got a device interrupt, and                 | Didn't notice the IPI
       the interrupt handler sent an               | since interrupts were
       IPI to CPU 2 using                          | disabled on this CPU.
       smp_call_function_single_async()            |
                                                   =

       Enter _DISABLE_IRQ stage

       Enter _RUN stage                              Enter _RUN stage

                                  =
       Busy loop with interrupts  |                  Invoke take_cpu_down()
       disabled.                  |                  and take CPU 2 offline
                                  =

       Enter _EXIT stage                             Enter _EXIT stage

       Re-enable interrupts                          Re-enable interrupts

                                                     The pending IPI is noted
                                                     immediately, but alas,
                                                     the CPU is offline at
                                                     this point.

This of course, makes the smp-call-function IPI handler code running on
CPU 2 unhappy and it complains about "receiving an IPI on an offline
CPU".

One real example of the scenario on CPU 1 is the block layer's
complete-request call-path:

	__blk_complete_request() [interrupt-handler]
	    raise_blk_irq()
	        smp_call_function_single_async()

However, if we look closely, the block layer does check that the target
CPU is online before firing the IPI.  So in this case, it is actually
the unfortunate ordering/timing of events in the stop-machine phase that
leads to receiving IPIs after the target CPU has gone offline.

In reality, getting a late IPI on an offline CPU is not too bad by
itself (this can happen even due to hardware latencies in IPI
send-receive).  It is a bug only if the target CPU really went offline
without executing all the callbacks queued on its list.  (Note that a
CPU is free to execute its pending smp-call-function callbacks in a
batch, without waiting for the corresponding IPIs to arrive for each one
of those callbacks).

So, fixing this issue can be broken up into two parts:

1. Ensure that a CPU goes offline only after executing all the
   callbacks queued on it.

2. Modify the warning condition in the smp-call-function IPI handler
   code such that it warns only if an offline CPU got an IPI *and* that
   CPU had gone offline with callbacks still pending in its queue.

Achieving part 1 is straight-forward - just flush (execute) all the
queued callbacks on the outgoing CPU in the CPU_DYING stage[1],
including those callbacks for which the source CPU's IPIs might not have
been received on the outgoing CPU yet.  Once we do this, an IPI that
arrives late on the CPU going offline (either due to the race mentioned
above, or due to hardware latencies) will be completely harmless, since
the outgoing CPU would have executed all the queued callbacks before
going offline.

Overall, this fix (parts 1 and 2 put together) additionally guarantees
that we will see a warning only when the *IPI-sender code* is buggy -
that is, if it queues the callback _after_ the target CPU has gone
offline.

[1].  The CPU_DYING part needs a little more explanation: by the time we
execute the CPU_DYING notifier callbacks, the CPU would have already
been marked offline.  But we want to flush out the pending callbacks at
this stage, ignoring the fact that the CPU is offline.  So restructure
the IPI handler code so that we can by-pass the "is-cpu-offline?" check
in this particular case.  (Of course, the right solution here is to fix
CPU hotplug to mark the CPU offline _after_ invoking the CPU_DYING
notifiers, but this requires a lot of audit to ensure that this change
doesn't break any existing code; hence lets go with the solution
proposed above until that is done).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sachin Kamat <sachin.kamat@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:43 -07:00
..
debug kernel/printk: use symbolic defines for console loglevels 2014-06-04 16:54:17 -07:00
events Merge branch 'perf/core' into perf/urgent, to pick up the latest fixes 2014-06-14 14:10:08 +02:00
gcov gcov: add support for GCC 4.9 2014-06-10 15:34:46 -07:00
irq genirq: Improve documentation to match current implementation 2014-05-27 10:16:44 +02:00
locking Merge branch 'locking-urgent-for-linus.patch' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-06-21 07:06:02 -10:00
power x86, kaslr: boot-time selectable with hibernation 2014-06-16 23:30:44 +02:00
printk kernel/printk: use symbolic defines for console loglevels 2014-06-04 16:54:17 -07:00
rcu Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next 2014-06-03 12:57:53 -07:00
sched Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-06-12 19:42:15 -07:00
time Merge branch 'akpm' (patchbomb from Andrew) into next 2014-06-04 16:55:13 -07:00
trace One bug fix that goes back to 3.10. Accessing a non existent buffer 2014-06-12 21:07:25 -07:00
.gitignore
acct.c ipc, kernel: clear whitespace 2014-06-06 16:08:14 -07:00
async.c
audit.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-06-12 14:27:40 -07:00
audit.h
audit_tree.c
audit_watch.c
auditfilter.c
auditsc.c auditsc: audit_krule mask accesses need bounds checking 2014-06-10 08:44:40 -07:00
backtracetest.c kernel/backtracetest.c: replace no level printk by pr_info() 2014-06-04 16:54:14 -07:00
bounds.c
capability.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-10 13:57:22 -07:00
cgroup.c Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-06-09 15:03:33 -07:00
cgroup_freezer.c cgroup: remove css_parent() 2014-05-16 13:22:48 -04:00
compat.c kernel/compat.c: use sizeof() instead of sizeof 2014-06-04 16:54:19 -07:00
configs.c
context_tracking.c x86/kprobes: Fix build errors and blacklist context_track_user 2014-06-14 09:07:44 +02:00
cpu.c More ACPI and power management updates for 3.16-rc1 2014-06-12 13:14:19 -07:00
cpu_pm.c
cpuset.c Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-06-09 15:03:33 -07:00
crash_dump.c
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c kernel/exec_domain.c: code clean-up 2014-06-04 16:54:15 -07:00
exit.c signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch] 2014-06-06 16:08:11 -07:00
extable.c
fork.c ptrace: fix fork event messages across pid namespaces 2014-06-06 16:08:11 -07:00
freezer.c
futex.c Merge branch 'next' (accumulated 3.16 merge window patches) into master 2014-06-08 11:31:16 -07:00
futex_compat.c
groups.c
hrtimer.c Merge branch 'perf/urgent' into perf/core, to resolve conflict and to prepare for new patches 2014-06-06 07:55:06 +02:00
hung_task.c kernel/hung_task.c: convert simple_strtoul to kstrtouint 2014-06-04 16:54:15 -07:00
irq_work.c
itimer.c
jump_label.c
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking/rwlocks: Introduce 'qrwlocks' - fair, queued rwlocks 2014-06-06 07:58:28 +02:00
Kconfig.preempt
kexec.c kernel/kexec.c: convert printk to pr_foo() 2014-06-06 16:08:12 -07:00
kmod.c signals: change wait_for_helper() to use kernel_sigaction() 2014-06-06 16:08:12 -07:00
kprobes.c
ksysfs.c
kthread.c kthread: fix return value of kthread_create() upon SIGKILL. 2014-06-04 16:53:51 -07:00
latencytop.c kernel/latencytop.c: convert seq_printf to seq_puts 2014-06-04 16:54:15 -07:00
Makefile
module-internal.h
module.c Most of this is cleaning up various driver sysfs permissions so we can 2014-06-11 16:09:14 -07:00
module_signing.c
notifier.c
nsproxy.c
padata.c
panic.c kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers 2014-06-06 16:08:12 -07:00
params.c
pid.c
pid_namespace.c
posix-cpu-timers.c
posix-timers.c
profile.c kernel/profile.c: use static const char instead of static char 2014-06-06 16:08:13 -07:00
ptrace.c
range.c
reboot.c kernel/reboot.c: convert simple_strtoul to kstrtoint 2014-06-04 16:54:15 -07:00
relay.c
res_counter.c kernel/res_counter.c: replace simple_strtoull by kstrtoull 2014-06-04 16:54:15 -07:00
resource.c resources: Clarify sanity check message 2014-05-23 10:47:21 -06:00
seccomp.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-06-12 14:27:40 -07:00
signal.c signals: introduce kernel_sigaction() 2014-06-06 16:08:12 -07:00
smp.c CPU hotplug, smp: flush any pending IPI callbacks before CPU offline 2014-06-23 16:47:43 -07:00
smpboot.c
smpboot.h
softirq.c Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu 2014-05-22 11:36:10 +02:00
stacktrace.c
stop_machine.c kernel/stop_machine.c: kernel-doc warning fix 2014-06-04 16:54:15 -07:00
sys.c sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice() 2014-05-22 11:16:36 +02:00
sys_ni.c sys_sgetmask/sys_ssetmask: add CONFIG_SGETMASK_SYSCALL 2014-06-04 16:54:14 -07:00
sysctl.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next 2014-06-19 07:50:07 -10:00
sysctl_binary.c
system_certificates.S
system_keyring.c
task_work.c
taskstats.c
test_kprobes.c
time.c
timeconst.bc
timer.c
torture.c
tracepoint.c kernel/tracepoint.c: kernel-doc fixes 2014-06-04 16:54:15 -07:00
tsacct.c
uid16.c
up.c
user-return-notifier.c
user.c kernel/user.c: drop unused field 'files' from user_struct 2014-06-04 16:54:16 -07:00
user_namespace.c kernel/user_namespace.c: kernel-doc/checkpatch fixes 2014-06-06 16:08:13 -07:00
utsname.c
utsname_sysctl.c sysctl: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
watchdog.c
workqueue.c Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2014-06-09 14:56:49 -07:00
workqueue_internal.h workqueue: rename manager_mutex to attach_mutex 2014-05-20 10:59:32 -04:00