Clocksources don't get the VALID_FOR_HRES flag until they have been
checked by a watchdog. However, when using an override, the
clocksource_select logic will clear the override value if the
clocksource is not marked VALID_FOR_HRES during that inititial check.
When using the boot arguments clocksource=<foo>, this selection can
run before the watchdog, and can cause the override to be incorrectly
cleared.
To address this condition, the override_name is only invalidated for
unstable clocksources. Otherwise, the override is left intact until after
the watchdog has validated the clocksource as stable/unstable.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Kyle Walker <kwalker@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Prior to the change the function would blindly deference mm, exe_file
and exe_file->f_inode, each of which could have been NULL or freed.
Use get_task_exe_file to safely obtain stable exe_file.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Cc: <stable@vger.kernel.org> # 4.3.x
Signed-off-by: Paul Moore <paul@paul-moore.com>
For more convenient access if one has a pointer to the task.
As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.
Use the helper in proc_exe_link.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Cc: <stable@vger.kernel.org> # 4.3.x
Signed-off-by: Paul Moore <paul@paul-moore.com>
v2: Fixed the very obvious lack of setting ucounts
on struct mnt_ns reported by Andrei Vagin, and the kbuild
test report.
Reported-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This fixes a ptrace vs fatal pending signals bug as manifested in
seccomp now that seccomp was reordered to happen after ptrace. The
short version is that seccomp should not attempt to call do_exit()
while fatal signals are pending under a tracer. The existing code was
trying to be as defensively paranoid as possible, but it now ends up
confusing ptrace. Instead, the syscall can just be skipped (which solves
the original concern that the do_exit() was addressing) and normal signal
handling, tracer notification, and process death can happen.
Paraphrasing from the original bug report:
If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed
after such a trap but not yet been scheduled, and another task in the
thread-group calls exit_group(), then the tracee task exits without the
ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here:
https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7
The bug happens because when __seccomp_filter() detects
fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and
that task is descheduled, __schedule() notices that there is a fatal
signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
That prevents the ptracer's waitpid() from returning the ptrace event.
A more detailed analysis is here:
https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.
Reported-by: Robert O'Callahan <robert@ocallahan.org>
Reported-by: Kyle Huey <khuey@kylehuey.com>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Fixes: 93e35efb8d ("x86/ptrace: run seccomp after ptrace")
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Unfortunately we record PIDs in audit records using a variety of
methods despite the correct way being the use of task_tgid_nr().
This patch converts all of these callers, except for the case of
AUDIT_SET in audit_receive_msg() (see the comment in the code).
Reported-by: Jeff Vander Stoep <jeffv@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull cgroup fixes from Tejun Heo:
"Two fixes for cgroup.
- There still was a hole in enforcing cpuset rules, fixed by Li.
- The recent switch to global percpu_rwseom for threadgroup locking
revealed a couple issues in how percpu_rwsem is implemented and
used by cgroup. Balbir found that the read locking section was too
wide unnecessarily including operations which can often depend on
IOs. With percpu_rwsem updates (coming through a different tree)
and reduction of read locking section, all the reported locking
latency issues, including the android one, are resolved.
It looks like we can keep global percpu_rwsem locking for now. If
there actually are cases which can't be resolved, we can go back to
more complex per-signal_struct locking"
* 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork
cpuset: make sure new tasks conform to the current config of the cpuset
Pull perf fixes from Thomas Gleixner:
"A few fixes from the perf departement
- prevent a imbalanced preemption disable in the events teardown code
- prevent out of bound acces in perf userspace
- make perf tools compile with UCLIBC again
- a fix for the userspace unwinder utility"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Use this_cpu_ptr() when stopping AUX events
perf evsel: Do not access outside hw cache name arrays
tools lib: Reinstate strlcpy() header guard with __UCLIBC__
perf unwind: Use addr_location::addr instead of ip for entries
Pull irq fixes from Thomas Gleixner:
"This lot provides:
- plug a hotplug race in the new affinity infrastructure
- a fix for the trigger type of chained interrupts
- plug a potential memory leak in the core code
- a few fixes for ARM and MIPS GICs"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/mips-gic: Implement activate op for device domain
irqchip/mips-gic: Cleanup chip and handler setup
genirq/affinity: Use get/put_online_cpus around cpumask operations
genirq: Fix potential memleak when failing to get irq pm
irqchip/gicv3-its: Disable the ITS before initializing it
irqchip/gicv3: Remove disabling redistributor and group1 non-secure interrupts
irqchip/gic: Allow self-SGIs for SMP on UP configurations
genirq: Correctly configure the trigger on chained interrupts
Pull timer fixes from Thomas Gleixner:
"A few updates for timers & co:
- prevent a livelock in the timekeeping code when debugging is
enabled
- prevent out of bounds access in the timekeeping debug code
- various fixes in clocksource drivers
- a new maintainers entry"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/sun4i: Clear interrupts after stopping timer in probe function
drivers/clocksource/pistachio: Fix memory corruption in init
clocksource/drivers/timer-atmel-pit: Enable mck clock
clocksource/drivers/pxa: Fix include files for compilation
MAINTAINERS: Add ARM ARCHITECTED TIMER entry
timekeeping: Cap array access in timekeeping_debug
timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING
Pull block fixes from Jens Axboe:
"Here's a set of block fixes for the current 4.8-rc release. This
contains:
- a fix for a secure erase regression, from Adrian.
- a fix for an mmc use-after-free bug regression, also from Adrian.
- potential zero pointer deference in bdev freezing, from Andrey.
- a race fix for blk_set_queue_dying() from Bart.
- a set of xen blkfront fixes from Bob Liu.
- three small fixes for bcache, from Eric and Kent.
- a fix for a potential invalid NVMe state transition, from Gabriel.
- blk-mq CPU offline fix, preventing us from issuing and completing a
request on the wrong queue. From me.
- revert two previous floppy changes, since they caused a user
visibile regression. A better fix is in the works.
- ensure that we don't send down bios that have more than 256
elements in them. Fixes a crash with bcache, for example. From
Ming.
- a fix for deferencing an error pointer with cgroup writeback.
Fixes a regression. From Vegard"
* 'for-linus' of git://git.kernel.dk/linux-block:
mmc: fix use-after-free of struct request
Revert "floppy: refactor open() flags handling"
Revert "floppy: fix open(O_ACCMODE) for ioctl-only open"
fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
blk-mq: improve warning for running a queue on the wrong CPU
blk-mq: don't overwrite rq->mq_ctx
block: make sure a big bio is split into at most 256 bvecs
nvme: Fix nvme_get/set_features() with a NULL result pointer
bdev: fix NULL pointer dereference
xen-blkfront: free resources if xlvbd_alloc_gendisk fails
xen-blkfront: introduce blkif_set_queue_limits()
xen-blkfront: fix places not updated after introducing 64KB page granularity
bcache: pr_err: more meaningful error message when nr_stripes is invalid
bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
bcache: register_bcache(): call blkdev_put() when cache_alloc() fails
block: Fix race triggered by blk_set_queue_dying()
block: Fix secure erase
nvme: Prevent controller state invalid transition
Commit bbeddf52ad ("printk: move braille console support into separate
braille.[ch] files") moved the parsing of braille-related options into
_braille_console_setup(), changing the type of variable str from char*
to char**. In this commit, memcmp(str, "brl,", 4) was correctly updated
to memcmp(*str, "brl,", 4) but not memcmp(str, "brl=", 4).
Update the code to make "brl=" option work again and replace memcmp()
with strncmp() to make the compiler able to detect such an issue.
Fixes: bbeddf52ad ("printk: move braille console support into separate braille.[ch] files")
Link: http://lkml.kernel.org/r/20160823165700.28952-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have scripts which write to certain fields on 3.18 kernels but this
seems to be failing on 4.4 kernels. An entry which we write to here is
xfrm_aevent_rseqth which is u32.
echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth
Commit 230633d109 ("kernel/sysctl.c: detect overflows when converting
to int") prevented writing to sysctl entries when integer overflow
occurs. However, this does not apply to unsigned integers.
Heinrich suggested that we introduce a new option to handle 64 bit
limits and set min as 0 and max as UINT_MAX. This might not work as it
leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
we would need to change the datatype of the entry to 64 bit.
static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
{
i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.
Introduce a new proc handler proc_douintvec. Individual proc entries
will need to be updated to use the new handler.
[akpm@linux-foundation.org: coding-style fixes]
Fixes: 230633d109 ("kernel/sysctl.c:detect overflows when converting to int")
Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no reliable way to determine which module tainted the kernel
with TAINT_LIVEPATCH. For example, /sys/module/<klp module>/taint
doesn't report it. Neither does the "mod -t" command in the crash tool.
Make it crystal clear who the guilty party is by associating
TAINT_LIVEPATCH with any module which sets the "livepatch" modinfo
attribute. The flag will still get set in the kernel like before, but
now it also sets the same flag in mod->taint.
Note that now the taint flag gets set when the module is loaded rather
than when it's enabled.
I also renamed find_livepatch_modinfo() to check_modinfo_livepatch() to
better reflect its purpose: it's basically a livepatch-specific
sub-function of check_modinfo().
Reported-by: Chunyu Hu <chuhu@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Jessica Yu <jeyu@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
disable_nonboot_cpus() assumes that the lowest numbered online CPU is
the boot CPU, and that this is the correct CPU to run any power
management code on.
On x86 this is always correct, as CPU0 cannot (easily) by taken offline.
On arm64 CPU0 can be taken offline. For hibernate/resume this means we
may hibernate on a CPU other than CPU0. If the system is rebooted with
kexec 'CPU0' will be assigned to a different physical CPU. This
complicates hibernate/resume as now we can't trust the CPU numbers.
Arch code can find the correct physical CPU, and ensure it is online
before resume from hibernate begins, but also needs to influence
disable_nonboot_cpus()s choice of CPU.
Rename disable_nonboot_cpus() as freeze_secondary_cpus() and add an
argument indicating which CPU should be left standing. Follow the logic
in migrate_to_reboot_cpu() to use the lowest numbered online CPU if the
requested CPU is not online.
Add disable_nonboot_cpus() as an inline function that has the existing
behaviour.
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
When tearing down an AUX buf for an event via perf_mmap_close(),
__perf_event_output_stop() is called on the event's CPU to ensure that
trace generation is halted before the process of unmapping and
freeing the buffer pages begins.
The callback is performed via cpu_function_call(), which ensures that it
runs with interrupts disabled and is therefore not preemptible.
Unfortunately, the current code grabs the per-cpu context pointer using
get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair
the call with put_cpu_ptr(), leading to a preempt_count() imbalance and
a BUG when freeing the AUX buffer later on:
WARNING: CPU: 1 PID: 2249 at kernel/events/ring_buffer.c:539 __rb_free_aux+0x10c/0x120
Modules linked in:
[...]
Call Trace:
[<ffffffff813379dd>] dump_stack+0x4f/0x72
[<ffffffff81059ff6>] __warn+0xc6/0xe0
[<ffffffff8105a0c8>] warn_slowpath_null+0x18/0x20
[<ffffffff8112761c>] __rb_free_aux+0x10c/0x120
[<ffffffff81128163>] rb_free_aux+0x13/0x20
[<ffffffff8112515e>] perf_mmap_close+0x29e/0x2f0
[<ffffffff8111da30>] ? perf_iterate_ctx+0xe0/0xe0
[<ffffffff8115f685>] remove_vma+0x25/0x60
[<ffffffff81161796>] exit_mmap+0x106/0x140
[<ffffffff8105725c>] mmput+0x1c/0xd0
[<ffffffff8105cac3>] do_exit+0x253/0xbf0
[<ffffffff8105e32e>] do_group_exit+0x3e/0xb0
[<ffffffff81068d49>] get_signal+0x249/0x640
[<ffffffff8101c273>] do_signal+0x23/0x640
[<ffffffff81905f42>] ? _raw_write_unlock_irq+0x12/0x30
[<ffffffff81905f69>] ? _raw_spin_unlock_irq+0x9/0x10
[<ffffffff81901896>] ? __schedule+0x2c6/0x710
[<ffffffff810022a4>] exit_to_usermode_loop+0x74/0x90
[<ffffffff81002a56>] prepare_exit_to_usermode+0x26/0x30
[<ffffffff81906d1b>] retint_user+0x8/0x10
This patch uses this_cpu_ptr() instead of get_cpu_ptr(), since preemption is
already disabled by the caller.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 95ff4ca26c ("perf/core: Free AUX pages in unmap path")
Link: http://lkml.kernel.org/r/20160824091905.GA16944@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the x86 switch_to() uses the standard C calling convention,
the STACK_FRAME_NON_STANDARD() annotation is no longer needed.
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-8-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we get a hung task it can often be valuable to see _all_ the held
locks on the system (in case we are being blocked on trying to acquire
one), e.g. with this patch we can immediately see where the problem is
below:
INFO: task trinity-c3:14933 blocked for more than 120 seconds.
Not tainted 4.8.0-rc1+ #135
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
trinity-c3 D ffff88010c16fc88 0 14933 1 0x00080004
ffff88010c16fc88 000000003b9aca00 0000000000000000 0000000000000296
00000000776cdf88 ffff88011a520ae0 ffff88011a520b08 ffff88011a520198
ffffffff867d7f00 ffff88011942c080 ffff880116841580 ffff88010c168000
Call Trace:
[<ffffffff845e9d37>] schedule+0x77/0x230
[<ffffffff833cb8b9>] __lock_sock+0x129/0x250
[<ffffffff833cb790>] ? __sk_destruct+0x450/0x450
[<ffffffff81408ac0>] ? wake_bit_function+0x2e0/0x2e0
[<ffffffff833d832b>] lock_sock_nested+0xeb/0x120
[<ffffffff83bad815>] irda_setsockopt+0x65/0xb40
[<ffffffff833c6c09>] SyS_setsockopt+0x139/0x230
[<ffffffff833c6ad0>] ? SyS_recv+0x20/0x20
[<ffffffff81004660>] ? trace_event_raw_event_sys_enter+0xb90/0xb90
[<ffffffff823c7023>] ? __this_cpu_preempt_check+0x13/0x20
[<ffffffff8162ee60>] ? __context_tracking_exit.part.3+0x30/0x1b0
[<ffffffff833c6ad0>] ? SyS_recv+0x20/0x20
[<ffffffff81007bd3>] do_syscall_64+0x1b3/0x4b0
[<ffffffff845f84aa>] entry_SYSCALL64_slow_path+0x25/0x25
Showing all locks held in the system:
2 locks held by khungtaskd/563:
#0: (rcu_read_lock){......}, at: [<ffffffff81534ce6>] watchdog+0x106/0x910
#1: (tasklist_lock){......}, at: [<ffffffff8141b3c4>] debug_show_all_locks+0x74/0x360
1 lock held by trinity-c0/19280:
#0: (sk_lock-AF_IRDA){......}, at: [<ffffffff83bab7c6>] irda_accept+0x176/0x10f0
1 lock held by trinity-c0/12865:
#0: (sk_lock-AF_IRDA){......}, at: [<ffffffff83bab7c6>] irda_accept+0x176/0x10f0
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471538460-7505-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When function graph tracing is enabled for a function, ftrace modifies
the stack by replacing the original return address with the address of a
hook function (return_to_handler).
Stack unwinders need a way to get the original return address. Add an
arch-independent helper function for that named ftrace_graph_ret_addr().
This adds two variations of the function: one depends on
HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, and the other relies on an index state
variable.
The former is recommended because, in some cases, the latter can cause
problems when the unwinder skips stack frames. It can get out of sync
with the ret_stack index and wrong addresses can be reported for the
stack trace.
Once all arches have been ported to use
HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, we can get rid of the distinction.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/36bd90f762fc5e5af3929e3797a68a64906421cf.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Storing this value will help prevent unwinders from getting out of sync
with the function graph tracer ret_stack. Now instead of needing a
stateful iterator, they can compare the return address pointer to find
the right ret_stack entry.
Note that an array of 50 ftrace_ret_stack structs is allocated for every
task. So when an arch implements this, it will add either 200 or 400
bytes of memory usage per task (depending on whether it's a 32-bit or
64-bit platform).
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a95cfcc39e8f26b89a430c56926af0bb217bc0a1.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This saves some memory when HAVE_FUNCTION_GRAPH_FP_TEST isn't defined.
On x86_64 with newer versions of gcc which have -mfentry, it saves 400
bytes per task.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5c7747d9ea7b5cb47ef0a8ce8a6cea6bf7aa94bf.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make HAVE_FUNCTION_GRAPH_FP_TEST a normal define, independent from
kconfig. This removes some config file pollution and simplifies the
checking for the fp test.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/2c4e5f05054d6d367f702fd153af7a0109dd5c81.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If CONFIG_VMAP_STACK=y is selected, kernel stacks are allocated with
__vmalloc_node_range().
Grsecurity has had a similar feature (called GRKERNSEC_KSTACKOVERFLOW=y)
for a long time.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/14c07d4fd173a5b117f51e8b939f9f4323e39899.1470907718.git.luto@kernel.org
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It was reported that hibernation could fail on the 2nd attempt, where the
system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
claim_dma_lock(), because the lock has already been taken.
However there is actually no other process would like to grab this lock on
that problematic platform.
Further investigation showed that the problem is triggered by setting
/sys/power/pm_trace to 1 before the 1st hibernation.
Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
than 1970) to the release date of that motherboard during POST stage, thus
after resumed, it may seem that the system had a significant long sleep time
which is a completely meaningless value.
Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
sleep time happened to be set to 1, fls() returns 32 and we add 1 to
sleep_time_bin[32], which causes an out of bounds array access and therefor
memory being overwritten.
As depicted by System.map:
0xffffffff81c9d080 b sleep_time_bin
0xffffffff81c9d100 B dma_spin_lock
the dma_spin_lock.val is set to 1, which caused this problem.
This patch adds a sanity check in tk_debug_account_sleep_time()
to ensure we don't index past the sleep_time_bin array.
[jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
issue slightly differently, but borrowed his excelent explanation of the
issue here.]
Fixes: 5c83545f24 "power: Add option to log time spent in suspend"
Reported-by: Janek Kozicki <cosurgi@gmail.com>
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: linux-pm@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Xunlei Pang <xpang@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: stable <stable@vger.kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Link: http://lkml.kernel.org/r/1471993702-29148-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When I added some extra sanity checking in timekeeping_get_ns() under
CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns()
method was using timekeeping_get_ns().
Thus the locking added to the debug checks broke the NMI-safety of
__ktime_get_fast_ns().
This patch open-codes the timekeeping_get_ns() logic for
__ktime_get_fast_ns(), so can avoid any deadlocks in NMI.
Fixes: 4ca22c2648 "timekeeping: Add warnings when overflows or underflows are observed"
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: stable <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1471993702-29148-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Change kprobe/uprobe-tracer to show the arguments type-casted
with u8/u16/u32/u64 in decimal digits instead of hexadecimal.
To minimize compatibility issue, the arguments without type
casting are typed by x64 (or x32 for 32bit arch) by default.
Note: all arguments set by old perf probe without types are
shown in decimal by default.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Hemant Kumar <hemant@linux.vnet.ibm.com>
Cc: Naohiro Aota <naohiro.aota@hgst.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/147151076135.12957.14684546093034343894.stgit@devbox
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add README entries for kprobe-events and uprobe-events.
This allows user to check what options can be acceptable
for running kernel.
E.g. perf tools can choose correct types for the kernel.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Hemant Kumar <hemant@linux.vnet.ibm.com>
Cc: Naohiro Aota <naohiro.aota@hgst.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/147151069524.12957.12957179170304055028.stgit@devbox
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add x8/x16/x32/x64 for hexadecimal type casting to kprobe/uprobe event
tracer.
These type casts can be used for integer arguments for explicitly
showing them in hexadecimal digits in formatted text.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Hemant Kumar <hemant@linux.vnet.ibm.com>
Cc: Naohiro Aota <naohiro.aota@hgst.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/147151067029.12957.11591314629326414783.stgit@devbox
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
A few rcuperf dmesg output messages have no space between the flag and
the start of the message. In contrast, every other messages consistently
supplies a single space. This difference makes rcuperf dmesg output
hard to read and to mechanically parse. This commit therefore fixes
this problem by modifying a pr_alert() call and PERFOUT_STRING() macro
function to provide that single space.
Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tests for rcu_barrier() were introduced by commit fae4b54f28 ("rcu:
Introduce rcutorture testing for rcu_barrier()"). This commit updated
the documentation to say that the "rtbe" field in rcutorture's dmesg
output indicates test failure. However, the code was not updated, only
the documentation. This commit therefore updates the code to match the
updated documentation.
Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds a dump of the scheduler state for stalled rcutorture
writer tasks. This addition provides yet more debug for the intermittent
"failures to proceed", where grace periods move ahead but the rcutorture
writer tasks fail to do so.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Upcoming changes to the timer wheel introduce significant inaccuracy
and possibly also an ultimate limit on timeout duration. This is a
problem for the current implementation of torture_shutdown() because
(1) shutdown times are user-specified, and can therefore be quite long,
and (2) the torture scripting will kill a test instance that runs for
more than a few minutes longer than scheduled. This commit therefore
converts the torture_shutdown() timed waits to an hrtimer, thus avoiding
too-short torture test runs as well as death by scripting.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CPU_STARTING is scheduled for removal. There is no use of it in drivers
and core code uses it only for compatibility with old-style CPU-hotplug
notifiers. This patch removes therefore removes CPU_STARTING from an
RCU-related comment.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Up to now, RCU has assumed that the CPU-online process makes it from
CPU_UP_PREPARE to set_cpu_online() within one jiffy. Given the recent
rise of virtualized environments, this assumption is very clearly
obsolete. Failing to meet this deadline can result in RCU paying
attention to an incoming CPU for one jiffy, then ignoring it until the
grace period following the one in which that CPU sets itself online.
This situation might prove to be fatally disappointing to any RCU
read-side critical sections that had the misfortune to execute during
the time in which RCU was ignoring the slow-to-come-online CPU.
This commit therefore updates RCU's internal CPU state-tracking
information at notify_cpu_starting() time, thus providing RCU with
an exact transition of the CPU's state from offline to online.
Note that this means that incoming CPUs must not use RCU read-side
critical section (other than those of SRCU) until notify_cpu_starting()
time. Note also that the CPU_STARTING notifiers -are- allowed to use
RCU read-side critical sections. (Of course, CPU-hotplug notifiers are
rapidly becoming obsolete, so you need to act fast!)
If a given architecture or CPU family needs to use RCU read-side
critical sections earlier, the call to rcu_cpu_starting() from
notify_cpu_starting() will need to be architecture-specific, with
architectures that need early use being required to hand-place
the call to rcu_cpu_starting() at some point preceding the call to
notify_cpu_starting().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, __note_gp_changes() checks to see if the CPU has slept through
multiple grace periods. If it has, it resynchronizes that CPU's view
of the grace-period state, which includes whether or not the current
grace period needs a quiescent state from this CPU. The fact of this
need (or lack thereof) needs to be in two places, rdp->cpu_no_qs.b.norm
and rdp->core_needs_qs. The former tells RCU's context-switch code to
go get a quiescent state and the latter says that it needs to be reported.
The current code unconditionally sets the former to true, but correctly
sets the latter.
This does not result in failures, but it does unnecessarily increase
the amount of work done on average at context-switch time. This commit
therefore correctly sets both fields.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The Kconfig currently controlling compilation of tree.c is:
init/Kconfig:config TREE_RCU
init/Kconfig: bool
...and update.c and sync.c are "obj-y" meaning that none are ever
built as a module by anyone.
Since MODULE_ALIAS is a no-op for non-modular code, we can remove
them from these files.
We leave moduleparam.h behind since the files instantiate some boot
time configuration parameters with module_param() still.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Both timers and hrtimers are maintained on the outgoing CPU until
CPU_DEAD time, at which point they are migrated to a surviving CPU. If a
mod_timer() executes between CPU_DYING and CPU_DEAD time, x86 systems
will splat in native_smp_send_reschedule() when attempting to wake up
the just-now-offlined CPU, as shown below from a NO_HZ_FULL kernel:
[ 7976.741556] WARNING: CPU: 0 PID: 661 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:125 native_smp_send_reschedule+0x39/0x40
[ 7976.741595] Modules linked in:
[ 7976.741595] CPU: 0 PID: 661 Comm: rcu_torture_rea Not tainted 4.7.0-rc2+ #1
[ 7976.741595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 7976.741595] 0000000000000000 ffff88000002fcc8 ffffffff8138ab2e 0000000000000000
[ 7976.741595] 0000000000000000 ffff88000002fd08 ffffffff8105cabc 0000007d1fd0ee18
[ 7976.741595] 0000000000000001 ffff88001fd16d40 ffff88001fd0ee00 ffff88001fd0ee00
[ 7976.741595] Call Trace:
[ 7976.741595] [<ffffffff8138ab2e>] dump_stack+0x67/0x99
[ 7976.741595] [<ffffffff8105cabc>] __warn+0xcc/0xf0
[ 7976.741595] [<ffffffff8105cb98>] warn_slowpath_null+0x18/0x20
[ 7976.741595] [<ffffffff8103cba9>] native_smp_send_reschedule+0x39/0x40
[ 7976.741595] [<ffffffff81089bc2>] wake_up_nohz_cpu+0x82/0x190
[ 7976.741595] [<ffffffff810d275a>] internal_add_timer+0x7a/0x80
[ 7976.741595] [<ffffffff810d3ee7>] mod_timer+0x187/0x2b0
[ 7976.741595] [<ffffffff810c89dd>] rcu_torture_reader+0x33d/0x380
[ 7976.741595] [<ffffffff810c66f0>] ? sched_torture_read_unlock+0x30/0x30
[ 7976.741595] [<ffffffff810c86a0>] ? rcu_bh_torture_read_lock+0x80/0x80
[ 7976.741595] [<ffffffff8108068f>] kthread+0xdf/0x100
[ 7976.741595] [<ffffffff819dd83f>] ret_from_fork+0x1f/0x40
[ 7976.741595] [<ffffffff810805b0>] ? kthread_create_on_node+0x200/0x200
However, in this case, the wakeup is redundant, because the timer
migration will reprogram timer hardware as needed. Note that the fact
that preemption is disabled does not avoid the splat, as the offline
operation has already passed both the synchronize_sched() and the
stop_machine() that would be blocked by disabled preemption.
This commit therefore modifies wake_up_nohz_cpu() to avoid attempting
to wake up offline CPUs. It also adds a comment stating that the
caller must tolerate lost wakeups when the target CPU is going offline,
and suggesting the CPU_DEAD notifier as a recovery mechanism.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Commit abedf8e241 ("rcu: Use simple wait queues where possible in
rcutree") converts Tree RCU's wait queues to simple wait queues,
but it incorrectly reverts the commit 2aa792e6fa ("rcu: Use
rcu_gp_kthread_wake() to wake up grace period kthreads"). This can
result in redundant self-wakeups.
This commit therefore replaces the simple wait-queue wakeups with
rcu_gp_kthread_wake(), thus avoiding the redundant wakeups.
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit improves the accuracy of the interaction between CPU hotplug
operations and RCU's expedited grace periods by using RCU's online-CPU
state to determine when failed IPIs should be retried.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The expedited RCU grace periods currently rely on a failure indication
from smp_call_function_single() to determine that a given CPU is offline.
This works after a fashion, but is more contorted and less precise than
relying on RCU's internal state. This commit therefore takes a first
step towards relying on internal state.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The expedited RCU CPU stall warnings currently responds to neither the
panic_on_rcu_stall sysctl setting nor the rcupdate.rcu_cpu_stall_suppress
kernel boot parameter. This commit therefore updates the expedited code
to respond to these two controls.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that RCU expedited grace periods are always driven by a workqueue,
there is no need to account for signal reception, and thus no need
to disable expedited RCU CPU stall warnings due to signal reception.
This commit therefore removes the signal-reception checks, leaving a
WARN_ON() to catch possible future bugs.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current implementation of expedited grace periods has the user
task drive the grace period. This works, but has downsides: (1) The
user task must awaken tasks piggybacking on this grace period, which
can result in latencies rivaling that of the grace period itself, and
(2) User tasks can receive signals, which interfere with RCU CPU stall
warnings.
This commit therefore uses workqueues to drive the grace periods, so
that the user task need not do the awakening. A subsequent commit
will remove the now-unnecessary code allowing for signals.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The functions synchronize_rcu_expedited() and synchronize_sched_expedited()
have nearly identical code. This commit therefore consolidates this code
into a new _synchronize_rcu_expedited() function.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>