A sequence of pairs [class_idx -> corresponding chain_key iteration]
is printed for both the current held_lock chain and the cached chain.
That exposes the two different class_idx sequences that led to that
particular hash value.
This helps with debugging hash chain collision reports.
Signed-off-by: Alfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Cc: sedat.dilek@gmail.com
Cc: tytso@mit.edu
Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A new task's util_avg is set to full utilization of a CPU (100% time
running). This accelerates a new task's utilization ramp-up, useful to
boost its execution in early time. However, it may result in
(insanely) high utilization for a transient time period when a flood
of tasks are spawned. Importantly, it violates the "fundamentally
bounded" CPU utilization, and its side effect is negative if we don't
take any measure to bound it.
This patch proposes an algorithm to address this issue. It has
two methods to approach a sensible initial util_avg:
(1) An expected (or average) util_avg based on its cfs_rq's util_avg:
util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
(2) A trajectory of how successive new tasks' util develops, which
gives 1/2 of the left utilization budget to a new task such that
the additional util is noticeably large (when overall util is low) or
unnoticeably small (when overall util is high enough). In the meantime,
the aggregate utilization is well bounded:
util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
where n denotes the nth task.
If util_avg is larger than util_avg_cap, then the effective util is
clamped to the util_avg_cap.
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
ed82b8a1ff ("sched/core: Move the sched_to_prio[] arrays out of line")
renamed prio_to_weight to sched_prio_to_weight, but the old name was not
updated in comments.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1459292871-22531-1-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While testing the tracer preemptoff, I hit this strange trace:
<...>-259 0...1 0us : schedule <-worker_thread
<...>-259 0d..1 0us : rcu_note_context_switch <-__schedule
<...>-259 0d..1 0us : rcu_sched_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
<...>-259 0d..1 0us : _raw_spin_lock <-__schedule
<...>-259 0d..1 0us : preempt_count_add <-_raw_spin_lock
<...>-259 0d..2 0us : do_raw_spin_lock <-_raw_spin_lock
<...>-259 0d..2 1us : deactivate_task <-__schedule
<...>-259 0d..2 1us : update_rq_clock.part.84 <-deactivate_task
<...>-259 0d..2 1us : dequeue_task_fair <-deactivate_task
<...>-259 0d..2 1us : dequeue_entity <-dequeue_task_fair
<...>-259 0d..2 1us : update_curr <-dequeue_entity
<...>-259 0d..2 1us : update_min_vruntime <-update_curr
<...>-259 0d..2 1us : cpuacct_charge <-update_curr
<...>-259 0d..2 1us : __rcu_read_lock <-cpuacct_charge
<...>-259 0d..2 1us : __rcu_read_unlock <-cpuacct_charge
<...>-259 0d..2 1us : clear_buddies <-dequeue_entity
<...>-259 0d..2 1us : account_entity_dequeue <-dequeue_entity
<...>-259 0d..2 2us : update_min_vruntime <-dequeue_entity
<...>-259 0d..2 2us : update_cfs_shares <-dequeue_entity
<...>-259 0d..2 2us : hrtick_update <-dequeue_task_fair
<...>-259 0d..2 2us : wq_worker_sleeping <-__schedule
<...>-259 0d..2 2us : kthread_data <-wq_worker_sleeping
<...>-259 0d..2 2us : pick_next_task_fair <-__schedule
<...>-259 0d..2 2us : check_cfs_rq_runtime <-pick_next_task_fair
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : pick_next_entity <-pick_next_task_fair
<...>-259 0d..2 2us : clear_buddies <-pick_next_entity
<...>-259 0d..2 2us : set_next_entity <-pick_next_task_fair
<...>-259 0d..2 3us : put_prev_entity <-pick_next_task_fair
<...>-259 0d..2 3us : check_cfs_rq_runtime <-put_prev_entity
<...>-259 0d..2 3us : set_next_entity <-pick_next_task_fair
gnome-sh-1031 0d..2 3us : finish_task_switch <-__schedule
gnome-sh-1031 0d..2 3us : _raw_spin_unlock_irq <-finish_task_switch
gnome-sh-1031 0d..2 3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
gnome-sh-1031 0...2 3us!: preempt_count_sub <-_raw_spin_unlock_irq
gnome-sh-1031 0...1 582us : do_raw_spin_lock <-_raw_spin_lock
gnome-sh-1031 0...1 583us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 583us : do_raw_spin_unlock <-_raw_spin_unlock
gnome-sh-1031 0...1 583us : preempt_count_sub <-_raw_spin_unlock
gnome-sh-1031 0...1 584us : _raw_spin_unlock <-drm_gem_object_lookup
gnome-sh-1031 0...1 584us+: trace_preempt_on <-drm_gem_object_lookup
gnome-sh-1031 0...1 603us : <stack trace>
=> preempt_count_sub
=> _raw_spin_unlock
=> drm_gem_object_lookup
=> i915_gem_madvise_ioctl
=> drm_ioctl
=> do_vfs_ioctl
=> SyS_ioctl
=> entry_SYSCALL_64_fastpath
As I'm tracing preemption disabled, it seemed incorrect that the trace
would go across a schedule and report not being in the scheduler.
Looking into this I discovered the problem.
schedule() calls preempt_disable() but the preempt_schedule() calls
preempt_enable_notrace(). What happened above was that the gnome-shell
task was preempted on another CPU, migrated over to the idle cpu. The
tracer stared with idle calling schedule(), which called
preempt_disable(), but then gnome-shell finished, and it enabled
preemption with preempt_enable_notrace() that does stop the trace, even
though preemption was enabled.
The purpose of the preempt_disable_notrace() in the preempt_schedule()
is to prevent function tracing from going into an infinite loop.
Because function tracing can trace the preempt_enable/disable() calls
that are traced. The problem with function tracing is:
NEED_RESCHED set
preempt_schedule()
preempt_disable()
preempt_count_inc()
function trace (before incrementing preempt count)
preempt_disable_notrace()
preempt_enable_notrace()
sees NEED_RESCHED set
preempt_schedule() (repeat)
Now by breaking out the preempt off/on tracing into their own code:
preempt_disable_check() and preempt_enable_check(), we can add these to
the preempt_schedule() code. As preemption would then be disabled, even
if they were to be traced by the function tracer, the disabled
preemption would prevent the recursion.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In account_entity_enqueue(), we do not do account_numa_enqueue()
as NUMA balancing is not needed for UP kernels.
Hence, we should remove the account_numa_dequeue() call from
account_entity_dequeue() for UP kernels.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1454366879.21738.29.camel@schen9-desk2.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To force a task migration during active balancing, nr_balance_failed is set
to cache_nice_tries + 1. However nr_balance_failed is not reset. As a side
effect, the next regular load balance under the same sd, a cache hot task
might be migrated, just because nr_balance_failed count is high.
Resetting nr_balance_failed after a successful active balance ensures
that a hot task is not unreasonably migrated. This can be verified by
looking at othe number of hot task migrations reported by /proc/schedstat.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458735884-30105-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sometimes, cpuacct.usage is not detailed enough to see how much CPU
usage a group had. We want to know how much time it used in user mode
and how much in kernel mode.
This patch introduces more files to give this information:
# ls /sys/fs/cgroup/cpuacct/cpuacct.usage*
/sys/fs/cgroup/cpuacct/cpuacct.usage
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu
/sys/fs/cgroup/cpuacct/cpuacct.usage_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_user
/sys/fs/cgroup/cpuacct/cpuacct.usage_sys
/sys/fs/cgroup/cpuacct/cpuacct.usage_percpu_sys
... while keeping the ABI with the existing counter.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
[ Ported to newer kernels. ]
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/aa171da036b520b51c79549e9b3215d29473f19d.1458635566.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Current code show stats of online CPUs in cpuacct.statcpus,
show stats of present cpus in cpuacct.usage(_percpu), and using
present CPUs for setting cpuacct.usage.
It will cause inconsistent result when a CPU is online or offline
or hotpluged.
We should always use possible CPUs to avoid above problem.
Here are the contents of a cpuacct.usage_percpu sysfs file,
on a 4 CPU system with maxcpus=32:
Before the patch:
# cat cpuacct.usage_percpu
2456565 411435 1052897 832584
After the patch:
# cat cpuacct.usage_percpu
2456565 411435 1052897 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a11d56cef12d0b4807f8be3a46bf9798c3014d59.1458635566.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Convert perf_output_begin() to __perf_output_begin() and make the later
function able to write records from the end of the ring-buffer.
Following commits will utilize the 'backward' flag.
This is the core patch to support writing to the ring-buffer backwards,
which will be introduced by upcoming patches to support reading from
overwritable ring-buffers.
In theory, this patch should not introduce any extra performance
overhead since we use always_inline, but it does not hurt to double
check that assumption:
When CONFIG_OPTIMIZE_INLINING is disabled, the output object is nearly
identical to original one. See:
http://lkml.kernel.org/g/56F52E83.70409@huawei.com
When CONFIG_OPTIMIZE_INLINING is enabled, the resuling object file becomes
smaller:
$ size kernel/events/ring_buffer.o*
text data bss dec hex filename
4641 4 8 4653 122d kernel/events/ring_buffer.o.old
4545 4 8 4557 11cd kernel/events/ring_buffer.o.new
Performance testing results:
Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.
Testing environment:
CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Kernel : v4.5.0
MEAN STDVAR
BASE 800214.950 2853.083
PRE 2253846.700 9997.014
POST 2257495.540 8516.293
Where 'BASE' is pure performance without capturing. 'PRE' is test
result of pure 'v4.5.0' kernel. 'POST' is test result after this
patch.
Considering the stdvar, this patch doesn't hurt performance, within
noise margin.
For testing details, see:
http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <pi3orama@163.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/1459147292-239310-4-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Set a default event->overflow_handler in perf_event_alloc() so don't
need to check event->overflow_handler in __perf_event_overflow().
Following commits can give a different default overflow_handler.
Initial idea comes from Peter:
http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net
Since the default value of event->overflow_handler is not NULL, existing
'if (!overflow_handler)' checks need to be changed.
is_default_overflow_handler() is introduced for this.
No extra performance overhead is introduced into the hot path because in the
original code we still need to read this handler from memory. A conditional
branch is avoided so actually we remove some instructions.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <pi3orama@163.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/1459147292-239310-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add new ioctl() to pause/resume ring-buffer output.
In some situations we want to read from the ring-buffer only when we
ensure nothing can write to the ring-buffer during reading. Without
this patch we have to turn off all events attached to this ring-buffer
to achieve this.
This patch is a prerequisite to enable overwrite support for the
perf ring-buffer support. Following commits will introduce new methods
support reading from overwrite ring buffer. Before reading, caller
must ensure the ring buffer is frozen, or the reading is unreliable.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <pi3orama@163.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we check sample type for ftrace:function events
even if it's not created as a sampling event. That prevents
creating ftrace_function event in counting mode.
Make sure we check sample types only for sampling events.
Before:
$ sudo perf stat -e ftrace:function ls
...
Performance counter stats for 'ls':
<not supported> ftrace:function
0.001983662 seconds time elapsed
After:
$ sudo perf stat -e ftrace:function ls
...
Performance counter stats for 'ls':
44,498 ftrace:function
0.037534722 seconds time elapsed
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1458138873-1553-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to ensure safe AUX buffer management, we rely on the assumption
that pmu::stop() stops its ongoing AUX transaction and not just the hw.
This patch documents this requirement for the perf_aux_output_{begin,end}()
APIs.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1457098969-21595-4-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we can ensure that when ring buffer's AUX area is on the way
to getting unmapped new transactions won't start, we only need to stop
all events that can potentially be writing aux data to our ring buffer.
Having done that, we can safely free the AUX pages and corresponding
PMU data, as this time it is guaranteed to be the last aux reference
holder.
This partially reverts:
57ffc5ca67 ("perf: Fix AUX buffer refcounting")
... which was made to defer deallocation that was otherwise possible
from an NMI context. Now it is no longer the case; the last call to
rb_free_aux() that drops the last AUX reference has to happen in
perf_mmap_close() on that AUX area.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/87d1qtz23d.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When ring buffer's AUX area is unmapped and rb->aux_mmap_count drops to
zero, new AUX transactions into this buffer can still be started,
even though the buffer in en route to deallocation.
This patch adds a check to perf_aux_output_begin() for rb->aux_mmap_count
being zero, in which case there is no point starting new transactions,
in other words, the ring buffers that pass a certain point in
perf_mmap_close will not have their events sending new data, which
clears path for freeing those buffers' pages right there and then,
provided that no active transactions are holding the AUX reference.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1457098969-21595-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There should (and can) only be a single PMU for perf_hw_context
events.
This is because of how we schedule events: once a hardware event fails to
schedule (the PMU is 'full') we stop trying to add more. The trivial
'fix' would break the Round-Robin scheduling we do.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Export irq_domain_free_irqs_common so it can be used by modules.
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
In the error path, event_file not being NULL is used to determine
whether the event itself still needs to be free'd, so fix it up to
avoid leaking.
Reported-by: Leon Yu <chianglungyu@gmail.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 130056275a ("perf: Do not double free")
Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stephane reported that commit:
3cbaa59069 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
introduced a regression wrt. time tracking, as easily observed by:
> This patch introduce a bug in the time tracking of events when
> multiplexing is used.
>
> The issue is easily reproducible with the following perf run:
>
> $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000
> 1.000730239 652,394 branches (66.41%)
> 1.000730239 597,809 branches (66.41%)
> 1.000730239 593,870 branches (66.63%)
> 1.000730239 651,440 branches (67.03%)
> 1.000730239 656,725 branches (66.96%)
> 1.000730239 <not counted> branches
>
> One branches event is shown as not having run. Yet, with
> multiplexing, all events should run especially with a 1s (-I 1000)
> interval. The delta for time_running comes out to 0. Yet, the event
> has run because the kernel is actually multiplexing the events. The
> problem is that the time tracking is the kernel and especially in
> ctx_sched_out() is wrong now.
>
> The problem is that in case that the kernel enters ctx_sched_out() with the
> following state:
> ctx->is_active=0x7 event_type=0x1
> Call Trace:
> [<ffffffff813ddd41>] dump_stack+0x63/0x82
> [<ffffffff81182bdc>] ctx_sched_out+0x2bc/0x2d0
> [<ffffffff81183896>] perf_mux_hrtimer_handler+0xf6/0x2c0
> [<ffffffff811837a0>] ? __perf_install_in_context+0x130/0x130
> [<ffffffff810f5818>] __hrtimer_run_queues+0xf8/0x2f0
> [<ffffffff810f6097>] hrtimer_interrupt+0xb7/0x1d0
> [<ffffffff810509a8>] local_apic_timer_interrupt+0x38/0x60
> [<ffffffff8175ca9d>] smp_apic_timer_interrupt+0x3d/0x50
> [<ffffffff8175ac7c>] apic_timer_interrupt+0x8c/0xa0
>
> In that case, the test:
> if (is_active & EVENT_TIME)
>
> will be false and the time will not be updated. Time must always be updated on
> sched out.
Fix this by always updating time if EVENT_TIME was set, as opposed to
only updating time when EVENT_TIME changed.
Reported-by: Stephane Eranian <eranian@google.com>
Tested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: kan.liang@intel.com
Cc: namhyung@kernel.org
Fixes: 3cbaa59069 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
Link: http://lkml.kernel.org/r/20160329072644.GB3408@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch functionally reverts:
5fd7a09cfb ("atomic: Export fetch_or()")
During the merge Linus observed that the generic version of fetch_or()
was messy:
" This makes the ugly "fetch_or()" macro that the scheduler used
internally a new generic helper, and does a bad job at it. "
e23604edac Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Now that we have introduced atomic_fetch_or(), fetch_or() is only used
by the scheduler in order to deal with thread_info flags which type
can vary across architectures.
Lets confine fetch_or() back to the scheduler so that we encourage
future users to use the more robust and well typed atomic_t version
instead.
While at it, fetch_or() gets robustified, pasting improvements from a
previous patch by Ingo Molnar that avoids needless expression
re-evaluations in the loop.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458830281-4255-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The tick dependency mask was intially unsigned long because this is the
type on which clear_bit() operates on and fetch_or() accepts it.
But now that we have atomic_fetch_or(), we can instead use
atomic_andnot() to clear the bit. This consolidates the type of our
tick dependency mask, reduce its size on structures and benefit from
possible architecture optimizations on atomic_t operations.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1458830281-4255-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.
Move the definition of __irq_entry to <linux/interrupt.h> so that the
users don't need to pull in <linux/ftrace.h>. Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.
The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.
This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention. This is not done by this patch.
Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will be needed in the patch "mm, oom: introduce oom reaper".
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add map_flags attribute to bpf_map_show_fdinfo(), so that tools like
tc can check for them when loading objects from a pinned entry, e.g.
if user intent wrt allocation (BPF_F_NO_PREALLOC) is different to the
pinned object, it can bail out. Follow-up to 6c90598174 ("bpf:
pre-allocate hash map elements"), so that tc can still support this
with v4.6.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix for an intel_pstate driver issue related to the handling of
MSR updates uncovered by the recent cpufreq rework (Rafael Wysocki).
- cpufreq core cleanups related to starting governors and frequency
synchronization during resume from system suspend and a locking
fix for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).
- acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
Michael Neuling, Richard Cochran, Shilpasri Bhat).
- intel_idle driver update preventing some Skylake-H systems
from hanging during initialization by disabling deep C-states
mishandled by the platform in the problematic configurations (Len
Brown).
- Intel Xeon Phi Processor x200 support for intel_idle (Dasaratharaman
Chandramouli).
- cpuidle menu governor updates to make it always honor PM QoS
latency constraints (and prevent C1 from being used as the
fallback C-state on x86 when they are set below its exit latency)
and to restore the previous behavior to fall back to C1 if the next
timer event is set far enough in the future that was changed in 4.4
which led to an energy consumption regression (Rik van Riel, Rafael
Wysocki).
- New device ID for a future AMD UART controller in the ACPI driver
for AMD SoCs (Wang Hongcheng).
- Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
scaling (AVS) driver (David Wu).
- ACPI PCI resources management fix for the handling of IO space
resources on architectures where the IO space is memory mapped
(IA64 and ARM64) broken by the introduction of common ACPI
resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).
- Fix for the ACPI backend of the generic device properties API
to make it parse non-device (data node only) children of an
ACPI device correctly (Irina Tirdea).
- Fixes for the handling of global suspend flags (introduced in 4.4)
during hibernation and resume from it (Lukas Wunner).
- Support for obtaining configuration information from Device Trees
in the PM clocks framework (Jon Hunter).
- ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
King, Geert Uytterhoeven).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJW9JaRAAoJEILEb/54YlRx/GAQAJujANWilWHZYm24a9JDcIE9
rsNZIC/FdeBVilPtRTZQnig/Pj32Z4Jm7IZ/DLOq0Deu1YK/9uv3y59M3BcX6WyL
H5VR80L8geUJZ7RRk0WfM5D4X82ovzwpE/kWt2Z7HDuvJSCBmFBZOvNrXbaRncKD
jIvat/p6uCuxt5c08+ebnBLQ6tOs8wLTWiCx3fO128GIrGRGN2xFV6hzRWVGnJ4g
WXGAR+AdLxRMZz4PPmqdTfRj4TNSR071GjKyaeKfZUjQGAsf5O9A77JFjeNVomDx
g1K37Byid2bTByzVavlEXPJZ7eKb5dAhlo7IJ9HAcOAXChLqH2Czjrpd+1XjR9MF
SV/78rCnF8eet83QYLbGV/Mzf7gbJP2Xp6wiaM22VAPpGe+sYfphJoQka9XRTfId
OgAjyYMYdWAKo5DhxVNI8WyN0W5dsoBFPxnaUFhHSGDCIJH7Ksy20m6y3plG2Bxf
ahoiQhmd9ohjtB5JbRnf4MY0hjekp8Srdf+DoNKsk/+JscIyROpYY3msQ3smUKo+
f628MC/wAosMpSV+l+KOYkbjCbtB49IabWtZ//NVD9hYB3E1f6aTN59yFbWB+1rp
L7Y8iaxzSkyJy/yYVuBal3rSk356+BvvoXBlLXmBsyu1TMlcDjALIYztSiTVT5MB
RZBhgNwdkxNCYJfU3ex+
=hUVj
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management and ACPI updates from Rafael Wysocki:
"The second batch of power management and ACPI updates for v4.6.
Included are fixups on top of the previous PM/ACPI pull request and
other material that didn't make into it but still should go into 4.6.
Among other things, there's a fix for an intel_pstate driver issue
uncovered by recent cpufreq changes, a workaround for a boot hang on
Skylake-H related to the handling of deep C-states by the platform and
a PCI/ACPI fix for the handling of IO port resources on non-x86
architectures plus some new device IDs and similar.
Specifics:
- Fix for an intel_pstate driver issue related to the handling of MSR
updates uncovered by the recent cpufreq rework (Rafael Wysocki).
- cpufreq core cleanups related to starting governors and frequency
synchronization during resume from system suspend and a locking fix
for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).
- acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
Michael Neuling, Richard Cochran, Shilpasri Bhat).
- intel_idle driver update preventing some Skylake-H systems from
hanging during initialization by disabling deep C-states mishandled
by the platform in the problematic configurations (Len Brown).
- Intel Xeon Phi Processor x200 support for intel_idle
(Dasaratharaman Chandramouli).
- cpuidle menu governor updates to make it always honor PM QoS
latency constraints (and prevent C1 from being used as the fallback
C-state on x86 when they are set below its exit latency) and to
restore the previous behavior to fall back to C1 if the next timer
event is set far enough in the future that was changed in 4.4 which
led to an energy consumption regression (Rik van Riel, Rafael
Wysocki).
- New device ID for a future AMD UART controller in the ACPI driver
for AMD SoCs (Wang Hongcheng).
- Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
scaling (AVS) driver (David Wu).
- ACPI PCI resources management fix for the handling of IO space
resources on architectures where the IO space is memory mapped
(IA64 and ARM64) broken by the introduction of common ACPI
resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).
- Fix for the ACPI backend of the generic device properties API to
make it parse non-device (data node only) children of an ACPI
device correctly (Irina Tirdea).
- Fixes for the handling of global suspend flags (introduced in 4.4)
during hibernation and resume from it (Lukas Wunner).
- Support for obtaining configuration information from Device Trees
in the PM clocks framework (Jon Hunter).
- ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
King, Geert Uytterhoeven)"
* tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
PM / AVS: rockchip-io: add io selectors and supplies for rk3399
intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
ACPI / PM: Runtime resume devices when waking from hibernate
PM / sleep: Clear pm_suspend_global_flags upon hibernate
cpufreq: governor: Always schedule work on the CPU running update
cpufreq: Always update current frequency before startig governor
cpufreq: Introduce cpufreq_update_current_freq()
cpufreq: Introduce cpufreq_start_governor()
cpufreq: powernv: Add sysfs attributes to show throttle stats
cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
PCI: ACPI: IA64: fix IO port generic range check
ACPI / util: cast data to u64 before shifting to fix sign extension
cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
cpuidle: menu: Fall back to polling if next timer event is near
cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
cpufreq: Make cpufreq_quick_get() safe to call
ACPI / property: fix data node parsing in acpi_get_next_subnode()
ACPI / APD: Add device HID for future AMD UART controller
...
Some visible changes:
A new flag was added to distinguish traces done in NMI context.
Preempt tracer now shows functions where preemption is disabled but
interrupts are still enabled.
Other notes:
Updates were done to function tracing to allow better performance
with perf.
Infrastructure code has been added to allow for a new histogram
feature for recording live trace event histograms that can be
configured by simple user commands. The feature itself was just
finished, but needs a round in linux-next before being pulled.
This only includes some infrastructure changes that will be needed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJW8/WPAAoJEKKk/i67LK/8wrAH/j2gU9ZfjVxTu8068TBGWRJP
yvvzq0cK5evB3dsVuUmKKRfU52nSv4J1WcFF569X0RulSLylR0dHlcxFJMn4kkgR
bm0AHRrqOf87ub3VimcpG146iVQij37l5A0SRoFbvSPLQx1KUW18v99x41Ji8dv6
oWXRc6/YhdzEE7l0nUsVjmScQ4b2emsems3cxZzXOY+nRJsiim6i+VaDeatdyey1
csLVqtRCs+x62TVtxG3+GhcLdRoPRbnHAGzrKDFIn1SrQaRXCc54wN5d2hWxjgNI
1laOwaj070lnJiWfBLIP/K+lx+VKRx5/O0rKZX35foLUTqJJKSyjAbKXuMCcSAM=
=2h2K
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Nothing major this round. Mostly small clean ups and fixes.
Some visible changes:
- A new flag was added to distinguish traces done in NMI context.
- Preempt tracer now shows functions where preemption is disabled but
interrupts are still enabled.
Other notes:
- Updates were done to function tracing to allow better performance
with perf.
- Infrastructure code has been added to allow for a new histogram
feature for recording live trace event histograms that can be
configured by simple user commands. The feature itself was just
finished, but needs a round in linux-next before being pulled.
This only includes some infrastructure changes that will be needed"
* tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (22 commits)
tracing: Record and show NMI state
tracing: Fix trace_printk() to print when not using bprintk()
tracing: Remove redundant reset per-CPU buff in irqsoff tracer
x86: ftrace: Fix the misleading comment for arch/x86/kernel/ftrace.c
tracing: Fix crash from reading trace_pipe with sendfile
tracing: Have preempt(irqs)off trace preempt disabled functions
tracing: Fix return while holding a lock in register_tracer()
ftrace: Use kasprintf() in ftrace_profile_tracefs()
ftrace: Update dynamic ftrace calls only if necessary
ftrace: Make ftrace_hash_rec_enable return update bool
tracing: Fix typoes in code comment and printk in trace_nop.c
tracing, writeback: Replace cgroup path to cgroup ino
tracing: Use flags instead of bool in trigger structure
tracing: Add an unreg_all() callback to trigger commands
tracing: Add needs_rec flag to event triggers
tracing: Add a per-event-trigger 'paused' field
tracing: Add get_syscall_name()
tracing: Add event record param to trigger_ops.func()
tracing: Make event trigger functions available
tracing: Make ftrace_event_field checking functions available
...
Pull perf fixes from Ingo Molnar:
"This tree contains various perf fixes on the kernel side, plus three
hw/event-enablement late additions:
- Intel Memory Bandwidth Monitoring events and handling
- the AMD Accumulated Power Mechanism reporting facility
- more IOMMU events
... and a final round of perf tooling updates/fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
perf llvm: Use strerror_r instead of the thread unsafe strerror one
perf llvm: Use realpath to canonicalize paths
perf tools: Unexport some methods unused outside strbuf.c
perf probe: No need to use formatting strbuf method
perf help: Use asprintf instead of adhoc equivalents
perf tools: Remove unused perf_pathdup, xstrdup functions
perf tools: Do not include stringify.h from the kernel sources
tools include: Copy linux/stringify.h from the kernel
tools lib traceevent: Remove redundant CPU output
perf tools: Remove needless 'extern' from function prototypes
perf tools: Simplify die() mechanism
perf tools: Remove unused DIE_IF macro
perf script: Remove lots of unused arguments
perf thread: Rename perf_event__preprocess_sample_addr to thread__resolve
perf machine: Rename perf_event__preprocess_sample to machine__resolve
perf tools: Add cpumode to struct perf_sample
perf tests: Forward the perf_sample in the dwarf unwind test
perf tools: Remove misplaced __maybe_unused
perf list: Fix documentation of :ppp
perf bench numa: Fix assertion for nodes bitfield
...
Pull scheduler fixes from Ingo Molnar:
"Misc fixes: a cgroup fix, a fair-scheduler migration accounting fix, a
cputime fix and two cpuacct cleanups"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cpuacct: Simplify the cpuacct code
sched/cpuacct: Rename parameter in cpuusage_write() for readability
sched/fair: Add comments to explain select_idle_sibling()
sched/fair: Fix fairness issue on migration
sched/cgroup: Fix/cleanup cgroup teardown/init
sched/cputime: Fix steal time accounting vs. CPU hotplug
When suspending to RAM, waking up and later suspending to disk,
we gratuitously runtime resume devices after the thaw phase.
This does not occur if we always suspend to RAM or always to disk.
pm_complete_with_resume_check(), which gets called from
pci_pm_complete() among others, schedules a runtime resume
if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during
a suspend-to-RAM cycle. It is cleared at the beginning of
the suspend-to-RAM cycle but not afterwards and it is not
cleared during a suspend-to-disk cycle at all. Fix it.
Fixes: ef25ba0476 (PM / sleep: Add flags to indicate platform firmware involvement)
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use the more common logging method with the eventual goal of removing
pr_warning altogether.
Miscellanea:
- Realign arguments
- Coalesce formats
- Add missing space between a few coalesced formats
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [kernel/power/suspend.c]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a flag to memremap() for writecombine mappings. Mappings satisfied
by this flag will not be cached, however writes may be delayed or
combined into more efficient bursts. This is most suitable for buffers
written sequentially by the CPU for use by other DMA devices.
Signed-off-by: Brian Starkey <brian.starkey@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These patches implement a MEMREMAP_WC flag for memremap(), which can be
used to obtain writecombine mappings. This is then used for setting up
dma_coherent_mem regions which use the DMA_MEMORY_MAP flag.
The motivation is to fix an alignment fault on arm64, and the suggestion
to implement MEMREMAP_WC for this case was made at [1]. That particular
issue is handled in patch 4, which makes sure that the appropriate
memset function is used when zeroing allocations mapped as IO memory.
This patch (of 4):
Don't modify the flags input argument to memremap(). MEMREMAP_WB is
already a special case so we can check for it directly instead of
clearing flag bits in each mapper.
Signed-off-by: Brian Starkey <brian.starkey@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The value of __ARCH_SI_PREAMBLE_SIZE defines the size (including
padding) of the part of the struct siginfo that is before the union, and
it is then used to calculate the needed padding (SI_PAD_SIZE) to make
the size of struct siginfo equal to 128 (SI_MAX_SIZE) bytes.
Depending on the target architecture and word width it equals to either
3 or 4 times sizeof int.
Since the very beginning we had __ARCH_SI_PREAMBLE_SIZE wrong on the
parisc architecture for the 64bit kernel build. It's even more
frustrating, because it can easily be checked at compile time if the
value was defined correctly.
This patch adds such a check for the correctness of
__ARCH_SI_PREAMBLE_SIZE in the hope that it will prevent existing and
future architectures from running into the same problem.
I refrained from replacing __ARCH_SI_PREAMBLE_SIZE by offsetof() in
copy_siginfo() in include/asm-generic/siginfo.h, because a) it doesn't
make any difference and b) it's used in the Documentation/kmemcheck.txt
example.
I ran this patch through the 0-DAY kernel test infrastructure and only
the parisc architecture triggered as expected. That means that this
patch should be OK for all major architectures.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.
kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).
Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.
This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.
We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:
https://github.com/google/syzkaller/wiki/Found-Bugs
We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.
Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.
kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.
Based on a patch by Quentin Casasnovas.
[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A couple of functions and variables in the profile implementation are
used only on SMP systems by the procfs code, but are unused if either
procfs is disabled or in uniprocessor kernels. gcc prints a harmless
warning about the unused symbols:
kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function]
static void profile_flip_buffers(void)
^
kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function]
static void profile_discard_flip_buffers(void)
^
kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function]
static int profile_cpu_callback(struct notifier_block *info,
^
This adds further #ifdef to the file, to annotate exactly in which cases
they are used. I have done several thousand ARM randconfig kernels with
this patch applied and no longer get any warnings in this file.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1717f2096b ("panic, x86: Fix re-entrance problem due to panic
on NMI") and commit 58c5661f21 ("panic, x86: Allow CPUs to save
registers even if looping in NMI context") introduced nmi_panic() which
prevents concurrent/recursive execution of panic(). It also saves
registers for the crash dump on x86.
However, there are some cases where NMI handlers still use panic().
This patch set partially replaces them with nmi_panic() in those cases.
Even this patchset is applied, some NMI or similar handlers (e.g. MCE
handler) continue to use panic(). This is because I can't test them
well and actual problems won't happen. For example, the possibility
that normal panic and panic on MCE happen simultaneously is very low.
This patch (of 3):
Convert nmi_panic() to a proper function and export it instead of
exporting internal implementation details to modules, for obvious
reasons.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Javi Merino <javi.merino@arm.com>
Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com>
Cc: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit fixes the following security hole affecting systems where
all of the following conditions are fulfilled:
- The fs.suid_dumpable sysctl is set to 2.
- The kernel.core_pattern sysctl's value starts with "/". (Systems
where kernel.core_pattern starts with "|/" are not affected.)
- Unprivileged user namespace creation is permitted. (This is
true on Linux >=3.8, but some distributions disallow it by
default using a distro patch.)
Under these conditions, if a program executes under secure exec rules,
causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
namespace, changes its root directory and crashes, the coredump will be
written using fsuid=0 and a path derived from kernel.core_pattern - but
this path is interpreted relative to the root directory of the process,
allowing the attacker to control where a coredump will be written with
root privileges.
To fix the security issue, always interpret core_pattern for dumps that
are written under SUID_DUMP_ROOT relative to the root directory of init.
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This test-case (simplified version of generated by syzkaller)
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
void test(void)
{
for (;;) {
if (fork()) {
wait(NULL);
continue;
}
ptrace(PTRACE_SEIZE, getppid(), 0, 0);
ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
_exit(0);
}
}
int main(void)
{
int np;
for (np = 0; np < 8; ++np)
if (!fork())
test();
while (wait(NULL) > 0)
;
return 0;
}
triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap(). The
problem is that __ptrace_unlink() clears task->jobctl under siglock but
task->ptrace is cleared without this lock held; this fools the "else"
branch which assumes that !PT_SEIZED means PT_PTRACED.
Note also that most of other PTRACE_SEIZE checks can race with detach
from the exiting tracer too. Say, the callers of ptrace_trap_notify()
assume that SEIZED can't go away after it was checked.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Except on SPARC, this is what the code always did. SPARC compat seccomp
was buggy, although the impact of the bug was limited because SPARC
32-bit and 64-bit syscall numbers are the same.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Users of the 32-bit ptrace() ABI expect the full 32-bit ABI. siginfo
translation should check ptrace() ABI, not caller task ABI.
This is an ABI change on SPARC. Let's hope that no one relied on the
old buggy ABI.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Seccomp wants to know the syscall bitness, not the caller task bitness,
when it selects the syscall whitelist.
As far as I know, this makes no difference on any architecture, so it's
not a security problem. (It generates identical code everywhere except
sparc, and, on sparc, the syscall numbering is the same for both ABIs.)
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When new timeout is written to /proc/sys/kernel/hung_task_timeout_secs,
khungtaskd is interrupted and again sleeps for full timeout duration.
This means that hang task will not be checked if new timeout is written
periodically within old timeout duration and/or checking of hang task
will be delayed for up to previous timeout duration. Fix this by
remembering last time khungtaskd checked hang task.
This change will allow other watchdog tasks (if any) to share khungtaskd
by sleeping for minimal timeout diff of all watchdog tasks. Doing more
watchdog tasks from khungtaskd will reduce the possibility of printk()
collisions by multiple watchdog threads.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The latency tracer format has a nice column to indicate IRQ state, but
this is not able to tell us about NMI state.
When tracing perf interrupt handlers (which often run in NMI context)
it is very useful to see how the events nest.
Link: http://lkml.kernel.org/r/20160318153022.105068893@infradead.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_printk() code will allocate extra buffers if the compile detects
that a trace_printk() is used. To do this, the format of the trace_printk()
is saved to the __trace_printk_fmt section, and if that section is bigger
than zero, the buffers are allocated (along with a message that this has
happened).
If trace_printk() uses a format that is not a constant, and thus something
not guaranteed to be around when the print happens, the compiler optimizes
the fmt out, as it is not used, and the __trace_printk_fmt section is not
filled. This means the kernel will not allocate the special buffers needed
for the trace_printk() and the trace_printk() will not write anything to the
tracing buffer.
Adding a "__used" to the variable in the __trace_printk_fmt section will
keep it around, even though it is set to NULL. This will keep the string
from being printed in the debugfs/tracing/printk_formats section as it is
not needed.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 07d777fe8c "tracing: Add percpu buffers for trace_printk()"
Cc: stable@vger.kernel.org # v3.5+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull cgroup namespace support from Tejun Heo:
"These are changes to implement namespace support for cgroup which has
been pending for quite some time now. It is very straight-forward and
only affects what part of cgroup hierarchies are visible.
After unsharing, mounting a cgroup fs will be scoped to the cgroups
the task belonged to at the time of unsharing and the cgroup paths
exposed to userland would be adjusted accordingly"
* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix and restructure error handling in copy_cgroup_ns()
cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
Add FS_USERNS_FLAG to cgroup fs
cgroup: Add documentation for cgroup namespaces
cgroup: mount cgroupns-root when inside non-init cgroupns
kernfs: define kernfs_node_dentry
cgroup: cgroup namespace setns support
cgroup: introduce cgroup namespaces
sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
kernfs: Add API to generate relative kernfs path
- Use for() instead of while() loop in some functions
to make the code simpler.
- Use this_cpu_ptr() instead of per_cpu_ptr() to make the code
cleaner and a bit faster.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/d8a7ef9592f55224630cb26dea239f05b6398a4e.1458187654.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The name of the 'reset' parameter to cpuusage_write() is quite confusing,
because the only valid value we allow is '0', so !reset is actually the
case that resets ...
Rename it to 'val' and explain it in a comment that we only allow 0.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cgroups@vger.kernel.org
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1450696483-2864-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>