This adds a REQ_OP_FLUSH operation that is sent to request_fn
based drivers by the block layer's flush code, instead of
sending requests with the request->cmd_flags REQ_FLUSH bit set.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Have blktrace use the req/bio op accessor to get the REQ_OP.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have the pm code
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
When checking the current cred for a capability in a specific user
namespace, it isn't always desirable to have the LSMs audit the check.
This patch adds a noaudit variant of ns_capable() for when those
situations arise.
The common logic between ns_capable() and the new ns_capable_noaudit()
is moved into a single, shared function to keep duplicated code to a
minimum and ease maintainability.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Pull irq fixes from Thomas Gleixner:
- a few simple fixes for fallout from the recent gic-v3 changes
- a workaround for a Cavium thunderX erratum
- a bugfix for the pic32 irqchip to make external interrupts work proper
- a missing return value in the generic IPI management code
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/irq-pic32-evic: Fix bug with external interrupts.
irqchip/gicv3-its: numa: Enable workaround for Cavium thunderx erratum 23144
irqchip/gic-v3: Fix quiescence check in gic_enable_redist
irqchip/gic-v3: Fix copy+paste mistakes in defines
irqchip/gic-v3: Fix ICC_SGI1R_EL1.INTID decoding mask
genirq: Fix missing return value in irq_destroy_ipi()
- A number of embarassing buglets (GICv3, PIC32)
- A more substential errata workaround for Cavium's GICv3 ITS
(kept for post-rc1 due to its dependency on NUMA)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXUXcjAAoJECPQ0LrRPXpDGFcQAICkj5cIsQghW2dgR0eo2d7S
+ieyxr55tz3A0c1Cisw09ESHz6wCrQ+PmvXkKISIG1l2AHv9UOUZVrsB1cl2CvBg
C9ClvUyMafiZ+FlhxMO1QM1Vfa3EUV2EPIx3mh5klUp8ph/cT+aArJe+WmJApS25
nlYiobi2AE0+m2V5ekikMtVbM5xXWKHPRgzYqZUlNBV74k/FGgRlBk/bw1AWqnsd
TfUF+QZpEd/4GPglbQLvwJwjQg+qanl59CJqi403U00emLuvRqdqTeMoqPEfw5id
MgVPMtUF+N7fgtygo20oPFBriFBHFUTj+c5Oafd4ahgp6eU02HYX8A7w/jj84tTP
cPa9bcoyKyec5vpO1mbU2a/VzqXPDNL17Dg9tRaf0NpksMeLvBh14jXWp5v8vEqU
Qm4mFlmEYKivWTJhz6pGJmxFX/X5vMa2wrFY7xvOVYby5mSuEGD7+puuKuVNNBEa
THaElOYM4ogTrUBM39dzfzXxSEQN/bcLHXNd2IDuUUK49NvNFjnke3PvLxesiDuT
Pxk4mO912+/Ldk7K1LVVVWltVtOHFNG+I7a3R75gwftHXMYONuOpEiflA4QorxVk
9Rq9yUI4h/69V5fIthBN8BUYU5hxaTtLi0DI1fWSweugZb5PUXiagKnXKaIOLQ/4
A3pvoYEHdynDVO7nJ+Kd
=E+j+
-----END PGP SIGNATURE-----
Merge tag 'irqchip-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent
Merge irqchip updates from Marc Zyngier:
- A number of embarassing buglets (GICv3, PIC32)
- A more substential errata workaround for Cavium's GICv3 ITS
(kept for post-rc1 due to its dependency on NUMA)
The mutex owner can get read and written to locklessly.
Use WRITE_ONCE when setting and clearing the owner field
in order to avoid optimizations such as store tearing. This
avoids situations where the owner field gets written to with
multiple stores and another thread could concurrently read
and use a partially written owner value.
Signed-off-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Waiman Long <Waiman.Long@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Terry Rudd <terry.rudd@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1463782776.2479.9.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When acquiring the rwsem write lock in the slowpath, we first try
to set count to RWSEM_WAITING_BIAS. When that is successful,
we then atomically add the RWSEM_WAITING_BIAS in cases where
there are other tasks on the wait list. This causes write lock
operations to often issue multiple atomic operations.
We can instead make the list_is_singular() check first, and then
set the count accordingly, so that we issue at most 1 atomic
operation when acquiring the write lock and reduce unnecessary
cacheline contention.
Signed-off-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long<Waiman.Long@hpe.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Terry Rudd <terry.rudd@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1463445486-16078-2-git-send-email-jason.low2@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Readers that are awoken will expect a nil ->task indicating
that a wakeup has occurred. Because of the way readers are
implemented, there's a small chance that the waiter will never
block in the slowpath (rwsem_down_read_failed), and therefore
requires some form of reference counting to avoid the following
scenario:
rwsem_down_read_failed() rwsem_wake()
get_task_struct();
spin_lock_irq(&wait_lock);
list_add_tail(&waiter.list)
spin_unlock_irq(&wait_lock);
raw_spin_lock_irqsave(&wait_lock)
__rwsem_do_wake()
while (1) {
set_task_state(TASK_UNINTERRUPTIBLE);
waiter->task = NULL
if (!waiter.task) // true
break;
schedule() // never reached
__set_task_state(TASK_RUNNING);
do_exit();
wake_up_process(tsk); // boom
... and therefore race with do_exit() when the caller returns.
There is also a mismatch between the smp_mb() and its documentation,
in that the serialization is done between reading the task and the
nil store. Furthermore, in addition to having the overlapping of
loads and stores to waiter->task guaranteed to be ordered within
that CPU, both wake_up_process() originally and now wake_q_add()
already imply barriers upon successful calls, which serves the
comment.
Now, as an alternative to perhaps inverting the checks in the blocker
side (which has its own penalty in that schedule is unavoidable),
with lockless wakeups this situation is naturally addressed and we
can just use the refcount held by wake_q_add(), instead doing so
explicitly. Of course, we must guarantee that the nil store is done
as the _last_ operation in that the task must already be marked for
deletion to not fall into the race above. Spurious wakeups are also
handled transparently in that the task's reference is only removed
when wake_up_q() is actually called _after_ the nil store.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hpe.com
Cc: dave@stgolabs.net
Cc: jason.low2@hp.com
Cc: peter@hurleysoftware.com
Link: http://lkml.kernel.org/r/1463165787-25937-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As wake_qs gain users, we can teach rwsems about them such that
waiters can be awoken without the wait_lock. This is for both
readers and writer, the former being the most ideal candidate
as we can batch the wakeups shortening the critical region that
much more -- ie writer task blocking a bunch of tasks waiting to
service page-faults (mmap_sem readers).
In general applying wake_qs to rwsem (xadd) is not difficult as
the wait_lock is intended to be released soon _anyways_, with
the exception of when a writer slowpath will proactively wakeup
any queued readers if it sees that the lock is owned by a reader,
in which we simply do the wakeups with the lock held (see comment
in __rwsem_down_write_failed_common()).
Similar to other locking primitives, delaying the waiter being
awoken does allow, at least in theory, the lock to be stolen in
the case of writers, however no harm was seen in this (in fact
lock stealing tends to be a _good_ thing in most workloads), and
this is a tiny window anyways.
Some page-fault (pft) and mmap_sem intensive benchmarks show some
pretty constant reduction in systime (by up to ~8 and ~10%) on a
2-socket, 12 core AMD box. In addition, on an 8-core Westmere doing
page allocations (page_test)
aim9:
4.6-rc6 4.6-rc6
rwsemv2
Min page_test 378167.89 ( 0.00%) 382613.33 ( 1.18%)
Min exec_test 499.00 ( 0.00%) 502.67 ( 0.74%)
Min fork_test 3395.47 ( 0.00%) 3537.64 ( 4.19%)
Hmean page_test 395433.06 ( 0.00%) 414693.68 ( 4.87%)
Hmean exec_test 499.67 ( 0.00%) 505.30 ( 1.13%)
Hmean fork_test 3504.22 ( 0.00%) 3594.95 ( 2.59%)
Stddev page_test 17426.57 ( 0.00%) 26649.92 (-52.93%)
Stddev exec_test 0.47 ( 0.00%) 1.41 (-199.05%)
Stddev fork_test 63.74 ( 0.00%) 32.59 ( 48.86%)
Max page_test 429873.33 ( 0.00%) 456960.00 ( 6.30%)
Max exec_test 500.33 ( 0.00%) 507.66 ( 1.47%)
Max fork_test 3653.33 ( 0.00%) 3650.90 ( -0.07%)
4.6-rc6 4.6-rc6
rwsemv2
User 1.12 0.04
System 0.23 0.04
Elapsed 727.27 721.98
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hpe.com
Cc: dave@stgolabs.net
Cc: jason.low2@hp.com
Cc: peter@hurleysoftware.com
Link: http://lkml.kernel.org/r/1463165787-25937-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change the return code for sampling event not supported from -ENOTSUPP
to -EOPNOTSUPP.
This allows userspace to identify this case specifically, instead of
printing the catch-all error message it did previously.
Technically this is an ABI change, but we think we can get away
with it.
Old behavior:
-------
| # perf record ls
| Error:
| The sys_perf_event_open() syscall returned with 524 (Unknown error 524)
| for event (cycles:ppp).
| /bin/dmesg may provide additional information.
| No CONFIG_PERF_EVENTS=y kernel support configured?
New behavior:
-------
| # perf record ls
| Error:
| PMU Hardware doesn't support sampling/overflow-interrupts.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <acme@redhat.com>
Cc: <linux-snps-arc@lists.infradead.org>
Cc: <vincent.weaver@maine.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Link: http://lkml.kernel.org/r/1462786660-2900-3-git-send-email-vgupta@synopsys.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch fixes an issue which was introduced by commit:
91a612eea9 ("perf/core: Fix dynamic interrupt throttle")
... which commit unconditionally sets the perf_sample_allowed_ns value
to !0. But that could trigger a bug in the following corner case:
The user can disable the dynamic interrupt throttle mechanism by setting
perf_cpu_time_max_percent to 0. Then they change perf_event_max_sample_rate.
For this case, the mechanism will be enabled implicitly, because
perf_sample_allowed_ns becomes !0 - which is not what we want.
This patch only updates perf_sample_allowed_ns when the dynamic
interrupt throttle mechanism is enabled.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Link: http://lkml.kernel.org/r/1462260366-3160-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are now two different things called AUX in perf, the
infrastructure to deliver the mmap/comm/task records and the
AUX part in the mmap buffer (with associated AUX_RECORD).
Since the former is internal, rename it to side-band to reduce
the confusion factor.
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The perf_event_aux() function iterates all PMUs and all events in
their respective per-CPU contexts to find the events to deliver
side-band records to.
For example, the brk test case in lkp triggers many mmap() operations,
which, if we're also running perf, results in many perf_event_aux()
invocations.
If we enable uncore PMU support (even when uncore events are not used),
dozens of uncore PMUs will be iterated, which can significantly
decrease brk_test's throughput.
For example, the brk throughput:
without uncore PMUs: 2647573 ops_per_sec
with uncore PMUs: 1768444 ops_per_sec
... a 33% reduction.
To get at the per-CPU events that need side-band records, this patch
puts these events on a per-CPU list, this avoids iterating the PMUs
and any events that do not need side-band records.
Per task events are unchanged to avoid extra overhead on the context
switch paths.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Huang, Ying <ying.huang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1458757477-3781-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Generally task_struct is only protected by RCU if it was found on a
RCU protected list (say, for_each_process() or find_task_by_vpid()).
As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
drops the (potentially) last reference without RCU gp, this means
that we need to fix the code which uses foreign_rq->curr under
rcu_read_lock().
Add a new helper which can be used to dereference rq->curr or any
other pointer to task_struct assuming that it should be cleared or
updated before the final put_task_struct(). It returns non-NULL
only if this task can't go away before rcu_read_unlock().
( Also add try_get_task_struct() to make it easier to use this API
correctly. )
Suggested-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ Updated comments; added try_get_task_struct()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Link: http://lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, smp_processor_id() is used to fetch the current CPU in
cpu_idle_loop(). Every time the idle thread runs, it fetches the
current CPU using smp_processor_id().
Since the idle thread is per CPU, the current CPU is constant, so we
can lift the load out of the loop, saving execution cycles/time in the
loop.
x86-64:
Before patch (execution in loop):
148: 0f ae e8 lfence
14b: 65 8b 04 25 00 00 00 00 mov %gs:0x0,%eax
152: 00
153: 89 c0 mov %eax,%eax
155: 49 0f a3 04 24 bt %rax,(%r12)
After patch (execution in loop):
150: 0f ae e8 lfence
153: 4d 0f a3 34 24 bt %r14,(%r12)
ARM64:
Before patch (execution in loop):
168: d5033d9f dsb ld
16c: b9405661 ldr w1,[x19,#84]
170: 1100fc20 add w0,w1,#0x3f
174: 6b1f003f cmp w1,wzr
178: 1a81b000 csel w0,w0,w1,lt
17c: 130c7000 asr w0,w0,#6
180: 937d7c00 sbfiz x0,x0,#3,#32
184: f8606aa0 ldr x0,[x21,x0]
188: 9ac12401 lsr x1,x0,x1
18c: 36000e61 tbz w1,#0,358
After patch (execution in loop):
1a8: d50339df dsb ld
1ac: f8776ac0 ldr x0,[x22,x23]
ab0: ea18001f tst x0,x24
1b4: 54000ea0 b.eq 388
Further observance on ARM64 for 4 seconds shows that cpu_idle_loop is
called 8672 times. Shifting the code will save instructions executed
in loop and eventually time as well.
Signed-off-by: Gaurav Jindal <gaurav.jindal@spreadtrum.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sanjeev Yadav <sanjeev.yadav@spreadtrum.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160512101330.GA488@gauravjindalubtnb.del.spreadtrum.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Two minor fixes for cfs_rq_clock_task():
1) If cfs_rq is currently being throttled, we need to subtract the cfs
throttled clock time.
2) Make "throttled_clock_task_time" update SMP unrelated. Now UP cases
need it as well.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1462885398-14724-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recursive locking for ww_mutexes was originally conceived as an
exception. However, it is heavily used by the DRM atomic modesetting
code. Currently, the recursive deadlock is checked after we have queued
up for a busy-spin and as we never release the lock, we spin until
kicked, whereupon the deadlock is discovered and reported.
A simple solution for the now common problem is to move the recursive
deadlock discovery to the first action when taking the ww_mutex.
Suggested-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1464293297-19777-1-git-send-email-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Create a new helper to avoid code duplication across governors.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The design of the cpufreq governor API is not very straightforward,
as struct cpufreq_governor provides only one callback to be invoked
from different code paths for different purposes. The purpose it is
invoked for is determined by its second "event" argument, causing it
to act as a "callback multiplexer" of sorts.
Unfortunately, that leads to extra complexity in governors, some of
which implement the ->governor() callback as a switch statement
that simply checks the event argument and invokes a separate function
to handle that specific event.
That extra complexity can be eliminated by replacing the all-purpose
->governor() callback with a family of callbacks to carry out specific
governor operations: initialization and exit, start and stop and policy
limits updates. That also turns out to reduce the code size too, so
do it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The cpuidle_devices per-CPU variable is only defined when CPU_IDLE is
enabled. Commit c8cc7d4de7 ("sched/idle: Reorganize the idle loop")
removed the #ifdef CONFIG_CPU_IDLE around cpuidle_idle_call() with the
compiler optimising away __this_cpu_read(cpuidle_devices). However, with
CONFIG_UBSAN && !CONFIG_CPU_IDLE, this optimisation no longer happens
and the kernel fails to link since cpuidle_devices is not defined.
This patch introduces an accessor function for the current CPU cpuidle
device (returning NULL when !CONFIG_CPU_IDLE) and uses it in
cpuidle_idle_call().
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: 4.5+ <stable@vger.kernel.org> # 4.5+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull networking fixes from David Miller:
1) Fix negative error code usage in ATM layer, from Stefan Hajnoczi.
2) If CONFIG_SYSCTL is disabled, the default TTL is not initialized
properly. From Ezequiel Garcia.
3) Missing spinlock init in mvneta driver, from Gregory CLEMENT.
4) Missing unlocks in hwmb error paths, also from Gregory CLEMENT.
5) Fix deadlock on team->lock when propagating features, from Ivan
Vecera.
6) Work around buffer offset hw bug in alx chips, from Feng Tang.
7) Fix double listing of SCTP entries in sctp_diag dumps, from Xin
Long.
8) Various statistics bug fixes in mlx4 from Eric Dumazet.
9) Fix some randconfig build errors wrt fou ipv6 from Arnd Bergmann.
10) All of l2tp was namespace aware, but the ipv6 support code was not
doing so. From Shmulik Ladkani.
11) Handle on-stack hrtimers properly in pktgen, from Guenter Roeck.
12) Propagate MAC changes properly through VLAN devices, from Mike
Manning.
13) Fix memory leak in bnx2x_init_one(), from Vitaly Kuznetsov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
sfc: Track RPS flow IDs per channel instead of per function
usbnet: smsc95xx: fix link detection for disabled autonegotiation
virtio_net: fix virtnet_open and virtnet_probe competing for try_fill_recv
bnx2x: avoid leaking memory on bnx2x_init_one() failures
fou: fix IPv6 Kconfig options
openvswitch: update checksum in {push,pop}_mpls
sctp: sctp_diag should dump sctp socket type
net: fec: update dirty_tx even if no skb
vlan: Propagate MAC address to VLANs
atm: iphase: off by one in rx_pkt()
atm: firestream: add more reserved strings
vxlan: Accept user specified MTU value when create new vxlan link
net: pktgen: Call destroy_hrtimer_on_stack()
timer: Export destroy_hrtimer_on_stack()
net: l2tp: Make l2tp_ip6 namespace aware
Documentation: ip-sysctl.txt: clarify secure_redirects
sfc: use flow dissector helpers for aRFS
ieee802154: fix logic error in ieee802154_llsec_parse_dev_addr
net: nps_enet: Disable interrupts before napi reschedule
net/lapb: tuse %*ph to dump buffers
...
hrtimer_init_on_stack() needs a matching call to
destroy_hrtimer_on_stack(), so both need to be exported.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 724e4fcc the intention was to pass any errors back from
audit_filter_user_rules() to audit_filter_user(). Add that code.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Additionally to being able to control the system wide maximum depth via
/proc/sys/kernel/perf_event_max_stack, now we are able to ask for
different depths per event, using perf_event_attr.sample_max_stack for
that.
This uses an u16 hole at the end of perf_event_attr, that, when
perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
sample_max_stack is zero, means use perf_event_max_stack, otherwise
it'll be bounds checked under callchain_mutex.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull vfs fixes from Al Viro:
"Followups to the parallel lookup work:
- update docs
- restore killability of the places that used to take ->i_mutex
killably now that we have down_write_killable() merged
- Additionally, it turns out that I missed a prerequisite for
security_d_instantiate() stuff - ->getxattr() wasn't the only thing
that could be called before dentry is attached to inode; with smack
we needed the same treatment applied to ->setxattr() as well"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->setxattr() to passing dentry and inode separately
switch xattr_handler->set() to passing dentry and inode separately
restore killability of old mutex_lock_killable(&inode->i_mutex) users
add down_write_killable_nested()
update D/f/directory-locking
Most users of IS_ERR_VALUE() in the kernel are wrong, as they
pass an 'int' into a function that takes an 'unsigned long'
argument. This happens to work because the type is sign-extended
on 64-bit architectures before it gets converted into an
unsigned type.
However, anything that passes an 'unsigned short' or 'unsigned int'
argument into IS_ERR_VALUE() is guaranteed to be broken, as are
8-bit integers and types that are wider than 'unsigned long'.
Andrzej Hajda has already fixed a lot of the worst abusers that
were causing actual bugs, but it would be nice to prevent any
users that are not passing 'unsigned long' arguments.
This patch changes all users of IS_ERR_VALUE() that I could find
on 32-bit ARM randconfig builds and x86 allmodconfig. For the
moment, this doesn't change the definition of IS_ERR_VALUE()
because there are probably still architecture specific users
elsewhere.
Almost all the warnings I got are for files that are better off
using 'if (err)' or 'if (err < 0)'.
The only legitimate user I could find that we get a warning for
is the (32-bit only) freescale fman driver, so I did not remove
the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
For 9pfs, I just worked around one user whose calling conventions
are so obscure that I did not dare change the behavior.
I was using this definition for testing:
#define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
which ends up making all 16-bit or wider types work correctly with
the most plausible interpretation of what IS_ERR_VALUE() was supposed
to return according to its users, but also causes a compile-time
warning for any users that do not pass an 'unsigned long' argument.
I suggested this approach earlier this year, but back then we ended
up deciding to just fix the users that are obviously broken. After
the initial warning that caused me to get involved in the discussion
(fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
asked me to send the whole thing again.
[ Updated the 9p parts as per Al Viro - Linus ]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.org/lkml/2016/1/7/363
Link: https://lkml.org/lkml/2016/5/27/486
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull kbuild updates from Michal Marek:
- new option CONFIG_TRIM_UNUSED_KSYMS which does a two-pass build and
unexports symbols which are not used in the current config [Nicolas
Pitre]
- several kbuild rule cleanups [Masahiro Yamada]
- warning option adjustments for gcov etc [Arnd Bergmann]
- a few more small fixes
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (31 commits)
kbuild: move -Wunused-const-variable to W=1 warning level
kbuild: fix if_change and friends to consider argument order
kbuild: fix adjust_autoksyms.sh for modules that need only one symbol
kbuild: fix ksym_dep_filter when multiple EXPORT_SYMBOL() on the same line
gcov: disable -Wmaybe-uninitialized warning
gcov: disable tree-loop-im to reduce stack usage
gcov: disable for COMPILE_TEST
Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES
Kbuild: change CC_OPTIMIZE_FOR_SIZE definition
kbuild: forbid kernel directory to contain spaces and colons
kbuild: adjust ksym_dep_filter for some cmd_* renames
kbuild: Fix dependencies for final vmlinux link
kbuild: better abstract vmlinux sequential prerequisites
kbuild: fix call to adjust_autoksyms.sh when output directory specified
kbuild: Get rid of KBUILD_STR
kbuild: rename cmd_as_s_S to cmd_cpp_s_S
kbuild: rename cmd_cc_i_c to cmd_cpp_i_c
kbuild: drop redundant "PHONY += FORCE"
kbuild: delete unnecessary "@:"
kbuild: mark help target as PHONY
...
mmput_async is currently used only from the oom_reaper which is defined
only for CONFIG_MMU. We can save work_struct in mm_struct for
!CONFIG_MMU.
[akpm@linux-foundation.org: fix typo, per Minchan]
Link: http://lkml.kernel.org/r/20160520061658.GB19172@dhcp22.suse.cz
Reported-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When create css failed, before call css_free_rcu_fn, we remove the css
id and exit the percpu_ref, but we will do these again in
css_free_work_fn, so they are redundant. Especially the css id, that
would cause problem if we remove it twice, since it may be assigned to
another css after the first remove.
tj: This was broken by two commits updating the free path without
synchronizing the creation failure path. This can be easily
triggered by trying to create more than 64k memory cgroups.
Signed-off-by: Wenwei Tao <ww.tao0320@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Fixes: 9a1049da9b ("percpu-refcount: require percpu_ref to be exited explicitly")
Fixes: 01e586598b ("cgroup: release css->id after css_free")
Cc: stable@vger.kernel.org # v3.17+
Pull scheduler fixes from Ingo Molnar:
"Two fixes: one for a lost wakeup, the other to fix the compiler
optimizing out preempt operations on ARM64 (and possibly other non-x86
architectures)"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix remote wakeups
sched/preempt: Fix preempt_count manipulations
Pull perf updates from Ingo Molnar:
"Mostly tooling and PMU driver fixes, but also a number of late updates
such as the reworking of the call-chain size limiting logic to make
call-graph recording more robust, plus tooling side changes for the
new 'backwards ring-buffer' extension to the perf ring-buffer"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
perf record: Read from backward ring buffer
perf record: Rename variable to make code clear
perf record: Prevent reading invalid data in record__mmap_read
perf evlist: Add API to pause/resume
perf trace: Use the ptr->name beautifier as default for "filename" args
perf trace: Use the fd->name beautifier as default for "fd" args
perf report: Add srcline_from/to branch sort keys
perf evsel: Record fd into perf_mmap
perf evsel: Add overwrite attribute and check write_backward
perf tools: Set buildid dir under symfs when --symfs is provided
perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
perf annotate: Sort list of recognised instructions
perf annotate: Fix identification of ARM blt and bls instructions
perf tools: Fix usage of max_stack sysctl
perf callchain: Stop validating callchains by the max_stack sysctl
perf trace: Fix exit_group() formatting
perf top: Use machine->kptr_restrict_warned
perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
perf machine: Do not bail out if not managing to read ref reloc symbol
perf/x86/intel/p4: Trival indentation fix, remove space
...
- Stable-candidate cpuidle fix to make it check the right variable
when deciding whether or not to enable interrupts on the local CPU
so as to avoid enabling iterrupts too early in some cases if the
system has both coupled and per-core idle states (Daniel Lezcano).
- Stable-candidate PM core fix to make it handle failures at the
"late suspend" stage of device suspend consistently for all
devices regardless of whether or not async suspend/resume is
enabled for them (Rafael Wysocki).
- Cleanups in the cpufreq core, the schedutil governor and the
intel_pstate driver (Rafael Wysocki, Pankaj Gupta, Viresh Kumar).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJXRglcAAoJEILEb/54YlRxCwEQAKQNLO5wrNnOQmVrMHaOk1XH
4zT+AdggLRVgBld4d4Vcli2zJPZfpnZuzjJdwTB5zLgQ4WIcBb6meOfH87XGqRyJ
o4ksyEUhpvDk8AdlmTA5CvvjFuydPJG5ZUSiM035XRT9heebvhgyaMBnT3ucXbq9
7LhNhCQ+a8arndt9ePO7tZnFfQQUbwNJ2BDVuH5DJBqMIFOo2/Kpag43CdFWlWZT
jnWaleDCjSmanuJ/45bFJHJeSZ7PK2etnArfzKtb9QLSGnuEfFPdHuUzJYo5dkP7
UBeYA94hhfR3f5FJIqNlF3N+eLEX1idpwxc8+CJLLDKDd1ZCBrbLoz5fwM+fVn0h
AfmyR+J1czcbiphsmpViOYDRrKdiQVkbP6SpBswvgMCZAcNDF2bxhzOlcuTUc+u0
8xsjWOtArL6uvzsAHa1HY6hhgUn9FB8m20HX+DmS2/zzqyzoRefenoyVcuLsAhXC
fm+sARQ7tvy3OoGRQ9mloWgv2X5iQUY5IVjOG2amIhbUvVmKQutPjVTGTwHqmmcb
2nNYptLsTA6crvnexPcPHY+OFjkQl/omtfaMx+OJl63yhln5ibveGOfZ6F8sPdoB
bRqHuHoK/xh9hSNwj117ZFzq1nm54mLjh0Yhw3EXFcV4I9vdsTp8/WeNThGvT17j
M+6PDXyjlwh3HZpGm+HW
=3vtL
-----END PGP SIGNATURE-----
Merge tag 'pm-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are two stable-candidate fixes (PM core, cpuidle) and a bunch of
cpufreq cleanups.
Specifics:
- Stable-candidate cpuidle fix to make it check the right variable
when deciding whether or not to enable interrupts on the local CPU
so as to avoid enabling iterrupts too early in some cases if the
system has both coupled and per-core idle states (Daniel Lezcano).
- Stable-candidate PM core fix to make it handle failures at the
"late suspend" stage of device suspend consistently for all devices
regardless of whether or not async suspend/resume is enabled for
them (Rafael Wysocki).
- Cleanups in the cpufreq core, the schedutil governor and the
intel_pstate driver (Rafael Wysocki, Pankaj Gupta, Viresh Kumar)"
* tag 'pm-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / sleep: Handle failures in device_suspend_late() consistently
cpufreq: schedutil: Improve prints messages with pr_fmt
cpuidle: Fix cpuidle_state_is_coupled() argument in cpuidle_enter()
cpufreq: simplified goto out in cpufreq_register_driver()
cpufreq: governor: CPUFREQ_GOV_STOP never fails
cpufreq: governor: CPUFREQ_GOV_POLICY_EXIT never fails
intel_pstate: Simplify conditional in intel_pstate_set_policy()
Commit:
b5179ac70d ("sched/fair: Prepare to fix fairness problems on migration")
... introduced a bug: Mike Galbraith found that it introduced a
performance regression, while Paul E. McKenney reported lost
wakeups and bisected it to this commit.
The reason is that I mis-read ttwu_queue() such that I assumed any
wakeup that got a remote queue must have had the task migrated.
Since this is not so; we need to transfer this information between
queueing the wakeup and actually doing the wakeup. Use a new
task_struct::sched_flag for this, we already write to
sched_contributes_to_load in the wakeup path so this is a hot and
modified cacheline.
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ben Segall <bsegall@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Fixes: b5179ac70d ("sched/fair: Prepare to fix fairness problems on migration")
Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
after a crash and a potential BUG_ON crash if a file has the data
journalling flag enabled while it has dirty delayed allocation blocks
that haven't been written yet. Also fix a potential crash in the new
project quota code and a maliciously corrupted file system.
In addition, fix some DAX-specific bugs, including when there is a
transient ENOSPC situation and races between writes via direct I/O and
an mmap'ed segment that could lead to lost I/O.
Finally the usual set of miscellaneous cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXQ40fAAoJEPL5WVaVDYGjnwMH+wXHASgPfzZgtRInsTG8W/2L
jsmAcMlyMAYIATWMppNtPIq0td49z1dYO0YkKhtPVMwfzu230IFWhGWp93WqP9ve
XYHMmaBorFlMAzWgMKn1K0ExWZlV+ammmcTKgU0kU4qyZp0G/NnMtlXIkSNv2amI
9Mn6R+v97c20gn8e9HWP/IVWkgPr+WBtEXaSGjC7dL6yI8hL+rJMqN82D76oU5ea
vtwzrna/ISijy+etYmQzqHNYNaBKf40+B5HxQZw/Ta3FSHofBwXAyLaeEAr260Mf
V3Eg2NDcKQxiZ3adBzIUvrRnrJV381OmHoguo8Frs8YHTTRiZ0T/s7FGr2Q0NYE=
=7yIM
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Fix a number of bugs, most notably a potential stale data exposure
after a crash and a potential BUG_ON crash if a file has the data
journalling flag enabled while it has dirty delayed allocation blocks
that haven't been written yet. Also fix a potential crash in the new
project quota code and a maliciously corrupted file system.
In addition, fix some DAX-specific bugs, including when there is a
transient ENOSPC situation and races between writes via direct I/O and
an mmap'ed segment that could lead to lost I/O.
Finally the usual set of miscellaneous cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: pre-zero allocated blocks for DAX IO
ext4: refactor direct IO code
ext4: fix race in transient ENOSPC detection
ext4: handle transient ENOSPC properly for DAX
dax: call get_blocks() with create == 1 for write faults to unwritten extents
ext4: remove unmeetable inconsisteny check from ext4_find_extent()
jbd2: remove excess descriptions for handle_s
ext4: remove unnecessary bio get/put
ext4: silence UBSAN in ext4_mb_init()
ext4: address UBSAN warning in mb_find_order_for_block()
ext4: fix oops on corrupted filesystem
ext4: fix check of dqget() return value in ext4_ioctl_setproject()
ext4: clean up error handling when orphan list is corrupted
ext4: fix hang when processing corrupted orphaned inode list
ext4: remove trailing \n from ext4_warning/ext4_error calls
ext4: fix races between changing inode journal mode and ext4_writepages
ext4: handle unwritten or delalloc buffers before enabling data journaling
ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
ext4: do not ask jbd2 to write data for delalloc buffers
jbd2: add support for avoiding data writes during transaction commits
...
Commit 7cec18a390 changed the return type of irq_destroy_ipi to int, but
missed adding a value to one return statement. Fix this to silence the
resulting compiler warning:
kernel/irq/ipi.c In function ‘irq_destroy_ipi’:
kernel/irq/ipi.c:128:3: warning: ‘return’ with no value, in function returning non-void [-Wreturn-type]
Fixes: 7cec18a390 "genirq: Add error code reporting to irq_{reserve,destroy}_ipi"
Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: linux-mips@linux-mips.org
Link: http://lkml.kernel.org/r/1464086550-24734-1-git-send-email-matt.redfearn@imgtec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
xol_add_vma needs mmap_sem for write. If the waiting task gets killed
by the oom killer it would block oom_reaper from asynchronous address
space reclaim and reduce the chances of timely OOM resolving. Wait for
the lock in the killable mode and return with EINTR if the task got
killed while waiting.
Do not warn in dup_xol_work if __create_xol_area failed due to fatal
signal pending because this is usually considered a kernel issue.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PR_SET_THP_DISABLE requires mmap_sem for write. If the waiting task
gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dup_mmap needs to lock current's mm mmap_sem for write. If the waiting
task gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 3f625002581b ("kexec: introduce a protection mechanism for the
crashkernel reserved memory") is a similar mechanism for protecting the
crash kernel reserved memory to previous crash_map/unmap_reserved_pages()
implementation, the new one is more generic in name and cleaner in code
(besides, some arch may not be allowed to unmap the pgtable).
Therefore, this patch consolidates them, and uses the new
arch_kexec_protect(unprotect)_crashkres() to replace former
crash_map/unmap_reserved_pages() which by now has been only used by
S390.
The consolidation work needs the crash memory to be mapped initially,
this is done in machine_kdump_pm_init() which is after
reserve_crashkernel(). Once kdump kernel is loaded, the new
arch_kexec_protect_crashkres() implemented for S390 will actually
unmap the pgtable like before.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Minfei Huang <mhuang@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a lof of work to be done in function kexec_load, not only for
allocating structs and loading initram, but also for some misc.
To make it more clear, wrap a new function do_kexec_load which is used
to allocate structs and load initram. And the pre-work will be done in
kexec_load.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Xunlei Pang <xlpang@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For some arch, kexec shall map the reserved pages, then use them, when
we try to start the kdump service.
kexec may return directly, without unmaping the reserved pages, if it
fails during starting service. To fix it, we make a pair of map/unmap
reserved pages both in generic path and error path.
This patch only affects s390. Other architecturess don't implement the
interface of crash_unmap_reserved_pages and crash_map_reserved_pages.
It isn't a urgent patch. Kernel can work well without any risk,
although the reserved pages are not unmapped before returning in error
path.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Xunlei Pang <xlpang@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For the cases that some kernel (module) path stamps the crash reserved
memory(already mapped by the kernel) where has been loaded the second
kernel data, the kdump kernel will probably fail to boot when panic
happens (or even not happens) leaving the culprit at large, this is
unacceptable.
The patch introduces a mechanism for detecting such cases:
1) After each crash kexec loading, it simply marks the reserved memory
regions readonly since we no longer access it after that. When someone
stamps the region, the first kernel will panic and trigger the kdump.
The weak arch_kexec_protect_crashkres() is introduced to do the actual
protection.
2) To allow multiple loading, once 1) was done we also need to remark
the reserved memory to readwrite each time a system call related to
kdump is made. The weak arch_kexec_unprotect_crashkres() is introduced
to do the actual protection.
The architecture can make its specific implementation by overriding
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres().
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Minfei Huang <mhuang@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux preallocates the task structs of the idle tasks for all possible
CPUs. This currently means they all end up on node 0. This also
implies that the cache line of MWAIT, which is around the flags field in
the task struct, are all located in node 0.
We see a noticeable performance improvement on Knights Landing CPUs when
the cache lines used for MWAIT are located in the local nodes of the
CPUs using them. I would expect this to give a (likely slight)
improvement on other systems too.
The patch implements placing the idle task in the node of its CPUs, by
passing the right target node to copy_process()
[akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1]
Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use pr_<level> instead of printk(KERN_<LEVEL> ).
Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I see no reason why waitid() can't support other linux-specific flags
allowed in sys_wait4().
In particular this change can help if we reconsider the previous change
("wait/ptrace: assume __WALL if the child is traced") which adds the
"automagical" __WALL for debugger.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>