Commit graph

23773 commits

Author SHA1 Message Date
Richard Guy Briggs
8443075eac audit: tame initialization warning len_abuf in audit_log_execve_info
Tame initialization warning of len_abuf in audit_log_execve_info even
though there isn't presently a bug introduced by commit 43761473c2
("audit: fix a double fetch in audit_log_single_execve_arg()").  Using
UNINITIALIZED_VAR instead may mask future bugs.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-14 15:18:48 -05:00
Paul E. McKenney
aa3e0bf1aa rcu: Don't kick unless grace period or request
The current code can result in spurious kicks when there are no grace
periods in progress and no grace-period-related requests.  This is
sort of OK for a diagnostic aid, but the resulting ftrace-dump messages
in dmesg are annoying.  This commit therefore avoids spurious kicks
in the common case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2016-11-14 10:46:31 -08:00
Paul E. McKenney
0742ac3e2f rcu: Make expedited grace periods recheck dyntick idle state
Expedited grace periods check dyntick-idle state, and avoid sending
IPIs to idle CPUs, including those running guest OSes, and, on NOHZ_FULL
kernels, nohz_full CPUs.  However, the kernel has been observed checking
a CPU while it was non-idle, but sending the IPI after it has gone
idle.  This commit therefore rechecks idle state immediately before
sending the IPI, refraining from IPIing CPUs that have since gone idle.

Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-11-14 10:46:31 -08:00
Paul E. McKenney
d0af39e89e torture: Trace long read-side delays
Although rcutorture will occasionally do a 50-millisecond grace-period
delay, these delays are quite rare.  And rightly so, because otherwise
the read rate would be quite low.  Thie means that it can be important
to identify whether or not a given run contained a long-delay read.
This commit therefore inserts a trace_rcu_torture_read() event to flag
runs containing long delays.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-11-14 10:46:30 -08:00
Paul E. McKenney
1f21b50b77 rcu: Remove obsolete comment from __call_rcu()
The __call_rcu() comment about opportunistically noting grace period
beginnings and endings is obsolete.  RCU still does such opportunistic
noting, but in __call_rcu_core() rather than __call_rcu(), and there
already is an appropriate comment in __call_rcu_core().  This commit
therefore removes the obsolete comment.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2016-11-14 10:46:19 -08:00
Paul E. McKenney
5403d367a7 rcu: Remove obsolete rcu_check_callbacks() header comment
In the deep past, rcu_check_callbacks() was only invoked if rcu_pending()
returned true.  Which was fine, but these days rcu_check_callbacks()
is invoked unconditionally.  This commit therefore removes the obsolete
sentence from the header comment.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2016-11-14 10:46:14 -08:00
Paul E. McKenney
b8f2ed5384 rcu: Tighten up __call_rcu() rcu_head alignment check
Commit 720abae3d6 ("rcu: force alignment on struct
callback_head/rcu_head") forced the rcu_head (AKA callback_head)
structure's alignment to pointer size, that is, to 4-byte boundaries on
32-bit systems and to 8-byte boundaries on 64-bit systems.  This
commit therefore checks for this same alignment in __call_rcu(),
which used to contain a looser check for two-byte alignment.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2016-11-14 10:46:08 -08:00
Linus Torvalds
f5c9f9c723 Revert "printk: make reading the kernel log flush pending lines"
This reverts commit bfd8d3f23b.

It turns out that this flushes things much too aggressiverly, and causes
lines to break up when the system logger races with new continuation
lines being printed.

There's a pending patch to make printk() flushing much more
straightforward, but it's too invasive for 4.9, so in the meantime let's
just not make the system message logging flush continuation lines.
They'll be flushed by the final newline anyway.

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-14 09:31:52 -08:00
Linus Torvalds
5d69561b7b Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
 "This fixes a genirq regression that resulted in the Intel/Broxton
  pinctrl/GPIO driver (and possibly others) spewing warnings"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Use irq type from irqdata instead of irqdesc
2016-11-14 08:34:56 -08:00
James Morris
185c0f26c0 Merge commit 'v4.9-rc5' into next 2016-11-14 09:34:01 +11:00
Daniel Borkmann
c540594f86 bpf, mlx4: fix prog refcount in mlx4_en_try_alloc_resources error path
Commit 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings
scheme") added a bug in that the prog's reference count is not dropped
in the error path when mlx4_en_try_alloc_resources() is failing from
mlx4_xdp_set().

We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
need to release again. Earlier in the call path, dev_change_xdp_fd()
itself holds a reference to the prog as well (hence the '- 1' in the
bpf_prog_add()), so a simple atomic_sub() is safe to use here. When
an error is propagated, then bpf_prog_put() is called eventually from
dev_change_xdp_fd()

Fixes: 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-12 23:31:57 -05:00
Linus Torvalds
b9f659b810 Power management fixes for v4.9-rc5
- Prevent the PM core from attempting to suspend parent devices
    if any of their children, whose suspend callbacks were invoked
    asynchronously, have failed to suspend during the "late" and
    "noirq" phases of system-wide suspend of devices (Brian Norris).
 
  - Prevent the boot-time system suspend test code from leaking a
    reference to the RTC device used by it (Johan Hovold).
 
  - Fix cpupower to use the return value of one of its library
    functions correctly and restore the correct behavior of it
    when used for setting cpufreq tunables broken during the 4.7
    development cycle (Laura Abbott).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJYJloUAAoJEILEb/54YlRxkYgP/RvTRLO+6Hlk0Re/Lm5SS7O4
 YjOrcemWkGkA+YJ5Qe4vki28nMzGITIAWuHJK7Nex0A1zFMX3GrNPxxMPHBejfog
 7VjeEp77x0h9F8EzWX2YLZaTCSpvQjpHH5/vUQaTIE8a/WtZVbSzfCT4jqAYYHzV
 rHn52aNwiAKd4kjMTSBJdjTOdhqebRFYRHpBHovoTARcGH4DgEJciQzaEh1O6qAq
 barOGrDDkB32mULjsBNRX4Wyx1wuJiOpVjiD7sIM75ZomlIDC/RyYc9O37t6GDBC
 koTTlilV2/i2VoqcU5nQgAmgUOoDz1skGJW6Rsp/d/S2GnZiLBJwma31e187otBL
 NFl1egL1BJaz5J1sow7YPYBQILzlYGwNhXLiKHoPEshWp5WSn7mIhxxzAou3Llsp
 oPCXDwuRrMhqYrOdclxiL+Pj6dUapvut3SerZB5C9rGg2JL9Y2gDxrhf/jFqXxNb
 mBw3kRx94bDb6DNHJPuplRHWrL0SlM2ZXnUa7XwUSt+jIiwMK2la0cAL1P/7Qva9
 zJ6SG01BnGEd6OI+y9ZSjITdpTVK10JIuCrTmR9cvWNA+A1vOGCkg6Nj+qunmzFh
 j6W2O0KQ/moR0ICCPCg96Hjf888mohX7vhGQqDQwpXYlsvWcWvjNpURsfY47yBH9
 p70+ziBkqyyIAprJKjLB
 =ykEg
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix two bugs in error code paths in the PM core (system-wide
  suspend of devices), a device reference leak in the boot-time suspend
  test code and a cpupower utility regression from the 4.7 cycle.

  Specifics:

   - Prevent the PM core from attempting to suspend parent devices if
     any of their children, whose suspend callbacks were invoked
     asynchronously, have failed to suspend during the "late" and
     "noirq" phases of system-wide suspend of devices (Brian Norris).

   - Prevent the boot-time system suspend test code from leaking a
     reference to the RTC device used by it (Johan Hovold).

   - Fix cpupower to use the return value of one of its library
     functions correctly and restore the correct behavior of it when
     used for setting cpufreq tunables broken during the 4.7 development
     cycle (Laura Abbott)"

* tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
  PM / sleep: fix device reference leak in test_suspend
  cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set
2016-11-11 16:54:23 -08:00
Rafael J. Wysocki
cd16f3dcdb Merge branches 'pm-tools-fixes' and 'pm-sleep-fixes'
* pm-tools-fixes:
  cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set

* pm-sleep-fixes:
  PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
  PM / sleep: fix device reference leak in test_suspend
2016-11-11 23:24:58 +01:00
Hans de Goede
c6c7d83b9c Revert "console: don't prefer first registered if DT specifies stdout-path"
This reverts commit 05fd007e46 ("console: don't prefer first
registered if DT specifies stdout-path").

The reverted commit changes existing behavior on which many ARM boards
rely.  Many ARM small-board-computers, like e.g.  the Raspberry Pi have
both a video output and a serial console.  Depending on whether the user
is using the device as a more regular computer; or as a headless device
we need to have the console on either one or the other.

Many users rely on the kernel behavior of the console being present on
both outputs, before the reverted commit the console setup with no
console= kernel arguments on an ARM board which sets stdout-path in dt
would look like this:

  [root@localhost ~]# cat /proc/consoles
  ttyS0                -W- (EC p a)    4:64
  tty0                 -WU (E  p  )    4:1

Where as after the reverted commit, it looks like this:

  [root@localhost ~]# cat /proc/consoles
  ttyS0                -W- (EC p a)    4:64

This commit reverts commit 05fd007e46 ("console: don't prefer first
registered if DT specifies stdout-path") restoring the original
behavior.

Fixes: 05fd007e46 ("console: don't prefer first registered if DT specifies stdout-path")
Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11 08:12:37 -08:00
Daniel Bristot de Oliveira
9846d50df3 sched/deadline: Fix typo in a comment
In the comment:

        /*
         * The task might have changed its scheduling policy to something
         * different than SCHED_DEADLINE (through switched_fromd_dl()).
         */

s/fromd/from/

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5408b3b3f9ee197a7b7f10fb834341100a4f2c88.1478599881.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11 08:28:52 +01:00
Ingo Molnar
bfdd5537dc Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11 08:27:11 +01:00
Tahsin Erdogan
83f06168ef locking/lockdep: Remove unused parameter from the add_lock_to_list() function
The 'class' parameter is not used, remove it.
n
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1478592127-4376-1-git-send-email-tahsin@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11 08:25:20 +01:00
Ingo Molnar
4c8ee71620 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-11 08:25:07 +01:00
Tobias Klauser
de464375da bpf: Remove unused but set variables
Remove the unused but set variables min_set and max_set in
adjust_reg_min_max_vals to fix the following warning when building with
'W=1':

  kernel/bpf/verifier.c:1483:7: warning: variable ‘min_set’ set but not used [-Wunused-but-set-variable]

There is no warning about max_set being unused, but since it is only
used in the assignment of min_set it can be removed as well.

They were introduced in commit 484611357c ("bpf: allow access into map
value arrays") but seem to have never been used.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 21:15:00 -05:00
Sebastian Andrzej Siewior
90b14889d2 kernel/printk: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161103145021.28528-3-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 23:45:26 +01:00
Christoph Hellwig
67c93c218d genirq/affinity: Handle pre/post vectors in irq_create_affinity_masks()
Only calculate the affinity for the main I/O vectors, and skip the
pre or post vectors specified by struct irq_affinity.

Also remove the irq_affinity cpumask argument that has never been used.
If we ever need it in the future we can pass it through struct
irq_affinity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Link: http://lkml.kernel.org/r/1478654107-7384-4-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 08:25:09 +01:00
Christoph Hellwig
212bd84622 genirq/affinity: Handle pre/post vectors in irq_calc_affinity_vectors()
Only calculate the affinity for the main I/O vectors, and skip the pre or
post vectors specified by struct irq_affinity.

Also remove the irq_affinity cpumask argument that has never been used.  If
we ever need it in the future we can pass it through struct irq_affinity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Link: http://lkml.kernel.org/r/1478654107-7384-3-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 08:25:08 +01:00
Thomas Gleixner
7ee7e87dfb genirq: Use irq type from irqdata instead of irqdesc
The type flags in the irq descriptor are there for historical reasons and
only updated via irq_modify_status() or irq_set_type(). Both functions also
update the type flags in irqdata. __setup_irq() is the only left over user
of the type flags in the irq descriptor.

If __setup_irq() is called with empty irq type flags, then the type flags
are retrieved from irqdata. If an interrupt is shared, then the type flags
are compared with the type flags stored in the irq descriptor. 

On x86 the ioapic does not have a irq_set_type() callback because the type
is defined in the BIOS tables and cannot be changed. The type is stored in
irqdata at setup time without updating the type data in the irq
descriptor. As a result the comparison described above fails.

There is no point in updating the irq descriptor flags because the only
relevant storage is irqdata. Use the type flags from irqdata for both
retrieval and comparison in __setup_irq() instead.

Aside of that the print out in case of non matching type flags has the old
and new type flags arguments flipped. Fix that as well.

For correctness sake the flags stored in the irq descriptor should be
removed, but this is beyond the scope of this bugfix and will be done in a
later patch.

Fixes: 4b357daed6 ("genirq: Look-up trigger type if not specified by caller")
Reported-and-tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jon Hunter <jonathanh@nvidia.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-08 15:15:19 +01:00
Daniel Borkmann
20b2b24f91 bpf: fix map not being uncharged during map creation failure
In map_create(), we first find and create the map, then once that
suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
a new anon fd through anon_inode_getfd(). The problem is, once the
latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
the map via map->ops->map_free(), but without uncharging the previously
locked memory first. That means that the user_struct allocation is
leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
Make the label names in the fix consistent with bpf_prog_load().

Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:22:26 -05:00
Daniel Borkmann
483bed2b0d bpf: fix htab map destruction when extra reserve is in use
Commit a6ed3ea65d ("bpf: restore behavior of bpf_map_update_elem")
added an extra per-cpu reserve to the hash table map to restore old
behaviour from pre prealloc times. When non-prealloc is in use for a
map, then problem is that once a hash table extra element has been
linked into the hash-table, and the hash table is destroyed due to
refcount dropping to zero, then htab_map_free() -> delete_all_elements()
will walk the whole hash table and drop all elements via htab_elem_free().
The problem is that the element from the extra reserve is first fed
to the wrong backend allocator and eventually freed twice.

Fixes: a6ed3ea65d ("bpf: restore behavior of bpf_map_update_elem")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07 13:20:52 -05:00
Linus Torvalds
ffbcbfca84 Merge branches 'sched-urgent-for-linus' and 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stack vmap fixups from Thomas Gleixner:
 "Two small patches related to sched_show_task():

   - make sure to hold a reference on the task stack while accessing it

   - remove the thread_saved_pc printout

  .. and add a sanity check into release_task_stack() to catch problems
  with task stack references"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Remove pointless printout in sched_show_task()
  sched/core: Fix oops in sched_show_task()

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  fork: Add task stack refcounting sanity check and prevent premature task stack freeing
2016-11-05 11:46:02 -07:00
WANG Cong
243d521261 taskstats: fix the length of cgroupstats_cmd_get_policy
cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1],
taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1],
but their family.maxattr is TASKSTATS_CMD_ATTR_MAX.
CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX,
so we could end up accessing out-of-bound.

Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1,
this is safe because the rest are initialized to 0's.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:55:58 -04:00
Linus Torvalds
8243d55977 sched/core: Remove pointless printout in sched_show_task()
In sched_show_task() we print out a useless hex number, not even a
symbol, and there's a big question mark whether this even makes sense
anyway, I suspect we should just remove it all.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-03 07:31:34 +01:00
Tetsuo Handa
382005027f sched/core: Fix oops in sched_show_task()
When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
remains in the task list after its stack pointer was already set to NULL.

Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
will trigger NULL pointer dereference if an attempt to dump such thread's
traces (e.g. SysRq-t, khungtaskd) is made.

Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-03 07:27:34 +01:00
Johan Hovold
ceb75787bc PM / sleep: fix device reference leak in test_suspend
Make sure to drop the reference taken by class_find_device() after
opening the RTC device.

Fixes: 77437fd4e6 (pm: boot time suspend selftest)
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-02 05:10:04 +01:00
Mickaël Salaün
285fdfc5d9 seccomp: Fix documentation
Fix struct seccomp_filter and seccomp_run_filters() signatures.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: James Morris <jmorris@namei.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-11-01 08:54:26 -07:00
Christoph Hellwig
70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Ingo Molnar
05b93c19d5 Merge branch 'linus' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:41:06 +01:00
Andy Lutomirski
405c075971 fork: Add task stack refcounting sanity check and prevent premature task stack freeing
If something goes wrong with task stack refcounting and a stack
refcount hits zero too early, warn and leak it rather than
potentially freeing it early (and silently).

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:39:17 +01:00
Daniel Borkmann
0f98621bef bpf, inode: add support for symlinks and fix mtime/ctime
While commit bb35a6ef7d ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.

Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 15:28:11 -04:00
David S. Miller
27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Thomas Graf
ebb676daa1 bpf: Print function name in addition to function id
The verifier currently prints raw function ids when printing CALL
instructions or when complaining:

	5: (85) call 23
	unknown func 23

print a meaningful function name instead:

	5: (85) call bpf_redirect#23
	unknown func bpf_redirect#23

Moves the function documentation to a single comment and renames all
helpers names in the list to conform to the bpf_ prefix notation so
they can be greped in the kernel source.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:55:13 -04:00
Linus Torvalds
b546e0c289 Power management fixes for v4.9-rc3
Specifics:
 
  - Fix a missing KERN_CONT in a system suspend message by converting
    the affected code to using pr_info() and pr_cont() instead of the
    "raw" printk() (Jon Hunter).
 
  - Make intel_pstate set the CPU P-state from its .set_policy()
    callback when the scaling_governor sysfs attribute is set to
    "performance" so that it interacts with NOHZ_FULL more
    predictably which was the case before 4.7 (Rafael Wysocki).
 
  - Make intel_pstate always request the maximum allowed P-state when
    the scaling_governor sysfs attribute is set to "performance" to
    prevent it from effectively ingoring that setting is some
    situations (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJYE+vbAAoJEILEb/54YlRxsqQQAIZLfm5Isp4L0JBt2svtLYuO
 5mLgGOfcpsHIpqq5+saKtK+JOcCpm/MJ0PMQdv3WhKkFGXG9++Dz6F2gNNFlE+Sy
 era/B9NFry1BsQ14xsX8XonYRkQ2tS+CgOw+akSZl9BTVeX57VASqPIzlKI4Auz/
 Ru5M8n4QWAD7dPWnqHTFHYbmTnMtr18wn0mNjiC/cLz7TPxmGM9Pe730/bpGKMDC
 DQpflnpT/ktqV1imDuaiMeuP2re2DSZWMrirZo326UsDtcKN1A4sCO+PPUUt/TTb
 k6pKsZ9OB9FgMmaRqx5G6wCD3NvuhKa1JOqg3RvAVZ0SmMcl2AFhCaoEJtX+TKvz
 3365o2tSQeGc3bmXTb7YqP2SaYm107rOG/M0I2JuA7qNUKxk8MFfbNdaYxTY2ezT
 D5NF0SCycOMeEXIDTTuZ2oN1NHR6osO7bNFOIacKlcIL8PmEl8fDqX2KdYQze5xh
 AGsa1Oee7FPrDd2kQIL6HNGgVjd6gtHU9cVjyzQeUv9SU6zwOlu4/BbZI+wIterN
 L2OfW7w6/KSSyScDZ3TbifeJwpP918bUv578dEP6U/2VKthPBUPY/MYMMQzSuEIn
 RAQoDShqacIjdyLZtStii28KctMOk0L/edEDbdFmyu2ttNwpB+hk2ieCYWfI8wAa
 2SAaYyRGpC2nafsSTv+E
 =RtkB
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix two intel_pstate issues related to the way it works when the
  scaling_governor sysfs attribute is set to "performance" and fix up
  messages in the system suspend core code.

  Specifics:

   - Fix a missing KERN_CONT in a system suspend message by converting
     the affected code to using pr_info() and pr_cont() instead of the
     "raw" printk() (Jon Hunter).

   - Make intel_pstate set the CPU P-state from its .set_policy()
     callback when the scaling_governor sysfs attribute is set to
     "performance" so that it interacts with NOHZ_FULL more predictably
     which was the case before 4.7 (Rafael Wysocki).

   - Make intel_pstate always request the maximum allowed P-state when
     the scaling_governor sysfs attribute is set to "performance" to
     prevent it from effectively ingoring that setting is some
     situations (Rafael Wysocki)"

* tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Always set max P-state in performance mode
  PM / suspend: Fix missing KERN_CONT for suspend message
  cpufreq: intel_pstate: Set P-state upfront in performance mode
2016-10-28 18:29:13 -07:00
Linus Torvalds
b49c3170bf Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc kernel fixes: a virtualization environment related fix, an uncore
  PMU driver removal handling fix, a PowerPC fix and new events for
  Knights Landing"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors
  perf/powerpc: Don't call perf_event_disable() from atomic context
  perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
  perf/x86/intel/cstate: Add C-state residency events for Knights Landing
2016-10-28 16:27:16 -07:00
Linus Torvalds
a8006bd915 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "Fix four timer locking races: two were noticed by Linus while
  reviewing the code while chasing for a corruption bug, and two
  from fixing spurious USB timeouts"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Prevent base clock corruption when forwarding
  timers: Prevent base clock rewind when forwarding clock
  timers: Lock base for same bucket optimization
  timers: Plug locking race vs. timer migration
2016-10-28 11:26:01 -07:00
Linus Torvalds
965c4b7e1a Merge branches 'core-urgent-for-linus', 'irq-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool, irq and scheduler fixes from Ingo Molnar:
 "One more objtool fixlet for GCC6 code generation patterns, an irq
  DocBook fix and an unused variable warning fix in the scheduler"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Fix rare switch jump table pattern detection

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  doc: Add missing parameter for msi_setup

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Remove unused but set variable 'rq'
2016-10-28 10:12:27 -07:00
Christoph Hellwig
ef295ecf09 block: better op and flags encoding
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields.  This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits.  Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:48:16 -06:00
Jiri Olsa
5aab90ce1e perf/powerpc: Don't call perf_event_disable() from atomic context
The trinity syscall fuzzer triggered following WARN() on powerpc:

  WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
  ...
  NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
  LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
  Call Trace:
  [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
  [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
  [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
  [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
  [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
  [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48

Followed by a lockdep warning:

  ===============================
  [ INFO: suspicious RCU usage. ]
  4.8.0-rc5+ #7 Tainted: G        W
  -------------------------------
  ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!

  other info that might help us debug this:

  rcu_scheduler_active = 1, debug_locks = 0
  2 locks held by ls/2998:
   #0:  (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0
   #1:  (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0

  stack backtrace:
  CPU: 9 PID: 2998 Comm: ls Tainted: G        W       4.8.0-rc5+ #7
  Call Trace:
  [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
  [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
  [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
  [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
  [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
  [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
  [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
  [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
  [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
  [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
  [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
  [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48

While it looks like the first WARN() is probably valid, the other one is
triggered by disabling event via perf_event_disable() from atomic context.

The event is disabled here in case we were not able to emulate
the instruction that hit the breakpoint. By disabling the event
we unschedule the event and make sure it's not scheduled back.

But we can't call perf_event_disable() from atomic context, instead
we need to use the event's pending_disable irq_work method to disable it.

Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-28 11:06:25 +02:00
Jiri Olsa
0933840acf perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
CAI Qian reported a crash in the PMU uncore device removal code,
enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option:

  https://marc.info/?l=linux-kernel&m=147688837328451

The reason for the crash is that perf_pmu_unregister() tries to remove
a PMU device which is not added at this point. We add PMU devices
only after pmu_bus is registered, which happens in the
perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.

The fix is to get the 'pmu_bus_running' flag state at the point
the PMU is taken out of the PMU list and remove the device
later only if it's set.

Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161020111011.GA13361@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-28 11:06:25 +02:00
Andrey Konovalov
b274c0bb39 kcov: properly check if we are in an interrupt
in_interrupt() returns a nonzero value when we are either in an
interrupt or have bh disabled via local_bh_disable().  Since we are
interested in only ignoring coverage from actual interrupts, do a proper
check instead of just calling in_interrupt().

As a result of this change, kcov will start to collect coverage from
within local_bh_disable()/local_bh_enable() sections.

Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: James Morse <james.morse@arm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27 18:43:42 -07:00
Johannes Berg
56989f6d85 genetlink: mark families as __ro_after_init
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.

In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.

This protects the data structure from accidental corruption.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
489111e5c2 genetlink: statically initialize families
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.

This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg
a07ea4d994 genetlink: no longer support using static family IDs
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.

Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)

Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Linus Torvalds
9c953d639c Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A set of fixes for this series, most notably the fix for the blk-mq
  software queue regression in from this merge window.

  Apart from that, a fix for an unlikely hang if a queue is flooded with
  FUA requests from Ming, and a few small fixes for nbd and badblocks.
  Lastly, a rename update for the proc softirq output, since the block
  polling code was made generic"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: update hardware and software queues for sleeping alloc
  block: flush: fix IO hang in case of flood fua req
  nbd: fix incorrect unlock of nbd->sock_lock in sock_shutdown
  badblocks: badblocks_set/clear update unacked_exist
  softirq: Display IRQ_POLL for irq-poll statistics
2016-10-27 10:05:31 -07:00
Linus Torvalds
9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:

     wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().

The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).

It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.

As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.

Peter Zijlstra already has a patch for that, but let's see if anybody
even notices.  In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27 09:27:57 -07:00