No description
Find a file
Johannes Weiner 0149a66e74 mm: vmscan: fix extreme overreclaim and swap floods
commit f53af4285d upstream.

During proactive reclaim, we sometimes observe severe overreclaim, with
several thousand times more pages reclaimed than requested.

This trace was obtained from shrink_lruvec() during such an instance:

    prio:0 anon_cost:1141521 file_cost:7767
    nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
    nr=[7161123 345 578 1111]

While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
by swapping.  These requests take over a minute, during which the write()
to memory.reclaim is unkillably stuck inside the kernel.

Digging into the source, this is caused by the proportional reclaim
bailout logic.  This code tries to resolve a fundamental conflict: to
reclaim roughly what was requested, while also aging all LRUs fairly and
in accordance to their size, swappiness, refault rates etc.  The way it
attempts fairness is that once the reclaim goal has been reached, it stops
scanning the LRUs with the smaller remaining scan targets, and adjusts the
remainder of the bigger LRUs according to how much of the smaller LRUs was
scanned.  It then finishes scanning that remainder regardless of the
reclaim goal.

This works fine if priority levels are low and the LRU lists are
comparable in size.  However, in this instance, the cgroup that is
targeted by proactive reclaim has almost no files left - they've already
been squeezed out by proactive reclaim earlier - and the remaining anon
pages are hot.  Anon rotations cause the priority level to drop to 0,
which results in reclaim targeting all of anon (a lot) and all of file
(almost nothing).  By the time reclaim decides to bail, it has scanned
most or all of the file target, and therefor must also scan most or all of
the enormous anon target.  This target is thousands of times larger than
the reclaim goal, thus causing the overreclaim.

The bailout code hasn't changed in years, why is this failing now?  The
most likely explanations are two other recent changes in anon reclaim:

1. Before the series starting with commit 5df741963d ("mm: fix LRU
   balancing effect of new transparent huge pages"), the VM was
   overall relatively reluctant to swap at all, even if swap was
   configured. This means the LRU balancing code didn't come into play
   as often as it does now, and mostly in high pressure situations
   where pronounced swap activity wouldn't be as surprising.

2. For historic reasons, shrink_lruvec() loops on the scan targets of
   all LRU lists except the active anon one, meaning it would bail if
   the only remaining pages to scan were active anon - even if there
   were a lot of them.

   Before the series starting with commit ccc5dc6734 ("mm/vmscan:
   make active/inactive ratio as 1:1 for anon lru"), most anon pages
   would live on the active LRU; the inactive one would contain only a
   handful of preselected reclaim candidates. After the series, anon
   gets aged similarly to file, and the inactive list is the default
   for new anon pages as well, making it often the much bigger list.

   As a result, the VM is now more likely to actually finish large
   anon targets than before.

Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
larger LRU lists is made before bailing out on a met reclaim goal.

This fixes the extreme overreclaim problem.

Fairness is more subtle and harder to evaluate.  No obvious misbehavior
was observed on the test workload, in any case.  Conceptually, fairness
should primarily be a cumulative effect from regular, lower priority
scans.  Once the VM is in trouble and needs to escalate scan targets to
make forward progress, fairness needs to take a backseat.  This is also
acknowledged by the myriad exceptions in get_scan_count().  This patch
makes fairness decrease gradually, as it keeps fairness work static over
increasing priority levels with growing scan targets.  This should make
more sense - although we may have to re-visit the exact values.

Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-19 17:56:41 +08:00
arch arm64: dts: rockchip: lower rk3399-puma-haikou SD controller clock frequency 2023-04-19 17:56:40 +08:00
block block, bfq: fix null pointer dereference in bfq_bio_bfqg() 2023-04-19 17:56:33 +08:00
certs certs/blacklist_hashes.c: fix const confusion in certs blacklist 2023-04-19 17:50:34 +08:00
crypto crypto: akcipher - default implementation for setting a private key 2023-04-19 17:55:27 +08:00
Documentation docs: update mediator contact information in CoC doc 2023-04-19 17:56:27 +08:00
drivers usb: dwc3: gadget: Clear ep descriptor last 2023-04-19 17:56:41 +08:00
fs nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty 2023-04-19 17:56:41 +08:00
include rxrpc: Use refcount_t rather than atomic_t 2023-04-19 17:56:35 +08:00
init init/Kconfig: fix CC_HAS_ASM_GOTO_TIED_OUTPUT test with dash 2023-04-19 17:56:40 +08:00
ipc ipc/mqueue: use get_tree_nodev() in mqueue_get_tree() 2023-04-19 17:49:52 +08:00
kernel gcov: clang: fix the buffer overflow issue 2023-04-19 17:56:41 +08:00
lib lib/vdso: use "grep -E" instead of "egrep" 2023-04-19 17:56:40 +08:00
LICENSES LICENSES/dual/CC-BY-4.0: Git rid of "smart quotes" 2021-07-15 06:31:24 -06:00
mm mm: vmscan: fix extreme overreclaim and swap floods 2023-04-19 17:56:41 +08:00
net ipv4: Fix error return code in fib_table_insert() 2023-04-19 17:56:39 +08:00
samples samples/landlock: Format with clang-format 2023-04-19 17:50:01 +08:00
scripts cert host tools: Stop complaining about deprecated OpenSSL functions 2023-04-19 17:56:14 +08:00
security capabilities: fix potential memleak on error path from vfs_getxattr_alloc() 2023-04-19 17:56:03 +08:00
sound ASoC: max98373: Add checks for devm_kcalloc 2023-04-19 17:56:34 +08:00
tools selftests: mptcp: fix mibit vs mbit mix up 2023-04-19 17:56:35 +08:00
usr usr/include/Makefile: add linux/nfc.h to the compile-test coverage 2023-04-19 17:44:58 +08:00
virt kvm: Add support for arch compat vm ioctls 2023-04-19 17:55:40 +08:00
.clang-format clang-format: Update with the latest for_each macro list 2021-05-12 23:32:39 +02:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: ignore only top-level modules.builtin 2021-05-02 00:43:35 +09:00
.mailmap mailmap: add Andrej Shadura 2021-10-18 20:22:03 -10:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: Move Daniel Drake to credits 2021-09-21 08:34:58 +03:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS Input: goodix - add a goodix.h header file 2023-04-19 17:50:59 +08:00
Makefile Linux 5.15.80 2023-04-19 17:56:30 +08:00
README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.