From 1f7a44f63e6c782c9c2aa9f18f40c23914e6b46a Mon Sep 17 00:00:00 2001 From: Nick Desaulniers Date: Tue, 13 Oct 2020 16:47:33 -0700 Subject: [PATCH 001/181] compiler-clang: add build check for clang 10.0.1 Patch series "set clang minimum version to 10.0.1", v3. Adds a compile time #error to compiler-clang.h setting the effective minimum supported version to clang 10.0.1. A separate patch has already been picked up into the Documentation/ tree also confirming the version. Next are a series of reverts. One for 32b arm is a partial revert. Then Marco suggested fixes to KASAN docs. Finally, improve the warning for GCC too as per Kees. This patch (of 7): During Plumbers 2020, we voted to just support the latest release of Clang for now. Add a compile time check for this. We plan to remove workarounds for older versions now, which will break in subtle and not so subtle ways. Suggested-by: Sedat Dilek Suggested-by: Nathan Chancellor Suggested-by: Kees Cook Signed-off-by: Nick Desaulniers Signed-off-by: Andrew Morton Tested-by: Sedat Dilek Reviewed-by: Kees Cook Reviewed-by: Miguel Ojeda Reviewed-by: Sedat Dilek Acked-by: Marco Elver Acked-by: Nathan Chancellor Acked-by: Sedat Dilek Cc: Andrey Konovalov Cc: Fangrui Song Cc: Masahiro Yamada Cc: Daniel Borkmann Cc: Alexei Starovoitov Cc: Will Deacon Cc: Vincenzo Frascino Link: https://lkml.kernel.org/r/20200902225911.209899-1-ndesaulniers@google.com Link: https://lkml.kernel.org/r/20200902225911.209899-2-ndesaulniers@google.com Link: https://github.com/ClangBuiltLinux/linux/issues/9 Link: https://github.com/ClangBuiltLinux/linux/issues/941 Signed-off-by: Linus Torvalds --- include/linux/compiler-clang.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h index cee0c728d39a..230604e7f057 100644 --- a/include/linux/compiler-clang.h +++ b/include/linux/compiler-clang.h @@ -3,6 +3,14 @@ #error "Please don't include directly, include instead." #endif +#define CLANG_VERSION (__clang_major__ * 10000 \ + + __clang_minor__ * 100 \ + + __clang_patchlevel__) + +#if CLANG_VERSION < 100001 +# error Sorry, your version of Clang is too old - please use 10.0.1 or newer. +#endif + /* Compiler specific definitions for Clang compiler */ /* same as gcc, this was present in clang-2.6 so we can assume it works From 4c207c50ea35abf98f087bbcda8c95b23c4bb3b1 Mon Sep 17 00:00:00 2001 From: Nick Desaulniers Date: Tue, 13 Oct 2020 16:47:37 -0700 Subject: [PATCH 002/181] Revert "kbuild: disable clang's default use of -fmerge-all-constants" This reverts commit 87e0d4f0f37fb0c8c4aeeac46fff5e957738df79. -fno-merge-all-constants has been the default since clang-6; the minimum supported version of clang in the kernel is clang-10 (10.0.1). Suggested-by: Nathan Chancellor Signed-off-by: Nick Desaulniers Signed-off-by: Andrew Morton Tested-by: Sedat Dilek Reviewed-by: Fangrui Song Reviewed-by: Nathan Chancellor Reviewed-by: Sedat Dilek Reviewed-by: Kees Cook Cc: Andrey Konovalov Cc: Marco Elver Cc: Miguel Ojeda Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Masahiro Yamada Cc: Vincenzo Frascino Cc: Will Deacon Link: https://lkml.kernel.org/r/20200902225911.209899-3-ndesaulniers@google.com Link: https://reviews.llvm.org/rL329300. Link: https://github.com/ClangBuiltLinux/linux/issues/9 Signed-off-by: Linus Torvalds --- Makefile | 9 --------- 1 file changed, 9 deletions(-) diff --git a/Makefile b/Makefile index 51540b291738..342d26432553 100644 --- a/Makefile +++ b/Makefile @@ -921,15 +921,6 @@ KBUILD_CFLAGS += $(call cc-disable-warning, maybe-uninitialized) # disable invalid "can't wrap" optimizations for signed / pointers KBUILD_CFLAGS += $(call cc-option,-fno-strict-overflow) -# clang sets -fmerge-all-constants by default as optimization, but this -# is non-conforming behavior for C and in fact breaks the kernel, so we -# need to disable it here generally. -KBUILD_CFLAGS += $(call cc-option,-fno-merge-all-constants) - -# for gcc -fno-merge-all-constants disables everything, but it is fine -# to have actual conforming behavior enabled. -KBUILD_CFLAGS += $(call cc-option,-fmerge-constants) - # Make sure -fstack-check isn't enabled (like gentoo apparently did) KBUILD_CFLAGS += $(call cc-option,-fno-stack-check,) From 2980e6070eefc0e0b67e223ed4e1db4f730c6f69 Mon Sep 17 00:00:00 2001 From: Nick Desaulniers Date: Tue, 13 Oct 2020 16:47:40 -0700 Subject: [PATCH 003/181] Revert "arm64: bti: Require clang >= 10.0.1 for in-kernel BTI support" This reverts commit b9249cba25a5dce5de87e5404503a5e11832c2dd. The minimum supported version of clang is now 10.0.1. Suggested-by: Nathan Chancellor Signed-off-by: Nick Desaulniers Signed-off-by: Andrew Morton Reviewed-by: Kees Cook Reviewed-by: Nathan Chancellor Cc: Andrey Konovalov Cc: Fangrui Song Cc: Marco Elver Cc: Miguel Ojeda Cc: Sedat Dilek Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Masahiro Yamada Cc: Vincenzo Frascino Cc: Will Deacon Link: https://lkml.kernel.org/r/20200902225911.209899-4-ndesaulniers@google.com Signed-off-by: Linus Torvalds --- arch/arm64/Kconfig | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 4b136e923ccb..583b6a60b094 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1612,8 +1612,6 @@ config ARM64_BTI_KERNEL depends on CC_HAS_BRANCH_PROT_PAC_RET_BTI # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94697 depends on !CC_IS_GCC || GCC_VERSION >= 100100 - # https://reviews.llvm.org/rGb8ae3fdfa579dbf366b1bb1cbfdbf8c51db7fa55 - depends on !CC_IS_CLANG || CLANG_VERSION >= 100001 depends on !(CC_IS_CLANG && GCOV_KERNEL) depends on (!FUNCTION_GRAPH_TRACER || DYNAMIC_FTRACE_WITH_REGS) help From 3759da22e5c0dfc25ee5296ca470262204ba35a8 Mon Sep 17 00:00:00 2001 From: Nick Desaulniers Date: Tue, 13 Oct 2020 16:47:44 -0700 Subject: [PATCH 004/181] Revert "arm64: vdso: Fix compilation with clang older than 8" This reverts commit 3acf4be235280f14d838581a750532219d67facc. The minimum supported version of clang is clang 10.0.1. Suggested-by: Nathan Chancellor Signed-off-by: Nick Desaulniers Signed-off-by: Andrew Morton Reviewed-by: Kees Cook Reviewed-by: Nathan Chancellor Cc: Andrey Konovalov Cc: Fangrui Song Cc: Marco Elver Cc: Miguel Ojeda Cc: Sedat Dilek Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Masahiro Yamada Cc: Vincenzo Frascino Cc: Will Deacon Link: https://lkml.kernel.org/r/20200902225911.209899-5-ndesaulniers@google.com Signed-off-by: Linus Torvalds --- arch/arm64/kernel/vdso/Makefile | 7 ------- 1 file changed, 7 deletions(-) diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile index 45d5cfe46429..04021a93171c 100644 --- a/arch/arm64/kernel/vdso/Makefile +++ b/arch/arm64/kernel/vdso/Makefile @@ -43,13 +43,6 @@ ifneq ($(c-gettimeofday-y),) CFLAGS_vgettimeofday.o += -include $(c-gettimeofday-y) endif -# Clang versions less than 8 do not support -mcmodel=tiny -ifeq ($(CONFIG_CC_IS_CLANG), y) - ifeq ($(shell test $(CONFIG_CLANG_VERSION) -lt 80000; echo $$?),0) - CFLAGS_REMOVE_vgettimeofday.o += -mcmodel=tiny - endif -endif - # Disable gcov profiling for VDSO code GCOV_PROFILE := n From 3511af0a72efb2ba5df7f1b4c8c1bf3b1a19a9ea Mon Sep 17 00:00:00 2001 From: Nick Desaulniers Date: Tue, 13 Oct 2020 16:47:48 -0700 Subject: [PATCH 005/181] Partially revert "ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newer" This partially reverts commit b0fe66cf095016e0b238374c10ae366e1f087d11. The minimum supported version of clang is now clang 10.0.1. We still want to pass -meabi=gnu. Suggested-by: Nathan Chancellor Signed-off-by: Nick Desaulniers Signed-off-by: Andrew Morton Reviewed-by: Kees Cook Reviewed-by: Nathan Chancellor Cc: Andrey Konovalov Cc: Fangrui Song Cc: Marco Elver Cc: Miguel Ojeda Cc: Sedat Dilek Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Masahiro Yamada Cc: Vincenzo Frascino Cc: Will Deacon Link: https://lkml.kernel.org/r/20200902225911.209899-6-ndesaulniers@google.com Signed-off-by: Linus Torvalds --- arch/arm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index e67ef15c800f..1ed277f2e4fc 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -84,7 +84,7 @@ config ARM select HAVE_FAST_GUP if ARM_LPAE select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL select HAVE_FUNCTION_GRAPH_TRACER if !THUMB2_KERNEL && !CC_IS_CLANG - select HAVE_FUNCTION_TRACER if !XIP_KERNEL && (CC_IS_GCC || CLANG_VERSION >= 100000) + select HAVE_FUNCTION_TRACER if !XIP_KERNEL select HAVE_GCC_PLUGINS select HAVE_HW_BREAKPOINT if PERF_EVENTS && (CPU_V6 || CPU_V6K || CPU_V7) select HAVE_IDE if PCI || ISA || PCMCIA From 527f6750d92beb9c787d8aba48477b1e834d64e5 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Tue, 13 Oct 2020 16:47:51 -0700 Subject: [PATCH 006/181] kasan: remove mentions of unsupported Clang versions Since the kernel now requires at least Clang 10.0.1, remove any mention of old Clang versions and simplify the documentation. Signed-off-by: Marco Elver Signed-off-by: Nick Desaulniers Signed-off-by: Andrew Morton Reviewed-by: Andrey Konovalov Reviewed-by: Kees Cook Reviewed-by: Nathan Chancellor Cc: Fangrui Song Cc: Miguel Ojeda Cc: Sedat Dilek Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Masahiro Yamada Cc: Vincenzo Frascino Cc: Will Deacon Link: https://lkml.kernel.org/r/20200902225911.209899-7-ndesaulniers@google.com Signed-off-by: Linus Torvalds --- Documentation/dev-tools/kasan.rst | 4 ++-- lib/Kconfig.kasan | 9 ++++----- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index 38fd5681fade..4abc84b1798c 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -13,10 +13,10 @@ KASAN uses compile-time instrumentation to insert validity checks before every memory access, and therefore requires a compiler version that supports that. Generic KASAN is supported in both GCC and Clang. With GCC it requires version -8.3.0 or later. With Clang it requires version 7.0.0 or later, but detection of +8.3.0 or later. Any supported Clang version is compatible, but detection of out-of-bounds accesses for global variables is only supported since Clang 11. -Tag-based KASAN is only supported in Clang and requires version 7.0.0 or later. +Tag-based KASAN is only supported in Clang. Currently generic KASAN is supported for the x86_64, arm64, xtensa, s390 and riscv architectures, and tag-based KASAN is supported only for arm64. diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan index 047b53dbfd58..033a5bc67ac4 100644 --- a/lib/Kconfig.kasan +++ b/lib/Kconfig.kasan @@ -54,9 +54,9 @@ config KASAN_GENERIC Enables generic KASAN mode. This mode is supported in both GCC and Clang. With GCC it requires - version 8.3.0 or later. With Clang it requires version 7.0.0 or - later, but detection of out-of-bounds accesses for global variables - is supported only since Clang 11. + version 8.3.0 or later. Any supported Clang version is compatible, + but detection of out-of-bounds accesses for global variables is + supported only since Clang 11. This mode consumes about 1/8th of available memory at kernel start and introduces an overhead of ~x1.5 for the rest of the allocations. @@ -78,8 +78,7 @@ config KASAN_SW_TAGS Enables software tag-based KASAN mode. This mode requires Top Byte Ignore support by the CPU and therefore - is only supported for arm64. This mode requires Clang version 7.0.0 - or later. + is only supported for arm64. This mode requires Clang. This mode consumes about 1/16th of available memory at kernel start and introduces an overhead of ~20% for the rest of the allocations. From c8db3b0a7ba7614f761f309d6aa7499127b18a0b Mon Sep 17 00:00:00 2001 From: Nick Desaulniers Date: Tue, 13 Oct 2020 16:47:55 -0700 Subject: [PATCH 007/181] compiler-gcc: improve version error As Kees suggests, doing so provides developers with two useful pieces of information: - The kernel build was attempting to use GCC. (Maybe they accidentally poked the wrong configs in a CI.) - They need 4.9 or better. ("Upgrade to what version?" doesn't need to be dug out of documentation, headers, etc.) Suggested-by: Kees Cook Signed-off-by: Nick Desaulniers Signed-off-by: Andrew Morton Tested-by: Sedat Dilek Reviewed-by: Kees Cook Reviewed-by: Miguel Ojeda Reviewed-by: Nathan Chancellor Reviewed-by: Sedat Dilek Cc: Andrey Konovalov Cc: Fangrui Song Cc: Marco Elver Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Masahiro Yamada Cc: Vincenzo Frascino Cc: Will Deacon Link: https://lkml.kernel.org/r/20200902225911.209899-8-ndesaulniers@google.com Signed-off-by: Linus Torvalds --- include/linux/compiler-gcc.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index 7a3769040d7d..d1e3c6896b71 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -12,7 +12,7 @@ /* https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145 */ #if GCC_VERSION < 40900 -# error Sorry, your compiler is too old - please upgrade it. +# error Sorry, your version of GCC is too old - please use 4.9 or newer. #endif /* Optimization barrier */ From a25c13b3aa1bdbf100e8770902c30908728f8410 Mon Sep 17 00:00:00 2001 From: Nick Desaulniers Date: Tue, 13 Oct 2020 16:47:58 -0700 Subject: [PATCH 008/181] compiler.h: avoid escaped section names The stringification operator, `#`, in the preprocessor escapes strings. For example, `# "foo"` becomes `"\"foo\""`. GCC and Clang differ in how they treat section names that contain \". The portable solution is to not use a string literal with the preprocessor stringification operator. In this case, since __section unconditionally uses the stringification operator, we actually want the more verbose __attribute__((__section__())). Fixes: commit e04462fb82f8 ("Compiler Attributes: remove uses of __attribute__ from compiler.h") Signed-off-by: Nick Desaulniers Signed-off-by: Andrew Morton Cc: Miguel Ojeda Cc: Luc Van Oostenryck Cc: Nathan Chancellor Cc: Arvind Sankar Link: https://bugs.llvm.org/show_bug.cgi?id=42950 Link: https://lkml.kernel.org/r/20200929194318.548707-1-ndesaulniers@google.com Signed-off-by: Linus Torvalds --- include/linux/compiler.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 92ef163a7479..ac45f6d40d39 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -155,7 +155,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val, extern typeof(sym) sym; \ static const unsigned long __kentry_##sym \ __used \ - __section("___kentry" "+" #sym ) \ + __attribute__((__section__("___kentry+" #sym))) \ = (unsigned long)&sym; #endif From 4d6fb34acb5d0bfc579ccd29df9cc6f653e51ab2 Mon Sep 17 00:00:00 2001 From: Nick Desaulniers Date: Tue, 13 Oct 2020 16:48:01 -0700 Subject: [PATCH 009/181] export.h: fix section name for CONFIG_TRIM_UNUSED_KSYMS for Clang When enabling CONFIG_TRIM_UNUSED_KSYMS, the linker will warn about the orphan sections: (".discard.ksym") is being placed in '".discard.ksym"' repeatedly when linking vmlinux. This is because the stringification operator, `#`, in the preprocessor escapes strings. GCC and Clang differ in how they treat section names that contain \". The portable solution is to not use a string literal with the preprocessor stringification operator. Fixes: commit bbda5ec671d3 ("kbuild: simplify dependency generation for CONFIG_TRIM_UNUSED_KSYMS") Reported-by: kbuild test robot Suggested-by: Kees Cook Signed-off-by: Nick Desaulniers Signed-off-by: Andrew Morton Reviewed-by: Kees Cook Cc: Nathan Chancellor Cc: Masahiro Yamada Cc: Matthias Maennich Cc: Jessica Yu Cc: Greg Kroah-Hartman Cc: Will Deacon Link: https://bugs.llvm.org/show_bug.cgi?id=42950 Link: https://github.com/ClangBuiltLinux/linux/issues/1166 Link: https://lkml.kernel.org/r/20200929190701.398762-1-ndesaulniers@google.com Signed-off-by: Linus Torvalds --- include/linux/export.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/export.h b/include/linux/export.h index fceb5e855717..8933ff6ad23a 100644 --- a/include/linux/export.h +++ b/include/linux/export.h @@ -130,7 +130,7 @@ struct kernel_symbol { * discarded in the final link stage. */ #define __ksym_marker(sym) \ - static int __ksym_marker_##sym[0] __section(".discard.ksym") __used + static int __ksym_marker_##sym[0] __section(.discard.ksym) __used #define __EXPORT_SYMBOL(sym, sec, ns) \ __ksym_marker(sym); \ From eb38f37c3cee08a0197bdc7bbb9b4e02e40e2300 Mon Sep 17 00:00:00 2001 From: Lukas Bulwahn Date: Tue, 13 Oct 2020 16:48:05 -0700 Subject: [PATCH 010/181] kbuild: doc: describe proper script invocation During an investigation to fix up the execute bits of scripts in the kernel repository, Andrew Morton and Kees Cook pointed out that the execute bit should not matter, and that build scripts cannot rely on that. Kees could not point to any documentation, though. Masahiro Yamada explained the convention of setting execute bits to make it easier for manual script invocation. Provide some basic documentation how the build shall invoke scripts, such that the execute bits do not matter, and acknowledge that execute bits are useful nonetheless. This serves as reference for further clean-up patches in the future. Suggested-by: Andrew Morton Suggested-by: Kees Cook Signed-off-by: Lukas Bulwahn Signed-off-by: Andrew Morton Cc: Masahiro Yamada Cc: Michal Marek Cc: Jonathan Corbet Cc: Ujjwal Kumar Cc: Lukas Bulwahn Link: https://lore.kernel.org/lkml/20200830174409.c24c3f67addcce0cea9a9d4c@linux-foundation.org/ Link: https://lore.kernel.org/lkml/202008271102.FEB906C88@keescook/ Link: https://lore.kernel.org/linux-kbuild/CAK7LNAQdrvMkDA6ApDJCGr+5db8SiPo=G+p8EiOvnnGvEN80gA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20201001075723.24246-1-lukas.bulwahn@gmail.com Signed-off-by: Linus Torvalds --- Documentation/kbuild/makefiles.rst | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/Documentation/kbuild/makefiles.rst b/Documentation/kbuild/makefiles.rst index 58d513a0fa95..0d5dd5413af0 100644 --- a/Documentation/kbuild/makefiles.rst +++ b/Documentation/kbuild/makefiles.rst @@ -21,6 +21,7 @@ This document describes the Linux kernel Makefiles. --- 3.10 Special Rules --- 3.11 $(CC) support functions --- 3.12 $(LD) support functions + --- 3.13 Script Invocation === 4 Host Program support --- 4.1 Simple Host Program @@ -605,6 +606,25 @@ more details, with real examples. #Makefile LDFLAGS_vmlinux += $(call ld-option, -X) +3.13 Script invocation +---------------------- + + Make rules may invoke scripts to build the kernel. The rules shall + always provide the appropriate interpreter to execute the script. They + shall not rely on the execute bits being set, and shall not invoke the + script directly. For the convenience of manual script invocation, such + as invoking ./scripts/checkpatch.pl, it is recommended to set execute + bits on the scripts nonetheless. + + Kbuild provides variables $(CONFIG_SHELL), $(AWK), $(PERL), + $(PYTHON) and $(PYTHON3) to refer to interpreters for the respective + scripts. + + Example:: + + #Makefile + cmd_depmod = $(CONFIG_SHELL) $(srctree)/scripts/depmod.sh $(DEPMOD) \ + $(KERNELRELEASE) 4 Host Program support ====================== From 2c92406f33433b624522acbc4e78e6c58c397cd5 Mon Sep 17 00:00:00 2001 From: Wang Qing Date: Tue, 13 Oct 2020 16:48:08 -0700 Subject: [PATCH 011/181] scripts/spelling.txt: increase error-prone spell checking Increase direcly,ununsed,manger spelling error check Signed-off-by: Wang Qing Signed-off-by: Andrew Morton Cc: Colin Ian King Cc: Wang Qing Cc: Xiong Cc: SeongJae Park Cc: Jonathan Neuschfer Cc: Luca Ceresoli Cc: Joe Perches Link: https://lkml.kernel.org/r/1601085397-27586-1-git-send-email-wangqing@vivo.com Signed-off-by: Linus Torvalds --- scripts/spelling.txt | 3 +++ 1 file changed, 3 insertions(+) diff --git a/scripts/spelling.txt b/scripts/spelling.txt index feb2efaaa5e6..06f4cf4ca717 100644 --- a/scripts/spelling.txt +++ b/scripts/spelling.txt @@ -482,6 +482,7 @@ disgest||digest dispalying||displaying diplay||display directon||direction +direcly||directly direectly||directly diregard||disregard disassocation||disassociation @@ -871,6 +872,7 @@ malplace||misplace managable||manageable managment||management mangement||management +manger||manager manoeuvering||maneuvering manufaucturing||manufacturing mappping||mapping @@ -1478,6 +1480,7 @@ unsolicitied||unsolicited unsuccessfull||unsuccessful unsuported||unsupported untill||until +ununsed||unused unuseful||useless unvalid||invalid upate||update From 33c5bb375ea4128b6e72b3ee260b74b59c295957 Mon Sep 17 00:00:00 2001 From: Naoki Hayama Date: Tue, 13 Oct 2020 16:48:11 -0700 Subject: [PATCH 012/181] scripts/spelling.txt: add "arbitrary" typo Add "abitrary||arbitrary". Signed-off-by: Naoki Hayama Signed-off-by: Andrew Morton Cc: Colin Ian King Cc: Andy Whitcroft Cc: Joe Perches Link: https://lkml.kernel.org/r/6bf6520d-787d-5749-09b5-ff92185f501f@lineo.co.jp Signed-off-by: Linus Torvalds --- scripts/spelling.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/scripts/spelling.txt b/scripts/spelling.txt index 06f4cf4ca717..cb46a79665f5 100644 --- a/scripts/spelling.txt +++ b/scripts/spelling.txt @@ -9,6 +9,7 @@ # abandonning||abandoning abigious||ambiguous +abitrary||arbitrary abitrate||arbitrate abnornally||abnormally abnrormal||abnormal From d72e720a19393eb611a112e4c5c377785dbd645d Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Tue, 13 Oct 2020 16:48:14 -0700 Subject: [PATCH 013/181] scripts/decodecode: add the capability to supply the program counter So that comparing with objdump output from vmlinux can ease pinpointing where the trapping instruction actually is. An example is better than a thousand words: $ PC=0xffffffff8329a927 ./scripts/decodecode < ~/tmp/syz/gfs2.splat [ 477.379104][T23917] Code: 48 83 ec 28 48 89 3c 24 48 89 54 24 08 e8 c1 b4 4a fe 48 8d bb 00 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 97 05 00 00 48 8b 9b 00 01 00 00 48 85 db 0f 84 All code ======== ffffffff8329a8fd: 48 83 ec 28 sub $0x28,%rsp ffffffff8329a901: 48 89 3c 24 mov %rdi,(%rsp) ffffffff8329a905: 48 89 54 24 08 mov %rdx,0x8(%rsp) ffffffff8329a90a: e8 c1 b4 4a fe callq 0xffffffff81745dd0 ffffffff8329a90f: 48 8d bb 00 01 00 00 lea 0x100(%rbx),%rdi ffffffff8329a916: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax ffffffff8329a91d: fc ff df ffffffff8329a920: 48 89 fa mov %rdi,%rdx ffffffff8329a923: 48 c1 ea 03 shr $0x3,%rdx ffffffff8329a927:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction ffffffff8329a92b: 0f 85 97 05 00 00 jne 0xffffffff8329aec8 ffffffff8329a931: 48 8b 9b 00 01 00 00 mov 0x100(%rbx),%rbx ffffffff8329a938: 48 85 db test %rbx,%rbx ffffffff8329a93b: 0f .byte 0xf ffffffff8329a93c: 84 .byte 0x84 Signed-off-by: Borislav Petkov Signed-off-by: Andrew Morton Cc: Marc Zyngier Cc: Will Deacon Cc: Rabin Vincent Link: https://lkml.kernel.org/r/20200930111416.GF6810@zn.tnic Link: https://lkml.kernel.org/r/20200929113238.GC21110@zn.tnic Signed-off-by: Linus Torvalds --- scripts/decodecode | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/scripts/decodecode b/scripts/decodecode index fbdb325cdf4f..31d884e35f2f 100755 --- a/scripts/decodecode +++ b/scripts/decodecode @@ -6,6 +6,7 @@ # options: set env. variable AFLAGS=options to pass options to "as"; # e.g., to decode an i386 oops on an x86_64 system, use: # AFLAGS=--32 decodecode < 386.oops +# PC=hex - the PC (program counter) the oops points to cleanup() { rm -f $T $T.s $T.o $T.oo $T.aa $T.dis @@ -67,15 +68,19 @@ if [ -z "$ARCH" ]; then esac fi +# Params: (tmp_file, pc_sub) disas() { - ${CROSS_COMPILE}as $AFLAGS -o $1.o $1.s > /dev/null 2>&1 + t=$1 + pc_sub=$2 + + ${CROSS_COMPILE}as $AFLAGS -o $t.o $t.s > /dev/null 2>&1 if [ "$ARCH" = "arm" ]; then if [ $width -eq 2 ]; then OBJDUMPFLAGS="-M force-thumb" fi - ${CROSS_COMPILE}strip $1.o + ${CROSS_COMPILE}strip $t.o fi if [ "$ARCH" = "arm64" ]; then @@ -83,11 +88,18 @@ disas() { type=inst fi - ${CROSS_COMPILE}strip $1.o + ${CROSS_COMPILE}strip $t.o fi - ${CROSS_COMPILE}objdump $OBJDUMPFLAGS -S $1.o | \ - grep -v "/tmp\|Disassembly\|\.text\|^$" > $1.dis 2>&1 + if [ $pc_sub -ne 0 ]; then + if [ $PC ]; then + adj_vma=$(( $PC - $pc_sub )) + OBJDUMPFLAGS="$OBJDUMPFLAGS --adjust-vma=$adj_vma" + fi + fi + + ${CROSS_COMPILE}objdump $OBJDUMPFLAGS -S $t.o | \ + grep -v "/tmp\|Disassembly\|\.text\|^$" > $t.dis 2>&1 } marker=`expr index "$code" "\<"` @@ -95,14 +107,17 @@ if [ $marker -eq 0 ]; then marker=`expr index "$code" "\("` fi + touch $T.oo if [ $marker -ne 0 ]; then + # 2 opcode bytes and a single space + pc_sub=$(( $marker / 3 )) echo All code >> $T.oo echo ======== >> $T.oo beforemark=`echo "$code"` echo -n " .$type 0x" > $T.s echo $beforemark | sed -e 's/ /,0x/g; s/[<>()]//g' >> $T.s - disas $T + disas $T $pc_sub cat $T.dis >> $T.oo rm -f $T.o $T.s $T.dis @@ -114,7 +129,7 @@ echo =========================================== >> $T.aa code=`echo $code | sed -e 's/ [<(]/ /;s/[>)] / /;s/ /,0x/g; s/[>)]$//'` echo -n " .$type 0x" > $T.s echo $code >> $T.s -disas $T +disas $T 0 cat $T.dis >> $T.aa # (lines of whole $T.oo) - (lines of $T.aa, i.e. "Code starting") + 3, From 4f8c94022f0bc3babd0a124c0a7dcdd7547bd94e Mon Sep 17 00:00:00 2001 From: Rustam Kovhaev Date: Tue, 13 Oct 2020 16:48:17 -0700 Subject: [PATCH 014/181] ntfs: add check for mft record size in superblock Number of bytes allocated for mft record should be equal to the mft record size stored in ntfs superblock as reported by syzbot, userspace might trigger out-of-bounds read by dereferencing ctx->attr in ntfs_attr_find() Reported-by: syzbot+aed06913f36eff9b544e@syzkaller.appspotmail.com Signed-off-by: Rustam Kovhaev Signed-off-by: Andrew Morton Tested-by: syzbot+aed06913f36eff9b544e@syzkaller.appspotmail.com Acked-by: Anton Altaparmakov Link: https://syzkaller.appspot.com/bug?extid=aed06913f36eff9b544e Link: https://lkml.kernel.org/r/20200824022804.226242-1-rkovhaev@gmail.com Signed-off-by: Linus Torvalds --- fs/ntfs/inode.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c index 9bb9f0952b18..caf563981532 100644 --- a/fs/ntfs/inode.c +++ b/fs/ntfs/inode.c @@ -1810,6 +1810,12 @@ int ntfs_read_inode_mount(struct inode *vi) brelse(bh); } + if (le32_to_cpu(m->bytes_allocated) != vol->mft_record_size) { + ntfs_error(sb, "Incorrect mft record size %u in superblock, should be %u.", + le32_to_cpu(m->bytes_allocated), vol->mft_record_size); + goto err_out; + } + /* Apply the mst fixups. */ if (post_read_mst_fixup((NTFS_RECORD*)m, vol->mft_record_size)) { /* FIXME: Try to use the $MFTMirr now. */ From 679edeb0ed8ac3e5df020976249d062624f35fa5 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 13 Oct 2020 16:48:21 -0700 Subject: [PATCH 015/181] ocfs2: delete repeated words in comments Drop duplicated words {the, and} in comments. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Acked-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Link: https://lkml.kernel.org/r/20200811021845.25134-1-rdunlap@infradead.org Signed-off-by: Linus Torvalds --- fs/ocfs2/alloc.c | 2 +- fs/ocfs2/localalloc.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 4c1b90442d6f..32317ffb9e5c 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -6013,7 +6013,7 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb) goto out; } - /* Appending truncate log(TA) and and flushing truncate log(TF) are + /* Appending truncate log(TA) and flushing truncate log(TF) are * two separated transactions. They can be both committed but not * checkpointed. If crash occurs then, both two transaction will be * replayed with several already released to global bitmap clusters. diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 720e9f94957e..fc8252a28cb1 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -677,7 +677,7 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb, /* * Under certain conditions, the window slide code * might have reduced the number of bits available or - * disabled the the local alloc entirely. Re-check + * disabled the local alloc entirely. Re-check * here and return -ENOSPC if necessary. */ status = -ENOSPC; From 8dd71b25c56a707fd492035c03e20e91040eedcf Mon Sep 17 00:00:00 2001 From: Gang He Date: Tue, 13 Oct 2020 16:48:24 -0700 Subject: [PATCH 016/181] ocfs2: fix potential soft lockup during fstrim When we discard unused blocks on a mounted ocfs2 filesystem, fstrim handles each block goup with locking/unlocking global bitmap meta-file repeatedly. we should let fstrim thread take a break(if need) between unlock and lock, this will avoid the potential soft lockup problem, and also gives the upper applications more IO opportunities, these applications are not blocked for too long at writing files. Signed-off-by: Gang He Signed-off-by: Andrew Morton Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Jun Piao Link: https://lkml.kernel.org/r/20200927015815.14904-1-ghe@suse.com Signed-off-by: Linus Torvalds --- fs/ocfs2/alloc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 32317ffb9e5c..78710788c237 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -7654,8 +7654,10 @@ out_mutex: * main_bm related locks for avoiding the current IO starve, then go to * trim the next group */ - if (ret >= 0 && group <= last_group) + if (ret >= 0 && group <= last_group) { + cond_resched(); goto next_group; + } out: range->len = trimmed * sb->s_blocksize; return ret; From da5c1c0bb316e3a0454d2c6980e9c2b618c149b0 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 13 Oct 2020 16:48:27 -0700 Subject: [PATCH 017/181] fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr Fix kernel-doc warnings in fs/xattr.c: ../fs/xattr.c:251: warning: Function parameter or member 'dentry' not described in '__vfs_setxattr_locked' ../fs/xattr.c:251: warning: Function parameter or member 'name' not described in '__vfs_setxattr_locked' ../fs/xattr.c:251: warning: Function parameter or member 'value' not described in '__vfs_setxattr_locked' ../fs/xattr.c:251: warning: Function parameter or member 'size' not described in '__vfs_setxattr_locked' ../fs/xattr.c:251: warning: Function parameter or member 'flags' not described in '__vfs_setxattr_locked' ../fs/xattr.c:251: warning: Function parameter or member 'delegated_inode' not described in '__vfs_setxattr_locked' ../fs/xattr.c:458: warning: Function parameter or member 'dentry' not described in '__vfs_removexattr_locked' ../fs/xattr.c:458: warning: Function parameter or member 'name' not described in '__vfs_removexattr_locked' ../fs/xattr.c:458: warning: Function parameter or member 'delegated_inode' not described in '__vfs_removexattr_locked' Fixes: 08b5d5014a27 ("xattr: break delegations in {set,remove}xattr") Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Cc: Al Viro Cc: Frank van der Linden Cc: Chuck Lever Link: http://lkml.kernel.org/r/7a3dd5a2-5787-adf3-d525-c203f9910ec4@infradead.org Signed-off-by: Linus Torvalds --- fs/xattr.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/fs/xattr.c b/fs/xattr.c index 386b45676d7e..cd7a563e8bcd 100644 --- a/fs/xattr.c +++ b/fs/xattr.c @@ -232,15 +232,15 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name, } /** - * __vfs_setxattr_locked: set an extended attribute while holding the inode + * __vfs_setxattr_locked - set an extended attribute while holding the inode * lock * - * @dentry - object to perform setxattr on - * @name - xattr name to set - * @value - value to set @name to - * @size - size of @value - * @flags - flags to pass into filesystem operations - * @delegated_inode - on return, will contain an inode pointer that + * @dentry: object to perform setxattr on + * @name: xattr name to set + * @value: value to set @name to + * @size: size of @value + * @flags: flags to pass into filesystem operations + * @delegated_inode: on return, will contain an inode pointer that * a delegation was broken on, NULL if none. */ int @@ -443,12 +443,12 @@ __vfs_removexattr(struct dentry *dentry, const char *name) EXPORT_SYMBOL(__vfs_removexattr); /** - * __vfs_removexattr_locked: set an extended attribute while holding the inode + * __vfs_removexattr_locked - set an extended attribute while holding the inode * lock * - * @dentry - object to perform setxattr on - * @name - name of xattr to remove - * @delegated_inode - on return, will contain an inode pointer that + * @dentry: object to perform setxattr on + * @name: name of xattr to remove + * @delegated_inode: on return, will contain an inode pointer that * a delegation was broken on, NULL if none. */ int From 97383c741b061ed2835802264c16784e42e20fe0 Mon Sep 17 00:00:00 2001 From: Luo Jiaxing Date: Tue, 13 Oct 2020 16:48:30 -0700 Subject: [PATCH 018/181] fs_parse: mark fs_param_bad_value() as static We found the following warning when build kernel with W=1: fs/fs_parser.c:192:5: warning: no previous prototype for `fs_param_bad_value' [-Wmissing-prototypes] int fs_param_bad_value(struct p_log *log, struct fs_parameter *param) ^ CC drivers/usb/gadget/udc/snps_udc_core.o And no header file define a prototype for this function, so we should mark it as static. Signed-off-by: Luo Jiaxing Signed-off-by: Andrew Morton Link: https://lkml.kernel.org/r/1601293463-25763-1-git-send-email-luojiaxing@huawei.com Signed-off-by: Linus Torvalds --- fs/fs_parser.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/fs_parser.c b/fs/fs_parser.c index ab53e42a874a..68b0148f4bb8 100644 --- a/fs/fs_parser.c +++ b/fs/fs_parser.c @@ -189,7 +189,7 @@ out: } EXPORT_SYMBOL(fs_lookup_param); -int fs_param_bad_value(struct p_log *log, struct fs_parameter *param) +static int fs_param_bad_value(struct p_log *log, struct fs_parameter *param) { return inval_plog(log, "Bad value for '%s'", param->key); } From c1ff3f95497e64b67230bddc1242f2d228880859 Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Tue, 13 Oct 2020 16:48:34 -0700 Subject: [PATCH 019/181] mm/slab.c: clean code by removing redundant if condition The removed code was unnecessary and changed nothing in the flow, since in case of returning NULL by 'kmem_cache_alloc_node' returning 'freelist' from the function in question is the same as returning NULL. Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Link: https://lkml.kernel.org/r/20200915230329.13002-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- mm/slab.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index f658e86ec8ce..04bc6a6c48eb 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2305,8 +2305,6 @@ static void *alloc_slabmgmt(struct kmem_cache *cachep, /* Slab management obj is off-slab. */ freelist = kmem_cache_alloc_node(cachep->freelist_cache, local_flags, nodeid); - if (!freelist) - return NULL; } else { /* We will use last bytes at the slab for freelist */ freelist = addr + (PAGE_SIZE << cachep->gfporder) - From d7cff4ded857ff7cc3e49eb39cc14df9345b9662 Mon Sep 17 00:00:00 2001 From: tangjianqiang Date: Tue, 13 Oct 2020 16:48:37 -0700 Subject: [PATCH 020/181] include/linux/slab.h: fix a typo error in comment fix a typo error in slab.h "allocagtor" -> "allocator" Signed-off-by: tangjianqiang Signed-off-by: Andrew Morton Acked-by: Souptick Joarder Link: https://lkml.kernel.org/r/1600230053-24303-1-git-send-email-tangjianqiang@xiaomi.com Signed-off-by: Linus Torvalds --- include/linux/slab.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 24df2393ec03..9e155cc83b8a 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -279,7 +279,7 @@ static inline void __check_heap_object(const void *ptr, unsigned long n, #define KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_MAX) /* Maximum size for which we actually use a slab cache */ #define KMALLOC_MAX_CACHE_SIZE (1UL << KMALLOC_SHIFT_HIGH) -/* Maximum order allocatable via the slab allocagtor */ +/* Maximum order allocatable via the slab allocator */ #define KMALLOC_MAX_ORDER (KMALLOC_SHIFT_MAX - PAGE_SHIFT) /* From c270cf3041a5cb7c5853d45a794309e66576493a Mon Sep 17 00:00:00 2001 From: Abel Wu Date: Tue, 13 Oct 2020 16:48:40 -0700 Subject: [PATCH 021/181] mm/slub.c: branch optimization in free slowpath The two conditions are mutually exclusive and gcc compiler will optimise this into if-else-like pattern. Given that the majority of free_slowpath is free_frozen, let's provide some hint to the compilers. Tests (perf bench sched messaging -g 20 -l 400000, executed 10x after reboot) are done and the summarized result: un-patched patched max. 192.316 189.851 min. 187.267 186.252 avg. 189.154 188.086 stdev. 1.37 0.99 Signed-off-by: Abel Wu Signed-off-by: Andrew Morton Acked-by: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Hewenliang Cc: Hu Shiyuan Link: http://lkml.kernel.org/r/20200813101812.1617-1-wuyun.wu@huawei.com Signed-off-by: Linus Torvalds --- mm/slub.c | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 6d3574013b2f..da6438bd8202 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3019,20 +3019,21 @@ static void __slab_free(struct kmem_cache *s, struct page *page, if (likely(!n)) { - /* - * If we just froze the page then put it onto the - * per cpu partial list. - */ - if (new.frozen && !was_frozen) { + if (likely(was_frozen)) { + /* + * The list lock was not taken therefore no list + * activity can be necessary. + */ + stat(s, FREE_FROZEN); + } else if (new.frozen) { + /* + * If we just froze the page then put it onto the + * per cpu partial list. + */ put_cpu_partial(s, page, 1); stat(s, CPU_PARTIAL_FREE); } - /* - * The list lock was not taken therefore no list - * activity can be necessary. - */ - if (was_frozen) - stat(s, FREE_FROZEN); + return; } From 9f986d998a3001b6eeb189be8444bc0360e61e24 Mon Sep 17 00:00:00 2001 From: Abel Wu Date: Tue, 13 Oct 2020 16:48:43 -0700 Subject: [PATCH 022/181] mm/slub: fix missing ALLOC_SLOWPATH stat when bulk alloc The ALLOC_SLOWPATH statistics is missing in bulk allocation now. Fix it by doing statistics in alloc slow path. Signed-off-by: Abel Wu Signed-off-by: Andrew Morton Reviewed-by: Pekka Enberg Acked-by: David Rientjes Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Hewenliang Cc: Hu Shiyuan Link: http://lkml.kernel.org/r/20200811022427.1363-1-wuyun.wu@huawei.com Signed-off-by: Linus Torvalds --- mm/slub.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index da6438bd8202..7728a0b71d63 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2661,6 +2661,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, void *freelist; struct page *page; + stat(s, ALLOC_SLOWPATH); + page = c->page; if (!page) { /* @@ -2850,7 +2852,6 @@ redo: page = c->page; if (unlikely(!object || !node_match(page, node))) { object = __slab_alloc(s, gfpflags, node, addr, c); - stat(s, ALLOC_SLOWPATH); } else { void *next_object = get_freepointer_safe(s, object); From 9cf7a111836552d159d443491a38b3bc2cc8a174 Mon Sep 17 00:00:00 2001 From: Abel Wu Date: Tue, 13 Oct 2020 16:48:47 -0700 Subject: [PATCH 023/181] mm/slub: make add_full() condition more explicit The commit below is incomplete, as it didn't handle the add_full() part. commit a4d3f8916c65 ("slub: remove useless kmem_cache_debug() before remove_full()") This patch checks for SLAB_STORE_USER instead of kmem_cache_debug(), since that should be the only context in which we need the list_lock for add_full(). Signed-off-by: Abel Wu Signed-off-by: Andrew Morton Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Liu Xiang Link: https://lkml.kernel.org/r/20200811020240.1231-1-wuyun.wu@huawei.com Signed-off-by: Linus Torvalds --- mm/slub.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index 7728a0b71d63..f05900186c0b 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2245,7 +2245,8 @@ redo: } } else { m = M_FULL; - if (kmem_cache_debug(s) && !lock) { +#ifdef CONFIG_SLUB_DEBUG + if ((s->flags & SLAB_STORE_USER) && !lock) { lock = 1; /* * This also ensures that the scanning of full @@ -2254,6 +2255,7 @@ redo: */ spin_lock(&n->list_lock); } +#endif } if (l != m) { From c4b28963fd79457315783b3b0f21c01eb88cfdc1 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Tue, 13 Oct 2020 16:48:50 -0700 Subject: [PATCH 024/181] mm/kmemleak: rely on rcu for task stack scanning kmemleak_scan() currently relies on the big tasklist_lock hammer to stabilize iterating through the tasklist. Instead, this patch proposes simply using rcu along with the rcu-safe for_each_process_thread flavor (without changing scan semantics), which doesn't make use of next_thread/p->thread_group and thus cannot race with exit. Furthermore, any races with fork() and not seeing the new child should be benign as it's not running yet and can also be detected by the next scan. Avoiding the tasklist_lock could prove beneficial for performance considering the scan operation is done periodically. I have seen improvements of 30%-ish when doing similar replacements on very pathological microbenchmarks (ie stressing get/setpriority(2)). However my main motivation is that it's one less user of the global lock, something that Linus has long time wanted to see gone eventually (if ever) even if the traditional fairness issues has been dealt with now with qrwlocks. Of course this is a very long ways ahead. This patch also kills another user of the deprecated tsk->thread_group. Signed-off-by: Davidlohr Bueso Signed-off-by: Andrew Morton Reviewed-by: Qian Cai Acked-by: Catalin Marinas Acked-by: Oleg Nesterov Link: https://lkml.kernel.org/r/20200820203902.11308-1-dave@stgolabs.net Signed-off-by: Linus Torvalds --- mm/kmemleak.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 5e252d91eb14..c0014d3b91c1 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1471,15 +1471,15 @@ static void kmemleak_scan(void) if (kmemleak_stack_scan) { struct task_struct *p, *g; - read_lock(&tasklist_lock); - do_each_thread(g, p) { + rcu_read_lock(); + for_each_process_thread(g, p) { void *stack = try_get_task_stack(p); if (stack) { scan_block(stack, stack + THREAD_SIZE, NULL); put_task_stack(p); } - } while_each_thread(g, p); - read_unlock(&tasklist_lock); + } + rcu_read_unlock(); } /* From 1abbef4f51724fb11f09adf0e75275f7cb422a8a Mon Sep 17 00:00:00 2001 From: Hui Su Date: Tue, 13 Oct 2020 16:48:53 -0700 Subject: [PATCH 025/181] mm,kmemleak-test.c: move kmemleak-test.c to samples dir kmemleak-test.c is just a kmemleak test module, which also can not be used as a built-in kernel module. Thus, i think it may should not be in mm dir, and move the kmemleak-test.c to samples/kmemleak/kmemleak-test.c. Fix the spelling of built-in by the way. Signed-off-by: Hui Su Signed-off-by: Andrew Morton Cc: Catalin Marinas Cc: Jonathan Corbet Cc: Mauro Carvalho Chehab Cc: David S. Miller Cc: Rob Herring Cc: Masahiro Yamada Cc: Sam Ravnborg Cc: Josh Poimboeuf Cc: Steven Rostedt (VMware) Cc: Miguel Ojeda Cc: Divya Indi Cc: Tomas Winkler Cc: David Howells Link: https://lkml.kernel.org/r/20200925183729.GA172837@rlk Signed-off-by: Linus Torvalds --- Documentation/dev-tools/kmemleak.rst | 2 +- MAINTAINERS | 2 +- mm/Makefile | 1 - samples/Makefile | 1 + samples/kmemleak/Makefile | 3 +++ {mm => samples/kmemleak}/kmemleak-test.c | 2 +- 6 files changed, 7 insertions(+), 4 deletions(-) create mode 100644 samples/kmemleak/Makefile rename {mm => samples/kmemleak}/kmemleak-test.c (98%) diff --git a/Documentation/dev-tools/kmemleak.rst b/Documentation/dev-tools/kmemleak.rst index a41a2d238af2..1c935f41cd3a 100644 --- a/Documentation/dev-tools/kmemleak.rst +++ b/Documentation/dev-tools/kmemleak.rst @@ -229,7 +229,7 @@ Testing with kmemleak-test To check if you have all set up to use kmemleak, you can use the kmemleak-test module, a module that deliberately leaks memory. Set CONFIG_DEBUG_KMEMLEAK_TEST -as module (it can't be used as bult-in) and boot the kernel with kmemleak +as module (it can't be used as built-in) and boot the kernel with kmemleak enabled. Load the module and perform a scan with:: # modprobe kmemleak-test diff --git a/MAINTAINERS b/MAINTAINERS index 71291bb26a07..42f1230d1add 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9727,8 +9727,8 @@ M: Catalin Marinas S: Maintained F: Documentation/dev-tools/kmemleak.rst F: include/linux/kmemleak.h -F: mm/kmemleak-test.c F: mm/kmemleak.c +F: samples/kmemleak/kmemleak-test.c KMOD KERNEL MODULE LOADER - USERMODE HELPER M: Luis Chamberlain diff --git a/mm/Makefile b/mm/Makefile index d5649f1c12c0..d73aed0fc99c 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -94,7 +94,6 @@ obj-$(CONFIG_GUP_BENCHMARK) += gup_benchmark.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o -obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o obj-$(CONFIG_PAGE_OWNER) += page_owner.o diff --git a/samples/Makefile b/samples/Makefile index 754553597581..c3392a595e4b 100644 --- a/samples/Makefile +++ b/samples/Makefile @@ -28,3 +28,4 @@ subdir-$(CONFIG_SAMPLE_VFS) += vfs obj-$(CONFIG_SAMPLE_INTEL_MEI) += mei/ subdir-$(CONFIG_SAMPLE_WATCHDOG) += watchdog subdir-$(CONFIG_SAMPLE_WATCH_QUEUE) += watch_queue +obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak/ diff --git a/samples/kmemleak/Makefile b/samples/kmemleak/Makefile new file mode 100644 index 000000000000..16b6132c540c --- /dev/null +++ b/samples/kmemleak/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only + +obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o diff --git a/mm/kmemleak-test.c b/samples/kmemleak/kmemleak-test.c similarity index 98% rename from mm/kmemleak-test.c rename to samples/kmemleak/kmemleak-test.c index e19279ff6aa3..7b476eb8285f 100644 --- a/mm/kmemleak-test.c +++ b/samples/kmemleak/kmemleak-test.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0-only /* - * mm/kmemleak-test.c + * samples/kmemleak/kmemleak-test.c * * Copyright (C) 2008 ARM Limited * Written by Catalin Marinas From 2dd57d3415f8623a5e9494c88978a202886041aa Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:48:57 -0700 Subject: [PATCH 026/181] x86/numa: cleanup configuration dependent command-line options MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "device-dax: Support sub-dividing soft-reserved ranges", v5. The device-dax facility allows an address range to be directly mapped through a chardev, or optionally hotplugged to the core kernel page allocator as System-RAM. It is the mechanism for converting persistent memory (pmem) to be used as another volatile memory pool i.e. the current Memory Tiering hot topic on linux-mm. In the case of pmem the nvdimm-namespace-label mechanism can sub-divide it, but that labeling mechanism is not available / applicable to soft-reserved ("EFI specific purpose") memory [3]. This series provides a sysfs-mechanism for the daxctl utility to enable provisioning of volatile-soft-reserved memory ranges. The motivations for this facility are: 1/ Allow performance differentiated memory ranges to be split between kernel-managed and directly-accessed use cases. 2/ Allow physical memory to be provisioned along performance relevant address boundaries. For example, divide a memory-side cache [4] along cache-color boundaries. 3/ Parcel out soft-reserved memory to VMs using device-dax as a security / permissions boundary [5]. Specifically I have seen people (ab)using memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the device-dax interface on custom address ranges. A follow-on for the VM use case is to teach device-dax to dynamically allocate 'struct page' at runtime to reduce the duplication of 'struct page' space in both the guest and the host kernel for the same physical pages. [2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com [3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@dwillia2-desk3.amr.corp.intel.com [4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com [5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com This patch (of 23): In preparation for adding a new numa= option clean up the existing ones to avoid ifdefs in numa_setup(), and provide feedback when the option is numa=fake= option is invalid due to kernel config. The same does not need to be done for numa=noacpi, since the capability is already hard disabled at compile-time. Suggested-by: Rafael J. Wysocki Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Borislav Petkov Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jeff Moyer Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Wei Yang Cc: Will Deacon Cc: Ard Biesheuvel Cc: Bjorn Helgaas Cc: Boris Ostrovsky Cc: Hulk Robot Cc: Jason Yan Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Vivek Goyal Link: https://lkml.kernel.org/r/160106109960.30709.7379926726669669398.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/159643094925.4062302.14979872973043772305.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- arch/x86/include/asm/numa.h | 8 +++++++- arch/x86/mm/numa.c | 8 ++------ arch/x86/mm/numa_emulation.c | 3 ++- arch/x86/xen/enlighten_pv.c | 2 +- drivers/acpi/numa/srat.c | 9 +++++++-- include/acpi/acpi_numa.h | 6 +++++- 6 files changed, 24 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index bbfde3d2662f..0aecc0b629e0 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -3,6 +3,7 @@ #define _ASM_X86_NUMA_H #include +#include #include #include @@ -77,7 +78,12 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable); #ifdef CONFIG_NUMA_EMU #define FAKE_NODE_MIN_SIZE ((u64)32 << 20) #define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL)) -void numa_emu_cmdline(char *); +int numa_emu_cmdline(char *str); +#else /* CONFIG_NUMA_EMU */ +static inline int numa_emu_cmdline(char *str) +{ + return -EINVAL; +} #endif /* CONFIG_NUMA_EMU */ #endif /* _ASM_X86_NUMA_H */ diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index aa76ec2d359b..87c52822cc44 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -37,14 +37,10 @@ static __init int numa_setup(char *opt) return -EINVAL; if (!strncmp(opt, "off", 3)) numa_off = 1; -#ifdef CONFIG_NUMA_EMU if (!strncmp(opt, "fake=", 5)) - numa_emu_cmdline(opt + 5); -#endif -#ifdef CONFIG_ACPI_NUMA + return numa_emu_cmdline(opt + 5); if (!strncmp(opt, "noacpi", 6)) - acpi_numa = -1; -#endif + disable_srat(); return 0; } early_param("numa", numa_setup); diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index 683cd12f4793..87d77cc52f86 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -13,9 +13,10 @@ static int emu_nid_to_phys[MAX_NUMNODES]; static char *emu_cmdline __initdata; -void __init numa_emu_cmdline(char *str) +int __init numa_emu_cmdline(char *str) { emu_cmdline = str; + return 0; } static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi) diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c index 41485a8a6dcf..b1418a6c0e90 100644 --- a/arch/x86/xen/enlighten_pv.c +++ b/arch/x86/xen/enlighten_pv.c @@ -1300,7 +1300,7 @@ asmlinkage __visible void __init xen_start_kernel(void) * any NUMA information the kernel tries to get from ACPI will * be meaningless. Prevent it from trying. */ - acpi_numa = -1; + disable_srat(); #endif WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_pv, xen_cpu_dead_pv)); diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c index 15bbaab8500b..1b0ae0a1959b 100644 --- a/drivers/acpi/numa/srat.c +++ b/drivers/acpi/numa/srat.c @@ -27,7 +27,12 @@ static int node_to_pxm_map[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = PXM_INVAL }; unsigned char acpi_srat_revision __initdata; -int acpi_numa __initdata; +static int acpi_numa __initdata; + +void __init disable_srat(void) +{ + acpi_numa = -1; +} int pxm_to_node(int pxm) { @@ -163,7 +168,7 @@ static int __init slit_valid(struct acpi_table_slit *slit) void __init bad_srat(void) { pr_err("SRAT: SRAT not used.\n"); - acpi_numa = -1; + disable_srat(); } int __init srat_disabled(void) diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h index fdebcfc6c8df..8784183b2204 100644 --- a/include/acpi/acpi_numa.h +++ b/include/acpi/acpi_numa.h @@ -17,10 +17,14 @@ extern int pxm_to_node(int); extern int node_to_pxm(int); extern int acpi_map_pxm_to_node(int); extern unsigned char acpi_srat_revision; -extern int acpi_numa __initdata; +extern void disable_srat(void); extern void bad_srat(void); extern int srat_disabled(void); +#else /* CONFIG_ACPI_NUMA */ +static inline void disable_srat(void) +{ +} #endif /* CONFIG_ACPI_NUMA */ #endif /* __ACP_NUMA_H */ From 3b0d31011d39759e3ba7214f75f77bb31983b5a4 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:02 -0700 Subject: [PATCH 027/181] x86/numa: add 'nohmat' option MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Disable parsing of the HMAT for debug, to workaround broken platform instances, or cases where it is otherwise not wanted. [rdunlap@infradead.org: fix build when CONFIG_ACPI is not set] Link: https://lkml.kernel.org/r/70e5ee34-9809-a997-7b49-499e4be61307@infradead.org Signed-off-by: Dan Williams Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jeff Moyer Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: "Rafael J. Wysocki" Cc: Tom Lendacky Cc: Vishal Verma Cc: Wei Yang Cc: Will Deacon Cc: Bjorn Helgaas Cc: Boris Ostrovsky Cc: Hulk Robot Cc: Jason Yan Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Vivek Goyal Link: https://lkml.kernel.org/r/159643095540.4062302.732962081968036212.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- Documentation/x86/x86_64/boot-options.rst | 4 ++++ arch/x86/mm/numa.c | 2 ++ drivers/acpi/numa/hmat.c | 8 +++++++- include/acpi/acpi_numa.h | 8 ++++++++ include/linux/acpi.h | 2 ++ 5 files changed, 23 insertions(+), 1 deletion(-) diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst index 2b98efb5ba7f..324cefff92e7 100644 --- a/Documentation/x86/x86_64/boot-options.rst +++ b/Documentation/x86/x86_64/boot-options.rst @@ -173,6 +173,10 @@ NUMA numa=noacpi Don't parse the SRAT table for NUMA setup + numa=nohmat + Don't parse the HMAT table for NUMA setup, or soft-reserved memory + partitioning. + numa=fake=[MG] If given as a memory unit, fills all system RAM with nodes of size interleaved over physical nodes. diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 87c52822cc44..f3805bbaa784 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -41,6 +41,8 @@ static __init int numa_setup(char *opt) return numa_emu_cmdline(opt + 5); if (!strncmp(opt, "noacpi", 6)) disable_srat(); + if (!strncmp(opt, "nohmat", 6)) + disable_hmat(); return 0; } early_param("numa", numa_setup); diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index 2c32cfb72370..a12e36a12618 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -26,6 +26,12 @@ #include static u8 hmat_revision; +static int hmat_disable __initdata; + +void __init disable_hmat(void) +{ + hmat_disable = 1; +} static LIST_HEAD(targets); static LIST_HEAD(initiators); @@ -814,7 +820,7 @@ static __init int hmat_init(void) enum acpi_hmat_type i; acpi_status status; - if (srat_disabled()) + if (srat_disabled() || hmat_disable) return 0; status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl); diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h index 8784183b2204..0e9302285f14 100644 --- a/include/acpi/acpi_numa.h +++ b/include/acpi/acpi_numa.h @@ -27,4 +27,12 @@ static inline void disable_srat(void) { } #endif /* CONFIG_ACPI_NUMA */ + +#ifdef CONFIG_ACPI_HMAT +extern void disable_hmat(void); +#else /* CONFIG_ACPI_HMAT */ +static inline void disable_hmat(void) +{ +} +#endif /* CONFIG_ACPI_HMAT */ #endif /* __ACP_NUMA_H */ diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 64ae25c59d55..cfa8c0015863 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -709,6 +709,8 @@ static inline u64 acpi_arch_get_root_pointer(void) #define ACPI_HANDLE_FWNODE(fwnode) (NULL) #define ACPI_DEVICE_CLASS(_cls, _msk) .cls = (0), .cls_msk = (0), +#include + struct fwnode_handle; static inline bool acpi_dev_found(const char *hid) From 88e9a5b7965c872d7a1f3624605ed0d2939ca03f Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:08 -0700 Subject: [PATCH 028/181] efi/fake_mem: arrange for a resource entry per efi_fake_mem instance MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation for attaching a platform device per iomem resource teach the efi_fake_mem code to create an e820 entry per instance. Similar to E820_TYPE_PRAM, bypass merging resource when the e820 map is sanitized. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Acked-by: Ard Biesheuvel Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jeff Moyer Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Tom Lendacky Cc: Vishal Verma Cc: Wei Yang Cc: Will Deacon Cc: Ard Biesheuvel Cc: Bjorn Helgaas Cc: Boris Ostrovsky Cc: Hulk Robot Cc: Jason Yan Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Vivek Goyal Link: https://lkml.kernel.org/r/159643096068.4062302.11590041070221681669.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- arch/x86/kernel/e820.c | 16 +++++++++++++++- drivers/firmware/efi/x86_fake_mem.c | 12 +++++++++--- 2 files changed, 24 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 983cd53ed4c9..22aad412f965 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -305,6 +305,20 @@ static int __init cpcompare(const void *a, const void *b) return (ap->addr != ap->entry->addr) - (bp->addr != bp->entry->addr); } +static bool e820_nomerge(enum e820_type type) +{ + /* + * These types may indicate distinct platform ranges aligned to + * numa node, protection domain, performance domain, or other + * boundaries. Do not merge them. + */ + if (type == E820_TYPE_PRAM) + return true; + if (type == E820_TYPE_SOFT_RESERVED) + return true; + return false; +} + int __init e820__update_table(struct e820_table *table) { struct e820_entry *entries = table->entries; @@ -380,7 +394,7 @@ int __init e820__update_table(struct e820_table *table) } /* Continue building up new map based on this information: */ - if (current_type != last_type || current_type == E820_TYPE_PRAM) { + if (current_type != last_type || e820_nomerge(current_type)) { if (last_type != 0) { new_entries[new_nr_entries].size = change_point[chg_idx]->addr - last_addr; /* Move forward only if the new size was non-zero: */ diff --git a/drivers/firmware/efi/x86_fake_mem.c b/drivers/firmware/efi/x86_fake_mem.c index e5d6d5a1b240..0bafcc1bb0f6 100644 --- a/drivers/firmware/efi/x86_fake_mem.c +++ b/drivers/firmware/efi/x86_fake_mem.c @@ -38,7 +38,7 @@ void __init efi_fake_memmap_early(void) m_start = mem->range.start; m_end = mem->range.end; for_each_efi_memory_desc(md) { - u64 start, end; + u64 start, end, size; if (md->type != EFI_CONVENTIONAL_MEMORY) continue; @@ -58,11 +58,17 @@ void __init efi_fake_memmap_early(void) */ start = max(start, m_start); end = min(end, m_end); + size = end - start + 1; if (end <= start) continue; - e820__range_update(start, end - start + 1, E820_TYPE_RAM, - E820_TYPE_SOFT_RESERVED); + + /* + * Ensure each efi_fake_mem instance results in + * a unique e820 resource + */ + e820__range_remove(start, size, E820_TYPE_RAM, 1); + e820__range_add(start, size, E820_TYPE_SOFT_RESERVED); e820__update_table(e820_table); } } From c01044cc819160323f3ca4acd44fca487c4432e6 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:13 -0700 Subject: [PATCH 029/181] ACPI: HMAT: refactor hmat_register_target_device to hmem_register_device MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation for exposing "Soft Reserved" memory ranges without an HMAT, move the hmem device registration to its own compilation unit and make the implementation generic. The generic implementation drops usage acpi_map_pxm_to_online_node() that was translating ACPI proximity domain values and instead relies on numa_map_to_online_node() to determine the numa node for the device. [joao.m.martins@oracle.com: CONFIG_DEV_DAX_HMEM_DEVICES should depend on CONFIG_DAX=y] Link: https://lkml.kernel.org/r/8f34727f-ec2d-9395-cb18-969ec8a5d0d4@oracle.com Signed-off-by: Dan Williams Signed-off-by: Joao Martins Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Borislav Petkov Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jeff Moyer Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Wei Yang Cc: Will Deacon Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Bjorn Helgaas Cc: Boris Ostrovsky Cc: Hulk Robot Cc: Jason Yan Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Vivek Goyal Link: https://lkml.kernel.org/r/159643096584.4062302.5035370788475153738.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lore.kernel.org/r/158318761484.2216124.2049322072599482736.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/acpi/numa/hmat.c | 68 ++++------------------------------- drivers/dax/Kconfig | 4 +++ drivers/dax/Makefile | 3 +- drivers/dax/hmem/Makefile | 5 +++ drivers/dax/hmem/device.c | 65 +++++++++++++++++++++++++++++++++ drivers/dax/{ => hmem}/hmem.c | 2 +- include/linux/dax.h | 8 +++++ 7 files changed, 90 insertions(+), 65 deletions(-) create mode 100644 drivers/dax/hmem/Makefile create mode 100644 drivers/dax/hmem/device.c rename drivers/dax/{ => hmem}/hmem.c (98%) diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index a12e36a12618..134bcb40b2af 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -24,6 +24,7 @@ #include #include #include +#include static u8 hmat_revision; static int hmat_disable __initdata; @@ -640,66 +641,6 @@ static void hmat_register_target_perf(struct memory_target *target) node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0); } -static void hmat_register_target_device(struct memory_target *target, - struct resource *r) -{ - /* define a clean / non-busy resource for the platform device */ - struct resource res = { - .start = r->start, - .end = r->end, - .flags = IORESOURCE_MEM, - }; - struct platform_device *pdev; - struct memregion_info info; - int rc, id; - - rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM, - IORES_DESC_SOFT_RESERVED); - if (rc != REGION_INTERSECTS) - return; - - id = memregion_alloc(GFP_KERNEL); - if (id < 0) { - pr_err("memregion allocation failure for %pr\n", &res); - return; - } - - pdev = platform_device_alloc("hmem", id); - if (!pdev) { - pr_err("hmem device allocation failure for %pr\n", &res); - goto out_pdev; - } - - pdev->dev.numa_node = acpi_map_pxm_to_online_node(target->memory_pxm); - info = (struct memregion_info) { - .target_node = acpi_map_pxm_to_node(target->memory_pxm), - }; - rc = platform_device_add_data(pdev, &info, sizeof(info)); - if (rc < 0) { - pr_err("hmem memregion_info allocation failure for %pr\n", &res); - goto out_pdev; - } - - rc = platform_device_add_resources(pdev, &res, 1); - if (rc < 0) { - pr_err("hmem resource allocation failure for %pr\n", &res); - goto out_resource; - } - - rc = platform_device_add(pdev); - if (rc < 0) { - dev_err(&pdev->dev, "device add failed for %pr\n", &res); - goto out_resource; - } - - return; - -out_resource: - put_device(&pdev->dev); -out_pdev: - memregion_free(id); -} - static void hmat_register_target_devices(struct memory_target *target) { struct resource *res; @@ -711,8 +652,11 @@ static void hmat_register_target_devices(struct memory_target *target) if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM)) return; - for (res = target->memregions.child; res; res = res->sibling) - hmat_register_target_device(target, res); + for (res = target->memregions.child; res; res = res->sibling) { + int target_nid = acpi_map_pxm_to_node(target->memory_pxm); + + hmem_register_device(target_nid, res); + } } static void hmat_register_target(struct memory_target *target) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 3b6c06f07326..a66dff78f298 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -48,6 +48,10 @@ config DEV_DAX_HMEM Say M if unsure. +config DEV_DAX_HMEM_DEVICES + depends on DEV_DAX_HMEM && DAX=y + def_bool y + config DEV_DAX_KMEM tristate "KMEM DAX: volatile-use of persistent memory" default DEV_DAX diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile index 80065b38b3c4..9d4ba672d305 100644 --- a/drivers/dax/Makefile +++ b/drivers/dax/Makefile @@ -2,11 +2,10 @@ obj-$(CONFIG_DAX) += dax.o obj-$(CONFIG_DEV_DAX) += device_dax.o obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o -obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o dax-y := super.o dax-y += bus.o device_dax-y := device.o -dax_hmem-y := hmem.o obj-y += pmem/ +obj-y += hmem/ diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile new file mode 100644 index 000000000000..a9d353d0c9ed --- /dev/null +++ b/drivers/dax/hmem/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o +obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device.o + +dax_hmem-y := hmem.o diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c new file mode 100644 index 000000000000..b9dd6b27745c --- /dev/null +++ b/drivers/dax/hmem/device.c @@ -0,0 +1,65 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +void hmem_register_device(int target_nid, struct resource *r) +{ + /* define a clean / non-busy resource for the platform device */ + struct resource res = { + .start = r->start, + .end = r->end, + .flags = IORESOURCE_MEM, + }; + struct platform_device *pdev; + struct memregion_info info; + int rc, id; + + rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM, + IORES_DESC_SOFT_RESERVED); + if (rc != REGION_INTERSECTS) + return; + + id = memregion_alloc(GFP_KERNEL); + if (id < 0) { + pr_err("memregion allocation failure for %pr\n", &res); + return; + } + + pdev = platform_device_alloc("hmem", id); + if (!pdev) { + pr_err("hmem device allocation failure for %pr\n", &res); + goto out_pdev; + } + + pdev->dev.numa_node = numa_map_to_online_node(target_nid); + info = (struct memregion_info) { + .target_node = target_nid, + }; + rc = platform_device_add_data(pdev, &info, sizeof(info)); + if (rc < 0) { + pr_err("hmem memregion_info allocation failure for %pr\n", &res); + goto out_pdev; + } + + rc = platform_device_add_resources(pdev, &res, 1); + if (rc < 0) { + pr_err("hmem resource allocation failure for %pr\n", &res); + goto out_resource; + } + + rc = platform_device_add(pdev); + if (rc < 0) { + dev_err(&pdev->dev, "device add failed for %pr\n", &res); + goto out_resource; + } + + return; + +out_resource: + put_device(&pdev->dev); +out_pdev: + memregion_free(id); +} diff --git a/drivers/dax/hmem.c b/drivers/dax/hmem/hmem.c similarity index 98% rename from drivers/dax/hmem.c rename to drivers/dax/hmem/hmem.c index fe7214daf62e..29ceb5795297 100644 --- a/drivers/dax/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -3,7 +3,7 @@ #include #include #include -#include "bus.h" +#include "../bus.h" static int dax_hmem_probe(struct platform_device *pdev) { diff --git a/include/linux/dax.h b/include/linux/dax.h index 43b39ab9de1a..4ec0bbf86205 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -238,4 +238,12 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); } +#ifdef CONFIG_DEV_DAX_HMEM_DEVICES +void hmem_register_device(int target_nid, struct resource *r); +#else +static inline void hmem_register_device(int target_nid, struct resource *r) +{ +} +#endif + #endif From 73fb952d83717697910d981e27fe2c252f64662b Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:18 -0700 Subject: [PATCH 030/181] resource: report parent to walk_iomem_res_desc() callback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In support of detecting whether a resource might have been been claimed, report the parent to the walk_iomem_res_desc() callback. For example, the ACPI HMAT parser publishes "hmem" platform devices per target range. However, if the HMAT is disabled / missing a fallback driver can attach devices to the raw memory ranges as a fallback if it sees unclaimed / orphan "Soft Reserved" resources in the resource tree. Otherwise, find_next_iomem_res() returns a resource with garbage data from the stack allocation in __walk_iomem_res_desc() for the res->parent field. There are currently no users that expect ->child and ->sibling to be valid, and the resource_lock would be needed to traverse them. Use a compound literal to implicitly zero initialize the fields that are not being returned in addition to setting ->parent. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Jason Gunthorpe Cc: Dave Hansen Cc: Wei Yang Cc: Tom Lendacky Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Borislav Petkov Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Ira Weiny Cc: Jeff Moyer Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Thomas Gleixner Cc: Vishal Verma Cc: Will Deacon Cc: Ard Biesheuvel Cc: Bjorn Helgaas Cc: Boris Ostrovsky Cc: Hulk Robot Cc: Jason Yan Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Vivek Goyal Link: https://lkml.kernel.org/r/159643097166.4062302.11875688887228572793.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- kernel/resource.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/kernel/resource.c b/kernel/resource.c index 841737bbda9e..f1175ce93a1d 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -382,10 +382,13 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end, if (p) { /* copy data */ - res->start = max(start, p->start); - res->end = min(end, p->end); - res->flags = p->flags; - res->desc = p->desc; + *res = (struct resource) { + .start = max(start, p->start), + .end = min(end, p->end), + .flags = p->flags, + .desc = p->desc, + .parent = p->parent, + }; } read_unlock(&resource_lock); From a035b6bf863e5c42c2746de2a8ed6600140307e7 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:23 -0700 Subject: [PATCH 031/181] mm/memory_hotplug: introduce default phys_to_target_node() implementation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation to set a fallback value for dev_dax->target_node, introduce generic fallback helpers for phys_to_target_node() A generic implementation based on node-data or memblock was proposed, but as noted by Mike: "Here again, I would prefer to add a weak default for phys_to_target_node() because the "generic" implementation is not really generic. The fallback to reserved ranges is x86 specfic because on x86 most of the reserved areas is not in memblock.memory. AFAIK, no other architecture does this." The info message in the generic memory_add_physaddr_to_nid() implementation is fixed up to properly reflect that memory_add_physaddr_to_nid() communicates "online" node info and phys_to_target_node() indicates "target / to-be-onlined" node info. [akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build] Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.com Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: David Hildenbrand Cc: Mike Rapoport Cc: Jia He Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Borislav Petkov Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jeff Moyer Cc: Joao Martins Cc: Jonathan Cameron Cc: Michael Ellerman Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Wei Yang Cc: Will Deacon Cc: Ard Biesheuvel Cc: Bjorn Helgaas Cc: Boris Ostrovsky Cc: Hulk Robot Cc: Jason Yan Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Vivek Goyal Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- arch/x86/mm/numa.c | 1 - include/linux/memory_hotplug.h | 23 ++++++++++++++--------- include/linux/numa.h | 11 ----------- mm/memory_hotplug.c | 10 +++++++++- 4 files changed, 23 insertions(+), 22 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index f3805bbaa784..c62e274d52d0 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -917,7 +917,6 @@ int phys_to_target_node(phys_addr_t start) return meminfo_to_nid(&numa_reserved_meminfo, start); } -EXPORT_SYMBOL_GPL(phys_to_target_node); int memory_add_physaddr_to_nid(u64 start) { diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 375515803cd8..c0faa7a30c46 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -149,15 +149,6 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, struct mhp_params *params); #endif /* ARCH_HAS_ADD_PAGES */ -#ifdef CONFIG_NUMA -extern int memory_add_physaddr_to_nid(u64 start); -#else -static inline int memory_add_physaddr_to_nid(u64 start) -{ - return 0; -} -#endif - #ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION /* * For supporting node-hotadd, we have to allocate a new pgdat. @@ -284,6 +275,20 @@ static inline bool movable_node_is_enabled(void) } #endif /* ! CONFIG_MEMORY_HOTPLUG */ +#ifdef CONFIG_NUMA +extern int memory_add_physaddr_to_nid(u64 start); +extern int phys_to_target_node(u64 start); +#else +static inline int memory_add_physaddr_to_nid(u64 start) +{ + return 0; +} +static inline int phys_to_target_node(u64 start) +{ + return 0; +} +#endif + #if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT) /* * pgdat resizing functions diff --git a/include/linux/numa.h b/include/linux/numa.h index a42df804679e..8cb33ccfb671 100644 --- a/include/linux/numa.h +++ b/include/linux/numa.h @@ -23,22 +23,11 @@ #ifdef CONFIG_NUMA /* Generic implementation available */ int numa_map_to_online_node(int node); - -/* - * Optional architecture specific implementation, users need a "depends - * on $ARCH" - */ -int phys_to_target_node(phys_addr_t addr); #else static inline int numa_map_to_online_node(int node) { return NUMA_NO_NODE; } - -static inline int phys_to_target_node(phys_addr_t addr) -{ - return NUMA_NO_NODE; -} #endif #endif /* _LINUX_NUMA_H */ diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ce3e73e3a5c1..8e9e2d44cdad 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -353,11 +353,19 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, #ifdef CONFIG_NUMA int __weak memory_add_physaddr_to_nid(u64 start) { - pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n", + pr_info_once("Unknown online node for memory at 0x%llx, assuming node 0\n", start); return 0; } EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); + +int __weak phys_to_target_node(u64 start) +{ + pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n", + start); + return 0; +} +EXPORT_SYMBOL_GPL(phys_to_target_node); #endif /* find the smallest valid pfn in the range [start_pfn, end_pfn) */ From 5ccac54f3e124a49789c3773d5a351e87470cf12 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:28 -0700 Subject: [PATCH 032/181] ACPI: HMAT: attach a device for each soft-reserved range MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The hmem enabling in commit cf8741ac57ed ("ACPI: NUMA: HMAT: Register "soft reserved" memory as an "hmem" device") only registered ranges to the hmem driver for each soft-reservation that also appeared in the HMAT. While this is meant to encourage platform firmware to "do the right thing" and publish an HMAT, the corollary is that platforms that fail to publish an accurate HMAT will strand memory from Linux usage. Additionally, the "efi_fake_mem" kernel command line option enabling will strand memory by default without an HMAT. Arrange for "soft reserved" memory that goes unclaimed by HMAT entries to be published as raw resource ranges for the hmem driver to consume. Include a module parameter to disable either this fallback behavior, or the hmat enabling from creating hmem devices. The module parameter requires the hmem device enabling to have unique name in the module namespace: "device_hmem". The driver depends on the architecture providing phys_to_target_node() which is only x86 via numa_meminfo() and arm64 via a generic memblock implementation. [joao.m.martins@oracle.com: require NUMA_KEEP_MEMINFO for phys_to_target_node()] Link: https://lkml.kernel.org/r/aaae71a7-4846-f5cc-5acf-cf05fdb1f2dc@oracle.com Signed-off-by: Dan Williams Signed-off-by: Joao Martins Signed-off-by: Andrew Morton Reviewed-by: Joao Martins Cc: Jonathan Cameron Cc: Brice Goglin Cc: Jeff Moyer Cc: Catalin Marinas Cc: Will Deacon Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Borislav Petkov Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jia He Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Wei Yang Cc: Ard Biesheuvel Cc: Bjorn Helgaas Cc: Boris Ostrovsky Cc: Hulk Robot Cc: Jason Yan Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Vivek Goyal Link: https://lkml.kernel.org/r/159643098298.4062302.17587338161136144730.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/Kconfig | 2 ++ drivers/dax/hmem/Makefile | 3 ++- drivers/dax/hmem/device.c | 35 +++++++++++++++++++++++++++++++++++ 3 files changed, 39 insertions(+), 1 deletion(-) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index a66dff78f298..567428e10b7b 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -35,6 +35,7 @@ config DEV_DAX_PMEM config DEV_DAX_HMEM tristate "HMEM DAX: direct access to 'specific purpose' memory" depends on EFI_SOFT_RESERVE + select NUMA_KEEP_MEMINFO if (NUMA && X86) default DEV_DAX help EFI 2.8 platforms, and others, may advertise 'specific purpose' @@ -49,6 +50,7 @@ config DEV_DAX_HMEM Say M if unsure. config DEV_DAX_HMEM_DEVICES + depends on NUMA_KEEP_MEMINFO # for phys_to_target_node() depends on DEV_DAX_HMEM && DAX=y def_bool y diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile index a9d353d0c9ed..57377b4c3d47 100644 --- a/drivers/dax/hmem/Makefile +++ b/drivers/dax/hmem/Makefile @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o -obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device.o +obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device_hmem.o +device_hmem-y := device.o dax_hmem-y := hmem.o diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c index b9dd6b27745c..cb6401c9e9a4 100644 --- a/drivers/dax/hmem/device.c +++ b/drivers/dax/hmem/device.c @@ -5,6 +5,9 @@ #include #include +static bool nohmem; +module_param_named(disable, nohmem, bool, 0444); + void hmem_register_device(int target_nid, struct resource *r) { /* define a clean / non-busy resource for the platform device */ @@ -17,6 +20,9 @@ void hmem_register_device(int target_nid, struct resource *r) struct memregion_info info; int rc, id; + if (nohmem) + return; + rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM, IORES_DESC_SOFT_RESERVED); if (rc != REGION_INTERSECTS) @@ -63,3 +69,32 @@ out_resource: out_pdev: memregion_free(id); } + +static __init int hmem_register_one(struct resource *res, void *data) +{ + /* + * If the resource is not a top-level resource it was already + * assigned to a device by the HMAT parsing. + */ + if (res->parent != &iomem_resource) { + pr_info("HMEM: skip %pr, already claimed\n", res); + return 0; + } + + hmem_register_device(phys_to_target_node(res->start), res); + + return 0; +} + +static __init int hmem_init(void) +{ + walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED, + IORESOURCE_MEM, 0, -1, NULL, hmem_register_one); + return 0; +} + +/* + * As this is a fallback for address ranges unclaimed by the ACPI HMAT + * parsing it must be at an initcall level greater than hmat_init(). + */ +late_initcall(hmem_init); From ec826909981c0b3262681ed7021b959593426d46 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:33 -0700 Subject: [PATCH 033/181] device-dax: drop the dax_region.pfn_flags attribute MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit All callers specify the same flags to alloc_dax_region(), so there is no need to allow for anything other than PFN_DEV|PFN_MAP, or carry a ->pfn_flags around on the region. Device-dax instances are always page backed. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Vishal Verma Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Borislav Petkov Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jeff Moyer Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Wei Yang Cc: Will Deacon Cc: Ard Biesheuvel Cc: Bjorn Helgaas Cc: Boris Ostrovsky Cc: Hulk Robot Cc: Jason Yan Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Vivek Goyal Link: https://lkml.kernel.org/r/159643098829.4062302.13611520567669439046.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 4 +--- drivers/dax/bus.h | 3 +-- drivers/dax/dax-private.h | 2 -- drivers/dax/device.c | 26 +++----------------------- drivers/dax/hmem/hmem.c | 2 +- drivers/dax/pmem/core.c | 3 +-- 6 files changed, 7 insertions(+), 33 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index df238c8b6ef2..f06ffa66cd78 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -226,8 +226,7 @@ static void dax_region_unregister(void *region) } struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, int target_node, unsigned int align, - unsigned long long pfn_flags) + struct resource *res, int target_node, unsigned int align) { struct dax_region *dax_region; @@ -251,7 +250,6 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, dev_set_drvdata(parent, dax_region); memcpy(&dax_region->res, res, sizeof(*res)); - dax_region->pfn_flags = pfn_flags; kref_init(&dax_region->kref); dax_region->id = region_id; dax_region->align = align; diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index 9e4eba67e8b9..55577e9791da 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -10,8 +10,7 @@ struct dax_device; struct dax_region; void dax_region_put(struct dax_region *dax_region); struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, int target_node, unsigned int align, - unsigned long long flags); + struct resource *res, int target_node, unsigned int align); enum dev_dax_subsys { DEV_DAX_BUS, diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 16850d5388ab..8a4c40ccd2ef 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -23,7 +23,6 @@ void dax_bus_exit(void); * @dev: parent device backing this region * @align: allocation and mapping alignment for child dax devices * @res: physical address range of the region - * @pfn_flags: identify whether the pfns are paged back or not */ struct dax_region { int id; @@ -32,7 +31,6 @@ struct dax_region { struct device *dev; unsigned int align; struct resource res; - unsigned long long pfn_flags; }; /** diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 1e89513f3c59..c528b725789b 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -41,14 +41,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma, return -EINVAL; } - if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV - && (vma->vm_flags & VM_DONTCOPY) == 0) { - dev_info_ratelimited(dev, - "%s: %s: fail, dax range requires MADV_DONTFORK\n", - current->comm, func); - return -EINVAL; - } - if (!vma_is_dax(vma)) { dev_info_ratelimited(dev, "%s: %s: fail, vma is not DAX capable\n", @@ -102,7 +94,7 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); + *pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); return vmf_insert_mixed(vmf->vma, vmf->address, *pfn); } @@ -127,12 +119,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - /* dax pmd mappings require pfn_t_devmap() */ - if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) { - dev_dbg(dev, "region lacks devmap flags\n"); - return VM_FAULT_SIGBUS; - } - if (fault_size < dax_region->align) return VM_FAULT_SIGBUS; else if (fault_size > dax_region->align) @@ -150,7 +136,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); + *pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); return vmf_insert_pfn_pmd(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE); } @@ -177,12 +163,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - /* dax pud mappings require pfn_t_devmap() */ - if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) { - dev_dbg(dev, "region lacks devmap flags\n"); - return VM_FAULT_SIGBUS; - } - if (fault_size < dax_region->align) return VM_FAULT_SIGBUS; else if (fault_size > dax_region->align) @@ -200,7 +180,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, return VM_FAULT_SIGBUS; } - *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags); + *pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); return vmf_insert_pfn_pud(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE); } diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index 29ceb5795297..506893861253 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -22,7 +22,7 @@ static int dax_hmem_probe(struct platform_device *pdev) memcpy(&pgmap.res, res, sizeof(*res)); dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node, - PMD_SIZE, PFN_DEV|PFN_MAP); + PMD_SIZE); if (!dax_region) return -ENOMEM; diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c index 2bedf8414fff..ea52bb77a294 100644 --- a/drivers/dax/pmem/core.c +++ b/drivers/dax/pmem/core.c @@ -53,8 +53,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) memcpy(&res, &pgmap.res, sizeof(res)); res.start += offset; dax_region = alloc_dax_region(dev, region_id, &res, - nd_region->target_node, le32_to_cpu(pfn_sb->align), - PFN_DEV|PFN_MAP); + nd_region->target_node, le32_to_cpu(pfn_sb->align)); if (!dax_region) return ERR_PTR(-ENOMEM); From 174ebece379bad6331048560dc7f7abfdb8442ee Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:38 -0700 Subject: [PATCH 034/181] device-dax: move instance creation parameters to 'struct dev_dax_data' MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation for adding more parameters to instance creation, move existing parameters to a new struct. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Vishal Verma Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Borislav Petkov Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jeff Moyer Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Wei Yang Cc: Will Deacon Cc: Ard Biesheuvel Cc: Bjorn Helgaas Cc: Boris Ostrovsky Cc: Hulk Robot Cc: Jason Yan Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Vivek Goyal Link: https://lkml.kernel.org/r/159643099411.4062302.1337305960720423895.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 14 +++++++------- drivers/dax/bus.h | 16 ++++++++-------- drivers/dax/hmem/hmem.c | 8 +++++++- drivers/dax/pmem/core.c | 9 ++++++++- 4 files changed, 30 insertions(+), 17 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index f06ffa66cd78..dffa4655e128 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -395,9 +395,9 @@ static void unregister_dev_dax(void *dev) put_device(dev); } -struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id, - struct dev_pagemap *pgmap, enum dev_dax_subsys subsys) +struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) { + struct dax_region *dax_region = data->dax_region; struct device *parent = dax_region->dev; struct dax_device *dax_dev; struct dev_dax *dev_dax; @@ -405,14 +405,14 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id, struct device *dev; int rc = -ENOMEM; - if (id < 0) + if (data->id < 0) return ERR_PTR(-EINVAL); dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL); if (!dev_dax) return ERR_PTR(-ENOMEM); - memcpy(&dev_dax->pgmap, pgmap, sizeof(*pgmap)); + memcpy(&dev_dax->pgmap, data->pgmap, sizeof(struct dev_pagemap)); /* * No 'host' or dax_operations since there is no access to this @@ -438,13 +438,13 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id, inode = dax_inode(dax_dev); dev->devt = inode->i_rdev; - if (subsys == DEV_DAX_BUS) + if (data->subsys == DEV_DAX_BUS) dev->bus = &dax_bus_type; else dev->class = dax_class; dev->parent = parent; dev->type = &dev_dax_type; - dev_set_name(dev, "dax%d.%d", dax_region->id, id); + dev_set_name(dev, "dax%d.%d", dax_region->id, data->id); rc = device_add(dev); if (rc) { @@ -464,7 +464,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id, return ERR_PTR(rc); } -EXPORT_SYMBOL_GPL(__devm_create_dev_dax); +EXPORT_SYMBOL_GPL(devm_create_dev_dax); static int match_always_count; diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index 55577e9791da..299c2e7fac09 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -13,18 +13,18 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, struct resource *res, int target_node, unsigned int align); enum dev_dax_subsys { - DEV_DAX_BUS, + DEV_DAX_BUS = 0, /* zeroed dev_dax_data picks this by default */ DEV_DAX_CLASS, }; -struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id, - struct dev_pagemap *pgmap, enum dev_dax_subsys subsys); +struct dev_dax_data { + struct dax_region *dax_region; + struct dev_pagemap *pgmap; + enum dev_dax_subsys subsys; + int id; +}; -static inline struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region, - int id, struct dev_pagemap *pgmap) -{ - return __devm_create_dev_dax(dax_region, id, pgmap, DEV_DAX_BUS); -} +struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data); /* to be deleted when DEV_DAX_CLASS is removed */ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys); diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index 506893861253..b84fe17178d8 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -11,6 +11,7 @@ static int dax_hmem_probe(struct platform_device *pdev) struct dev_pagemap pgmap = { }; struct dax_region *dax_region; struct memregion_info *mri; + struct dev_dax_data data; struct dev_dax *dev_dax; struct resource *res; @@ -26,7 +27,12 @@ static int dax_hmem_probe(struct platform_device *pdev) if (!dax_region) return -ENOMEM; - dev_dax = devm_create_dev_dax(dax_region, 0, &pgmap); + data = (struct dev_dax_data) { + .dax_region = dax_region, + .id = 0, + .pgmap = &pgmap, + }; + dev_dax = devm_create_dev_dax(&data); if (IS_ERR(dev_dax)) return PTR_ERR(dev_dax); diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c index ea52bb77a294..08ee5947a49c 100644 --- a/drivers/dax/pmem/core.c +++ b/drivers/dax/pmem/core.c @@ -14,6 +14,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) resource_size_t offset; struct nd_pfn_sb *pfn_sb; struct dev_dax *dev_dax; + struct dev_dax_data data; struct nd_namespace_io *nsio; struct dax_region *dax_region; struct dev_pagemap pgmap = { }; @@ -57,7 +58,13 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) if (!dax_region) return ERR_PTR(-ENOMEM); - dev_dax = __devm_create_dev_dax(dax_region, id, &pgmap, subsys); + data = (struct dev_dax_data) { + .dax_region = dax_region, + .id = id, + .pgmap = &pgmap, + .subsys = subsys, + }; + dev_dax = devm_create_dev_dax(&data); /* child dev_dax instances now own the lifetime of the dax_region */ dax_region_put(dax_region); From f5516ec5efb9fe0f426a46eeef25d389d3c2f988 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:43 -0700 Subject: [PATCH 035/181] device-dax: make pgmap optional for instance creation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The passed in dev_pagemap is only required in the pmem case as the libnvdimm core may have reserved a vmem_altmap for dev_memremap_pages() to place the memmap in pmem directly. In the hmem case there is no agent reserving an altmap so it can all be handled by a core internal default. Pass the resource range via a new @range property of 'struct dev_dax_data'. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: David Hildenbrand Cc: Vishal Verma Cc: Dave Hansen Cc: Pavel Tatashin Cc: Brice Goglin Cc: Dave Jiang Cc: Ira Weiny Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Daniel Vetter Cc: David Airlie Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643099958.4062302.10379230791041872886.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106110513.30709.4303239334850606031.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 29 +++++++++++++++-------------- drivers/dax/bus.h | 2 ++ drivers/dax/dax-private.h | 9 ++++++++- drivers/dax/device.c | 28 +++++++++++++++++++--------- drivers/dax/hmem/hmem.c | 8 ++++---- drivers/dax/kmem.c | 12 ++++++------ drivers/dax/pmem/core.c | 4 ++++ tools/testing/nvdimm/dax-dev.c | 8 ++++---- 8 files changed, 62 insertions(+), 38 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index dffa4655e128..96bd64ba95a5 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -271,7 +271,7 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr, char *buf) { struct dev_dax *dev_dax = to_dev_dax(dev); - unsigned long long size = resource_size(&dev_dax->region->res); + unsigned long long size = range_len(&dev_dax->range); return sprintf(buf, "%llu\n", size); } @@ -293,19 +293,12 @@ static ssize_t target_node_show(struct device *dev, } static DEVICE_ATTR_RO(target_node); -static unsigned long long dev_dax_resource(struct dev_dax *dev_dax) -{ - struct dax_region *dax_region = dev_dax->region; - - return dax_region->res.start; -} - static ssize_t resource_show(struct device *dev, struct device_attribute *attr, char *buf) { struct dev_dax *dev_dax = to_dev_dax(dev); - return sprintf(buf, "%#llx\n", dev_dax_resource(dev_dax)); + return sprintf(buf, "%#llx\n", dev_dax->range.start); } static DEVICE_ATTR(resource, 0400, resource_show, NULL); @@ -376,6 +369,7 @@ static void dev_dax_release(struct device *dev) dax_region_put(dax_region); put_dax(dax_dev); + kfree(dev_dax->pgmap); kfree(dev_dax); } @@ -412,7 +406,12 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) if (!dev_dax) return ERR_PTR(-ENOMEM); - memcpy(&dev_dax->pgmap, data->pgmap, sizeof(struct dev_pagemap)); + if (data->pgmap) { + dev_dax->pgmap = kmemdup(data->pgmap, + sizeof(struct dev_pagemap), GFP_KERNEL); + if (!dev_dax->pgmap) + goto err_pgmap; + } /* * No 'host' or dax_operations since there is no access to this @@ -421,18 +420,19 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC); if (IS_ERR(dax_dev)) { rc = PTR_ERR(dax_dev); - goto err; + goto err_alloc_dax; } /* a device_dax instance is dead while the driver is not attached */ kill_dax(dax_dev); - /* from here on we're committed to teardown via dax_dev_release() */ + /* from here on we're committed to teardown via dev_dax_release() */ dev = &dev_dax->dev; device_initialize(dev); dev_dax->dax_dev = dax_dev; dev_dax->region = dax_region; + dev_dax->range = data->range; dev_dax->target_node = dax_region->target_node; kref_get(&dax_region->kref); @@ -458,8 +458,9 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) return ERR_PTR(rc); return dev_dax; - - err: +err_alloc_dax: + kfree(dev_dax->pgmap); +err_pgmap: kfree(dev_dax); return ERR_PTR(rc); diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index 299c2e7fac09..4aeb36da83a4 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -3,6 +3,7 @@ #ifndef __DAX_BUS_H__ #define __DAX_BUS_H__ #include +#include struct dev_dax; struct resource; @@ -21,6 +22,7 @@ struct dev_dax_data { struct dax_region *dax_region; struct dev_pagemap *pgmap; enum dev_dax_subsys subsys; + struct range range; int id; }; diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 8a4c40ccd2ef..6779f683671d 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -41,6 +41,7 @@ struct dax_region { * @target_node: effective numa node if dev_dax memory range is onlined * @dev - device core * @pgmap - pgmap for memmap setup / lifetime (driver owned) + * @range: resource range for the instance * @dax_mem_res: physical address range of hotadded DAX memory * @dax_mem_name: name for hotadded DAX memory via add_memory_driver_managed() */ @@ -49,10 +50,16 @@ struct dev_dax { struct dax_device *dax_dev; int target_node; struct device dev; - struct dev_pagemap pgmap; + struct dev_pagemap *pgmap; + struct range range; struct resource *dax_kmem_res; }; +static inline u64 range_len(struct range *range) +{ + return range->end - range->start + 1; +} + static inline struct dev_dax *to_dev_dax(struct device *dev) { return container_of(dev, struct dev_dax, dev); diff --git a/drivers/dax/device.c b/drivers/dax/device.c index c528b725789b..287cf0a3db23 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -55,12 +55,12 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma, __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size) { - struct resource *res = &dev_dax->region->res; + struct range *range = &dev_dax->range; phys_addr_t phys; - phys = pgoff * PAGE_SIZE + res->start; - if (phys >= res->start && phys <= res->end) { - if (phys + size - 1 <= res->end) + phys = pgoff * PAGE_SIZE + range->start; + if (phys >= range->start && phys <= range->end) { + if (phys + size - 1 <= range->end) return phys; } @@ -396,21 +396,31 @@ int dev_dax_probe(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); struct dax_device *dax_dev = dev_dax->dax_dev; - struct resource *res = &dev_dax->region->res; + struct range *range = &dev_dax->range; + struct dev_pagemap *pgmap; struct inode *inode; struct cdev *cdev; void *addr; int rc; /* 1:1 map region resource range to device-dax instance range */ - if (!devm_request_mem_region(dev, res->start, resource_size(res), + if (!devm_request_mem_region(dev, range->start, range_len(range), dev_name(dev))) { - dev_warn(dev, "could not reserve region %pR\n", res); + dev_warn(dev, "could not reserve range: %#llx - %#llx\n", + range->start, range->end); return -EBUSY; } - dev_dax->pgmap.type = MEMORY_DEVICE_GENERIC; - addr = devm_memremap_pages(dev, &dev_dax->pgmap); + pgmap = dev_dax->pgmap; + if (!pgmap) { + pgmap = devm_kzalloc(dev, sizeof(*pgmap), GFP_KERNEL); + if (!pgmap) + return -ENOMEM; + pgmap->res.start = range->start; + pgmap->res.end = range->end; + } + pgmap->type = MEMORY_DEVICE_GENERIC; + addr = devm_memremap_pages(dev, pgmap); if (IS_ERR(addr)) return PTR_ERR(addr); diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index b84fe17178d8..af82d6ba820a 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -8,7 +8,6 @@ static int dax_hmem_probe(struct platform_device *pdev) { struct device *dev = &pdev->dev; - struct dev_pagemap pgmap = { }; struct dax_region *dax_region; struct memregion_info *mri; struct dev_dax_data data; @@ -20,8 +19,6 @@ static int dax_hmem_probe(struct platform_device *pdev) return -ENOMEM; mri = dev->platform_data; - memcpy(&pgmap.res, res, sizeof(*res)); - dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node, PMD_SIZE); if (!dax_region) @@ -30,7 +27,10 @@ static int dax_hmem_probe(struct platform_device *pdev) data = (struct dev_dax_data) { .dax_region = dax_region, .id = 0, - .pgmap = &pgmap, + .range = { + .start = res->start, + .end = res->end, + }, }; dev_dax = devm_create_dev_dax(&data); if (IS_ERR(dev_dax)) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 275aa5f87399..5bb133df147d 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -22,7 +22,7 @@ static bool any_hotremove_failed; int dev_dax_kmem_probe(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); - struct resource *res = &dev_dax->region->res; + struct range *range = &dev_dax->range; resource_size_t kmem_start; resource_size_t kmem_size; resource_size_t kmem_end; @@ -39,17 +39,17 @@ int dev_dax_kmem_probe(struct device *dev) */ numa_node = dev_dax->target_node; if (numa_node < 0) { - dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n", - res, numa_node); + dev_warn(dev, "rejecting DAX region with invalid node: %d\n", + numa_node); return -EINVAL; } /* Hotplug starting at the beginning of the next block: */ - kmem_start = ALIGN(res->start, memory_block_size_bytes()); + kmem_start = ALIGN(range->start, memory_block_size_bytes()); - kmem_size = resource_size(res); + kmem_size = range_len(range); /* Adjust the size down to compensate for moving up kmem_start: */ - kmem_size -= kmem_start - res->start; + kmem_size -= kmem_start - range->start; /* Align the size down to cover only complete blocks: */ kmem_size &= ~(memory_block_size_bytes() - 1); kmem_end = kmem_start + kmem_size; diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c index 08ee5947a49c..4fa81d3d2f65 100644 --- a/drivers/dax/pmem/core.c +++ b/drivers/dax/pmem/core.c @@ -63,6 +63,10 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) .id = id, .pgmap = &pgmap, .subsys = subsys, + .range = { + .start = res.start, + .end = res.end, + }, }; dev_dax = devm_create_dev_dax(&data); diff --git a/tools/testing/nvdimm/dax-dev.c b/tools/testing/nvdimm/dax-dev.c index 7e5d979e73cb..38d8e55c4a0d 100644 --- a/tools/testing/nvdimm/dax-dev.c +++ b/tools/testing/nvdimm/dax-dev.c @@ -9,12 +9,12 @@ phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size) { - struct resource *res = &dev_dax->region->res; + struct range *range = &dev_dax->range; phys_addr_t addr; - addr = pgoff * PAGE_SIZE + res->start; - if (addr >= res->start && addr <= res->end) { - if (addr + size - 1 <= res->end) { + addr = pgoff * PAGE_SIZE + range->start; + if (addr >= range->start && addr <= range->end) { + if (addr + size - 1 <= range->end) { if (get_nfit_res(addr)) { struct page *page; From 59bc8d10dc417884a3bc18a092a62e13645ed044 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:48 -0700 Subject: [PATCH 036/181] device-dax/kmem: introduce dax_kmem_range() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Towards removing the mode specific @dax_kmem_res attribute from the generic 'struct dev_dax', and preparing for multi-range support, teach the driver to calculate the hotplug range from the device range. The hotplug range is the trivially calculated memory-block-size aligned version of the device range. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: David Hildenbrand Cc: Vishal Verma Cc: Dave Hansen Cc: Pavel Tatashin Cc: Brice Goglin Cc: Dave Jiang Cc: Ira Weiny Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Daniel Vetter Cc: David Airlie Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/160106111109.30709.3173462396758431559.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/kmem.c | 40 +++++++++++++++++----------------------- 1 file changed, 17 insertions(+), 23 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 5bb133df147d..b0d6a99cf12d 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -19,13 +19,20 @@ static const char *kmem_name; /* Set if any memory will remain added when the driver will be unloaded. */ static bool any_hotremove_failed; +static struct range dax_kmem_range(struct dev_dax *dev_dax) +{ + struct range range; + + /* memory-block align the hotplug range */ + range.start = ALIGN(dev_dax->range.start, memory_block_size_bytes()); + range.end = ALIGN_DOWN(dev_dax->range.end + 1, memory_block_size_bytes()) - 1; + return range; +} + int dev_dax_kmem_probe(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); - struct range *range = &dev_dax->range; - resource_size_t kmem_start; - resource_size_t kmem_size; - resource_size_t kmem_end; + struct range range = dax_kmem_range(dev_dax); struct resource *new_res; const char *new_res_name; int numa_node; @@ -44,25 +51,14 @@ int dev_dax_kmem_probe(struct device *dev) return -EINVAL; } - /* Hotplug starting at the beginning of the next block: */ - kmem_start = ALIGN(range->start, memory_block_size_bytes()); - - kmem_size = range_len(range); - /* Adjust the size down to compensate for moving up kmem_start: */ - kmem_size -= kmem_start - range->start; - /* Align the size down to cover only complete blocks: */ - kmem_size &= ~(memory_block_size_bytes() - 1); - kmem_end = kmem_start + kmem_size; - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); if (!new_res_name) return -ENOMEM; /* Region is permanently reserved if hotremove fails. */ - new_res = request_mem_region(kmem_start, kmem_size, new_res_name); + new_res = request_mem_region(range.start, range_len(&range), new_res_name); if (!new_res) { - dev_warn(dev, "could not reserve region [%pa-%pa]\n", - &kmem_start, &kmem_end); + dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", range.start, range.end); kfree(new_res_name); return -EBUSY; } @@ -96,9 +92,8 @@ int dev_dax_kmem_probe(struct device *dev) static int dev_dax_kmem_remove(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); + struct range range = dax_kmem_range(dev_dax); struct resource *res = dev_dax->dax_kmem_res; - resource_size_t kmem_start = res->start; - resource_size_t kmem_size = resource_size(res); const char *res_name = res->name; int rc; @@ -108,12 +103,11 @@ static int dev_dax_kmem_remove(struct device *dev) * there is no way to hotremove this memory until reboot because device * unbind will succeed even if we return failure. */ - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size); + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range)); if (rc) { any_hotremove_failed = true; - dev_err(dev, - "DAX region %pR cannot be hotremoved until the next reboot\n", - res); + dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n", + range.start, range.end); return rc; } From 7e6b431aaef8b611a5adcd7f18fe089ff0d7bb59 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:53 -0700 Subject: [PATCH 037/181] device-dax/kmem: move resource name tracking to drvdata MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Towards removing the mode specific @dax_kmem_res attribute from the generic 'struct dev_dax', and preparing for multi-range support, move resource name tracking to driver data. The memory for the resource name needs to have its own lifetime separate from the device bind lifetime for cases where the driver is unbound, but the kmem range could not be unplugged from the page allocator. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: David Hildenbrand Cc: Vishal Verma Cc: Dave Hansen Cc: Pavel Tatashin Cc: Brice Goglin Cc: Dave Jiang Cc: Ira Weiny Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Daniel Vetter Cc: David Airlie Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/160106111639.30709.17624822766862009183.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/kmem.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index b0d6a99cf12d..6fe2cb1c5f7c 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -34,7 +34,7 @@ int dev_dax_kmem_probe(struct device *dev) struct dev_dax *dev_dax = to_dev_dax(dev); struct range range = dax_kmem_range(dev_dax); struct resource *new_res; - const char *new_res_name; + char *res_name; int numa_node; int rc; @@ -51,15 +51,15 @@ int dev_dax_kmem_probe(struct device *dev) return -EINVAL; } - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); - if (!new_res_name) + res_name = kstrdup(dev_name(dev), GFP_KERNEL); + if (!res_name) return -ENOMEM; /* Region is permanently reserved if hotremove fails. */ - new_res = request_mem_region(range.start, range_len(&range), new_res_name); + new_res = request_mem_region(range.start, range_len(&range), res_name); if (!new_res) { dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", range.start, range.end); - kfree(new_res_name); + kfree(res_name); return -EBUSY; } @@ -80,9 +80,11 @@ int dev_dax_kmem_probe(struct device *dev) if (rc) { release_resource(new_res); kfree(new_res); - kfree(new_res_name); + kfree(res_name); return rc; } + + dev_set_drvdata(dev, res_name); dev_dax->dax_kmem_res = new_res; return 0; @@ -94,7 +96,7 @@ static int dev_dax_kmem_remove(struct device *dev) struct dev_dax *dev_dax = to_dev_dax(dev); struct range range = dax_kmem_range(dev_dax); struct resource *res = dev_dax->dax_kmem_res; - const char *res_name = res->name; + const char *res_name = dev_get_drvdata(dev); int rc; /* From 0513bd5bb11456d45250c9283e1cb52533125180 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:49:58 -0700 Subject: [PATCH 038/181] device-dax/kmem: replace release_resource() with release_mem_region() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Towards removing the mode specific @dax_kmem_res attribute from the generic 'struct dev_dax', and preparing for multi-range support, change the kmem driver to use the idiomatic release_mem_region() to pair with the initial request_mem_region(). This also eliminates the need to open code the release of the resource allocated by request_mem_region(). As there are no more dax_kmem_res users, delete this struct member. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: David Hildenbrand Cc: Vishal Verma Cc: Dave Hansen Cc: Pavel Tatashin Cc: Brice Goglin Cc: Dave Jiang Cc: Ira Weiny Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Daniel Vetter Cc: David Airlie Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/160106112239.30709.15909567572288425294.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/dax-private.h | 3 --- drivers/dax/kmem.c | 20 +++++++------------- 2 files changed, 7 insertions(+), 16 deletions(-) diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 6779f683671d..12a2dbc43b40 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -42,8 +42,6 @@ struct dax_region { * @dev - device core * @pgmap - pgmap for memmap setup / lifetime (driver owned) * @range: resource range for the instance - * @dax_mem_res: physical address range of hotadded DAX memory - * @dax_mem_name: name for hotadded DAX memory via add_memory_driver_managed() */ struct dev_dax { struct dax_region *region; @@ -52,7 +50,6 @@ struct dev_dax { struct device dev; struct dev_pagemap *pgmap; struct range range; - struct resource *dax_kmem_res; }; static inline u64 range_len(struct range *range) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 6fe2cb1c5f7c..e56fc688bdc5 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -33,7 +33,7 @@ int dev_dax_kmem_probe(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); struct range range = dax_kmem_range(dev_dax); - struct resource *new_res; + struct resource *res; char *res_name; int numa_node; int rc; @@ -56,8 +56,8 @@ int dev_dax_kmem_probe(struct device *dev) return -ENOMEM; /* Region is permanently reserved if hotremove fails. */ - new_res = request_mem_region(range.start, range_len(&range), res_name); - if (!new_res) { + res = request_mem_region(range.start, range_len(&range), res_name); + if (!res) { dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", range.start, range.end); kfree(res_name); return -EBUSY; @@ -69,23 +69,20 @@ int dev_dax_kmem_probe(struct device *dev) * inherit flags from the parent since it may set new flags * unknown to us that will break add_memory() below. */ - new_res->flags = IORESOURCE_SYSTEM_RAM; + res->flags = IORESOURCE_SYSTEM_RAM; /* * Ensure that future kexec'd kernels will not treat this as RAM * automatically. */ - rc = add_memory_driver_managed(numa_node, new_res->start, - resource_size(new_res), kmem_name); + rc = add_memory_driver_managed(numa_node, range.start, range_len(&range), kmem_name); if (rc) { - release_resource(new_res); - kfree(new_res); + release_mem_region(range.start, range_len(&range)); kfree(res_name); return rc; } dev_set_drvdata(dev, res_name); - dev_dax->dax_kmem_res = new_res; return 0; } @@ -95,7 +92,6 @@ static int dev_dax_kmem_remove(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); struct range range = dax_kmem_range(dev_dax); - struct resource *res = dev_dax->dax_kmem_res; const char *res_name = dev_get_drvdata(dev); int rc; @@ -114,10 +110,8 @@ static int dev_dax_kmem_remove(struct device *dev) } /* Release and free dax resources */ - release_resource(res); - kfree(res); + release_mem_region(range.start, range_len(&range)); kfree(res_name); - dev_dax->dax_kmem_res = NULL; return 0; } From c2f3011ee697f85ba0166fb3780332aafc66b8f4 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:50:03 -0700 Subject: [PATCH 039/181] device-dax: add an allocation interface for device-dax instances MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation for a facility that enables dax regions to be sub-divided, introduce infrastructure to track and allocate region capacity. The new dax_region/available_size attribute is only enabled for volatile hmem devices, not pmem devices that are defined by nvdimm namespace boundaries. This is per Jeff's feedback the last time dynamic device-dax capacity allocation support was discussed. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Vishal Verma Cc: Brice Goglin Cc: Dave Hansen Cc: Dave Jiang Cc: David Hildenbrand Cc: Ira Weiny Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Daniel Vetter Cc: David Airlie Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lore.kernel.org/linux-nvdimm/x49shpp3zn8.fsf@segfault.boston.devel.redhat.com Link: https://lkml.kernel.org/r/159643101035.4062302.6785857915652647857.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106112801.30709.14601438735305335071.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 120 ++++++++++++++++++++++++++++++++++---- drivers/dax/bus.h | 7 ++- drivers/dax/dax-private.h | 2 +- drivers/dax/hmem/hmem.c | 7 +-- drivers/dax/pmem/core.c | 8 +-- 5 files changed, 121 insertions(+), 23 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 96bd64ba95a5..0a48ce378686 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -130,6 +130,11 @@ ATTRIBUTE_GROUPS(dax_drv); static int dax_bus_match(struct device *dev, struct device_driver *drv); +static bool is_static(struct dax_region *dax_region) +{ + return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0; +} + static struct bus_type dax_bus_type = { .name = "dax", .uevent = dax_bus_uevent, @@ -185,7 +190,48 @@ static ssize_t align_show(struct device *dev, } static DEVICE_ATTR_RO(align); +#define for_each_dax_region_resource(dax_region, res) \ + for (res = (dax_region)->res.child; res; res = res->sibling) + +static unsigned long long dax_region_avail_size(struct dax_region *dax_region) +{ + resource_size_t size = resource_size(&dax_region->res); + struct resource *res; + + device_lock_assert(dax_region->dev); + + for_each_dax_region_resource(dax_region, res) + size -= resource_size(res); + return size; +} + +static ssize_t available_size_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dax_region *dax_region = dev_get_drvdata(dev); + unsigned long long size; + + device_lock(dev); + size = dax_region_avail_size(dax_region); + device_unlock(dev); + + return sprintf(buf, "%llu\n", size); +} +static DEVICE_ATTR_RO(available_size); + +static umode_t dax_region_visible(struct kobject *kobj, struct attribute *a, + int n) +{ + struct device *dev = container_of(kobj, struct device, kobj); + struct dax_region *dax_region = dev_get_drvdata(dev); + + if (is_static(dax_region) && a == &dev_attr_available_size.attr) + return 0; + return a->mode; +} + static struct attribute *dax_region_attributes[] = { + &dev_attr_available_size.attr, &dev_attr_region_size.attr, &dev_attr_align.attr, &dev_attr_id.attr, @@ -195,6 +241,7 @@ static struct attribute *dax_region_attributes[] = { static const struct attribute_group dax_region_attribute_group = { .name = "dax_region", .attrs = dax_region_attributes, + .is_visible = dax_region_visible, }; static const struct attribute_group *dax_region_attribute_groups[] = { @@ -226,7 +273,8 @@ static void dax_region_unregister(void *region) } struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, int target_node, unsigned int align) + struct resource *res, int target_node, unsigned int align, + unsigned long flags) { struct dax_region *dax_region; @@ -249,12 +297,17 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, return NULL; dev_set_drvdata(parent, dax_region); - memcpy(&dax_region->res, res, sizeof(*res)); kref_init(&dax_region->kref); dax_region->id = region_id; dax_region->align = align; dax_region->dev = parent; dax_region->target_node = target_node; + dax_region->res = (struct resource) { + .start = res->start, + .end = res->end, + .flags = IORESOURCE_MEM | flags, + }; + if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) { kfree(dax_region); return NULL; @@ -267,6 +320,32 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, } EXPORT_SYMBOL_GPL(alloc_dax_region); +static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size) +{ + struct dax_region *dax_region = dev_dax->region; + struct resource *res = &dax_region->res; + struct device *dev = &dev_dax->dev; + struct resource *alloc; + + device_lock_assert(dax_region->dev); + + /* TODO: handle multiple allocations per region */ + if (res->child) + return -ENOMEM; + + alloc = __request_region(res, res->start, size, dev_name(dev), 0); + + if (!alloc) + return -ENOMEM; + + dev_dax->range = (struct range) { + .start = alloc->start, + .end = alloc->end, + }; + + return 0; +} + static ssize_t size_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -361,6 +440,15 @@ void kill_dev_dax(struct dev_dax *dev_dax) } EXPORT_SYMBOL_GPL(kill_dev_dax); +static void free_dev_dax_range(struct dev_dax *dev_dax) +{ + struct dax_region *dax_region = dev_dax->region; + struct range *range = &dev_dax->range; + + device_lock_assert(dax_region->dev); + __release_region(&dax_region->res, range->start, range_len(range)); +} + static void dev_dax_release(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); @@ -385,6 +473,7 @@ static void unregister_dev_dax(void *dev) dev_dbg(dev, "%s\n", __func__); kill_dev_dax(dev_dax); + free_dev_dax_range(dev_dax); device_del(dev); put_device(dev); } @@ -397,7 +486,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) struct dev_dax *dev_dax; struct inode *inode; struct device *dev; - int rc = -ENOMEM; + int rc; if (data->id < 0) return ERR_PTR(-EINVAL); @@ -406,11 +495,25 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) if (!dev_dax) return ERR_PTR(-ENOMEM); + dev_dax->region = dax_region; + dev = &dev_dax->dev; + device_initialize(dev); + dev_set_name(dev, "dax%d.%d", dax_region->id, data->id); + + rc = alloc_dev_dax_range(dev_dax, data->size); + if (rc) + goto err_range; + if (data->pgmap) { + dev_WARN_ONCE(parent, !is_static(dax_region), + "custom dev_pagemap requires a static dax_region\n"); + dev_dax->pgmap = kmemdup(data->pgmap, sizeof(struct dev_pagemap), GFP_KERNEL); - if (!dev_dax->pgmap) + if (!dev_dax->pgmap) { + rc = -ENOMEM; goto err_pgmap; + } } /* @@ -427,12 +530,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) kill_dax(dax_dev); /* from here on we're committed to teardown via dev_dax_release() */ - dev = &dev_dax->dev; - device_initialize(dev); - dev_dax->dax_dev = dax_dev; - dev_dax->region = dax_region; - dev_dax->range = data->range; dev_dax->target_node = dax_region->target_node; kref_get(&dax_region->kref); @@ -444,7 +542,6 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) dev->class = dax_class; dev->parent = parent; dev->type = &dev_dax_type; - dev_set_name(dev, "dax%d.%d", dax_region->id, data->id); rc = device_add(dev); if (rc) { @@ -458,9 +555,12 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) return ERR_PTR(rc); return dev_dax; + err_alloc_dax: kfree(dev_dax->pgmap); err_pgmap: + free_dev_dax_range(dev_dax); +err_range: kfree(dev_dax); return ERR_PTR(rc); diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index 4aeb36da83a4..44592a8cac0f 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -10,8 +10,11 @@ struct resource; struct dax_device; struct dax_region; void dax_region_put(struct dax_region *dax_region); + +#define IORESOURCE_DAX_STATIC (1UL << 0) struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, int target_node, unsigned int align); + struct resource *res, int target_node, unsigned int align, + unsigned long flags); enum dev_dax_subsys { DEV_DAX_BUS = 0, /* zeroed dev_dax_data picks this by default */ @@ -22,7 +25,7 @@ struct dev_dax_data { struct dax_region *dax_region; struct dev_pagemap *pgmap; enum dev_dax_subsys subsys; - struct range range; + resource_size_t size; int id; }; diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 12a2dbc43b40..99b1273bb232 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -22,7 +22,7 @@ void dax_bus_exit(void); * @kref: to pin while other agents have a need to do lookups * @dev: parent device backing this region * @align: allocation and mapping alignment for child dax devices - * @res: physical address range of the region + * @res: resource tree to track instance allocations */ struct dax_region { int id; diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index af82d6ba820a..e7b64539e23e 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -20,17 +20,14 @@ static int dax_hmem_probe(struct platform_device *pdev) mri = dev->platform_data; dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node, - PMD_SIZE); + PMD_SIZE, 0); if (!dax_region) return -ENOMEM; data = (struct dev_dax_data) { .dax_region = dax_region, .id = 0, - .range = { - .start = res->start, - .end = res->end, - }, + .size = resource_size(res), }; dev_dax = devm_create_dev_dax(&data); if (IS_ERR(dev_dax)) diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c index 4fa81d3d2f65..4fe700884338 100644 --- a/drivers/dax/pmem/core.c +++ b/drivers/dax/pmem/core.c @@ -54,7 +54,8 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) memcpy(&res, &pgmap.res, sizeof(res)); res.start += offset; dax_region = alloc_dax_region(dev, region_id, &res, - nd_region->target_node, le32_to_cpu(pfn_sb->align)); + nd_region->target_node, le32_to_cpu(pfn_sb->align), + IORESOURCE_DAX_STATIC); if (!dax_region) return ERR_PTR(-ENOMEM); @@ -63,10 +64,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) .id = id, .pgmap = &pgmap, .subsys = subsys, - .range = { - .start = res.start, - .end = res.end, - }, + .size = resource_size(&res), }; dev_dax = devm_create_dev_dax(&data); From f11cf813dee20e67eac22a6d78502aa564554eb4 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:50:08 -0700 Subject: [PATCH 040/181] device-dax: introduce 'struct dev_dax' typed-driver operations MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In preparation for introducing seed devices the dax-bus core needs to be able to intercept ->probe() and ->remove() operations. Towards that end arrange for the bus and drivers to switch from raw 'struct device' driver operations to 'struct dev_dax' typed operations. Reported-by: Hulk Robot Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Jason Yan Cc: Vishal Verma Cc: Brice Goglin Cc: Dave Hansen Cc: Dave Jiang Cc: David Hildenbrand Cc: Ira Weiny Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Daniel Vetter Cc: David Airlie Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/160106113357.30709.4541750544799737855.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 18 ++++++++++++++++++ drivers/dax/bus.h | 4 +++- drivers/dax/device.c | 12 +++++------- drivers/dax/kmem.c | 18 ++++++++---------- drivers/dax/pmem/compat.c | 2 +- 5 files changed, 35 insertions(+), 19 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 0a48ce378686..9549f11ebed0 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -135,10 +135,28 @@ static bool is_static(struct dax_region *dax_region) return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0; } +static int dax_bus_probe(struct device *dev) +{ + struct dax_device_driver *dax_drv = to_dax_drv(dev->driver); + struct dev_dax *dev_dax = to_dev_dax(dev); + + return dax_drv->probe(dev_dax); +} + +static int dax_bus_remove(struct device *dev) +{ + struct dax_device_driver *dax_drv = to_dax_drv(dev->driver); + struct dev_dax *dev_dax = to_dev_dax(dev); + + return dax_drv->remove(dev_dax); +} + static struct bus_type dax_bus_type = { .name = "dax", .uevent = dax_bus_uevent, .match = dax_bus_match, + .probe = dax_bus_probe, + .remove = dax_bus_remove, .drv_groups = dax_drv_groups, }; diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index 44592a8cac0f..da27ea70a19a 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -38,6 +38,8 @@ struct dax_device_driver { struct device_driver drv; struct list_head ids; int match_always; + int (*probe)(struct dev_dax *dev); + int (*remove)(struct dev_dax *dev); }; int __dax_driver_register(struct dax_device_driver *dax_drv, @@ -48,7 +50,7 @@ void dax_driver_unregister(struct dax_device_driver *dax_drv); void kill_dev_dax(struct dev_dax *dev_dax); #if IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT) -int dev_dax_probe(struct device *dev); +int dev_dax_probe(struct dev_dax *dev_dax); #endif /* diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 287cf0a3db23..9833fa83b537 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -392,11 +392,11 @@ static void dev_dax_kill(void *dev_dax) kill_dev_dax(dev_dax); } -int dev_dax_probe(struct device *dev) +int dev_dax_probe(struct dev_dax *dev_dax) { - struct dev_dax *dev_dax = to_dev_dax(dev); struct dax_device *dax_dev = dev_dax->dax_dev; struct range *range = &dev_dax->range; + struct device *dev = &dev_dax->dev; struct dev_pagemap *pgmap; struct inode *inode; struct cdev *cdev; @@ -446,17 +446,15 @@ int dev_dax_probe(struct device *dev) } EXPORT_SYMBOL_GPL(dev_dax_probe); -static int dev_dax_remove(struct device *dev) +static int dev_dax_remove(struct dev_dax *dev_dax) { /* all probe actions are unwound by devm */ return 0; } static struct dax_device_driver device_dax_driver = { - .drv = { - .probe = dev_dax_probe, - .remove = dev_dax_remove, - }, + .probe = dev_dax_probe, + .remove = dev_dax_remove, .match_always = 1, }; diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index e56fc688bdc5..c2ac465cc342 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -29,10 +29,10 @@ static struct range dax_kmem_range(struct dev_dax *dev_dax) return range; } -int dev_dax_kmem_probe(struct device *dev) +static int dev_dax_kmem_probe(struct dev_dax *dev_dax) { - struct dev_dax *dev_dax = to_dev_dax(dev); struct range range = dax_kmem_range(dev_dax); + struct device *dev = &dev_dax->dev; struct resource *res; char *res_name; int numa_node; @@ -88,12 +88,12 @@ int dev_dax_kmem_probe(struct device *dev) } #ifdef CONFIG_MEMORY_HOTREMOVE -static int dev_dax_kmem_remove(struct device *dev) +static int dev_dax_kmem_remove(struct dev_dax *dev_dax) { - struct dev_dax *dev_dax = to_dev_dax(dev); + int rc; + struct device *dev = &dev_dax->dev; struct range range = dax_kmem_range(dev_dax); const char *res_name = dev_get_drvdata(dev); - int rc; /* * We have one shot for removing memory, if some memory blocks were not @@ -116,7 +116,7 @@ static int dev_dax_kmem_remove(struct device *dev) return 0; } #else -static int dev_dax_kmem_remove(struct device *dev) +static int dev_dax_kmem_remove(struct dev_dax *dev_dax) { /* * Without hotremove purposely leak the request_mem_region() for the @@ -131,10 +131,8 @@ static int dev_dax_kmem_remove(struct device *dev) #endif /* CONFIG_MEMORY_HOTREMOVE */ static struct dax_device_driver device_dax_kmem_driver = { - .drv = { - .probe = dev_dax_kmem_probe, - .remove = dev_dax_kmem_remove, - }, + .probe = dev_dax_kmem_probe, + .remove = dev_dax_kmem_remove, }; static int __init dax_kmem_init(void) diff --git a/drivers/dax/pmem/compat.c b/drivers/dax/pmem/compat.c index d7b15e6f30c5..863c114fd88c 100644 --- a/drivers/dax/pmem/compat.c +++ b/drivers/dax/pmem/compat.c @@ -22,7 +22,7 @@ static int dax_pmem_compat_probe(struct device *dev) return -ENOMEM; device_lock(&dev_dax->dev); - rc = dev_dax_probe(&dev_dax->dev); + rc = dev_dax_probe(dev_dax); device_unlock(&dev_dax->dev); devres_close_group(&dev_dax->dev, dev_dax); From 0f3da14a4f0503998fc6c12da3d2fc6e8b33e669 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:50:13 -0700 Subject: [PATCH 041/181] device-dax: introduce 'seed' devices MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a seed device concept for dynamic dax regions to be able to split the region amongst multiple sub-instances. The seed device, similar to libnvdimm seed devices, is a device that starts with zero capacity allocated and unbound to a driver. In contrast to libnvdimm seed devices explicit 'create' and 'delete' interfaces are added to the region to trigger seeds to be created and unused devices to be reclaimed. The explicit create and delete replaces implicit create as a side effect of probe and implicit delete when writing 0 to the size that libnvdimm implements. Delete can be performed on any 0-sized and idle device. This avoids the gymnastics of needing to move device_unregister() to its own async context. Specifically, it avoids the deadlock of deleting a device via one of its own attributes. It is also less surprising to userspace which never sees an extra device it did not request. For now just add the device creation, teardown, and ->probe() prevention. A later patch will arrange for the 'dax/size' attribute to be writable to allocate capacity from the region. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Vishal Verma Cc: Brice Goglin Cc: Dave Hansen Cc: Dave Jiang Cc: David Hildenbrand Cc: Ira Weiny Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Daniel Vetter Cc: David Airlie Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643101583.4062302.12255093902950754962.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106113873.30709.15168756050631539431.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 301 +++++++++++++++++++++++++++++++++----- drivers/dax/dax-private.h | 9 ++ drivers/dax/hmem/hmem.c | 2 +- 3 files changed, 272 insertions(+), 40 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 9549f11ebed0..dce9413a4394 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -139,8 +139,26 @@ static int dax_bus_probe(struct device *dev) { struct dax_device_driver *dax_drv = to_dax_drv(dev->driver); struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_region *dax_region = dev_dax->region; + struct range *range = &dev_dax->range; + int rc; - return dax_drv->probe(dev_dax); + if (range_len(range) == 0 || dev_dax->id < 0) + return -ENXIO; + + rc = dax_drv->probe(dev_dax); + + if (rc || is_static(dax_region)) + return rc; + + /* + * Track new seed creation only after successful probe of the + * previous seed. + */ + if (dax_region->seed == dev) + dax_region->seed = NULL; + + return 0; } static int dax_bus_remove(struct device *dev) @@ -237,14 +255,216 @@ static ssize_t available_size_show(struct device *dev, } static DEVICE_ATTR_RO(available_size); +static ssize_t seed_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dax_region *dax_region = dev_get_drvdata(dev); + struct device *seed; + ssize_t rc; + + if (is_static(dax_region)) + return -EINVAL; + + device_lock(dev); + seed = dax_region->seed; + rc = sprintf(buf, "%s\n", seed ? dev_name(seed) : ""); + device_unlock(dev); + + return rc; +} +static DEVICE_ATTR_RO(seed); + +static ssize_t create_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dax_region *dax_region = dev_get_drvdata(dev); + struct device *youngest; + ssize_t rc; + + if (is_static(dax_region)) + return -EINVAL; + + device_lock(dev); + youngest = dax_region->youngest; + rc = sprintf(buf, "%s\n", youngest ? dev_name(youngest) : ""); + device_unlock(dev); + + return rc; +} + +static ssize_t create_store(struct device *dev, struct device_attribute *attr, + const char *buf, size_t len) +{ + struct dax_region *dax_region = dev_get_drvdata(dev); + unsigned long long avail; + ssize_t rc; + int val; + + if (is_static(dax_region)) + return -EINVAL; + + rc = kstrtoint(buf, 0, &val); + if (rc) + return rc; + if (val != 1) + return -EINVAL; + + device_lock(dev); + avail = dax_region_avail_size(dax_region); + if (avail == 0) + rc = -ENOSPC; + else { + struct dev_dax_data data = { + .dax_region = dax_region, + .size = 0, + .id = -1, + }; + struct dev_dax *dev_dax = devm_create_dev_dax(&data); + + if (IS_ERR(dev_dax)) + rc = PTR_ERR(dev_dax); + else { + /* + * In support of crafting multiple new devices + * simultaneously multiple seeds can be created, + * but only the first one that has not been + * successfully bound is tracked as the region + * seed. + */ + if (!dax_region->seed) + dax_region->seed = &dev_dax->dev; + dax_region->youngest = &dev_dax->dev; + rc = len; + } + } + device_unlock(dev); + + return rc; +} +static DEVICE_ATTR_RW(create); + +void kill_dev_dax(struct dev_dax *dev_dax) +{ + struct dax_device *dax_dev = dev_dax->dax_dev; + struct inode *inode = dax_inode(dax_dev); + + kill_dax(dax_dev); + unmap_mapping_range(inode->i_mapping, 0, 0, 1); +} +EXPORT_SYMBOL_GPL(kill_dev_dax); + +static void free_dev_dax_range(struct dev_dax *dev_dax) +{ + struct dax_region *dax_region = dev_dax->region; + struct range *range = &dev_dax->range; + + device_lock_assert(dax_region->dev); + if (range_len(range)) + __release_region(&dax_region->res, range->start, + range_len(range)); +} + +static void unregister_dev_dax(void *dev) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + + dev_dbg(dev, "%s\n", __func__); + + kill_dev_dax(dev_dax); + free_dev_dax_range(dev_dax); + device_del(dev); + put_device(dev); +} + +/* a return value >= 0 indicates this invocation invalidated the id */ +static int __free_dev_dax_id(struct dev_dax *dev_dax) +{ + struct dax_region *dax_region = dev_dax->region; + struct device *dev = &dev_dax->dev; + int rc = dev_dax->id; + + device_lock_assert(dev); + + if (is_static(dax_region) || dev_dax->id < 0) + return -1; + ida_free(&dax_region->ida, dev_dax->id); + dev_dax->id = -1; + return rc; +} + +static int free_dev_dax_id(struct dev_dax *dev_dax) +{ + struct device *dev = &dev_dax->dev; + int rc; + + device_lock(dev); + rc = __free_dev_dax_id(dev_dax); + device_unlock(dev); + return rc; +} + +static ssize_t delete_store(struct device *dev, struct device_attribute *attr, + const char *buf, size_t len) +{ + struct dax_region *dax_region = dev_get_drvdata(dev); + struct dev_dax *dev_dax; + struct device *victim; + bool do_del = false; + int rc; + + if (is_static(dax_region)) + return -EINVAL; + + victim = device_find_child_by_name(dax_region->dev, buf); + if (!victim) + return -ENXIO; + + device_lock(dev); + device_lock(victim); + dev_dax = to_dev_dax(victim); + if (victim->driver || range_len(&dev_dax->range)) + rc = -EBUSY; + else { + /* + * Invalidate the device so it does not become active + * again, but always preserve device-id-0 so that + * /sys/bus/dax/ is guaranteed to be populated while any + * dax_region is registered. + */ + if (dev_dax->id > 0) { + do_del = __free_dev_dax_id(dev_dax) >= 0; + rc = len; + if (dax_region->seed == victim) + dax_region->seed = NULL; + if (dax_region->youngest == victim) + dax_region->youngest = NULL; + } else + rc = -EBUSY; + } + device_unlock(victim); + + /* won the race to invalidate the device, clean it up */ + if (do_del) + devm_release_action(dev, unregister_dev_dax, victim); + device_unlock(dev); + put_device(victim); + + return rc; +} +static DEVICE_ATTR_WO(delete); + static umode_t dax_region_visible(struct kobject *kobj, struct attribute *a, int n) { struct device *dev = container_of(kobj, struct device, kobj); struct dax_region *dax_region = dev_get_drvdata(dev); - if (is_static(dax_region) && a == &dev_attr_available_size.attr) - return 0; + if (is_static(dax_region)) + if (a == &dev_attr_available_size.attr + || a == &dev_attr_create.attr + || a == &dev_attr_seed.attr + || a == &dev_attr_delete.attr) + return 0; return a->mode; } @@ -252,6 +472,9 @@ static struct attribute *dax_region_attributes[] = { &dev_attr_available_size.attr, &dev_attr_region_size.attr, &dev_attr_align.attr, + &dev_attr_create.attr, + &dev_attr_seed.attr, + &dev_attr_delete.attr, &dev_attr_id.attr, NULL, }; @@ -320,6 +543,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, dax_region->align = align; dax_region->dev = parent; dax_region->target_node = target_node; + ida_init(&dax_region->ida); dax_region->res = (struct resource) { .start = res->start, .end = res->end, @@ -347,6 +571,15 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size) device_lock_assert(dax_region->dev); + /* handle the seed alloc special case */ + if (!size) { + dev_dax->range = (struct range) { + .start = res->start, + .end = res->start - 1, + }; + return 0; + } + /* TODO: handle multiple allocations per region */ if (res->child) return -ENOMEM; @@ -448,33 +681,15 @@ static const struct attribute_group *dax_attribute_groups[] = { NULL, }; -void kill_dev_dax(struct dev_dax *dev_dax) -{ - struct dax_device *dax_dev = dev_dax->dax_dev; - struct inode *inode = dax_inode(dax_dev); - - kill_dax(dax_dev); - unmap_mapping_range(inode->i_mapping, 0, 0, 1); -} -EXPORT_SYMBOL_GPL(kill_dev_dax); - -static void free_dev_dax_range(struct dev_dax *dev_dax) -{ - struct dax_region *dax_region = dev_dax->region; - struct range *range = &dev_dax->range; - - device_lock_assert(dax_region->dev); - __release_region(&dax_region->res, range->start, range_len(range)); -} - static void dev_dax_release(struct device *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); struct dax_region *dax_region = dev_dax->region; struct dax_device *dax_dev = dev_dax->dax_dev; - dax_region_put(dax_region); put_dax(dax_dev); + free_dev_dax_id(dev_dax); + dax_region_put(dax_region); kfree(dev_dax->pgmap); kfree(dev_dax); } @@ -484,18 +699,6 @@ static const struct device_type dev_dax_type = { .groups = dax_attribute_groups, }; -static void unregister_dev_dax(void *dev) -{ - struct dev_dax *dev_dax = to_dev_dax(dev); - - dev_dbg(dev, "%s\n", __func__); - - kill_dev_dax(dev_dax); - free_dev_dax_range(dev_dax); - device_del(dev); - put_device(dev); -} - struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) { struct dax_region *dax_region = data->dax_region; @@ -506,17 +709,35 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) struct device *dev; int rc; - if (data->id < 0) - return ERR_PTR(-EINVAL); - dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL); if (!dev_dax) return ERR_PTR(-ENOMEM); + if (is_static(dax_region)) { + if (dev_WARN_ONCE(parent, data->id < 0, + "dynamic id specified to static region\n")) { + rc = -EINVAL; + goto err_id; + } + + dev_dax->id = data->id; + } else { + if (dev_WARN_ONCE(parent, data->id >= 0, + "static id specified to dynamic region\n")) { + rc = -EINVAL; + goto err_id; + } + + rc = ida_alloc(&dax_region->ida, GFP_KERNEL); + if (rc < 0) + goto err_id; + dev_dax->id = rc; + } + dev_dax->region = dax_region; dev = &dev_dax->dev; device_initialize(dev); - dev_set_name(dev, "dax%d.%d", dax_region->id, data->id); + dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id); rc = alloc_dev_dax_range(dev_dax, data->size); if (rc) @@ -579,6 +800,8 @@ err_alloc_dax: err_pgmap: free_dev_dax_range(dev_dax); err_range: + free_dev_dax_id(dev_dax); +err_id: kfree(dev_dax); return ERR_PTR(rc); diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 99b1273bb232..b81a1494d82b 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -7,6 +7,7 @@ #include #include +#include /* private routines between core files */ struct dax_device; @@ -22,7 +23,10 @@ void dax_bus_exit(void); * @kref: to pin while other agents have a need to do lookups * @dev: parent device backing this region * @align: allocation and mapping alignment for child dax devices + * @ida: instance id allocator * @res: resource tree to track instance allocations + * @seed: allow userspace to find the first unbound seed device + * @youngest: allow userspace to find the most recently created device */ struct dax_region { int id; @@ -30,7 +34,10 @@ struct dax_region { struct kref kref; struct device *dev; unsigned int align; + struct ida ida; struct resource res; + struct device *seed; + struct device *youngest; }; /** @@ -39,6 +46,7 @@ struct dax_region { * @region - parent region * @dax_dev - core dax functionality * @target_node: effective numa node if dev_dax memory range is onlined + * @id: ida allocated id * @dev - device core * @pgmap - pgmap for memmap setup / lifetime (driver owned) * @range: resource range for the instance @@ -47,6 +55,7 @@ struct dev_dax { struct dax_region *region; struct dax_device *dax_dev; int target_node; + int id; struct device dev; struct dev_pagemap *pgmap; struct range range; diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index e7b64539e23e..aa260009dfc7 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -26,7 +26,7 @@ static int dax_hmem_probe(struct platform_device *pdev) data = (struct dev_dax_data) { .dax_region = dax_region, - .id = 0, + .id = -1, .size = resource_size(res), }; dev_dax = devm_create_dev_dax(&data); From c77f520db8ebed1ffdeb8a545526dc093365d972 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:50:18 -0700 Subject: [PATCH 042/181] drivers/base: make device_find_child_by_name() compatible with sysfs inputs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Use sysfs_streq() in device_find_child_by_name() to allow it to use a sysfs input string that might contain a trailing newline. The other "device by name" interfaces, {bus,driver,class}_find_device_by_name(), already account for sysfs strings. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Reviewed-by: Greg Kroah-Hartman Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643102106.4062302.12229802117645312104.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106114576.30709.2960091665444712180.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/base/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index bb5806a2bd4c..8dd753539c06 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -3324,7 +3324,7 @@ struct device *device_find_child_by_name(struct device *parent, klist_iter_init(&parent->p->klist_children, &i); while ((child = next_device(&i))) - if (!strcmp(dev_name(child), name) && get_device(child)) + if (sysfs_streq(dev_name(child), name) && get_device(child)) break; klist_iter_exit(&i); return child; From fcffb6a1df921c81579e9c01f9caa281c3f991d5 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:50:24 -0700 Subject: [PATCH 043/181] device-dax: add resize support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Make the device-dax 'size' attribute writable to allow capacity to be split between multiple instances in a region. The intended consumers of this capability are users that want to split a scarce memory resource between device-dax and System-RAM access, or users that want to have multiple security domains for a large region. By default the hmem instance provider allocates an entire region to the first instance. The process of creating a new instance (assuming a region-id of 0) is find the region and trigger the 'create' attribute which yields an empty instance to configure. For example: cd /sys/bus/dax/devices echo dax0.0 > dax0.0/driver/unbind echo $new_size > dax0.0/size echo 1 > $(readlink -f dax0.0)../dax_region/create seed=$(cat $(readlink -f dax0.0)../dax_region/seed) echo $new_size > $seed/size echo dax0.0 > ../drivers/{device_dax,kmem}/bind echo dax0.1 > ../drivers/{device_dax,kmem}/bind Instances can be destroyed by: echo $device > $(readlink -f $device)../dax_region/delete Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Vishal Verma Cc: Brice Goglin Cc: Dave Hansen Cc: Dave Jiang Cc: David Hildenbrand Cc: Ira Weiny Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Catalin Marinas Cc: Daniel Vetter Cc: David Airlie Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643102625.4062302.7431838945566033852.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106115239.30709.9850106928133493138.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 161 +++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 152 insertions(+), 9 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index dce9413a4394..53d07f2f1285 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -6,6 +6,7 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" @@ -562,7 +563,8 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, } EXPORT_SYMBOL_GPL(alloc_dax_region); -static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size) +static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start, + resource_size_t size) { struct dax_region *dax_region = dev_dax->region; struct resource *res = &dax_region->res; @@ -580,12 +582,7 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size) return 0; } - /* TODO: handle multiple allocations per region */ - if (res->child) - return -ENOMEM; - - alloc = __request_region(res, res->start, size, dev_name(dev), 0); - + alloc = __request_region(res, start, size, dev_name(dev), 0); if (!alloc) return -ENOMEM; @@ -597,6 +594,29 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size) return 0; } +static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size) +{ + struct dax_region *dax_region = dev_dax->region; + struct range *range = &dev_dax->range; + int rc = 0; + + device_lock_assert(dax_region->dev); + + if (size) + rc = adjust_resource(res, range->start, size); + else + __release_region(&dax_region->res, range->start, range_len(range)); + if (rc) + return rc; + + dev_dax->range = (struct range) { + .start = range->start, + .end = range->start + size - 1, + }; + + return 0; +} + static ssize_t size_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -605,7 +625,127 @@ static ssize_t size_show(struct device *dev, return sprintf(buf, "%llu\n", size); } -static DEVICE_ATTR_RO(size); + +static bool alloc_is_aligned(struct dax_region *dax_region, + resource_size_t size) +{ + /* + * The minimum mapping granularity for a device instance is a + * single subsection, unless the arch says otherwise. + */ + return IS_ALIGNED(size, max_t(unsigned long, dax_region->align, + memremap_compat_align())); +} + +static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size) +{ + struct dax_region *dax_region = dev_dax->region; + struct range *range = &dev_dax->range; + struct resource *res, *adjust = NULL; + struct device *dev = &dev_dax->dev; + + for_each_dax_region_resource(dax_region, res) + if (strcmp(res->name, dev_name(dev)) == 0 + && res->start == range->start) { + adjust = res; + break; + } + + if (dev_WARN_ONCE(dev, !adjust, "failed to find matching resource\n")) + return -ENXIO; + return adjust_dev_dax_range(dev_dax, adjust, size); +} + +static ssize_t dev_dax_resize(struct dax_region *dax_region, + struct dev_dax *dev_dax, resource_size_t size) +{ + resource_size_t avail = dax_region_avail_size(dax_region), to_alloc; + resource_size_t dev_size = range_len(&dev_dax->range); + struct resource *region_res = &dax_region->res; + struct device *dev = &dev_dax->dev; + const char *name = dev_name(dev); + struct resource *res, *first; + + if (dev->driver) + return -EBUSY; + if (size == dev_size) + return 0; + if (size > dev_size && size - dev_size > avail) + return -ENOSPC; + if (size < dev_size) + return dev_dax_shrink(dev_dax, size); + + to_alloc = size - dev_size; + if (dev_WARN_ONCE(dev, !alloc_is_aligned(dax_region, to_alloc), + "resize of %pa misaligned\n", &to_alloc)) + return -ENXIO; + + /* + * Expand the device into the unused portion of the region. This + * may involve adjusting the end of an existing resource, or + * allocating a new resource. + */ + first = region_res->child; + if (!first) + return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc); + for (res = first; to_alloc && res; res = res->sibling) { + struct resource *next = res->sibling; + resource_size_t free; + + /* space at the beginning of the region */ + free = 0; + if (res == first && res->start > dax_region->res.start) + free = res->start - dax_region->res.start; + if (free >= to_alloc && dev_size == 0) + return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc); + + free = 0; + /* space between allocations */ + if (next && next->start > res->end + 1) + free = next->start - res->end + 1; + + /* space at the end of the region */ + if (free < to_alloc && !next && res->end < region_res->end) + free = region_res->end - res->end; + + if (free >= to_alloc && strcmp(name, res->name) == 0) + return adjust_dev_dax_range(dev_dax, res, resource_size(res) + to_alloc); + else if (free >= to_alloc && dev_size == 0) + return alloc_dev_dax_range(dev_dax, res->end + 1, to_alloc); + } + return -ENOSPC; +} + +static ssize_t size_store(struct device *dev, struct device_attribute *attr, + const char *buf, size_t len) +{ + ssize_t rc; + unsigned long long val; + struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_region *dax_region = dev_dax->region; + + rc = kstrtoull(buf, 0, &val); + if (rc) + return rc; + + if (!alloc_is_aligned(dax_region, val)) { + dev_dbg(dev, "%s: size: %lld misaligned\n", __func__, val); + return -EINVAL; + } + + device_lock(dax_region->dev); + if (!dax_region->dev->driver) { + device_unlock(dax_region->dev); + return -ENXIO; + } + device_lock(dev); + rc = dev_dax_resize(dax_region, dev_dax, val); + device_unlock(dev); + device_unlock(dax_region->dev); + + return rc == 0 ? len : rc; +} +static DEVICE_ATTR_RW(size); static int dev_dax_target_node(struct dev_dax *dev_dax) { @@ -654,11 +794,14 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) { struct device *dev = container_of(kobj, struct device, kobj); struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_region *dax_region = dev_dax->region; if (a == &dev_attr_target_node.attr && dev_dax_target_node(dev_dax) < 0) return 0; if (a == &dev_attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA)) return 0; + if (a == &dev_attr_size.attr && is_static(dax_region)) + return 0444; return a->mode; } @@ -739,7 +882,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) device_initialize(dev); dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id); - rc = alloc_dev_dax_range(dev_dax, data->size); + rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size); if (rc) goto err_range; From a4574f63edc6f76fb46dcd65d3eb4d5a8e23ba38 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:50:29 -0700 Subject: [PATCH 044/181] mm/memremap_pages: convert to 'struct range' MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 'struct resource' in 'struct dev_pagemap' is only used for holding resource span information. The other fields, 'name', 'flags', 'desc', 'parent', 'sibling', and 'child' are all unused wasted space. This is in preparation for introducing a multi-range extension of devm_memremap_pages(). The bulk of this change is unwinding all the places internal to libnvdimm that used 'struct resource' unnecessarily, and replacing instances of 'struct dev_pagemap'.res with 'struct dev_pagemap'.range. P2PDMA had a minor usage of the resource flags field, but only to report failures with "%pR". That is replaced with an open coded print of the range. [dan.carpenter@oracle.com: mm/hmm/test: use after free in dmirror_allocate_chunk()] Link: https://lkml.kernel.org/r/20200926121402.GA7467@kadam Signed-off-by: Dan Williams Signed-off-by: Dan Carpenter Signed-off-by: Andrew Morton Reviewed-by: Boris Ostrovsky [xen] Cc: Paul Mackerras Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Vishal Verma Cc: Vivek Goyal Cc: Dave Jiang Cc: Ben Skeggs Cc: David Airlie Cc: Daniel Vetter Cc: Ira Weiny Cc: Bjorn Helgaas Cc: Juergen Gross Cc: Stefano Stabellini Cc: "Jérôme Glisse" Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brice Goglin Cc: Catalin Marinas Cc: Dave Hansen Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: kernel test robot Cc: Mike Rapoport Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643103173.4062302.768998885691711532.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106115761.30709.13539840236873663620.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- arch/powerpc/kvm/book3s_hv_uvmem.c | 13 +++-- drivers/dax/bus.c | 10 ++-- drivers/dax/bus.h | 2 +- drivers/dax/dax-private.h | 5 -- drivers/dax/device.c | 3 +- drivers/dax/hmem/hmem.c | 5 +- drivers/dax/pmem/core.c | 12 ++-- drivers/gpu/drm/nouveau/nouveau_dmem.c | 14 ++--- drivers/nvdimm/badrange.c | 26 ++++----- drivers/nvdimm/claim.c | 13 +++-- drivers/nvdimm/nd.h | 3 +- drivers/nvdimm/pfn_devs.c | 12 ++-- drivers/nvdimm/pmem.c | 26 +++++---- drivers/nvdimm/region.c | 21 ++++--- drivers/pci/p2pdma.c | 11 ++-- drivers/xen/unpopulated-alloc.c | 44 ++++++++++----- include/linux/memremap.h | 5 +- include/linux/range.h | 6 ++ lib/test_hmm.c | 50 ++++++++--------- mm/memremap.c | 77 +++++++++++++------------- tools/testing/nvdimm/test/iomap.c | 2 +- 21 files changed, 195 insertions(+), 165 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 7705d5557239..29ec555055c2 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -687,9 +687,9 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) struct kvmppc_uvmem_page_pvt *pvt; unsigned long pfn_last, pfn_first; - pfn_first = kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT; + pfn_first = kvmppc_uvmem_pgmap.range.start >> PAGE_SHIFT; pfn_last = pfn_first + - (resource_size(&kvmppc_uvmem_pgmap.res) >> PAGE_SHIFT); + (range_len(&kvmppc_uvmem_pgmap.range) >> PAGE_SHIFT); spin_lock(&kvmppc_uvmem_bitmap_lock); bit = find_first_zero_bit(kvmppc_uvmem_bitmap, @@ -1007,7 +1007,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf) static void kvmppc_uvmem_page_free(struct page *page) { unsigned long pfn = page_to_pfn(page) - - (kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT); + (kvmppc_uvmem_pgmap.range.start >> PAGE_SHIFT); struct kvmppc_uvmem_page_pvt *pvt; spin_lock(&kvmppc_uvmem_bitmap_lock); @@ -1170,7 +1170,8 @@ int kvmppc_uvmem_init(void) } kvmppc_uvmem_pgmap.type = MEMORY_DEVICE_PRIVATE; - kvmppc_uvmem_pgmap.res = *res; + kvmppc_uvmem_pgmap.range.start = res->start; + kvmppc_uvmem_pgmap.range.end = res->end; kvmppc_uvmem_pgmap.ops = &kvmppc_uvmem_ops; /* just one global instance: */ kvmppc_uvmem_pgmap.owner = &kvmppc_uvmem_pgmap; @@ -1205,7 +1206,7 @@ void kvmppc_uvmem_free(void) return; memunmap_pages(&kvmppc_uvmem_pgmap); - release_mem_region(kvmppc_uvmem_pgmap.res.start, - resource_size(&kvmppc_uvmem_pgmap.res)); + release_mem_region(kvmppc_uvmem_pgmap.range.start, + range_len(&kvmppc_uvmem_pgmap.range)); kfree(kvmppc_uvmem_bitmap); } diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 53d07f2f1285..00fa73a8dfb4 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -515,7 +515,7 @@ static void dax_region_unregister(void *region) } struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, int target_node, unsigned int align, + struct range *range, int target_node, unsigned int align, unsigned long flags) { struct dax_region *dax_region; @@ -530,8 +530,8 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, return NULL; } - if (!IS_ALIGNED(res->start, align) - || !IS_ALIGNED(resource_size(res), align)) + if (!IS_ALIGNED(range->start, align) + || !IS_ALIGNED(range_len(range), align)) return NULL; dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL); @@ -546,8 +546,8 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, dax_region->target_node = target_node; ida_init(&dax_region->ida); dax_region->res = (struct resource) { - .start = res->start, - .end = res->end, + .start = range->start, + .end = range->end, .flags = IORESOURCE_MEM | flags, }; diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index da27ea70a19a..72b92f95509f 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -13,7 +13,7 @@ void dax_region_put(struct dax_region *dax_region); #define IORESOURCE_DAX_STATIC (1UL << 0) struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, int target_node, unsigned int align, + struct range *range, int target_node, unsigned int align, unsigned long flags); enum dev_dax_subsys { diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index b81a1494d82b..0cbb2ec81ca7 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -61,11 +61,6 @@ struct dev_dax { struct range range; }; -static inline u64 range_len(struct range *range) -{ - return range->end - range->start + 1; -} - static inline struct dev_dax *to_dev_dax(struct device *dev) { return container_of(dev, struct dev_dax, dev); diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 9833fa83b537..a14448bca83d 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -416,8 +416,7 @@ int dev_dax_probe(struct dev_dax *dev_dax) pgmap = devm_kzalloc(dev, sizeof(*pgmap), GFP_KERNEL); if (!pgmap) return -ENOMEM; - pgmap->res.start = range->start; - pgmap->res.end = range->end; + pgmap->range = *range; } pgmap->type = MEMORY_DEVICE_GENERIC; addr = devm_memremap_pages(dev, pgmap); diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index aa260009dfc7..1a3347bb6143 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -13,13 +13,16 @@ static int dax_hmem_probe(struct platform_device *pdev) struct dev_dax_data data; struct dev_dax *dev_dax; struct resource *res; + struct range range; res = platform_get_resource(pdev, IORESOURCE_MEM, 0); if (!res) return -ENOMEM; mri = dev->platform_data; - dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node, + range.start = res->start; + range.end = res->end; + dax_region = alloc_dax_region(dev, pdev->id, &range, mri->target_node, PMD_SIZE, 0); if (!dax_region) return -ENOMEM; diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c index 4fe700884338..62b26bfceab1 100644 --- a/drivers/dax/pmem/core.c +++ b/drivers/dax/pmem/core.c @@ -9,7 +9,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) { - struct resource res; + struct range range; int rc, id, region_id; resource_size_t offset; struct nd_pfn_sb *pfn_sb; @@ -50,10 +50,10 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) if (rc != 2) return ERR_PTR(-EINVAL); - /* adjust the dax_region resource to the start of data */ - memcpy(&res, &pgmap.res, sizeof(res)); - res.start += offset; - dax_region = alloc_dax_region(dev, region_id, &res, + /* adjust the dax_region range to the start of data */ + range = pgmap.range; + range.start += offset, + dax_region = alloc_dax_region(dev, region_id, &range, nd_region->target_node, le32_to_cpu(pfn_sb->align), IORESOURCE_DAX_STATIC); if (!dax_region) @@ -64,7 +64,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys) .id = id, .pgmap = &pgmap, .subsys = subsys, - .size = resource_size(&res), + .size = range_len(&range), }; dev_dax = devm_create_dev_dax(&data); diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 4e8112fde3e6..25811ed7e274 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -101,7 +101,7 @@ unsigned long nouveau_dmem_page_addr(struct page *page) { struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page); unsigned long off = (page_to_pfn(page) << PAGE_SHIFT) - - chunk->pagemap.res.start; + chunk->pagemap.range.start; return chunk->bo->offset + off; } @@ -249,7 +249,8 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage) chunk->drm = drm; chunk->pagemap.type = MEMORY_DEVICE_PRIVATE; - chunk->pagemap.res = *res; + chunk->pagemap.range.start = res->start; + chunk->pagemap.range.end = res->end; chunk->pagemap.ops = &nouveau_dmem_pagemap_ops; chunk->pagemap.owner = drm->dev; @@ -273,7 +274,7 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage) list_add(&chunk->list, &drm->dmem->chunks); mutex_unlock(&drm->dmem->mutex); - pfn_first = chunk->pagemap.res.start >> PAGE_SHIFT; + pfn_first = chunk->pagemap.range.start >> PAGE_SHIFT; page = pfn_to_page(pfn_first); spin_lock(&drm->dmem->lock); for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) { @@ -294,8 +295,7 @@ out_bo_unpin: out_bo_free: nouveau_bo_ref(NULL, &chunk->bo); out_release: - release_mem_region(chunk->pagemap.res.start, - resource_size(&chunk->pagemap.res)); + release_mem_region(chunk->pagemap.range.start, range_len(&chunk->pagemap.range)); out_free: kfree(chunk); out: @@ -382,8 +382,8 @@ nouveau_dmem_fini(struct nouveau_drm *drm) nouveau_bo_ref(NULL, &chunk->bo); list_del(&chunk->list); memunmap_pages(&chunk->pagemap); - release_mem_region(chunk->pagemap.res.start, - resource_size(&chunk->pagemap.res)); + release_mem_region(chunk->pagemap.range.start, + range_len(&chunk->pagemap.range)); kfree(chunk); } diff --git a/drivers/nvdimm/badrange.c b/drivers/nvdimm/badrange.c index b9eeefa27e3a..aaf6e215a8c6 100644 --- a/drivers/nvdimm/badrange.c +++ b/drivers/nvdimm/badrange.c @@ -211,7 +211,7 @@ static void __add_badblock_range(struct badblocks *bb, u64 ns_offset, u64 len) } static void badblocks_populate(struct badrange *badrange, - struct badblocks *bb, const struct resource *res) + struct badblocks *bb, const struct range *range) { struct badrange_entry *bre; @@ -222,34 +222,34 @@ static void badblocks_populate(struct badrange *badrange, u64 bre_end = bre->start + bre->length - 1; /* Discard intervals with no intersection */ - if (bre_end < res->start) + if (bre_end < range->start) continue; - if (bre->start > res->end) + if (bre->start > range->end) continue; /* Deal with any overlap after start of the namespace */ - if (bre->start >= res->start) { + if (bre->start >= range->start) { u64 start = bre->start; u64 len; - if (bre_end <= res->end) + if (bre_end <= range->end) len = bre->length; else - len = res->start + resource_size(res) + len = range->start + range_len(range) - bre->start; - __add_badblock_range(bb, start - res->start, len); + __add_badblock_range(bb, start - range->start, len); continue; } /* * Deal with overlap for badrange starting before * the namespace. */ - if (bre->start < res->start) { + if (bre->start < range->start) { u64 len; - if (bre_end < res->end) - len = bre->start + bre->length - res->start; + if (bre_end < range->end) + len = bre->start + bre->length - range->start; else - len = resource_size(res); + len = range_len(range); __add_badblock_range(bb, 0, len); } } @@ -267,7 +267,7 @@ static void badblocks_populate(struct badrange *badrange, * and add badblocks entries for all matching sub-ranges */ void nvdimm_badblocks_populate(struct nd_region *nd_region, - struct badblocks *bb, const struct resource *res) + struct badblocks *bb, const struct range *range) { struct nvdimm_bus *nvdimm_bus; @@ -279,7 +279,7 @@ void nvdimm_badblocks_populate(struct nd_region *nd_region, nvdimm_bus = walk_to_nvdimm_bus(&nd_region->dev); nvdimm_bus_lock(&nvdimm_bus->dev); - badblocks_populate(&nvdimm_bus->badrange, bb, res); + badblocks_populate(&nvdimm_bus->badrange, bb, range); nvdimm_bus_unlock(&nvdimm_bus->dev); } EXPORT_SYMBOL_GPL(nvdimm_badblocks_populate); diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c index 22d865ba6353..5a7c80053c62 100644 --- a/drivers/nvdimm/claim.c +++ b/drivers/nvdimm/claim.c @@ -303,13 +303,16 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, int devm_nsio_enable(struct device *dev, struct nd_namespace_io *nsio, resource_size_t size) { - struct resource *res = &nsio->res; struct nd_namespace_common *ndns = &nsio->common; + struct range range = { + .start = nsio->res.start, + .end = nsio->res.end, + }; nsio->size = size; - if (!devm_request_mem_region(dev, res->start, size, + if (!devm_request_mem_region(dev, range.start, size, dev_name(&ndns->dev))) { - dev_warn(dev, "could not reserve region %pR\n", res); + dev_warn(dev, "could not reserve region %pR\n", &nsio->res); return -EBUSY; } @@ -317,9 +320,9 @@ int devm_nsio_enable(struct device *dev, struct nd_namespace_io *nsio, if (devm_init_badblocks(dev, &nsio->bb)) return -ENOMEM; nvdimm_badblocks_populate(to_nd_region(ndns->dev.parent), &nsio->bb, - &nsio->res); + &range); - nsio->addr = devm_memremap(dev, res->start, size, ARCH_MEMREMAP_PMEM); + nsio->addr = devm_memremap(dev, range.start, size, ARCH_MEMREMAP_PMEM); return PTR_ERR_OR_ZERO(nsio->addr); } diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h index 72740108ba42..696b55556d4d 100644 --- a/drivers/nvdimm/nd.h +++ b/drivers/nvdimm/nd.h @@ -377,8 +377,9 @@ int nvdimm_namespace_detach_btt(struct nd_btt *nd_btt); const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns, char *name); unsigned int pmem_sector_size(struct nd_namespace_common *ndns); +struct range; void nvdimm_badblocks_populate(struct nd_region *nd_region, - struct badblocks *bb, const struct resource *res); + struct badblocks *bb, const struct range *range); int devm_namespace_enable(struct device *dev, struct nd_namespace_common *ndns, resource_size_t size); void devm_namespace_disable(struct device *dev, diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index 3e11ef8d3f5b..3c4787b92a6a 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -672,7 +672,7 @@ static unsigned long init_altmap_reserve(resource_size_t base) static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap) { - struct resource *res = &pgmap->res; + struct range *range = &pgmap->range; struct vmem_altmap *altmap = &pgmap->altmap; struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb; u64 offset = le64_to_cpu(pfn_sb->dataoff); @@ -689,16 +689,16 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap) .end_pfn = PHYS_PFN(end), }; - memcpy(res, &nsio->res, sizeof(*res)); - res->start += start_pad; - res->end -= end_trunc; - + *range = (struct range) { + .start = nsio->res.start + start_pad, + .end = nsio->res.end - end_trunc, + }; if (nd_pfn->mode == PFN_MODE_RAM) { if (offset < reserve) return -EINVAL; nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns); } else if (nd_pfn->mode == PFN_MODE_PMEM) { - nd_pfn->npfns = PHYS_PFN((resource_size(res) - offset)); + nd_pfn->npfns = PHYS_PFN((range_len(range) - offset)); if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns) dev_info(&nd_pfn->dev, "number of pfns truncated from %lld to %ld\n", diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index c86a0ceaece6..1f394f44838f 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -375,7 +375,7 @@ static int pmem_attach_disk(struct device *dev, struct nd_region *nd_region = to_nd_region(dev->parent); int nid = dev_to_node(dev), fua; struct resource *res = &nsio->res; - struct resource bb_res; + struct range bb_range; struct nd_pfn *nd_pfn = NULL; struct dax_device *dax_dev; struct nd_pfn_sb *pfn_sb; @@ -434,24 +434,26 @@ static int pmem_attach_disk(struct device *dev, pfn_sb = nd_pfn->pfn_sb; pmem->data_offset = le64_to_cpu(pfn_sb->dataoff); pmem->pfn_pad = resource_size(res) - - resource_size(&pmem->pgmap.res); + range_len(&pmem->pgmap.range); pmem->pfn_flags |= PFN_MAP; - memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res)); - bb_res.start += pmem->data_offset; + bb_range = pmem->pgmap.range; + bb_range.start += pmem->data_offset; } else if (pmem_should_map_pages(dev)) { - memcpy(&pmem->pgmap.res, &nsio->res, sizeof(pmem->pgmap.res)); + pmem->pgmap.range.start = res->start; + pmem->pgmap.range.end = res->end; pmem->pgmap.type = MEMORY_DEVICE_FS_DAX; pmem->pgmap.ops = &fsdax_pagemap_ops; addr = devm_memremap_pages(dev, &pmem->pgmap); pmem->pfn_flags |= PFN_MAP; - memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res)); + bb_range = pmem->pgmap.range; } else { if (devm_add_action_or_reset(dev, pmem_release_queue, &pmem->pgmap)) return -ENOMEM; addr = devm_memremap(dev, pmem->phys_addr, pmem->size, ARCH_MEMREMAP_PMEM); - memcpy(&bb_res, &nsio->res, sizeof(bb_res)); + bb_range.start = res->start; + bb_range.end = res->end; } if (IS_ERR(addr)) @@ -480,7 +482,7 @@ static int pmem_attach_disk(struct device *dev, / 512); if (devm_init_badblocks(dev, &pmem->bb)) return -ENOMEM; - nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res); + nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_range); disk->bb = &pmem->bb; if (is_nvdimm_sync(nd_region)) @@ -591,8 +593,8 @@ static void nd_pmem_notify(struct device *dev, enum nvdimm_event event) resource_size_t offset = 0, end_trunc = 0; struct nd_namespace_common *ndns; struct nd_namespace_io *nsio; - struct resource res; struct badblocks *bb; + struct range range; struct kernfs_node *bb_state; if (event != NVDIMM_REVALIDATE_POISON) @@ -628,9 +630,9 @@ static void nd_pmem_notify(struct device *dev, enum nvdimm_event event) nsio = to_nd_namespace_io(&ndns->dev); } - res.start = nsio->res.start + offset; - res.end = nsio->res.end - end_trunc; - nvdimm_badblocks_populate(nd_region, bb, &res); + range.start = nsio->res.start + offset; + range.end = nsio->res.end - end_trunc; + nvdimm_badblocks_populate(nd_region, bb, &range); if (bb_state) sysfs_notify_dirent(bb_state); } diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c index 0f6978e72e7c..bfce87ed72ab 100644 --- a/drivers/nvdimm/region.c +++ b/drivers/nvdimm/region.c @@ -35,7 +35,10 @@ static int nd_region_probe(struct device *dev) return rc; if (is_memory(&nd_region->dev)) { - struct resource ndr_res; + struct range range = { + .start = nd_region->ndr_start, + .end = nd_region->ndr_start + nd_region->ndr_size - 1, + }; if (devm_init_badblocks(dev, &nd_region->bb)) return -ENODEV; @@ -44,9 +47,7 @@ static int nd_region_probe(struct device *dev) if (!nd_region->bb_state) dev_warn(&nd_region->dev, "'badblocks' notification disabled\n"); - ndr_res.start = nd_region->ndr_start; - ndr_res.end = nd_region->ndr_start + nd_region->ndr_size - 1; - nvdimm_badblocks_populate(nd_region, &nd_region->bb, &ndr_res); + nvdimm_badblocks_populate(nd_region, &nd_region->bb, &range); } rc = nd_region_register_namespaces(nd_region, &err); @@ -121,14 +122,16 @@ static void nd_region_notify(struct device *dev, enum nvdimm_event event) { if (event == NVDIMM_REVALIDATE_POISON) { struct nd_region *nd_region = to_nd_region(dev); - struct resource res; if (is_memory(&nd_region->dev)) { - res.start = nd_region->ndr_start; - res.end = nd_region->ndr_start + - nd_region->ndr_size - 1; + struct range range = { + .start = nd_region->ndr_start, + .end = nd_region->ndr_start + + nd_region->ndr_size - 1, + }; + nvdimm_badblocks_populate(nd_region, - &nd_region->bb, &res); + &nd_region->bb, &range); if (nd_region->bb_state) sysfs_notify_dirent(nd_region->bb_state); } diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index f357f9a32b3a..256850513813 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -185,9 +185,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, return -ENOMEM; pgmap = &p2p_pgmap->pgmap; - pgmap->res.start = pci_resource_start(pdev, bar) + offset; - pgmap->res.end = pgmap->res.start + size - 1; - pgmap->res.flags = pci_resource_flags(pdev, bar); + pgmap->range.start = pci_resource_start(pdev, bar) + offset; + pgmap->range.end = pgmap->range.start + size - 1; pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; p2p_pgmap->provider = pdev; @@ -202,13 +201,13 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, error = gen_pool_add_owner(pdev->p2pdma->pool, (unsigned long)addr, pci_bus_address(pdev, bar) + offset, - resource_size(&pgmap->res), dev_to_node(&pdev->dev), + range_len(&pgmap->range), dev_to_node(&pdev->dev), pgmap->ref); if (error) goto pages_free; - pci_info(pdev, "added peer-to-peer DMA memory %pR\n", - &pgmap->res); + pci_info(pdev, "added peer-to-peer DMA memory %#llx-%#llx\n", + pgmap->range.start, pgmap->range.end); return 0; diff --git a/drivers/xen/unpopulated-alloc.c b/drivers/xen/unpopulated-alloc.c index 3b98dc921426..091b8669eca3 100644 --- a/drivers/xen/unpopulated-alloc.c +++ b/drivers/xen/unpopulated-alloc.c @@ -18,27 +18,37 @@ static unsigned int list_count; static int fill_list(unsigned int nr_pages) { struct dev_pagemap *pgmap; + struct resource *res; void *vaddr; unsigned int i, alloc_pages = round_up(nr_pages, PAGES_PER_SECTION); - int ret; + int ret = -ENOMEM; + + res = kzalloc(sizeof(*res), GFP_KERNEL); + if (!res) + return -ENOMEM; pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL); if (!pgmap) - return -ENOMEM; + goto err_pgmap; pgmap->type = MEMORY_DEVICE_GENERIC; - pgmap->res.name = "Xen scratch"; - pgmap->res.flags = IORESOURCE_MEM | IORESOURCE_BUSY; + res->name = "Xen scratch"; + res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; - ret = allocate_resource(&iomem_resource, &pgmap->res, + ret = allocate_resource(&iomem_resource, res, alloc_pages * PAGE_SIZE, 0, -1, PAGES_PER_SECTION * PAGE_SIZE, NULL, NULL); if (ret < 0) { pr_err("Cannot allocate new IOMEM resource\n"); - kfree(pgmap); - return ret; + goto err_resource; } + pgmap->range = (struct range) { + .start = res->start, + .end = res->end, + }; + pgmap->owner = res; + #ifdef CONFIG_XEN_HAVE_PVMMU /* * memremap will build page tables for the new memory so @@ -50,14 +60,13 @@ static int fill_list(unsigned int nr_pages) * conflict with any devices. */ if (!xen_feature(XENFEAT_auto_translated_physmap)) { - xen_pfn_t pfn = PFN_DOWN(pgmap->res.start); + xen_pfn_t pfn = PFN_DOWN(res->start); for (i = 0; i < alloc_pages; i++) { if (!set_phys_to_machine(pfn + i, INVALID_P2M_ENTRY)) { pr_warn("set_phys_to_machine() failed, no memory added\n"); - release_resource(&pgmap->res); - kfree(pgmap); - return -ENOMEM; + ret = -ENOMEM; + goto err_memremap; } } } @@ -66,9 +75,8 @@ static int fill_list(unsigned int nr_pages) vaddr = memremap_pages(pgmap, NUMA_NO_NODE); if (IS_ERR(vaddr)) { pr_err("Cannot remap memory range\n"); - release_resource(&pgmap->res); - kfree(pgmap); - return PTR_ERR(vaddr); + ret = PTR_ERR(vaddr); + goto err_memremap; } for (i = 0; i < alloc_pages; i++) { @@ -80,6 +88,14 @@ static int fill_list(unsigned int nr_pages) } return 0; + +err_memremap: + release_resource(res); +err_resource: + kfree(pgmap); +err_pgmap: + kfree(res); + return ret; } /** diff --git a/include/linux/memremap.h b/include/linux/memremap.h index e5862746751b..d0dd261d87c0 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -1,6 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_MEMREMAP_H_ #define _LINUX_MEMREMAP_H_ +#include #include #include @@ -93,7 +94,7 @@ struct dev_pagemap_ops { /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings * @altmap: pre-allocated/reserved memory for vmemmap allocations - * @res: physical address range covered by @ref + * @range: physical address range covered by @ref * @ref: reference count that pins the devm_memremap_pages() mapping * @internal_ref: internal reference if @ref is not provided by the caller * @done: completion for @internal_ref @@ -106,7 +107,7 @@ struct dev_pagemap_ops { */ struct dev_pagemap { struct vmem_altmap altmap; - struct resource res; + struct range range; struct percpu_ref *ref; struct percpu_ref internal_ref; struct completion done; diff --git a/include/linux/range.h b/include/linux/range.h index d1fbeb664012..274681cc3154 100644 --- a/include/linux/range.h +++ b/include/linux/range.h @@ -1,12 +1,18 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_RANGE_H #define _LINUX_RANGE_H +#include struct range { u64 start; u64 end; }; +static inline u64 range_len(const struct range *range) +{ + return range->end - range->start + 1; +} + int add_range(struct range *range, int az, int nr_range, u64 start, u64 end); diff --git a/lib/test_hmm.c b/lib/test_hmm.c index e7dc3de355b7..e97ca8ec0bce 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -460,6 +460,21 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, unsigned long pfn_last; void *ptr; + devmem = kzalloc(sizeof(*devmem), GFP_KERNEL); + if (!devmem) + return -ENOMEM; + + res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE, + "hmm_dmirror"); + if (IS_ERR(res)) + goto err_devmem; + + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; + devmem->pagemap.range.start = res->start; + devmem->pagemap.range.end = res->end; + devmem->pagemap.ops = &dmirror_devmem_ops; + devmem->pagemap.owner = mdevice; + mutex_lock(&mdevice->devmem_lock); if (mdevice->devmem_count == mdevice->devmem_capacity) { @@ -472,33 +487,18 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, sizeof(new_chunks[0]) * new_capacity, GFP_KERNEL); if (!new_chunks) - goto err; + goto err_release; mdevice->devmem_capacity = new_capacity; mdevice->devmem_chunks = new_chunks; } - res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE, - "hmm_dmirror"); - if (IS_ERR(res)) - goto err; - - devmem = kzalloc(sizeof(*devmem), GFP_KERNEL); - if (!devmem) - goto err_release; - - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; - devmem->pagemap.res = *res; - devmem->pagemap.ops = &dmirror_devmem_ops; - devmem->pagemap.owner = mdevice; - ptr = memremap_pages(&devmem->pagemap, numa_node_id()); if (IS_ERR(ptr)) - goto err_free; + goto err_release; devmem->mdevice = mdevice; - pfn_first = devmem->pagemap.res.start >> PAGE_SHIFT; - pfn_last = pfn_first + - (resource_size(&devmem->pagemap.res) >> PAGE_SHIFT); + pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT; + pfn_last = pfn_first + (range_len(&devmem->pagemap.range) >> PAGE_SHIFT); mdevice->devmem_chunks[mdevice->devmem_count++] = devmem; mutex_unlock(&mdevice->devmem_lock); @@ -525,12 +525,12 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, return true; -err_free: - kfree(devmem); err_release: - release_mem_region(res->start, resource_size(res)); -err: mutex_unlock(&mdevice->devmem_lock); + release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range)); +err_devmem: + kfree(devmem); + return false; } @@ -1100,8 +1100,8 @@ static void dmirror_device_remove(struct dmirror_device *mdevice) mdevice->devmem_chunks[i]; memunmap_pages(&devmem->pagemap); - release_mem_region(devmem->pagemap.res.start, - resource_size(&devmem->pagemap.res)); + release_mem_region(devmem->pagemap.range.start, + range_len(&devmem->pagemap.range)); kfree(devmem); } kfree(mdevice->devmem_chunks); diff --git a/mm/memremap.c b/mm/memremap.c index 006dace60b1a..d958d348b3ca 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -70,24 +70,24 @@ static void devmap_managed_enable_put(void) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ -static void pgmap_array_delete(struct resource *res) +static void pgmap_array_delete(struct range *range) { - xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end), + xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end), NULL, GFP_KERNEL); synchronize_rcu(); } static unsigned long pfn_first(struct dev_pagemap *pgmap) { - return PHYS_PFN(pgmap->res.start) + + return PHYS_PFN(pgmap->range.start) + vmem_altmap_offset(pgmap_altmap(pgmap)); } static unsigned long pfn_end(struct dev_pagemap *pgmap) { - const struct resource *res = &pgmap->res; + const struct range *range = &pgmap->range; - return (res->start + resource_size(res)) >> PAGE_SHIFT; + return (range->start + range_len(range)) >> PAGE_SHIFT; } static unsigned long pfn_next(unsigned long pfn) @@ -126,7 +126,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap) void memunmap_pages(struct dev_pagemap *pgmap) { - struct resource *res = &pgmap->res; + struct range *range = &pgmap->range; struct page *first_page; unsigned long pfn; int nid; @@ -143,20 +143,20 @@ void memunmap_pages(struct dev_pagemap *pgmap) nid = page_to_nid(first_page); mem_hotplug_begin(); - remove_pfn_range_from_zone(page_zone(first_page), PHYS_PFN(res->start), - PHYS_PFN(resource_size(res))); + remove_pfn_range_from_zone(page_zone(first_page), PHYS_PFN(range->start), + PHYS_PFN(range_len(range))); if (pgmap->type == MEMORY_DEVICE_PRIVATE) { - __remove_pages(PHYS_PFN(res->start), - PHYS_PFN(resource_size(res)), NULL); + __remove_pages(PHYS_PFN(range->start), + PHYS_PFN(range_len(range)), NULL); } else { - arch_remove_memory(nid, res->start, resource_size(res), + arch_remove_memory(nid, range->start, range_len(range), pgmap_altmap(pgmap)); - kasan_remove_zero_shadow(__va(res->start), resource_size(res)); + kasan_remove_zero_shadow(__va(range->start), range_len(range)); } mem_hotplug_done(); - untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res)); - pgmap_array_delete(res); + untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range)); + pgmap_array_delete(range); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); } @@ -182,7 +182,7 @@ static void dev_pagemap_percpu_release(struct percpu_ref *ref) */ void *memremap_pages(struct dev_pagemap *pgmap, int nid) { - struct resource *res = &pgmap->res; + struct range *range = &pgmap->range; struct dev_pagemap *conflict_pgmap; struct mhp_params params = { /* @@ -251,7 +251,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) return ERR_PTR(error); } - conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->start), NULL); + conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL); if (conflict_pgmap) { WARN(1, "Conflicting mapping in same section\n"); put_dev_pagemap(conflict_pgmap); @@ -259,7 +259,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) goto err_array; } - conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->end), NULL); + conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL); if (conflict_pgmap) { WARN(1, "Conflicting mapping in same section\n"); put_dev_pagemap(conflict_pgmap); @@ -267,26 +267,27 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) goto err_array; } - is_ram = region_intersects(res->start, resource_size(res), + is_ram = region_intersects(range->start, range_len(range), IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE); if (is_ram != REGION_DISJOINT) { - WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__, - is_ram == REGION_MIXED ? "mixed" : "ram", res); + WARN_ONCE(1, "attempted on %s region %#llx-%#llx\n", + is_ram == REGION_MIXED ? "mixed" : "ram", + range->start, range->end); error = -ENXIO; goto err_array; } - error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(res->start), - PHYS_PFN(res->end), pgmap, GFP_KERNEL)); + error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(range->start), + PHYS_PFN(range->end), pgmap, GFP_KERNEL)); if (error) goto err_array; if (nid < 0) nid = numa_mem_id(); - error = track_pfn_remap(NULL, ¶ms.pgprot, PHYS_PFN(res->start), - 0, resource_size(res)); + error = track_pfn_remap(NULL, ¶ms.pgprot, PHYS_PFN(range->start), 0, + range_len(range)); if (error) goto err_pfn_remap; @@ -304,16 +305,16 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) * arch_add_memory(). */ if (pgmap->type == MEMORY_DEVICE_PRIVATE) { - error = add_pages(nid, PHYS_PFN(res->start), - PHYS_PFN(resource_size(res)), ¶ms); + error = add_pages(nid, PHYS_PFN(range->start), + PHYS_PFN(range_len(range)), ¶ms); } else { - error = kasan_add_zero_shadow(__va(res->start), resource_size(res)); + error = kasan_add_zero_shadow(__va(range->start), range_len(range)); if (error) { mem_hotplug_done(); goto err_kasan; } - error = arch_add_memory(nid, res->start, resource_size(res), + error = arch_add_memory(nid, range->start, range_len(range), ¶ms); } @@ -321,8 +322,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) struct zone *zone; zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE]; - move_pfn_range_to_zone(zone, PHYS_PFN(res->start), - PHYS_PFN(resource_size(res)), params.altmap); + move_pfn_range_to_zone(zone, PHYS_PFN(range->start), + PHYS_PFN(range_len(range)), params.altmap); } mem_hotplug_done(); @@ -334,17 +335,17 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) * to allow us to do the work while not holding the hotplug lock. */ memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], - PHYS_PFN(res->start), - PHYS_PFN(resource_size(res)), pgmap); + PHYS_PFN(range->start), + PHYS_PFN(range_len(range)), pgmap); percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap)); - return __va(res->start); + return __va(range->start); err_add_memory: - kasan_remove_zero_shadow(__va(res->start), resource_size(res)); + kasan_remove_zero_shadow(__va(range->start), range_len(range)); err_kasan: - untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res)); + untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range)); err_pfn_remap: - pgmap_array_delete(res); + pgmap_array_delete(range); err_array: dev_pagemap_kill(pgmap); dev_pagemap_cleanup(pgmap); @@ -369,7 +370,7 @@ EXPORT_SYMBOL_GPL(memremap_pages); * 'live' on entry and will be killed and reaped at * devm_memremap_pages_release() time, or if this routine fails. * - * 4/ res is expected to be a host memory range that could feasibly be + * 4/ range is expected to be a host memory range that could feasibly be * treated as a "System RAM" range, i.e. not a device mmio range, but * this is not enforced. */ @@ -426,7 +427,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, * In the cached case we're already holding a live reference. */ if (pgmap) { - if (phys >= pgmap->res.start && phys <= pgmap->res.end) + if (phys >= pgmap->range.start && phys <= pgmap->range.end) return pgmap; put_dev_pagemap(pgmap); } diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index 03e40b3b0106..c62d372d426f 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -126,7 +126,7 @@ static void dev_pagemap_percpu_release(struct percpu_ref *ref) void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) { int error; - resource_size_t offset = pgmap->res.start; + resource_size_t offset = pgmap->range.start; struct nfit_test_resource *nfit_res = get_nfit_res(offset); if (!nfit_res) From b7b3c01b191596d27a6980d1a42504f5b607f802 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:50:34 -0700 Subject: [PATCH 045/181] mm/memremap_pages: support multiple ranges per invocation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In support of device-dax growing the ability to front physically dis-contiguous ranges of memory, update devm_memremap_pages() to track multiple ranges with a single reference counter and devm instance. Convert all [devm_]memremap_pages() users to specify the number of ranges they are mapping in their 'struct dev_pagemap' instance. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Paul Mackerras Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Vishal Verma Cc: Vivek Goyal Cc: Dave Jiang Cc: Ben Skeggs Cc: David Airlie Cc: Daniel Vetter Cc: Ira Weiny Cc: Bjorn Helgaas Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: "Jérôme Glisse" Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Borislav Petkov Cc: Brice Goglin Cc: Catalin Marinas Cc: Dave Hansen Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Jia He Cc: Joao Martins Cc: Jonathan Cameron Cc: kernel test robot Cc: Mike Rapoport Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643103789.4062302.18426128170217903785.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106116293.30709.13350662794915396198.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- arch/powerpc/kvm/book3s_hv_uvmem.c | 1 + drivers/dax/device.c | 1 + drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + drivers/nvdimm/pfn_devs.c | 1 + drivers/nvdimm/pmem.c | 1 + drivers/pci/p2pdma.c | 1 + drivers/xen/unpopulated-alloc.c | 1 + include/linux/memremap.h | 10 +- lib/test_hmm.c | 1 + mm/memremap.c | 274 ++++++++++++++----------- 10 files changed, 174 insertions(+), 118 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 29ec555055c2..84e5a2dc8be5 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -1172,6 +1172,7 @@ int kvmppc_uvmem_init(void) kvmppc_uvmem_pgmap.type = MEMORY_DEVICE_PRIVATE; kvmppc_uvmem_pgmap.range.start = res->start; kvmppc_uvmem_pgmap.range.end = res->end; + kvmppc_uvmem_pgmap.nr_range = 1; kvmppc_uvmem_pgmap.ops = &kvmppc_uvmem_ops; /* just one global instance: */ kvmppc_uvmem_pgmap.owner = &kvmppc_uvmem_pgmap; diff --git a/drivers/dax/device.c b/drivers/dax/device.c index a14448bca83d..5f808617672a 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -417,6 +417,7 @@ int dev_dax_probe(struct dev_dax *dev_dax) if (!pgmap) return -ENOMEM; pgmap->range = *range; + pgmap->nr_range = 1; } pgmap->type = MEMORY_DEVICE_GENERIC; addr = devm_memremap_pages(dev, pgmap); diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 25811ed7e274..a13c6215bba8 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -251,6 +251,7 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage) chunk->pagemap.type = MEMORY_DEVICE_PRIVATE; chunk->pagemap.range.start = res->start; chunk->pagemap.range.end = res->end; + chunk->pagemap.nr_range = 1; chunk->pagemap.ops = &nouveau_dmem_pagemap_ops; chunk->pagemap.owner = drm->dev; diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index 3c4787b92a6a..b499df630d4d 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -693,6 +693,7 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap) .start = nsio->res.start + start_pad, .end = nsio->res.end - end_trunc, }; + pgmap->nr_range = 1; if (nd_pfn->mode == PFN_MODE_RAM) { if (offset < reserve) return -EINVAL; diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 1f394f44838f..875076b0ea6c 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -441,6 +441,7 @@ static int pmem_attach_disk(struct device *dev, } else if (pmem_should_map_pages(dev)) { pmem->pgmap.range.start = res->start; pmem->pgmap.range.end = res->end; + pmem->pgmap.nr_range = 1; pmem->pgmap.type = MEMORY_DEVICE_FS_DAX; pmem->pgmap.ops = &fsdax_pagemap_ops; addr = devm_memremap_pages(dev, &pmem->pgmap); diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 256850513813..9d53c16b7329 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -187,6 +187,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, pgmap = &p2p_pgmap->pgmap; pgmap->range.start = pci_resource_start(pdev, bar) + offset; pgmap->range.end = pgmap->range.start + size - 1; + pgmap->nr_range = 1; pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; p2p_pgmap->provider = pdev; diff --git a/drivers/xen/unpopulated-alloc.c b/drivers/xen/unpopulated-alloc.c index 091b8669eca3..8c512ea550bb 100644 --- a/drivers/xen/unpopulated-alloc.c +++ b/drivers/xen/unpopulated-alloc.c @@ -47,6 +47,7 @@ static int fill_list(unsigned int nr_pages) .start = res->start, .end = res->end, }; + pgmap->nr_range = 1; pgmap->owner = res; #ifdef CONFIG_XEN_HAVE_PVMMU diff --git a/include/linux/memremap.h b/include/linux/memremap.h index d0dd261d87c0..79c49e7f5c30 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -94,7 +94,6 @@ struct dev_pagemap_ops { /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings * @altmap: pre-allocated/reserved memory for vmemmap allocations - * @range: physical address range covered by @ref * @ref: reference count that pins the devm_memremap_pages() mapping * @internal_ref: internal reference if @ref is not provided by the caller * @done: completion for @internal_ref @@ -104,10 +103,12 @@ struct dev_pagemap_ops { * @owner: an opaque pointer identifying the entity that manages this * instance. Used by various helpers to make sure that no * foreign ZONE_DEVICE memory is accessed. + * @nr_range: number of ranges to be mapped + * @range: range to be mapped when nr_range == 1 + * @ranges: array of ranges to be mapped when nr_range > 1 */ struct dev_pagemap { struct vmem_altmap altmap; - struct range range; struct percpu_ref *ref; struct percpu_ref internal_ref; struct completion done; @@ -115,6 +116,11 @@ struct dev_pagemap { unsigned int flags; const struct dev_pagemap_ops *ops; void *owner; + int nr_range; + union { + struct range range; + struct range ranges[0]; + }; }; static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index e97ca8ec0bce..c710b4c5714d 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -472,6 +472,7 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; devmem->pagemap.range.start = res->start; devmem->pagemap.range.end = res->end; + devmem->pagemap.nr_range = 1; devmem->pagemap.ops = &dmirror_devmem_ops; devmem->pagemap.owner = mdevice; diff --git a/mm/memremap.c b/mm/memremap.c index d958d348b3ca..532ec3d36ab4 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -77,15 +77,19 @@ static void pgmap_array_delete(struct range *range) synchronize_rcu(); } -static unsigned long pfn_first(struct dev_pagemap *pgmap) +static unsigned long pfn_first(struct dev_pagemap *pgmap, int range_id) { - return PHYS_PFN(pgmap->range.start) + - vmem_altmap_offset(pgmap_altmap(pgmap)); + struct range *range = &pgmap->ranges[range_id]; + unsigned long pfn = PHYS_PFN(range->start); + + if (range_id) + return pfn; + return pfn + vmem_altmap_offset(pgmap_altmap(pgmap)); } -static unsigned long pfn_end(struct dev_pagemap *pgmap) +static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id) { - const struct range *range = &pgmap->range; + const struct range *range = &pgmap->ranges[range_id]; return (range->start + range_len(range)) >> PAGE_SHIFT; } @@ -97,8 +101,8 @@ static unsigned long pfn_next(unsigned long pfn) return pfn + 1; } -#define for_each_device_pfn(pfn, map) \ - for (pfn = pfn_first(map); pfn < pfn_end(map); pfn = pfn_next(pfn)) +#define for_each_device_pfn(pfn, map, i) \ + for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn)) static void dev_pagemap_kill(struct dev_pagemap *pgmap) { @@ -124,20 +128,14 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap) pgmap->ref = NULL; } -void memunmap_pages(struct dev_pagemap *pgmap) +static void pageunmap_range(struct dev_pagemap *pgmap, int range_id) { - struct range *range = &pgmap->range; + struct range *range = &pgmap->ranges[range_id]; struct page *first_page; - unsigned long pfn; int nid; - dev_pagemap_kill(pgmap); - for_each_device_pfn(pfn, pgmap) - put_page(pfn_to_page(pfn)); - dev_pagemap_cleanup(pgmap); - /* make sure to access a memmap that was actually initialized */ - first_page = pfn_to_page(pfn_first(pgmap)); + first_page = pfn_to_page(pfn_first(pgmap, range_id)); /* pages are dead and unused, undo the arch mapping */ nid = page_to_nid(first_page); @@ -157,6 +155,22 @@ void memunmap_pages(struct dev_pagemap *pgmap) untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range)); pgmap_array_delete(range); +} + +void memunmap_pages(struct dev_pagemap *pgmap) +{ + unsigned long pfn; + int i; + + dev_pagemap_kill(pgmap); + for (i = 0; i < pgmap->nr_range; i++) + for_each_device_pfn(pfn, pgmap, i) + put_page(pfn_to_page(pfn)); + dev_pagemap_cleanup(pgmap); + + for (i = 0; i < pgmap->nr_range; i++) + pageunmap_range(pgmap, i); + WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(); } @@ -175,6 +189,114 @@ static void dev_pagemap_percpu_release(struct percpu_ref *ref) complete(&pgmap->done); } +static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, + int range_id, int nid) +{ + struct range *range = &pgmap->ranges[range_id]; + struct dev_pagemap *conflict_pgmap; + int error, is_ram; + + if (WARN_ONCE(pgmap_altmap(pgmap) && range_id > 0, + "altmap not supported for multiple ranges\n")) + return -EINVAL; + + conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL); + if (conflict_pgmap) { + WARN(1, "Conflicting mapping in same section\n"); + put_dev_pagemap(conflict_pgmap); + return -ENOMEM; + } + + conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL); + if (conflict_pgmap) { + WARN(1, "Conflicting mapping in same section\n"); + put_dev_pagemap(conflict_pgmap); + return -ENOMEM; + } + + is_ram = region_intersects(range->start, range_len(range), + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE); + + if (is_ram != REGION_DISJOINT) { + WARN_ONCE(1, "attempted on %s region %#llx-%#llx\n", + is_ram == REGION_MIXED ? "mixed" : "ram", + range->start, range->end); + return -ENXIO; + } + + error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(range->start), + PHYS_PFN(range->end), pgmap, GFP_KERNEL)); + if (error) + return error; + + if (nid < 0) + nid = numa_mem_id(); + + error = track_pfn_remap(NULL, ¶ms->pgprot, PHYS_PFN(range->start), 0, + range_len(range)); + if (error) + goto err_pfn_remap; + + mem_hotplug_begin(); + + /* + * For device private memory we call add_pages() as we only need to + * allocate and initialize struct page for the device memory. More- + * over the device memory is un-accessible thus we do not want to + * create a linear mapping for the memory like arch_add_memory() + * would do. + * + * For all other device memory types, which are accessible by + * the CPU, we do want the linear mapping and thus use + * arch_add_memory(). + */ + if (pgmap->type == MEMORY_DEVICE_PRIVATE) { + error = add_pages(nid, PHYS_PFN(range->start), + PHYS_PFN(range_len(range)), params); + } else { + error = kasan_add_zero_shadow(__va(range->start), range_len(range)); + if (error) { + mem_hotplug_done(); + goto err_kasan; + } + + error = arch_add_memory(nid, range->start, range_len(range), + params); + } + + if (!error) { + struct zone *zone; + + zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE]; + move_pfn_range_to_zone(zone, PHYS_PFN(range->start), + PHYS_PFN(range_len(range)), params->altmap); + } + + mem_hotplug_done(); + if (error) + goto err_add_memory; + + /* + * Initialization of the pages has been deferred until now in order + * to allow us to do the work while not holding the hotplug lock. + */ + memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], + PHYS_PFN(range->start), + PHYS_PFN(range_len(range)), pgmap); + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) + - pfn_first(pgmap, range_id)); + return 0; + +err_add_memory: + kasan_remove_zero_shadow(__va(range->start), range_len(range)); +err_kasan: + untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range)); +err_pfn_remap: + pgmap_array_delete(range); + return error; +} + + /* * Not device managed version of dev_memremap_pages, undone by * memunmap_pages(). Please use dev_memremap_pages if you have a struct @@ -182,17 +304,16 @@ static void dev_pagemap_percpu_release(struct percpu_ref *ref) */ void *memremap_pages(struct dev_pagemap *pgmap, int nid) { - struct range *range = &pgmap->range; - struct dev_pagemap *conflict_pgmap; struct mhp_params params = { - /* - * We do not want any optional features only our own memmap - */ .altmap = pgmap_altmap(pgmap), .pgprot = PAGE_KERNEL, }; - int error, is_ram; + const int nr_range = pgmap->nr_range; bool need_devmap_managed = true; + int error, i; + + if (WARN_ONCE(!nr_range, "nr_range must be specified\n")) + return ERR_PTR(-EINVAL); switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: @@ -251,106 +372,27 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) return ERR_PTR(error); } - conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL); - if (conflict_pgmap) { - WARN(1, "Conflicting mapping in same section\n"); - put_dev_pagemap(conflict_pgmap); - error = -ENOMEM; - goto err_array; - } - - conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL); - if (conflict_pgmap) { - WARN(1, "Conflicting mapping in same section\n"); - put_dev_pagemap(conflict_pgmap); - error = -ENOMEM; - goto err_array; - } - - is_ram = region_intersects(range->start, range_len(range), - IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE); - - if (is_ram != REGION_DISJOINT) { - WARN_ONCE(1, "attempted on %s region %#llx-%#llx\n", - is_ram == REGION_MIXED ? "mixed" : "ram", - range->start, range->end); - error = -ENXIO; - goto err_array; - } - - error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(range->start), - PHYS_PFN(range->end), pgmap, GFP_KERNEL)); - if (error) - goto err_array; - - if (nid < 0) - nid = numa_mem_id(); - - error = track_pfn_remap(NULL, ¶ms.pgprot, PHYS_PFN(range->start), 0, - range_len(range)); - if (error) - goto err_pfn_remap; - - mem_hotplug_begin(); - /* - * For device private memory we call add_pages() as we only need to - * allocate and initialize struct page for the device memory. More- - * over the device memory is un-accessible thus we do not want to - * create a linear mapping for the memory like arch_add_memory() - * would do. - * - * For all other device memory types, which are accessible by - * the CPU, we do want the linear mapping and thus use - * arch_add_memory(). + * Clear the pgmap nr_range as it will be incremented for each + * successfully processed range. This communicates how many + * regions to unwind in the abort case. */ - if (pgmap->type == MEMORY_DEVICE_PRIVATE) { - error = add_pages(nid, PHYS_PFN(range->start), - PHYS_PFN(range_len(range)), ¶ms); - } else { - error = kasan_add_zero_shadow(__va(range->start), range_len(range)); - if (error) { - mem_hotplug_done(); - goto err_kasan; - } - - error = arch_add_memory(nid, range->start, range_len(range), - ¶ms); + pgmap->nr_range = 0; + error = 0; + for (i = 0; i < nr_range; i++) { + error = pagemap_range(pgmap, ¶ms, i, nid); + if (error) + break; + pgmap->nr_range++; } - if (!error) { - struct zone *zone; - - zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE]; - move_pfn_range_to_zone(zone, PHYS_PFN(range->start), - PHYS_PFN(range_len(range)), params.altmap); + if (i < nr_range) { + memunmap_pages(pgmap); + pgmap->nr_range = nr_range; + return ERR_PTR(error); } - mem_hotplug_done(); - if (error) - goto err_add_memory; - - /* - * Initialization of the pages has been deferred until now in order - * to allow us to do the work while not holding the hotplug lock. - */ - memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], - PHYS_PFN(range->start), - PHYS_PFN(range_len(range)), pgmap); - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap)); - return __va(range->start); - - err_add_memory: - kasan_remove_zero_shadow(__va(range->start), range_len(range)); - err_kasan: - untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range)); - err_pfn_remap: - pgmap_array_delete(range); - err_array: - dev_pagemap_kill(pgmap); - dev_pagemap_cleanup(pgmap); - devmap_managed_enable_put(); - return ERR_PTR(error); + return __va(pgmap->ranges[0].start); } EXPORT_SYMBOL_GPL(memremap_pages); From 60e93dc097f7f13a16a7e4b75b8803eb2adbb721 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:50:39 -0700 Subject: [PATCH 046/181] device-dax: add dis-contiguous resource support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Break the requirement that device-dax instances are physically contiguous. With this constraint removed it allows fragmented available capacity to be fully allocated. This capability is useful to mitigate the "noisy neighbor" problem with memory-side-cache management for virtual machines, or any other scenario where a platform address boundary also designates a performance boundary. For example a direct mapped memory side cache might rotate cache colors at 1GB boundaries. With dis-contiguous allocations a device-dax instance could be configured to contain only 1 cache color. It also satisfies Joao's use case (see link) for partitioning memory for exclusive guest access. It allows for a future potential mode where the host kernel need not allocate 'struct page' capacity up-front. Reported-by: Joao Martins Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Jia He Cc: Jonathan Cameron Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lore.kernel.org/lkml/20200110190313.17144-1-joao.m.martins@oracle.com/ Link: https://lkml.kernel.org/r/159643104304.4062302.16561669534797528660.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106116875.30709.11456649969327399771.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 233 +++++++++++++++++++++++++-------- drivers/dax/dax-private.h | 9 +- drivers/dax/device.c | 53 +++++--- drivers/dax/kmem.c | 126 ++++++++++++------ tools/testing/nvdimm/dax-dev.c | 20 ++- 5 files changed, 320 insertions(+), 121 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 00fa73a8dfb4..06a789aba58a 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -136,15 +136,27 @@ static bool is_static(struct dax_region *dax_region) return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0; } +static u64 dev_dax_size(struct dev_dax *dev_dax) +{ + u64 size = 0; + int i; + + device_lock_assert(&dev_dax->dev); + + for (i = 0; i < dev_dax->nr_range; i++) + size += range_len(&dev_dax->ranges[i].range); + + return size; +} + static int dax_bus_probe(struct device *dev) { struct dax_device_driver *dax_drv = to_dax_drv(dev->driver); struct dev_dax *dev_dax = to_dev_dax(dev); struct dax_region *dax_region = dev_dax->region; - struct range *range = &dev_dax->range; int rc; - if (range_len(range) == 0 || dev_dax->id < 0) + if (dev_dax_size(dev_dax) == 0 || dev_dax->id < 0) return -ENXIO; rc = dax_drv->probe(dev_dax); @@ -354,15 +366,19 @@ void kill_dev_dax(struct dev_dax *dev_dax) } EXPORT_SYMBOL_GPL(kill_dev_dax); -static void free_dev_dax_range(struct dev_dax *dev_dax) +static void free_dev_dax_ranges(struct dev_dax *dev_dax) { struct dax_region *dax_region = dev_dax->region; - struct range *range = &dev_dax->range; + int i; device_lock_assert(dax_region->dev); - if (range_len(range)) + for (i = 0; i < dev_dax->nr_range; i++) { + struct range *range = &dev_dax->ranges[i].range; + __release_region(&dax_region->res, range->start, range_len(range)); + } + dev_dax->nr_range = 0; } static void unregister_dev_dax(void *dev) @@ -372,7 +388,7 @@ static void unregister_dev_dax(void *dev) dev_dbg(dev, "%s\n", __func__); kill_dev_dax(dev_dax); - free_dev_dax_range(dev_dax); + free_dev_dax_ranges(dev_dax); device_del(dev); put_device(dev); } @@ -423,7 +439,7 @@ static ssize_t delete_store(struct device *dev, struct device_attribute *attr, device_lock(dev); device_lock(victim); dev_dax = to_dev_dax(victim); - if (victim->driver || range_len(&dev_dax->range)) + if (victim->driver || dev_dax_size(dev_dax)) rc = -EBUSY; else { /* @@ -569,51 +585,86 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start, struct dax_region *dax_region = dev_dax->region; struct resource *res = &dax_region->res; struct device *dev = &dev_dax->dev; + struct dev_dax_range *ranges; + unsigned long pgoff = 0; struct resource *alloc; + int i; device_lock_assert(dax_region->dev); /* handle the seed alloc special case */ if (!size) { - dev_dax->range = (struct range) { - .start = res->start, - .end = res->start - 1, - }; + if (dev_WARN_ONCE(dev, dev_dax->nr_range, + "0-size allocation must be first\n")) + return -EBUSY; + /* nr_range == 0 is elsewhere special cased as 0-size device */ return 0; } - alloc = __request_region(res, start, size, dev_name(dev), 0); - if (!alloc) + ranges = krealloc(dev_dax->ranges, sizeof(*ranges) + * (dev_dax->nr_range + 1), GFP_KERNEL); + if (!ranges) return -ENOMEM; - dev_dax->range = (struct range) { - .start = alloc->start, - .end = alloc->end, + alloc = __request_region(res, start, size, dev_name(dev), 0); + if (!alloc) { + /* + * If this was an empty set of ranges nothing else + * will release @ranges, so do it now. + */ + if (!dev_dax->nr_range) { + kfree(ranges); + ranges = NULL; + } + dev_dax->ranges = ranges; + return -ENOMEM; + } + + for (i = 0; i < dev_dax->nr_range; i++) + pgoff += PHYS_PFN(range_len(&ranges[i].range)); + dev_dax->ranges = ranges; + ranges[dev_dax->nr_range++] = (struct dev_dax_range) { + .pgoff = pgoff, + .range = { + .start = alloc->start, + .end = alloc->end, + }, }; + dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1, + &alloc->start, &alloc->end); + return 0; } static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size) { + int last_range = dev_dax->nr_range - 1; + struct dev_dax_range *dax_range = &dev_dax->ranges[last_range]; struct dax_region *dax_region = dev_dax->region; - struct range *range = &dev_dax->range; - int rc = 0; + bool is_shrink = resource_size(res) > size; + struct range *range = &dax_range->range; + struct device *dev = &dev_dax->dev; + int rc; device_lock_assert(dax_region->dev); - if (size) - rc = adjust_resource(res, range->start, size); - else - __release_region(&dax_region->res, range->start, range_len(range)); + if (dev_WARN_ONCE(dev, !size, "deletion is handled by dev_dax_shrink\n")) + return -EINVAL; + + rc = adjust_resource(res, range->start, size); if (rc) return rc; - dev_dax->range = (struct range) { + *range = (struct range) { .start = range->start, .end = range->start + size - 1, }; + dev_dbg(dev, "%s range[%d]: %#llx:%#llx\n", is_shrink ? "shrink" : "extend", + last_range, (unsigned long long) range->start, + (unsigned long long) range->end); + return 0; } @@ -621,7 +672,11 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr, char *buf) { struct dev_dax *dev_dax = to_dev_dax(dev); - unsigned long long size = range_len(&dev_dax->range); + unsigned long long size; + + device_lock(dev); + size = dev_dax_size(dev_dax); + device_unlock(dev); return sprintf(buf, "%llu\n", size); } @@ -639,32 +694,82 @@ static bool alloc_is_aligned(struct dax_region *dax_region, static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size) { + resource_size_t to_shrink = dev_dax_size(dev_dax) - size; struct dax_region *dax_region = dev_dax->region; - struct range *range = &dev_dax->range; - struct resource *res, *adjust = NULL; struct device *dev = &dev_dax->dev; + int i; - for_each_dax_region_resource(dax_region, res) - if (strcmp(res->name, dev_name(dev)) == 0 - && res->start == range->start) { - adjust = res; - break; + for (i = dev_dax->nr_range - 1; i >= 0; i--) { + struct range *range = &dev_dax->ranges[i].range; + struct resource *adjust = NULL, *res; + resource_size_t shrink; + + shrink = min_t(u64, to_shrink, range_len(range)); + if (shrink >= range_len(range)) { + __release_region(&dax_region->res, range->start, + range_len(range)); + dev_dax->nr_range--; + dev_dbg(dev, "delete range[%d]: %#llx:%#llx\n", i, + (unsigned long long) range->start, + (unsigned long long) range->end); + to_shrink -= shrink; + if (!to_shrink) + break; + continue; } - if (dev_WARN_ONCE(dev, !adjust, "failed to find matching resource\n")) - return -ENXIO; - return adjust_dev_dax_range(dev_dax, adjust, size); + for_each_dax_region_resource(dax_region, res) + if (strcmp(res->name, dev_name(dev)) == 0 + && res->start == range->start) { + adjust = res; + break; + } + + if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1, + "failed to find matching resource\n")) + return -ENXIO; + return adjust_dev_dax_range(dev_dax, adjust, range_len(range) + - shrink); + } + return 0; +} + +/* + * Only allow adjustments that preserve the relative pgoff of existing + * allocations. I.e. the dev_dax->ranges array is ordered by increasing pgoff. + */ +static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res) +{ + struct dev_dax_range *last; + int i; + + if (dev_dax->nr_range == 0) + return false; + if (strcmp(res->name, dev_name(&dev_dax->dev)) != 0) + return false; + last = &dev_dax->ranges[dev_dax->nr_range - 1]; + if (last->range.start != res->start || last->range.end != res->end) + return false; + for (i = 0; i < dev_dax->nr_range - 1; i++) { + struct dev_dax_range *dax_range = &dev_dax->ranges[i]; + + if (dax_range->pgoff > last->pgoff) + return false; + } + + return true; } static ssize_t dev_dax_resize(struct dax_region *dax_region, struct dev_dax *dev_dax, resource_size_t size) { resource_size_t avail = dax_region_avail_size(dax_region), to_alloc; - resource_size_t dev_size = range_len(&dev_dax->range); + resource_size_t dev_size = dev_dax_size(dev_dax); struct resource *region_res = &dax_region->res; struct device *dev = &dev_dax->dev; - const char *name = dev_name(dev); struct resource *res, *first; + resource_size_t alloc = 0; + int rc; if (dev->driver) return -EBUSY; @@ -685,35 +790,47 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region, * may involve adjusting the end of an existing resource, or * allocating a new resource. */ +retry: first = region_res->child; if (!first) return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc); - for (res = first; to_alloc && res; res = res->sibling) { + + rc = -ENOSPC; + for (res = first; res; res = res->sibling) { struct resource *next = res->sibling; - resource_size_t free; /* space at the beginning of the region */ - free = 0; - if (res == first && res->start > dax_region->res.start) - free = res->start - dax_region->res.start; - if (free >= to_alloc && dev_size == 0) - return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc); + if (res == first && res->start > dax_region->res.start) { + alloc = min(res->start - dax_region->res.start, to_alloc); + rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc); + break; + } - free = 0; + alloc = 0; /* space between allocations */ if (next && next->start > res->end + 1) - free = next->start - res->end + 1; + alloc = min(next->start - (res->end + 1), to_alloc); /* space at the end of the region */ - if (free < to_alloc && !next && res->end < region_res->end) - free = region_res->end - res->end; + if (!alloc && !next && res->end < region_res->end) + alloc = min(region_res->end - res->end, to_alloc); - if (free >= to_alloc && strcmp(name, res->name) == 0) - return adjust_dev_dax_range(dev_dax, res, resource_size(res) + to_alloc); - else if (free >= to_alloc && dev_size == 0) - return alloc_dev_dax_range(dev_dax, res->end + 1, to_alloc); + if (!alloc) + continue; + + if (adjust_ok(dev_dax, res)) { + rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc); + break; + } + rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc); + break; } - return -ENOSPC; + if (rc) + return rc; + to_alloc -= alloc; + if (to_alloc) + goto retry; + return 0; } static ssize_t size_store(struct device *dev, struct device_attribute *attr, @@ -767,8 +884,15 @@ static ssize_t resource_show(struct device *dev, struct device_attribute *attr, char *buf) { struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_region *dax_region = dev_dax->region; + unsigned long long start; - return sprintf(buf, "%#llx\n", dev_dax->range.start); + if (dev_dax->nr_range < 1) + start = dax_region->res.start; + else + start = dev_dax->ranges[0].range.start; + + return sprintf(buf, "%#llx\n", start); } static DEVICE_ATTR(resource, 0400, resource_show, NULL); @@ -833,6 +957,7 @@ static void dev_dax_release(struct device *dev) put_dax(dax_dev); free_dev_dax_id(dev_dax); dax_region_put(dax_region); + kfree(dev_dax->ranges); kfree(dev_dax->pgmap); kfree(dev_dax); } @@ -941,7 +1066,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) err_alloc_dax: kfree(dev_dax->pgmap); err_pgmap: - free_dev_dax_range(dev_dax); + free_dev_dax_ranges(dev_dax); err_range: free_dev_dax_id(dev_dax); err_id: diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 0cbb2ec81ca7..f863287107fd 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -49,7 +49,8 @@ struct dax_region { * @id: ida allocated id * @dev - device core * @pgmap - pgmap for memmap setup / lifetime (driver owned) - * @range: resource range for the instance + * @nr_range: size of @ranges + * @ranges: resource-span + pgoff tuples for the instance */ struct dev_dax { struct dax_region *region; @@ -58,7 +59,11 @@ struct dev_dax { int id; struct device dev; struct dev_pagemap *pgmap; - struct range range; + int nr_range; + struct dev_dax_range { + unsigned long pgoff; + struct range range; + } *ranges; }; static inline struct dev_dax *to_dev_dax(struct device *dev) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 5f808617672a..bf389712a20b 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -55,15 +55,22 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma, __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size) { - struct range *range = &dev_dax->range; - phys_addr_t phys; + int i; - phys = pgoff * PAGE_SIZE + range->start; - if (phys >= range->start && phys <= range->end) { + for (i = 0; i < dev_dax->nr_range; i++) { + struct dev_dax_range *dax_range = &dev_dax->ranges[i]; + struct range *range = &dax_range->range; + unsigned long long pgoff_end; + phys_addr_t phys; + + pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1; + if (pgoff < dax_range->pgoff || pgoff > pgoff_end) + continue; + phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start; if (phys + size - 1 <= range->end) return phys; + break; } - return -1; } @@ -395,30 +402,40 @@ static void dev_dax_kill(void *dev_dax) int dev_dax_probe(struct dev_dax *dev_dax) { struct dax_device *dax_dev = dev_dax->dax_dev; - struct range *range = &dev_dax->range; struct device *dev = &dev_dax->dev; struct dev_pagemap *pgmap; struct inode *inode; struct cdev *cdev; void *addr; - int rc; - - /* 1:1 map region resource range to device-dax instance range */ - if (!devm_request_mem_region(dev, range->start, range_len(range), - dev_name(dev))) { - dev_warn(dev, "could not reserve range: %#llx - %#llx\n", - range->start, range->end); - return -EBUSY; - } + int rc, i; pgmap = dev_dax->pgmap; + if (dev_WARN_ONCE(dev, pgmap && dev_dax->nr_range > 1, + "static pgmap / multi-range device conflict\n")) + return -EINVAL; + if (!pgmap) { - pgmap = devm_kzalloc(dev, sizeof(*pgmap), GFP_KERNEL); + pgmap = devm_kzalloc(dev, sizeof(*pgmap) + sizeof(struct range) + * (dev_dax->nr_range - 1), GFP_KERNEL); if (!pgmap) return -ENOMEM; - pgmap->range = *range; - pgmap->nr_range = 1; + pgmap->nr_range = dev_dax->nr_range; } + + for (i = 0; i < dev_dax->nr_range; i++) { + struct range *range = &dev_dax->ranges[i].range; + + if (!devm_request_mem_region(dev, range->start, + range_len(range), dev_name(dev))) { + dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve range\n", + i, range->start, range->end); + return -EBUSY; + } + /* don't update the range for static pgmap */ + if (!dev_dax->pgmap) + pgmap->ranges[i] = *range; + } + pgmap->type = MEMORY_DEVICE_GENERIC; addr = devm_memremap_pages(dev, pgmap); if (IS_ERR(addr)) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index c2ac465cc342..6c933f2b604e 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -19,24 +19,28 @@ static const char *kmem_name; /* Set if any memory will remain added when the driver will be unloaded. */ static bool any_hotremove_failed; -static struct range dax_kmem_range(struct dev_dax *dev_dax) +static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r) { - struct range range; + struct dev_dax_range *dax_range = &dev_dax->ranges[i]; + struct range *range = &dax_range->range; /* memory-block align the hotplug range */ - range.start = ALIGN(dev_dax->range.start, memory_block_size_bytes()); - range.end = ALIGN_DOWN(dev_dax->range.end + 1, memory_block_size_bytes()) - 1; - return range; + r->start = ALIGN(range->start, memory_block_size_bytes()); + r->end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1; + if (r->start >= r->end) { + r->start = range->start; + r->end = range->end; + return -ENOSPC; + } + return 0; } static int dev_dax_kmem_probe(struct dev_dax *dev_dax) { - struct range range = dax_kmem_range(dev_dax); struct device *dev = &dev_dax->dev; - struct resource *res; + int i, mapped = 0; char *res_name; int numa_node; - int rc; /* * Ensure good NUMA information for the persistent memory. @@ -55,31 +59,58 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) if (!res_name) return -ENOMEM; - /* Region is permanently reserved if hotremove fails. */ - res = request_mem_region(range.start, range_len(&range), res_name); - if (!res) { - dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", range.start, range.end); - kfree(res_name); - return -EBUSY; - } + for (i = 0; i < dev_dax->nr_range; i++) { + struct resource *res; + struct range range; + int rc; - /* - * Set flags appropriate for System RAM. Leave ..._BUSY clear - * so that add_memory() can add a child resource. Do not - * inherit flags from the parent since it may set new flags - * unknown to us that will break add_memory() below. - */ - res->flags = IORESOURCE_SYSTEM_RAM; + rc = dax_kmem_range(dev_dax, i, &range); + if (rc) { + dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n", + i, range.start, range.end); + continue; + } - /* - * Ensure that future kexec'd kernels will not treat this as RAM - * automatically. - */ - rc = add_memory_driver_managed(numa_node, range.start, range_len(&range), kmem_name); - if (rc) { - release_mem_region(range.start, range_len(&range)); - kfree(res_name); - return rc; + /* Region is permanently reserved if hotremove fails. */ + res = request_mem_region(range.start, range_len(&range), res_name); + if (!res) { + dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n", + i, range.start, range.end); + /* + * Once some memory has been onlined we can't + * assume that it can be un-onlined safely. + */ + if (mapped) + continue; + kfree(res_name); + return -EBUSY; + } + + /* + * Set flags appropriate for System RAM. Leave ..._BUSY clear + * so that add_memory() can add a child resource. Do not + * inherit flags from the parent since it may set new flags + * unknown to us that will break add_memory() below. + */ + res->flags = IORESOURCE_SYSTEM_RAM; + + /* + * Ensure that future kexec'd kernels will not treat + * this as RAM automatically. + */ + rc = add_memory_driver_managed(numa_node, range.start, + range_len(&range), kmem_name); + + if (rc) { + dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", + i, range.start, range.end); + release_mem_region(range.start, range_len(&range)); + if (mapped) + continue; + kfree(res_name); + return rc; + } + mapped++; } dev_set_drvdata(dev, res_name); @@ -90,9 +121,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) #ifdef CONFIG_MEMORY_HOTREMOVE static int dev_dax_kmem_remove(struct dev_dax *dev_dax) { - int rc; + int i, success = 0; struct device *dev = &dev_dax->dev; - struct range range = dax_kmem_range(dev_dax); const char *res_name = dev_get_drvdata(dev); /* @@ -101,17 +131,31 @@ static int dev_dax_kmem_remove(struct dev_dax *dev_dax) * there is no way to hotremove this memory until reboot because device * unbind will succeed even if we return failure. */ - rc = remove_memory(dev_dax->target_node, range.start, range_len(&range)); - if (rc) { + for (i = 0; i < dev_dax->nr_range; i++) { + struct range range; + int rc; + + rc = dax_kmem_range(dev_dax, i, &range); + if (rc) + continue; + + rc = remove_memory(dev_dax->target_node, range.start, + range_len(&range)); + if (rc == 0) { + release_mem_region(range.start, range_len(&range)); + success++; + continue; + } any_hotremove_failed = true; - dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n", - range.start, range.end); - return rc; + dev_err(dev, + "mapping%d: %#llx-%#llx cannot be hotremoved until the next reboot\n", + i, range.start, range.end); } - /* Release and free dax resources */ - release_mem_region(range.start, range_len(&range)); - kfree(res_name); + if (success >= dev_dax->nr_range) { + kfree(res_name); + dev_set_drvdata(dev, NULL); + } return 0; } diff --git a/tools/testing/nvdimm/dax-dev.c b/tools/testing/nvdimm/dax-dev.c index 38d8e55c4a0d..fb342a8c98d3 100644 --- a/tools/testing/nvdimm/dax-dev.c +++ b/tools/testing/nvdimm/dax-dev.c @@ -9,11 +9,18 @@ phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size) { - struct range *range = &dev_dax->range; - phys_addr_t addr; + int i; - addr = pgoff * PAGE_SIZE + range->start; - if (addr >= range->start && addr <= range->end) { + for (i = 0; i < dev_dax->nr_range; i++) { + struct dev_dax_range *dax_range = &dev_dax->ranges[i]; + struct range *range = &dax_range->range; + unsigned long long pgoff_end; + phys_addr_t addr; + + pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1; + if (pgoff < dax_range->pgoff || pgoff > pgoff_end) + continue; + addr = PFN_PHYS(pgoff - dax_range->pgoff) + range->start; if (addr + size - 1 <= range->end) { if (get_nfit_res(addr)) { struct page *page; @@ -23,9 +30,10 @@ phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, page = vmalloc_to_page((void *)addr); return PFN_PHYS(page_to_pfn(page)); - } else - return addr; + } + return addr; } + break; } return -1; } From 0b07ce872a9eca1ff88c0eb7f6e92dde127d21ca Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:50:45 -0700 Subject: [PATCH 047/181] device-dax: introduce 'mapping' devices MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In support of interrogating the physical address layout of a device with dis-contiguous ranges, introduce a sysfs directory with 'start', 'end', and 'page_offset' attributes. The alternative is trying to parse /proc/iomem, and that file will not reflect the extent layout until the device is enabled. Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Joao Martins Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Jia He Cc: Jonathan Cameron Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643104819.4062302.13691281391423291589.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106117446.30709.2751020815463722537.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 191 +++++++++++++++++++++++++++++++++++++- drivers/dax/dax-private.h | 14 +++ 2 files changed, 203 insertions(+), 2 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 06a789aba58a..005fa3e6d41c 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -579,6 +579,167 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id, } EXPORT_SYMBOL_GPL(alloc_dax_region); +static void dax_mapping_release(struct device *dev) +{ + struct dax_mapping *mapping = to_dax_mapping(dev); + struct dev_dax *dev_dax = to_dev_dax(dev->parent); + + ida_free(&dev_dax->ida, mapping->id); + kfree(mapping); +} + +static void unregister_dax_mapping(void *data) +{ + struct device *dev = data; + struct dax_mapping *mapping = to_dax_mapping(dev); + struct dev_dax *dev_dax = to_dev_dax(dev->parent); + struct dax_region *dax_region = dev_dax->region; + + dev_dbg(dev, "%s\n", __func__); + + device_lock_assert(dax_region->dev); + + dev_dax->ranges[mapping->range_id].mapping = NULL; + mapping->range_id = -1; + + device_del(dev); + put_device(dev); +} + +static struct dev_dax_range *get_dax_range(struct device *dev) +{ + struct dax_mapping *mapping = to_dax_mapping(dev); + struct dev_dax *dev_dax = to_dev_dax(dev->parent); + struct dax_region *dax_region = dev_dax->region; + + device_lock(dax_region->dev); + if (mapping->range_id < 0) { + device_unlock(dax_region->dev); + return NULL; + } + + return &dev_dax->ranges[mapping->range_id]; +} + +static void put_dax_range(struct dev_dax_range *dax_range) +{ + struct dax_mapping *mapping = dax_range->mapping; + struct dev_dax *dev_dax = to_dev_dax(mapping->dev.parent); + struct dax_region *dax_region = dev_dax->region; + + device_unlock(dax_region->dev); +} + +static ssize_t start_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dev_dax_range *dax_range; + ssize_t rc; + + dax_range = get_dax_range(dev); + if (!dax_range) + return -ENXIO; + rc = sprintf(buf, "%#llx\n", dax_range->range.start); + put_dax_range(dax_range); + + return rc; +} +static DEVICE_ATTR(start, 0400, start_show, NULL); + +static ssize_t end_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dev_dax_range *dax_range; + ssize_t rc; + + dax_range = get_dax_range(dev); + if (!dax_range) + return -ENXIO; + rc = sprintf(buf, "%#llx\n", dax_range->range.end); + put_dax_range(dax_range); + + return rc; +} +static DEVICE_ATTR(end, 0400, end_show, NULL); + +static ssize_t pgoff_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dev_dax_range *dax_range; + ssize_t rc; + + dax_range = get_dax_range(dev); + if (!dax_range) + return -ENXIO; + rc = sprintf(buf, "%#lx\n", dax_range->pgoff); + put_dax_range(dax_range); + + return rc; +} +static DEVICE_ATTR(page_offset, 0400, pgoff_show, NULL); + +static struct attribute *dax_mapping_attributes[] = { + &dev_attr_start.attr, + &dev_attr_end.attr, + &dev_attr_page_offset.attr, + NULL, +}; + +static const struct attribute_group dax_mapping_attribute_group = { + .attrs = dax_mapping_attributes, +}; + +static const struct attribute_group *dax_mapping_attribute_groups[] = { + &dax_mapping_attribute_group, + NULL, +}; + +static struct device_type dax_mapping_type = { + .release = dax_mapping_release, + .groups = dax_mapping_attribute_groups, +}; + +static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id) +{ + struct dax_region *dax_region = dev_dax->region; + struct dax_mapping *mapping; + struct device *dev; + int rc; + + device_lock_assert(dax_region->dev); + + if (dev_WARN_ONCE(&dev_dax->dev, !dax_region->dev->driver, + "region disabled\n")) + return -ENXIO; + + mapping = kzalloc(sizeof(*mapping), GFP_KERNEL); + if (!mapping) + return -ENOMEM; + mapping->range_id = range_id; + mapping->id = ida_alloc(&dev_dax->ida, GFP_KERNEL); + if (mapping->id < 0) { + kfree(mapping); + return -ENOMEM; + } + dev_dax->ranges[range_id].mapping = mapping; + dev = &mapping->dev; + device_initialize(dev); + dev->parent = &dev_dax->dev; + dev->type = &dax_mapping_type; + dev_set_name(dev, "mapping%d", mapping->id); + rc = device_add(dev); + if (rc) { + put_device(dev); + return rc; + } + + rc = devm_add_action_or_reset(dax_region->dev, unregister_dax_mapping, + dev); + if (rc) + return rc; + return 0; +} + static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start, resource_size_t size) { @@ -588,7 +749,7 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start, struct dev_dax_range *ranges; unsigned long pgoff = 0; struct resource *alloc; - int i; + int i, rc; device_lock_assert(dax_region->dev); @@ -633,6 +794,22 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start, dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1, &alloc->start, &alloc->end); + /* + * A dev_dax instance must be registered before mapping device + * children can be added. Defer to devm_create_dev_dax() to add + * the initial mapping device. + */ + if (!device_is_registered(&dev_dax->dev)) + return 0; + + rc = devm_register_dax_mapping(dev_dax, dev_dax->nr_range - 1); + if (rc) { + dev_dbg(dev, "delete range[%d]: %pa:%pa\n", dev_dax->nr_range - 1, + &alloc->start, &alloc->end); + dev_dax->nr_range--; + __release_region(res, alloc->start, resource_size(alloc)); + return rc; + } return 0; } @@ -701,11 +878,14 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size) for (i = dev_dax->nr_range - 1; i >= 0; i--) { struct range *range = &dev_dax->ranges[i].range; + struct dax_mapping *mapping = dev_dax->ranges[i].mapping; struct resource *adjust = NULL, *res; resource_size_t shrink; shrink = min_t(u64, to_shrink, range_len(range)); if (shrink >= range_len(range)) { + devm_release_action(dax_region->dev, + unregister_dax_mapping, &mapping->dev); __release_region(&dax_region->res, range->start, range_len(range)); dev_dax->nr_range--; @@ -1036,9 +1216,9 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) /* a device_dax instance is dead while the driver is not attached */ kill_dax(dax_dev); - /* from here on we're committed to teardown via dev_dax_release() */ dev_dax->dax_dev = dax_dev; dev_dax->target_node = dax_region->target_node; + ida_init(&dev_dax->ida); kref_get(&dax_region->kref); inode = dax_inode(dax_dev); @@ -1061,6 +1241,13 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) if (rc) return ERR_PTR(rc); + /* register mapping device for the initial allocation range */ + if (dev_dax->nr_range && range_len(&dev_dax->ranges[0].range)) { + rc = devm_register_dax_mapping(dev_dax, 0); + if (rc) + return ERR_PTR(rc); + } + return dev_dax; err_alloc_dax: diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index f863287107fd..13780f62b95e 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -40,6 +40,12 @@ struct dax_region { struct device *youngest; }; +struct dax_mapping { + struct device dev; + int range_id; + int id; +}; + /** * struct dev_dax - instance data for a subdivision of a dax region, and * data while the device is activated in the driver. @@ -47,6 +53,7 @@ struct dax_region { * @dax_dev - core dax functionality * @target_node: effective numa node if dev_dax memory range is onlined * @id: ida allocated id + * @ida: mapping id allocator * @dev - device core * @pgmap - pgmap for memmap setup / lifetime (driver owned) * @nr_range: size of @ranges @@ -57,12 +64,14 @@ struct dev_dax { struct dax_device *dax_dev; int target_node; int id; + struct ida ida; struct device dev; struct dev_pagemap *pgmap; int nr_range; struct dev_dax_range { unsigned long pgoff; struct range range; + struct dax_mapping *mapping; } *ranges; }; @@ -70,4 +79,9 @@ static inline struct dev_dax *to_dev_dax(struct device *dev) { return container_of(dev, struct dev_dax, dev); } + +static inline struct dax_mapping *to_dax_mapping(struct device *dev) +{ + return container_of(dev, struct dax_mapping, dev); +} #endif From 33cf94d7176672174042bea0566065f356e2caab Mon Sep 17 00:00:00 2001 From: Joao Martins Date: Tue, 13 Oct 2020 16:50:50 -0700 Subject: [PATCH 048/181] device-dax: make align a per-device property MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce @align to struct dev_dax. When creating a new device, we still initialize to the default dax_region @align. Child devices belonging to a region may wish to keep a different alignment property instead of a global region-defined one. Signed-off-by: Joao Martins Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Jia He Cc: Jonathan Cameron Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643105377.4062302.4159447829955683131.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lore.kernel.org/r/20200716172913.19658-2-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/160106117957.30709.1142303024324655705.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 1 + drivers/dax/dax-private.h | 3 +++ drivers/dax/device.c | 41 ++++++++++++++------------------------- 3 files changed, 19 insertions(+), 26 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 005fa3e6d41c..852899084d13 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -1218,6 +1218,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) dev_dax->dax_dev = dax_dev; dev_dax->target_node = dax_region->target_node; + dev_dax->align = dax_region->align; ida_init(&dev_dax->ida); kref_get(&dax_region->kref); diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 13780f62b95e..5fd3a26cfcea 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -62,6 +62,7 @@ struct dax_mapping { struct dev_dax { struct dax_region *region; struct dax_device *dax_dev; + unsigned int align; int target_node; int id; struct ida ida; @@ -84,4 +85,6 @@ static inline struct dax_mapping *to_dax_mapping(struct device *dev) { return container_of(dev, struct dax_mapping, dev); } + +phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size); #endif diff --git a/drivers/dax/device.c b/drivers/dax/device.c index bf389712a20b..25e0b84a4296 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -17,7 +17,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma, const char *func) { - struct dax_region *dax_region = dev_dax->region; struct device *dev = &dev_dax->dev; unsigned long mask; @@ -32,7 +31,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma, return -EINVAL; } - mask = dax_region->align - 1; + mask = dev_dax->align - 1; if (vma->vm_start & mask || vma->vm_end & mask) { dev_info_ratelimited(dev, "%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n", @@ -78,21 +77,19 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf, pfn_t *pfn) { struct device *dev = &dev_dax->dev; - struct dax_region *dax_region; phys_addr_t phys; unsigned int fault_size = PAGE_SIZE; if (check_vma(dev_dax, vmf->vma, __func__)) return VM_FAULT_SIGBUS; - dax_region = dev_dax->region; - if (dax_region->align > PAGE_SIZE) { + if (dev_dax->align > PAGE_SIZE) { dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n", - dax_region->align, fault_size); + dev_dax->align, fault_size); return VM_FAULT_SIGBUS; } - if (fault_size != dax_region->align) + if (fault_size != dev_dax->align) return VM_FAULT_SIGBUS; phys = dax_pgoff_to_phys(dev_dax, vmf->pgoff, PAGE_SIZE); @@ -111,7 +108,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, { unsigned long pmd_addr = vmf->address & PMD_MASK; struct device *dev = &dev_dax->dev; - struct dax_region *dax_region; phys_addr_t phys; pgoff_t pgoff; unsigned int fault_size = PMD_SIZE; @@ -119,16 +115,15 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, if (check_vma(dev_dax, vmf->vma, __func__)) return VM_FAULT_SIGBUS; - dax_region = dev_dax->region; - if (dax_region->align > PMD_SIZE) { + if (dev_dax->align > PMD_SIZE) { dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n", - dax_region->align, fault_size); + dev_dax->align, fault_size); return VM_FAULT_SIGBUS; } - if (fault_size < dax_region->align) + if (fault_size < dev_dax->align) return VM_FAULT_SIGBUS; - else if (fault_size > dax_region->align) + else if (fault_size > dev_dax->align) return VM_FAULT_FALLBACK; /* if we are outside of the VMA */ @@ -154,7 +149,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, { unsigned long pud_addr = vmf->address & PUD_MASK; struct device *dev = &dev_dax->dev; - struct dax_region *dax_region; phys_addr_t phys; pgoff_t pgoff; unsigned int fault_size = PUD_SIZE; @@ -163,16 +157,15 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, if (check_vma(dev_dax, vmf->vma, __func__)) return VM_FAULT_SIGBUS; - dax_region = dev_dax->region; - if (dax_region->align > PUD_SIZE) { + if (dev_dax->align > PUD_SIZE) { dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n", - dax_region->align, fault_size); + dev_dax->align, fault_size); return VM_FAULT_SIGBUS; } - if (fault_size < dax_region->align) + if (fault_size < dev_dax->align) return VM_FAULT_SIGBUS; - else if (fault_size > dax_region->align) + else if (fault_size > dev_dax->align) return VM_FAULT_FALLBACK; /* if we are outside of the VMA */ @@ -267,9 +260,8 @@ static int dev_dax_split(struct vm_area_struct *vma, unsigned long addr) { struct file *filp = vma->vm_file; struct dev_dax *dev_dax = filp->private_data; - struct dax_region *dax_region = dev_dax->region; - if (!IS_ALIGNED(addr, dax_region->align)) + if (!IS_ALIGNED(addr, dev_dax->align)) return -EINVAL; return 0; } @@ -278,9 +270,8 @@ static unsigned long dev_dax_pagesize(struct vm_area_struct *vma) { struct file *filp = vma->vm_file; struct dev_dax *dev_dax = filp->private_data; - struct dax_region *dax_region = dev_dax->region; - return dax_region->align; + return dev_dax->align; } static const struct vm_operations_struct dax_vm_ops = { @@ -319,13 +310,11 @@ static unsigned long dax_get_unmapped_area(struct file *filp, { unsigned long off, off_end, off_align, len_align, addr_align, align; struct dev_dax *dev_dax = filp ? filp->private_data : NULL; - struct dax_region *dax_region; if (!dev_dax || addr) goto out; - dax_region = dev_dax->region; - align = dax_region->align; + align = dev_dax->align; off = pgoff << PAGE_SHIFT; off_end = off + len; off_align = round_up(off, align); From 6d82120f41561426dd67c86380d779b4599d070d Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 13 Oct 2020 16:50:55 -0700 Subject: [PATCH 049/181] device-dax: add an 'align' attribute MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce a device align attribute. While doing so, rename the region align attribute to be more explicitly named as so, but keep it named as @align to retain the API for tools like daxctl. Changes on align may not always be valid, when say certain mappings were created with 2M and then we switch to 1G. So, we validate all ranges against the new value being attempted, post resizing. Signed-off-by: Joao Martins Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Jia He Cc: Jonathan Cameron Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643105944.4062302.3131761052969132784.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lore.kernel.org/r/20200716172913.19658-3-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/160106118486.30709.13012322227204800596.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 93 ++++++++++++++++++++++++++++++++++----- drivers/dax/dax-private.h | 18 ++++++++ 2 files changed, 101 insertions(+), 10 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 852899084d13..0ac4a9c0fd18 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -230,14 +230,15 @@ static ssize_t region_size_show(struct device *dev, static struct device_attribute dev_attr_region_size = __ATTR(size, 0444, region_size_show, NULL); -static ssize_t align_show(struct device *dev, +static ssize_t region_align_show(struct device *dev, struct device_attribute *attr, char *buf) { struct dax_region *dax_region = dev_get_drvdata(dev); return sprintf(buf, "%u\n", dax_region->align); } -static DEVICE_ATTR_RO(align); +static struct device_attribute dev_attr_region_align = + __ATTR(align, 0400, region_align_show, NULL); #define for_each_dax_region_resource(dax_region, res) \ for (res = (dax_region)->res.child; res; res = res->sibling) @@ -488,7 +489,7 @@ static umode_t dax_region_visible(struct kobject *kobj, struct attribute *a, static struct attribute *dax_region_attributes[] = { &dev_attr_available_size.attr, &dev_attr_region_size.attr, - &dev_attr_align.attr, + &dev_attr_region_align.attr, &dev_attr_create.attr, &dev_attr_seed.attr, &dev_attr_delete.attr, @@ -858,15 +859,13 @@ static ssize_t size_show(struct device *dev, return sprintf(buf, "%llu\n", size); } -static bool alloc_is_aligned(struct dax_region *dax_region, - resource_size_t size) +static bool alloc_is_aligned(struct dev_dax *dev_dax, resource_size_t size) { /* * The minimum mapping granularity for a device instance is a * single subsection, unless the arch says otherwise. */ - return IS_ALIGNED(size, max_t(unsigned long, dax_region->align, - memremap_compat_align())); + return IS_ALIGNED(size, max_t(unsigned long, dev_dax->align, memremap_compat_align())); } static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size) @@ -961,7 +960,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region, return dev_dax_shrink(dev_dax, size); to_alloc = size - dev_size; - if (dev_WARN_ONCE(dev, !alloc_is_aligned(dax_region, to_alloc), + if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc), "resize of %pa misaligned\n", &to_alloc)) return -ENXIO; @@ -1025,7 +1024,7 @@ static ssize_t size_store(struct device *dev, struct device_attribute *attr, if (rc) return rc; - if (!alloc_is_aligned(dax_region, val)) { + if (!alloc_is_aligned(dev_dax, val)) { dev_dbg(dev, "%s: size: %lld misaligned\n", __func__, val); return -EINVAL; } @@ -1044,6 +1043,78 @@ static ssize_t size_store(struct device *dev, struct device_attribute *attr, } static DEVICE_ATTR_RW(size); +static ssize_t align_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + + return sprintf(buf, "%d\n", dev_dax->align); +} + +static ssize_t dev_dax_validate_align(struct dev_dax *dev_dax) +{ + resource_size_t dev_size = dev_dax_size(dev_dax); + struct device *dev = &dev_dax->dev; + int i; + + if (dev_size > 0 && !alloc_is_aligned(dev_dax, dev_size)) { + dev_dbg(dev, "%s: align %u invalid for size %pa\n", + __func__, dev_dax->align, &dev_size); + return -EINVAL; + } + + for (i = 0; i < dev_dax->nr_range; i++) { + size_t len = range_len(&dev_dax->ranges[i].range); + + if (!alloc_is_aligned(dev_dax, len)) { + dev_dbg(dev, "%s: align %u invalid for range %d\n", + __func__, dev_dax->align, i); + return -EINVAL; + } + } + + return 0; +} + +static ssize_t align_store(struct device *dev, struct device_attribute *attr, + const char *buf, size_t len) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_region *dax_region = dev_dax->region; + unsigned long val, align_save; + ssize_t rc; + + rc = kstrtoul(buf, 0, &val); + if (rc) + return -ENXIO; + + if (!dax_align_valid(val)) + return -EINVAL; + + device_lock(dax_region->dev); + if (!dax_region->dev->driver) { + device_unlock(dax_region->dev); + return -ENXIO; + } + + device_lock(dev); + if (dev->driver) { + rc = -EBUSY; + goto out_unlock; + } + + align_save = dev_dax->align; + dev_dax->align = val; + rc = dev_dax_validate_align(dev_dax); + if (rc) + dev_dax->align = align_save; +out_unlock: + device_unlock(dev); + device_unlock(dax_region->dev); + return rc == 0 ? len : rc; +} +static DEVICE_ATTR_RW(align); + static int dev_dax_target_node(struct dev_dax *dev_dax) { struct dax_region *dax_region = dev_dax->region; @@ -1104,7 +1175,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) return 0; if (a == &dev_attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA)) return 0; - if (a == &dev_attr_size.attr && is_static(dax_region)) + if ((a == &dev_attr_align.attr || + a == &dev_attr_size.attr) && is_static(dax_region)) return 0444; return a->mode; } @@ -1113,6 +1185,7 @@ static struct attribute *dev_dax_attributes[] = { &dev_attr_modalias.attr, &dev_attr_size.attr, &dev_attr_target_node.attr, + &dev_attr_align.attr, &dev_attr_resource.attr, &dev_attr_numa_node.attr, NULL, diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 5fd3a26cfcea..1c974b7caae6 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -87,4 +87,22 @@ static inline struct dax_mapping *to_dax_mapping(struct device *dev) } phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size); + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline bool dax_align_valid(unsigned long align) +{ + if (align == PUD_SIZE && IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)) + return true; + if (align == PMD_SIZE && has_transparent_hugepage()) + return true; + if (align == PAGE_SIZE) + return true; + return false; +} +#else +static inline bool dax_align_valid(unsigned long align) +{ + return align == PAGE_SIZE; +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif From 5a505603a917854fd68d2c25e86e1fb96c845ced Mon Sep 17 00:00:00 2001 From: Joao Martins Date: Tue, 13 Oct 2020 16:51:00 -0700 Subject: [PATCH 050/181] dax/hmem: introduce dax_hmem.region_idle parameter MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce a new module parameter for dax_hmem which initializes all region devices as free, rather than allocating a pagemap for the region by default. All hmem devices created with dax_hmem.region_idle=1 will have full available size for creating dynamic dax devices. Signed-off-by: Joao Martins Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Jia He Cc: Jonathan Cameron Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643106460.4062302.5868522341307530091.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lore.kernel.org/r/20200716172913.19658-4-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/160106119033.30709.11249962152222193448.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/hmem/hmem.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index 1a3347bb6143..1bf040dbc834 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -5,6 +5,9 @@ #include #include "../bus.h" +static bool region_idle; +module_param_named(region_idle, region_idle, bool, 0644); + static int dax_hmem_probe(struct platform_device *pdev) { struct device *dev = &pdev->dev; @@ -30,7 +33,7 @@ static int dax_hmem_probe(struct platform_device *pdev) data = (struct dev_dax_data) { .dax_region = dax_region, .id = -1, - .size = resource_size(res), + .size = region_idle ? 0 : resource_size(res), }; dev_dax = devm_create_dev_dax(&data); if (IS_ERR(dev_dax)) From 8490e2e25b5a1f9591145f0e3bbd09b99409be76 Mon Sep 17 00:00:00 2001 From: Joao Martins Date: Tue, 13 Oct 2020 16:51:06 -0700 Subject: [PATCH 051/181] device-dax: add a range mapping allocation attribute MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a sysfs attribute which denotes a range from the dax region to be allocated. It's an write only @mapping sysfs attribute in the format of '-' to allocate a range. @start and @end use hexadecimal values and the @pgoff is implicitly ordered wrt to previous writes to @mapping sysfs e.g. a write of a range of length 1G the pgoff is 0..1G(-4K), a second write will use @pgoff for 1G+4K... This range mapping interface is useful for: 1) Application which want to implement its own allocation logic, and thus pick the desired ranges from dax_region. 2) For use cases like VMM fast restart[0] where after kexec we want to the same gpa<->phys mappings (as originally created before kexec). [0] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf Signed-off-by: Joao Martins Signed-off-by: Dan Williams Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Ard Biesheuvel Cc: Ard Biesheuvel Cc: Benjamin Herrenschmidt Cc: Ben Skeggs Cc: Bjorn Helgaas Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Brice Goglin Cc: Catalin Marinas Cc: Daniel Vetter Cc: Dave Hansen Cc: Dave Jiang Cc: David Airlie Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: "H. Peter Anvin" Cc: Hulk Robot Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jason Yan Cc: Jeff Moyer Cc: "Jérôme Glisse" Cc: Jia He Cc: Jonathan Cameron Cc: Juergen Gross Cc: kernel test robot Cc: Michael Ellerman Cc: Mike Rapoport Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: "Rafael J. Wysocki" Cc: Randy Dunlap Cc: Stefano Stabellini Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Vishal Verma Cc: Vivek Goyal Cc: Wei Yang Cc: Will Deacon Link: https://lkml.kernel.org/r/159643106970.4062302.10402616567780784722.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lore.kernel.org/r/20200716172913.19658-5-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/160106119570.30709.4548889722645210610.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds --- drivers/dax/bus.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 0ac4a9c0fd18..27513d311242 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -1043,6 +1043,67 @@ static ssize_t size_store(struct device *dev, struct device_attribute *attr, } static DEVICE_ATTR_RW(size); +static ssize_t range_parse(const char *opt, size_t len, struct range *range) +{ + unsigned long long addr = 0; + char *start, *end, *str; + ssize_t rc = EINVAL; + + str = kstrdup(opt, GFP_KERNEL); + if (!str) + return rc; + + end = str; + start = strsep(&end, "-"); + if (!start || !end) + goto err; + + rc = kstrtoull(start, 16, &addr); + if (rc) + goto err; + range->start = addr; + + rc = kstrtoull(end, 16, &addr); + if (rc) + goto err; + range->end = addr; + +err: + kfree(str); + return rc; +} + +static ssize_t mapping_store(struct device *dev, struct device_attribute *attr, + const char *buf, size_t len) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_region *dax_region = dev_dax->region; + size_t to_alloc; + struct range r; + ssize_t rc; + + rc = range_parse(buf, len, &r); + if (rc) + return rc; + + rc = -ENXIO; + device_lock(dax_region->dev); + if (!dax_region->dev->driver) { + device_unlock(dax_region->dev); + return rc; + } + device_lock(dev); + + to_alloc = range_len(&r); + if (alloc_is_aligned(dev_dax, to_alloc)) + rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc); + device_unlock(dev); + device_unlock(dax_region->dev); + + return rc == 0 ? len : rc; +} +static DEVICE_ATTR_WO(mapping); + static ssize_t align_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -1175,6 +1236,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) return 0; if (a == &dev_attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA)) return 0; + if (a == &dev_attr_mapping.attr && is_static(dax_region)) + return 0; if ((a == &dev_attr_align.attr || a == &dev_attr_size.attr) && is_static(dax_region)) return 0444; @@ -1184,6 +1247,7 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) static struct attribute *dev_dax_attributes[] = { &dev_attr_modalias.attr, &dev_attr_size.attr, + &dev_attr_mapping.attr, &dev_attr_target_node.attr, &dev_attr_align.attr, &dev_attr_resource.attr, From 853322a671047f9300b9ccc2c358af2859bca2c2 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:51:10 -0700 Subject: [PATCH 052/181] mm/debug.c: do not dereference i_ino blindly __dump_page() checks i_dentry is fetchable and i_ino is earlier in the struct than i_ino, so it ought to work fine, but it's possible that struct randomisation has reordered i_ino after i_dentry and the pointer is just wild enough that i_dentry is fetchable and i_ino isn't. Also print the inode number if the dentry is invalid. Reported-by: Vlastimil Babka Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Reviewed-by: John Hubbard Reviewed-by: Mike Rapoport Link: https://lkml.kernel.org/r/20200819185710.28180-1-willy@infradead.org Signed-off-by: Linus Torvalds --- mm/debug.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/mm/debug.c b/mm/debug.c index ca8d1cacdecc..2a767865145c 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -120,6 +120,7 @@ void __dump_page(struct page *page, const char *reason) struct hlist_node *dentry_first; struct dentry *dentry_ptr; struct dentry dentry; + unsigned long ino; /* * mapping can be invalid pointer and we don't want to crash @@ -136,21 +137,22 @@ void __dump_page(struct page *page, const char *reason) goto out_mapping; } - if (get_kernel_nofault(dentry_first, &host->i_dentry.first)) { + if (get_kernel_nofault(dentry_first, &host->i_dentry.first) || + get_kernel_nofault(ino, &host->i_ino)) { pr_warn("aops:%ps with invalid host inode %px\n", a_ops, host); goto out_mapping; } if (!dentry_first) { - pr_warn("aops:%ps ino:%lx\n", a_ops, host->i_ino); + pr_warn("aops:%ps ino:%lx\n", a_ops, ino); goto out_mapping; } dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias); if (get_kernel_nofault(dentry, dentry_ptr)) { - pr_warn("aops:%ps with invalid dentry %px\n", a_ops, - dentry_ptr); + pr_warn("aops:%ps ino:%lx with invalid dentry %px\n", + a_ops, ino, dentry_ptr); } else { /* * if dentry is corrupted, the %pd handler may still @@ -158,7 +160,7 @@ void __dump_page(struct page *page, const char *reason) * corrupted struct page */ pr_warn("aops:%ps ino:%lx dentry name:\"%pd\"\n", - a_ops, host->i_ino, &dentry); + a_ops, ino, &dentry); } } out_mapping: From bac3cf4d01d43b587c873360dc8c84e3b570b344 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Tue, 13 Oct 2020 16:51:14 -0700 Subject: [PATCH 053/181] mm, dump_page: rename head_mapcount() --> head_compound_mapcount() Rename head_pincount() --> head_compound_pincount(). These names are more accurate (or less misleading) than the original ones. Signed-off-by: John Hubbard Signed-off-by: Andrew Morton Cc: Qian Cai Cc: Matthew Wilcox Cc: Vlastimil Babka Cc: Kirill A. Shutemov Cc: Mike Rapoport Cc: William Kucharski Link: https://lkml.kernel.org/r/20200807183358.105097-1-jhubbard@nvidia.com Signed-off-by: Linus Torvalds --- include/linux/mm.h | 8 ++++---- mm/debug.c | 6 +++--- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 13dc9b9ccf8e..9cc0894e7d61 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -791,7 +791,7 @@ static inline void *kvcalloc(size_t n, size_t size, gfp_t flags) extern void kvfree(const void *addr); extern void kvfree_sensitive(const void *addr, size_t len); -static inline int head_mapcount(struct page *head) +static inline int head_compound_mapcount(struct page *head) { return atomic_read(compound_mapcount_ptr(head)) + 1; } @@ -805,7 +805,7 @@ static inline int compound_mapcount(struct page *page) { VM_BUG_ON_PAGE(!PageCompound(page), page); page = compound_head(page); - return head_mapcount(page); + return head_compound_mapcount(page); } /* @@ -918,7 +918,7 @@ static inline bool hpage_pincount_available(struct page *page) return PageCompound(page) && compound_order(page) > 1; } -static inline int head_pincount(struct page *head) +static inline int head_compound_pincount(struct page *head) { return atomic_read(compound_pincount_ptr(head)); } @@ -927,7 +927,7 @@ static inline int compound_pincount(struct page *page) { VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); page = compound_head(page); - return head_pincount(page); + return head_compound_pincount(page); } static inline void set_compound_order(struct page *page, unsigned int order) diff --git a/mm/debug.c b/mm/debug.c index 2a767865145c..ccca576b2899 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -102,12 +102,12 @@ void __dump_page(struct page *page, const char *reason) if (hpage_pincount_available(page)) { pr_warn("head:%p order:%u compound_mapcount:%d compound_pincount:%d\n", head, compound_order(head), - head_mapcount(head), - head_pincount(head)); + head_compound_mapcount(head), + head_compound_pincount(head)); } else { pr_warn("head:%p order:%u compound_mapcount:%d\n", head, compound_order(head), - head_mapcount(head)); + head_compound_mapcount(head)); } } if (PageKsm(page)) From 61ef1865570452801f6e554a668e049c2e25c1fd Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:51:17 -0700 Subject: [PATCH 054/181] mm: factor find_get_incore_page out of mincore_page Patch series "Return head pages from find_*_entry", v2. This patch series started out as part of the THP patch set, but it has some nice effects along the way and it seems worth splitting it out and submitting separately. Currently find_get_entry() and find_lock_entry() return the page corresponding to the requested index, but the first thing most callers do is find the head page, which we just threw away. As part of auditing all the callers, I found some misuses of the APIs and some plain inefficiencies that I've fixed. The diffstat is unflattering, but I added more kernel-doc and a new wrapper. This patch (of 8); Provide this functionality from the swap cache. It's useful for more than just mincore(). Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Cc: Hugh Dickins Cc: William Kucharski Cc: Jani Nikula Cc: Alexey Dobriyan Cc: Johannes Weiner Cc: Chris Wilson Cc: Matthew Auld Cc: Huang Ying Link: https://lkml.kernel.org/r/20200910183318.20139-1-willy@infradead.org Link: https://lkml.kernel.org/r/20200910183318.20139-2-willy@infradead.org Signed-off-by: Linus Torvalds --- include/linux/swap.h | 7 +++++++ mm/mincore.c | 28 ++-------------------------- mm/swap_state.c | 32 ++++++++++++++++++++++++++++++++ 3 files changed, 41 insertions(+), 26 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4340a7b6e7a1..23c6e43a956d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -427,6 +427,7 @@ extern void free_pages_and_swap_cache(struct page **, int); extern struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr); +struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index); extern struct page *read_swap_cache_async(swp_entry_t, gfp_t, struct vm_area_struct *vma, unsigned long addr, bool do_poll); @@ -570,6 +571,12 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp, return NULL; } +static inline +struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index) +{ + return find_get_page(mapping, index); +} + static inline int add_to_swap(struct page *page) { return 0; diff --git a/mm/mincore.c b/mm/mincore.c index 453ff112470f..02db1a834021 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -48,7 +48,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, * and is up to date; i.e. that no page-in operation would be required * at this time if an application were to map and access this page. */ -static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) +static unsigned char mincore_page(struct address_space *mapping, pgoff_t index) { unsigned char present = 0; struct page *page; @@ -59,31 +59,7 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) * any other file mapping (ie. marked !present and faulted in with * tmpfs's .fault). So swapped out tmpfs mappings are tested here. */ -#ifdef CONFIG_SWAP - if (shmem_mapping(mapping)) { - page = find_get_entry(mapping, pgoff); - /* - * shmem/tmpfs may return swap: account for swapcache - * page too. - */ - if (xa_is_value(page)) { - swp_entry_t swp = radix_to_swp_entry(page); - struct swap_info_struct *si; - - /* Prevent swap device to being swapoff under us */ - si = get_swap_device(swp); - if (si) { - page = find_get_page(swap_address_space(swp), - swp_offset(swp)); - put_swap_device(si); - } else - page = NULL; - } - } else - page = find_get_page(mapping, pgoff); -#else - page = find_get_page(mapping, pgoff); -#endif + page = find_get_incore_page(mapping, index); if (page) { present = PageUptodate(page); put_page(page); diff --git a/mm/swap_state.c b/mm/swap_state.c index c16eebb81d8b..c79e2242dd04 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "internal.h" /* @@ -414,6 +415,37 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma, return page; } +/** + * find_get_incore_page - Find and get a page from the page or swap caches. + * @mapping: The address_space to search. + * @index: The page cache index. + * + * This differs from find_get_page() in that it will also look for the + * page in the swap cache. + * + * Return: The found page or %NULL. + */ +struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index) +{ + swp_entry_t swp; + struct swap_info_struct *si; + struct page *page = find_get_entry(mapping, index); + + if (!xa_is_value(page)) + return page; + if (!shmem_mapping(mapping)) + return NULL; + + swp = radix_to_swp_entry(page); + /* Prevent swapoff from happening to us */ + si = get_swap_device(swp); + if (!si) + return NULL; + page = find_get_page(swap_address_space(swp), swp_offset(swp)); + put_swap_device(si); + return page; +} + struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, bool *new_page_allocated) From f5df8635c5a3c912919c91be64aa198554b0f9ed Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:51:21 -0700 Subject: [PATCH 055/181] mm: use find_get_incore_page in memcontrol The current code does not protect against swapoff of the underlying swap device, so this is a bug fix as well as a worthwhile reduction in code complexity. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Cc: Alexey Dobriyan Cc: Chris Wilson Cc: Huang Ying Cc: Hugh Dickins Cc: Jani Nikula Cc: Johannes Weiner Cc: Matthew Auld Cc: William Kucharski Link: https://lkml.kernel.org/r/20200910183318.20139-3-willy@infradead.org Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 24 ++---------------------- 1 file changed, 2 insertions(+), 22 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5c1983c84395..d8bdbe1e8ff8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5539,35 +5539,15 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, static struct page *mc_handle_file_pte(struct vm_area_struct *vma, unsigned long addr, pte_t ptent, swp_entry_t *entry) { - struct page *page = NULL; - struct address_space *mapping; - pgoff_t pgoff; - if (!vma->vm_file) /* anonymous vma */ return NULL; if (!(mc.flags & MOVE_FILE)) return NULL; - mapping = vma->vm_file->f_mapping; - pgoff = linear_page_index(vma, addr); - /* page is moved even if it's not RSS of this task(page-faulted). */ -#ifdef CONFIG_SWAP /* shmem/tmpfs may report page out on swap: account for that too. */ - if (shmem_mapping(mapping)) { - page = find_get_entry(mapping, pgoff); - if (xa_is_value(page)) { - swp_entry_t swp = radix_to_swp_entry(page); - *entry = swp; - page = find_get_page(swap_address_space(swp), - swp_offset(swp)); - } - } else - page = find_get_page(mapping, pgoff); -#else - page = find_get_page(mapping, pgoff); -#endif - return page; + return find_get_incore_page(vma->vm_file->f_mapping, + linear_page_index(vma, addr)); } /** From e6e88712e43b7942df451508aafc2f083266f56b Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:51:24 -0700 Subject: [PATCH 056/181] mm: optimise madvise WILLNEED Instead of calling find_get_entry() for every page index, use an XArray iterator to skip over NULL entries, and avoid calling get_page(), because we only want the swap entries. [willy@infradead.org: fix LTP soft lockups] Link: https://lkml.kernel.org/r/20200914165032.GS6583@casper.infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Cc: Alexey Dobriyan Cc: Chris Wilson Cc: Huang Ying Cc: Hugh Dickins Cc: Jani Nikula Cc: Matthew Auld Cc: William Kucharski Cc: Qian Cai Link: https://lkml.kernel.org/r/20200910183318.20139-4-willy@infradead.org Signed-off-by: Linus Torvalds --- mm/madvise.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 0e0d61003fc6..9b065d412e5f 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -224,25 +224,28 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma, unsigned long start, unsigned long end, struct address_space *mapping) { - pgoff_t index; + XA_STATE(xas, &mapping->i_pages, linear_page_index(vma, start)); + pgoff_t end_index = end / PAGE_SIZE; struct page *page; - swp_entry_t swap; - for (; start < end; start += PAGE_SIZE) { - index = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; + rcu_read_lock(); + xas_for_each(&xas, page, end_index) { + swp_entry_t swap; - page = find_get_entry(mapping, index); - if (!xa_is_value(page)) { - if (page) - put_page(page); + if (!xa_is_value(page)) continue; - } + xas_pause(&xas); + rcu_read_unlock(); + swap = radix_to_swp_entry(page); page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE, NULL, 0, false); if (page) put_page(page); + + rcu_read_lock(); } + rcu_read_unlock(); lru_add_drain(); /* Push any new pages onto the LRU now */ } From 8cf886463ecc621688a9c81c387d0f9ed32e45ea Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:51:28 -0700 Subject: [PATCH 057/181] proc: optimise smaps for shmem entries Avoid bumping the refcount on pages when we're only interested in the swap entries. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Cc: Alexey Dobriyan Cc: Chris Wilson Cc: Huang Ying Cc: Hugh Dickins Cc: Jani Nikula Cc: Matthew Auld Cc: William Kucharski Link: https://lkml.kernel.org/r/20200910183318.20139-5-willy@infradead.org Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 35172a91148e..a1be198f755c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -520,16 +520,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, page = device_private_entry_to_page(swpent); } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap && pte_none(*pte))) { - page = find_get_entry(vma->vm_file->f_mapping, + page = xa_load(&vma->vm_file->f_mapping->i_pages, linear_page_index(vma, addr)); - if (!page) - return; - if (xa_is_value(page)) mss->swap += PAGE_SIZE; - else - put_page(page); - return; } From 9dfc8ff34b951f83632815a87e97a625a11360f0 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:51:31 -0700 Subject: [PATCH 058/181] i915: use find_lock_page instead of find_lock_entry i915 does not want to see value entries. Switch it to use find_lock_page() instead, and remove the export of find_lock_entry(). Move find_lock_entry() and find_get_entry() to mm/internal.h to discourage any future use. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Cc: Alexey Dobriyan Cc: Chris Wilson Cc: Huang Ying Cc: Hugh Dickins Cc: Jani Nikula Cc: Matthew Auld Cc: William Kucharski Link: https://lkml.kernel.org/r/20200910183318.20139-6-willy@infradead.org Signed-off-by: Linus Torvalds --- drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 4 ++-- include/linux/pagemap.h | 2 -- mm/filemap.c | 1 - mm/internal.h | 3 +++ 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c index 38113d3c0138..75e8b71c18b9 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c @@ -258,8 +258,8 @@ shmem_writeback(struct drm_i915_gem_object *obj) for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) { struct page *page; - page = find_lock_entry(mapping, i); - if (!page || xa_is_value(page)) + page = find_lock_page(mapping, i); + if (!page) continue; if (!page_mapped(page) && clear_page_dirty_for_io(page)) { diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 434c9c34aeb6..9d282fe6700d 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -385,8 +385,6 @@ static inline struct page *find_subpage(struct page *head, pgoff_t index) return head + (index & (thp_nr_pages(head) - 1)); } -struct page *find_get_entry(struct address_space *mapping, pgoff_t offset); -struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset); unsigned find_get_entries(struct address_space *mapping, pgoff_t start, unsigned int nr_entries, struct page **entries, pgoff_t *indices); diff --git a/mm/filemap.c b/mm/filemap.c index 748b7b1b4f6d..089a2f02067f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1726,7 +1726,6 @@ repeat: } return page; } -EXPORT_SYMBOL(find_lock_entry); /** * pagecache_get_page - Find and get a reference to a page. diff --git a/mm/internal.h b/mm/internal.h index 10c677655912..a801a4d51f26 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -65,6 +65,9 @@ static inline void ra_submit(struct file_ra_state *ra, ra->start, ra->size, ra->async_size); } +struct page *find_get_entry(struct address_space *mapping, pgoff_t index); +struct page *find_lock_entry(struct address_space *mapping, pgoff_t index); + /** * page_evictable - test whether a page is evictable * @page: the page to test From a6de4b4873e1e352f5029b0f5b3c347427d74ab4 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:51:34 -0700 Subject: [PATCH 059/181] mm: convert find_get_entry to return the head page There are only four callers remaining of find_get_entry(). get_shadow_from_swap_cache() only wants to see shadow entries and doesn't care about which page is returned. Push the find_subpage() call into find_lock_entry(), find_get_incore_page() and pagecache_get_page(). [willy@infradead.org: fix oops] Link: https://lkml.kernel.org/r/20200914112738.GM6583@casper.infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Cc: Alexey Dobriyan Cc: Chris Wilson Cc: Huang Ying Cc: Hugh Dickins Cc: Jani Nikula Cc: Johannes Weiner Cc: Matthew Auld Cc: William Kucharski Link: https://lkml.kernel.org/r/20200910183318.20139-7-willy@infradead.org Signed-off-by: Linus Torvalds --- mm/filemap.c | 13 +++++++------ mm/swap_state.c | 4 +++- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 089a2f02067f..9457d4f8b1c1 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1645,19 +1645,19 @@ EXPORT_SYMBOL(page_cache_prev_miss); /** * find_get_entry - find and get a page cache entry * @mapping: the address_space to search - * @offset: the page cache index + * @index: The page cache index. * * Looks up the page cache slot at @mapping & @offset. If there is a - * page cache page, it is returned with an increased refcount. + * page cache page, the head page is returned with an increased refcount. * * If the slot holds a shadow entry of a previously evicted page, or a * swap entry from shmem/tmpfs, it is returned. * - * Return: the found page or shadow entry, %NULL if nothing is found. + * Return: The head page or shadow entry, %NULL if nothing is found. */ -struct page *find_get_entry(struct address_space *mapping, pgoff_t offset) +struct page *find_get_entry(struct address_space *mapping, pgoff_t index) { - XA_STATE(xas, &mapping->i_pages, offset); + XA_STATE(xas, &mapping->i_pages, index); struct page *page; rcu_read_lock(); @@ -1685,7 +1685,6 @@ repeat: put_page(page); goto repeat; } - page = find_subpage(page, offset); out: rcu_read_unlock(); @@ -1722,6 +1721,7 @@ repeat: put_page(page); goto repeat; } + page = find_subpage(page, offset); VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); } return page; @@ -1768,6 +1768,7 @@ repeat: page = NULL; if (!page) goto no_page; + page = find_subpage(page, index); if (fgp_flags & FGP_LOCK) { if (fgp_flags & FGP_NOWAIT) { diff --git a/mm/swap_state.c b/mm/swap_state.c index c79e2242dd04..f24f2cea4238 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -431,8 +431,10 @@ struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index) struct swap_info_struct *si; struct page *page = find_get_entry(mapping, index); - if (!xa_is_value(page)) + if (!page) return page; + if (!xa_is_value(page)) + return find_subpage(page, index); if (!shmem_mapping(mapping)) return NULL; From 63ec1973ddf3eb70feb5728088ca190f1af449cb Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:51:38 -0700 Subject: [PATCH 060/181] mm/shmem: return head page from find_lock_entry Convert shmem_getpage_gfp() (the only remaining caller of find_lock_entry()) to cope with a head page being returned instead of the subpage for the index. [willy@infradead.org: fix BUG()s] Link https://lore.kernel.org/linux-mm/20200912032042.GA6583@casper.infradead.org/ Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Cc: Alexey Dobriyan Cc: Chris Wilson Cc: Huang Ying Cc: Hugh Dickins Cc: Jani Nikula Cc: Johannes Weiner Cc: Matthew Auld Cc: William Kucharski Link: https://lkml.kernel.org/r/20200910183318.20139-8-willy@infradead.org Signed-off-by: Linus Torvalds --- include/linux/pagemap.h | 9 +++++++++ mm/filemap.c | 25 +++++++++++-------------- mm/shmem.c | 19 ++++++++++--------- 3 files changed, 30 insertions(+), 23 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 9d282fe6700d..5176009e4ffa 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -372,6 +372,15 @@ static inline struct page *grab_cache_page_nowait(struct address_space *mapping, mapping_gfp_mask(mapping)); } +/* Does this page contain this index? */ +static inline bool thp_contains(struct page *head, pgoff_t index) +{ + /* HugeTLBfs indexes the page cache in units of hpage_size */ + if (PageHuge(head)) + return head->index == index; + return page_index(head) == (index & ~(thp_nr_pages(head) - 1UL)); +} + /* * Given the page we found in the page cache, return the page corresponding * to this index in the file diff --git a/mm/filemap.c b/mm/filemap.c index 9457d4f8b1c1..b44346a748fa 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1692,37 +1692,34 @@ out: } /** - * find_lock_entry - locate, pin and lock a page cache entry - * @mapping: the address_space to search - * @offset: the page cache index + * find_lock_entry - Locate and lock a page cache entry. + * @mapping: The address_space to search. + * @index: The page cache index. * - * Looks up the page cache slot at @mapping & @offset. If there is a - * page cache page, it is returned locked and with an increased - * refcount. + * Looks up the page at @mapping & @index. If there is a page in the + * cache, the head page is returned locked and with an increased refcount. * * If the slot holds a shadow entry of a previously evicted page, or a * swap entry from shmem/tmpfs, it is returned. * - * find_lock_entry() may sleep. - * - * Return: the found page or shadow entry, %NULL if nothing is found. + * Context: May sleep. + * Return: The head page or shadow entry, %NULL if nothing is found. */ -struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset) +struct page *find_lock_entry(struct address_space *mapping, pgoff_t index) { struct page *page; repeat: - page = find_get_entry(mapping, offset); + page = find_get_entry(mapping, index); if (page && !xa_is_value(page)) { lock_page(page); /* Has the page been truncated? */ - if (unlikely(page_mapping(page) != mapping)) { + if (unlikely(page->mapping != mapping)) { unlock_page(page); put_page(page); goto repeat; } - page = find_subpage(page, offset); - VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); + VM_BUG_ON_PAGE(!thp_contains(page, index), page); } return page; } diff --git a/mm/shmem.c b/mm/shmem.c index d42c27e4769f..6d4ddef4a24f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1830,6 +1830,8 @@ repeat: return error; } + if (page) + hindex = page->index; if (page && sgp == SGP_WRITE) mark_page_accessed(page); @@ -1840,11 +1842,10 @@ repeat: unlock_page(page); put_page(page); page = NULL; + hindex = index; } - if (page || sgp == SGP_READ) { - *pagep = page; - return 0; - } + if (page || sgp == SGP_READ) + goto out; /* * Fast cache lookup did not find it: @@ -1969,14 +1970,13 @@ clear: * it now, lest undo on failure cancel our earlier guarantee. */ if (sgp != SGP_WRITE && !PageUptodate(page)) { - struct page *head = compound_head(page); int i; - for (i = 0; i < compound_nr(head); i++) { - clear_highpage(head + i); - flush_dcache_page(head + i); + for (i = 0; i < compound_nr(page); i++) { + clear_highpage(page + i); + flush_dcache_page(page + i); } - SetPageUptodate(head); + SetPageUptodate(page); } /* Perhaps the file has been truncated since we checked */ @@ -1992,6 +1992,7 @@ clear: error = -EINVAL; goto unlock; } +out: *pagep = page + index - hindex; return 0; From a8cf7f272b5a28a62ecfc39d6f7d75b4f486e350 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:51:41 -0700 Subject: [PATCH 061/181] mm: add find_lock_head Add a new FGP_HEAD flag which avoids calling find_subpage() and add a convenience wrapper for it. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Cc: Alexey Dobriyan Cc: Chris Wilson Cc: Huang Ying Cc: Hugh Dickins Cc: Jani Nikula Cc: Johannes Weiner Cc: Matthew Auld Cc: William Kucharski Link: https://lkml.kernel.org/r/20200910183318.20139-9-willy@infradead.org Signed-off-by: Linus Torvalds --- include/linux/pagemap.h | 32 ++++++++++++++++++++++++++------ mm/filemap.c | 9 ++++++--- 2 files changed, 32 insertions(+), 9 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 5176009e4ffa..1a3554f5d992 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -279,6 +279,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping, #define FGP_NOFS 0x00000010 #define FGP_NOWAIT 0x00000020 #define FGP_FOR_MMAP 0x00000040 +#define FGP_HEAD 0x00000080 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, int fgp_flags, gfp_t cache_gfp_mask); @@ -310,18 +311,37 @@ static inline struct page *find_get_page_flags(struct address_space *mapping, * @mapping: the address_space to search * @offset: the page index * - * Looks up the page cache slot at @mapping & @offset. If there is a + * Looks up the page cache entry at @mapping & @offset. If there is a * page cache page, it is returned locked and with an increased * refcount. * - * Otherwise, %NULL is returned. - * - * find_lock_page() may sleep. + * Context: May sleep. + * Return: A struct page or %NULL if there is no page in the cache for this + * index. */ static inline struct page *find_lock_page(struct address_space *mapping, - pgoff_t offset) + pgoff_t index) { - return pagecache_get_page(mapping, offset, FGP_LOCK, 0); + return pagecache_get_page(mapping, index, FGP_LOCK, 0); +} + +/** + * find_lock_head - Locate, pin and lock a pagecache page. + * @mapping: The address_space to search. + * @offset: The page index. + * + * Looks up the page cache entry at @mapping & @offset. If there is a + * page cache page, its head page is returned locked and with an increased + * refcount. + * + * Context: May sleep. + * Return: A struct page which is !PageTail, or %NULL if there is no page + * in the cache for this index. + */ +static inline struct page *find_lock_head(struct address_space *mapping, + pgoff_t index) +{ + return pagecache_get_page(mapping, index, FGP_LOCK | FGP_HEAD, 0); } /** diff --git a/mm/filemap.c b/mm/filemap.c index b44346a748fa..63d2fed539d7 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1737,6 +1737,8 @@ repeat: * * * %FGP_ACCESSED - The page will be marked accessed. * * %FGP_LOCK - The page is returned locked. + * * %FGP_HEAD - If the page is present and a THP, return the head page + * rather than the exact page specified by the index. * * %FGP_CREAT - If no page is present then a new page is allocated using * @gfp_mask and added to the page cache and the VM's LRU list. * The page is returned locked and with an increased refcount. @@ -1765,7 +1767,6 @@ repeat: page = NULL; if (!page) goto no_page; - page = find_subpage(page, index); if (fgp_flags & FGP_LOCK) { if (fgp_flags & FGP_NOWAIT) { @@ -1778,12 +1779,12 @@ repeat: } /* Has the page been truncated? */ - if (unlikely(compound_head(page)->mapping != mapping)) { + if (unlikely(page->mapping != mapping)) { unlock_page(page); put_page(page); goto repeat; } - VM_BUG_ON_PAGE(page->index != index, page); + VM_BUG_ON_PAGE(!thp_contains(page, index), page); } if (fgp_flags & FGP_ACCESSED) @@ -1793,6 +1794,8 @@ repeat: if (page_is_idle(page)) clear_page_idle(page); } + if (!(fgp_flags & FGP_HEAD)) + page = find_subpage(page, index); no_page: if (!page && (fgp_flags & FGP_CREAT)) { From 27a83a609b3b39b0a4ec6c75050b1183d7c302db Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:51:44 -0700 Subject: [PATCH 062/181] mm/filemap: fix filemap_map_pages for THP We dereference page->mapping and page->index directly after calling find_subpage() and these fields are not valid for tail pages. While commit 4101196b19d7 ("mm: page cache: store only head pages in i_pages") introduced the call to find_subpage(), the problem existed prior to this; I'm going to suggest all the way back to when THPs first existed. The user-visible effects of this are almost negligible. To hit it, you have to mmap a tmpfs file at an unaligned address and then it's only a disabled optimisation causing page faults to happen more frequently than they otherwise would. Fix this by keeping both head and page pointers and checking the appropriate one. We could use page_mapping() and page_to_index(), but that's higher overhead. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Acked-by: Kirill A. Shutemov Cc: William Kucharski Link: https://lkml.kernel.org/r/20200911012532.24761-1-willy@infradead.org Signed-off-by: Linus Torvalds --- mm/filemap.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 63d2fed539d7..38546dca58fe 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2793,42 +2793,42 @@ void filemap_map_pages(struct vm_fault *vmf, pgoff_t last_pgoff = start_pgoff; unsigned long max_idx; XA_STATE(xas, &mapping->i_pages, start_pgoff); - struct page *page; + struct page *head, *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); rcu_read_lock(); - xas_for_each(&xas, page, end_pgoff) { - if (xas_retry(&xas, page)) + xas_for_each(&xas, head, end_pgoff) { + if (xas_retry(&xas, head)) continue; - if (xa_is_value(page)) + if (xa_is_value(head)) goto next; /* * Check for a locked page first, as a speculative * reference may adversely influence page migration. */ - if (PageLocked(page)) + if (PageLocked(head)) goto next; - if (!page_cache_get_speculative(page)) + if (!page_cache_get_speculative(head)) goto next; /* Has the page moved or been split? */ - if (unlikely(page != xas_reload(&xas))) + if (unlikely(head != xas_reload(&xas))) goto skip; - page = find_subpage(page, xas.xa_index); + page = find_subpage(head, xas.xa_index); - if (!PageUptodate(page) || + if (!PageUptodate(head) || PageReadahead(page) || PageHWPoison(page)) goto skip; - if (!trylock_page(page)) + if (!trylock_page(head)) goto skip; - if (page->mapping != mapping || !PageUptodate(page)) + if (head->mapping != mapping || !PageUptodate(head)) goto unlock; max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); - if (page->index >= max_idx) + if (xas.xa_index >= max_idx) goto unlock; if (mmap_miss > 0) @@ -2840,12 +2840,12 @@ void filemap_map_pages(struct vm_fault *vmf, last_pgoff = xas.xa_index; if (alloc_set_pte(vmf, page)) goto unlock; - unlock_page(page); + unlock_page(head); goto next; unlock: - unlock_page(page); + unlock_page(head); skip: - put_page(page); + put_page(head); next: /* Huge page is mapped? No need to proceed. */ if (pmd_trans_huge(*vmf->pmd)) From eb1d7a65f08a52dfb828bf45b4ead7f617c64047 Mon Sep 17 00:00:00 2001 From: Yafang Shao Date: Tue, 13 Oct 2020 16:51:47 -0700 Subject: [PATCH 063/181] mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED Our users reported that there're some random latency spikes when their RT process is running. Finally we found that latency spike is caused by FADV_DONTNEED. Which may call lru_add_drain_all() to drain LRU cache on remote CPUs, and then waits the per-cpu work to complete. The wait time is uncertain, which may be tens millisecond. That behavior is unreasonable, because this process is bound to a specific CPU and the file is only accessed by itself, IOW, there should be no pagecache pages on a per-cpu pagevec of a remote CPU. That unreasonable behavior is partially caused by the wrong comparation of the number of invalidated pages and the number of the target. For example, if (count < (end_index - start_index + 1)) The count above is how many pages were invalidated in the local CPU, and (end_index - start_index + 1) is how many pages should be invalidated. The usage of (end_index - start_index + 1) is incorrect, because they are virtual addresses, which may not mapped to pages. Besides that, there may be holes between start and end. So we'd better check whether there are still pages on per-cpu pagevec after drain the local cpu, and then decide whether or not to call lru_add_drain_all(). After I applied it with a hotfix to our production environment, most of the lru_add_drain_all() can be avoided. Suggested-by: Mel Gorman Signed-off-by: Yafang Shao Signed-off-by: Andrew Morton Acked-by: Mel Gorman Cc: Johannes Weiner Link: https://lkml.kernel.org/r/20200923133318.14373-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds --- include/linux/fs.h | 4 ++++ mm/fadvise.c | 9 +++---- mm/truncate.c | 58 ++++++++++++++++++++++++++++++++-------------- 3 files changed, 49 insertions(+), 22 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 2e621d28cd65..5815f7d4dbf4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2581,6 +2581,10 @@ extern bool is_bad_inode(struct inode *); unsigned long invalidate_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t end); +void invalidate_mapping_pagevec(struct address_space *mapping, + pgoff_t start, pgoff_t end, + unsigned long *nr_pagevec); + static inline void invalidate_remote_inode(struct inode *inode) { if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) || diff --git a/mm/fadvise.c b/mm/fadvise.c index 0e66f2aaeea3..d6baa4f451c5 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -141,7 +141,7 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) } if (end_index >= start_index) { - unsigned long count; + unsigned long nr_pagevec = 0; /* * It's common to FADV_DONTNEED right after @@ -154,8 +154,9 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) */ lru_add_drain(); - count = invalidate_mapping_pages(mapping, - start_index, end_index); + invalidate_mapping_pagevec(mapping, + start_index, end_index, + &nr_pagevec); /* * If fewer pages were invalidated than expected then @@ -163,7 +164,7 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) * a per-cpu pagevec for a remote CPU. Drain all * pagevecs and try again. */ - if (count < (end_index - start_index + 1)) { + if (nr_pagevec) { lru_add_drain_all(); invalidate_mapping_pages(mapping, start_index, end_index); diff --git a/mm/truncate.c b/mm/truncate.c index dd9ebc1da356..6bbe0f0b3ce9 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -528,23 +528,8 @@ void truncate_inode_pages_final(struct address_space *mapping) } EXPORT_SYMBOL(truncate_inode_pages_final); -/** - * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode - * @mapping: the address_space which holds the pages to invalidate - * @start: the offset 'from' which to invalidate - * @end: the offset 'to' which to invalidate (inclusive) - * - * This function only removes the unlocked pages, if you want to - * remove all the pages of one inode, you must call truncate_inode_pages. - * - * invalidate_mapping_pages() will not block on IO activity. It will not - * invalidate pages which are dirty, locked, under writeback or mapped into - * pagetables. - * - * Return: the number of the pages that were invalidated - */ -unsigned long invalidate_mapping_pages(struct address_space *mapping, - pgoff_t start, pgoff_t end) +unsigned long __invalidate_mapping_pages(struct address_space *mapping, + pgoff_t start, pgoff_t end, unsigned long *nr_pagevec) { pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; @@ -610,8 +595,13 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, * Invalidation is a hint that the page is no longer * of interest and try to speed up its reclaim. */ - if (!ret) + if (!ret) { deactivate_file_page(page); + /* It is likely on the pagevec of a remote CPU */ + if (nr_pagevec) + (*nr_pagevec)++; + } + if (PageTransHuge(page)) put_page(page); count += ret; @@ -623,8 +613,40 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, } return count; } + +/** + * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode + * @mapping: the address_space which holds the pages to invalidate + * @start: the offset 'from' which to invalidate + * @end: the offset 'to' which to invalidate (inclusive) + * + * This function only removes the unlocked pages, if you want to + * remove all the pages of one inode, you must call truncate_inode_pages. + * + * invalidate_mapping_pages() will not block on IO activity. It will not + * invalidate pages which are dirty, locked, under writeback or mapped into + * pagetables. + * + * Return: the number of the pages that were invalidated + */ +unsigned long invalidate_mapping_pages(struct address_space *mapping, + pgoff_t start, pgoff_t end) +{ + return __invalidate_mapping_pages(mapping, start, end, NULL); +} EXPORT_SYMBOL(invalidate_mapping_pages); +/** + * This helper is similar with the above one, except that it accounts for pages + * that are likely on a pagevec and count them in @nr_pagevec, which will used by + * the caller. + */ +void invalidate_mapping_pagevec(struct address_space *mapping, + pgoff_t start, pgoff_t end, unsigned long *nr_pagevec) +{ + __invalidate_mapping_pages(mapping, start, end, nr_pagevec); +} + /* * This is like invalidate_complete_page(), except it ignores the page's * refcount. We do this because invalidate_inode_pages2() needs stronger From 4c6cd03ed88cbeed796d840a4f9c8ac082e82409 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Tue, 13 Oct 2020 16:51:51 -0700 Subject: [PATCH 064/181] mm/gup_benchmark: update the documentation in Kconfig In the beginning, mm/gup_benchmark.c supported get_user_pages_fast() only, but right now, it supports the benchmarking of a couple of get_user_pages() related calls like: * get_user_pages_fast() * get_user_pages() * pin_user_pages_fast() * pin_user_pages() The documentation is confusing and needs update. Signed-off-by: Barry Song Signed-off-by: Andrew Morton Cc: John Hubbard Cc: Keith Busch Cc: Ira Weiny Cc: Kirill A. Shutemov Link: https://lkml.kernel.org/r/20200821032546.19992-1-song.bao.hua@hisilicon.com Signed-off-by: Linus Torvalds --- mm/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index e3ee7b32c637..8c60c49a123b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -831,10 +831,10 @@ config PERCPU_STATS be used to help understand percpu memory usage. config GUP_BENCHMARK - bool "Enable infrastructure for get_user_pages_fast() benchmarking" + bool "Enable infrastructure for get_user_pages() and related calls benchmarking" help Provides /sys/kernel/debug/gup_benchmark that helps with testing - performance of get_user_pages_fast(). + performance of get_user_pages() and related calls. See tools/testing/selftests/vm/gup_benchmark.c From 657d4f7996c6a4235069d8f9a47b64af2f007dbc Mon Sep 17 00:00:00 2001 From: Barry Song Date: Tue, 13 Oct 2020 16:51:54 -0700 Subject: [PATCH 065/181] mm/gup_benchmark: use pin_user_pages for FOLL_LONGTERM flag MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit According to Documentation/core-api/pin_user_pages.rst, FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. Almost all kernel modules are using pin_user_pages() with FOLL_LONGTERM, mm/gup_benchmark.c seems to the only exception in which FOLL_PIN is not a prerequisite to FOLL_LONGTERM. Signed-off-by: Barry Song Signed-off-by: Andrew Morton Reviewed-by: John Hubbard Cc: Jan Kara Cc: Jérôme Glisse Cc: "Matthew Wilcox (Oracle)" Cc: Al Viro Cc: Christoph Hellwig Cc: Dan Williams Cc: Dave Chinner Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Michal Hocko Cc: Mike Kravetz Cc: Shuah Khan Cc: Vlastimil Babka Link: http://lkml.kernel.org/r/20200815122056.29508-1-song.bao.hua@hisilicon.com Signed-off-by: Linus Torvalds --- mm/gup_benchmark.c | 23 +++++++++++----------- tools/testing/selftests/vm/gup_benchmark.c | 14 ++++++------- 2 files changed, 19 insertions(+), 18 deletions(-) diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c index be690fa66a46..464cae1fa3ea 100644 --- a/mm/gup_benchmark.c +++ b/mm/gup_benchmark.c @@ -6,10 +6,10 @@ #include #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) -#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) -#define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) -#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark) -#define PIN_BENCHMARK _IOWR('g', 5, struct gup_benchmark) +#define GUP_BENCHMARK _IOWR('g', 2, struct gup_benchmark) +#define PIN_FAST_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define PIN_BENCHMARK _IOWR('g', 4, struct gup_benchmark) +#define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark) struct gup_benchmark { __u64 get_delta_usec; @@ -28,7 +28,6 @@ static void put_back_pages(unsigned int cmd, struct page **pages, switch (cmd) { case GUP_FAST_BENCHMARK: - case GUP_LONGTERM_BENCHMARK: case GUP_BENCHMARK: for (i = 0; i < nr_pages; i++) put_page(pages[i]); @@ -36,6 +35,7 @@ static void put_back_pages(unsigned int cmd, struct page **pages, case PIN_FAST_BENCHMARK: case PIN_BENCHMARK: + case PIN_LONGTERM_BENCHMARK: unpin_user_pages(pages, nr_pages); break; } @@ -50,6 +50,7 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages, switch (cmd) { case PIN_FAST_BENCHMARK: case PIN_BENCHMARK: + case PIN_LONGTERM_BENCHMARK: for (i = 0; i < nr_pages; i++) { page = pages[i]; if (WARN(!page_maybe_dma_pinned(page), @@ -101,11 +102,6 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = get_user_pages_fast(addr, nr, gup->flags, pages + i); break; - case GUP_LONGTERM_BENCHMARK: - nr = get_user_pages(addr, nr, - gup->flags | FOLL_LONGTERM, - pages + i, NULL); - break; case GUP_BENCHMARK: nr = get_user_pages(addr, nr, gup->flags, pages + i, NULL); @@ -118,6 +114,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = pin_user_pages(addr, nr, gup->flags, pages + i, NULL); break; + case PIN_LONGTERM_BENCHMARK: + nr = pin_user_pages(addr, nr, + gup->flags | FOLL_LONGTERM, + pages + i, NULL); + break; default: kvfree(pages); ret = -EINVAL; @@ -162,10 +163,10 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd, switch (cmd) { case GUP_FAST_BENCHMARK: - case GUP_LONGTERM_BENCHMARK: case GUP_BENCHMARK: case PIN_FAST_BENCHMARK: case PIN_BENCHMARK: + case PIN_LONGTERM_BENCHMARK: break; default: return -EINVAL; diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c index 43b4dfe161a2..31f8bb086907 100644 --- a/tools/testing/selftests/vm/gup_benchmark.c +++ b/tools/testing/selftests/vm/gup_benchmark.c @@ -15,12 +15,12 @@ #define PAGE_SIZE sysconf(_SC_PAGESIZE) #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) -#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) -#define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define GUP_BENCHMARK _IOWR('g', 2, struct gup_benchmark) /* Similar to above, but use FOLL_PIN instead of FOLL_GET. */ -#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark) -#define PIN_BENCHMARK _IOWR('g', 5, struct gup_benchmark) +#define PIN_FAST_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define PIN_BENCHMARK _IOWR('g', 4, struct gup_benchmark) +#define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark) /* Just the flags we need, copied from mm.h: */ #define FOLL_WRITE 0x01 /* check pte is writable */ @@ -52,6 +52,9 @@ int main(int argc, char **argv) case 'b': cmd = PIN_BENCHMARK; break; + case 'L': + cmd = PIN_LONGTERM_BENCHMARK; + break; case 'm': size = atoi(optarg) * MB; break; @@ -67,9 +70,6 @@ int main(int argc, char **argv) case 'T': thp = 0; break; - case 'L': - cmd = GUP_LONGTERM_BENCHMARK; - break; case 'U': cmd = GUP_BENCHMARK; break; From 447f3e45c18a8f27018213bcb1b5a0076633f68a Mon Sep 17 00:00:00 2001 From: Barry Song Date: Tue, 13 Oct 2020 16:51:58 -0700 Subject: [PATCH 066/181] mm/gup: don't permit users to call get_user_pages with FOLL_LONGTERM MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit gup prohibits users from calling get_user_pages() with FOLL_PIN. But it allows users to call get_user_pages() with FOLL_LONGTERM only. It seems insensible. Since FOLL_LONGTERM is a stricter case of FOLL_PIN, we should prohibit users from calling get_user_pages() with FOLL_LONGTERM while not with FOLL_PIN. mm/gup_benchmark.c used to be the only user who did this improperly. But it has been fixed by moving to use pin_user_pages(). [akpm@linux-foundation.org: fix CONFIG_MMU=n build] Link: https://lkml.kernel.org/r/CA+G9fYuNS3k0DVT62twfV746pfNhCSrk5sVMcOcQ1PGGnEseyw@mail.gmail.com Signed-off-by: Barry Song Signed-off-by: Andrew Morton Reviewed-by: Ira Weiny Cc: John Hubbard Cc: Jan Kara Cc: Jérôme Glisse Cc: "Matthew Wilcox (Oracle)" Cc: Al Viro Cc: Christoph Hellwig Cc: Dan Williams Cc: Dave Chinner Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Michal Hocko Cc: Mike Kravetz Cc: Shuah Khan Cc: Vlastimil Babka Cc: Naresh Kamboju Link: http://lkml.kernel.org/r/20200819110100.23504-1-song.bao.hua@hisilicon.com Signed-off-by: Linus Torvalds --- mm/gup.c | 37 ++++++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 15 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index e869c634cc9a..32d0e3ca7fbb 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1747,6 +1747,25 @@ static __always_inline long __gup_longterm_locked(struct mm_struct *mm, } #endif /* CONFIG_FS_DAX || CONFIG_CMA */ +static bool is_valid_gup_flags(unsigned int gup_flags) +{ + /* + * FOLL_PIN must only be set internally by the pin_user_pages*() APIs, + * never directly by the caller, so enforce that with an assertion: + */ + if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) + return false; + /* + * FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying + * that is, FOLL_LONGTERM is a specific case, more restrictive case of + * FOLL_PIN. + */ + if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM)) + return false; + + return true; +} + #ifdef CONFIG_MMU static long __get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, @@ -1842,11 +1861,7 @@ long get_user_pages_remote(struct mm_struct *mm, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) { - /* - * FOLL_PIN must only be set internally by the pin_user_pages*() APIs, - * never directly by the caller, so enforce that with an assertion: - */ - if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) + if (!is_valid_gup_flags(gup_flags)) return -EINVAL; return __get_user_pages_remote(mm, start, nr_pages, gup_flags, @@ -1892,11 +1907,7 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas) { - /* - * FOLL_PIN must only be set internally by the pin_user_pages*() APIs, - * never directly by the caller, so enforce that with an assertion: - */ - if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) + if (!is_valid_gup_flags(gup_flags)) return -EINVAL; return __gup_longterm_locked(current->mm, start, nr_pages, @@ -2786,11 +2797,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast_only); int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) { - /* - * FOLL_PIN must only be set internally by the pin_user_pages*() APIs, - * never directly by the caller, so enforce that: - */ - if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) + if (!is_valid_gup_flags(gup_flags)) return -EINVAL; /* From 146608bb75e6776af4cf42310f583d39311e5334 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Tue, 13 Oct 2020 16:52:01 -0700 Subject: [PATCH 067/181] mm/gup: protect unpin_user_pages() against npages==-ERRNO As suggested by Dan Carpenter, fortify unpin_user_pages() just a bit, against a typical caller mistake: check if the npages arg is really a -ERRNO value, which would blow up the unpinning loop: WARN and return. If this new WARN_ON() fires, then the system *might* be leaking pages (by leaving them pinned), but probably not. More likely, gup/pup returned a hard -ERRNO error to the caller, who erroneously passed it here. Signed-off-by: John Hubbard Signed-off-by: Dan Carpenter Signed-off-by: Andrew Morton Cc: Ira Weiny Cc: Souptick Joarder Link: https://lkml.kernel.org/r/20200917065706.409079-1-jhubbard@nvidia.com Signed-off-by: Linus Torvalds --- mm/gup.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index 32d0e3ca7fbb..ad617e7f22f5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -328,6 +328,13 @@ void unpin_user_pages(struct page **pages, unsigned long npages) { unsigned long index; + /* + * If this WARN_ON() fires, then the system *might* be leaking pages (by + * leaving them pinned), but probably not. More likely, gup/pup returned + * a hard -ERRNO error to the caller, who erroneously passed it here. + */ + if (WARN_ON(IS_ERR_VALUE(npages))) + return; /* * TODO: this can be optimized for huge pages: if a series of pages is * physically contiguous and part of the same compound page, then a From 3264631548b1f2bf89b71793d06bfd0f748f649d Mon Sep 17 00:00:00 2001 From: Gao Xiang Date: Tue, 13 Oct 2020 16:52:04 -0700 Subject: [PATCH 068/181] swap: rename SWP_FS to SWAP_FS_OPS to avoid ambiguity SWP_FS is used to make swap_{read,write}page() go through the filesystem, and it's only used for swap files over NFS for now. Otherwise it will directly submit IO to blockdev according to swapfile extents reported by filesystems in advance. As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's rename to SWP_FS_OPS. [1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org Suggested-by: Matthew Wilcox Signed-off-by: Gao Xiang Signed-off-by: Andrew Morton Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com Signed-off-by: Linus Torvalds --- include/linux/swap.h | 2 +- mm/page_io.c | 6 +++--- mm/swap_state.c | 2 +- mm/swapfile.c | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 23c6e43a956d..7bd5b4aac049 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -170,7 +170,7 @@ enum { SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */ SWP_BLKDEV = (1 << 6), /* its a block device */ SWP_ACTIVATED = (1 << 7), /* set after swap_activate success */ - SWP_FS = (1 << 8), /* swap file goes through fs */ + SWP_FS_OPS = (1 << 8), /* swapfile operations go through fs */ SWP_AREA_DISCARD = (1 << 9), /* single-time swap area discards */ SWP_PAGE_DISCARD = (1 << 10), /* freed swap page-cluster discards */ SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */ diff --git a/mm/page_io.c b/mm/page_io.c index f9e9267f296f..2ffe4c4a6d97 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -312,7 +312,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, struct swap_info_struct *sis = page_swap_info(page); VM_BUG_ON_PAGE(!PageSwapCache(page), page); - if (data_race(sis->flags & SWP_FS)) { + if (data_race(sis->flags & SWP_FS_OPS)) { struct kiocb kiocb; struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping; @@ -403,7 +403,7 @@ int swap_readpage(struct page *page, bool synchronous) goto out; } - if (data_race(sis->flags & SWP_FS)) { + if (data_race(sis->flags & SWP_FS_OPS)) { struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping; @@ -467,7 +467,7 @@ int swap_set_page_dirty(struct page *page) { struct swap_info_struct *sis = page_swap_info(page); - if (data_race(sis->flags & SWP_FS)) { + if (data_race(sis->flags & SWP_FS_OPS)) { struct address_space *mapping = sis->swap_file->f_mapping; VM_BUG_ON_PAGE(!PageSwapCache(page), page); diff --git a/mm/swap_state.c b/mm/swap_state.c index f24f2cea4238..aa40e706604c 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -665,7 +665,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, goto skip; /* Test swap type to make sure the dereference is safe */ - if (likely(si->flags & (SWP_BLKDEV | SWP_FS))) { + if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) { struct inode *inode = si->swap_file->f_mapping->host; if (inode_read_congested(inode)) goto skip; diff --git a/mm/swapfile.c b/mm/swapfile.c index ced4635d924c..183b87bc87cc 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2437,7 +2437,7 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span) if (ret >= 0) sis->flags |= SWP_ACTIVATED; if (!ret) { - sis->flags |= SWP_FS; + sis->flags |= SWP_FS_OPS; ret = add_swap_extent(sis, 0, sis->max, 0); *span = sis->pages; } From cc2828b21c764f901128ca2e7b9f056d0e72104f Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Tue, 13 Oct 2020 16:52:08 -0700 Subject: [PATCH 069/181] mm: remove activate_page() from unuse_pte() We don't initially add anon pages to active lruvec after commit b518154e59aa ("mm/vmscan: protect the workingset on anonymous LRU"). Remove activate_page() from unuse_pte(), which seems to be missed by the commit. And make the function static while we are at it. Before the commit, we called lru_cache_add_active_or_unevictable() to add new ksm pages to active lruvec. Therefore, activate_page() wasn't necessary for them in the first place. Signed-off-by: Yu Zhao Signed-off-by: Andrew Morton Reviewed-by: Yang Shi Cc: Alexander Duyck Cc: Huang Ying Cc: David Hildenbrand Cc: Michal Hocko Cc: Qian Cai Cc: Mel Gorman Cc: Nicholas Piggin Cc: Hugh Dickins Cc: Joonsoo Kim Link: http://lkml.kernel.org/r/20200818184704.3625199-1-yuzhao@google.com Signed-off-by: Linus Torvalds --- include/linux/swap.h | 1 - mm/swap.c | 4 ++-- mm/swapfile.c | 5 ----- 3 files changed, 2 insertions(+), 8 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 7bd5b4aac049..667935c0dbd4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -340,7 +340,6 @@ extern void lru_note_cost_page(struct page *); extern void lru_cache_add(struct page *); extern void lru_add_page_tail(struct page *page, struct page *page_tail, struct lruvec *lruvec, struct list_head *head); -extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); diff --git a/mm/swap.c b/mm/swap.c index 65ef7e3525bf..82ddefda4904 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -348,7 +348,7 @@ static bool need_activate_page_drain(int cpu) return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0; } -void activate_page(struct page *page) +static void activate_page(struct page *page) { page = compound_head(page); if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { @@ -368,7 +368,7 @@ static inline void activate_page_drain(int cpu) { } -void activate_page(struct page *page) +static void activate_page(struct page *page) { pg_data_t *pgdat = page_pgdat(page); diff --git a/mm/swapfile.c b/mm/swapfile.c index 183b87bc87cc..d5c19d0baf06 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1929,11 +1929,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, lru_cache_add_inactive_or_unevictable(page, vma); } swap_free(entry); - /* - * Move the page to the active list so it is not - * immediately swapped out again after swapon. - */ - activate_page(page); out: pte_unmap_unlock(pte, ptl); if (page != swapcache) { From 6f4dd8de4835563de9bae797ce1d7a13465a7a7d Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Tue, 13 Oct 2020 16:52:11 -0700 Subject: [PATCH 070/181] mm: remove superfluous __ClearPageActive() To activate a page, mark_page_accessed() always holds a reference on it. It either gets a new reference when adding a page to lru_pvecs.activate_page or reuses an existing one it previously got when it added a page to lru_pvecs.lru_add. So it doesn't call SetPageActive() on a page that doesn't have any reference left. Therefore, the race is impossible these days (I didn't brother to dig into its history). For other paths, namely reclaim and migration, a reference count is always held while calling SetPageActive() on a page. SetPageSlabPfmemalloc() also uses SetPageActive(), but it's irrelevant to LRU pages. Signed-off-by: Yu Zhao Signed-off-by: Andrew Morton Reviewed-by: Yang Shi Cc: Alexander Duyck Cc: David Hildenbrand Cc: Huang Ying Cc: Hugh Dickins Cc: Joonsoo Kim Cc: Mel Gorman Cc: Michal Hocko Cc: Nicholas Piggin Cc: Qian Cai Link: http://lkml.kernel.org/r/20200818184704.3625199-2-yuzhao@google.com Signed-off-by: Linus Torvalds --- mm/memremap.c | 2 -- mm/swap.c | 2 -- 2 files changed, 4 deletions(-) diff --git a/mm/memremap.c b/mm/memremap.c index 532ec3d36ab4..7dc7aec971de 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -494,8 +494,6 @@ void free_devmap_managed_page(struct page *page) return; } - /* Clear Active bit in case of parallel mark_page_accessed */ - __ClearPageActive(page); __ClearPageWaiters(page); mem_cgroup_uncharge(page); diff --git a/mm/swap.c b/mm/swap.c index 82ddefda4904..8c936404f254 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -943,8 +943,6 @@ void release_pages(struct page **pages, int nr) del_page_from_lru_list(page, lruvec, page_off_lru(page)); } - /* Clear Active bit in case of parallel mark_page_accessed */ - __ClearPageActive(page); __ClearPageWaiters(page); list_add(&page->lru, &pages_to_free); From a3e7bea060724c760906218f1c232c75ecb0c767 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:52:15 -0700 Subject: [PATCH 071/181] mm/swap.c: fix confusing comment in release_pages() Since commit 07d802699528 ("mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages"), we have renamed the func put_devmap_managed_page() to page_is_devmap_managed(). Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Cc: John Hubbard Link: https://lkml.kernel.org/r/20200905084453.19353-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- mm/swap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/swap.c b/mm/swap.c index 8c936404f254..43288a0e11bc 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -902,7 +902,7 @@ void release_pages(struct page **pages, int nr) } /* * ZONE_DEVICE pages that return 'false' from - * put_devmap_managed_page() do not require special + * page_is_devmap_managed() do not require special * processing, and instead, expect a call to * put_page_testzero(). */ From f3bc52cb04bcfccd12da3f03ca8bc50484898436 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:52:18 -0700 Subject: [PATCH 072/181] mm/swap_slots.c: remove always zero and unused return value of enable_swap_slots_cache() enable_swap_slots_cache() always return zero and its return value is just ignored by the caller. So make enable_swap_slots_cache() void. Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Link: https://lkml.kernel.org/r/20200924113554.50614-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- include/linux/swap_slots.h | 2 +- mm/swap_slots.c | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h index e36b200c2a77..347f1a304190 100644 --- a/include/linux/swap_slots.h +++ b/include/linux/swap_slots.h @@ -23,7 +23,7 @@ struct swap_slots_cache { void disable_swap_slots_cache_lock(void); void reenable_swap_slots_cache_unlock(void); -int enable_swap_slots_cache(void); +void enable_swap_slots_cache(void); int free_swap_slot(swp_entry_t entry); extern bool swap_slot_cache_enabled; diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 3e6453573a89..0357fbe70645 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -237,7 +237,7 @@ static int free_slot_cache(unsigned int cpu) return 0; } -int enable_swap_slots_cache(void) +void enable_swap_slots_cache(void) { mutex_lock(&swap_slots_cache_enable_mutex); if (!swap_slot_cache_initialized) { @@ -255,7 +255,6 @@ int enable_swap_slots_cache(void) __reenable_swap_slots_cache(); out_unlock: mutex_unlock(&swap_slots_cache_enable_mutex); - return 0; } /* called with swap slot cache's alloc lock held */ From 548d9782bd844048bc6b5159c848772a5fe3da32 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:52:21 -0700 Subject: [PATCH 073/181] mm/page_io.c: remove useless out label in __swap_writepage() The out label is only used in one place and return ret directly without something like resource cleanup or lock release and so on. So we should remove this jump label and do some cleanup. Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200927124032.22521-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- mm/page_io.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index 2ffe4c4a6d97..433df1263349 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -359,13 +359,11 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, return 0; } - ret = 0; bio = get_swap_bio(GFP_NOIO, page, end_write_func); if (bio == NULL) { set_page_dirty(page); unlock_page(page); - ret = -ENOMEM; - goto out; + return -ENOMEM; } bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc); bio_associate_blkg_from_page(bio, page); @@ -373,8 +371,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, set_page_writeback(page); unlock_page(page); submit_bio(bio); -out: - return ret; + + return 0; } int swap_readpage(struct page *page, bool synchronous) From 12eab4289d3203be384d0c0733670c2c9daa1880 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:52:24 -0700 Subject: [PATCH 074/181] mm/swap.c: fix incomplete comment in lru_cache_add_inactive_or_unevictable() Since commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs"), unevictable pages do not goes directly back onto zone's unevictable list. Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Cc: Shakeel Butt Link: https://lkml.kernel.org/r/20200927122209.59328-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- mm/swap.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 43288a0e11bc..f41ccd8eae94 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -481,9 +481,7 @@ EXPORT_SYMBOL(lru_cache_add); * @vma: vma in which page is mapped for determining reclaimability * * Place @page on the inactive or unevictable LRU list, depending on its - * evictability. Note that if the page is not evictable, it goes - * directly back onto it's zone's unevictable list, it does NOT use a - * per cpu pagevec. + * evictability. */ void lru_cache_add_inactive_or_unevictable(struct page *page, struct vm_area_struct *vma) From 7a3d52e45e00dd1cb1f2e81661e53e324e0a7b82 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:52:27 -0700 Subject: [PATCH 075/181] mm/swapfile.c: remove unnecessary goto out in _swap_info_get() It's unnecessary to goto the out label while out label is just below. Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200930102549.1885-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- mm/swapfile.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index d5c19d0baf06..b23090c912f4 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1184,7 +1184,6 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry) bad_free: pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val); - goto out; out: return NULL; } From 822bca52ee7eb279acfba261a423ed7ac47d6f73 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:52:30 -0700 Subject: [PATCH 076/181] mm/swapfile.c: fix potential memory leak in sys_swapon If we failed to drain inode, we would forget to free the swap address space allocated by init_swap_address_space() above. Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files") Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Reviewed-by: Darrick J. Wong Link: https://lkml.kernel.org/r/20200930101803.53884-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- mm/swapfile.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index b23090c912f4..c4a613688a17 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3342,7 +3342,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) error = inode_drain_writes(inode); if (error) { inode->i_flags &= ~S_SWAPFILE; - goto bad_swap_unlock_inode; + goto free_swap_address_space; } mutex_lock(&swapon_mutex); @@ -3367,6 +3367,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) error = 0; goto out; +free_swap_address_space: + exit_swap_address_space(p->type); bad_swap_unlock_inode: inode_unlock(inode); bad_swap: From 433e7d3177544c8cf0b6375abd310b0ef023fe9d Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Tue, 13 Oct 2020 16:52:33 -0700 Subject: [PATCH 077/181] mm/memremap.c: convert devmap static branch to {inc,dec} While reviewing Protection Key Supervisor support it was pointed out that using a counter to track static branch enable was an anti-pattern which was better solved using the provided static_branch_{inc,dec} functions.[1] Fix up devmap_managed_key to work the same way. Also this should be safer because there is a very small (very unlikely) race when multiple callers try to enable at the same time. [1] https://lore.kernel.org/lkml/20200714194031.GI5523@worktop.programming.kicks-ass.net/ Signed-off-by: Ira Weiny Signed-off-by: Andrew Morton Reviewed-by: William Kucharski Cc: Dan Williams Cc: Vishal Verma Link: https://lkml.kernel.org/r/20200810235319.2796597-1-ira.weiny@intel.com Signed-off-by: Linus Torvalds --- mm/memremap.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/mm/memremap.c b/mm/memremap.c index 7dc7aec971de..198083453182 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -40,12 +40,10 @@ EXPORT_SYMBOL_GPL(memremap_compat_align); #ifdef CONFIG_DEV_PAGEMAP_OPS DEFINE_STATIC_KEY_FALSE(devmap_managed_key); EXPORT_SYMBOL(devmap_managed_key); -static atomic_t devmap_managed_enable; static void devmap_managed_enable_put(void) { - if (atomic_dec_and_test(&devmap_managed_enable)) - static_branch_disable(&devmap_managed_key); + static_branch_dec(&devmap_managed_key); } static int devmap_managed_enable_get(struct dev_pagemap *pgmap) @@ -56,8 +54,7 @@ static int devmap_managed_enable_get(struct dev_pagemap *pgmap) return -EINVAL; } - if (atomic_inc_return(&devmap_managed_enable) == 1) - static_branch_enable(&devmap_managed_key); + static_branch_inc(&devmap_managed_key); return 0; } #else From e90342e6d26ab6d1c38c8f5804492d76e3e9b4e9 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Tue, 13 Oct 2020 16:52:36 -0700 Subject: [PATCH 078/181] mm: memcontrol: use flex_array_size() helper in memcpy() Make use of the flex_array_size() helper to calculate the size of a flexible array member within an enclosing structure. This helper offers defense-in-depth against potential integer overflows, while at the same time makes it explicitly clear that we are dealing with a flexible array member. Also, remove unnecessary braces. Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Acked-by: Michal Hocko Cc: Johannes Weiner Cc: Vladimir Davydov Link: https://lkml.kernel.org/r/ddd60dae2d9aea1ccdd2be66634815c93696125e.1596214831.git.gustavoars@kernel.org Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d8bdbe1e8ff8..e3565ec02401 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4255,10 +4255,9 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg, new->size = size; /* Copy thresholds (if any) to new array */ - if (thresholds->primary) { - memcpy(new->entries, thresholds->primary->entries, (size - 1) * - sizeof(struct mem_cgroup_threshold)); - } + if (thresholds->primary) + memcpy(new->entries, thresholds->primary->entries, + flex_array_size(new, entries, size - 1)); /* Add new threshold */ new->entries[size - 1].eventfd = eventfd; From 61e604e636ab9614de49df9149ef92cae17e9701 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Tue, 13 Oct 2020 16:52:39 -0700 Subject: [PATCH 079/181] mm: memcontrol: use the preferred form for passing the size of a structure type Use the preferred form for passing the size of a structure type. The alternative form where the structure type is spelled out hurts readability and introduces an opportunity for a bug when the object type is changed but the corresponding object identifier to which the sizeof operator is applied is not. Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Acked-by: Michal Hocko Cc: Johannes Weiner Cc: Vladimir Davydov Link: https://lkml.kernel.org/r/773e013ff2f07fe2a0b47153f14dea054c0c04f1.1596214831.git.gustavoars@kernel.org Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e3565ec02401..e71e8440de6b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4264,7 +4264,7 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg, new->entries[size - 1].threshold = threshold; /* Sort thresholds. Registering of new threshold isn't time-critical */ - sort(new->entries, size, sizeof(struct mem_cgroup_threshold), + sort(new->entries, size, sizeof(*new->entries), compare_thresholds, NULL); /* Find current threshold */ From 19b629c9795bfe67bf77be8fb611b84424b56d91 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Tue, 13 Oct 2020 16:52:42 -0700 Subject: [PATCH 080/181] mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj() mem_cgroup_from_obj() checks the lowest bit of the page->mem_cgroup pointer to determine if the page has an attached obj_cgroup vector instead of a regular memcg pointer. If it's not set, it simple returns the page->mem_cgroup value as a struct mem_cgroup pointer. The commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") changed the moment when this bit is set: if previously it was set on the allocation of the slab page, now it can be set well after, when the first accounted object is allocated on this page. It opened a race: if page->mem_cgroup is set concurrently after the first page_has_obj_cgroups(page) check, a pointer to the obj_cgroups array can be returned as a memory cgroup pointer. A simple check for page->mem_cgroup pointer for NULL before the page_has_obj_cgroups() check fixes the race. Indeed, if the pointer is not NULL, it's either a simple mem_cgroup pointer or a pointer to obj_cgroup vector. The pointer can be asynchronously changed from NULL to (obj_cgroup_vec | 0x1UL), but can't be changed from a valid memcg pointer to objcg vector or back. If the object passed to mem_cgroup_from_obj() is a slab object and page->mem_cgroup is NULL, it means that the object is not accounted, so the function must return NULL. I've discovered the race looking at the code, so far I haven't seen it in the wild. Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") Signed-off-by: Roman Gushchin Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Cc: Johannes Weiner Cc: Vlastimil Babka Link: https://lkml.kernel.org/r/20200910022435.2773735-1-guro@fb.com Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e71e8440de6b..ba9f5404b8cf 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2887,6 +2887,17 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p) page = virt_to_head_page(p); + /* + * If page->mem_cgroup is set, it's either a simple mem_cgroup pointer + * or a pointer to obj_cgroup vector. In the latter case the lowest + * bit of the pointer is set. + * The page->mem_cgroup pointer can be asynchronously changed + * from NULL to (obj_cgroup_vec | 0x1UL), but can't be changed + * from a valid memcg pointer to objcg vector or back. + */ + if (!page->mem_cgroup) + return NULL; + /* * Slab objects are accounted individually, not per-page. * Memcg membership data for each individual object is saved in From 05bdc520b3ad39d216efc52112bc59be2e975299 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:52:45 -0700 Subject: [PATCH 081/181] mm: memcontrol: correct the comment of mem_cgroup_iter() Since commit bbec2e15170a ("mm: rename page_counter's count/limit into usage/max"), the arg @reclaim has no priority field anymore. Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Acked-by: Michal Hocko Cc: Johannes Weiner Cc: Vladimir Davydov Link: https://lkml.kernel.org/r/20200913094129.44558-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ba9f5404b8cf..283aaf0864de 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1102,9 +1102,9 @@ static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void) * invocations for reference counting, or use mem_cgroup_iter_break() * to cancel a hierarchy walk before the round-trip is complete. * - * Reclaimers can specify a node and a priority level in @reclaim to - * divide up the memcgs in the hierarchy among all concurrent - * reclaimers operating on the same node and priority. + * Reclaimers can specify a node in @reclaim to divide up the memcgs + * in the hierarchy among all concurrent reclaimers operating on the + * same node. */ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, From f9f84ec56f7e370cc6fc478b7d09fbf41de970ea Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Tue, 13 Oct 2020 16:52:49 -0700 Subject: [PATCH 082/181] mm/memcg: clean up obsolete enum charge_type Patch series "mm/memcg: Miscellaneous cleanups and streamlining", v2. This patch (of 3): Since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") and commit 00501b531c47 ("mm: memcontrol: rewrite charge API") in v3.17, the enum charge_type was no longer used anywhere. However, the enum itself was not removed at that time. Remove the obsolete enum charge_type now. Signed-off-by: Waiman Long Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Acked-by: Johannes Weiner Acked-by: Michal Hocko Acked-by: Chris Down Cc: Vladimir Davydov Cc: Tejun Heo Cc: Roman Gushchin Cc: Yafang Shao Link: https://lkml.kernel.org/r/20200914024452.19167-1-longman@redhat.com Link: https://lkml.kernel.org/r/20200914024452.19167-2-longman@redhat.com Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 283aaf0864de..09e851f0bee3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -197,14 +197,6 @@ static struct move_charge_struct { #define MEM_CGROUP_MAX_RECLAIM_LOOPS 100 #define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2 -enum charge_type { - MEM_CGROUP_CHARGE_TYPE_CACHE = 0, - MEM_CGROUP_CHARGE_TYPE_ANON, - MEM_CGROUP_CHARGE_TYPE_SWAPOUT, /* for accounting swapcache */ - MEM_CGROUP_CHARGE_TYPE_DROP, /* a page was unused swap cache */ - NR_CHARGE_TYPE, -}; - /* for encoding cft->private value on file */ enum res_type { _MEM, From 8d387a5f172f26ff8c76096d5876b881dec6b7ce Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Tue, 13 Oct 2020 16:52:52 -0700 Subject: [PATCH 083/181] mm/memcg: simplify mem_cgroup_get_max() mem_cgroup_get_max() used to get memory+swap max from both the v1 memsw and v2 memory+swap page counters & return the maximum of these 2 values. This is redundant and it is more efficient to just get either the v1 or the v2 values depending on which one is currently in use. [longman@redhat.com: v4] Link: https://lkml.kernel.org/r/20200914150928.7841-1-longman@redhat.com Signed-off-by: Waiman Long Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Acked-by: Michal Hocko Cc: Chris Down Cc: Johannes Weiner Cc: Roman Gushchin Cc: Tejun Heo Cc: Vladimir Davydov Cc: Yafang Shao Link: https://lkml.kernel.org/r/20200914024452.19167-3-longman@redhat.com Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 09e851f0bee3..962f8d649b83 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1633,17 +1633,19 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) */ unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg) { - unsigned long max; + unsigned long max = READ_ONCE(memcg->memory.max); - max = READ_ONCE(memcg->memory.max); - if (mem_cgroup_swappiness(memcg)) { - unsigned long memsw_max; - unsigned long swap_max; + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) { + if (mem_cgroup_swappiness(memcg)) + max += min(READ_ONCE(memcg->swap.max), + (unsigned long)total_swap_pages); + } else { /* v1 */ + if (mem_cgroup_swappiness(memcg)) { + /* Calculate swap excess capacity from memsw limit */ + unsigned long swap = READ_ONCE(memcg->memsw.max) - max; - memsw_max = memcg->memsw.max; - swap_max = READ_ONCE(memcg->swap.max); - swap_max = min(swap_max, (unsigned long)total_swap_pages); - max = min(max + swap_max, memsw_max); + max += min(swap, (unsigned long)total_swap_pages); + } } return max; } From bd0b230fe14554bfffbae54e19038716f96f5a41 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Tue, 13 Oct 2020 16:52:56 -0700 Subject: [PATCH 084/181] mm/memcg: unify swap and memsw page counters The swap page counter is v2 only while memsw is v1 only. As v1 and v2 controllers cannot be active at the same time, there is no point to keep both swap and memsw page counters in mem_cgroup. The previous patch has made sure that memsw page counter is updated and accessed only when in v1 code paths. So it is now safe to alias the v1 memsw page counter to v2 swap page counter. This saves 14 long's in the size of mem_cgroup. This is a saving of 112 bytes for 64-bit archs. While at it, also document which page counters are used in v1 and/or v2. Signed-off-by: Waiman Long Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Acked-by: Michal Hocko Cc: Chris Down Cc: Johannes Weiner Cc: Roman Gushchin Cc: Tejun Heo Cc: Vladimir Davydov Cc: Yafang Shao Link: https://lkml.kernel.org/r/20200914024452.19167-4-longman@redhat.com Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 13 ++++++++----- mm/memcontrol.c | 3 --- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d0b036123c6a..6ef4a552e09d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -215,13 +215,16 @@ struct mem_cgroup { struct mem_cgroup_id id; /* Accounted resources */ - struct page_counter memory; - struct page_counter swap; + struct page_counter memory; /* Both v1 & v2 */ + + union { + struct page_counter swap; /* v2 only */ + struct page_counter memsw; /* v1 only */ + }; /* Legacy consumer-oriented counters */ - struct page_counter memsw; - struct page_counter kmem; - struct page_counter tcpmem; + struct page_counter kmem; /* v1 only */ + struct page_counter tcpmem; /* v1 only */ /* Range enforcement for interrupt charges */ struct work_struct high_work; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 962f8d649b83..a0bfc92804b7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5295,13 +5295,11 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) memcg->use_hierarchy = true; page_counter_init(&memcg->memory, &parent->memory); page_counter_init(&memcg->swap, &parent->swap); - page_counter_init(&memcg->memsw, &parent->memsw); page_counter_init(&memcg->kmem, &parent->kmem); page_counter_init(&memcg->tcpmem, &parent->tcpmem); } else { page_counter_init(&memcg->memory, NULL); page_counter_init(&memcg->swap, NULL); - page_counter_init(&memcg->memsw, NULL); page_counter_init(&memcg->kmem, NULL); page_counter_init(&memcg->tcpmem, NULL); /* @@ -5430,7 +5428,6 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) page_counter_set_max(&memcg->memory, PAGE_COUNTER_MAX); page_counter_set_max(&memcg->swap, PAGE_COUNTER_MAX); - page_counter_set_max(&memcg->memsw, PAGE_COUNTER_MAX); page_counter_set_max(&memcg->kmem, PAGE_COUNTER_MAX); page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX); page_counter_set_min(&memcg->memory, 0); From 5f9a4f4a709608fc15197368464a6c8ed4e3630a Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Tue, 13 Oct 2020 16:52:59 -0700 Subject: [PATCH 085/181] mm: memcontrol: add the missing numa_stat interface for cgroup v2 In the cgroup v1, we have a numa_stat interface. This is useful for providing visibility into the numa locality information within an memcg since the pages are allowed to be allocated from any physical node. One of the use cases is evaluating application performance by combining this information with the application's CPU allocation. But the cgroup v2 does not. So this patch adds the missing information. Suggested-by: Shakeel Butt Signed-off-by: Muchun Song Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Cc: Zefan Li Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Michal Hocko Cc: Vladimir Davydov Cc: Roman Gushchin Cc: Randy Dunlap Link: https://lkml.kernel.org/r/20200916100030.71698-2-songmuchun@bytedance.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v2.rst | 69 +++++++--- mm/memcontrol.c | 170 +++++++++++++++--------- 2 files changed, 159 insertions(+), 80 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index baa07b30845e..608d7c279396 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1259,6 +1259,10 @@ PAGE_SIZE multiple when read back. can show up in the middle. Don't rely on items remaining in a fixed position; use the keys to look up specific values! + If the entry has no per-node counter(or not show in the + mempry.numa_stat). We use 'npn'(non-per-node) as the tag + to indicate that it will not show in the mempry.numa_stat. + anon Amount of memory used in anonymous mappings such as brk(), sbrk(), and mmap(MAP_ANONYMOUS) @@ -1270,15 +1274,11 @@ PAGE_SIZE multiple when read back. kernel_stack Amount of memory allocated to kernel stacks. - slab - Amount of memory used for storing in-kernel data - structures. - - percpu + percpu(npn) Amount of memory used for storing per-cpu kernel data structures. - sock + sock(npn) Amount of memory used in network transmission buffers shmem @@ -1318,11 +1318,9 @@ PAGE_SIZE multiple when read back. Part of "slab" that cannot be reclaimed on memory pressure. - pgfault - Total number of page faults incurred - - pgmajfault - Number of major page faults incurred + slab(npn) + Amount of memory used for storing in-kernel data + structures. workingset_refault_anon Number of refaults of previously evicted anonymous pages. @@ -1348,37 +1346,68 @@ PAGE_SIZE multiple when read back. workingset_nodereclaim Number of times a shadow node has been reclaimed - pgrefill + pgfault(npn) + Total number of page faults incurred + + pgmajfault(npn) + Number of major page faults incurred + + pgrefill(npn) Amount of scanned pages (in an active LRU list) - pgscan + pgscan(npn) Amount of scanned pages (in an inactive LRU list) - pgsteal + pgsteal(npn) Amount of reclaimed pages - pgactivate + pgactivate(npn) Amount of pages moved to the active LRU list - pgdeactivate + pgdeactivate(npn) Amount of pages moved to the inactive LRU list - pglazyfree + pglazyfree(npn) Amount of pages postponed to be freed under memory pressure - pglazyfreed + pglazyfreed(npn) Amount of reclaimed lazyfree pages - thp_fault_alloc + thp_fault_alloc(npn) Number of transparent hugepages which were allocated to satisfy a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. - thp_collapse_alloc + thp_collapse_alloc(npn) Number of transparent hugepages which were allocated to allow collapsing an existing range of pages. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. + memory.numa_stat + A read-only nested-keyed file which exists on non-root cgroups. + + This breaks down the cgroup's memory footprint into different + types of memory, type-specific details, and other information + per node on the state of the memory management system. + + This is useful for providing visibility into the NUMA locality + information within an memcg since the pages are allowed to be + allocated from any physical node. One of the use case is evaluating + application performance by combining this information with the + application's CPU allocation. + + All memory amounts are in bytes. + + The output format of memory.numa_stat is:: + + type N0= N1= ... + + The entries are ordered to be human readable, and new entries + can show up in the middle. Don't rely on items remaining in a + fixed position; use the keys to look up specific values! + + The entries can refer to the memory.stat. + memory.swap.current A read-only single value file which exists on non-root cgroups. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a0bfc92804b7..e9fa32a943c5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1448,6 +1448,70 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg) return false; } +struct memory_stat { + const char *name; + unsigned int ratio; + unsigned int idx; +}; + +static struct memory_stat memory_stats[] = { + { "anon", PAGE_SIZE, NR_ANON_MAPPED }, + { "file", PAGE_SIZE, NR_FILE_PAGES }, + { "kernel_stack", 1024, NR_KERNEL_STACK_KB }, + { "percpu", 1, MEMCG_PERCPU_B }, + { "sock", PAGE_SIZE, MEMCG_SOCK }, + { "shmem", PAGE_SIZE, NR_SHMEM }, + { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED }, + { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY }, + { "file_writeback", PAGE_SIZE, NR_WRITEBACK }, +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + /* + * The ratio will be initialized in memory_stats_init(). Because + * on some architectures, the macro of HPAGE_PMD_SIZE is not + * constant(e.g. powerpc). + */ + { "anon_thp", 0, NR_ANON_THPS }, +#endif + { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON }, + { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON }, + { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE }, + { "active_file", PAGE_SIZE, NR_ACTIVE_FILE }, + { "unevictable", PAGE_SIZE, NR_UNEVICTABLE }, + + /* + * Note: The slab_reclaimable and slab_unreclaimable must be + * together and slab_reclaimable must be in front. + */ + { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B }, + { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B }, + + /* The memory events */ + { "workingset_refault_anon", 1, WORKINGSET_REFAULT_ANON }, + { "workingset_refault_file", 1, WORKINGSET_REFAULT_FILE }, + { "workingset_activate_anon", 1, WORKINGSET_ACTIVATE_ANON }, + { "workingset_activate_file", 1, WORKINGSET_ACTIVATE_FILE }, + { "workingset_restore_anon", 1, WORKINGSET_RESTORE_ANON }, + { "workingset_restore_file", 1, WORKINGSET_RESTORE_FILE }, + { "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM }, +}; + +static int __init memory_stats_init(void) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (memory_stats[i].idx == NR_ANON_THPS) + memory_stats[i].ratio = HPAGE_PMD_SIZE; +#endif + VM_BUG_ON(!memory_stats[i].ratio); + VM_BUG_ON(memory_stats[i].idx >= MEMCG_NR_STAT); + } + + return 0; +} +pure_initcall(memory_stats_init); + static char *memory_stat_format(struct mem_cgroup *memcg) { struct seq_buf s; @@ -1468,52 +1532,19 @@ static char *memory_stat_format(struct mem_cgroup *memcg) * Current memory state: */ - seq_buf_printf(&s, "anon %llu\n", - (u64)memcg_page_state(memcg, NR_ANON_MAPPED) * - PAGE_SIZE); - seq_buf_printf(&s, "file %llu\n", - (u64)memcg_page_state(memcg, NR_FILE_PAGES) * - PAGE_SIZE); - seq_buf_printf(&s, "kernel_stack %llu\n", - (u64)memcg_page_state(memcg, NR_KERNEL_STACK_KB) * - 1024); - seq_buf_printf(&s, "slab %llu\n", - (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) + - memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B))); - seq_buf_printf(&s, "percpu %llu\n", - (u64)memcg_page_state(memcg, MEMCG_PERCPU_B)); - seq_buf_printf(&s, "sock %llu\n", - (u64)memcg_page_state(memcg, MEMCG_SOCK) * - PAGE_SIZE); + for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { + u64 size; - seq_buf_printf(&s, "shmem %llu\n", - (u64)memcg_page_state(memcg, NR_SHMEM) * - PAGE_SIZE); - seq_buf_printf(&s, "file_mapped %llu\n", - (u64)memcg_page_state(memcg, NR_FILE_MAPPED) * - PAGE_SIZE); - seq_buf_printf(&s, "file_dirty %llu\n", - (u64)memcg_page_state(memcg, NR_FILE_DIRTY) * - PAGE_SIZE); - seq_buf_printf(&s, "file_writeback %llu\n", - (u64)memcg_page_state(memcg, NR_WRITEBACK) * - PAGE_SIZE); + size = memcg_page_state(memcg, memory_stats[i].idx); + size *= memory_stats[i].ratio; + seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - seq_buf_printf(&s, "anon_thp %llu\n", - (u64)memcg_page_state(memcg, NR_ANON_THPS) * - HPAGE_PMD_SIZE); -#endif - - for (i = 0; i < NR_LRU_LISTS; i++) - seq_buf_printf(&s, "%s %llu\n", lru_list_name(i), - (u64)memcg_page_state(memcg, NR_LRU_BASE + i) * - PAGE_SIZE); - - seq_buf_printf(&s, "slab_reclaimable %llu\n", - (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B)); - seq_buf_printf(&s, "slab_unreclaimable %llu\n", - (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)); + if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) { + size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) + + memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B); + seq_buf_printf(&s, "slab %llu\n", size); + } + } /* Accumulated memory events */ @@ -1521,22 +1552,6 @@ static char *memory_stat_format(struct mem_cgroup *memcg) memcg_events(memcg, PGFAULT)); seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT), memcg_events(memcg, PGMAJFAULT)); - - seq_buf_printf(&s, "workingset_refault_anon %lu\n", - memcg_page_state(memcg, WORKINGSET_REFAULT_ANON)); - seq_buf_printf(&s, "workingset_refault_file %lu\n", - memcg_page_state(memcg, WORKINGSET_REFAULT_FILE)); - seq_buf_printf(&s, "workingset_activate_anon %lu\n", - memcg_page_state(memcg, WORKINGSET_ACTIVATE_ANON)); - seq_buf_printf(&s, "workingset_activate_file %lu\n", - memcg_page_state(memcg, WORKINGSET_ACTIVATE_FILE)); - seq_buf_printf(&s, "workingset_restore_anon %lu\n", - memcg_page_state(memcg, WORKINGSET_RESTORE_ANON)); - seq_buf_printf(&s, "workingset_restore_file %lu\n", - memcg_page_state(memcg, WORKINGSET_RESTORE_FILE)); - seq_buf_printf(&s, "workingset_nodereclaim %lu\n", - memcg_page_state(memcg, WORKINGSET_NODERECLAIM)); - seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGREFILL), memcg_events(memcg, PGREFILL)); seq_buf_printf(&s, "pgscan %lu\n", @@ -6374,6 +6389,35 @@ static int memory_stat_show(struct seq_file *m, void *v) return 0; } +#ifdef CONFIG_NUMA +static int memory_numa_stat_show(struct seq_file *m, void *v) +{ + int i; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { + int nid; + + if (memory_stats[i].idx >= NR_VM_NODE_STAT_ITEMS) + continue; + + seq_printf(m, "%s", memory_stats[i].name); + for_each_node_state(nid, N_MEMORY) { + u64 size; + struct lruvec *lruvec; + + lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); + size = lruvec_page_state(lruvec, memory_stats[i].idx); + size *= memory_stats[i].ratio; + seq_printf(m, " N%d=%llu", nid, size); + } + seq_putc(m, '\n'); + } + + return 0; +} +#endif + static int memory_oom_group_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(m); @@ -6451,6 +6495,12 @@ static struct cftype memory_files[] = { .name = "stat", .seq_show = memory_stat_show, }, +#ifdef CONFIG_NUMA + { + .name = "numa_stat", + .seq_show = memory_numa_stat_show, + }, +#endif { .name = "oom.group", .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, From d437024e69b8c005d4eec65ee075d6725b4b8e58 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:53:02 -0700 Subject: [PATCH 086/181] mm/page_counter: correct the obsolete func name in the comment of page_counter_try_charge() Since commit bbec2e15170a ("mm: rename page_counter's count/limit into usage/max"), page_counter_limit() is renamed to page_counter_set_max(). So replace page_counter_limit with page_counter_set_max in comment. Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Cc: Roman Gushchin Cc: Johannes Weiner Link: https://lkml.kernel.org/r/20200917113629.14382-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- mm/page_counter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_counter.c b/mm/page_counter.c index afe22ad335cc..b24a60b28bb0 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -109,7 +109,7 @@ bool page_counter_try_charge(struct page_counter *counter, * * The atomic_long_add_return() implies a full memory * barrier between incrementing the count and reading - * the limit. When racing with page_counter_limit(), + * the limit. When racing with page_counter_set_max(), * we either see the new limit or the setter sees the * counter has changed and retries. */ From 7a52d4d88ade00c99db007708bbcc5b9311f9ea4 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:53:05 -0700 Subject: [PATCH 087/181] mm: memcontrol: reword obsolete comment of mem_cgroup_unmark_under_oom() Since commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than counter"), the mem_cgroup_unmark_under_oom() is added and the comment of the mem_cgroup_oom_unlock() is moved here. But this comment make no sense here because mem_cgroup_oom_lock() does not operate on under_oom field. So we reword the comment as this would be helpful. [Thanks Michal Hocko for rewording this comment.] Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Acked-by: Michal Hocko Cc: Johannes Weiner Cc: Vladimir Davydov Link: https://lkml.kernel.org/r/20200930095336.21323-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e9fa32a943c5..c04b57ccefe9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1826,8 +1826,8 @@ static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg) struct mem_cgroup *iter; /* - * When a new child is created while the hierarchy is under oom, - * mem_cgroup_oom_lock() may not be called. Watch for underflow. + * Be careful about under_oom underflows becase a child memcg + * could have been added after mem_cgroup_mark_under_oom. */ spin_lock(&memcg_oom_lock); for_each_mem_cgroup_tree(iter, memcg) From d1b2cf6cb84a9bd0de6f151512648dd1af82f80f Mon Sep 17 00:00:00 2001 From: Bharata B Rao Date: Tue, 13 Oct 2020 16:53:09 -0700 Subject: [PATCH 088/181] mm: memcg/slab: uncharge during kmem_cache_free_bulk() Object cgroup charging is done for all the objects during allocation, but during freeing, uncharging ends up happening for only one object in the case of bulk allocation/freeing. Fix this by having a separate call to uncharge all the objects from kmem_cache_free_bulk() and by modifying memcg_slab_free_hook() to take care of bulk uncharging. Fixes: 964d4bd370d5 ("mm: memcg/slab: save obj_cgroup for non-root slab objects" Signed-off-by: Bharata B Rao Signed-off-by: Andrew Morton Acked-by: Roman Gushchin Cc: Christoph Lameter Cc: David Rientjes Cc: Joonsoo Kim Cc: Vlastimil Babka Cc: Shakeel Butt Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: Link: https://lkml.kernel.org/r/20201009060423.390479-1-bharata@linux.ibm.com Signed-off-by: Linus Torvalds --- mm/slab.c | 2 +- mm/slab.h | 42 +++++++++++++++++++++++++++--------------- mm/slub.c | 3 ++- 3 files changed, 30 insertions(+), 17 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 04bc6a6c48eb..399a9d185b0f 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3438,7 +3438,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp, memset(objp, 0, cachep->object_size); kmemleak_free_recursive(objp, cachep->flags); objp = cache_free_debugcheck(cachep, objp, caller); - memcg_slab_free_hook(cachep, virt_to_head_page(objp), objp); + memcg_slab_free_hook(cachep, &objp, 1); /* * Skip calling cache_free_alien() when the platform is not numa. diff --git a/mm/slab.h b/mm/slab.h index 6cc323f1313a..6dd4b702888a 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -345,30 +345,42 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, obj_cgroup_put(objcg); } -static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page, - void *p) +static inline void memcg_slab_free_hook(struct kmem_cache *s_orig, + void **p, int objects) { + struct kmem_cache *s; struct obj_cgroup *objcg; + struct page *page; unsigned int off; + int i; if (!memcg_kmem_enabled()) return; - if (!page_has_obj_cgroups(page)) - return; + for (i = 0; i < objects; i++) { + if (unlikely(!p[i])) + continue; - off = obj_to_index(s, page, p); - objcg = page_obj_cgroups(page)[off]; - page_obj_cgroups(page)[off] = NULL; + page = virt_to_head_page(p[i]); + if (!page_has_obj_cgroups(page)) + continue; - if (!objcg) - return; + if (!s_orig) + s = page->slab_cache; + else + s = s_orig; - obj_cgroup_uncharge(objcg, obj_full_size(s)); - mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s), - -obj_full_size(s)); + off = obj_to_index(s, page, p[i]); + objcg = page_obj_cgroups(page)[off]; + if (!objcg) + continue; - obj_cgroup_put(objcg); + page_obj_cgroups(page)[off] = NULL; + obj_cgroup_uncharge(objcg, obj_full_size(s)); + mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s), + -obj_full_size(s)); + obj_cgroup_put(objcg); + } } #else /* CONFIG_MEMCG_KMEM */ @@ -406,8 +418,8 @@ static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, { } -static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page, - void *p) +static inline void memcg_slab_free_hook(struct kmem_cache *s, + void **p, int objects) { } #endif /* CONFIG_MEMCG_KMEM */ diff --git a/mm/slub.c b/mm/slub.c index f05900186c0b..61d0d2968413 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3095,7 +3095,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s, struct kmem_cache_cpu *c; unsigned long tid; - memcg_slab_free_hook(s, page, head); + memcg_slab_free_hook(s, &head, 1); redo: /* * Determine the currently cpus per cpu slab. @@ -3257,6 +3257,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) if (WARN_ON(!size)) return; + memcg_slab_free_hook(s, p, size); do { struct detached_freelist df; From 9a137153fc8798a89d8fce895cd0a06ea5b8e37c Mon Sep 17 00:00:00 2001 From: Ralph Campbell Date: Tue, 13 Oct 2020 16:53:13 -0700 Subject: [PATCH 089/181] mm/memcg: fix device private memcg accounting The code in mc_handle_swap_pte() checks for non_swap_entry() and returns NULL before checking is_device_private_entry() so device private pages are never handled. Fix this by checking for non_swap_entry() after handling device private swap PTEs. I assume the memory cgroup accounting would be off somehow when moving a process to another memory cgroup. Currently, the device private page is charged like a normal anonymous page when allocated and is uncharged when the page is freed so I think that path is OK. Signed-off-by: Ralph Campbell Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Jerome Glisse Cc: Balbir Singh Cc: Ira Weiny Link: https://lkml.kernel.org/r/20201009215952.2726-1-rcampbell@nvidia.com xFixes: c733a82874a7 ("mm/memcontrol: support MEMORY_DEVICE_PRIVATE") Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c04b57ccefe9..7f74a158cfa8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5516,7 +5516,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, struct page *page = NULL; swp_entry_t ent = pte_to_swp_entry(ptent); - if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent)) + if (!(mc.flags & MOVE_ANON)) return NULL; /* @@ -5535,6 +5535,9 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, return page; } + if (non_swap_entry(ent)) + return NULL; + /* * Because lookup_swap_cache() updates some statistics counter, * we call find_get_page() with swapper_space directly. From efc9511cecf617c828734024f4e1cde5f974f510 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Tue, 13 Oct 2020 16:53:16 -0700 Subject: [PATCH 090/181] selftests/vm: fix false build success on the second and later attempts Patch series "selftests/vm: fix some minor aggravating factors in the Makefile". This fixes a couple of minor aggravating factors that I ran across while trying to do some changes in selftests/vm. These are simple things, but like most things with GNU Make, it's rarely obvious what's wrong until you understand *the entire Makefile and all of its includes*. So while there is, of course, joy in learning those details, I thought I'd fix these little things, so as to allow others to skip out on the Joy if they so choose. :) First of all, if you have an item (let's choose userfaultfd for an example) that fails to build, you might do this: $ make -j32 # ...you observe a failed item in the threaded output # OK, let's get a closer look $ make # ...but now the build quietly "succeeds". That's what Patch 0001 fixes. Second, if you instead attempt this approach for your closer look (a casual mistake, as it's not supported): $ make userfaultfd # ...userfaultfd fails to link, due to incomplete LDLIBS That's what Patch 0002 fixes. This patch (of 2): If one or more of these selftest fail to build, then after the first failure, subsequent invocations of "make" will make it appear that there are no build failures, after all. That's because the failed build products remain, with up-to-date timestamps, thus tricking Make (and you!) into believing that there's nothing else to build. Fix this by telling Make to delete targets that didn't completely succeed. Signed-off-by: John Hubbard Signed-off-by: Andrew Morton Cc: Shuah Khan Cc: Jason Gunthorpe Link: https://lkml.kernel.org/r/20200915012901.1655280-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20200915012901.1655280-2-jhubbard@nvidia.com Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/Makefile | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index a9026706d597..9f2625bebf07 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -3,6 +3,11 @@ uname_M := $(shell uname -m 2>/dev/null || echo not) MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/') +# Without this, failed build products remain, with up-to-date timestamps, +# thus tricking Make (and you!) into believing that All Is Well, in subsequent +# make invocations: +.DELETE_ON_ERROR: + CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS) LDLIBS = -lrt TEST_GEN_FILES = compaction_test From 34d109131f485eccd3f7e3050581eb73bffa3520 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Tue, 13 Oct 2020 16:53:19 -0700 Subject: [PATCH 091/181] selftests/vm: fix incorrect gcc invocation in some cases Avoid accidental wrong builds, due to built-in rules working just a little bit too well--but not quite as well as required for our situation here. In other words, "make userfaultfd" (for example) is supposed to fail to build at all, because this Makefile only supports either "make" (all), or "make /full/path". However, the built-in rules, if not suppressed, will pick up CFLAGS and the initial LDLIBS (but not the target-specific LDLIBS, because those are only set for the full path target!). This causes it to get pretty far into building things despite using incorrect values such as an *occasionally* incomplete LDLIBS value. Signed-off-by: John Hubbard Signed-off-by: Andrew Morton Cc: Shuah Khan Cc: Jason Gunthorpe Link: https://lkml.kernel.org/r/20200915012901.1655280-3-jhubbard@nvidia.com Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/Makefile | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 9f2625bebf07..30873b19d04b 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -8,6 +8,18 @@ MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/') # make invocations: .DELETE_ON_ERROR: +# Avoid accidental wrong builds, due to built-in rules working just a little +# bit too well--but not quite as well as required for our situation here. +# +# In other words, "make userfaultfd" is supposed to fail to build at all, +# because this Makefile only supports either "make" (all), or "make /full/path". +# However, the built-in rules, if not suppressed, will pick up CFLAGS and the +# initial LDLIBS (but not the target-specific LDLIBS, because those are only +# set for the full path target!). This causes it to get pretty far into building +# things despite using incorrect values such as an *occasionally* incomplete +# LDLIBS. +MAKEFLAGS += --no-builtin-rules + CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS) LDLIBS = -lrt TEST_GEN_FILES = compaction_test From b2b29d6d01194404dfef4eafa026959be301705b Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Tue, 13 Oct 2020 16:53:22 -0700 Subject: [PATCH 092/181] mm: account PMD tables like PTE tables We account the PTE level of the page tables to the process in order to make smarter OOM decisions and help diagnose why memory is fragmented. For these same reasons, we should account pages allocated for PMDs. With larger process address spaces and ASLR, the number of PMDs in use is higher than it used to be so the inaccuracy is starting to matter. [rppt@linux.ibm.com: arm: __pmd_free_tlb(): call page table destructor] Link: https://lkml.kernel.org/r/20200825111303.GB69694@linux.ibm.com Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Reviewed-by: Mike Rapoport Cc: Abdul Haleem Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Christophe Leroy Cc: Joerg Roedel Cc: Max Filippov Cc: Peter Zijlstra Cc: Satheesh Rajendran Cc: Stafford Horne Cc: Naresh Kamboju Cc: Anders Roxell Link: http://lkml.kernel.org/r/20200627184642.GF25039@casper.infradead.org Signed-off-by: Linus Torvalds --- arch/arm/include/asm/tlb.h | 1 + include/linux/mm.h | 24 ++++++++++++++++++++---- 2 files changed, 21 insertions(+), 4 deletions(-) diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h index 9415222b49ad..b8cbe03ad260 100644 --- a/arch/arm/include/asm/tlb.h +++ b/arch/arm/include/asm/tlb.h @@ -59,6 +59,7 @@ __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr) #ifdef CONFIG_ARM_LPAE struct page *page = virt_to_page(pmdp); + pgtable_pmd_page_dtor(page); tlb_remove_table(tlb, page); #endif } diff --git a/include/linux/mm.h b/include/linux/mm.h index 9cc0894e7d61..5320e7ab843f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2254,7 +2254,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(pmd_to_page(pmd)); } -static inline bool pgtable_pmd_page_ctor(struct page *page) +static inline bool pmd_ptlock_init(struct page *page) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE page->pmd_huge_pte = NULL; @@ -2262,7 +2262,7 @@ static inline bool pgtable_pmd_page_ctor(struct page *page) return ptlock_init(page); } -static inline void pgtable_pmd_page_dtor(struct page *page) +static inline void pmd_ptlock_free(struct page *page) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE VM_BUG_ON_PAGE(page->pmd_huge_pte, page); @@ -2279,8 +2279,8 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd) return &mm->page_table_lock; } -static inline bool pgtable_pmd_page_ctor(struct page *page) { return true; } -static inline void pgtable_pmd_page_dtor(struct page *page) {} +static inline bool pmd_ptlock_init(struct page *page) { return true; } +static inline void pmd_ptlock_free(struct page *page) {} #define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte) @@ -2293,6 +2293,22 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd) return ptl; } +static inline bool pgtable_pmd_page_ctor(struct page *page) +{ + if (!pmd_ptlock_init(page)) + return false; + __SetPageTable(page); + inc_zone_page_state(page, NR_PAGETABLE); + return true; +} + +static inline void pgtable_pmd_page_dtor(struct page *page) +{ + pmd_ptlock_free(page); + __ClearPageTable(page); + dec_zone_page_state(page, NR_PAGETABLE); +} + /* * No scalability reason to split PUD locks yet, but follow the same pattern * as the PMD locks to make it easier if we decide to. The VM should not be From d383807aaf7796d328a43f20b98b99c2fcc664d7 Mon Sep 17 00:00:00 2001 From: Yanfei Xu Date: Tue, 13 Oct 2020 16:53:26 -0700 Subject: [PATCH 093/181] mm/memory.c: fix typo in __do_fault() comment It's "pte_alloc_one", not "pte_alloc_pne". Let's fix that. Signed-off-by: Yanfei Xu Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Link: http://lkml.kernel.org/r/20200818104339.5310-1-yanfei.xu@windriver.com Signed-off-by: Linus Torvalds --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index eeae590e526a..1ba65e981541 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3589,7 +3589,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) * unlock_page(A) * lock_page(B) * lock_page(B) - * pte_alloc_pne + * pte_alloc_one * shrink_page_list * wait_on_page_writeback(A) * SetPageWriteback(B) From a7069ee3f891b22727303dfe0857ae47ef0c4936 Mon Sep 17 00:00:00 2001 From: Yanfei Xu Date: Tue, 13 Oct 2020 16:53:29 -0700 Subject: [PATCH 094/181] mm/memory.c: replace vmf->vma with variable vma The code has declared a vma_struct named vma which is assigned a value of vmf->vma. Thus, use variable vma directly here. Signed-off-by: Yanfei Xu Signed-off-by: Andrew Morton Reviewed-by: Matthew Wilcox (Oracle) Link: http://lkml.kernel.org/r/20200818084607.37616-1-yanfei.xu@windriver.com Signed-off-by: Linus Torvalds --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index 1ba65e981541..817ac9c58473 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3597,7 +3597,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) * # flush A, B to clear the writeback */ if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { - vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); + vmf->prealloc_pte = pte_alloc_one(vma->vm_mm); if (!vmf->prealloc_pte) return VM_FAULT_OOM; smp_wmb(); /* See comment in __pte_alloc() */ From 7c61f917b1617d5d1920f47d134de0b3401dc93a Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:53:32 -0700 Subject: [PATCH 095/181] mm/mmap: rename __vma_unlink_common() to __vma_unlink() __vma_unlink_common() and __vma_unlink() are counterparts. Since there is no function named __vma_unlink(), let's rename __vma_unlink_common() to __vma_unlink() to make the code more self-explanatory and easy for audience to understand. Otherwise we may expect there are several variants of vma_unlink() and __vma_unlink_common() is used by them. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200809232057.23477-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/mmap.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index e71d2d471416..4815364107ef 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -677,7 +677,7 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) mm->map_count++; } -static __always_inline void __vma_unlink_common(struct mm_struct *mm, +static __always_inline void __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *ignore) { @@ -859,7 +859,7 @@ again: * us to remove next before dropping the locks. */ if (remove_next != 3) - __vma_unlink_common(mm, next, next); + __vma_unlink(mm, next, next); else /* * vma is not before next if they've been @@ -870,7 +870,7 @@ again: * "next" (which is stored in post-swap() * "vma"). */ - __vma_unlink_common(mm, next, vma); + __vma_unlink(mm, next, vma); if (file) __remove_shared_vm_struct(next, file, mapping); } else if (insert) { From 4d1e72437b92598fa371120aade3f7805114bb16 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:53:35 -0700 Subject: [PATCH 096/181] mm/mmap: leverage vma_rb_erase_ignore() to implement vma_rb_erase() These two functions share the same logic except ignore a different vma. Let's reuse the code. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200809232057.23477-2-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/mmap.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 4815364107ef..0f3ca5257335 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -474,8 +474,12 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma, { /* * All rb_subtree_gap values must be consistent prior to erase, - * with the possible exception of the "next" vma being erased if - * next->vm_start was reduced. + * with the possible exception of + * + * a. the "next" vma being erased if next->vm_start was reduced in + * __vma_adjust() -> __vma_unlink() + * b. the vma being erased in detach_vmas_to_be_unmapped() -> + * vma_rb_erase() */ validate_mm_rb(root, ignore); @@ -485,13 +489,7 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma, static __always_inline void vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root) { - /* - * All rb_subtree_gap values must be consistent prior to erase, - * with the possible exception of the vma being erased. - */ - validate_mm_rb(root, vma); - - __vma_rb_erase(vma, root); + vma_rb_erase_ignore(vma, root, vma); } /* From 07e5bfe651f8595ca6a777d989016aa8d8217924 Mon Sep 17 00:00:00 2001 From: Chinwen Chang Date: Tue, 13 Oct 2020 16:53:39 -0700 Subject: [PATCH 097/181] mmap locking API: add mmap_lock_is_contended() Patch series "Try to release mmap_lock temporarily in smaps_rollup", v4. Recently, we have observed some janky issues caused by unpleasantly long contention on mmap_lock which is held by smaps_rollup when probing large processes. To address the problem, we let smaps_rollup detect if anyone wants to acquire mmap_lock for write attempts. If yes, just release the lock temporarily to ease the contention. smaps_rollup is a procfs interface which allows users to summarize the process's memory usage without the overhead of seq_* calls. Android uses it to sample the memory usage of various processes to balance its memory pool sizes. If no one wants to take the lock for write requests, smaps_rollup with this patch will behave like the original one. Although there are on-going mmap_lock optimizations like range-based locks, the lock applied to smaps_rollup would be the coarse one, which is hard to avoid the occurrence of aforementioned issues. So the detection and temporary release for write attempts on mmap_lock in smaps_rollup is still necessary. This patch (of 3): Add new API to query if someone wants to acquire mmap_lock for write attempts. Using this instead of rwsem_is_contended makes it more tolerant of future changes to the lock type. Signed-off-by: Chinwen Chang Signed-off-by: Andrew Morton Reviewed-by: Steven Price Acked-by: Michel Lespinasse Cc: Alexey Dobriyan Cc: Daniel Jordan Cc: Daniel Kiss Cc: Davidlohr Bueso Cc: Huang Ying Cc: Jason Gunthorpe Cc: Jimmy Assarsson Cc: Laurent Dufour Cc: "Matthew Wilcox (Oracle)" Cc: Matthias Brugger Cc: Song Liu Cc: Vlastimil Babka Link: http://lkml.kernel.org/r/1597715898-3854-1-git-send-email-chinwen.chang@mediatek.com Link: http://lkml.kernel.org/r/1597715898-3854-2-git-send-email-chinwen.chang@mediatek.com Signed-off-by: Linus Torvalds --- include/linux/mmap_lock.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 0707671851a8..18e7eae9b5ba 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -87,4 +87,9 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm) VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm); } +static inline int mmap_lock_is_contended(struct mm_struct *mm) +{ + return rwsem_is_contended(&mm->mmap_lock); +} + #endif /* _LINUX_MMAP_LOCK_H */ From 03b4b1149308b0f0d76075a25d4100f684bef326 Mon Sep 17 00:00:00 2001 From: Chinwen Chang Date: Tue, 13 Oct 2020 16:53:43 -0700 Subject: [PATCH 098/181] mm: smaps*: extend smap_gather_stats to support specified beginning Extend smap_gather_stats to support indicated beginning address at which it should start gathering. To achieve the goal, we add a new parameter @start assigned by the caller and try to refactor it for simplicity. If @start is 0, it will use the range of @vma for gathering. Signed-off-by: Chinwen Chang Signed-off-by: Andrew Morton Reviewed-by: Steven Price Cc: Michel Lespinasse Cc: Alexey Dobriyan Cc: Daniel Jordan Cc: Daniel Kiss Cc: Davidlohr Bueso Cc: Huang Ying Cc: Jason Gunthorpe Cc: Jimmy Assarsson Cc: Laurent Dufour Cc: "Matthew Wilcox (Oracle)" Cc: Matthias Brugger Cc: Song Liu Cc: Vlastimil Babka Link: http://lkml.kernel.org/r/1597715898-3854-3-git-send-email-chinwen.chang@mediatek.com Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index a1be198f755c..f77727530b9c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -721,9 +721,21 @@ static const struct mm_walk_ops smaps_shmem_walk_ops = { .pte_hole = smaps_pte_hole, }; +/* + * Gather mem stats from @vma with the indicated beginning + * address @start, and keep them in @mss. + * + * Use vm_start of @vma as the beginning address if @start is 0. + */ static void smap_gather_stats(struct vm_area_struct *vma, - struct mem_size_stats *mss) + struct mem_size_stats *mss, unsigned long start) { + const struct mm_walk_ops *ops = &smaps_walk_ops; + + /* Invalid start */ + if (start >= vma->vm_end) + return; + #ifdef CONFIG_SHMEM /* In case of smaps_rollup, reset the value from previous vma */ mss->check_shmem_swap = false; @@ -740,18 +752,20 @@ static void smap_gather_stats(struct vm_area_struct *vma, */ unsigned long shmem_swapped = shmem_swap_usage(vma); - if (!shmem_swapped || (vma->vm_flags & VM_SHARED) || - !(vma->vm_flags & VM_WRITE)) { + if (!start && (!shmem_swapped || (vma->vm_flags & VM_SHARED) || + !(vma->vm_flags & VM_WRITE))) { mss->swap += shmem_swapped; } else { mss->check_shmem_swap = true; - walk_page_vma(vma, &smaps_shmem_walk_ops, mss); - return; + ops = &smaps_shmem_walk_ops; } } #endif /* mmap_lock is held in m_start */ - walk_page_vma(vma, &smaps_walk_ops, mss); + if (!start) + walk_page_vma(vma, ops, mss); + else + walk_page_range(vma->vm_mm, start, vma->vm_end, ops, mss); } #define SEQ_PUT_DEC(str, val) \ @@ -803,7 +817,7 @@ static int show_smap(struct seq_file *m, void *v) memset(&mss, 0, sizeof(mss)); - smap_gather_stats(vma, &mss); + smap_gather_stats(vma, &mss, 0); show_map_vma(m, vma); @@ -852,7 +866,7 @@ static int show_smaps_rollup(struct seq_file *m, void *v) hold_task_mempolicy(priv); for (vma = priv->mm->mmap; vma; vma = vma->vm_next) { - smap_gather_stats(vma, &mss); + smap_gather_stats(vma, &mss, 0); last_vma_end = vma->vm_end; } From ff9f47f6f00cfe1625a06a136e286a47c9483b2e Mon Sep 17 00:00:00 2001 From: Chinwen Chang Date: Tue, 13 Oct 2020 16:53:47 -0700 Subject: [PATCH 099/181] mm: proc: smaps_rollup: do not stall write attempts on mmap_lock smaps_rollup will try to grab mmap_lock and go through the whole vma list until it finishes the iterating. When encountering large processes, the mmap_lock will be held for a longer time, which may block other write requests like mmap and munmap from progressing smoothly. There are upcoming mmap_lock optimizations like range-based locks, but the lock applied to smaps_rollup would be the coarse type, which doesn't avoid the occurrence of unpleasant contention. To solve aforementioned issue, we add a check which detects whether anyone wants to grab mmap_lock for write attempts. Signed-off-by: Chinwen Chang Signed-off-by: Andrew Morton Cc: Steven Price Cc: Michel Lespinasse Cc: Matthias Brugger Cc: Vlastimil Babka Cc: Daniel Jordan Cc: Davidlohr Bueso Cc: Chinwen Chang Cc: Alexey Dobriyan Cc: "Matthew Wilcox (Oracle)" Cc: Jason Gunthorpe Cc: Song Liu Cc: Jimmy Assarsson Cc: Huang Ying Cc: Daniel Kiss Cc: Laurent Dufour Link: http://lkml.kernel.org/r/1597715898-3854-4-git-send-email-chinwen.chang@mediatek.com Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 66 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 65 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f77727530b9c..846d43df3fdf 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -865,9 +865,73 @@ static int show_smaps_rollup(struct seq_file *m, void *v) hold_task_mempolicy(priv); - for (vma = priv->mm->mmap; vma; vma = vma->vm_next) { + for (vma = priv->mm->mmap; vma;) { smap_gather_stats(vma, &mss, 0); last_vma_end = vma->vm_end; + + /* + * Release mmap_lock temporarily if someone wants to + * access it for write request. + */ + if (mmap_lock_is_contended(mm)) { + mmap_read_unlock(mm); + ret = mmap_read_lock_killable(mm); + if (ret) { + release_task_mempolicy(priv); + goto out_put_mm; + } + + /* + * After dropping the lock, there are four cases to + * consider. See the following example for explanation. + * + * +------+------+-----------+ + * | VMA1 | VMA2 | VMA3 | + * +------+------+-----------+ + * | | | | + * 4k 8k 16k 400k + * + * Suppose we drop the lock after reading VMA2 due to + * contention, then we get: + * + * last_vma_end = 16k + * + * 1) VMA2 is freed, but VMA3 exists: + * + * find_vma(mm, 16k - 1) will return VMA3. + * In this case, just continue from VMA3. + * + * 2) VMA2 still exists: + * + * find_vma(mm, 16k - 1) will return VMA2. + * Iterate the loop like the original one. + * + * 3) No more VMAs can be found: + * + * find_vma(mm, 16k - 1) will return NULL. + * No more things to do, just break. + * + * 4) (last_vma_end - 1) is the middle of a vma (VMA'): + * + * find_vma(mm, 16k - 1) will return VMA' whose range + * contains last_vma_end. + * Iterate VMA' from last_vma_end. + */ + vma = find_vma(mm, last_vma_end - 1); + /* Case 3 above */ + if (!vma) + break; + + /* Case 1 above */ + if (vma->vm_start >= last_vma_end) + continue; + + /* Case 4 above */ + if (vma->vm_end > last_vma_end) + smap_gather_stats(vma, &mss, last_vma_end); + } + /* Case 2 above */ + vma = vma->vm_next; } show_vma_header_prefix(m, priv->mm->mmap->vm_start, From e18c45ffcfa347b13c2f300f290bacff55a4b41e Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:53:50 -0700 Subject: [PATCH 100/181] mm: move PageDoubleMap bit Patch series "Fix PageDoubleMap". This is a purely theoretical problem for now as none of the filesystems which use PG_private_2 (ie PG_fscache) are being converted at this time, but it's confusing to leave it like this. This patch (of 2): PG_private_2 is defined as being PF_ANY (applicable to tail pages as well as regular & head pages). That means that the first tail page of a double-map page will appear to have Private2 set. Use the Workingset bit instead which is defined as PF_HEAD so any attempt to access the Workingset bit on a tail page will redirect to the head page's Workingset bit. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Reviewed-by: Zi Yan Link: https://lkml.kernel.org/r/20200629151933.15671-1-willy@infradead.org Link: https://lkml.kernel.org/r/20200629151933.15671-2-willy@infradead.org Signed-off-by: Linus Torvalds --- include/linux/page-flags.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 276140c94f4a..76413b2ffef0 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -167,7 +167,7 @@ enum pageflags { PG_slob_free = PG_private, /* Compound pages. Stored in first tail page's flags */ - PG_double_map = PG_private_2, + PG_double_map = PG_workingset, /* non-lru isolated movable page */ PG_isolated = PG_reclaim, From a08d93e5752a35a771054f6c463f789720f9a3e8 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:53:54 -0700 Subject: [PATCH 101/181] mm: simplify PageDoubleMap with PF_SECOND policy Introduce the new page policy of PF_SECOND which lets us use the normal pageflags generation machinery to create the various DoubleMap manipulation functions. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Reviewed-by: Zi Yan Link: https://lkml.kernel.org/r/20200629151933.15671-3-willy@infradead.org Signed-off-by: Linus Torvalds --- include/linux/page-flags.h | 40 ++++++++++---------------------------- 1 file changed, 10 insertions(+), 30 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 76413b2ffef0..38ded408bd4c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -235,6 +235,9 @@ static inline void page_init_poison(struct page *page, size_t size) * * PF_NO_COMPOUND: * the page flag is not relevant for compound pages. + * + * PF_SECOND: + * the page flag is stored in the first tail page. */ #define PF_POISONED_CHECK(page) ({ \ VM_BUG_ON_PGFLAGS(PagePoisoned(page), page); \ @@ -250,6 +253,9 @@ static inline void page_init_poison(struct page *page, size_t size) #define PF_NO_COMPOUND(page, enforce) ({ \ VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page); \ PF_POISONED_CHECK(page); }) +#define PF_SECOND(page, enforce) ({ \ + VM_BUG_ON_PGFLAGS(!PageHead(page), page); \ + PF_POISONED_CHECK(&page[1]); }) /* * Macros to create function definitions for page flags @@ -688,42 +694,15 @@ static inline int PageTransTail(struct page *page) * * See also __split_huge_pmd_locked() and page_remove_anon_compound_rmap(). */ -static inline int PageDoubleMap(struct page *page) -{ - return PageHead(page) && test_bit(PG_double_map, &page[1].flags); -} - -static inline void SetPageDoubleMap(struct page *page) -{ - VM_BUG_ON_PAGE(!PageHead(page), page); - set_bit(PG_double_map, &page[1].flags); -} - -static inline void ClearPageDoubleMap(struct page *page) -{ - VM_BUG_ON_PAGE(!PageHead(page), page); - clear_bit(PG_double_map, &page[1].flags); -} -static inline int TestSetPageDoubleMap(struct page *page) -{ - VM_BUG_ON_PAGE(!PageHead(page), page); - return test_and_set_bit(PG_double_map, &page[1].flags); -} - -static inline int TestClearPageDoubleMap(struct page *page) -{ - VM_BUG_ON_PAGE(!PageHead(page), page); - return test_and_clear_bit(PG_double_map, &page[1].flags); -} - +PAGEFLAG(DoubleMap, double_map, PF_SECOND) + TESTSCFLAG(DoubleMap, double_map, PF_SECOND) #else TESTPAGEFLAG_FALSE(TransHuge) TESTPAGEFLAG_FALSE(TransCompound) TESTPAGEFLAG_FALSE(TransCompoundMap) TESTPAGEFLAG_FALSE(TransTail) PAGEFLAG_FALSE(DoubleMap) - TESTSETFLAG_FALSE(DoubleMap) - TESTCLEARFLAG_FALSE(DoubleMap) + TESTSCFLAG_FALSE(DoubleMap) #endif /* @@ -888,6 +867,7 @@ static inline int page_has_private(struct page *page) #undef PF_ONLY_HEAD #undef PF_NO_TAIL #undef PF_NO_COMPOUND +#undef PF_SECOND #endif /* !__GENERATING_BOUNDS_H */ #endif /* PAGE_FLAGS_H */ From f9d86a60572295ebb53c87a4305dc89b487711bd Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:53:57 -0700 Subject: [PATCH 102/181] mm/mmap: leave adjust_next as virtual address instead of page frame number Instead of converting adjust_next between bytes and pages number, let's just store the virtual address into adjust_next. Also, this patch fixes one typo in the comment of vma_adjust_trans_huge(). [vbabka@suse.cz: changelog tweak] Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Acked-by: Vlastimil Babka Cc: Mike Kravetz Link: http://lkml.kernel.org/r/20200828081031.11306-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 4 ++-- mm/mmap.c | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ec0f0cc49545..65c289c13b58 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2306,13 +2306,13 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, /* * If we're also updating the vma->vm_next->vm_start, if the new - * vm_next->vm_start isn't page aligned and it could previously + * vm_next->vm_start isn't hpage aligned and it could previously * contain an hugepage: check if we need to split an huge pmd. */ if (adjust_next > 0) { struct vm_area_struct *next = vma->vm_next; unsigned long nstart = next->vm_start; - nstart += adjust_next << PAGE_SHIFT; + nstart += adjust_next; if (nstart & ~HPAGE_PMD_MASK && (nstart & HPAGE_PMD_MASK) >= next->vm_start && (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end) diff --git a/mm/mmap.c b/mm/mmap.c index 0f3ca5257335..57de816e2614 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -758,7 +758,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, * vma expands, overlapping part of the next: * mprotect case 5 shifting the boundary up. */ - adjust_next = (end - next->vm_start) >> PAGE_SHIFT; + adjust_next = (end - next->vm_start); exporter = next; importer = vma; VM_WARN_ON(expand != importer); @@ -768,7 +768,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, * split_vma inserting another: so it must be * mprotect case 4 shifting the boundary down. */ - adjust_next = -((vma->vm_end - end) >> PAGE_SHIFT); + adjust_next = -(vma->vm_end - end); exporter = vma; importer = next; VM_WARN_ON(expand != importer); @@ -840,8 +840,8 @@ again: } vma->vm_pgoff = pgoff; if (adjust_next) { - next->vm_start += adjust_next << PAGE_SHIFT; - next->vm_pgoff += adjust_next; + next->vm_start += adjust_next; + next->vm_pgoff += adjust_next >> PAGE_SHIFT; } if (root) { From f1dc1685f6ca25d435d58d16e53bb6f336f54cac Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 13 Oct 2020 16:54:01 -0700 Subject: [PATCH 103/181] mm/memory.c: fix spello of "function" Fix typo/spello of "function". Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Link: https://lkml.kernel.org/r/e7bf180e-c558-b1d5-9a15-6d9708823c9c@infradead.org Signed-off-by: Linus Torvalds --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index 817ac9c58473..5a1c4c9759dc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3764,7 +3764,7 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) /** * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the fucntion allocates page table or use pre-allocated. + * mapping. If needed, the function allocates page table or use pre-allocated. * * @vmf: fault environment * @page: page to map From 808fbdbea05f1e965da5b887d808025ba22c1946 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:54:04 -0700 Subject: [PATCH 104/181] mm/mmap: not necessary to check mapping separately *root* with type of struct rb_root_cached is an element of *mapping* with type of struct address_space. This implies when we have a valid *root* it must be a part of valid *mapping*. So we can merge these two checks together to make the code more easy to read and to save some cpu cycles. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200913133631.37781-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/mmap.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 57de816e2614..295197b75cb0 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -895,10 +895,9 @@ again: anon_vma_interval_tree_post_update_vma(next); anon_vma_unlock_write(anon_vma); } - if (mapping) - i_mmap_unlock_write(mapping); if (root) { + i_mmap_unlock_write(mapping); uprobe_mmap(vma); if (adjust_next) From 0fc48a6e213ab8e4033d26bd8333ee8558f210f6 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:54:07 -0700 Subject: [PATCH 105/181] mm/mmap: check on file instead of the rb_root_cached of its address_space In __vma_adjust(), we do the check on *root* to decide whether to adjust the address_space. It seems to be more meaningful to do the check on *file* itself. This means we are adjusting some data because it is a file backed vma. Since we seem to assume the address_space is valid if it is a file backed vma, let's just replace *root* with *file* here. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200913133631.37781-2-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/mmap.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 295197b75cb0..19cd69524837 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -823,7 +823,7 @@ again: anon_vma_interval_tree_pre_update_vma(next); } - if (root) { + if (file) { flush_dcache_mmap_lock(mapping); vma_interval_tree_remove(vma, root); if (adjust_next) @@ -844,7 +844,7 @@ again: next->vm_pgoff += adjust_next >> PAGE_SHIFT; } - if (root) { + if (file) { if (adjust_next) vma_interval_tree_insert(next, root); vma_interval_tree_insert(vma, root); @@ -896,7 +896,7 @@ again: anon_vma_unlock_write(anon_vma); } - if (root) { + if (file) { i_mmap_unlock_write(mapping); uprobe_mmap(vma); From cf508b58457cfe51b168d49ea6c79a221900d354 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:54:10 -0700 Subject: [PATCH 106/181] mm: use helper function mapping_allow_writable() Commit 4bb5f5d9395b ("mm: allow drivers to prevent new writable mappings") changed i_mmap_writable from unsigned int to atomic_t and add the helper function mapping_allow_writable() to atomic_inc i_mmap_writable. But it forgot to use this helper function in dup_mmap() and __vma_link_file(). Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Cc: Christian Brauner Cc: Ingo Molnar Cc: Peter Zijlstra Cc: "Eric W. Biederman" Cc: Christian Kellner Cc: Suren Baghdasaryan Cc: Adrian Reber Cc: Shakeel Butt Cc: Aleksa Sarai Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20200917112736.7789-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- kernel/fork.c | 2 +- mm/mmap.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index a3795aaaab5c..2dcb19a30650 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -559,7 +559,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, atomic_dec(&inode->i_writecount); i_mmap_lock_write(mapping); if (tmp->vm_flags & VM_SHARED) - atomic_inc(&mapping->i_mmap_writable); + mapping_allow_writable(mapping); flush_dcache_mmap_lock(mapping); /* insert tmp into the share list, just after mpnt */ vma_interval_tree_insert_after(tmp, mpnt, diff --git a/mm/mmap.c b/mm/mmap.c index 19cd69524837..7799a3f2e483 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -621,7 +621,7 @@ static void __vma_link_file(struct vm_area_struct *vma) if (vma->vm_flags & VM_DENYWRITE) atomic_dec(&file_inode(file)->i_writecount); if (vma->vm_flags & VM_SHARED) - atomic_inc(&mapping->i_mmap_writable); + mapping_allow_writable(mapping); flush_dcache_mmap_lock(mapping); vma_interval_tree_insert(vma, &mapping->i_mmap); From cb48841fbf4ed6f5e001a79af4c2486028dfb291 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:54:14 -0700 Subject: [PATCH 107/181] mm/mmap.c: use helper function allow_write_access() in __remove_shared_vm_struct() In commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), the helper allow_write_access came with the atomic_inc operation of the i_writecount field in the func __remove_shared_vm_struct(). But it forgot to use this helper function. Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Link: https://lkml.kernel.org/r/20200921115814.39680-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- mm/mmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mmap.c b/mm/mmap.c index 7799a3f2e483..cab12a759200 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -143,7 +143,7 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma, struct file *file, struct address_space *mapping) { if (vma->vm_flags & VM_DENYWRITE) - atomic_inc(&file_inode(file)->i_writecount); + allow_write_access(file); if (vma->vm_flags & VM_SHARED) mapping_unmap_writable(mapping); From 8332326e8e47fbc35711433419c31f610153dd58 Mon Sep 17 00:00:00 2001 From: Liao Pingfang Date: Tue, 13 Oct 2020 16:54:18 -0700 Subject: [PATCH 108/181] mm/mmap.c: replace do_brk with do_brk_flags in comment of insert_vm_struct() Replace do_brk with do_brk_flags in comment of insert_vm_struct(), since do_brk was removed in following commit. Fixes: bb177a732c4369 ("mm: do not bug_on on incorrect length in __mm_populate()") Signed-off-by: Liao Pingfang Signed-off-by: Yi Wang Signed-off-by: Andrew Morton Link: https://lkml.kernel.org/r/1600650778-43230-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Linus Torvalds --- mm/mmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mmap.c b/mm/mmap.c index cab12a759200..67d11ad6df24 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3233,7 +3233,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) * By setting it to reflect the virtual start address of the * vma, merges and splits can happen in a seamless way, just * using the existing file pgoff checks and manipulations. - * Similarly in do_mmap and in do_brk. + * Similarly in do_mmap and in do_brk_flags. */ if (vma_is_anonymous(vma)) { BUG_ON(vma->anon_vma); From c78f463649d60f4ed12df97a32def4d77b583853 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 13 Oct 2020 16:54:21 -0700 Subject: [PATCH 109/181] mm: remove src/dst mm parameter in copy_page_range() Both of the mm pointers are not needed after commit 7a4830c380f3 ("mm/fork: Pass new vma pointer into copy_page_range()"). Jason Gunthorpe also reported that the ordering of copy_page_range() is odd. Since working at it, reorder the parameters to be logical, by (1) always put the dst_* fields to be before src_* fields, and (2) keep the same type of parameters together. [peterx@redhat.com: further reorder some parameters and line format, per Jason] Link: https://lkml.kernel.org/r/20201002192647.7161-1-peterx@redhat.com [peterx@redhat.com: fix warnings] Link: https://lkml.kernel.org/r/20201006200138.GA6026@xz-x1 Reported-by: Kirill A. Shutemov Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jason Gunthorpe Acked-by: Kirill A. Shutemov Link: https://lkml.kernel.org/r/20200930204950.6668-1-peterx@redhat.com Signed-off-by: Linus Torvalds --- include/linux/mm.h | 4 +- kernel/fork.c | 2 +- mm/memory.c | 141 +++++++++++++++++++++++---------------------- 3 files changed, 76 insertions(+), 71 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5320e7ab843f..620961e4f32b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1653,8 +1653,8 @@ struct mmu_notifier_range; void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); -int copy_page_range(struct mm_struct *dst, struct mm_struct *src, - struct vm_area_struct *vma, struct vm_area_struct *new); +int +copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); int follow_pte_pmd(struct mm_struct *mm, unsigned long address, struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); diff --git a/kernel/fork.c b/kernel/fork.c index 2dcb19a30650..ede26e5a6097 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -590,7 +590,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, mm->map_count++; if (!(tmp->vm_flags & VM_WIPEONFORK)) - retval = copy_page_range(mm, oldmm, mpnt, tmp); + retval = copy_page_range(tmp, mpnt); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); diff --git a/mm/memory.c b/mm/memory.c index 5a1c4c9759dc..f482af8bc828 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -794,15 +794,14 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, * lock. */ static inline int -copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pte_t *dst_pte, pte_t *src_pte, - struct vm_area_struct *vma, struct vm_area_struct *new, - unsigned long addr, int *rss, struct page **prealloc, - pte_t pte, struct page *page) +copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, + pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, + struct page **prealloc, pte_t pte, struct page *page) { + struct mm_struct *src_mm = src_vma->vm_mm; struct page *new_page; - if (!is_cow_mapping(vma->vm_flags)) + if (!is_cow_mapping(src_vma->vm_flags)) return 1; /* @@ -832,16 +831,16 @@ copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm, * over and copy the page & arm it. */ *prealloc = NULL; - copy_user_highpage(new_page, page, addr, vma); + copy_user_highpage(new_page, page, addr, src_vma); __SetPageUptodate(new_page); - page_add_new_anon_rmap(new_page, new, addr, false); - lru_cache_add_inactive_or_unevictable(new_page, new); + page_add_new_anon_rmap(new_page, dst_vma, addr, false); + lru_cache_add_inactive_or_unevictable(new_page, dst_vma); rss[mm_counter(new_page)]++; /* All done, just insert the new page copy in the child */ - pte = mk_pte(new_page, new->vm_page_prot); - pte = maybe_mkwrite(pte_mkdirty(pte), new); - set_pte_at(dst_mm, addr, dst_pte, pte); + pte = mk_pte(new_page, dst_vma->vm_page_prot); + pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma); + set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); return 0; } @@ -850,24 +849,21 @@ copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm, * is required to copy this pte. */ static inline int -copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, - struct vm_area_struct *new, - unsigned long addr, int *rss, struct page **prealloc) +copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, + pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, + struct page **prealloc) { - unsigned long vm_flags = vma->vm_flags; + struct mm_struct *src_mm = src_vma->vm_mm; + unsigned long vm_flags = src_vma->vm_flags; pte_t pte = *src_pte; struct page *page; - page = vm_normal_page(vma, addr, pte); + page = vm_normal_page(src_vma, addr, pte); if (page) { int retval; - retval = copy_present_page(dst_mm, src_mm, - dst_pte, src_pte, - vma, new, - addr, rss, prealloc, - pte, page); + retval = copy_present_page(dst_vma, src_vma, dst_pte, src_pte, + addr, rss, prealloc, pte, page); if (retval <= 0) return retval; @@ -901,7 +897,7 @@ copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (!(vm_flags & VM_UFFD_WP)) pte = pte_clear_uffd_wp(pte); - set_pte_at(dst_mm, addr, dst_pte, pte); + set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); return 0; } @@ -924,11 +920,13 @@ page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma, return new_page; } -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma, - struct vm_area_struct *new, - unsigned long addr, unsigned long end) +static int +copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, + unsigned long end) { + struct mm_struct *dst_mm = dst_vma->vm_mm; + struct mm_struct *src_mm = src_vma->vm_mm; pte_t *orig_src_pte, *orig_dst_pte; pte_t *src_pte, *dst_pte; spinlock_t *src_ptl, *dst_ptl; @@ -971,15 +969,15 @@ again: if (unlikely(!pte_present(*src_pte))) { entry.val = copy_nonpresent_pte(dst_mm, src_mm, dst_pte, src_pte, - vma, addr, rss); + src_vma, addr, rss); if (entry.val) break; progress += 8; continue; } /* copy_present_pte() will clear `*prealloc' if consumed */ - ret = copy_present_pte(dst_mm, src_mm, dst_pte, src_pte, - vma, new, addr, rss, &prealloc); + ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, + addr, rss, &prealloc); /* * If we need a pre-allocated page for this pte, drop the * locks, allocate, and try again. @@ -1014,7 +1012,7 @@ again: entry.val = 0; } else if (ret) { WARN_ON_ONCE(ret != -EAGAIN); - prealloc = page_copy_prealloc(src_mm, vma, addr); + prealloc = page_copy_prealloc(src_mm, src_vma, addr); if (!prealloc) return -ENOMEM; /* We've captured and resolved the error. Reset, try again. */ @@ -1028,11 +1026,13 @@ out: return ret; } -static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma, - struct vm_area_struct *new, - unsigned long addr, unsigned long end) +static inline int +copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, + pud_t *dst_pud, pud_t *src_pud, unsigned long addr, + unsigned long end) { + struct mm_struct *dst_mm = dst_vma->vm_mm; + struct mm_struct *src_mm = src_vma->vm_mm; pmd_t *src_pmd, *dst_pmd; unsigned long next; @@ -1045,9 +1045,9 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) { int err; - VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma); + VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma); err = copy_huge_pmd(dst_mm, src_mm, - dst_pmd, src_pmd, addr, vma); + dst_pmd, src_pmd, addr, src_vma); if (err == -ENOMEM) return -ENOMEM; if (!err) @@ -1056,18 +1056,20 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src } if (pmd_none_or_clear_bad(src_pmd)) continue; - if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd, - vma, new, addr, next)) + if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, + addr, next)) return -ENOMEM; } while (dst_pmd++, src_pmd++, addr = next, addr != end); return 0; } -static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - p4d_t *dst_p4d, p4d_t *src_p4d, struct vm_area_struct *vma, - struct vm_area_struct *new, - unsigned long addr, unsigned long end) +static inline int +copy_pud_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, + p4d_t *dst_p4d, p4d_t *src_p4d, unsigned long addr, + unsigned long end) { + struct mm_struct *dst_mm = dst_vma->vm_mm; + struct mm_struct *src_mm = src_vma->vm_mm; pud_t *src_pud, *dst_pud; unsigned long next; @@ -1080,9 +1082,9 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) { int err; - VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, vma); + VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma); err = copy_huge_pud(dst_mm, src_mm, - dst_pud, src_pud, addr, vma); + dst_pud, src_pud, addr, src_vma); if (err == -ENOMEM) return -ENOMEM; if (!err) @@ -1091,18 +1093,19 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src } if (pud_none_or_clear_bad(src_pud)) continue; - if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud, - vma, new, addr, next)) + if (copy_pmd_range(dst_vma, src_vma, dst_pud, src_pud, + addr, next)) return -ENOMEM; } while (dst_pud++, src_pud++, addr = next, addr != end); return 0; } -static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma, - struct vm_area_struct *new, - unsigned long addr, unsigned long end) +static inline int +copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, + pgd_t *dst_pgd, pgd_t *src_pgd, unsigned long addr, + unsigned long end) { + struct mm_struct *dst_mm = dst_vma->vm_mm; p4d_t *src_p4d, *dst_p4d; unsigned long next; @@ -1114,20 +1117,22 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src next = p4d_addr_end(addr, end); if (p4d_none_or_clear_bad(src_p4d)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d, - vma, new, addr, next)) + if (copy_pud_range(dst_vma, src_vma, dst_p4d, src_p4d, + addr, next)) return -ENOMEM; } while (dst_p4d++, src_p4d++, addr = next, addr != end); return 0; } -int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, - struct vm_area_struct *vma, struct vm_area_struct *new) +int +copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { pgd_t *src_pgd, *dst_pgd; unsigned long next; - unsigned long addr = vma->vm_start; - unsigned long end = vma->vm_end; + unsigned long addr = src_vma->vm_start; + unsigned long end = src_vma->vm_end; + struct mm_struct *dst_mm = dst_vma->vm_mm; + struct mm_struct *src_mm = src_vma->vm_mm; struct mmu_notifier_range range; bool is_cow; int ret; @@ -1138,19 +1143,19 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) && - !vma->anon_vma) + if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) && + !src_vma->anon_vma) return 0; - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (is_vm_hugetlb_page(src_vma)) + return copy_hugetlb_page_range(dst_mm, src_mm, src_vma); - if (unlikely(vma->vm_flags & VM_PFNMAP)) { + if (unlikely(src_vma->vm_flags & VM_PFNMAP)) { /* * We do not free on error cases below as remove_vma * gets called on error from higher level routine */ - ret = track_pfn_copy(vma); + ret = track_pfn_copy(src_vma); if (ret) return ret; } @@ -1161,11 +1166,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, * parent mm. And a permission downgrade will only happen if * is_cow_mapping() returns true. */ - is_cow = is_cow_mapping(vma->vm_flags); + is_cow = is_cow_mapping(src_vma->vm_flags); if (is_cow) { mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, - 0, vma, src_mm, addr, end); + 0, src_vma, src_mm, addr, end); mmu_notifier_invalidate_range_start(&range); } @@ -1176,8 +1181,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, new, addr, next))) { + if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd, + addr, next))) { ret = -ENOMEM; break; } From f577e143d85aa7ea3d9c62c607ad00fc46a5730c Mon Sep 17 00:00:00 2001 From: yuleixzhang Date: Tue, 13 Oct 2020 16:54:25 -0700 Subject: [PATCH 110/181] include/linux/huge_mm.h: remove mincore_huge_pmd declaration As mincore_huge_pmd() was dropped, remove the declaration from the header file. Signed-off-by: Yulei Zhang Signed-off-by: Andrew Morton Reviewed-by: Zi Yan Link: https://lkml.kernel.org/r/20200922083423.15074-1-yuleixzhang@tencent.com Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 3 --- 1 file changed, 3 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 8a8bc46a2432..0365aa97f8e7 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -38,9 +38,6 @@ extern int zap_huge_pmd(struct mmu_gather *tlb, extern int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr); -extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, - unsigned char *vec); extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd); From bfe18a0900f1188e323d3f2c3cd2d6dfe2d0789c Mon Sep 17 00:00:00 2001 From: Ralph Campbell Date: Tue, 13 Oct 2020 16:54:29 -0700 Subject: [PATCH 111/181] tools/testing/selftests/vm/hmm-tests.c: use the new SKIP() macro Some tests might not be able to be run if resources like huge pages are not available. Mark these tests as skipped instead of simply passing. Signed-off-by: Ralph Campbell Signed-off-by: Andrew Morton Reviewed-by: Jason Gunthorpe Cc: Jerome Glisse Cc: John Hubbard Cc: Shuah Khan Link: http://lkml.kernel.org/r/20200827190400.12608-1-rcampbell@nvidia.com Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/hmm-tests.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 93fc5cadce61..0a28a6a29581 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -680,7 +680,7 @@ TEST_F(hmm, anon_write_hugetlbfs) n = gethugepagesizes(pagesizes, 4); if (n <= 0) - return; + SKIP(return, "Huge page size could not be determined"); for (idx = 0; --n > 0; ) { if (pagesizes[n] < pagesizes[idx]) idx = n; @@ -694,7 +694,7 @@ TEST_F(hmm, anon_write_hugetlbfs) buffer->ptr = get_hugepage_region(size, GHR_STRICT); if (buffer->ptr == NULL) { free(buffer); - return; + SKIP(return, "Huge page could not be allocated"); } buffer->fd = -1; From 9b53122f616a74ddbbd6c97a3c8294c631a13d15 Mon Sep 17 00:00:00 2001 From: Ralph Campbell Date: Tue, 13 Oct 2020 16:54:32 -0700 Subject: [PATCH 112/181] lib/test_hmm.c: remove unused dmirror_zero_page The variable dmirror_zero_page is unused in the HMM self test driver which was probably intended to demonstrate how a driver could use migrate_vma_setup() to share a single read-only device private zero page similar to how the CPU does. However, this isn't needed for the self tests so remove it. Signed-off-by: Ralph Campbell Signed-off-by: Andrew Morton Cc: Jerome Glisse Link: https://lkml.kernel.org/r/20200914213801.16520-1-rcampbell@nvidia.com Signed-off-by: Linus Torvalds --- lib/test_hmm.c | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index c710b4c5714d..e151a7f10519 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -36,7 +36,6 @@ static const struct dev_pagemap_ops dmirror_devmem_ops; static const struct mmu_interval_notifier_ops dmirror_min_ops; static dev_t dmirror_dev; -static struct page *dmirror_zero_page; struct dmirror_device; @@ -1127,17 +1126,6 @@ static int __init hmm_dmirror_init(void) goto err_chrdev; } - /* - * Allocate a zero page to simulate a reserved page of device private - * memory which is always zero. The zero_pfn page isn't used just to - * make the code here simpler (i.e., we need a struct page for it). - */ - dmirror_zero_page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); - if (!dmirror_zero_page) { - ret = -ENOMEM; - goto err_chrdev; - } - pr_info("HMM test module loaded. This is only for testing HMM.\n"); return 0; @@ -1153,8 +1141,6 @@ static void __exit hmm_dmirror_exit(void) { int id; - if (dmirror_zero_page) - __free_page(dmirror_zero_page); for (id = 0; id < DMIRROR_NDEVICES; id++) dmirror_device_remove(dmirror_devices + id); unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES); From 42286f83f80f58ae2cf3889d35cc60220df49cbb Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Tue, 13 Oct 2020 16:54:35 -0700 Subject: [PATCH 113/181] mm/dmapool.c: replace open-coded list_for_each_entry_safe() There is a place in the code where open-coded version of list_for_each_entry_safe() is used. Replace that with the standard macro. Signed-off-by: Andy Shevchenko Signed-off-by: Andrew Morton Cc: Matthew Wilcox Link: http://lkml.kernel.org/r/20200814135055.24898-1-andriy.shevchenko@linux.intel.com Signed-off-by: Linus Torvalds --- mm/dmapool.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index f9fb9bbd733e..7716a7a42a1c 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -266,6 +266,7 @@ static void pool_free_page(struct dma_pool *pool, struct dma_page *page) */ void dma_pool_destroy(struct dma_pool *pool) { + struct dma_page *page, *tmp; bool empty = false; if (unlikely(!pool)) @@ -281,10 +282,7 @@ void dma_pool_destroy(struct dma_pool *pool) device_remove_file(pool->dev, &dev_attr_pools); mutex_unlock(&pools_reg_lock); - while (!list_empty(&pool->page_list)) { - struct dma_page *page; - page = list_entry(pool->page_list.next, - struct dma_page, page_list); + list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) { if (is_page_busy(page)) { if (pool->dev) dev_err(pool->dev, From 41a04814a715ae561ac378587a8f7763f29beb4a Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Tue, 13 Oct 2020 16:54:38 -0700 Subject: [PATCH 114/181] mm/dmapool.c: replace hard coded function name with __func__ No need to hard code function name when __func__ can be used. While here, replace specifiers for special types like dma_addr_t. Signed-off-by: Andy Shevchenko Signed-off-by: Andrew Morton Cc: Matthew Wilcox Link: http://lkml.kernel.org/r/20200814135055.24898-2-andriy.shevchenko@linux.intel.com Signed-off-by: Linus Torvalds --- mm/dmapool.c | 40 ++++++++++++++++++---------------------- 1 file changed, 18 insertions(+), 22 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index 7716a7a42a1c..a97c97232337 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -285,11 +285,10 @@ void dma_pool_destroy(struct dma_pool *pool) list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) { if (is_page_busy(page)) { if (pool->dev) - dev_err(pool->dev, - "dma_pool_destroy %s, %p busy\n", + dev_err(pool->dev, "%s %s, %p busy\n", __func__, pool->name, page->vaddr); else - pr_err("dma_pool_destroy %s, %p busy\n", + pr_err("%s %s, %p busy\n", __func__, pool->name, page->vaddr); /* leak the still-in-use consistent memory */ list_del(&page->page_list); @@ -353,12 +352,11 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, if (data[i] == POOL_POISON_FREED) continue; if (pool->dev) - dev_err(pool->dev, - "dma_pool_alloc %s, %p (corrupted)\n", - pool->name, retval); + dev_err(pool->dev, "%s %s, %p (corrupted)\n", + __func__, pool->name, retval); else - pr_err("dma_pool_alloc %s, %p (corrupted)\n", - pool->name, retval); + pr_err("%s %s, %p (corrupted)\n", + __func__, pool->name, retval); /* * Dump the first 4 bytes even if they are not @@ -414,12 +412,11 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) if (!page) { spin_unlock_irqrestore(&pool->lock, flags); if (pool->dev) - dev_err(pool->dev, - "dma_pool_free %s, %p/%lx (bad dma)\n", - pool->name, vaddr, (unsigned long)dma); + dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n", + __func__, pool->name, vaddr, &dma); else - pr_err("dma_pool_free %s, %p/%lx (bad dma)\n", - pool->name, vaddr, (unsigned long)dma); + pr_err("%s %s, %p/%pad (bad dma)\n", + __func__, pool->name, vaddr, &dma); return; } @@ -430,12 +427,11 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) if ((dma - page->dma) != offset) { spin_unlock_irqrestore(&pool->lock, flags); if (pool->dev) - dev_err(pool->dev, - "dma_pool_free %s, %p (bad vaddr)/%pad\n", - pool->name, vaddr, &dma); + dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n", + __func__, pool->name, vaddr, &dma); else - pr_err("dma_pool_free %s, %p (bad vaddr)/%pad\n", - pool->name, vaddr, &dma); + pr_err("%s %s, %p (bad vaddr)/%pad\n", + __func__, pool->name, vaddr, &dma); return; } { @@ -447,11 +443,11 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) } spin_unlock_irqrestore(&pool->lock, flags); if (pool->dev) - dev_err(pool->dev, "dma_pool_free %s, dma %pad already free\n", - pool->name, &dma); + dev_err(pool->dev, "%s %s, dma %pad already free\n", + __func__, pool->name, &dma); else - pr_err("dma_pool_free %s, dma %pad already free\n", - pool->name, &dma); + pr_err("%s %s, dma %pad already free\n", + __func__, pool->name, &dma); return; } } From c43bc03d0a872dcfcaaf1689c4f6140f9d3cb867 Mon Sep 17 00:00:00 2001 From: Xianting Tian Date: Tue, 13 Oct 2020 16:54:42 -0700 Subject: [PATCH 115/181] mm/memory-failure: do pgoff calculation before for_each_process() There is no need to calculate pgoff in each loop of for_each_process(), so move it to the place before for_each_process(), which can save some CPU cycles. Signed-off-by: Xianting Tian Signed-off-by: Andrew Morton Acked-by: Naoya Horiguchi Link: http://lkml.kernel.org/r/20200818082647.34322-1-tian.xianting@h3c.com Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index a1e73943445e..24c28f80628d 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -484,11 +484,12 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, struct vm_area_struct *vma; struct task_struct *tsk; struct address_space *mapping = page->mapping; + pgoff_t pgoff; i_mmap_lock_read(mapping); read_lock(&tasklist_lock); + pgoff = page_to_pgoff(page); for_each_process(tsk) { - pgoff_t pgoff = page_to_pgoff(page); struct task_struct *t = task_early_kill(tsk, force_early); if (!t) From 2c3125977ec1cb1ecb027727539049e277f42c25 Mon Sep 17 00:00:00 2001 From: Alex Shi Date: Tue, 13 Oct 2020 16:54:45 -0700 Subject: [PATCH 116/181] mm/memory-failure.c: remove unused macro `writeback' Unlike others we don't use the marco writeback. so let's remove it to tame gcc warning: mm/memory-failure.c:827: warning: macro "writeback" is not used [-Wunused-macros] Signed-off-by: Alex Shi Signed-off-by: Andrew Morton Cc: Naoya Horiguchi Link: https://lkml.kernel.org/r/1599715096-20369-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 24c28f80628d..990e3b2e37d5 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -825,7 +825,6 @@ static int me_huge_page(struct page *p, unsigned long pfn) #define sc ((1UL << PG_swapcache) | (1UL << PG_swapbacked)) #define unevict (1UL << PG_unevictable) #define mlock (1UL << PG_mlocked) -#define writeback (1UL << PG_writeback) #define lru (1UL << PG_lru) #define head (1UL << PG_head) #define slab (1UL << PG_slab) @@ -874,7 +873,6 @@ static struct page_state { #undef sc #undef unevict #undef mlock -#undef writeback #undef lru #undef head #undef slab From 82afbc32f22154d40f0bbbcfc7e18d2411f3dc92 Mon Sep 17 00:00:00 2001 From: Hui Su Date: Tue, 13 Oct 2020 16:54:48 -0700 Subject: [PATCH 117/181] mm/vmalloc.c: update the comment in __vmalloc_area_node() Since c67dc624757 ("mm/vmalloc: do not call kmemleak_free() on not yet accounted memory"), the __vunmap() have been changed to __vfree(), so update the confusing comment(). Signed-off-by: Hui Su Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Roman Penyaev Link: https://lkml.kernel.org/r/20200927155409.GA3315@rlk Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index be4724b916b3..689e7ef08a5d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2447,7 +2447,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, page = alloc_pages_node(node, alloc_mask|highmem_mask, 0); if (unlikely(!page)) { - /* Successfully allocated i pages, free them in __vunmap() */ + /* Successfully allocated i pages, free them in __vfree() */ area->nr_pages = i; atomic_long_add(area->nr_pages, &nr_vmalloc_pages); goto fail; From 74640617e14fdae2cbac9dbd6fae38a811123e7d Mon Sep 17 00:00:00 2001 From: Hui Su Date: Tue, 13 Oct 2020 16:54:51 -0700 Subject: [PATCH 118/181] mm/vmalloc.c: fix the comment of find_vm_area Fix the comment of find_vm_area() and get_vm_area() Signed-off-by: Hui Su Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200927153034.GA199877@rlk Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 689e7ef08a5d..04ac98bf5045 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2133,7 +2133,7 @@ struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags, * It is up to the caller to do all required locking to keep the returned * pointer valid. * - * Return: pointer to the found area or %NULL on faulure + * Return: the area descriptor on success or %NULL on failure. */ struct vm_struct *find_vm_area(const void *addr) { @@ -2154,7 +2154,7 @@ struct vm_struct *find_vm_area(const void *addr) * This function returns the found VM area, but using it is NOT safe * on SMP machines, except for its size or flags. * - * Return: pointer to the found area or %NULL on faulure + * Return: the area descriptor on success or %NULL on failure. */ struct vm_struct *remove_vm_area(const void *addr) { From 25356cfad69cd67cfaf0e703df6f888e9727b947 Mon Sep 17 00:00:00 2001 From: Alexander Gordeev Date: Tue, 13 Oct 2020 16:54:54 -0700 Subject: [PATCH 119/181] docs/vm: fix 'mm_count' vs 'mm_users' counter confusion In the context of the anonymous address space lifespan description the 'mm_users' reference counter is confused with 'mm_count'. I.e a "zombie" mm gets released when "mm_count" becomes zero, not "mm_users". Signed-off-by: Alexander Gordeev Signed-off-by: Andrew Morton Cc: Jonathan Corbet Link: https://lkml.kernel.org/r/1597040695-32633-1-git-send-email-agordeev@linux.ibm.com Signed-off-by: Linus Torvalds --- Documentation/vm/active_mm.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/vm/active_mm.rst b/Documentation/vm/active_mm.rst index c84471b180f8..6f8269c284ed 100644 --- a/Documentation/vm/active_mm.rst +++ b/Documentation/vm/active_mm.rst @@ -64,7 +64,7 @@ Active MM actually get cases where you have a address space that is _only_ used by lazy users. That is often a short-lived state, because once that thread gets scheduled away in favour of a real thread, the "zombie" mm gets - released because "mm_users" becomes zero. + released because "mm_count" becomes zero. Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any more. "init_mm" should be considered just a "lazy context when no other From 393824f650fabf6ea32bb09bea7acbc3a062dac8 Mon Sep 17 00:00:00 2001 From: Patricia Alfonso Date: Tue, 13 Oct 2020 16:54:58 -0700 Subject: [PATCH 120/181] kasan/kunit: add KUnit Struct to Current Task Patch series "KASAN-KUnit Integration", v14. This patchset contains everything needed to integrate KASAN and KUnit. KUnit will be able to: (1) Fail tests when an unexpected KASAN error occurs (2) Pass tests when an expected KASAN error occurs Convert KASAN tests to KUnit with the exception of copy_user_test because KUnit is unable to test those. Add documentation on how to run the KASAN tests with KUnit and what to expect when running these tests. This patch (of 5): In order to integrate debugging tools like KASAN into the KUnit framework, add KUnit struct to the current task to keep track of the current KUnit test. Signed-off-by: Patricia Alfonso Signed-off-by: David Gow Signed-off-by: Andrew Morton Tested-by: Andrey Konovalov Reviewed-by: Brendan Higgins Cc: Brendan Higgins Cc: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Shuah Khan Link: https://lkml.kernel.org/r/20200915035828.570483-1-davidgow@google.com Link: https://lkml.kernel.org/r/20200915035828.570483-2-davidgow@google.com Link: https://lkml.kernel.org/r/20200910070331.3358048-1-davidgow@google.com Link: https://lkml.kernel.org/r/20200910070331.3358048-2-davidgow@google.com Signed-off-by: Linus Torvalds --- include/linux/sched.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 829b0697d19c..9030f3abd969 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1208,6 +1208,10 @@ struct task_struct { #endif #endif +#if IS_ENABLED(CONFIG_KUNIT) + struct kunit *kunit_test; +#endif + #ifdef CONFIG_FUNCTION_GRAPH_TRACER /* Index of current stored address in ret_stack: */ int curr_ret_stack; From 83c4e7a0363bdb8104f510370907161623e31086 Mon Sep 17 00:00:00 2001 From: Patricia Alfonso Date: Tue, 13 Oct 2020 16:55:02 -0700 Subject: [PATCH 121/181] KUnit: KASAN Integration Integrate KASAN into KUnit testing framework. - Fail tests when KASAN reports an error that is not expected - Use KUNIT_EXPECT_KASAN_FAIL to expect a KASAN error in KASAN tests - Expected KASAN reports pass tests and are still printed when run without kunit_tool (kunit_tool still bypasses the report due to the test passing) - KUnit struct in current task used to keep track of the current test from KASAN code Make use of "[PATCH v3 kunit-next 1/2] kunit: generalize kunit_resource API beyond allocated resources" and "[PATCH v3 kunit-next 2/2] kunit: add support for named resources" from Alan Maguire [1] - A named resource is added to a test when a KASAN report is expected - This resource contains a struct for kasan_data containing booleans representing if a KASAN report is expected and if a KASAN report is found [1] (https://lore.kernel.org/linux-kselftest/1583251361-12748-1-git-send-email-alan.maguire@oracle.com/T/#t) Signed-off-by: Patricia Alfonso Signed-off-by: David Gow Signed-off-by: Andrew Morton Tested-by: Andrey Konovalov Reviewed-by: Andrey Konovalov Reviewed-by: Dmitry Vyukov Acked-by: Brendan Higgins Cc: Andrey Ryabinin Cc: Ingo Molnar Cc: Juri Lelli Cc: Peter Zijlstra Cc: Shuah Khan Cc: Vincent Guittot Link: https://lkml.kernel.org/r/20200915035828.570483-3-davidgow@google.com Link: https://lkml.kernel.org/r/20200910070331.3358048-3-davidgow@google.com Signed-off-by: Linus Torvalds --- include/kunit/test.h | 5 +++++ include/linux/kasan.h | 6 ++++++ lib/kunit/test.c | 13 +++++++----- lib/test_kasan.c | 47 +++++++++++++++++++++++++++++++++++++++++-- mm/kasan/report.c | 32 +++++++++++++++++++++++++++++ 5 files changed, 96 insertions(+), 7 deletions(-) diff --git a/include/kunit/test.h b/include/kunit/test.h index 59f3144f009a..3391f38389f8 100644 --- a/include/kunit/test.h +++ b/include/kunit/test.h @@ -224,6 +224,11 @@ struct kunit { struct list_head resources; /* Protected by lock. */ }; +static inline void kunit_set_failure(struct kunit *test) +{ + WRITE_ONCE(test->success, false); +} + void kunit_init_test(struct kunit *test, const char *name, char *log); int kunit_run_tests(struct kunit_suite *suite); diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 087fba34b209..30d343b4a40a 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -14,6 +14,12 @@ struct task_struct; #include #include +/* kasan_data struct is used in KUnit tests for KASAN expected failures */ +struct kunit_kasan_expectation { + bool report_expected; + bool report_found; +}; + extern unsigned char kasan_early_shadow_page[PAGE_SIZE]; extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE]; extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD]; diff --git a/lib/kunit/test.c b/lib/kunit/test.c index c36037200310..dcc35fd30d95 100644 --- a/lib/kunit/test.c +++ b/lib/kunit/test.c @@ -10,16 +10,12 @@ #include #include #include +#include #include "debugfs.h" #include "string-stream.h" #include "try-catch-impl.h" -static void kunit_set_failure(struct kunit *test) -{ - WRITE_ONCE(test->success, false); -} - static void kunit_print_tap_version(void) { static bool kunit_has_printed_tap_version; @@ -288,6 +284,10 @@ static void kunit_try_run_case(void *data) struct kunit_suite *suite = ctx->suite; struct kunit_case *test_case = ctx->test_case; +#if (IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_KUNIT)) + current->kunit_test = test; +#endif /* IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_KUNIT) */ + /* * kunit_run_case_internal may encounter a fatal error; if it does, * abort will be called, this thread will exit, and finally the parent @@ -602,6 +602,9 @@ void kunit_cleanup(struct kunit *test) spin_unlock(&test->lock); kunit_remove_resource(test, res); } +#if (IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_KUNIT)) + current->kunit_test = NULL; +#endif /* IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_KUNIT)*/ } EXPORT_SYMBOL_GPL(kunit_cleanup); diff --git a/lib/test_kasan.c b/lib/test_kasan.c index 53e953bb1d1d..58bffadd8367 100644 --- a/lib/test_kasan.c +++ b/lib/test_kasan.c @@ -23,6 +23,8 @@ #include +#include + #include "../mm/kasan/kasan.h" #define OOB_TAG_OFF (IS_ENABLED(CONFIG_KASAN_GENERIC) ? 0 : KASAN_SHADOW_SCALE_SIZE) @@ -32,14 +34,55 @@ * are not eliminated as dead code. */ -int kasan_int_result; void *kasan_ptr_result; +int kasan_int_result; + +static struct kunit_resource resource; +static struct kunit_kasan_expectation fail_data; +static bool multishot; + +static int kasan_test_init(struct kunit *test) +{ + /* + * Temporarily enable multi-shot mode and set panic_on_warn=0. + * Otherwise, we'd only get a report for the first case. + */ + multishot = kasan_save_enable_multi_shot(); + + return 0; +} + +static void kasan_test_exit(struct kunit *test) +{ + kasan_restore_multi_shot(multishot); +} + +/** + * KUNIT_EXPECT_KASAN_FAIL() - Causes a test failure when the expression does + * not cause a KASAN error. This uses a KUnit resource named "kasan_data." Do + * Do not use this name for a KUnit resource outside here. + * + */ +#define KUNIT_EXPECT_KASAN_FAIL(test, condition) do { \ + fail_data.report_expected = true; \ + fail_data.report_found = false; \ + kunit_add_named_resource(test, \ + NULL, \ + NULL, \ + &resource, \ + "kasan_data", &fail_data); \ + condition; \ + KUNIT_EXPECT_EQ(test, \ + fail_data.report_expected, \ + fail_data.report_found); \ +} while (0) + + /* * Note: test functions are marked noinline so that their names appear in * reports. */ - static noinline void __init kmalloc_oob_right(void) { char *ptr; diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 4f49fa6cd1aa..e2c14b10bc81 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -33,6 +33,8 @@ #include +#include + #include "kasan.h" #include "../slab.h" @@ -464,12 +466,37 @@ static bool report_enabled(void) return !test_and_set_bit(KASAN_BIT_REPORTED, &kasan_flags); } +#if IS_ENABLED(CONFIG_KUNIT) +static void kasan_update_kunit_status(struct kunit *cur_test) +{ + struct kunit_resource *resource; + struct kunit_kasan_expectation *kasan_data; + + resource = kunit_find_named_resource(cur_test, "kasan_data"); + + if (!resource) { + kunit_set_failure(cur_test); + return; + } + + kasan_data = (struct kunit_kasan_expectation *)resource->data; + kasan_data->report_found = true; + kunit_put_resource(resource); +} +#endif /* IS_ENABLED(CONFIG_KUNIT) */ + void kasan_report_invalid_free(void *object, unsigned long ip) { unsigned long flags; u8 tag = get_tag(object); object = reset_tag(object); + +#if IS_ENABLED(CONFIG_KUNIT) + if (current->kunit_test) + kasan_update_kunit_status(current->kunit_test); +#endif /* IS_ENABLED(CONFIG_KUNIT) */ + start_report(&flags); pr_err("BUG: KASAN: double-free or invalid-free in %pS\n", (void *)ip); print_tags(tag, object); @@ -488,6 +515,11 @@ static void __kasan_report(unsigned long addr, size_t size, bool is_write, void *untagged_addr; unsigned long flags; +#if IS_ENABLED(CONFIG_KUNIT) + if (current->kunit_test) + kasan_update_kunit_status(current->kunit_test); +#endif /* IS_ENABLED(CONFIG_KUNIT) */ + disable_trace_on_warning(); tagged_addr = (void *)addr; From 73228c7ecc5e40c0851c4703c5ec6ed38123e989 Mon Sep 17 00:00:00 2001 From: Patricia Alfonso Date: Tue, 13 Oct 2020 16:55:06 -0700 Subject: [PATCH 122/181] KASAN: port KASAN Tests to KUnit Transfer all previous tests for KASAN to KUnit so they can be run more easily. Using kunit_tool, developers can run these tests with their other KUnit tests and see "pass" or "fail" with the appropriate KASAN report instead of needing to parse each KASAN report to test KASAN functionalities. All KASAN reports are still printed to dmesg. Stack tests do not work properly when KASAN_STACK is enabled so those tests use a check for "if IS_ENABLED(CONFIG_KASAN_STACK)" so they only run if stack instrumentation is enabled. If KASAN_STACK is not enabled, KUnit will print a statement to let the user know this test was not run with KASAN_STACK enabled. copy_user_test and kasan_rcu_uaf cannot be run in KUnit so there is a separate test file for those tests, which can be run as before as a module. [trishalfonso@google.com: v14] Link: https://lkml.kernel.org/r/20200915035828.570483-4-davidgow@google.com Signed-off-by: Patricia Alfonso Signed-off-by: David Gow Signed-off-by: Andrew Morton Tested-by: Andrey Konovalov Reviewed-by: Brendan Higgins Reviewed-by: Andrey Konovalov Reviewed-by: Dmitry Vyukov Cc: Andrey Ryabinin Cc: Ingo Molnar Cc: Juri Lelli Cc: Peter Zijlstra Cc: Shuah Khan Cc: Vincent Guittot Link: https://lkml.kernel.org/r/20200910070331.3358048-4-davidgow@google.com Signed-off-by: Linus Torvalds --- lib/Kconfig.kasan | 22 +- lib/Makefile | 4 +- lib/test_kasan.c | 705 +++++++++++++++------------------------- lib/test_kasan_module.c | 111 +++++++ 4 files changed, 395 insertions(+), 447 deletions(-) create mode 100644 lib/test_kasan_module.c diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan index 033a5bc67ac4..542a9c18398e 100644 --- a/lib/Kconfig.kasan +++ b/lib/Kconfig.kasan @@ -166,12 +166,24 @@ config KASAN_VMALLOC for KASAN to detect more sorts of errors (and to support vmapped stacks), but at the cost of higher memory usage. -config TEST_KASAN - tristate "Module for testing KASAN for bug detection" - depends on m +config KASAN_KUNIT_TEST + tristate "KUnit-compatible tests of KASAN bug detection capabilities" if !KUNIT_ALL_TESTS + depends on KASAN && KUNIT + default KUNIT_ALL_TESTS help - This is a test module doing various nasty things like - out of bounds accesses, use after free. It is useful for testing + This is a KUnit test suite doing various nasty things like + out of bounds and use after free accesses. It is useful for testing kernel debugging features like KASAN. + For more information on KUnit and unit tests in general, please refer + to the KUnit documentation in Documentation/dev-tools/kunit + +config TEST_KASAN_MODULE + tristate "KUnit-incompatible tests of KASAN bug detection capabilities" + depends on m && KASAN + help + This is a part of the KASAN test suite that is incompatible with + KUnit. Currently includes tests that do bad copy_from/to_user + accesses. + endif # KASAN diff --git a/lib/Makefile b/lib/Makefile index a4a4c6864f51..d4af75136c54 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -65,9 +65,11 @@ CFLAGS_test_bitops.o += -Werror obj-$(CONFIG_TEST_SYSCTL) += test_sysctl.o obj-$(CONFIG_TEST_HASH) += test_hash.o test_siphash.o obj-$(CONFIG_TEST_IDA) += test_ida.o -obj-$(CONFIG_TEST_KASAN) += test_kasan.o +obj-$(CONFIG_KASAN_KUNIT_TEST) += test_kasan.o CFLAGS_test_kasan.o += -fno-builtin CFLAGS_test_kasan.o += $(call cc-disable-warning, vla) +obj-$(CONFIG_TEST_KASAN_MODULE) += test_kasan_module.o +CFLAGS_test_kasan_module.o += -fno-builtin obj-$(CONFIG_TEST_UBSAN) += test_ubsan.o CFLAGS_test_ubsan.o += $(call cc-disable-warning, vla) UBSAN_SANITIZE_test_ubsan.o := y diff --git a/lib/test_kasan.c b/lib/test_kasan.c index 58bffadd8367..63c26171a791 100644 --- a/lib/test_kasan.c +++ b/lib/test_kasan.c @@ -5,8 +5,6 @@ * Author: Andrey Ryabinin */ -#define pr_fmt(fmt) "kasan test: %s " fmt, __func__ - #include #include #include @@ -77,416 +75,327 @@ static void kasan_test_exit(struct kunit *test) fail_data.report_found); \ } while (0) - - -/* - * Note: test functions are marked noinline so that their names appear in - * reports. - */ -static noinline void __init kmalloc_oob_right(void) +static void kmalloc_oob_right(struct kunit *test) { char *ptr; size_t size = 123; - pr_info("out-of-bounds to right\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } - - ptr[size + OOB_TAG_OFF] = 'x'; + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); + KUNIT_EXPECT_KASAN_FAIL(test, ptr[size + OOB_TAG_OFF] = 'x'); kfree(ptr); } -static noinline void __init kmalloc_oob_left(void) +static void kmalloc_oob_left(struct kunit *test) { char *ptr; size_t size = 15; - pr_info("out-of-bounds to left\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); - *ptr = *(ptr - 1); + KUNIT_EXPECT_KASAN_FAIL(test, *ptr = *(ptr - 1)); kfree(ptr); } -static noinline void __init kmalloc_node_oob_right(void) +static void kmalloc_node_oob_right(struct kunit *test) { char *ptr; size_t size = 4096; - pr_info("kmalloc_node(): out-of-bounds to right\n"); ptr = kmalloc_node(size, GFP_KERNEL, 0); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); - ptr[size] = 0; + KUNIT_EXPECT_KASAN_FAIL(test, ptr[size] = 0); kfree(ptr); } -#ifdef CONFIG_SLUB -static noinline void __init kmalloc_pagealloc_oob_right(void) +static void kmalloc_pagealloc_oob_right(struct kunit *test) { char *ptr; size_t size = KMALLOC_MAX_CACHE_SIZE + 10; + if (!IS_ENABLED(CONFIG_SLUB)) { + kunit_info(test, "CONFIG_SLUB is not enabled."); + return; + } + /* Allocate a chunk that does not fit into a SLUB cache to trigger * the page allocator fallback. */ - pr_info("kmalloc pagealloc allocation: out-of-bounds to right\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } - - ptr[size + OOB_TAG_OFF] = 0; + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); + KUNIT_EXPECT_KASAN_FAIL(test, ptr[size + OOB_TAG_OFF] = 0); kfree(ptr); } -static noinline void __init kmalloc_pagealloc_uaf(void) +static void kmalloc_pagealloc_uaf(struct kunit *test) { char *ptr; size_t size = KMALLOC_MAX_CACHE_SIZE + 10; - pr_info("kmalloc pagealloc allocation: use-after-free\n"); - ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); + if (!IS_ENABLED(CONFIG_SLUB)) { + kunit_info(test, "CONFIG_SLUB is not enabled."); return; } + ptr = kmalloc(size, GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); + kfree(ptr); - ptr[0] = 0; + KUNIT_EXPECT_KASAN_FAIL(test, ptr[0] = 0); } -static noinline void __init kmalloc_pagealloc_invalid_free(void) +static void kmalloc_pagealloc_invalid_free(struct kunit *test) { char *ptr; size_t size = KMALLOC_MAX_CACHE_SIZE + 10; - pr_info("kmalloc pagealloc allocation: invalid-free\n"); - ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); + if (!IS_ENABLED(CONFIG_SLUB)) { + kunit_info(test, "CONFIG_SLUB is not enabled."); return; } - kfree(ptr + 1); -} -#endif + ptr = kmalloc(size, GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); -static noinline void __init kmalloc_large_oob_right(void) + KUNIT_EXPECT_KASAN_FAIL(test, kfree(ptr + 1)); +} + +static void kmalloc_large_oob_right(struct kunit *test) { char *ptr; size_t size = KMALLOC_MAX_CACHE_SIZE - 256; /* Allocate a chunk that is large enough, but still fits into a slab * and does not trigger the page allocator fallback in SLUB. */ - pr_info("kmalloc large allocation: out-of-bounds to right\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); - ptr[size] = 0; + KUNIT_EXPECT_KASAN_FAIL(test, ptr[size] = 0); kfree(ptr); } -static noinline void __init kmalloc_oob_krealloc_more(void) +static void kmalloc_oob_krealloc_more(struct kunit *test) { char *ptr1, *ptr2; size_t size1 = 17; size_t size2 = 19; - pr_info("out-of-bounds after krealloc more\n"); ptr1 = kmalloc(size1, GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1); + ptr2 = krealloc(ptr1, size2, GFP_KERNEL); - if (!ptr1 || !ptr2) { - pr_err("Allocation failed\n"); - kfree(ptr1); - kfree(ptr2); - return; - } - - ptr2[size2 + OOB_TAG_OFF] = 'x'; + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2); + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2 + OOB_TAG_OFF] = 'x'); kfree(ptr2); } -static noinline void __init kmalloc_oob_krealloc_less(void) +static void kmalloc_oob_krealloc_less(struct kunit *test) { char *ptr1, *ptr2; size_t size1 = 17; size_t size2 = 15; - pr_info("out-of-bounds after krealloc less\n"); ptr1 = kmalloc(size1, GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1); + ptr2 = krealloc(ptr1, size2, GFP_KERNEL); - if (!ptr1 || !ptr2) { - pr_err("Allocation failed\n"); - kfree(ptr1); - return; - } - - ptr2[size2 + OOB_TAG_OFF] = 'x'; + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2); + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2 + OOB_TAG_OFF] = 'x'); kfree(ptr2); } -static noinline void __init kmalloc_oob_16(void) +static void kmalloc_oob_16(struct kunit *test) { struct { u64 words[2]; } *ptr1, *ptr2; - pr_info("kmalloc out-of-bounds for 16-bytes access\n"); ptr1 = kmalloc(sizeof(*ptr1) - 3, GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1); + ptr2 = kmalloc(sizeof(*ptr2), GFP_KERNEL); - if (!ptr1 || !ptr2) { - pr_err("Allocation failed\n"); - kfree(ptr1); - kfree(ptr2); - return; - } - *ptr1 = *ptr2; + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2); + + KUNIT_EXPECT_KASAN_FAIL(test, *ptr1 = *ptr2); kfree(ptr1); kfree(ptr2); } -static noinline void __init kmalloc_oob_memset_2(void) +static void kmalloc_oob_memset_2(struct kunit *test) { char *ptr; size_t size = 8; - pr_info("out-of-bounds in memset2\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } - - memset(ptr + 7 + OOB_TAG_OFF, 0, 2); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); + KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 7 + OOB_TAG_OFF, 0, 2)); kfree(ptr); } -static noinline void __init kmalloc_oob_memset_4(void) +static void kmalloc_oob_memset_4(struct kunit *test) { char *ptr; size_t size = 8; - pr_info("out-of-bounds in memset4\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } - - memset(ptr + 5 + OOB_TAG_OFF, 0, 4); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); + KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 5 + OOB_TAG_OFF, 0, 4)); kfree(ptr); } -static noinline void __init kmalloc_oob_memset_8(void) +static void kmalloc_oob_memset_8(struct kunit *test) { char *ptr; size_t size = 8; - pr_info("out-of-bounds in memset8\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } - - memset(ptr + 1 + OOB_TAG_OFF, 0, 8); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); + KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 1 + OOB_TAG_OFF, 0, 8)); kfree(ptr); } -static noinline void __init kmalloc_oob_memset_16(void) +static void kmalloc_oob_memset_16(struct kunit *test) { char *ptr; size_t size = 16; - pr_info("out-of-bounds in memset16\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } - - memset(ptr + 1 + OOB_TAG_OFF, 0, 16); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); + KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 1 + OOB_TAG_OFF, 0, 16)); kfree(ptr); } -static noinline void __init kmalloc_oob_in_memset(void) +static void kmalloc_oob_in_memset(struct kunit *test) { char *ptr; size_t size = 666; - pr_info("out-of-bounds in memset\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } - - memset(ptr, 0, size + 5 + OOB_TAG_OFF); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); + KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr, 0, size + 5 + OOB_TAG_OFF)); kfree(ptr); } -static noinline void __init kmalloc_memmove_invalid_size(void) +static void kmalloc_memmove_invalid_size(struct kunit *test) { char *ptr; size_t size = 64; volatile size_t invalid_size = -2; - pr_info("invalid size in memmove\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); memset((char *)ptr, 0, 64); - memmove((char *)ptr, (char *)ptr + 4, invalid_size); + + KUNIT_EXPECT_KASAN_FAIL(test, + memmove((char *)ptr, (char *)ptr + 4, invalid_size)); kfree(ptr); } -static noinline void __init kmalloc_uaf(void) +static void kmalloc_uaf(struct kunit *test) { char *ptr; size_t size = 10; - pr_info("use-after-free\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); kfree(ptr); - *(ptr + 8) = 'x'; + KUNIT_EXPECT_KASAN_FAIL(test, *(ptr + 8) = 'x'); } -static noinline void __init kmalloc_uaf_memset(void) +static void kmalloc_uaf_memset(struct kunit *test) { char *ptr; size_t size = 33; - pr_info("use-after-free in memset\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); kfree(ptr); - memset(ptr, 0, size); + KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr, 0, size)); } -static noinline void __init kmalloc_uaf2(void) +static void kmalloc_uaf2(struct kunit *test) { char *ptr1, *ptr2; size_t size = 43; - pr_info("use-after-free after another kmalloc\n"); ptr1 = kmalloc(size, GFP_KERNEL); - if (!ptr1) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1); kfree(ptr1); - ptr2 = kmalloc(size, GFP_KERNEL); - if (!ptr2) { - pr_err("Allocation failed\n"); - return; - } - ptr1[40] = 'x'; - if (ptr1 == ptr2) - pr_err("Could not detect use-after-free: ptr1 == ptr2\n"); + ptr2 = kmalloc(size, GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2); + + KUNIT_EXPECT_KASAN_FAIL(test, ptr1[40] = 'x'); + KUNIT_EXPECT_PTR_NE(test, ptr1, ptr2); + kfree(ptr2); } -static noinline void __init kfree_via_page(void) +static void kfree_via_page(struct kunit *test) { char *ptr; size_t size = 8; struct page *page; unsigned long offset; - pr_info("invalid-free false positive (via page)\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); page = virt_to_page(ptr); offset = offset_in_page(ptr); kfree(page_address(page) + offset); } -static noinline void __init kfree_via_phys(void) +static void kfree_via_phys(struct kunit *test) { char *ptr; size_t size = 8; phys_addr_t phys; - pr_info("invalid-free false positive (via phys)\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); phys = virt_to_phys(ptr); kfree(phys_to_virt(phys)); } -static noinline void __init kmem_cache_oob(void) +static void kmem_cache_oob(struct kunit *test) { char *p; size_t size = 200; struct kmem_cache *cache = kmem_cache_create("test_cache", size, 0, 0, NULL); - if (!cache) { - pr_err("Cache allocation failed\n"); - return; - } - pr_info("out-of-bounds in kmem_cache_alloc\n"); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, cache); p = kmem_cache_alloc(cache, GFP_KERNEL); if (!p) { - pr_err("Allocation failed\n"); + kunit_err(test, "Allocation failed: %s\n", __func__); kmem_cache_destroy(cache); return; } - *p = p[size + OOB_TAG_OFF]; - + KUNIT_EXPECT_KASAN_FAIL(test, *p = p[size + OOB_TAG_OFF]); kmem_cache_free(cache, p); kmem_cache_destroy(cache); } -static noinline void __init memcg_accounted_kmem_cache(void) +static void memcg_accounted_kmem_cache(struct kunit *test) { int i; char *p; @@ -494,12 +403,8 @@ static noinline void __init memcg_accounted_kmem_cache(void) struct kmem_cache *cache; cache = kmem_cache_create("test_cache", size, 0, SLAB_ACCOUNT, NULL); - if (!cache) { - pr_err("Cache allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, cache); - pr_info("allocate memcg accounted object\n"); /* * Several allocations with a delay to allow for lazy per memcg kmem * cache creation. @@ -519,134 +424,93 @@ free_cache: static char global_array[10]; -static noinline void __init kasan_global_oob(void) +static void kasan_global_oob(struct kunit *test) { volatile int i = 3; char *p = &global_array[ARRAY_SIZE(global_array) + i]; - pr_info("out-of-bounds global variable\n"); - *(volatile char *)p; + KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p); } -static noinline void __init kasan_stack_oob(void) +static void ksize_unpoisons_memory(struct kunit *test) +{ + char *ptr; + size_t size = 123, real_size; + + ptr = kmalloc(size, GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); + real_size = ksize(ptr); + /* This access doesn't trigger an error. */ + ptr[size] = 'x'; + /* This one does. */ + KUNIT_EXPECT_KASAN_FAIL(test, ptr[real_size] = 'y'); + kfree(ptr); +} + +static void kasan_stack_oob(struct kunit *test) { char stack_array[10]; volatile int i = OOB_TAG_OFF; char *p = &stack_array[ARRAY_SIZE(stack_array) + i]; - pr_info("out-of-bounds on stack\n"); - *(volatile char *)p; -} - -static noinline void __init ksize_unpoisons_memory(void) -{ - char *ptr; - size_t size = 123, real_size; - - pr_info("ksize() unpoisons the whole allocated chunk\n"); - ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } - real_size = ksize(ptr); - /* This access doesn't trigger an error. */ - ptr[size] = 'x'; - /* This one does. */ - ptr[real_size] = 'y'; - kfree(ptr); -} - -static noinline void __init copy_user_test(void) -{ - char *kmem; - char __user *usermem; - size_t size = 10; - int unused; - - kmem = kmalloc(size, GFP_KERNEL); - if (!kmem) - return; - - usermem = (char __user *)vm_mmap(NULL, 0, PAGE_SIZE, - PROT_READ | PROT_WRITE | PROT_EXEC, - MAP_ANONYMOUS | MAP_PRIVATE, 0); - if (IS_ERR(usermem)) { - pr_err("Failed to allocate user memory\n"); - kfree(kmem); + if (!IS_ENABLED(CONFIG_KASAN_STACK)) { + kunit_info(test, "CONFIG_KASAN_STACK is not enabled"); return; } - pr_info("out-of-bounds in copy_from_user()\n"); - unused = copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); - - pr_info("out-of-bounds in copy_to_user()\n"); - unused = copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF); - - pr_info("out-of-bounds in __copy_from_user()\n"); - unused = __copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); - - pr_info("out-of-bounds in __copy_to_user()\n"); - unused = __copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF); - - pr_info("out-of-bounds in __copy_from_user_inatomic()\n"); - unused = __copy_from_user_inatomic(kmem, usermem, size + 1 + OOB_TAG_OFF); - - pr_info("out-of-bounds in __copy_to_user_inatomic()\n"); - unused = __copy_to_user_inatomic(usermem, kmem, size + 1 + OOB_TAG_OFF); - - pr_info("out-of-bounds in strncpy_from_user()\n"); - unused = strncpy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); - - vm_munmap((unsigned long)usermem, PAGE_SIZE); - kfree(kmem); + KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p); } -static noinline void __init kasan_alloca_oob_left(void) +static void kasan_alloca_oob_left(struct kunit *test) { volatile int i = 10; char alloca_array[i]; char *p = alloca_array - 1; - pr_info("out-of-bounds to left on alloca\n"); - *(volatile char *)p; + if (!IS_ENABLED(CONFIG_KASAN_STACK)) { + kunit_info(test, "CONFIG_KASAN_STACK is not enabled"); + return; + } + + KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p); } -static noinline void __init kasan_alloca_oob_right(void) +static void kasan_alloca_oob_right(struct kunit *test) { volatile int i = 10; char alloca_array[i]; char *p = alloca_array + i; - pr_info("out-of-bounds to right on alloca\n"); - *(volatile char *)p; + if (!IS_ENABLED(CONFIG_KASAN_STACK)) { + kunit_info(test, "CONFIG_KASAN_STACK is not enabled"); + return; + } + + KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p); } -static noinline void __init kmem_cache_double_free(void) +static void kmem_cache_double_free(struct kunit *test) { char *p; size_t size = 200; struct kmem_cache *cache; cache = kmem_cache_create("test_cache", size, 0, 0, NULL); - if (!cache) { - pr_err("Cache allocation failed\n"); - return; - } - pr_info("double-free on heap object\n"); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, cache); + p = kmem_cache_alloc(cache, GFP_KERNEL); if (!p) { - pr_err("Allocation failed\n"); + kunit_err(test, "Allocation failed: %s\n", __func__); kmem_cache_destroy(cache); return; } kmem_cache_free(cache, p); - kmem_cache_free(cache, p); + KUNIT_EXPECT_KASAN_FAIL(test, kmem_cache_free(cache, p)); kmem_cache_destroy(cache); } -static noinline void __init kmem_cache_invalid_free(void) +static void kmem_cache_invalid_free(struct kunit *test) { char *p; size_t size = 200; @@ -654,20 +518,17 @@ static noinline void __init kmem_cache_invalid_free(void) cache = kmem_cache_create("test_cache", size, 0, SLAB_TYPESAFE_BY_RCU, NULL); - if (!cache) { - pr_err("Cache allocation failed\n"); - return; - } - pr_info("invalid-free of heap object\n"); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, cache); + p = kmem_cache_alloc(cache, GFP_KERNEL); if (!p) { - pr_err("Allocation failed\n"); + kunit_err(test, "Allocation failed: %s\n", __func__); kmem_cache_destroy(cache); return; } /* Trigger invalid free, the object doesn't get freed */ - kmem_cache_free(cache, p + 1); + KUNIT_EXPECT_KASAN_FAIL(test, kmem_cache_free(cache, p + 1)); /* * Properly free the object to prevent the "Objects remaining in @@ -678,45 +539,63 @@ static noinline void __init kmem_cache_invalid_free(void) kmem_cache_destroy(cache); } -static noinline void __init kasan_memchr(void) +static void kasan_memchr(struct kunit *test) { char *ptr; size_t size = 24; - pr_info("out-of-bounds in memchr\n"); - ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO); - if (!ptr) + /* See https://bugzilla.kernel.org/show_bug.cgi?id=206337 */ + if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) { + kunit_info(test, + "str* functions are not instrumented with CONFIG_AMD_MEM_ENCRYPT"); return; + } + + ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); + + KUNIT_EXPECT_KASAN_FAIL(test, + kasan_ptr_result = memchr(ptr, '1', size + 1)); - kasan_ptr_result = memchr(ptr, '1', size + 1); kfree(ptr); } -static noinline void __init kasan_memcmp(void) +static void kasan_memcmp(struct kunit *test) { char *ptr; size_t size = 24; int arr[9]; - pr_info("out-of-bounds in memcmp\n"); - ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO); - if (!ptr) + /* See https://bugzilla.kernel.org/show_bug.cgi?id=206337 */ + if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) { + kunit_info(test, + "str* functions are not instrumented with CONFIG_AMD_MEM_ENCRYPT"); return; + } + ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); memset(arr, 0, sizeof(arr)); - kasan_int_result = memcmp(ptr, arr, size + 1); + + KUNIT_EXPECT_KASAN_FAIL(test, + kasan_int_result = memcmp(ptr, arr, size+1)); kfree(ptr); } -static noinline void __init kasan_strings(void) +static void kasan_strings(struct kunit *test) { char *ptr; size_t size = 24; - pr_info("use-after-free in strchr\n"); - ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO); - if (!ptr) + /* See https://bugzilla.kernel.org/show_bug.cgi?id=206337 */ + if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) { + kunit_info(test, + "str* functions are not instrumented with CONFIG_AMD_MEM_ENCRYPT"); return; + } + + ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); kfree(ptr); @@ -727,220 +606,164 @@ static noinline void __init kasan_strings(void) * will likely point to zeroed byte. */ ptr += 16; - kasan_ptr_result = strchr(ptr, '1'); + KUNIT_EXPECT_KASAN_FAIL(test, kasan_ptr_result = strchr(ptr, '1')); - pr_info("use-after-free in strrchr\n"); - kasan_ptr_result = strrchr(ptr, '1'); + KUNIT_EXPECT_KASAN_FAIL(test, kasan_ptr_result = strrchr(ptr, '1')); - pr_info("use-after-free in strcmp\n"); - kasan_int_result = strcmp(ptr, "2"); + KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = strcmp(ptr, "2")); - pr_info("use-after-free in strncmp\n"); - kasan_int_result = strncmp(ptr, "2", 1); + KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = strncmp(ptr, "2", 1)); - pr_info("use-after-free in strlen\n"); - kasan_int_result = strlen(ptr); + KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = strlen(ptr)); - pr_info("use-after-free in strnlen\n"); - kasan_int_result = strnlen(ptr, 1); + KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = strnlen(ptr, 1)); } -static noinline void __init kasan_bitops(void) +static void kasan_bitops(struct kunit *test) { /* * Allocate 1 more byte, which causes kzalloc to round up to 16-bytes; * this way we do not actually corrupt other memory. */ long *bits = kzalloc(sizeof(*bits) + 1, GFP_KERNEL); - if (!bits) - return; + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, bits); /* * Below calls try to access bit within allocated memory; however, the * below accesses are still out-of-bounds, since bitops are defined to * operate on the whole long the bit is in. */ - pr_info("out-of-bounds in set_bit\n"); - set_bit(BITS_PER_LONG, bits); + KUNIT_EXPECT_KASAN_FAIL(test, set_bit(BITS_PER_LONG, bits)); - pr_info("out-of-bounds in __set_bit\n"); - __set_bit(BITS_PER_LONG, bits); + KUNIT_EXPECT_KASAN_FAIL(test, __set_bit(BITS_PER_LONG, bits)); - pr_info("out-of-bounds in clear_bit\n"); - clear_bit(BITS_PER_LONG, bits); + KUNIT_EXPECT_KASAN_FAIL(test, clear_bit(BITS_PER_LONG, bits)); - pr_info("out-of-bounds in __clear_bit\n"); - __clear_bit(BITS_PER_LONG, bits); + KUNIT_EXPECT_KASAN_FAIL(test, __clear_bit(BITS_PER_LONG, bits)); - pr_info("out-of-bounds in clear_bit_unlock\n"); - clear_bit_unlock(BITS_PER_LONG, bits); + KUNIT_EXPECT_KASAN_FAIL(test, clear_bit_unlock(BITS_PER_LONG, bits)); - pr_info("out-of-bounds in __clear_bit_unlock\n"); - __clear_bit_unlock(BITS_PER_LONG, bits); + KUNIT_EXPECT_KASAN_FAIL(test, __clear_bit_unlock(BITS_PER_LONG, bits)); - pr_info("out-of-bounds in change_bit\n"); - change_bit(BITS_PER_LONG, bits); + KUNIT_EXPECT_KASAN_FAIL(test, change_bit(BITS_PER_LONG, bits)); - pr_info("out-of-bounds in __change_bit\n"); - __change_bit(BITS_PER_LONG, bits); + KUNIT_EXPECT_KASAN_FAIL(test, __change_bit(BITS_PER_LONG, bits)); /* * Below calls try to access bit beyond allocated memory. */ - pr_info("out-of-bounds in test_and_set_bit\n"); - test_and_set_bit(BITS_PER_LONG + BITS_PER_BYTE, bits); + KUNIT_EXPECT_KASAN_FAIL(test, + test_and_set_bit(BITS_PER_LONG + BITS_PER_BYTE, bits)); - pr_info("out-of-bounds in __test_and_set_bit\n"); - __test_and_set_bit(BITS_PER_LONG + BITS_PER_BYTE, bits); + KUNIT_EXPECT_KASAN_FAIL(test, + __test_and_set_bit(BITS_PER_LONG + BITS_PER_BYTE, bits)); - pr_info("out-of-bounds in test_and_set_bit_lock\n"); - test_and_set_bit_lock(BITS_PER_LONG + BITS_PER_BYTE, bits); + KUNIT_EXPECT_KASAN_FAIL(test, + test_and_set_bit_lock(BITS_PER_LONG + BITS_PER_BYTE, bits)); - pr_info("out-of-bounds in test_and_clear_bit\n"); - test_and_clear_bit(BITS_PER_LONG + BITS_PER_BYTE, bits); + KUNIT_EXPECT_KASAN_FAIL(test, + test_and_clear_bit(BITS_PER_LONG + BITS_PER_BYTE, bits)); - pr_info("out-of-bounds in __test_and_clear_bit\n"); - __test_and_clear_bit(BITS_PER_LONG + BITS_PER_BYTE, bits); + KUNIT_EXPECT_KASAN_FAIL(test, + __test_and_clear_bit(BITS_PER_LONG + BITS_PER_BYTE, bits)); - pr_info("out-of-bounds in test_and_change_bit\n"); - test_and_change_bit(BITS_PER_LONG + BITS_PER_BYTE, bits); + KUNIT_EXPECT_KASAN_FAIL(test, + test_and_change_bit(BITS_PER_LONG + BITS_PER_BYTE, bits)); - pr_info("out-of-bounds in __test_and_change_bit\n"); - __test_and_change_bit(BITS_PER_LONG + BITS_PER_BYTE, bits); + KUNIT_EXPECT_KASAN_FAIL(test, + __test_and_change_bit(BITS_PER_LONG + BITS_PER_BYTE, bits)); - pr_info("out-of-bounds in test_bit\n"); - kasan_int_result = test_bit(BITS_PER_LONG + BITS_PER_BYTE, bits); + KUNIT_EXPECT_KASAN_FAIL(test, + kasan_int_result = + test_bit(BITS_PER_LONG + BITS_PER_BYTE, bits)); #if defined(clear_bit_unlock_is_negative_byte) - pr_info("out-of-bounds in clear_bit_unlock_is_negative_byte\n"); - kasan_int_result = clear_bit_unlock_is_negative_byte(BITS_PER_LONG + - BITS_PER_BYTE, bits); + KUNIT_EXPECT_KASAN_FAIL(test, + kasan_int_result = clear_bit_unlock_is_negative_byte( + BITS_PER_LONG + BITS_PER_BYTE, bits)); #endif kfree(bits); } -static noinline void __init kmalloc_double_kzfree(void) +static void kmalloc_double_kzfree(struct kunit *test) { char *ptr; size_t size = 16; - pr_info("double-free (kfree_sensitive)\n"); ptr = kmalloc(size, GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); kfree_sensitive(ptr); - kfree_sensitive(ptr); + KUNIT_EXPECT_KASAN_FAIL(test, kfree_sensitive(ptr)); } -#ifdef CONFIG_KASAN_VMALLOC -static noinline void __init vmalloc_oob(void) +static void vmalloc_oob(struct kunit *test) { void *area; - pr_info("vmalloc out-of-bounds\n"); + if (!IS_ENABLED(CONFIG_KASAN_VMALLOC)) { + kunit_info(test, "CONFIG_KASAN_VMALLOC is not enabled."); + return; + } /* * We have to be careful not to hit the guard page. * The MMU will catch that and crash us. */ area = vmalloc(3000); - if (!area) { - pr_err("Allocation failed\n"); - return; - } + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, area); - ((volatile char *)area)[3100]; + KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)area)[3100]); vfree(area); } -#else -static void __init vmalloc_oob(void) {} -#endif -static struct kasan_rcu_info { - int i; - struct rcu_head rcu; -} *global_rcu_ptr; +static struct kunit_case kasan_kunit_test_cases[] = { + KUNIT_CASE(kmalloc_oob_right), + KUNIT_CASE(kmalloc_oob_left), + KUNIT_CASE(kmalloc_node_oob_right), + KUNIT_CASE(kmalloc_pagealloc_oob_right), + KUNIT_CASE(kmalloc_pagealloc_uaf), + KUNIT_CASE(kmalloc_pagealloc_invalid_free), + KUNIT_CASE(kmalloc_large_oob_right), + KUNIT_CASE(kmalloc_oob_krealloc_more), + KUNIT_CASE(kmalloc_oob_krealloc_less), + KUNIT_CASE(kmalloc_oob_16), + KUNIT_CASE(kmalloc_oob_in_memset), + KUNIT_CASE(kmalloc_oob_memset_2), + KUNIT_CASE(kmalloc_oob_memset_4), + KUNIT_CASE(kmalloc_oob_memset_8), + KUNIT_CASE(kmalloc_oob_memset_16), + KUNIT_CASE(kmalloc_memmove_invalid_size), + KUNIT_CASE(kmalloc_uaf), + KUNIT_CASE(kmalloc_uaf_memset), + KUNIT_CASE(kmalloc_uaf2), + KUNIT_CASE(kfree_via_page), + KUNIT_CASE(kfree_via_phys), + KUNIT_CASE(kmem_cache_oob), + KUNIT_CASE(memcg_accounted_kmem_cache), + KUNIT_CASE(kasan_global_oob), + KUNIT_CASE(kasan_stack_oob), + KUNIT_CASE(kasan_alloca_oob_left), + KUNIT_CASE(kasan_alloca_oob_right), + KUNIT_CASE(ksize_unpoisons_memory), + KUNIT_CASE(kmem_cache_double_free), + KUNIT_CASE(kmem_cache_invalid_free), + KUNIT_CASE(kasan_memchr), + KUNIT_CASE(kasan_memcmp), + KUNIT_CASE(kasan_strings), + KUNIT_CASE(kasan_bitops), + KUNIT_CASE(kmalloc_double_kzfree), + KUNIT_CASE(vmalloc_oob), + {} +}; -static noinline void __init kasan_rcu_reclaim(struct rcu_head *rp) -{ - struct kasan_rcu_info *fp = container_of(rp, - struct kasan_rcu_info, rcu); +static struct kunit_suite kasan_kunit_test_suite = { + .name = "kasan", + .init = kasan_test_init, + .test_cases = kasan_kunit_test_cases, + .exit = kasan_test_exit, +}; - kfree(fp); - fp->i = 1; -} +kunit_test_suite(kasan_kunit_test_suite); -static noinline void __init kasan_rcu_uaf(void) -{ - struct kasan_rcu_info *ptr; - - pr_info("use-after-free in kasan_rcu_reclaim\n"); - ptr = kmalloc(sizeof(struct kasan_rcu_info), GFP_KERNEL); - if (!ptr) { - pr_err("Allocation failed\n"); - return; - } - - global_rcu_ptr = rcu_dereference_protected(ptr, NULL); - call_rcu(&global_rcu_ptr->rcu, kasan_rcu_reclaim); -} - -static int __init kmalloc_tests_init(void) -{ - /* - * Temporarily enable multi-shot mode. Otherwise, we'd only get a - * report for the first case. - */ - bool multishot = kasan_save_enable_multi_shot(); - - kmalloc_oob_right(); - kmalloc_oob_left(); - kmalloc_node_oob_right(); -#ifdef CONFIG_SLUB - kmalloc_pagealloc_oob_right(); - kmalloc_pagealloc_uaf(); - kmalloc_pagealloc_invalid_free(); -#endif - kmalloc_large_oob_right(); - kmalloc_oob_krealloc_more(); - kmalloc_oob_krealloc_less(); - kmalloc_oob_16(); - kmalloc_oob_in_memset(); - kmalloc_oob_memset_2(); - kmalloc_oob_memset_4(); - kmalloc_oob_memset_8(); - kmalloc_oob_memset_16(); - kmalloc_memmove_invalid_size(); - kmalloc_uaf(); - kmalloc_uaf_memset(); - kmalloc_uaf2(); - kfree_via_page(); - kfree_via_phys(); - kmem_cache_oob(); - memcg_accounted_kmem_cache(); - kasan_stack_oob(); - kasan_global_oob(); - kasan_alloca_oob_left(); - kasan_alloca_oob_right(); - ksize_unpoisons_memory(); - copy_user_test(); - kmem_cache_double_free(); - kmem_cache_invalid_free(); - kasan_memchr(); - kasan_memcmp(); - kasan_strings(); - kasan_bitops(); - kmalloc_double_kzfree(); - vmalloc_oob(); - kasan_rcu_uaf(); - - kasan_restore_multi_shot(multishot); - - return -EAGAIN; -} - -module_init(kmalloc_tests_init); MODULE_LICENSE("GPL"); diff --git a/lib/test_kasan_module.c b/lib/test_kasan_module.c new file mode 100644 index 000000000000..2d68db6ae67b --- /dev/null +++ b/lib/test_kasan_module.c @@ -0,0 +1,111 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * + * Copyright (c) 2014 Samsung Electronics Co., Ltd. + * Author: Andrey Ryabinin + */ + +#define pr_fmt(fmt) "kasan test: %s " fmt, __func__ + +#include +#include +#include +#include +#include + +#include "../mm/kasan/kasan.h" + +#define OOB_TAG_OFF (IS_ENABLED(CONFIG_KASAN_GENERIC) ? 0 : KASAN_SHADOW_SCALE_SIZE) + +static noinline void __init copy_user_test(void) +{ + char *kmem; + char __user *usermem; + size_t size = 10; + int unused; + + kmem = kmalloc(size, GFP_KERNEL); + if (!kmem) + return; + + usermem = (char __user *)vm_mmap(NULL, 0, PAGE_SIZE, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_ANONYMOUS | MAP_PRIVATE, 0); + if (IS_ERR(usermem)) { + pr_err("Failed to allocate user memory\n"); + kfree(kmem); + return; + } + + pr_info("out-of-bounds in copy_from_user()\n"); + unused = copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); + + pr_info("out-of-bounds in copy_to_user()\n"); + unused = copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF); + + pr_info("out-of-bounds in __copy_from_user()\n"); + unused = __copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); + + pr_info("out-of-bounds in __copy_to_user()\n"); + unused = __copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF); + + pr_info("out-of-bounds in __copy_from_user_inatomic()\n"); + unused = __copy_from_user_inatomic(kmem, usermem, size + 1 + OOB_TAG_OFF); + + pr_info("out-of-bounds in __copy_to_user_inatomic()\n"); + unused = __copy_to_user_inatomic(usermem, kmem, size + 1 + OOB_TAG_OFF); + + pr_info("out-of-bounds in strncpy_from_user()\n"); + unused = strncpy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); + + vm_munmap((unsigned long)usermem, PAGE_SIZE); + kfree(kmem); +} + +static struct kasan_rcu_info { + int i; + struct rcu_head rcu; +} *global_rcu_ptr; + +static noinline void __init kasan_rcu_reclaim(struct rcu_head *rp) +{ + struct kasan_rcu_info *fp = container_of(rp, + struct kasan_rcu_info, rcu); + + kfree(fp); + fp->i = 1; +} + +static noinline void __init kasan_rcu_uaf(void) +{ + struct kasan_rcu_info *ptr; + + pr_info("use-after-free in kasan_rcu_reclaim\n"); + ptr = kmalloc(sizeof(struct kasan_rcu_info), GFP_KERNEL); + if (!ptr) { + pr_err("Allocation failed\n"); + return; + } + + global_rcu_ptr = rcu_dereference_protected(ptr, NULL); + call_rcu(&global_rcu_ptr->rcu, kasan_rcu_reclaim); +} + + +static int __init test_kasan_module_init(void) +{ + /* + * Temporarily enable multi-shot mode. Otherwise, we'd only get a + * report for the first case. + */ + bool multishot = kasan_save_enable_multi_shot(); + + copy_user_test(); + kasan_rcu_uaf(); + + kasan_restore_multi_shot(multishot); + return -EAGAIN; +} + +module_init(test_kasan_module_init); +MODULE_LICENSE("GPL"); From 9ab5be976898860f70f67257be725b891ded10ea Mon Sep 17 00:00:00 2001 From: Patricia Alfonso Date: Tue, 13 Oct 2020 16:55:09 -0700 Subject: [PATCH 123/181] KASAN: Testing Documentation Include documentation on how to test KASAN using CONFIG_TEST_KASAN_KUNIT and CONFIG_TEST_KASAN_MODULE. Signed-off-by: Patricia Alfonso Signed-off-by: David Gow Signed-off-by: Andrew Morton Tested-by: Andrey Konovalov Reviewed-by: Andrey Konovalov Reviewed-by: Dmitry Vyukov Acked-by: Brendan Higgins Cc: Andrey Ryabinin Cc: Ingo Molnar Cc: Juri Lelli Cc: Peter Zijlstra Cc: Shuah Khan Cc: Vincent Guittot Link: https://lkml.kernel.org/r/20200915035828.570483-5-davidgow@google.com Link: https://lkml.kernel.org/r/20200910070331.3358048-5-davidgow@google.com Signed-off-by: Linus Torvalds --- Documentation/dev-tools/kasan.rst | 70 +++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index 4abc84b1798c..c09c9ca2ff1c 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -281,3 +281,73 @@ unmapped. This will require changes in arch-specific code. This allows ``VMAP_STACK`` support on x86, and can simplify support of architectures that do not have a fixed module region. + +CONFIG_KASAN_KUNIT_TEST & CONFIG_TEST_KASAN_MODULE +-------------------------------------------------- + +``CONFIG_KASAN_KUNIT_TEST`` utilizes the KUnit Test Framework for testing. +This means each test focuses on a small unit of functionality and +there are a few ways these tests can be run. + +Each test will print the KASAN report if an error is detected and then +print the number of the test and the status of the test: + +pass:: + + ok 28 - kmalloc_double_kzfree +or, if kmalloc failed:: + + # kmalloc_large_oob_right: ASSERTION FAILED at lib/test_kasan.c:163 + Expected ptr is not null, but is + not ok 4 - kmalloc_large_oob_right +or, if a KASAN report was expected, but not found:: + + # kmalloc_double_kzfree: EXPECTATION FAILED at lib/test_kasan.c:629 + Expected kasan_data->report_expected == kasan_data->report_found, but + kasan_data->report_expected == 1 + kasan_data->report_found == 0 + not ok 28 - kmalloc_double_kzfree + +All test statuses are tracked as they run and an overall status will +be printed at the end:: + + ok 1 - kasan + +or:: + + not ok 1 - kasan + +(1) Loadable Module +~~~~~~~~~~~~~~~~~~~~ + +With ``CONFIG_KUNIT`` enabled, ``CONFIG_KASAN_KUNIT_TEST`` can be built as +a loadable module and run on any architecture that supports KASAN +using something like insmod or modprobe. The module is called ``test_kasan``. + +(2) Built-In +~~~~~~~~~~~~~ + +With ``CONFIG_KUNIT`` built-in, ``CONFIG_KASAN_KUNIT_TEST`` can be built-in +on any architecure that supports KASAN. These and any other KUnit +tests enabled will run and print the results at boot as a late-init +call. + +(3) Using kunit_tool +~~~~~~~~~~~~~~~~~~~~~ + +With ``CONFIG_KUNIT`` and ``CONFIG_KASAN_KUNIT_TEST`` built-in, we can also +use kunit_tool to see the results of these along with other KUnit +tests in a more readable way. This will not print the KASAN reports +of tests that passed. Use `KUnit documentation `_ for more up-to-date +information on kunit_tool. + +.. _KUnit: https://www.kernel.org/doc/html/latest/dev-tools/kunit/index.html + +``CONFIG_TEST_KASAN_MODULE`` is a set of KASAN tests that could not be +converted to KUnit. These tests can be run only as a module with +``CONFIG_TEST_KASAN_MODULE`` built as a loadable module and +``CONFIG_KASAN`` built-in. The type of error expected and the +function being run is printed before the expression expected to give +an error. Then the error is printed, if found, and that test +should be interpretted to pass only if the error was the one expected +by the test. From be4f1ae978ffe98cc95ec49ceb95386fb4474974 Mon Sep 17 00:00:00 2001 From: David Gow Date: Tue, 13 Oct 2020 16:55:13 -0700 Subject: [PATCH 124/181] mm: kasan: do not panic if both panic_on_warn and kasan_multishot set KASAN errors will currently trigger a panic when panic_on_warn is set. This renders kasan_multishot useless, as further KASAN errors won't be reported if the kernel has already paniced. By making kasan_multishot disable this behaviour for KASAN errors, we can still have the benefits of panic_on_warn for non-KASAN warnings, yet be able to use kasan_multishot. This is particularly important when running KASAN tests, which need to trigger multiple KASAN errors: previously these would panic the system if panic_on_warn was set, now they can run (and will panic the system should non-KASAN warnings show up). Signed-off-by: David Gow Signed-off-by: Andrew Morton Tested-by: Andrey Konovalov Reviewed-by: Andrey Konovalov Reviewed-by: Brendan Higgins Cc: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: Juri Lelli Cc: Patricia Alfonso Cc: Peter Zijlstra Cc: Shuah Khan Cc: Vincent Guittot Link: https://lkml.kernel.org/r/20200915035828.570483-6-davidgow@google.com Link: https://lkml.kernel.org/r/20200910070331.3358048-6-davidgow@google.com Signed-off-by: Linus Torvalds --- mm/kasan/report.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/kasan/report.c b/mm/kasan/report.c index e2c14b10bc81..00a53f1355ae 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -95,7 +95,7 @@ static void end_report(unsigned long *flags) pr_err("==================================================================\n"); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); spin_unlock_irqrestore(&report_lock, *flags); - if (panic_on_warn) { + if (panic_on_warn && !test_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags)) { /* * This thread may hit another WARN() in the panic path. * Resetting this prevents additional WARN() from panicking the From c9c510dc2964420038f8527125a2cd5d8fb79cb6 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Oct 2020 16:55:17 -0700 Subject: [PATCH 125/181] mm/page_alloc: tweak comments in has_unmovable_pages() Patch series "mm / virtio-mem: support ZONE_MOVABLE", v5. When introducing virtio-mem, the semantics of ZONE_MOVABLE were rather unclear, which is why we special-cased ZONE_MOVABLE such that partially plugged blocks would never end up in ZONE_MOVABLE. Now that the semantics are much clearer (and are documented in patch #6), let's support partially plugged memory blocks in ZONE_MOVABLE, allowing partially plugged memory blocks to be online to ZONE_MOVABLE and also unplugging from such memory blocks. This avoids surprises when onlining of memory blocks suddenly fails, just because they are not completely populated by virtio-mem (yet). This is especially helpful for testing, but also paves the way for virtio-mem optimizations, allowing more memory to get reliably unplugged. Cleanup has_unmovable_pages() and set_migratetype_isolate(), providing better documentation of how ZONE_MOVABLE interacts with different kind of unmovable pages (memory offlining vs. alloc_contig_range()). This patch (of 6): Let's move the split comment regarding bootmem allocations and memory holes, especially in the context of ZONE_MOVABLE, to the PageReserved() check. Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Cc: Michal Hocko Cc: Michael S. Tsirkin Cc: Mike Kravetz Cc: Pankaj Gupta Cc: Jason Wang Cc: Mike Rapoport Cc: Qian Cai Link: http://lkml.kernel.org/r/20200816125333.7434-1-david@redhat.com Link: http://lkml.kernel.org/r/20200816125333.7434-2-david@redhat.com Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 780c8f023b28..ff0b14b0e8d7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8235,14 +8235,6 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page, unsigned long iter = 0; unsigned long pfn = page_to_pfn(page); - /* - * TODO we could make this much more efficient by not checking every - * page in the range if we know all of them are in MOVABLE_ZONE and - * that the movable zone guarantees that pages are migratable but - * the later is not the case right now unfortunatelly. E.g. movablecore - * can still lead to having bootmem allocations in zone_movable. - */ - if (is_migrate_cma_page(page)) { /* * CMA allocations (alloc_contig_range) really need to mark @@ -8261,6 +8253,12 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page, page = pfn_to_page(pfn + iter); + /* + * Both, bootmem allocations and memory holes are marked + * PG_reserved and are unmovable. We can even have unmovable + * allocations inside ZONE_MOVABLE, for example when + * specifying "movablecore". + */ if (PageReserved(page)) return page; @@ -8334,14 +8332,6 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page, * it. But now, memory offline itself doesn't call * shrink_node_slabs() and it still to be fixed. */ - /* - * If the page is not RAM, page_count()should be 0. - * we don't need more check. This is an _used_ not-movable page. - * - * The problematic thing here is PG_reserved pages. PG_reserved - * is set to both of a memory hole page and a _used_ kernel - * page at boot. - */ return page; } return NULL; From 51030a53d81e308f55e0e1d2048d23d8c8d16e5b Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Oct 2020 16:55:21 -0700 Subject: [PATCH 126/181] mm/page_isolation: exit early when pageblock is isolated in set_migratetype_isolate() Right now, if we have two isolations racing on a pageblock that's in the MOVABLE zone, we would trigger the WARN_ON_ONCE(). Let's just return directly, simplifying error handling. The change was introduced in commit 3d680bdf60a5 ("mm/page_isolation: fix potential warning from user"). As far as I can see, we currently don't have alloc_contig_range() users that use the ZONE_MOVABLE (anymore), so it's currently more a cleanup and a preparation for the future than a fix. Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Reviewed-by: Pankaj Gupta Acked-by: Mike Kravetz Cc: Michal Hocko Cc: Michael S. Tsirkin Cc: Qian Cai Cc: Jason Wang Cc: Mike Rapoport Link: http://lkml.kernel.org/r/20200816125333.7434-3-david@redhat.com Signed-off-by: Linus Torvalds --- mm/page_isolation.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 63a3db10a8c0..ad3aa7ac59a7 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -29,10 +29,12 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ /* * We assume the caller intended to SET migrate type to isolate. * If it is already set, then someone else must have raced and - * set it before us. Return -EBUSY + * set it before us. */ - if (is_migrate_isolate_page(page)) - goto out; + if (is_migrate_isolate_page(page)) { + spin_unlock_irqrestore(&zone->lock, flags); + return -EBUSY; + } /* * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. @@ -52,7 +54,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ ret = 0; } -out: spin_unlock_irqrestore(&zone->lock, flags); if (!ret) { drain_all_pages(zone); From 48381d7e4c1fbbbd67993aef822f7f79ca3dc194 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Oct 2020 16:55:24 -0700 Subject: [PATCH 127/181] mm/page_isolation: drop WARN_ON_ONCE() in set_migratetype_isolate() Inside has_unmovable_pages(), we have a comment describing how unmovable data could end up in ZONE_MOVABLE - via "movablecore". Also, besides checking if the first page in the pageblock is reserved, we don't perform any further checks in case of ZONE_MOVABLE. In case of memory offlining, we set REPORT_FAILURE, properly dump_page() the page and handle the error gracefully. alloc_contig_pages() users currently never allocate from ZONE_MOVABLE. E.g., hugetlb uses alloc_contig_pages() for the allocation of gigantic pages only, which will never end up on the MOVABLE zone (see htlb_alloc_mask()). Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Cc: Michal Hocko Cc: Michael S. Tsirkin Cc: Mike Kravetz Cc: Pankaj Gupta Cc: Jason Wang Cc: Mike Rapoport Cc: Qian Cai Link: http://lkml.kernel.org/r/20200816125333.7434-4-david@redhat.com Signed-off-by: Linus Torvalds --- mm/page_isolation.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index ad3aa7ac59a7..dfc55fae6556 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -57,15 +57,12 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ spin_unlock_irqrestore(&zone->lock, flags); if (!ret) { drain_all_pages(zone); - } else { - WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE); - - if ((isol_flags & REPORT_FAILURE) && unmovable) - /* - * printk() with zone->lock held will likely trigger a - * lockdep splat, so defer it here. - */ - dump_page(unmovable, "unmovable page"); + } else if ((isol_flags & REPORT_FAILURE) && unmovable) { + /* + * printk() with zone->lock held will likely trigger a + * lockdep splat, so defer it here. + */ + dump_page(unmovable, "unmovable page"); } return ret; From 1c31cb493c31441562d1a548a4430aaa54157480 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Oct 2020 16:55:28 -0700 Subject: [PATCH 128/181] mm/page_isolation: cleanup set_migratetype_isolate() Let's clean it up a bit, simplifying the exit paths. Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Reviewed-by: Pankaj Gupta Cc: Michal Hocko Cc: Michael S. Tsirkin Cc: Mike Kravetz Cc: Jason Wang Cc: Mike Rapoport Cc: Qian Cai Link: http://lkml.kernel.org/r/20200816125333.7434-5-david@redhat.com Signed-off-by: Linus Torvalds --- mm/page_isolation.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/mm/page_isolation.c b/mm/page_isolation.c index dfc55fae6556..aa94afb63823 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -17,12 +17,9 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags) { - struct page *unmovable = NULL; - struct zone *zone; + struct zone *zone = page_zone(page); + struct page *unmovable; unsigned long flags; - int ret = -EBUSY; - - zone = page_zone(page); spin_lock_irqsave(&zone->lock, flags); @@ -51,13 +48,13 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ NULL); __mod_zone_freepage_state(zone, -nr_pages, mt); - ret = 0; + spin_unlock_irqrestore(&zone->lock, flags); + drain_all_pages(zone); + return 0; } spin_unlock_irqrestore(&zone->lock, flags); - if (!ret) { - drain_all_pages(zone); - } else if ((isol_flags & REPORT_FAILURE) && unmovable) { + if (isol_flags & REPORT_FAILURE) { /* * printk() with zone->lock held will likely trigger a * lockdep splat, so defer it here. @@ -65,7 +62,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ dump_page(unmovable, "unmovable page"); } - return ret; + return -EBUSY; } static void unset_migratetype_isolate(struct page *page, unsigned migratetype) From 27f852795a0684781750b95141c6d88be102ca5b Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Oct 2020 16:55:31 -0700 Subject: [PATCH 129/181] virtio-mem: don't special-case ZONE_MOVABLE When introducing virtio-mem, the semantics of ZONE_MOVABLE were rather unclear, which is why we special-cased ZONE_MOVABLE such that partially plugged blocks would never end up in ZONE_MOVABLE. Now that the semantics are much clearer (and will be documented in a follow-up patch including the new virtio-mem behavior), let's allow to online partially plugged memory blocks to ZONE_MOVABLE and also consider memory blocks that were onlined to ZONE_MOVABLE when unplugging memory. While unplugged memory pages are, in general, unmovable, they can be skipped when offlining memory. virtio-mem only unplugs fairly big chunks (in the megabyte range) and rather tries to shrink the memory region than randomly choosing memory. In theory, if all other pages in the movable zone would be movable, virtio-mem would only shrink that zone and not create any kind of fragmentation. In the future, we might want to remember the zone again and use the information when (un)plugging memory. For now, let's keep it simple. Note: Support for defragmentation is planned, to deal with fragmentation after unplug due to memory chunks within memory blocks that could not get unplugged before (e.g., somebody pinning pages within ZONE_MOVABLE for a longer time). Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Cc: Michal Hocko Cc: Michael S. Tsirkin Cc: Jason Wang Cc: Mike Kravetz Cc: Pankaj Gupta Cc: Baoquan He Cc: Mike Rapoport Cc: Qian Cai Link: http://lkml.kernel.org/r/20200816125333.7434-6-david@redhat.com Signed-off-by: Linus Torvalds --- drivers/virtio/virtio_mem.c | 47 +++++++------------------------------ 1 file changed, 8 insertions(+), 39 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index c08512fcea90..834b7c13ef3d 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -36,18 +36,10 @@ enum virtio_mem_mb_state { VIRTIO_MEM_MB_STATE_OFFLINE, /* Partially plugged, fully added to Linux, offline. */ VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL, - /* Fully plugged, fully added to Linux, online (!ZONE_MOVABLE). */ + /* Fully plugged, fully added to Linux, online. */ VIRTIO_MEM_MB_STATE_ONLINE, - /* Partially plugged, fully added to Linux, online (!ZONE_MOVABLE). */ + /* Partially plugged, fully added to Linux, online. */ VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL, - /* - * Fully plugged, fully added to Linux, online (ZONE_MOVABLE). - * We are not allowed to allocate (unplug) parts of this block that - * are not movable (similar to gigantic pages). We will never allow - * to online OFFLINE_PARTIAL to ZONE_MOVABLE (as they would contain - * unmovable parts). - */ - VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE, VIRTIO_MEM_MB_STATE_COUNT }; @@ -526,21 +518,10 @@ static bool virtio_mem_owned_mb(struct virtio_mem *vm, unsigned long mb_id) } static int virtio_mem_notify_going_online(struct virtio_mem *vm, - unsigned long mb_id, - enum zone_type zone) + unsigned long mb_id) { switch (virtio_mem_mb_get_state(vm, mb_id)) { case VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL: - /* - * We won't allow to online a partially plugged memory block - * to the MOVABLE zone - it would contain unmovable parts. - */ - if (zone == ZONE_MOVABLE) { - dev_warn_ratelimited(&vm->vdev->dev, - "memory block has holes, MOVABLE not supported\n"); - return NOTIFY_BAD; - } - return NOTIFY_OK; case VIRTIO_MEM_MB_STATE_OFFLINE: return NOTIFY_OK; default: @@ -560,7 +541,6 @@ static void virtio_mem_notify_offline(struct virtio_mem *vm, VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL); break; case VIRTIO_MEM_MB_STATE_ONLINE: - case VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE: virtio_mem_mb_set_state(vm, mb_id, VIRTIO_MEM_MB_STATE_OFFLINE); break; @@ -579,24 +559,17 @@ static void virtio_mem_notify_offline(struct virtio_mem *vm, virtio_mem_retry(vm); } -static void virtio_mem_notify_online(struct virtio_mem *vm, unsigned long mb_id, - enum zone_type zone) +static void virtio_mem_notify_online(struct virtio_mem *vm, unsigned long mb_id) { unsigned long nb_offline; switch (virtio_mem_mb_get_state(vm, mb_id)) { case VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL: - BUG_ON(zone == ZONE_MOVABLE); virtio_mem_mb_set_state(vm, mb_id, VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL); break; case VIRTIO_MEM_MB_STATE_OFFLINE: - if (zone == ZONE_MOVABLE) - virtio_mem_mb_set_state(vm, mb_id, - VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE); - else - virtio_mem_mb_set_state(vm, mb_id, - VIRTIO_MEM_MB_STATE_ONLINE); + virtio_mem_mb_set_state(vm, mb_id, VIRTIO_MEM_MB_STATE_ONLINE); break; default: BUG(); @@ -675,7 +648,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb, const unsigned long start = PFN_PHYS(mhp->start_pfn); const unsigned long size = PFN_PHYS(mhp->nr_pages); const unsigned long mb_id = virtio_mem_phys_to_mb_id(start); - enum zone_type zone; int rc = NOTIFY_OK; if (!virtio_mem_overlaps_range(vm, start, size)) @@ -717,8 +689,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb, break; } vm->hotplug_active = true; - zone = page_zonenum(pfn_to_page(mhp->start_pfn)); - rc = virtio_mem_notify_going_online(vm, mb_id, zone); + rc = virtio_mem_notify_going_online(vm, mb_id); break; case MEM_OFFLINE: virtio_mem_notify_offline(vm, mb_id); @@ -726,8 +697,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb, mutex_unlock(&vm->hotplug_mutex); break; case MEM_ONLINE: - zone = page_zonenum(pfn_to_page(mhp->start_pfn)); - virtio_mem_notify_online(vm, mb_id, zone); + virtio_mem_notify_online(vm, mb_id); vm->hotplug_active = false; mutex_unlock(&vm->hotplug_mutex); break; @@ -1906,8 +1876,7 @@ static void virtio_mem_remove(struct virtio_device *vdev) if (vm->nb_mb_state[VIRTIO_MEM_MB_STATE_OFFLINE] || vm->nb_mb_state[VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL] || vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE] || - vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL] || - vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE]) { + vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL]) { dev_warn(&vdev->dev, "device still has system memory added\n"); } else { virtio_mem_delete_resource(vm); From 9181a980625a45425085ccec0fc38074a16470a5 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 13 Oct 2020 16:55:35 -0700 Subject: [PATCH 130/181] mm: document semantics of ZONE_MOVABLE Let's document what ZONE_MOVABLE means, how it's used, and which special cases we have regarding unmovable pages (memory offlining vs. migration / allocations). Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Acked-by: Mike Rapoport Cc: Michal Hocko Cc: Michael S. Tsirkin Cc: Mike Kravetz Cc: Pankaj Gupta Cc: Baoquan He Cc: Jason Wang Cc: Qian Cai Link: http://lkml.kernel.org/r/20200816125333.7434-7-david@redhat.com Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 0f7a4ff4b059..927bd7e98a88 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -396,6 +396,41 @@ enum zone_type { */ ZONE_HIGHMEM, #endif + /* + * ZONE_MOVABLE is similar to ZONE_NORMAL, except that it contains + * movable pages with few exceptional cases described below. Main use + * cases for ZONE_MOVABLE are to make memory offlining/unplug more + * likely to succeed, and to locally limit unmovable allocations - e.g., + * to increase the number of THP/huge pages. Notable special cases are: + * + * 1. Pinned pages: (long-term) pinning of movable pages might + * essentially turn such pages unmovable. Memory offlining might + * retry a long time. + * 2. memblock allocations: kernelcore/movablecore setups might create + * situations where ZONE_MOVABLE contains unmovable allocations + * after boot. Memory offlining and allocations fail early. + * 3. Memory holes: kernelcore/movablecore setups might create very rare + * situations where ZONE_MOVABLE contains memory holes after boot, + * for example, if we have sections that are only partially + * populated. Memory offlining and allocations fail early. + * 4. PG_hwpoison pages: while poisoned pages can be skipped during + * memory offlining, such pages cannot be allocated. + * 5. Unmovable PG_offline pages: in paravirtualized environments, + * hotplugged memory blocks might only partially be managed by the + * buddy (e.g., via XEN-balloon, Hyper-V balloon, virtio-mem). The + * parts not manged by the buddy are unmovable PG_offline pages. In + * some cases (virtio-mem), such pages can be skipped during + * memory offlining, however, cannot be moved/allocated. These + * techniques might use alloc_contig_range() to hide previously + * exposed pages from the buddy again (e.g., to implement some sort + * of memory unplug in virtio-mem). + * + * In general, no unmovable allocations that degrade memory offlining + * should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range()) + * have to expect that migrating pages in ZONE_MOVABLE can fail (even + * if has_unmovable_pages() states that there are no unmovable pages, + * there can be false negatives). + */ ZONE_MOVABLE, #ifdef CONFIG_ZONE_DEVICE ZONE_DEVICE, From 6a654e36fa51a4ae1f109b0b30a23d3f097b3d8a Mon Sep 17 00:00:00 2001 From: Li Xinhai Date: Tue, 13 Oct 2020 16:55:39 -0700 Subject: [PATCH 131/181] mm, isolation: avoid checking unmovable pages across pageblock boundary In has_unmovable_pages(), the page parameter would not always be the first page within a pageblock (see how the page pointer is passed in from start_isolate_page_range() after call __first_valid_page()), so that would cause checking unmovable pages span two pageblocks. After this patch, the checking is enforced within one pageblock no matter the page is first one or not, and obey the semantics of this function. This issue is found by code inspection. Michal said "this might lead to false negatives when an unrelated block would cause an isolation failure". Signed-off-by: Li Xinhai Signed-off-by: Andrew Morton Reviewed-by: Oscar Salvador Acked-by: Michal Hocko Cc: David Hildenbrand Link: https://lkml.kernel.org/r/20200824065811.383266-1-lixinhai.lxh@gmail.com Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ff0b14b0e8d7..b9f9b51e0342 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8234,6 +8234,7 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page, { unsigned long iter = 0; unsigned long pfn = page_to_pfn(page); + unsigned long offset = pfn % pageblock_nr_pages; if (is_migrate_cma_page(page)) { /* @@ -8247,7 +8248,7 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page, return page; } - for (; iter < pageblock_nr_pages; iter++) { + for (; iter < pageblock_nr_pages - offset; iter++) { if (!pfn_valid_within(pfn + iter)) continue; From b630749f018c39f68e6ba0db95ac1cbd66cf0cbb Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Tue, 13 Oct 2020 16:55:42 -0700 Subject: [PATCH 132/181] mm/page_alloc.c: clean code by removing unnecessary initialization Previously variable 'tmp' was initialized, but was not read later before reassigning. So the initialization can be removed. [akpm@linux-foundation.org: remove `tmp' altogether] Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Link: https://lkml.kernel.org/r/20200904132422.17387-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b9f9b51e0342..316fff2780c0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5651,7 +5651,6 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) int n, val; int min_val = INT_MAX; int best_node = NUMA_NO_NODE; - const struct cpumask *tmp = cpumask_of_node(0); /* Use the local node if we haven't already */ if (!node_isset(node, *used_node_mask)) { @@ -5672,8 +5671,7 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) val += (n < node); /* Give preference to headless and unused nodes */ - tmp = cpumask_of_node(n); - if (!cpumask_empty(tmp)) + if (!cpumask_empty(cpumask_of_node(n))) val += PENALTY_FOR_NODE_WITH_CPUS; /* Slight preference for less loaded node */ From cfb4a541918498016ea9fcd7c44fd87b99fb5701 Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Tue, 13 Oct 2020 16:55:45 -0700 Subject: [PATCH 133/181] mm/page_alloc.c: micro-optimization remove unnecessary branch Previously flags check was separated into two separated checks with two separated branches. In case of presence of any of two mentioned flags, the same effect on flow occurs. Therefore checks can be merged and one branch can be avoided. Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200911092310.31136-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 316fff2780c0..cd43e8bdf156 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3986,8 +3986,10 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, * success so it is time to admit defeat. We will skip the OOM killer * because it is very likely that the caller has a more reasonable * fallback than shooting a random task. + * + * The OOM killer may not free memory on a specific node. */ - if (gfp_mask & __GFP_RETRY_MAYFAIL) + if (gfp_mask & (__GFP_RETRY_MAYFAIL | __GFP_THISNODE)) goto out; /* The OOM killer does not needlessly kill tasks for lowmem */ if (ac->highest_zoneidx < ZONE_NORMAL) @@ -4004,10 +4006,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, * failures more gracefully we should just bail out here. */ - /* The OOM killer may not free memory on a specific node */ - if (gfp_mask & __GFP_THISNODE) - goto out; - /* Exhausted what can be done so it's blame time */ if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { *did_some_progress = 1; From fdd4fa1cd90441fd1d393e49c21c00ed73ef929c Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Tue, 13 Oct 2020 16:55:48 -0700 Subject: [PATCH 134/181] mm/page_alloc.c: fix early params garbage value accesses Previously in '__init early_init_on_alloc' and '__init early_init_on_free' the return values from 'kstrtobool' were not handled properly. That caused potential garbage value read from variable 'bool_result'. Introduced patch fixes error handling. Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200916214125.28271-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cd43e8bdf156..0907884e6913 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -156,16 +156,16 @@ static int __init early_init_on_alloc(char *buf) int ret; bool bool_result; - if (!buf) - return -EINVAL; ret = kstrtobool(buf, &bool_result); + if (ret) + return ret; if (bool_result && page_poisoning_enabled()) pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, will take precedence over init_on_alloc\n"); if (bool_result) static_branch_enable(&init_on_alloc); else static_branch_disable(&init_on_alloc); - return ret; + return 0; } early_param("init_on_alloc", early_init_on_alloc); @@ -174,16 +174,16 @@ static int __init early_init_on_free(char *buf) int ret; bool bool_result; - if (!buf) - return -EINVAL; ret = kstrtobool(buf, &bool_result); + if (ret) + return ret; if (bool_result && page_poisoning_enabled()) pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, will take precedence over init_on_free\n"); if (bool_result) static_branch_enable(&init_on_free); else static_branch_disable(&init_on_free); - return ret; + return 0; } early_param("init_on_free", early_init_on_free); From a0622d05374b61a37894df314c313736d1e17d0c Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Tue, 13 Oct 2020 16:55:51 -0700 Subject: [PATCH 135/181] mm/page_alloc.c: clean code by merging two functions finalise_ac() is just 'epilogue' for 'prepare_alloc_pages'. Therefore there is no need to keep them both so 'finalise_ac' content can be merged into prepare_alloc_pages() code. It would make __alloc_pages_nodemask() cleaner when it comes to readability. Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Mel Gorman Cc: Mike Rapoport Link: https://lkml.kernel.org/r/20200916110118.6537-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0907884e6913..12bac250c8e4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4838,12 +4838,6 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, *alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags); - return true; -} - -/* Determine whether to spread dirty pages and what the first usable zone */ -static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac) -{ /* Dirty zone balancing only done in the fast path */ ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE); @@ -4854,6 +4848,8 @@ static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac) */ ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx, ac->nodemask); + + return true; } /* @@ -4882,8 +4878,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags)) return NULL; - finalise_ac(gfp_mask, &ac); - /* * Forbid the first pass from falling back to types that fragment * memory until all local zones are considered. From 2187e17b02036ecd0ee83e225b28b968d3788a71 Mon Sep 17 00:00:00 2001 From: Yanfei Xu Date: Tue, 13 Oct 2020 16:55:54 -0700 Subject: [PATCH 136/181] mm/page_alloc.c: __perform_reclaim should return 'unsigned long' __perform_reclaim()'s single caller expects it to return 'unsigned long', hence change its return value and a local variable to 'unsigned long'. Suggested-by: Andrew Morton Signed-off-by: Yanfei Xu Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200916022138.16740-1-yanfei.xu@windriver.com Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 12bac250c8e4..a105c657be37 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4253,13 +4253,12 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release); #endif /* Perform direct synchronous page reclaim */ -static int +static unsigned long __perform_reclaim(gfp_t gfp_mask, unsigned int order, const struct alloc_context *ac) { - int progress; unsigned int noreclaim_flag; - unsigned long pflags; + unsigned long pflags, progress; cond_resched(); From 30d8ec73e8772b32a7eae626d14004bd37d8f13c Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Tue, 13 Oct 2020 16:55:57 -0700 Subject: [PATCH 137/181] mmzone: clean code by removing unused macro parameter Previously 'for_next_zone_zonelist_nodemask' macro parameter 'zlist' was unused so this patch removes it. Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Link: https://lkml.kernel.org/r/20200917211906.30059-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 +- mm/page_alloc.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 927bd7e98a88..c27fb1faffe5 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1116,7 +1116,7 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, z = next_zones_zonelist(++z, highidx, nodemask), \ zone = zonelist_zone(z)) -#define for_next_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \ +#define for_next_zone_zonelist_nodemask(zone, z, highidx, nodemask) \ for (zone = z->zone; \ zone; \ z = next_zones_zonelist(++z, highidx, nodemask), \ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a105c657be37..6263a97e39c6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3741,8 +3741,8 @@ retry: */ no_fallback = alloc_flags & ALLOC_NOFRAGMENT; z = ac->preferred_zoneref; - for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, - ac->highest_zoneidx, ac->nodemask) { + for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx, + ac->nodemask) { struct page *page; unsigned long mark; From a9b576f7253e22528584f5aeb46edf0b6b007992 Mon Sep 17 00:00:00 2001 From: Ralph Campbell Date: Tue, 13 Oct 2020 16:56:00 -0700 Subject: [PATCH 138/181] mm: move call to compound_head() in release_pages() The function is_huge_zero_page() doesn't call compound_head() to make sure the page pointer is a head page. The call to is_huge_zero_page() in release_pages() is made before compound_head() is called so the test would fail if release_pages() was called with a tail page of the huge_zero_page and put_page_testzero() would be called releasing the page. This is unlikely to be happening in normal use or we would be seeing all sorts of process data corruption when accessing a THP zero page. Looking at other places where is_huge_zero_page() is called, all seem to only pass a head page so I think the right solution is to move the call to compound_head() in release_pages() to a point before calling is_huge_zero_page(). Signed-off-by: Ralph Campbell Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Yu Zhao Cc: Dan Williams Cc: Matthew Wilcox Cc: Christoph Hellwig Link: https://lkml.kernel.org/r/20200917173938.16420-1-rcampbell@nvidia.com Signed-off-by: Linus Torvalds --- mm/swap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/swap.c b/mm/swap.c index f41ccd8eae94..47a47681c86b 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -889,6 +889,7 @@ void release_pages(struct page **pages, int nr) locked_pgdat = NULL; } + page = compound_head(page); if (is_huge_zero_page(page)) continue; @@ -910,7 +911,6 @@ void release_pages(struct page **pages, int nr) } } - page = compound_head(page); if (!put_page_testzero(page)) continue; From e320d3012d25b1fb5f3df4edb7bd44a1c362ec10 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 13 Oct 2020 16:56:04 -0700 Subject: [PATCH 139/181] mm/page_alloc.c: fix freeing non-compound pages Here is a very rare race which leaks memory: Page P0 is allocated to the page cache. Page P1 is free. Thread A Thread B Thread C find_get_entry(): xas_load() returns P0 Removes P0 from page cache P0 finds its buddy P1 alloc_pages(GFP_KERNEL, 1) returns P0 P0 has refcount 1 page_cache_get_speculative(P0) P0 has refcount 2 __free_pages(P0) P0 has refcount 1 put_page(P0) P1 is not freed Fix this by freeing all the pages in __free_pages() that won't be freed by the call to put_page(). It's usually not a good idea to split a page, but this is a very unlikely scenario. Fixes: e286781d5f2e ("mm: speculative page references") Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Acked-by: Mike Rapoport Cc: Nick Piggin Cc: Hugh Dickins Cc: Peter Zijlstra Link: https://lkml.kernel.org/r/20200926213919.26642-1-willy@infradead.org Signed-off-by: Linus Torvalds --- lib/Kconfig.debug | 9 +++++++++ lib/Makefile | 1 + lib/test_free_pages.c | 42 ++++++++++++++++++++++++++++++++++++++++++ mm/page_alloc.c | 3 +++ 4 files changed, 55 insertions(+) create mode 100644 lib/test_free_pages.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 0c781f912f9f..491789a793ae 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2367,6 +2367,15 @@ config TEST_HMM If unsure, say N. +config TEST_FREE_PAGES + tristate "Test freeing pages" + help + Test that a memory leak does not occur due to a race between + freeing a block of pages and a speculative page reference. + Loading this module is safe if your kernel has the bug fixed. + If the bug is not fixed, it will leak gigabytes of memory and + probably OOM your system. + config TEST_FPU tristate "Test floating point operations in kernel space" depends on X86 && !KCOV_INSTRUMENT_ALL diff --git a/lib/Makefile b/lib/Makefile index d4af75136c54..49a2a9e36224 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -101,6 +101,7 @@ obj-$(CONFIG_TEST_BLACKHOLE_DEV) += test_blackhole_dev.o obj-$(CONFIG_TEST_MEMINIT) += test_meminit.o obj-$(CONFIG_TEST_LOCKUP) += test_lockup.o obj-$(CONFIG_TEST_HMM) += test_hmm.o +obj-$(CONFIG_TEST_FREE_PAGES) += test_free_pages.o # # CFLAGS for compiling floating point code inside the kernel. x86/Makefile turns diff --git a/lib/test_free_pages.c b/lib/test_free_pages.c new file mode 100644 index 000000000000..074e76bd76b2 --- /dev/null +++ b/lib/test_free_pages.c @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * test_free_pages.c: Check that free_pages() doesn't leak memory + * Copyright (c) 2020 Oracle + * Author: Matthew Wilcox + */ + +#include +#include +#include + +static void test_free_pages(gfp_t gfp) +{ + unsigned int i; + + for (i = 0; i < 1000 * 1000; i++) { + unsigned long addr = __get_free_pages(gfp, 3); + struct page *page = virt_to_page(addr); + + /* Simulate page cache getting a speculative reference */ + get_page(page); + free_pages(addr, 3); + put_page(page); + } +} + +static int m_in(void) +{ + test_free_pages(GFP_KERNEL); + test_free_pages(GFP_KERNEL | __GFP_COMP); + + return 0; +} + +static void m_ex(void) +{ +} + +module_init(m_in); +module_exit(m_ex); +MODULE_AUTHOR("Matthew Wilcox "); +MODULE_LICENSE("GPL"); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6263a97e39c6..73e33ab6d249 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4952,6 +4952,9 @@ void __free_pages(struct page *page, unsigned int order) { if (put_page_testzero(page)) free_the_page(page, order); + else if (!PageHead(page)) + while (order-- > 0) + free_the_page(page + (1 << order), order); } EXPORT_SYMBOL(__free_pages); From ab00db216c9c78cc0a68bc4e27889c1ee374598d Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 13 Oct 2020 16:56:07 -0700 Subject: [PATCH 140/181] include/linux/gfp.h: clarify usage of GFP_ATOMIC in !preemptible contexts There is a general understanding that GFP_ATOMIC/GFP_NOWAIT are to be used from atomic contexts. E.g. from within a spin lock or from the IRQ context. This is correct but there are some atomic contexts where the above doesn't hold. One of them would be an NMI context. Page allocator has never supported that and the general fear of this context didn't let anybody to actually even try to use the allocator there. Good, but let's be more specific about that. Another such a context, and that is where people seem to be more daring, is raw_spin_lock. Mostly because it simply resembles regular spin lock which is supported by the allocator and there is not any implementation difference with !RT kernels in the first place. Be explicit that such a context is not supported by the allocator. The underlying reason is that zone->lock would have to become raw_spin_lock as well and that has turned out to be a problem for RT (http://lkml.kernel.org/r/87mu305c1w.fsf@nanos.tec.linutronix.de). Signed-off-by: Michal Hocko Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Reviewed-by: Thomas Gleixner Reviewed-by: Uladzislau Rezki Cc: "Paul E. McKenney" Link: https://lkml.kernel.org/r/20200929123010.5137-1-mhocko@kernel.org Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 67a0774e080b..2e8370cf60c7 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -238,7 +238,9 @@ struct vm_area_struct; * %__GFP_FOO flags as necessary. * * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower - * watermark is applied to allow access to "atomic reserves" + * watermark is applied to allow access to "atomic reserves". + * The current implementation doesn't support NMI and few other strict + * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT. * * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim. From 3e5c36007e9c37378ff0bcaa2bc813d72c8659bc Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 13 Oct 2020 16:56:10 -0700 Subject: [PATCH 141/181] mm/hugetlb.c: make is_hugetlb_entry_hwpoisoned return bool Patch series "mm/hugetlb: Small cleanup and improvement", v2. This patch (of 3): Just like its neighbour is_hugetlb_entry_migration() has done. Signed-off-by: Baoquan He Signed-off-by: Andrew Morton Reviewed-by: Mike Kravetz Reviewed-by: David Hildenbrand Reviewed-by: Anshuman Khandual Link: https://lkml.kernel.org/r/20200723032248.24772-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20200723032248.24772-2-bhe@redhat.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 67fc6383995b..57b96c1d5265 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3805,17 +3805,17 @@ bool is_hugetlb_entry_migration(pte_t pte) return false; } -static int is_hugetlb_entry_hwpoisoned(pte_t pte) +static bool is_hugetlb_entry_hwpoisoned(pte_t pte) { swp_entry_t swp; if (huge_pte_none(pte) || pte_present(pte)) - return 0; + return false; swp = pte_to_swp_entry(pte); if (non_swap_entry(swp) && is_hwpoison_entry(swp)) - return 1; + return true; else - return 0; + return false; } int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, From d79d176a303775e2ccb16ca3c08bb63c030aa4dc Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 13 Oct 2020 16:56:14 -0700 Subject: [PATCH 142/181] mm/hugetlb.c: remove the unnecessary non_swap_entry() If a swap entry tests positive for either is_[migration|hwpoison]_entry(), then its swap_type() is among SWP_MIGRATION_READ, SWP_MIGRATION_WRITE and SWP_HWPOISON. All these types >= MAX_SWAPFILES, exactly what is asserted with non_swap_entry(). So the checking non_swap_entry() in is_hugetlb_entry_migration() and is_hugetlb_entry_hwpoisoned() is redundant. Let's remove it to optimize code. Signed-off-by: Baoquan He Signed-off-by: Andrew Morton Reviewed-by: Mike Kravetz Reviewed-by: David Hildenbrand Reviewed-by: Anshuman Khandual Link: https://lkml.kernel.org/r/20200723032248.24772-3-bhe@redhat.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 57b96c1d5265..a7d312cc018e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3799,7 +3799,7 @@ bool is_hugetlb_entry_migration(pte_t pte) if (huge_pte_none(pte) || pte_present(pte)) return false; swp = pte_to_swp_entry(pte); - if (non_swap_entry(swp) && is_migration_entry(swp)) + if (is_migration_entry(swp)) return true; else return false; @@ -3812,7 +3812,7 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte) if (huge_pte_none(pte) || pte_present(pte)) return false; swp = pte_to_swp_entry(pte); - if (non_swap_entry(swp) && is_hwpoison_entry(swp)) + if (is_hwpoison_entry(swp)) return true; else return false; From 540809be526724cfa8703d08b029d323fc123a41 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Tue, 13 Oct 2020 16:56:17 -0700 Subject: [PATCH 143/181] doc/vm: fix typo in the hugetlb admin documentation Change 'pecify' to 'Specify'. Signed-off-by: Baoquan He Signed-off-by: Andrew Morton Reviewed-by: Mike Kravetz Reviewed-by: David Hildenbrand Cc: Anshuman Khandual Link: https://lkml.kernel.org/r/20200723032248.24772-4-bhe@redhat.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/mm/hugetlbpage.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index 015a5f7d7854..f7b1c7462991 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -131,7 +131,7 @@ hugepages parameter is preceded by an invalid hugepagesz parameter, it will be ignored. default_hugepagesz - pecify the default huge page size. This parameter can + Specify the default huge page size. This parameter can only be specified once on the command line. default_hugepagesz can optionally be followed by the hugepages parameter to preallocate a specific number of huge pages of default size. The number of default From 7db5e7b67e3e8d7aa96beb2a4551d211b76a05ba Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:56:20 -0700 Subject: [PATCH 144/181] mm/hugetlb: not necessary to coalesce regions recursively Patch series "mm/hugetlb: code refine and simplification", v4. Following are some cleanups for hugetlb. Simple testing with tools/testing/selftests/vm/map_hugetlb passes. This patch (of 7): Per my understanding, we keep the regions ordered and would always coalesce regions properly. So the task to keep this property is just to coalesce its neighbour. Let's simplify this. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Reviewed-by: Mike Kravetz Cc: Vlastimil Babka Link: https://lkml.kernel.org/r/20200901014636.29737-1-richard.weiyang@linux.alibaba.com Link: https://lkml.kernel.org/r/20200831022351.20916-1-richard.weiyang@linux.alibaba.com Link: https://lkml.kernel.org/r/20200831022351.20916-2-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a7d312cc018e..dc2977750e85 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -309,8 +309,7 @@ static void coalesce_file_region(struct resv_map *resv, struct file_region *rg) list_del(&rg->link); kfree(rg); - coalesce_file_region(resv, prg); - return; + rg = prg; } nrg = list_next_entry(rg, link); @@ -320,9 +319,6 @@ static void coalesce_file_region(struct resv_map *resv, struct file_region *rg) list_del(&rg->link); kfree(rg); - - coalesce_file_region(resv, nrg); - return; } } From a1ddc2e8250eb550733a3f257ea299a84492d8a7 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:56:23 -0700 Subject: [PATCH 145/181] mm/hugetlb: remove VM_BUG_ON(!nrg) in get_file_region_entry_from_cache() We are sure to get a valid file_region, otherwise the VM_BUG_ON(resv->region_cache_count <= 0) at the very beginning would be triggered. Let's remove the redundant one. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Mike Kravetz Cc: Baoquan He Cc: Vlastimil Babka Link: https://lkml.kernel.org/r/20200831022351.20916-3-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dc2977750e85..39f86bbcb07e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -240,7 +240,6 @@ get_file_region_entry_from_cache(struct resv_map *resv, long from, long to) resv->region_cache_count--; nrg = list_first_entry(&resv->region_cache, struct file_region, link); - VM_BUG_ON(!nrg); list_del(&nrg->link); nrg->from = from; From d3ec7b6e09e512ba902b86bcca2c512fb06d492f Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:56:27 -0700 Subject: [PATCH 146/181] mm/hugetlb: use list_splice to merge two list at once Instead of add allocated file_region one by one to region_cache, we could use list_splice to merge two list at once. Also we know the number of entries in the list, increase the number directly. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Reviewed-by: Mike Kravetz Cc: Vlastimil Babka Link: https://lkml.kernel.org/r/20200831022351.20916-4-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 39f86bbcb07e..13dc0a455400 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -443,11 +443,8 @@ static int allocate_file_region_entries(struct resv_map *resv, spin_lock(&resv->lock); - list_for_each_entry_safe(rg, trg, &allocated_regions, link) { - list_del(&rg->link); - list_add(&rg->link, &resv->region_cache); - resv->region_cache_count++; - } + list_splice(&allocated_regions, &resv->region_cache); + resv->region_cache_count += to_allocate; } return 0; From 972a3da355c947283f3d88fd1764f001730206f9 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:56:30 -0700 Subject: [PATCH 147/181] mm/hugetlb: count file_region to be added when regions_needed != NULL There are only two cases of function add_reservation_in_range() * count file_region and return the number in regions_needed * do the real list operation without counting This means it is not necessary to have two parameters to classify these two cases. Just use regions_needed to separate them. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Reviewed-by: Mike Kravetz Cc: Vlastimil Babka Link: https://lkml.kernel.org/r/20200831022351.20916-5-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 13dc0a455400..31586a1e70b5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -321,16 +321,17 @@ static void coalesce_file_region(struct resv_map *resv, struct file_region *rg) } } -/* Must be called with resv->lock held. Calling this with count_only == true - * will count the number of pages to be added but will not modify the linked - * list. If regions_needed != NULL and count_only == true, then regions_needed - * will indicate the number of file_regions needed in the cache to carry out to - * add the regions for this range. +/* + * Must be called with resv->lock held. + * + * Calling this with regions_needed != NULL will count the number of pages + * to be added but will not modify the linked list. And regions_needed will + * indicate the number of file_regions needed in the cache to carry out to add + * the regions for this range. */ static long add_reservation_in_range(struct resv_map *resv, long f, long t, struct hugetlb_cgroup *h_cg, - struct hstate *h, long *regions_needed, - bool count_only) + struct hstate *h, long *regions_needed) { long add = 0; struct list_head *head = &resv->regions; @@ -366,14 +367,14 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, */ if (rg->from > last_accounted_offset) { add += rg->from - last_accounted_offset; - if (!count_only) { + if (!regions_needed) { nrg = get_file_region_entry_from_cache( resv, last_accounted_offset, rg->from); record_hugetlb_cgroup_uncharge_info(h_cg, h, resv, nrg); list_add(&nrg->link, rg->link.prev); coalesce_file_region(resv, nrg); - } else if (regions_needed) + } else *regions_needed += 1; } @@ -385,13 +386,13 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, */ if (last_accounted_offset < t) { add += t - last_accounted_offset; - if (!count_only) { + if (!regions_needed) { nrg = get_file_region_entry_from_cache( resv, last_accounted_offset, t); record_hugetlb_cgroup_uncharge_info(h_cg, h, resv, nrg); list_add(&nrg->link, rg->link.prev); coalesce_file_region(resv, nrg); - } else if (regions_needed) + } else *regions_needed += 1; } @@ -484,8 +485,8 @@ static long region_add(struct resv_map *resv, long f, long t, retry: /* Count how many regions are actually needed to execute this add. */ - add_reservation_in_range(resv, f, t, NULL, NULL, &actual_regions_needed, - true); + add_reservation_in_range(resv, f, t, NULL, NULL, + &actual_regions_needed); /* * Check for sufficient descriptors in the cache to accommodate @@ -513,7 +514,7 @@ retry: goto retry; } - add = add_reservation_in_range(resv, f, t, h_cg, h, NULL, false); + add = add_reservation_in_range(resv, f, t, h_cg, h, NULL); resv->adds_in_progress -= in_regions_needed; @@ -549,9 +550,9 @@ static long region_chg(struct resv_map *resv, long f, long t, spin_lock(&resv->lock); - /* Count how many hugepages in this range are NOT respresented. */ + /* Count how many hugepages in this range are NOT represented. */ chg = add_reservation_in_range(resv, f, t, NULL, NULL, - out_regions_needed, true); + out_regions_needed); if (*out_regions_needed == 0) *out_regions_needed = 1; From 15a8d68e9dc23dc9def4bd7e9563db60f4f86580 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:56:33 -0700 Subject: [PATCH 148/181] mm/hugetlb: a page from buddy is not on any list The page allocated from buddy is not on any list, so just use list_add() is enough. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Reviewed-by: Mike Kravetz Cc: Vlastimil Babka Link: https://lkml.kernel.org/r/20200831022351.20916-6-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 31586a1e70b5..a3aea2c1181b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2416,7 +2416,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, h->resv_huge_pages--; } spin_lock(&hugetlb_lock); - list_move(&page->lru, &h->hugepage_activelist); + list_add(&page->lru, &h->hugepage_activelist); /* Fall through */ } hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page); From 2f37511cb6c2a59b67f8f13ad206a0298992eaf5 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:56:36 -0700 Subject: [PATCH 149/181] mm/hugetlb: narrow the hugetlb_lock protection area during preparing huge page set_hugetlb_cgroup_[rsvd] just manipulate page local data, which is not necessary to be protected by hugetlb_lock. Let's take this out. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Reviewed-by: Mike Kravetz Cc: Vlastimil Babka Link: https://lkml.kernel.org/r/20200831022351.20916-7-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a3aea2c1181b..a5068a38d8af 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1504,9 +1504,9 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) { INIT_LIST_HEAD(&page->lru); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); - spin_lock(&hugetlb_lock); set_hugetlb_cgroup(page, NULL); set_hugetlb_cgroup_rsvd(page, NULL); + spin_lock(&hugetlb_lock); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; spin_unlock(&hugetlb_lock); From 6664bfc8e934dd008565510a41d86a552bd0bb00 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:56:39 -0700 Subject: [PATCH 150/181] mm/hugetlb: take the free hpage during the iteration directly Function dequeue_huge_page_node_exact() iterates the free list and return the first valid free hpage. Instead of break and check the loop variant, we could return in the loop directly. This could reduce some redundant check. [mike.kravetz@oracle.com: points out a logic error] [richard.weiyang@linux.alibaba.com: v4] Link: https://lkml.kernel.org/r/20200901014636.29737-8-richard.weiyang@linux.alibaba.com Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Cc: Baoquan He Cc: Mike Kravetz Cc: Vlastimil Babka Link: https://lkml.kernel.org/r/20200831022351.20916-8-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a5068a38d8af..cc70e541c9bf 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1040,21 +1040,17 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) if (nocma && is_migrate_cma_page(page)) continue; - if (!PageHWPoison(page)) - break; + if (PageHWPoison(page)) + continue; + + list_move(&page->lru, &h->hugepage_activelist); + set_page_refcounted(page); + h->free_huge_pages--; + h->free_huge_pages_node[nid]--; + return page; } - /* - * if 'non-isolated free hugepage' not found on the list, - * the allocation fails. - */ - if (&h->hugepage_freelists[nid] == &page->lru) - return NULL; - list_move(&page->lru, &h->hugepage_activelist); - set_page_refcounted(page); - h->free_huge_pages--; - h->free_huge_pages_node[nid]--; - return page; + return NULL; } static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask, int nid, From 0bf7b64e6e51eb69cf6fce7c9f7ff44840393e64 Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Tue, 13 Oct 2020 16:56:42 -0700 Subject: [PATCH 151/181] hugetlb: add lockdep check for i_mmap_rwsem held in huge_pmd_share As a debugging aid, huge_pmd_share should make sure i_mmap_rwsem is held if necessary. To clarify the 'if necessary', expand the comment block at the beginning of huge_pmd_share. No functional change. The added i_mmap_assert_locked() call is only enabled if CONFIG_LOCKDEP. Ideally, this should have been included with commit 34ae204f1851 ("hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem"). Signed-off-by: Mike Kravetz Signed-off-by: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: "Kirill A . Shutemov" Cc: Davidlohr Bueso Link: https://lkml.kernel.org/r/20200911201248.88537-1-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index cc70e541c9bf..2fb9a4c7a161 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5337,10 +5337,16 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, * !shared pmd case because we can allocate the pmd later as well, it makes the * code much cleaner. * - * This routine must be called with i_mmap_rwsem held in at least read mode. - * For hugetlbfs, this prevents removal of any page table entries associated - * with the address space. This is important as we are setting up sharing - * based on existing page table entries (mappings). + * This routine must be called with i_mmap_rwsem held in at least read mode if + * sharing is possible. For hugetlbfs, this prevents removal of any page + * table entries associated with the address space. This is important as we + * are setting up sharing based on existing page table entries (mappings). + * + * NOTE: This routine is only called from huge_pte_alloc. Some callers of + * huge_pte_alloc know that sharing is not possible and do not take + * i_mmap_rwsem as a performance optimization. This is handled by the + * if !vma_shareable check at the beginning of the routine. i_mmap_rwsem is + * only required for subsequent processing. */ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) { @@ -5357,6 +5363,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (!vma_shareable(vma, addr)) return (pte_t *)pmd_alloc(mm, pud, addr); + i_mmap_assert_locked(mapping); vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; From 069c411de40a621c82efd2618663fee51d8c59b8 Mon Sep 17 00:00:00 2001 From: Chunxin Zang Date: Tue, 13 Oct 2020 16:56:46 -0700 Subject: [PATCH 152/181] mm/vmscan: fix infinite loop in drop_slab_node We have observed that drop_caches can take a considerable amount of time (). Especially when there are many memcgs involved because they are adding an additional overhead. It is quite unfortunate that the operation cannot be interrupted by a signal currently. Add a check for fatal signals into the main loop so that userspace can control early bailout. There are two reasons: 1. We have too many memcgs, even though one object freed in one memcg, the sum of object is bigger than 10. 2. We spend a lot of time in traverse memcg once. So, the memcg who traversed at the first have been freed many objects. Traverse memcg next time, the freed count bigger than 10 again. We can get the following info through 'ps': root:~# ps -aux | grep drop root 357956 ... R Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches root 1771385 ... R Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches root 1986319 ... R 18:56 117:27 echo 3 > /proc/sys/vm/drop_caches root 2002148 ... R Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches root 2564666 ... R 18:59 113:58 echo 3 > /proc/sys/vm/drop_caches root 2639347 ... R Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches root 3904747 ... R 03:35 993:31 echo 3 > /proc/sys/vm/drop_caches root 4016780 ... R Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches Use bpftrace follow 'freed' value in drop_slab_node: root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }' Attaching 1 probe... ^B^C @ret: [64, 128) 1 | | [128, 256) 28 | | [256, 512) 107 |@ | [512, 1K) 298 |@@@ | [1K, 2K) 613 |@@@@@@@ | [2K, 4K) 4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4K, 8K) 442 |@@@@@ | [8K, 16K) 299 |@@@ | [16K, 32K) 100 |@ | [32K, 64K) 139 |@ | [64K, 128K) 56 | | [128K, 256K) 26 | | [256K, 512K) 2 | | In the while loop, we can check whether the TASK_KILLABLE signal is set, if so, we should break the loop. Signed-off-by: Chunxin Zang Signed-off-by: Muchun Song Signed-off-by: Andrew Morton Acked-by: Chris Down Acked-by: Michal Hocko Cc: Vlastimil Babka Cc: Matthew Wilcox Link: https://lkml.kernel.org/r/20200909152047.27905-1-zangchunxin@bytedance.com Signed-off-by: Linus Torvalds --- mm/vmscan.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 466fc3144fff..f134db6478d3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -699,6 +699,9 @@ void drop_slab_node(int nid) do { struct mem_cgroup *memcg = NULL; + if (fatal_signal_pending(current)) + return; + freed = 0; memcg = mem_cgroup_iter(NULL, NULL, NULL); do { From 01c4776ba08ca9ab8cf58fb27d311868193dd368 Mon Sep 17 00:00:00 2001 From: Hui Su Date: Tue, 13 Oct 2020 16:56:49 -0700 Subject: [PATCH 153/181] mm/vmscan: fix comments for isolate_lru_page() fix comments for isolate_lru_page(): s/fundamentnal/fundamental Signed-off-by: Hui Su Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200927173923.GA8058@rlk Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f134db6478d3..879fb57c5045 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1754,7 +1754,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, * Restrictions: * * (1) Must be called with an elevated refcount on the page. This is a - * fundamentnal difference from isolate_lru_pages (which is called + * fundamental difference from isolate_lru_pages (which is called * without a stable reference). * (2) the lru_lock must not be held. * (3) interrupts must be enabled. From f94afee9980c5722dfff6d3e46fd34a36293a509 Mon Sep 17 00:00:00 2001 From: Hui Su Date: Tue, 13 Oct 2020 16:56:52 -0700 Subject: [PATCH 154/181] mm/z3fold.c: use xx_zalloc instead xx_alloc and memset alloc_slots() allocates memory for slots using kmem_cache_alloc(), then memsets it. We can just use kmem_cache_zalloc(). Signed-off-by: Hui Su Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200926100834.GA184671@rlk Signed-off-by: Linus Torvalds --- mm/z3fold.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/z3fold.c b/mm/z3fold.c index 460b0feced26..18feaa0bc537 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -212,13 +212,12 @@ static inline struct z3fold_buddy_slots *alloc_slots(struct z3fold_pool *pool, { struct z3fold_buddy_slots *slots; - slots = kmem_cache_alloc(pool->c_handle, + slots = kmem_cache_zalloc(pool->c_handle, (gfp & ~(__GFP_HIGHMEM | __GFP_MOVABLE))); if (slots) { /* It will be freed separately in free_handle(). */ kmemleak_not_leak(slots); - memset(slots->slot, 0, sizeof(slots->slot)); slots->pool = (unsigned long)pool; rwlock_init(&slots->lock); } From 1860129421c37a152b3bed23d7bef8ae21545d67 Mon Sep 17 00:00:00 2001 From: Xiang Chen Date: Tue, 13 Oct 2020 16:56:55 -0700 Subject: [PATCH 155/181] mm/zbud: remove redundant initialization zhdr is already initialized in the front of the function, so remove redundant initialization here. Signed-off-by: Xiang Chen Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Cc: Seth Jennings Cc: Dan Streetman Link: https://lkml.kernel.org/r/1600419885-191907-1-git-send-email-chenxiang66@hisilicon.com Signed-off-by: Linus Torvalds --- mm/zbud.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/zbud.c b/mm/zbud.c index bc93aa4e46fc..c49966ece674 100644 --- a/mm/zbud.c +++ b/mm/zbud.c @@ -367,7 +367,6 @@ int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp, spin_lock(&pool->lock); /* First, try to find an unbuddied zbud page. */ - zhdr = NULL; for_each_unbuddied_list(i, chunks) { if (!list_empty(&pool->unbuddied[i])) { zhdr = list_first_entry(&pool->unbuddied[i], From 62b35fe0eba21b09b015cdb43cddf51073e4b18c Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Tue, 13 Oct 2020 16:56:58 -0700 Subject: [PATCH 156/181] mm/compaction.c: micro-optimization remove unnecessary branch The same code can work both for 'zone->compact_considered > defer_limit' and 'zone->compact_considered >= defer_limit'. In the latter there is one branch less which is more effective considering performance. Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Joonsoo Kim Cc: Vlastimil Babka Cc: Mel Gorman Cc: David Rientjes Link: https://lkml.kernel.org/r/20200913190448.28649-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- mm/compaction.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 176dcded298e..6c63844fc061 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -180,11 +180,10 @@ bool compaction_deferred(struct zone *zone, int order) return false; /* Avoid possible overflow */ - if (++zone->compact_considered > defer_limit) + if (++zone->compact_considered >= defer_limit) { zone->compact_considered = defer_limit; - - if (zone->compact_considered >= defer_limit) return false; + } trace_mm_compaction_deferred(zone, order); From 74c9da4e1dc0ecf70e7fa78568821e3ed8f77938 Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Tue, 13 Oct 2020 16:57:01 -0700 Subject: [PATCH 157/181] include/linux/compaction.h: clean code by removing unused enum value The enum value 'COMPACT_INACTIVE' is never used so can be removed. Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Link: https://lkml.kernel.org/r/20200917110750.12015-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- include/linux/compaction.h | 3 --- 1 file changed, 3 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 25a521d299c1..1de5a1151ee7 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -29,9 +29,6 @@ enum compact_result { /* compaction didn't start as it was deferred due to past failures */ COMPACT_DEFERRED, - /* compaction not active last round */ - COMPACT_INACTIVE = COMPACT_DEFERRED, - /* For more detailed tracepoint output - internal to compaction */ COMPACT_NO_SUITABLE_PAGE, /* compaction should continue to another pageblock */ From 1100262037be8008cc85240389fbe5eac4df034d Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Tue, 13 Oct 2020 16:57:04 -0700 Subject: [PATCH 158/181] selftests/vm: 8x compaction_test speedup This patch reduces the running time for compaction_test from about 27 sec, to 3.3 sec, which is about an 8x speedup. These numbers are for an Intel x86_64 system with 32 GB of DRAM. The compaction_test.c program was spending most of its time doing mmap(), 1 MB at a time, on about 25 GB of memory. Instead, do the mmaps 100 MB at a time. (Going past 100 MB doesn't make things go much faster, because other parts of the program are using the remaining time.) Signed-off-by: John Hubbard Signed-off-by: Andrew Morton Acked-by: Sri Jayaramappa Cc: Shuah Khan Cc: Mel Gorman Link: https://lkml.kernel.org/r/20201002080621.551044-2-jhubbard@nvidia.com Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/compaction_test.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/vm/compaction_test.c b/tools/testing/selftests/vm/compaction_test.c index bcec71250873..9b420140ba2b 100644 --- a/tools/testing/selftests/vm/compaction_test.c +++ b/tools/testing/selftests/vm/compaction_test.c @@ -18,7 +18,8 @@ #include "../kselftest.h" -#define MAP_SIZE 1048576 +#define MAP_SIZE_MB 100 +#define MAP_SIZE (MAP_SIZE_MB * 1024 * 1024) struct map_list { void *map; @@ -165,7 +166,7 @@ int main(int argc, char **argv) void *map = NULL; unsigned long mem_free = 0; unsigned long hugepage_size = 0; - unsigned long mem_fragmentable = 0; + long mem_fragmentable_MB = 0; if (prereq() != 0) { printf("Either the sysctl compact_unevictable_allowed is not\n" @@ -190,9 +191,9 @@ int main(int argc, char **argv) return -1; } - mem_fragmentable = mem_free * 0.8 / 1024; + mem_fragmentable_MB = mem_free * 0.8 / 1024; - while (mem_fragmentable > 0) { + while (mem_fragmentable_MB > 0) { map = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED, -1, 0); if (map == MAP_FAILED) @@ -213,7 +214,7 @@ int main(int argc, char **argv) for (i = 0; i < MAP_SIZE; i += page_size) *(unsigned long *)(map + i) = (unsigned long)map + i; - mem_fragmentable--; + mem_fragmentable_MB -= MAP_SIZE_MB; } for (entry = list; entry != NULL; entry = entry->next) { From 78b132e9bae922f1e8bd9d137f0c27b81c44f15d Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:57:08 -0700 Subject: [PATCH 159/181] mm/mempolicy: remove or narrow the lock on current It is not necessary to hold the lock of current when setting nodemask of a new policy. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200921040416.86185-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index eddbe4e56c73..69ce708c3a3a 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -875,13 +875,12 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, goto out; } - task_lock(current); ret = mpol_set_nodemask(new, nodes, scratch); if (ret) { - task_unlock(current); mpol_put(new); goto out; } + task_lock(current); old = current->mempolicy; current->mempolicy = new; if (new && new->mode == MPOL_INTERLEAVE) @@ -1324,9 +1323,7 @@ static long do_mbind(unsigned long start, unsigned long len, NODEMASK_SCRATCH(scratch); if (scratch) { mmap_write_lock(mm); - task_lock(current); err = mpol_set_nodemask(new, nmask, scratch); - task_unlock(current); if (err) mmap_write_unlock(mm); } else From f8fd52535c7326d72645c9878d7897aaf44db51c Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Tue, 13 Oct 2020 16:57:11 -0700 Subject: [PATCH 160/181] mm: remove unused alloc_page_vma_node() No one use this macro anymore. Also fix code style of policy_node(). Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://lkml.kernel.org/r/20200921021401.84508-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 2 -- mm/mempolicy.c | 3 +-- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 2e8370cf60c7..07e481993ef5 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -562,8 +562,6 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) #define alloc_page_vma(gfp_mask, vma, addr) \ alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false) -#define alloc_page_vma_node(gfp_mask, vma, addr, node) \ - alloc_pages_vma(gfp_mask, 0, vma, addr, node, false) extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); extern unsigned long get_zeroed_page(gfp_t gfp_mask); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 69ce708c3a3a..3fde772ef5ef 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1882,8 +1882,7 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) } /* Return the node id preferred by the given mempolicy, or the given id */ -static int policy_node(gfp_t gfp, struct mempolicy *policy, - int nd) +static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) { if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL)) nd = policy->v.preferred_node; From 544941d788319951ae75bae5d1bce076a6f71949 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Tue, 13 Oct 2020 16:57:14 -0700 Subject: [PATCH 161/181] mm/mempool: add 'else' to split mutually exclusive case Add else to split mutually exclusive case and avoid some unnecessary check. It doesn't seem to change code generation (compiler is smart), but I think it helps readability. [akpm@linux-foundation.org: fix comment location] Signed-off-by: Miaohe Lin Signed-off-by: Andrew Morton Link: https://lkml.kernel.org/r/20200924111641.28922-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds --- mm/mempool.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/mm/mempool.c b/mm/mempool.c index 79bff63ecf27..f473cdddaff0 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -58,11 +58,10 @@ static void __check_element(mempool_t *pool, void *element, size_t size) static void check_element(mempool_t *pool, void *element) { /* Mempools backed by slab allocator */ - if (pool->free == mempool_free_slab || pool->free == mempool_kfree) + if (pool->free == mempool_free_slab || pool->free == mempool_kfree) { __check_element(pool, element, ksize(element)); - - /* Mempools backed by page allocator */ - if (pool->free == mempool_free_pages) { + } else if (pool->free == mempool_free_pages) { + /* Mempools backed by page allocator */ int order = (int)(long)pool->pool_data; void *addr = kmap_atomic((struct page *)element); @@ -82,11 +81,10 @@ static void __poison_element(void *element, size_t size) static void poison_element(mempool_t *pool, void *element) { /* Mempools backed by slab allocator */ - if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc) + if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc) { __poison_element(element, ksize(element)); - - /* Mempools backed by page allocator */ - if (pool->alloc == mempool_alloc_pages) { + } else if (pool->alloc == mempool_alloc_pages) { + /* Mempools backed by page allocator */ int order = (int)(long)pool->pool_data; void *addr = kmap_atomic((struct page *)element); @@ -107,7 +105,7 @@ static __always_inline void kasan_poison_element(mempool_t *pool, void *element) { if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc) kasan_poison_kfree(element, _RET_IP_); - if (pool->alloc == mempool_alloc_pages) + else if (pool->alloc == mempool_alloc_pages) kasan_free_pages(element, (unsigned long)pool->pool_data); } @@ -115,7 +113,7 @@ static void kasan_unpoison_element(mempool_t *pool, void *element) { if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc) kasan_unpoison_slab(element); - if (pool->alloc == mempool_alloc_pages) + else if (pool->alloc == mempool_alloc_pages) kasan_alloc_pages(element, (unsigned long)pool->pool_data); } From 04ba0a923f07df85211cf83932ab9531dd2e6f7f Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:57:17 -0700 Subject: [PATCH 162/181] KVM: PPC: Book3S HV: simplify kvm_cma_reserve() Patch series "memblock: seasonal cleaning^w cleanup", v3. These patches simplify several uses of memblock iterators and hide some of the memblock implementation details from the rest of the system. This patch (of 17): The memory size calculation in kvm_cma_reserve() traverses memblock.memory rather than simply call memblock_phys_mem_size(). The comment in that function suggests that at some point there should have been call to memblock_analyze() before memblock_phys_mem_size() could be used. As of now, there is no memblock_analyze() at all and memblock_phys_mem_size() can be used as soon as cold-plug memory is registered with memblock. Replace loop over memblock.memory with a call to memblock_phys_mem_size(). Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Baoquan He Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Ingo Molnar Cc: Hari Bathini Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Miguel Ojeda Cc: Thomas Bogendoerfer Link: https://lkml.kernel.org/r/20200818151634.14343-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20200818151634.14343-2-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/powerpc/kvm/book3s_hv_builtin.c | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index 073617ce83e0..8f58dd20b362 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -95,23 +95,15 @@ EXPORT_SYMBOL_GPL(kvm_free_hpt_cma); void __init kvm_cma_reserve(void) { unsigned long align_size; - struct memblock_region *reg; - phys_addr_t selected_size = 0; + phys_addr_t selected_size; /* * We need CMA reservation only when we are in HV mode */ if (!cpu_has_feature(CPU_FTR_HVMODE)) return; - /* - * We cannot use memblock_phys_mem_size() here, because - * memblock_analyze() has not been called yet. - */ - for_each_memblock(memory, reg) - selected_size += memblock_region_memory_end_pfn(reg) - - memblock_region_memory_base_pfn(reg); - selected_size = (selected_size * kvm_cma_resv_ratio / 100) << PAGE_SHIFT; + selected_size = PAGE_ALIGN(memblock_phys_mem_size() * kvm_cma_resv_ratio / 100); if (selected_size) { pr_info("%s: reserving %ld MiB for global area\n", __func__, (unsigned long)selected_size / SZ_1M); From e9aa36ccbb4e845d043bb51de274ae877ae925a7 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:57:22 -0700 Subject: [PATCH 163/181] dma-contiguous: simplify cma_early_percent_memory() The memory size calculation in cma_early_percent_memory() traverses memblock.memory rather than simply call memblock_phys_mem_size(). The comment in that function suggests that at some point there should have been call to memblock_analyze() before memblock_phys_mem_size() could be used. As of now, there is no memblock_analyze() at all and memblock_phys_mem_size() can be used as soon as cold-plug memory is registered with memblock. Replace loop over memblock.memory with a call to memblock_phys_mem_size(). Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Reviewed-by: Christoph Hellwig Reviewed-by: Baoquan He Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-3-rppt@kernel.org Signed-off-by: Linus Torvalds --- kernel/dma/contiguous.c | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index cff7e60968b9..0369fd5fda8f 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -73,16 +73,7 @@ early_param("cma", early_cma); static phys_addr_t __init __maybe_unused cma_early_percent_memory(void) { - struct memblock_region *reg; - unsigned long total_pages = 0; - - /* - * We cannot use memblock_phys_mem_size() here, because - * memblock_analyze() has not been called yet. - */ - for_each_memblock(memory, reg) - total_pages += memblock_region_memory_end_pfn(reg) - - memblock_region_memory_base_pfn(reg); + unsigned long total_pages = PHYS_PFN(memblock_phys_mem_size()); return (total_pages * CONFIG_CMA_SIZE_PERCENTAGE / 100) << PAGE_SHIFT; } From cddb5ddf2b76debdb8cad1728ad0a9321383d933 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:57:26 -0700 Subject: [PATCH 164/181] arm, xtensa: simplify initialization of high memory pages free_highpages() in both arm and xtensa essentially open-code for_each_free_mem_range() loop to detect high memory pages that were not reserved and that should be initialized and passed to the buddy allocator. Replace open-coded implementation of for_each_free_mem_range() with usage of memblock API to simplify the code. Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Tested-by: Max Filippov [xtensa] Reviewed-by: Max Filippov [xtensa] Cc: Andy Lutomirski Cc: Baoquan He Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-4-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/arm/mm/init.c | 48 +++++++------------------------------ arch/xtensa/mm/init.c | 55 ++++++++----------------------------------- 2 files changed, 18 insertions(+), 85 deletions(-) diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index 000c1b48e973..50a5a30a78ff 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -347,61 +347,29 @@ static void __init free_unused_memmap(void) #endif } -#ifdef CONFIG_HIGHMEM -static inline void free_area_high(unsigned long pfn, unsigned long end) -{ - for (; pfn < end; pfn++) - free_highmem_page(pfn_to_page(pfn)); -} -#endif - static void __init free_highpages(void) { #ifdef CONFIG_HIGHMEM unsigned long max_low = max_low_pfn; - struct memblock_region *mem, *res; + phys_addr_t range_start, range_end; + u64 i; /* set highmem page free */ - for_each_memblock(memory, mem) { - unsigned long start = memblock_region_memory_base_pfn(mem); - unsigned long end = memblock_region_memory_end_pfn(mem); + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, + &range_start, &range_end, NULL) { + unsigned long start = PHYS_PFN(range_start); + unsigned long end = PHYS_PFN(range_end); /* Ignore complete lowmem entries */ if (end <= max_low) continue; - if (memblock_is_nomap(mem)) - continue; - /* Truncate partial highmem entries */ if (start < max_low) start = max_low; - /* Find and exclude any reserved regions */ - for_each_memblock(reserved, res) { - unsigned long res_start, res_end; - - res_start = memblock_region_reserved_base_pfn(res); - res_end = memblock_region_reserved_end_pfn(res); - - if (res_end < start) - continue; - if (res_start < start) - res_start = start; - if (res_start > end) - res_start = end; - if (res_end > end) - res_end = end; - if (res_start != start) - free_area_high(start, res_start); - start = res_end; - if (start == end) - break; - } - - /* And now free anything which remains */ - if (start < end) - free_area_high(start, end); + for (; start < end; start++) + free_highmem_page(pfn_to_page(start)); } #endif } diff --git a/arch/xtensa/mm/init.c b/arch/xtensa/mm/init.c index a05b306cf371..ad9d59d93f39 100644 --- a/arch/xtensa/mm/init.c +++ b/arch/xtensa/mm/init.c @@ -79,67 +79,32 @@ void __init zones_init(void) free_area_init(max_zone_pfn); } -#ifdef CONFIG_HIGHMEM -static void __init free_area_high(unsigned long pfn, unsigned long end) -{ - for (; pfn < end; pfn++) - free_highmem_page(pfn_to_page(pfn)); -} - static void __init free_highpages(void) { +#ifdef CONFIG_HIGHMEM unsigned long max_low = max_low_pfn; - struct memblock_region *mem, *res; + phys_addr_t range_start, range_end; + u64 i; - reset_all_zones_managed_pages(); /* set highmem page free */ - for_each_memblock(memory, mem) { - unsigned long start = memblock_region_memory_base_pfn(mem); - unsigned long end = memblock_region_memory_end_pfn(mem); + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, + &range_start, &range_end, NULL) { + unsigned long start = PHYS_PFN(range_start); + unsigned long end = PHYS_PFN(range_end); /* Ignore complete lowmem entries */ if (end <= max_low) continue; - if (memblock_is_nomap(mem)) - continue; - /* Truncate partial highmem entries */ if (start < max_low) start = max_low; - /* Find and exclude any reserved regions */ - for_each_memblock(reserved, res) { - unsigned long res_start, res_end; - - res_start = memblock_region_reserved_base_pfn(res); - res_end = memblock_region_reserved_end_pfn(res); - - if (res_end < start) - continue; - if (res_start < start) - res_start = start; - if (res_start > end) - res_start = end; - if (res_end > end) - res_end = end; - if (res_start != start) - free_area_high(start, res_start); - start = res_end; - if (start == end) - break; - } - - /* And now free anything which remains */ - if (start < end) - free_area_high(start, end); + for (; start < end; start++) + free_highmem_page(pfn_to_page(start)); } -} -#else -static void __init free_highpages(void) -{ -} #endif +} /* * Initialize memory pages. From ab8f21aa8b2ee77931e2fe7fd8a842628b780d22 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:57:31 -0700 Subject: [PATCH 165/181] arm64: numa: simplify dummy_numa_init() dummy_numa_init() loops over memblock.memory and passes nid=0 to numa_add_memblk() which essentially wraps memblock_set_node(). However, memblock_set_node() can cope with entire memory span itself, so the loop over memblock.memory regions is redundant. Using a single call to memblock_set_node() rather than a loop also fixes an issue with a buggy ACPI firmware in which the SRAT table covers some but not all of the memory in the EFI memory map. Jonathan Cameron says: This issue can be easily triggered by having an SRAT table which fails to cover all elements of the EFI memory map. This firmware error is detected and a warning printed. e.g. "NUMA: Warning: invalid memblk node 64 [mem 0x240000000-0x27fffffff]" At that point we fall back to dummy_numa_init(). However, the failed ACPI init has left us with our memblocks all broken up as we split them when trying to assign them to NUMA nodes. We then iterate over the memblocks and add them to node 0. numa_add_memblk() calls memblock_set_node() which merges regions that were previously split up during the earlier attempt to add them to different nodes during parsing of SRAT. This means elements are moved in the memblock array and we can end up in a different memblock after the call to numa_add_memblk(). Result is: Unable to handle kernel paging request at virtual address 0000000000003a40 Mem abort info: ESR = 0x96000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [0000000000003a40] user address but active_mm is swapper Internal error: Oops: 96000004 [#1] PREEMPT SMP ... Call trace: sparse_init_nid+0x5c/0x2b0 sparse_init+0x138/0x170 bootmem_init+0x80/0xe0 setup_arch+0x2a0/0x5fc start_kernel+0x8c/0x648 Replace the loop with a single call to memblock_set_node() to the entire memory. Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Acked-by: Jonathan Cameron Acked-by: Catalin Marinas Cc: Andy Lutomirski Cc: Baoquan He Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-5-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/arm64/mm/numa.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index 676deb220b99..2040b97e92d0 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -427,19 +427,16 @@ out_free_distance: */ static int __init dummy_numa_init(void) { + phys_addr_t start = memblock_start_of_DRAM(); + phys_addr_t end = memblock_end_of_DRAM(); int ret; - struct memblock_region *mblk; if (numa_off) pr_info("NUMA disabled\n"); /* Forced off on command line. */ - pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n", - memblock_start_of_DRAM(), memblock_end_of_DRAM() - 1); - - for_each_memblock(memory, mblk) { - ret = numa_add_memblk(0, mblk->base, mblk->base + mblk->size); - if (!ret) - continue; + pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n", start, end - 1); + ret = numa_add_memblk(0, start, end); + if (ret) { pr_err("NUMA init failed\n"); return ret; } From 80c4574417ae42fe2ba6978772854dc67cfc5331 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:57:36 -0700 Subject: [PATCH 166/181] h8300, nds32, openrisc: simplify detection of memory extents Instead of traversing memblock.memory regions to find memory_start and memory_end, simply query memblock_{start,end}_of_DRAM(). Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Acked-by: Stafford Horne Cc: Andy Lutomirski Cc: Baoquan He Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-6-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/h8300/kernel/setup.c | 8 +++----- arch/nds32/kernel/setup.c | 8 ++------ arch/openrisc/kernel/setup.c | 9 ++------- 3 files changed, 7 insertions(+), 18 deletions(-) diff --git a/arch/h8300/kernel/setup.c b/arch/h8300/kernel/setup.c index 28ac88358a89..0281f92eea3d 100644 --- a/arch/h8300/kernel/setup.c +++ b/arch/h8300/kernel/setup.c @@ -74,17 +74,15 @@ static void __init bootmem_init(void) memory_end = memory_start = 0; /* Find main memory where is the kernel */ - for_each_memblock(memory, region) { - memory_start = region->base; - memory_end = region->base + region->size; - } + memory_start = memblock_start_of_DRAM(); + memory_end = memblock_end_of_DRAM(); if (!memory_end) panic("No memory!"); /* setup bootmem globals (we use no_bootmem, but mm still depends on this) */ min_low_pfn = PFN_UP(memory_start); - max_low_pfn = PFN_DOWN(memblock_end_of_DRAM()); + max_low_pfn = PFN_DOWN(memory_end); max_pfn = max_low_pfn; memblock_reserve(__pa(_stext), _end - _stext); diff --git a/arch/nds32/kernel/setup.c b/arch/nds32/kernel/setup.c index a066efbe53c0..c356e484dcab 100644 --- a/arch/nds32/kernel/setup.c +++ b/arch/nds32/kernel/setup.c @@ -249,12 +249,8 @@ static void __init setup_memory(void) memory_end = memory_start = 0; /* Find main memory where is the kernel */ - for_each_memblock(memory, region) { - memory_start = region->base; - memory_end = region->base + region->size; - pr_info("%s: Memory: 0x%x-0x%x\n", __func__, - memory_start, memory_end); - } + memory_start = memblock_start_of_DRAM(); + memory_end = memblock_end_of_DRAM(); if (!memory_end) { panic("No memory!"); diff --git a/arch/openrisc/kernel/setup.c b/arch/openrisc/kernel/setup.c index 13c87f1f872b..2416a9f91533 100644 --- a/arch/openrisc/kernel/setup.c +++ b/arch/openrisc/kernel/setup.c @@ -48,17 +48,12 @@ static void __init setup_memory(void) unsigned long ram_start_pfn; unsigned long ram_end_pfn; phys_addr_t memory_start, memory_end; - struct memblock_region *region; memory_end = memory_start = 0; /* Find main memory where is the kernel, we assume its the only one */ - for_each_memblock(memory, region) { - memory_start = region->base; - memory_end = region->base + region->size; - printk(KERN_INFO "%s: Memory: 0x%x-0x%x\n", __func__, - memory_start, memory_end); - } + memory_start = memblock_start_of_DRAM(); + memory_end = memblock_end_of_DRAM(); if (!memory_end) { panic("No memory!"); From c8e470184a063b5fb1dc06779935247e398f2177 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:57:40 -0700 Subject: [PATCH 167/181] riscv: drop unneeded node initialization RISC-V does not (yet) support NUMA and for UMA architectures node 0 is used implicitly during early memory initialization. There is no need to call memblock_set_node(), remove this call and the surrounding code. Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Baoquan He Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-7-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/riscv/mm/init.c | 9 --------- 1 file changed, 9 deletions(-) diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index f750e012dbe5..17911b9402ea 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -191,15 +191,6 @@ void __init setup_bootmem(void) early_init_fdt_scan_reserved_mem(); memblock_allow_resize(); memblock_dump_all(); - - for_each_memblock(memory, reg) { - unsigned long start_pfn = memblock_region_memory_base_pfn(reg); - unsigned long end_pfn = memblock_region_memory_end_pfn(reg); - - memblock_set_node(PFN_PHYS(start_pfn), - PFN_PHYS(end_pfn - start_pfn), - &memblock.memory, 0); - } } #ifdef CONFIG_MMU From 49645793bce1a2bd8c56c0de709bb9093be0df2f Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:57:45 -0700 Subject: [PATCH 168/181] mircoblaze: drop unneeded NUMA and sparsemem initializations microblaze does not support neither NUMA not SPARSMEM, so there is no point to call memblock_set_node() and sparse_memory_present_with_active_regions() functions during microblaze memory initialization. Remove these calls and the surrounding code. Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Baoquan He Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-8-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/microblaze/mm/init.c | 14 +------------- 1 file changed, 1 insertion(+), 13 deletions(-) diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c index 3344d4a1fe89..25ec8f2c3a4d 100644 --- a/arch/microblaze/mm/init.c +++ b/arch/microblaze/mm/init.c @@ -108,9 +108,8 @@ static void __init paging_init(void) void __init setup_memory(void) { - struct memblock_region *reg; - #ifndef CONFIG_MMU + struct memblock_region *reg; u32 kernel_align_start, kernel_align_size; /* Find main memory where is the kernel */ @@ -164,17 +163,6 @@ void __init setup_memory(void) pr_info("%s: max_low_pfn: %#lx\n", __func__, max_low_pfn); pr_info("%s: max_pfn: %#lx\n", __func__, max_pfn); - /* Add active regions with valid PFNs */ - for_each_memblock(memory, reg) { - unsigned long start_pfn, end_pfn; - - start_pfn = memblock_region_memory_base_pfn(reg); - end_pfn = memblock_region_memory_end_pfn(reg); - memblock_set_node(start_pfn << PAGE_SHIFT, - (end_pfn - start_pfn) << PAGE_SHIFT, - &memblock.memory, 0); - } - paging_init(); } From cd991db8ddc33d2d2c7af45627fc48352915001c Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:57:49 -0700 Subject: [PATCH 169/181] memblock: make for_each_memblock_type() iterator private for_each_memblock_type() is not used outside mm/memblock.c, move it there from include/linux/memblock.h Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-9-rppt@kernel.org Signed-off-by: Linus Torvalds --- include/linux/memblock.h | 5 ----- mm/memblock.c | 5 +++++ 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 9d925db0d355..550faf69fc1c 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -552,11 +552,6 @@ static inline unsigned long memblock_region_reserved_end_pfn(const struct memblo region < (memblock.memblock_type.regions + memblock.memblock_type.cnt); \ region++) -#define for_each_memblock_type(i, memblock_type, rgn) \ - for (i = 0, rgn = &memblock_type->regions[0]; \ - i < memblock_type->cnt; \ - i++, rgn = &memblock_type->regions[i]) - extern void *alloc_large_system_hash(const char *tablename, unsigned long bucketsize, unsigned long numentries, diff --git a/mm/memblock.c b/mm/memblock.c index 45f198750be9..59f3998ae5db 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -132,6 +132,11 @@ struct memblock_type physmem = { }; #endif +#define for_each_memblock_type(i, memblock_type, rgn) \ + for (i = 0, rgn = &memblock_type->regions[0]; \ + i < memblock_type->cnt; \ + i++, rgn = &memblock_type->regions[i]) + int memblock_debug __initdata_memblock; static bool system_has_some_mirror __initdata_memblock = false; static int memblock_can_resize __initdata_memblock; From 87c55870f01266fe22f345a0767162f85f1cf8f1 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:57:54 -0700 Subject: [PATCH 170/181] memblock: make memblock_debug and related functionality private The only user of memblock_dbg() outside memblock was s390 setup code and it is converted to use pr_debug() instead. This allows to stop exposing memblock_debug and memblock_dbg() to the rest of the kernel. [akpm@linux-foundation.org: make memblock_dbg() safer and neater] Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-10-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/s390/kernel/setup.c | 4 ++-- include/linux/memblock.h | 12 +----------- mm/memblock.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c index c2c1b4e723ea..115c92839af5 100644 --- a/arch/s390/kernel/setup.c +++ b/arch/s390/kernel/setup.c @@ -776,8 +776,8 @@ static void __init memblock_add_mem_detect_info(void) unsigned long start, end; int i; - memblock_dbg("physmem info source: %s (%hhd)\n", - get_mem_info_source(), mem_detect.info_source); + pr_debug("physmem info source: %s (%hhd)\n", + get_mem_info_source(), mem_detect.info_source); /* keep memblock lists close to the kernel */ memblock_set_bottom_up(true); for_each_mem_detect_block(i, &start, &end) { diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 550faf69fc1c..47a76e237fca 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -86,7 +86,6 @@ struct memblock { }; extern struct memblock memblock; -extern int memblock_debug; #ifndef CONFIG_ARCH_KEEP_MEMBLOCK #define __init_memblock __meminit @@ -98,9 +97,6 @@ void memblock_discard(void); static inline void memblock_discard(void) {} #endif -#define memblock_dbg(fmt, ...) \ - if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__) - phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end, phys_addr_t size, phys_addr_t align); void memblock_allow_resize(void); @@ -476,13 +472,7 @@ bool memblock_is_region_memory(phys_addr_t base, phys_addr_t size); bool memblock_is_reserved(phys_addr_t addr); bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size); -extern void __memblock_dump_all(void); - -static inline void memblock_dump_all(void) -{ - if (memblock_debug) - __memblock_dump_all(); -} +void memblock_dump_all(void); /** * memblock_set_current_limit - Set the current allocation limit to allow diff --git a/mm/memblock.c b/mm/memblock.c index 59f3998ae5db..e9b6efd74329 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -137,7 +137,13 @@ struct memblock_type physmem = { i < memblock_type->cnt; \ i++, rgn = &memblock_type->regions[i]) -int memblock_debug __initdata_memblock; +#define memblock_dbg(fmt, ...) \ + do { \ + if (memblock_debug) \ + pr_info(fmt, ##__VA_ARGS__); \ + } while (0) + +static int memblock_debug __initdata_memblock; static bool system_has_some_mirror __initdata_memblock = false; static int memblock_can_resize __initdata_memblock; static int memblock_memory_in_slab __initdata_memblock = 0; @@ -1920,7 +1926,7 @@ static void __init_memblock memblock_dump(struct memblock_type *type) } } -void __init_memblock __memblock_dump_all(void) +static void __init_memblock __memblock_dump_all(void) { pr_info("MEMBLOCK configuration:\n"); pr_info(" memory size = %pa reserved size = %pa\n", @@ -1934,6 +1940,12 @@ void __init_memblock __memblock_dump_all(void) #endif } +void __init_memblock memblock_dump_all(void) +{ + if (memblock_debug) + __memblock_dump_all(); +} + void __init memblock_allow_resize(void) { memblock_can_resize = 1; From 6e245ad4a17ab92dba63406d3f517520a86c0a80 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:57:59 -0700 Subject: [PATCH 171/181] memblock: reduce number of parameters in for_each_mem_range() Currently for_each_mem_range() and for_each_mem_range_rev() iterators are the most generic way to traverse memblock regions. As such, they have 8 parameters and they are hardly convenient to users. Most users choose to utilize one of their wrappers and the only user that actually needs most of the parameters is memblock itself. To avoid yet another naming for memblock iterators, rename the existing for_each_mem_range[_rev]() to __for_each_mem_range[_rev]() and add a new for_each_mem_range[_rev]() wrappers with only index, start and end parameters. The new wrapper nicely fits into init_unavailable_mem() and will be used in upcoming changes to simplify memblock traversals. Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Acked-by: Thomas Bogendoerfer [MIPS] Cc: Andy Lutomirski Cc: Baoquan He Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-11-rppt@kernel.org Signed-off-by: Linus Torvalds --- .clang-format | 2 ++ arch/arm64/kernel/machine_kexec_file.c | 6 ++-- arch/powerpc/kexec/file_load_64.c | 6 ++-- include/linux/memblock.h | 41 +++++++++++++++++++------- mm/page_alloc.c | 3 +- 5 files changed, 38 insertions(+), 20 deletions(-) diff --git a/.clang-format b/.clang-format index badfc1ba440a..0366a3d2e561 100644 --- a/.clang-format +++ b/.clang-format @@ -207,7 +207,9 @@ ForEachMacros: - 'for_each_memblock_type' - 'for_each_memcg_cache_index' - 'for_each_mem_pfn_range' + - '__for_each_mem_range' - 'for_each_mem_range' + - '__for_each_mem_range_rev' - 'for_each_mem_range_rev' - 'for_each_migratetype_order' - 'for_each_msi_entry' diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index 361a1143e09e..5b0e67b93cdc 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -215,8 +215,7 @@ static int prepare_elf_headers(void **addr, unsigned long *sz) phys_addr_t start, end; nr_ranges = 1; /* for exclusion of crashkernel region */ - for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, - MEMBLOCK_NONE, &start, &end, NULL) + for_each_mem_range(i, &start, &end) nr_ranges++; cmem = kmalloc(struct_size(cmem, ranges, nr_ranges), GFP_KERNEL); @@ -225,8 +224,7 @@ static int prepare_elf_headers(void **addr, unsigned long *sz) cmem->max_nr_ranges = nr_ranges; cmem->nr_ranges = 0; - for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, - MEMBLOCK_NONE, &start, &end, NULL) { + for_each_mem_range(i, &start, &end) { cmem->ranges[cmem->nr_ranges].start = start; cmem->ranges[cmem->nr_ranges].end = end - 1; cmem->nr_ranges++; diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index 53bb71e3a2e1..2c9d908eab96 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -250,8 +250,7 @@ static int __locate_mem_hole_top_down(struct kexec_buf *kbuf, phys_addr_t start, end; u64 i; - for_each_mem_range_rev(i, &memblock.memory, NULL, NUMA_NO_NODE, - MEMBLOCK_NONE, &start, &end, NULL) { + for_each_mem_range_rev(i, &start, &end) { /* * memblock uses [start, end) convention while it is * [start, end] here. Fix the off-by-one to have the @@ -350,8 +349,7 @@ static int __locate_mem_hole_bottom_up(struct kexec_buf *kbuf, phys_addr_t start, end; u64 i; - for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, - MEMBLOCK_NONE, &start, &end, NULL) { + for_each_mem_range(i, &start, &end) { /* * memblock uses [start, end) convention while it is * [start, end] here. Fix the off-by-one to have the diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 47a76e237fca..27c3b84d1615 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -162,7 +162,7 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type, #endif /* CONFIG_HAVE_MEMBLOCK_PHYS_MAP */ /** - * for_each_mem_range - iterate through memblock areas from type_a and not + * __for_each_mem_range - iterate through memblock areas from type_a and not * included in type_b. Or just type_a if type_b is NULL. * @i: u64 used as loop variable * @type_a: ptr to memblock_type to iterate @@ -173,7 +173,7 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type, * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL * @p_nid: ptr to int for nid of the range, can be %NULL */ -#define for_each_mem_range(i, type_a, type_b, nid, flags, \ +#define __for_each_mem_range(i, type_a, type_b, nid, flags, \ p_start, p_end, p_nid) \ for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b, \ p_start, p_end, p_nid); \ @@ -182,7 +182,7 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type, p_start, p_end, p_nid)) /** - * for_each_mem_range_rev - reverse iterate through memblock areas from + * __for_each_mem_range_rev - reverse iterate through memblock areas from * type_a and not included in type_b. Or just type_a if type_b is NULL. * @i: u64 used as loop variable * @type_a: ptr to memblock_type to iterate @@ -193,15 +193,36 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type, * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL * @p_nid: ptr to int for nid of the range, can be %NULL */ -#define for_each_mem_range_rev(i, type_a, type_b, nid, flags, \ - p_start, p_end, p_nid) \ +#define __for_each_mem_range_rev(i, type_a, type_b, nid, flags, \ + p_start, p_end, p_nid) \ for (i = (u64)ULLONG_MAX, \ - __next_mem_range_rev(&i, nid, flags, type_a, type_b,\ + __next_mem_range_rev(&i, nid, flags, type_a, type_b, \ p_start, p_end, p_nid); \ i != (u64)ULLONG_MAX; \ __next_mem_range_rev(&i, nid, flags, type_a, type_b, \ p_start, p_end, p_nid)) +/** + * for_each_mem_range - iterate through memory areas. + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + */ +#define for_each_mem_range(i, p_start, p_end) \ + __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, \ + MEMBLOCK_NONE, p_start, p_end, NULL) + +/** + * for_each_mem_range_rev - reverse iterate through memblock areas from + * type_a and not included in type_b. Or just type_a if type_b is NULL. + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + */ +#define for_each_mem_range_rev(i, p_start, p_end) \ + __for_each_mem_range_rev(i, &memblock.memory, NULL, NUMA_NO_NODE, \ + MEMBLOCK_NONE, p_start, p_end, NULL) + /** * for_each_reserved_mem_region - iterate over all reserved memblock areas * @i: u64 used as loop variable @@ -307,8 +328,8 @@ int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask); * soon as memblock is initialized. */ #define for_each_free_mem_range(i, nid, flags, p_start, p_end, p_nid) \ - for_each_mem_range(i, &memblock.memory, &memblock.reserved, \ - nid, flags, p_start, p_end, p_nid) + __for_each_mem_range(i, &memblock.memory, &memblock.reserved, \ + nid, flags, p_start, p_end, p_nid) /** * for_each_free_mem_range_reverse - rev-iterate through free memblock areas @@ -324,8 +345,8 @@ int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask); */ #define for_each_free_mem_range_reverse(i, nid, flags, p_start, p_end, \ p_nid) \ - for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved, \ - nid, flags, p_start, p_end, p_nid) + __for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved, \ + nid, flags, p_start, p_end, p_nid) int memblock_set_node(phys_addr_t base, phys_addr_t size, struct memblock_type *type, int nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 73e33ab6d249..34ac7127d1e6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6990,8 +6990,7 @@ static void __init init_unavailable_mem(void) * Loop through unavailable ranges not covered by memblock.memory. */ pgcnt = 0; - for_each_mem_range(i, &memblock.memory, NULL, - NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) { + for_each_mem_range(i, &start, &end) { if (next < start) pgcnt += init_unavailable_range(PFN_DOWN(next), PFN_UP(start)); From c9118e6c37bff9ade90b638207a6e0db676ee6a9 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:58:03 -0700 Subject: [PATCH 172/181] arch, mm: replace for_each_memblock() with for_each_mem_pfn_range() There are several occurrences of the following pattern: for_each_memblock(memory, reg) { start_pfn = memblock_region_memory_base_pfn(reg); end_pfn = memblock_region_memory_end_pfn(reg); /* do something with start_pfn and end_pfn */ } Rather than iterate over all memblock.memory regions and each time query for their start and end PFNs, use for_each_mem_pfn_range() iterator to get simpler and clearer code. Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Acked-by: Miguel Ojeda [.clang-format] Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/arm/mm/init.c | 11 ++++------- arch/arm64/mm/init.c | 11 ++++------- arch/powerpc/kernel/fadump.c | 11 ++++++----- arch/powerpc/mm/mem.c | 15 ++++++++------- arch/powerpc/mm/numa.c | 7 ++----- arch/s390/mm/page-states.c | 6 ++---- arch/sh/mm/init.c | 9 +++------ mm/memblock.c | 6 ++---- mm/sparse.c | 10 ++++------ 9 files changed, 35 insertions(+), 51 deletions(-) diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index 50a5a30a78ff..45f9d5ec2360 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -299,16 +299,14 @@ free_memmap(unsigned long start_pfn, unsigned long end_pfn) */ static void __init free_unused_memmap(void) { - unsigned long start, prev_end = 0; - struct memblock_region *reg; + unsigned long start, end, prev_end = 0; + int i; /* * This relies on each bank being in address order. * The banks are sorted previously in bootmem_init(). */ - for_each_memblock(memory, reg) { - start = memblock_region_memory_base_pfn(reg); - + for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, NULL) { #ifdef CONFIG_SPARSEMEM /* * Take care not to free memmap entries that don't exist @@ -336,8 +334,7 @@ static void __init free_unused_memmap(void) * memmap entries are valid from the bank end aligned to * MAX_ORDER_NR_PAGES. */ - prev_end = ALIGN(memblock_region_memory_end_pfn(reg), - MAX_ORDER_NR_PAGES); + prev_end = ALIGN(end, MAX_ORDER_NR_PAGES); } #ifdef CONFIG_SPARSEMEM diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 481d22c32a2e..f0bf86d81622 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -471,12 +471,10 @@ static inline void free_memmap(unsigned long start_pfn, unsigned long end_pfn) */ static void __init free_unused_memmap(void) { - unsigned long start, prev_end = 0; - struct memblock_region *reg; - - for_each_memblock(memory, reg) { - start = __phys_to_pfn(reg->base); + unsigned long start, end, prev_end = 0; + int i; + for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, NULL) { #ifdef CONFIG_SPARSEMEM /* * Take care not to free memmap entries that don't exist due @@ -496,8 +494,7 @@ static void __init free_unused_memmap(void) * memmap entries are valid from the bank end aligned to * MAX_ORDER_NR_PAGES. */ - prev_end = ALIGN(__phys_to_pfn(reg->base + reg->size), - MAX_ORDER_NR_PAGES); + prev_end = ALIGN(end, MAX_ORDER_NR_PAGES); } #ifdef CONFIG_SPARSEMEM diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 10ebb4bf71ad..e469b150be21 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -1242,14 +1242,15 @@ static void fadump_free_reserved_memory(unsigned long start_pfn, */ static void fadump_release_reserved_area(u64 start, u64 end) { - u64 tstart, tend, spfn, epfn; - struct memblock_region *reg; + u64 tstart, tend, spfn, epfn, reg_spfn, reg_epfn, i; spfn = PHYS_PFN(start); epfn = PHYS_PFN(end); - for_each_memblock(memory, reg) { - tstart = max_t(u64, spfn, memblock_region_memory_base_pfn(reg)); - tend = min_t(u64, epfn, memblock_region_memory_end_pfn(reg)); + + for_each_mem_pfn_range(i, MAX_NUMNODES, ®_spfn, ®_epfn, NULL) { + tstart = max_t(u64, spfn, reg_spfn); + tend = min_t(u64, epfn, reg_epfn); + if (tstart < tend) { fadump_free_reserved_memory(tstart, tend); diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 42e25874f5a8..80df329f180e 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -184,15 +184,16 @@ void __init initmem_init(void) /* mark pages that don't exist as nosave */ static int __init mark_nonram_nosave(void) { - struct memblock_region *reg, *prev = NULL; + unsigned long spfn, epfn, prev = 0; + int i; - for_each_memblock(memory, reg) { - if (prev && - memblock_region_memory_end_pfn(prev) < memblock_region_memory_base_pfn(reg)) - register_nosave_region(memblock_region_memory_end_pfn(prev), - memblock_region_memory_base_pfn(reg)); - prev = reg; + for_each_mem_pfn_range(i, MAX_NUMNODES, &spfn, &epfn, NULL) { + if (prev && prev < spfn) + register_nosave_region(prev, spfn); + + prev = epfn; } + return 0; } #else /* CONFIG_NEED_MULTIPLE_NODES */ diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 1f61fa2148b5..f4e20d8e6c02 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -804,17 +804,14 @@ static void __init setup_nonnuma(void) unsigned long total_ram = memblock_phys_mem_size(); unsigned long start_pfn, end_pfn; unsigned int nid = 0; - struct memblock_region *reg; + int i; printk(KERN_DEBUG "Top of RAM: 0x%lx, Total RAM: 0x%lx\n", top_of_ram, total_ram); printk(KERN_DEBUG "Memory hole size: %ldMB\n", (top_of_ram - total_ram) >> 20); - for_each_memblock(memory, reg) { - start_pfn = memblock_region_memory_base_pfn(reg); - end_pfn = memblock_region_memory_end_pfn(reg); - + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { fake_numa_create_new_node(end_pfn, &nid); memblock_set_node(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn), diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c index fc141893d028..567c69f3069e 100644 --- a/arch/s390/mm/page-states.c +++ b/arch/s390/mm/page-states.c @@ -183,9 +183,9 @@ static void mark_kernel_pgd(void) void __init cmma_init_nodat(void) { - struct memblock_region *reg; struct page *page; unsigned long start, end, ix; + int i; if (cmma_flag < 2) return; @@ -193,9 +193,7 @@ void __init cmma_init_nodat(void) mark_kernel_pgd(); /* Set all kernel pages not used for page tables to stable/no-dat */ - for_each_memblock(memory, reg) { - start = memblock_region_memory_base_pfn(reg); - end = memblock_region_memory_end_pfn(reg); + for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, NULL) { page = pfn_to_page(start); for (ix = start; ix < end; ix++, page++) { if (__test_and_clear_bit(PG_arch_1, &page->flags)) diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index 4735176ab811..3348e0c4d769 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -226,15 +226,12 @@ void __init allocate_pgdat(unsigned int nid) static void __init do_init_bootmem(void) { - struct memblock_region *reg; + unsigned long start_pfn, end_pfn; + int i; /* Add active regions with valid PFNs. */ - for_each_memblock(memory, reg) { - unsigned long start_pfn, end_pfn; - start_pfn = memblock_region_memory_base_pfn(reg); - end_pfn = memblock_region_memory_end_pfn(reg); + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) __add_active_range(0, start_pfn, end_pfn); - } /* All of system RAM sits in node 0 for the non-NUMA case */ allocate_pgdat(0); diff --git a/mm/memblock.c b/mm/memblock.c index e9b6efd74329..d66d805dec96 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1663,12 +1663,10 @@ phys_addr_t __init_memblock memblock_reserved_size(void) phys_addr_t __init memblock_mem_size(unsigned long limit_pfn) { unsigned long pages = 0; - struct memblock_region *r; unsigned long start_pfn, end_pfn; + int i; - for_each_memblock(memory, r) { - start_pfn = memblock_region_memory_base_pfn(r); - end_pfn = memblock_region_memory_end_pfn(r); + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { start_pfn = min_t(unsigned long, start_pfn, limit_pfn); end_pfn = min_t(unsigned long, end_pfn, limit_pfn); pages += end_pfn - start_pfn; diff --git a/mm/sparse.c b/mm/sparse.c index fcc3d176f1ea..b25ad8e64839 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -291,13 +291,11 @@ static void __init memory_present(int nid, unsigned long start, unsigned long en */ static void __init memblocks_present(void) { - struct memblock_region *reg; + unsigned long start, end; + int i, nid; - for_each_memblock(memory, reg) { - memory_present(memblock_get_region_node(reg), - memblock_region_memory_base_pfn(reg), - memblock_region_memory_end_pfn(reg)); - } + for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) + memory_present(nid, start, end); } /* From b10d6bca87204cdafd0cd7aaa837ad30b4eb8c20 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:58:08 -0700 Subject: [PATCH 173/181] arch, drivers: replace for_each_membock() with for_each_mem_range() There are several occurrences of the following pattern: for_each_memblock(memory, reg) { start = __pfn_to_phys(memblock_region_memory_base_pfn(reg); end = __pfn_to_phys(memblock_region_memory_end_pfn(reg)); /* do something with start and end */ } Using for_each_mem_range() iterator is more appropriate in such cases and allows simpler and cleaner code. [akpm@linux-foundation.org: fix arch/arm/mm/pmsa-v7.c build] [rppt@linux.ibm.com: mips: fix cavium-octeon build caused by memblock refactoring] Link: http://lkml.kernel.org/r/20200827124549.GD167163@linux.ibm.com Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Cc: Andy Lutomirski Cc: Baoquan He Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-13-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/arm/kernel/setup.c | 18 ++++++--- arch/arm/mm/mmu.c | 39 ++++++------------ arch/arm/mm/pmsa-v7.c | 23 ++++++----- arch/arm/mm/pmsa-v8.c | 17 ++++---- arch/arm/xen/mm.c | 7 ++-- arch/arm64/mm/kasan_init.c | 10 ++--- arch/arm64/mm/mmu.c | 11 ++---- arch/c6x/kernel/setup.c | 9 +++-- arch/microblaze/mm/init.c | 9 +++-- arch/mips/cavium-octeon/dma-octeon.c | 14 +++---- arch/mips/kernel/setup.c | 31 +++++++-------- arch/openrisc/mm/init.c | 8 ++-- arch/powerpc/kernel/fadump.c | 50 +++++++++++------------- arch/powerpc/kexec/file_load_64.c | 10 ++--- arch/powerpc/mm/book3s64/hash_utils.c | 16 ++++---- arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ++--- arch/powerpc/mm/kasan/kasan_init_32.c | 8 ++-- arch/powerpc/mm/mem.c | 16 +++++--- arch/powerpc/mm/pgtable_32.c | 8 ++-- arch/riscv/mm/init.c | 25 +++++------- arch/riscv/mm/kasan_init.c | 10 ++--- arch/s390/kernel/setup.c | 23 +++++++---- arch/s390/mm/vmem.c | 7 ++-- arch/sparc/mm/init_64.c | 12 ++---- drivers/bus/mvebu-mbus.c | 12 +++--- 25 files changed, 195 insertions(+), 208 deletions(-) diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c index d8e18cdd96d3..3f65d0ac9f63 100644 --- a/arch/arm/kernel/setup.c +++ b/arch/arm/kernel/setup.c @@ -843,19 +843,25 @@ early_param("mem", early_mem); static void __init request_standard_resources(const struct machine_desc *mdesc) { - struct memblock_region *region; + phys_addr_t start, end, res_end; struct resource *res; + u64 i; kernel_code.start = virt_to_phys(_text); kernel_code.end = virt_to_phys(__init_begin - 1); kernel_data.start = virt_to_phys(_sdata); kernel_data.end = virt_to_phys(_end - 1); - for_each_memblock(memory, region) { - phys_addr_t start = __pfn_to_phys(memblock_region_memory_base_pfn(region)); - phys_addr_t end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1; + for_each_mem_range(i, &start, &end) { unsigned long boot_alias_start; + /* + * In memblock, end points to the first byte after the + * range while in resourses, end points to the last byte in + * the range. + */ + res_end = end - 1; + /* * Some systems have a special memory alias which is only * used for booting. We need to advertise this region to @@ -869,7 +875,7 @@ static void __init request_standard_resources(const struct machine_desc *mdesc) __func__, sizeof(*res)); res->name = "System RAM (boot alias)"; res->start = boot_alias_start; - res->end = phys_to_idmap(end); + res->end = phys_to_idmap(res_end); res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; request_resource(&iomem_resource, res); } @@ -880,7 +886,7 @@ static void __init request_standard_resources(const struct machine_desc *mdesc) sizeof(*res)); res->name = "System RAM"; res->start = start; - res->end = end; + res->end = res_end; res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; request_resource(&iomem_resource, res); diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index c36f977b2ccb..698cc740c6b8 100644 --- a/arch/arm/mm/mmu.c +++ b/arch/arm/mm/mmu.c @@ -1154,9 +1154,8 @@ phys_addr_t arm_lowmem_limit __initdata = 0; void __init adjust_lowmem_bounds(void) { - phys_addr_t memblock_limit = 0; - u64 vmalloc_limit; - struct memblock_region *reg; + phys_addr_t block_start, block_end, memblock_limit = 0; + u64 vmalloc_limit, i; phys_addr_t lowmem_limit = 0; /* @@ -1172,26 +1171,18 @@ void __init adjust_lowmem_bounds(void) * The first usable region must be PMD aligned. Mark its start * as MEMBLOCK_NOMAP if it isn't */ - for_each_memblock(memory, reg) { - if (!memblock_is_nomap(reg)) { - if (!IS_ALIGNED(reg->base, PMD_SIZE)) { - phys_addr_t len; + for_each_mem_range(i, &block_start, &block_end) { + if (!IS_ALIGNED(block_start, PMD_SIZE)) { + phys_addr_t len; - len = round_up(reg->base, PMD_SIZE) - reg->base; - memblock_mark_nomap(reg->base, len); - } - break; + len = round_up(block_start, PMD_SIZE) - block_start; + memblock_mark_nomap(block_start, len); } + break; } - for_each_memblock(memory, reg) { - phys_addr_t block_start = reg->base; - phys_addr_t block_end = reg->base + reg->size; - - if (memblock_is_nomap(reg)) - continue; - - if (reg->base < vmalloc_limit) { + for_each_mem_range(i, &block_start, &block_end) { + if (block_start < vmalloc_limit) { if (block_end > lowmem_limit) /* * Compare as u64 to ensure vmalloc_limit does @@ -1440,19 +1431,15 @@ static void __init kmap_init(void) static void __init map_lowmem(void) { - struct memblock_region *reg; phys_addr_t kernel_x_start = round_down(__pa(KERNEL_START), SECTION_SIZE); phys_addr_t kernel_x_end = round_up(__pa(__init_end), SECTION_SIZE); + phys_addr_t start, end; + u64 i; /* Map all the lowmem memory banks. */ - for_each_memblock(memory, reg) { - phys_addr_t start = reg->base; - phys_addr_t end = start + reg->size; + for_each_mem_range(i, &start, &end) { struct map_desc map; - if (memblock_is_nomap(reg)) - continue; - if (end > arm_lowmem_limit) end = arm_lowmem_limit; if (start >= end) diff --git a/arch/arm/mm/pmsa-v7.c b/arch/arm/mm/pmsa-v7.c index 699fa2e88725..88950e41a3a9 100644 --- a/arch/arm/mm/pmsa-v7.c +++ b/arch/arm/mm/pmsa-v7.c @@ -231,12 +231,12 @@ static int __init allocate_region(phys_addr_t base, phys_addr_t size, void __init pmsav7_adjust_lowmem_bounds(void) { phys_addr_t specified_mem_size = 0, total_mem_size = 0; - struct memblock_region *reg; - bool first = true; phys_addr_t mem_start; phys_addr_t mem_end; + phys_addr_t reg_start, reg_end; unsigned int mem_max_regions; - int num, i; + int num; + u64 i; /* Free-up PMSAv7_PROBE_REGION */ mpu_min_region_order = __mpu_min_region_order(); @@ -262,20 +262,19 @@ void __init pmsav7_adjust_lowmem_bounds(void) mem_max_regions -= num; #endif - for_each_memblock(memory, reg) { - if (first) { + for_each_mem_range(i, ®_start, ®_end) { + if (i == 0) { phys_addr_t phys_offset = PHYS_OFFSET; /* * Initially only use memory continuous from * PHYS_OFFSET */ - if (reg->base != phys_offset) + if (reg_start != phys_offset) panic("First memory bank must be contiguous from PHYS_OFFSET"); - mem_start = reg->base; - mem_end = reg->base + reg->size; - specified_mem_size = reg->size; - first = false; + mem_start = reg_start; + mem_end = reg_end; + specified_mem_size = mem_end - mem_start; } else { /* * memblock auto merges contiguous blocks, remove @@ -283,8 +282,8 @@ void __init pmsav7_adjust_lowmem_bounds(void) * blocks separately while iterating) */ pr_notice("Ignoring RAM after %pa, memory at %pa ignored\n", - &mem_end, ®->base); - memblock_remove(reg->base, 0 - reg->base); + &mem_end, ®_start); + memblock_remove(reg_start, 0 - reg_start); break; } } diff --git a/arch/arm/mm/pmsa-v8.c b/arch/arm/mm/pmsa-v8.c index 0d7d5fb59247..2de019f7503e 100644 --- a/arch/arm/mm/pmsa-v8.c +++ b/arch/arm/mm/pmsa-v8.c @@ -94,20 +94,19 @@ static __init bool is_region_fixed(int number) void __init pmsav8_adjust_lowmem_bounds(void) { phys_addr_t mem_end; - struct memblock_region *reg; - bool first = true; + phys_addr_t reg_start, reg_end; + u64 i; - for_each_memblock(memory, reg) { - if (first) { + for_each_mem_range(i, ®_start, ®_end) { + if (i == 0) { phys_addr_t phys_offset = PHYS_OFFSET; /* * Initially only use memory continuous from * PHYS_OFFSET */ - if (reg->base != phys_offset) + if (reg_start != phys_offset) panic("First memory bank must be contiguous from PHYS_OFFSET"); - mem_end = reg->base + reg->size; - first = false; + mem_end = reg_end; } else { /* * memblock auto merges contiguous blocks, remove @@ -115,8 +114,8 @@ void __init pmsav8_adjust_lowmem_bounds(void) * blocks separately while iterating) */ pr_notice("Ignoring RAM after %pa, memory at %pa ignored\n", - &mem_end, ®->base); - memblock_remove(reg->base, 0 - reg->base); + &mem_end, ®_start); + memblock_remove(reg_start, 0 - reg_start); break; } } diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c index 396797ffe2b1..d3ef975a0965 100644 --- a/arch/arm/xen/mm.c +++ b/arch/arm/xen/mm.c @@ -25,11 +25,12 @@ unsigned long xen_get_swiotlb_free_pages(unsigned int order) { - struct memblock_region *reg; + phys_addr_t base; gfp_t flags = __GFP_NOWARN|__GFP_KSWAPD_RECLAIM; + u64 i; - for_each_memblock(memory, reg) { - if (reg->base < (phys_addr_t)0xffffffff) { + for_each_mem_range(i, &base, NULL) { + if (base < (phys_addr_t)0xffffffff) { if (IS_ENABLED(CONFIG_ZONE_DMA32)) flags |= __GFP_DMA32; else diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index 7291b26ce788..b24e43d20667 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -212,8 +212,8 @@ void __init kasan_init(void) { u64 kimg_shadow_start, kimg_shadow_end; u64 mod_shadow_start, mod_shadow_end; - struct memblock_region *reg; - int i; + phys_addr_t pa_start, pa_end; + u64 i; kimg_shadow_start = (u64)kasan_mem_to_shadow(_text) & PAGE_MASK; kimg_shadow_end = PAGE_ALIGN((u64)kasan_mem_to_shadow(_end)); @@ -246,9 +246,9 @@ void __init kasan_init(void) kasan_populate_early_shadow((void *)mod_shadow_end, (void *)kimg_shadow_start); - for_each_memblock(memory, reg) { - void *start = (void *)__phys_to_virt(reg->base); - void *end = (void *)__phys_to_virt(reg->base + reg->size); + for_each_mem_range(i, &pa_start, &pa_end) { + void *start = (void *)__phys_to_virt(pa_start); + void *end = (void *)__phys_to_virt(pa_end); if (start >= end) break; diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 087a844b4d26..beff3ad8c7f8 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -473,8 +473,9 @@ static void __init map_mem(pgd_t *pgdp) { phys_addr_t kernel_start = __pa_symbol(_text); phys_addr_t kernel_end = __pa_symbol(__init_begin); - struct memblock_region *reg; + phys_addr_t start, end; int flags = 0; + u64 i; if (rodata_full || debug_pagealloc_enabled()) flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; @@ -493,15 +494,9 @@ static void __init map_mem(pgd_t *pgdp) #endif /* map all the memory banks */ - for_each_memblock(memory, reg) { - phys_addr_t start = reg->base; - phys_addr_t end = start + reg->size; - + for_each_mem_range(i, &start, &end) { if (start >= end) break; - if (memblock_is_nomap(reg)) - continue; - /* * The linear map must allow allocation tags reading/writing * if MTE is present. Otherwise, it has the same attributes as diff --git a/arch/c6x/kernel/setup.c b/arch/c6x/kernel/setup.c index 8ef35131f999..9254c3b794a5 100644 --- a/arch/c6x/kernel/setup.c +++ b/arch/c6x/kernel/setup.c @@ -287,7 +287,8 @@ notrace void __init machine_init(unsigned long dt_ptr) void __init setup_arch(char **cmdline_p) { - struct memblock_region *reg; + phys_addr_t start, end; + u64 i; printk(KERN_INFO "Initializing kernel\n"); @@ -351,9 +352,9 @@ void __init setup_arch(char **cmdline_p) disable_caching(ram_start, ram_end - 1); /* Set caching of external RAM used by Linux */ - for_each_memblock(memory, reg) - enable_caching(CACHE_REGION_START(reg->base), - CACHE_REGION_START(reg->base + reg->size - 1)); + for_each_mem_range(i, &start, &end) + enable_caching(CACHE_REGION_START(start), + CACHE_REGION_START(end - 1)); #ifdef CONFIG_BLK_DEV_INITRD /* diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c index 25ec8f2c3a4d..0902c459c385 100644 --- a/arch/microblaze/mm/init.c +++ b/arch/microblaze/mm/init.c @@ -109,13 +109,14 @@ static void __init paging_init(void) void __init setup_memory(void) { #ifndef CONFIG_MMU - struct memblock_region *reg; u32 kernel_align_start, kernel_align_size; + phys_addr_t start, end; + u64 i; /* Find main memory where is the kernel */ - for_each_memblock(memory, reg) { - memory_start = (u32)reg->base; - lowmem_size = reg->size; + for_each_mem_range(i, &start, &end) { + memory_start = start; + lowmem_size = end - start; if ((memory_start <= (u32)_text) && ((u32)_text <= (memory_start + lowmem_size - 1))) { memory_size = lowmem_size; diff --git a/arch/mips/cavium-octeon/dma-octeon.c b/arch/mips/cavium-octeon/dma-octeon.c index 14ea680d180e..ad1aecc4b401 100644 --- a/arch/mips/cavium-octeon/dma-octeon.c +++ b/arch/mips/cavium-octeon/dma-octeon.c @@ -190,25 +190,25 @@ char *octeon_swiotlb; void __init plat_swiotlb_setup(void) { - struct memblock_region *mem; + phys_addr_t start, end; phys_addr_t max_addr; phys_addr_t addr_size; size_t swiotlbsize; unsigned long swiotlb_nslabs; + u64 i; max_addr = 0; addr_size = 0; - for_each_memblock(memory, mem) { + for_each_mem_range(i, &start, &end) { /* These addresses map low for PCI. */ - if (mem->base > 0x410000000ull && !OCTEON_IS_OCTEON2()) + if (start > 0x410000000ull && !OCTEON_IS_OCTEON2()) continue; - addr_size += mem->size; - - if (max_addr < mem->base + mem->size) - max_addr = mem->base + mem->size; + addr_size += (end - start); + if (max_addr < end) + max_addr = end; } swiotlbsize = PAGE_SIZE; diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c index bf5f5acab0a8..335bd188b8b4 100644 --- a/arch/mips/kernel/setup.c +++ b/arch/mips/kernel/setup.c @@ -300,8 +300,9 @@ static void __init bootmem_init(void) static void __init bootmem_init(void) { - struct memblock_region *mem; phys_addr_t ramstart, ramend; + phys_addr_t start, end; + u64 i; ramstart = memblock_start_of_DRAM(); ramend = memblock_end_of_DRAM(); @@ -338,18 +339,13 @@ static void __init bootmem_init(void) min_low_pfn = ARCH_PFN_OFFSET; max_pfn = PFN_DOWN(ramend); - for_each_memblock(memory, mem) { - unsigned long start = memblock_region_memory_base_pfn(mem); - unsigned long end = memblock_region_memory_end_pfn(mem); - + for_each_mem_range(i, &start, &end) { /* * Skip highmem here so we get an accurate max_low_pfn if low * memory stops short of high memory. * If the region overlaps HIGHMEM_START, end is clipped so * max_pfn excludes the highmem portion. */ - if (memblock_is_nomap(mem)) - continue; if (start >= PFN_DOWN(HIGHMEM_START)) continue; if (end > PFN_DOWN(HIGHMEM_START)) @@ -450,13 +446,12 @@ early_param("memmap", early_parse_memmap); unsigned long setup_elfcorehdr, setup_elfcorehdr_size; static int __init early_parse_elfcorehdr(char *p) { - struct memblock_region *mem; + phys_addr_t start, end; + u64 i; setup_elfcorehdr = memparse(p, &p); - for_each_memblock(memory, mem) { - unsigned long start = mem->base; - unsigned long end = start + mem->size; + for_each_mem_range(i, &start, &end) { if (setup_elfcorehdr >= start && setup_elfcorehdr < end) { /* * Reserve from the elf core header to the end of @@ -720,7 +715,8 @@ static void __init arch_mem_init(char **cmdline_p) static void __init resource_init(void) { - struct memblock_region *region; + phys_addr_t start, end; + u64 i; if (UNCAC_BASE != IO_BASE) return; @@ -732,9 +728,7 @@ static void __init resource_init(void) bss_resource.start = __pa_symbol(&__bss_start); bss_resource.end = __pa_symbol(&__bss_stop) - 1; - for_each_memblock(memory, region) { - phys_addr_t start = PFN_PHYS(memblock_region_memory_base_pfn(region)); - phys_addr_t end = PFN_PHYS(memblock_region_memory_end_pfn(region)) - 1; + for_each_mem_range(i, &start, &end) { struct resource *res; res = memblock_alloc(sizeof(struct resource), SMP_CACHE_BYTES); @@ -743,7 +737,12 @@ static void __init resource_init(void) sizeof(struct resource)); res->start = start; - res->end = end; + /* + * In memblock, end points to the first byte after the + * range while in resourses, end points to the last byte in + * the range. + */ + res->end = end - 1; res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; res->name = "System RAM"; diff --git a/arch/openrisc/mm/init.c b/arch/openrisc/mm/init.c index 3d7c79c7745d..8348feaaf46e 100644 --- a/arch/openrisc/mm/init.c +++ b/arch/openrisc/mm/init.c @@ -64,6 +64,7 @@ extern const char _s_kernel_ro[], _e_kernel_ro[]; */ static void __init map_ram(void) { + phys_addr_t start, end; unsigned long v, p, e; pgprot_t prot; pgd_t *pge; @@ -71,6 +72,7 @@ static void __init map_ram(void) pud_t *pue; pmd_t *pme; pte_t *pte; + u64 i; /* These mark extents of read-only kernel pages... * ...from vmlinux.lds.S */ @@ -78,9 +80,9 @@ static void __init map_ram(void) v = PAGE_OFFSET; - for_each_memblock(memory, region) { - p = (u32) region->base & PAGE_MASK; - e = p + (u32) region->size; + for_each_mem_range(i, &start, &end) { + p = (u32) start & PAGE_MASK; + e = (u32) end; v = (u32) __va(p); pge = pgd_offset_k(v); diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index e469b150be21..5cdf4168a61a 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -191,13 +191,13 @@ int is_fadump_active(void) */ static bool is_fadump_mem_area_contiguous(u64 d_start, u64 d_end) { - struct memblock_region *reg; + phys_addr_t reg_start, reg_end; bool ret = false; - u64 start, end; + u64 i, start, end; - for_each_memblock(memory, reg) { - start = max_t(u64, d_start, reg->base); - end = min_t(u64, d_end, (reg->base + reg->size)); + for_each_mem_range(i, ®_start, ®_end) { + start = max_t(u64, d_start, reg_start); + end = min_t(u64, d_end, reg_end); if (d_start < end) { /* Memory hole from d_start to start */ if (start > d_start) @@ -422,34 +422,34 @@ static int __init add_boot_mem_regions(unsigned long mstart, static int __init fadump_get_boot_mem_regions(void) { - unsigned long base, size, cur_size, hole_size, last_end; + unsigned long size, cur_size, hole_size, last_end; unsigned long mem_size = fw_dump.boot_memory_size; - struct memblock_region *reg; + phys_addr_t reg_start, reg_end; int ret = 1; + u64 i; fw_dump.boot_mem_regs_cnt = 0; last_end = 0; hole_size = 0; cur_size = 0; - for_each_memblock(memory, reg) { - base = reg->base; - size = reg->size; - hole_size += (base - last_end); + for_each_mem_range(i, ®_start, ®_end) { + size = reg_end - reg_start; + hole_size += (reg_start - last_end); if ((cur_size + size) >= mem_size) { size = (mem_size - cur_size); - ret = add_boot_mem_regions(base, size); + ret = add_boot_mem_regions(reg_start, size); break; } mem_size -= size; cur_size += size; - ret = add_boot_mem_regions(base, size); + ret = add_boot_mem_regions(reg_start, size); if (!ret) break; - last_end = base + size; + last_end = reg_end; } fw_dump.boot_mem_top = PAGE_ALIGN(fw_dump.boot_memory_size + hole_size); @@ -985,9 +985,8 @@ static int fadump_init_elfcore_header(char *bufp) */ static int fadump_setup_crash_memory_ranges(void) { - struct memblock_region *reg; - u64 start, end; - int i, ret; + u64 i, start, end; + int ret; pr_debug("Setup crash memory ranges.\n"); crash_mrange_info.mem_range_cnt = 0; @@ -1005,10 +1004,7 @@ static int fadump_setup_crash_memory_ranges(void) return ret; } - for_each_memblock(memory, reg) { - start = (u64)reg->base; - end = start + (u64)reg->size; - + for_each_mem_range(i, &start, &end) { /* * skip the memory chunk that is already added * (0 through boot_memory_top). @@ -1242,7 +1238,9 @@ static void fadump_free_reserved_memory(unsigned long start_pfn, */ static void fadump_release_reserved_area(u64 start, u64 end) { - u64 tstart, tend, spfn, epfn, reg_spfn, reg_epfn, i; + unsigned long reg_spfn, reg_epfn; + u64 tstart, tend, spfn, epfn; + int i; spfn = PHYS_PFN(start); epfn = PHYS_PFN(end); @@ -1685,12 +1683,10 @@ int __init fadump_reserve_mem(void) /* Preserve everything above the base address */ static void __init fadump_reserve_crash_area(u64 base) { - struct memblock_region *reg; - u64 mstart, msize; + u64 i, mstart, mend, msize; - for_each_memblock(memory, reg) { - mstart = reg->base; - msize = reg->size; + for_each_mem_range(i, &mstart, &mend) { + msize = mend - mstart; if ((mstart + msize) < base) continue; diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index 2c9d908eab96..c69bcf9b547a 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -138,15 +138,13 @@ out: */ static int get_crash_memory_ranges(struct crash_mem **mem_ranges) { - struct memblock_region *reg; + phys_addr_t base, end; struct crash_mem *tmem; + u64 i; int ret; - for_each_memblock(memory, reg) { - u64 base, size; - - base = (u64)reg->base; - size = (u64)reg->size; + for_each_mem_range(i, &base, &end) { + u64 size = end - base; /* Skip backup memory region, which needs a separate entry */ if (base == BACKUP_SRC_START) { diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index c663e7ba801f..b830adee51f5 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -7,7 +7,7 @@ * * SMP scalability work: * Copyright (C) 2001 Anton Blanchard , IBM - * + * * Module name: htab.c * * Description: @@ -867,8 +867,8 @@ static void __init htab_initialize(void) unsigned long table; unsigned long pteg_count; unsigned long prot; - unsigned long base = 0, size = 0; - struct memblock_region *reg; + phys_addr_t base = 0, size = 0, end; + u64 i; DBG(" -> htab_initialize()\n"); @@ -884,7 +884,7 @@ static void __init htab_initialize(void) /* * Calculate the required size of the htab. We want the number of * PTEGs to equal one half the number of real pages. - */ + */ htab_size_bytes = htab_get_table_size(); pteg_count = htab_size_bytes >> 7; @@ -894,7 +894,7 @@ static void __init htab_initialize(void) firmware_has_feature(FW_FEATURE_PS3_LV1)) { /* Using a hypervisor which owns the htab */ htab_address = NULL; - _SDR1 = 0; + _SDR1 = 0; #ifdef CONFIG_FA_DUMP /* * If firmware assisted dump is active firmware preserves @@ -960,9 +960,9 @@ static void __init htab_initialize(void) #endif /* CONFIG_DEBUG_PAGEALLOC */ /* create bolted the linear mapping in the hash table */ - for_each_memblock(memory, reg) { - base = (unsigned long)__va(reg->base); - size = reg->size; + for_each_mem_range(i, &base, &end) { + size = end - base; + base = (unsigned long)__va(base); DBG("creating mapping for region: %lx..%lx (prot: %lx)\n", base, size, prot); diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index d5f0c10d752a..cc72666e891a 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -329,7 +329,8 @@ static int __meminit create_physical_mapping(unsigned long start, static void __init radix_init_pgtable(void) { unsigned long rts_field; - struct memblock_region *reg; + phys_addr_t start, end; + u64 i; /* We don't support slb for radix */ mmu_slb_size = 0; @@ -337,20 +338,19 @@ static void __init radix_init_pgtable(void) /* * Create the linear mapping */ - for_each_memblock(memory, reg) { + for_each_mem_range(i, &start, &end) { /* * The memblock allocator is up at this point, so the * page tables will be allocated within the range. No * need or a node (which we don't have yet). */ - if ((reg->base + reg->size) >= RADIX_VMALLOC_START) { + if (end >= RADIX_VMALLOC_START) { pr_warn("Outside the supported range\n"); continue; } - WARN_ON(create_physical_mapping(reg->base, - reg->base + reg->size, + WARN_ON(create_physical_mapping(start, end, radix_mem_block_size, -1, PAGE_KERNEL)); } diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index fb294046e00e..26fda3203320 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -138,11 +138,11 @@ void __init kasan_mmu_init(void) void __init kasan_init(void) { - struct memblock_region *reg; + phys_addr_t base, end; + u64 i; - for_each_memblock(memory, reg) { - phys_addr_t base = reg->base; - phys_addr_t top = min(base + reg->size, total_lowmem); + for_each_mem_range(i, &base, &end) { + phys_addr_t top = min(end, total_lowmem); int ret; if (base >= top) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 80df329f180e..5e2e7c0a8f1a 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -585,20 +585,24 @@ void flush_icache_user_page(struct vm_area_struct *vma, struct page *page, */ static int __init add_system_ram_resources(void) { - struct memblock_region *reg; + phys_addr_t start, end; + u64 i; - for_each_memblock(memory, reg) { + for_each_mem_range(i, &start, &end) { struct resource *res; - unsigned long base = reg->base; - unsigned long size = reg->size; res = kzalloc(sizeof(struct resource), GFP_KERNEL); WARN_ON(!res); if (res) { res->name = "System RAM"; - res->start = base; - res->end = base + size - 1; + res->start = start; + /* + * In memblock, end points to the first byte after + * the range while in resourses, end points to the + * last byte in the range. + */ + res->end = end - 1; res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; WARN_ON(request_resource(&iomem_resource, res) < 0); } diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index 6eb4eab79385..079159e97bca 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -123,11 +123,11 @@ static void __init __mapin_ram_chunk(unsigned long offset, unsigned long top) void __init mapin_ram(void) { - struct memblock_region *reg; + phys_addr_t base, end; + u64 i; - for_each_memblock(memory, reg) { - phys_addr_t base = reg->base; - phys_addr_t top = min(base + reg->size, total_lowmem); + for_each_mem_range(i, &base, &end) { + phys_addr_t top = min(end, total_lowmem); if (base >= top) continue; diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 17911b9402ea..1e8c3e24e0c4 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -145,21 +145,21 @@ static phys_addr_t dtb_early_pa __initdata; void __init setup_bootmem(void) { - struct memblock_region *reg; phys_addr_t mem_size = 0; phys_addr_t total_mem = 0; - phys_addr_t mem_start, end = 0; + phys_addr_t mem_start, start, end = 0; phys_addr_t vmlinux_end = __pa_symbol(&_end); phys_addr_t vmlinux_start = __pa_symbol(&_start); + u64 i; /* Find the memory region containing the kernel */ - for_each_memblock(memory, reg) { - end = reg->base + reg->size; + for_each_mem_range(i, &start, &end) { + phys_addr_t size = end - start; if (!total_mem) - mem_start = reg->base; - if (reg->base <= vmlinux_start && vmlinux_end <= end) - BUG_ON(reg->size == 0); - total_mem = total_mem + reg->size; + mem_start = start; + if (start <= vmlinux_start && vmlinux_end <= end) + BUG_ON(size == 0); + total_mem = total_mem + size; } /* @@ -455,7 +455,7 @@ static void __init setup_vm_final(void) { uintptr_t va, map_size; phys_addr_t pa, start, end; - struct memblock_region *reg; + u64 i; /* Set mmu_enabled flag */ mmu_enabled = true; @@ -466,14 +466,9 @@ static void __init setup_vm_final(void) PGDIR_SIZE, PAGE_TABLE); /* Map all memory banks */ - for_each_memblock(memory, reg) { - start = reg->base; - end = start + reg->size; - + for_each_mem_range(i, &start, &end) { if (start >= end) break; - if (memblock_is_nomap(reg)) - continue; if (start <= __pa(PAGE_OFFSET) && __pa(PAGE_OFFSET) < end) start = __pa(PAGE_OFFSET); diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c index 87b4ab3d3c77..12ddd1f6bf70 100644 --- a/arch/riscv/mm/kasan_init.c +++ b/arch/riscv/mm/kasan_init.c @@ -85,16 +85,16 @@ static void __init populate(void *start, void *end) void __init kasan_init(void) { - struct memblock_region *reg; - unsigned long i; + phys_addr_t _start, _end; + u64 i; kasan_populate_early_shadow((void *)KASAN_SHADOW_START, (void *)kasan_mem_to_shadow((void *) VMALLOC_END)); - for_each_memblock(memory, reg) { - void *start = (void *)__va(reg->base); - void *end = (void *)__va(reg->base + reg->size); + for_each_mem_range(i, &_start, &_end) { + void *start = (void *)_start; + void *end = (void *)_end; if (start >= end) break; diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c index 115c92839af5..d44e522c569b 100644 --- a/arch/s390/kernel/setup.c +++ b/arch/s390/kernel/setup.c @@ -484,8 +484,9 @@ static struct resource __initdata *standard_resources[] = { static void __init setup_resources(void) { struct resource *res, *std_res, *sub_res; - struct memblock_region *reg; + phys_addr_t start, end; int j; + u64 i; code_resource.start = (unsigned long) _text; code_resource.end = (unsigned long) _etext - 1; @@ -494,7 +495,7 @@ static void __init setup_resources(void) bss_resource.start = (unsigned long) __bss_start; bss_resource.end = (unsigned long) __bss_stop - 1; - for_each_memblock(memory, reg) { + for_each_mem_range(i, &start, &end) { res = memblock_alloc(sizeof(*res), 8); if (!res) panic("%s: Failed to allocate %zu bytes align=0x%x\n", @@ -502,8 +503,13 @@ static void __init setup_resources(void) res->flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM; res->name = "System RAM"; - res->start = reg->base; - res->end = reg->base + reg->size - 1; + res->start = start; + /* + * In memblock, end points to the first byte after the + * range while in resourses, end points to the last byte in + * the range. + */ + res->end = end - 1; request_resource(&iomem_resource, res); for (j = 0; j < ARRAY_SIZE(standard_resources); j++) { @@ -819,14 +825,15 @@ static void __init reserve_kernel(void) static void __init setup_memory(void) { - struct memblock_region *reg; + phys_addr_t start, end; + u64 i; /* * Init storage key for present memory */ - for_each_memblock(memory, reg) { - storage_key_init_range(reg->base, reg->base + reg->size); - } + for_each_mem_range(i, &start, &end) + storage_key_init_range(start, end); + psw_set_key(PAGE_DEFAULT_KEY); /* Only cosmetics */ diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c index eddf71c22875..b239f2ba93b0 100644 --- a/arch/s390/mm/vmem.c +++ b/arch/s390/mm/vmem.c @@ -555,10 +555,11 @@ int vmem_add_mapping(unsigned long start, unsigned long size) */ void __init vmem_map_init(void) { - struct memblock_region *reg; + phys_addr_t base, end; + u64 i; - for_each_memblock(memory, reg) - vmem_add_range(reg->base, reg->size); + for_each_mem_range(i, &base, &end) + vmem_add_range(base, end - base); __set_memory((unsigned long)_stext, (unsigned long)(_etext - _stext) >> PAGE_SHIFT, SET_MEMORY_RO | SET_MEMORY_X); diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index fad6d3129904..96edf64d4fb3 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -1192,18 +1192,14 @@ int of_node_to_nid(struct device_node *dp) static void __init add_node_ranges(void) { - struct memblock_region *reg; + phys_addr_t start, end; unsigned long prev_max; + u64 i; memblock_resized: prev_max = memblock.memory.max; - for_each_memblock(memory, reg) { - unsigned long size = reg->size; - unsigned long start, end; - - start = reg->base; - end = start + size; + for_each_mem_range(i, &start, &end) { while (start < end) { unsigned long this_end; int nid; @@ -1211,7 +1207,7 @@ memblock_resized: this_end = memblock_nid_range(start, end, &nid); numadbg("Setting memblock NUMA node nid[%d] " - "start[%lx] end[%lx]\n", + "start[%llx] end[%lx]\n", nid, start, this_end); memblock_set_node(start, this_end - start, diff --git a/drivers/bus/mvebu-mbus.c b/drivers/bus/mvebu-mbus.c index 5b2a11a88951..2519ceede64b 100644 --- a/drivers/bus/mvebu-mbus.c +++ b/drivers/bus/mvebu-mbus.c @@ -610,23 +610,23 @@ static unsigned int armada_xp_mbus_win_remap_offset(int win) static void __init mvebu_mbus_find_bridge_hole(uint64_t *start, uint64_t *end) { - struct memblock_region *r; - uint64_t s = 0; + phys_addr_t reg_start, reg_end; + uint64_t i, s = 0; - for_each_memblock(memory, r) { + for_each_mem_range(i, ®_start, ®_end) { /* * This part of the memory is above 4 GB, so we don't * care for the MBus bridge hole. */ - if (r->base >= 0x100000000ULL) + if (reg_start >= 0x100000000ULL) continue; /* * The MBus bridge hole is at the end of the RAM under * the 4 GB limit. */ - if (r->base + r->size > s) - s = r->base + r->size; + if (reg_end > s) + s = reg_end; } *start = s; From 3c45ee6dc7a1384de747e8afaa80ffb4b08be39f Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:58:12 -0700 Subject: [PATCH 174/181] x86/setup: simplify initrd relocation and reservation Currently, initrd image is reserved very early during setup and then it might be relocated and re-reserved after the initial physical memory mapping is created. The "late" reservation of memblock verifies that mapped memory size exceeds the size of initrd, then checks whether the relocation required and, if yes, relocates inirtd to a new memory allocated from memblock and frees the old location. The check for memory size is excessive as memblock allocation will anyway fail if there is not enough memory. Besides, there is no point to allocate memory from memblock using memblock_find_in_range() + memblock_reserve() when there exists memblock_phys_alloc_range() with required functionality. Remove the redundant check and simplify memblock allocation. Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Acked-by: Ingo Molnar Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-14-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/x86/kernel/setup.c | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index fa16b906ea3f..c7a2ccee3f5f 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -264,16 +264,12 @@ static void __init relocate_initrd(void) u64 area_size = PAGE_ALIGN(ramdisk_size); /* We need to move the initrd down into directly mapped mem */ - relocated_ramdisk = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), - area_size, PAGE_SIZE); - + relocated_ramdisk = memblock_phys_alloc_range(area_size, PAGE_SIZE, 0, + PFN_PHYS(max_pfn_mapped)); if (!relocated_ramdisk) panic("Cannot find place for new RAMDISK of size %lld\n", ramdisk_size); - /* Note: this includes all the mem currently occupied by - the initrd, we rely on that fact to keep the data intact. */ - memblock_reserve(relocated_ramdisk, area_size); initrd_start = relocated_ramdisk + PAGE_OFFSET; initrd_end = initrd_start + ramdisk_size; printk(KERN_INFO "Allocated new RAMDISK: [mem %#010llx-%#010llx]\n", @@ -300,13 +296,13 @@ static void __init early_reserve_initrd(void) memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image); } + static void __init reserve_initrd(void) { /* Assume only end is not page aligned */ u64 ramdisk_image = get_ramdisk_image(); u64 ramdisk_size = get_ramdisk_size(); u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size); - u64 mapped_size; if (!boot_params.hdr.type_of_loader || !ramdisk_image || !ramdisk_size) @@ -314,12 +310,6 @@ static void __init reserve_initrd(void) initrd_start = 0; - mapped_size = memblock_mem_size(max_pfn_mapped); - if (ramdisk_size >= (mapped_size>>1)) - panic("initrd too large to handle, " - "disabling initrd (%lld needed, %lld available)\n", - ramdisk_size, mapped_size>>1); - printk(KERN_INFO "RAMDISK: [mem %#010llx-%#010llx]\n", ramdisk_image, ramdisk_end - 1); From 6120cdc01ef63e92f1b33547af87382364cd1b38 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:58:16 -0700 Subject: [PATCH 175/181] x86/setup: simplify reserve_crashkernel() * Replace magic numbers with defines * Replace memblock_find_in_range() + memblock_reserve() with memblock_phys_alloc_range() * Stop checking for low memory size in reserve_crashkernel_low(). The allocation from limited range will anyway fail if there is no enough memory, so there is no need for extra traversal of memblock.memory Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Acked-by: Ingo Molnar Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-15-rppt@kernel.org Signed-off-by: Linus Torvalds --- arch/x86/kernel/setup.c | 40 ++++++++++++++-------------------------- 1 file changed, 14 insertions(+), 26 deletions(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index c7a2ccee3f5f..210e878c4c0d 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -421,13 +421,13 @@ static int __init reserve_crashkernel_low(void) { #ifdef CONFIG_X86_64 unsigned long long base, low_base = 0, low_size = 0; - unsigned long total_low_mem; + unsigned long low_mem_limit; int ret; - total_low_mem = memblock_mem_size(1UL << (32 - PAGE_SHIFT)); + low_mem_limit = min(memblock_phys_mem_size(), CRASH_ADDR_LOW_MAX); /* crashkernel=Y,low */ - ret = parse_crashkernel_low(boot_command_line, total_low_mem, &low_size, &base); + ret = parse_crashkernel_low(boot_command_line, low_mem_limit, &low_size, &base); if (ret) { /* * two parts from kernel/dma/swiotlb.c: @@ -445,23 +445,17 @@ static int __init reserve_crashkernel_low(void) return 0; } - low_base = memblock_find_in_range(0, 1ULL << 32, low_size, CRASH_ALIGN); + low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, 0, CRASH_ADDR_LOW_MAX); if (!low_base) { pr_err("Cannot reserve %ldMB crashkernel low memory, please try smaller size.\n", (unsigned long)(low_size >> 20)); return -ENOMEM; } - ret = memblock_reserve(low_base, low_size); - if (ret) { - pr_err("%s: Error reserving crashkernel low memblock.\n", __func__); - return ret; - } - - pr_info("Reserving %ldMB of low memory at %ldMB for crashkernel (System low RAM: %ldMB)\n", + pr_info("Reserving %ldMB of low memory at %ldMB for crashkernel (low RAM limit: %ldMB)\n", (unsigned long)(low_size >> 20), (unsigned long)(low_base >> 20), - (unsigned long)(total_low_mem >> 20)); + (unsigned long)(low_mem_limit >> 20)); crashk_low_res.start = low_base; crashk_low_res.end = low_base + low_size - 1; @@ -505,13 +499,13 @@ static void __init reserve_crashkernel(void) * unless "crashkernel=size[KMG],high" is specified. */ if (!high) - crash_base = memblock_find_in_range(CRASH_ALIGN, - CRASH_ADDR_LOW_MAX, - crash_size, CRASH_ALIGN); + crash_base = memblock_phys_alloc_range(crash_size, + CRASH_ALIGN, CRASH_ALIGN, + CRASH_ADDR_LOW_MAX); if (!crash_base) - crash_base = memblock_find_in_range(CRASH_ALIGN, - CRASH_ADDR_HIGH_MAX, - crash_size, CRASH_ALIGN); + crash_base = memblock_phys_alloc_range(crash_size, + CRASH_ALIGN, CRASH_ALIGN, + CRASH_ADDR_HIGH_MAX); if (!crash_base) { pr_info("crashkernel reservation failed - No suitable area found.\n"); return; @@ -519,19 +513,13 @@ static void __init reserve_crashkernel(void) } else { unsigned long long start; - start = memblock_find_in_range(crash_base, - crash_base + crash_size, - crash_size, 1 << 20); + start = memblock_phys_alloc_range(crash_size, SZ_1M, crash_base, + crash_base + crash_size); if (start != crash_base) { pr_info("crashkernel reservation failed - memory is in use.\n"); return; } } - ret = memblock_reserve(crash_base, crash_size); - if (ret) { - pr_err("%s: Error reserving crashkernel memblock.\n", __func__); - return; - } if (crash_base >= (1ULL << 32) && reserve_crashkernel_low()) { memblock_free(crash_base, crash_size); From 5bd0960b85d7e3e4a2dc5bbf1c87d0b505115d71 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:58:20 -0700 Subject: [PATCH 176/181] memblock: remove unused memblock_mem_size() The only user of memblock_mem_size() was x86 setup code, it is gone now and memblock_mem_size() funciton can be removed. Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Miguel Ojeda Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-16-rppt@kernel.org Signed-off-by: Linus Torvalds --- include/linux/memblock.h | 1 - mm/memblock.c | 15 --------------- 2 files changed, 16 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 27c3b84d1615..15ed119701c1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -481,7 +481,6 @@ static inline bool memblock_bottom_up(void) phys_addr_t memblock_phys_mem_size(void); phys_addr_t memblock_reserved_size(void); -phys_addr_t memblock_mem_size(unsigned long limit_pfn); phys_addr_t memblock_start_of_DRAM(void); phys_addr_t memblock_end_of_DRAM(void); void memblock_enforce_memory_limit(phys_addr_t memory_limit); diff --git a/mm/memblock.c b/mm/memblock.c index d66d805dec96..4de76cf48434 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1660,21 +1660,6 @@ phys_addr_t __init_memblock memblock_reserved_size(void) return memblock.reserved.total_size; } -phys_addr_t __init memblock_mem_size(unsigned long limit_pfn) -{ - unsigned long pages = 0; - unsigned long start_pfn, end_pfn; - int i; - - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { - start_pfn = min_t(unsigned long, start_pfn, limit_pfn); - end_pfn = min_t(unsigned long, end_pfn, limit_pfn); - pages += end_pfn - start_pfn; - } - - return PFN_PHYS(pages); -} - /* lowest address */ phys_addr_t __init_memblock memblock_start_of_DRAM(void) { From 9f3d5eaa3c60f95d9fff1ce4eea7553a3dc04906 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:58:25 -0700 Subject: [PATCH 177/181] memblock: implement for_each_reserved_mem_region() using __next_mem_region() Iteration over memblock.reserved with for_each_reserved_mem_region() used __next_reserved_mem_region() that implemented a subset of __next_mem_region(). Use __for_each_mem_range() and, essentially, __next_mem_region() with appropriate parameters to reduce code duplication. While on it, rename for_each_reserved_mem_region() to for_each_reserved_mem_range() for consistency. Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Acked-by: Miguel Ojeda [.clang-format] Cc: Andy Lutomirski Cc: Baoquan He Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Bogendoerfer Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-17-rppt@kernel.org Signed-off-by: Linus Torvalds --- .clang-format | 2 +- arch/arm64/kernel/setup.c | 2 +- drivers/irqchip/irq-gic-v3-its.c | 2 +- include/linux/memblock.h | 12 +++---- mm/memblock.c | 56 ++++++++++++-------------------- 5 files changed, 27 insertions(+), 47 deletions(-) diff --git a/.clang-format b/.clang-format index 0366a3d2e561..8806bb21b6c2 100644 --- a/.clang-format +++ b/.clang-format @@ -273,7 +273,7 @@ ForEachMacros: - 'for_each_registered_fb' - 'for_each_requested_gpio' - 'for_each_requested_gpio_in_range' - - 'for_each_reserved_mem_region' + - 'for_each_reserved_mem_range' - 'for_each_rtd_codec_dais' - 'for_each_rtd_codec_dais_rollback' - 'for_each_rtd_components' diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index 53acbeca4f57..16aae43badfd 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -257,7 +257,7 @@ static int __init reserve_memblock_reserved_regions(void) if (!memblock_is_region_reserved(mem->start, mem_size)) continue; - for_each_reserved_mem_region(j, &r_start, &r_end) { + for_each_reserved_mem_range(j, &r_start, &r_end) { resource_size_t start, end; start = max(PFN_PHYS(PFN_DOWN(r_start)), mem->start); diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index 0418071a3724..46d885575601 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -2198,7 +2198,7 @@ static bool gic_check_reserved_range(phys_addr_t addr, unsigned long size) addr_end = addr + size - 1; - for_each_reserved_mem_region(i, &start, &end) { + for_each_reserved_mem_range(i, &start, &end) { if (addr >= start && addr_end <= end) return true; } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 15ed119701c1..354078713cd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -132,9 +132,6 @@ void __next_mem_range_rev(u64 *idx, int nid, enum memblock_flags flags, struct memblock_type *type_b, phys_addr_t *out_start, phys_addr_t *out_end, int *out_nid); -void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, - phys_addr_t *out_end); - void __memblock_free_late(phys_addr_t base, phys_addr_t size); #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP @@ -224,7 +221,7 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type, MEMBLOCK_NONE, p_start, p_end, NULL) /** - * for_each_reserved_mem_region - iterate over all reserved memblock areas + * for_each_reserved_mem_range - iterate over all reserved memblock areas * @i: u64 used as loop variable * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL @@ -232,10 +229,9 @@ static inline void __next_physmem_range(u64 *idx, struct memblock_type *type, * Walks over reserved areas of memblock. Available as soon as memblock * is initialized. */ -#define for_each_reserved_mem_region(i, p_start, p_end) \ - for (i = 0UL, __next_reserved_mem_region(&i, p_start, p_end); \ - i != (u64)ULLONG_MAX; \ - __next_reserved_mem_region(&i, p_start, p_end)) +#define for_each_reserved_mem_range(i, p_start, p_end) \ + __for_each_mem_range(i, &memblock.reserved, NULL, NUMA_NO_NODE, \ + MEMBLOCK_NONE, p_start, p_end, NULL) static inline bool memblock_is_hotpluggable(struct memblock_region *m) { diff --git a/mm/memblock.c b/mm/memblock.c index 4de76cf48434..a09cc4f057f0 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -132,6 +132,14 @@ struct memblock_type physmem = { }; #endif +/* + * keep a pointer to &memblock.memory in the text section to use it in + * __next_mem_range() and its helpers. + * For architectures that do not keep memblock data after init, this + * pointer will be reset to NULL at memblock_discard() + */ +static __refdata struct memblock_type *memblock_memory = &memblock.memory; + #define for_each_memblock_type(i, memblock_type, rgn) \ for (i = 0, rgn = &memblock_type->regions[0]; \ i < memblock_type->cnt; \ @@ -402,6 +410,8 @@ void __init memblock_discard(void) memblock.memory.max); __memblock_free_late(addr, size); } + + memblock_memory = NULL; } #endif @@ -952,42 +962,16 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP); } -/** - * __next_reserved_mem_region - next function for for_each_reserved_region() - * @idx: pointer to u64 loop variable - * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL - * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL - * - * Iterate over all reserved memory regions. - */ -void __init_memblock __next_reserved_mem_region(u64 *idx, - phys_addr_t *out_start, - phys_addr_t *out_end) -{ - struct memblock_type *type = &memblock.reserved; - - if (*idx < type->cnt) { - struct memblock_region *r = &type->regions[*idx]; - phys_addr_t base = r->base; - phys_addr_t size = r->size; - - if (out_start) - *out_start = base; - if (out_end) - *out_end = base + size - 1; - - *idx += 1; - return; - } - - /* signal end of iteration */ - *idx = ULLONG_MAX; -} - -static bool should_skip_region(struct memblock_region *m, int nid, int flags) +static bool should_skip_region(struct memblock_type *type, + struct memblock_region *m, + int nid, int flags) { int m_nid = memblock_get_region_node(m); + /* we never skip regions when iterating memblock.reserved or physmem */ + if (type != memblock_memory) + return false; + /* only memory regions are associated with nodes, check it */ if (nid != NUMA_NO_NODE && nid != m_nid) return true; @@ -1052,7 +1036,7 @@ void __next_mem_range(u64 *idx, int nid, enum memblock_flags flags, phys_addr_t m_end = m->base + m->size; int m_nid = memblock_get_region_node(m); - if (should_skip_region(m, nid, flags)) + if (should_skip_region(type_a, m, nid, flags)) continue; if (!type_b) { @@ -1156,7 +1140,7 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid, phys_addr_t m_end = m->base + m->size; int m_nid = memblock_get_region_node(m); - if (should_skip_region(m, nid, flags)) + if (should_skip_region(type_a, m, nid, flags)) continue; if (!type_b) { @@ -1981,7 +1965,7 @@ static unsigned long __init free_low_memory_core_early(void) memblock_clear_hotplug(0, -1); - for_each_reserved_mem_region(i, &start, &end) + for_each_reserved_mem_range(i, &start, &end) reserve_bootmem_region(start, end); /* From cc6de1680538633e4ef9540b2313fa2481a7c641 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 13 Oct 2020 16:58:30 -0700 Subject: [PATCH 178/181] memblock: use separate iterators for memory and reserved regions for_each_memblock() is used to iterate over memblock.memory in a few places that use data from memblock_region rather than the memory ranges. Introduce separate for_each_mem_region() and for_each_reserved_mem_region() to improve encapsulation of memblock internals from its users. Signed-off-by: Mike Rapoport Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Acked-by: Ingo Molnar [x86] Acked-by: Thomas Bogendoerfer [MIPS] Acked-by: Miguel Ojeda [.clang-format] Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Daniel Axtens Cc: Dave Hansen Cc: Emil Renner Berthing Cc: Hari Bathini Cc: Ingo Molnar Cc: Jonathan Cameron Cc: Marek Szyprowski Cc: Max Filippov Cc: Michael Ellerman Cc: Michal Simek Cc: Palmer Dabbelt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Russell King Cc: Stafford Horne Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: https://lkml.kernel.org/r/20200818151634.14343-18-rppt@kernel.org Signed-off-by: Linus Torvalds --- .clang-format | 3 ++- arch/arm64/kernel/setup.c | 2 +- arch/arm64/mm/numa.c | 2 +- arch/mips/netlogic/xlp/setup.c | 2 +- arch/riscv/mm/init.c | 2 +- arch/x86/mm/numa.c | 2 +- include/linux/memblock.h | 19 ++++++++++++++++--- mm/memblock.c | 4 ++-- mm/page_alloc.c | 8 ++++---- 9 files changed, 29 insertions(+), 15 deletions(-) diff --git a/.clang-format b/.clang-format index 8806bb21b6c2..6cbd6ee51610 100644 --- a/.clang-format +++ b/.clang-format @@ -203,7 +203,7 @@ ForEachMacros: - 'for_each_matching_node' - 'for_each_matching_node_and_match' - 'for_each_member' - - 'for_each_memblock' + - 'for_each_mem_region' - 'for_each_memblock_type' - 'for_each_memcg_cache_index' - 'for_each_mem_pfn_range' @@ -274,6 +274,7 @@ ForEachMacros: - 'for_each_requested_gpio' - 'for_each_requested_gpio_in_range' - 'for_each_reserved_mem_range' + - 'for_each_reserved_mem_region' - 'for_each_rtd_codec_dais' - 'for_each_rtd_codec_dais_rollback' - 'for_each_rtd_components' diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index 16aae43badfd..133257ffd859 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -217,7 +217,7 @@ static void __init request_standard_resources(void) if (!standard_resources) panic("%s: Failed to allocate %zu bytes\n", __func__, res_size); - for_each_memblock(memory, region) { + for_each_mem_region(region) { res = &standard_resources[i++]; if (memblock_is_nomap(region)) { res->name = "reserved"; diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index 2040b97e92d0..a8303bc6b62a 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -354,7 +354,7 @@ static int __init numa_register_nodes(void) struct memblock_region *mblk; /* Check that valid nid is set to memblks */ - for_each_memblock(memory, mblk) { + for_each_mem_region(mblk) { int mblk_nid = memblock_get_region_node(mblk); if (mblk_nid == NUMA_NO_NODE || mblk_nid >= MAX_NUMNODES) { diff --git a/arch/mips/netlogic/xlp/setup.c b/arch/mips/netlogic/xlp/setup.c index 1a0fc5b62ba4..6e3102bcd2f1 100644 --- a/arch/mips/netlogic/xlp/setup.c +++ b/arch/mips/netlogic/xlp/setup.c @@ -70,7 +70,7 @@ static void nlm_fixup_mem(void) const int pref_backup = 512; struct memblock_region *mem; - for_each_memblock(memory, mem) { + for_each_mem_region(mem) { memblock_remove(mem->base + mem->size - pref_backup, pref_backup); } diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 1e8c3e24e0c4..1dc89303b679 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -531,7 +531,7 @@ static void __init resource_init(void) { struct memblock_region *region; - for_each_memblock(memory, region) { + for_each_mem_region(region) { struct resource *res; res = memblock_alloc(sizeof(struct resource), SMP_CACHE_BYTES); diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index c62e274d52d0..9df94e0aaee1 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -514,7 +514,7 @@ static void __init numa_clear_kernel_node_hotplug(void) * memory ranges, because quirks such as trim_snb_memory() * reserve specific pages for Sandy Bridge graphics. ] */ - for_each_memblock(reserved, mb_region) { + for_each_reserved_mem_region(mb_region) { int nid = memblock_get_region_node(mb_region); if (nid != MAX_NUMNODES) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 354078713cd1..ef131255cedc 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -553,9 +553,22 @@ static inline unsigned long memblock_region_reserved_end_pfn(const struct memblo return PFN_UP(reg->base + reg->size); } -#define for_each_memblock(memblock_type, region) \ - for (region = memblock.memblock_type.regions; \ - region < (memblock.memblock_type.regions + memblock.memblock_type.cnt); \ +/** + * for_each_mem_region - itereate over memory regions + * @region: loop variable + */ +#define for_each_mem_region(region) \ + for (region = memblock.memory.regions; \ + region < (memblock.memory.regions + memblock.memory.cnt); \ + region++) + +/** + * for_each_reserved_mem_region - itereate over reserved memory regions + * @region: loop variable + */ +#define for_each_reserved_mem_region(region) \ + for (region = memblock.reserved.regions; \ + region < (memblock.reserved.regions + memblock.reserved.cnt); \ region++) extern void *alloc_large_system_hash(const char *tablename, diff --git a/mm/memblock.c b/mm/memblock.c index a09cc4f057f0..165f40a8a254 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1667,7 +1667,7 @@ static phys_addr_t __init_memblock __find_max_addr(phys_addr_t limit) * the memory memblock regions, if the @limit exceeds the total size * of those regions, max_addr will keep original value PHYS_ADDR_MAX */ - for_each_memblock(memory, r) { + for_each_mem_region(r) { if (limit <= r->size) { max_addr = r->base + limit; break; @@ -1837,7 +1837,7 @@ void __init_memblock memblock_trim_memory(phys_addr_t align) phys_addr_t start, end, orig_start, orig_end; struct memblock_region *r; - for_each_memblock(memory, r) { + for_each_mem_region(r) { orig_start = r->base; orig_end = r->base + r->size; start = round_up(orig_start, align); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 34ac7127d1e6..05fe1ddb033c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5961,7 +5961,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn) if (mirrored_kernelcore && zone == ZONE_MOVABLE) { if (!r || *pfn >= memblock_region_memory_end_pfn(r)) { - for_each_memblock(memory, r) { + for_each_mem_region(r) { if (*pfn < memblock_region_memory_end_pfn(r)) break; } @@ -6546,7 +6546,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid, unsigned long start_pfn, end_pfn; struct memblock_region *r; - for_each_memblock(memory, r) { + for_each_mem_region(r) { start_pfn = clamp(memblock_region_memory_base_pfn(r), zone_start_pfn, zone_end_pfn); end_pfn = clamp(memblock_region_memory_end_pfn(r), @@ -7140,7 +7140,7 @@ static void __init find_zone_movable_pfns_for_nodes(void) * options. */ if (movable_node_is_enabled()) { - for_each_memblock(memory, r) { + for_each_mem_region(r) { if (!memblock_is_hotpluggable(r)) continue; @@ -7161,7 +7161,7 @@ static void __init find_zone_movable_pfns_for_nodes(void) if (mirrored_kernelcore) { bool mem_below_4gb_not_mirrored = false; - for_each_memblock(memory, r) { + for_each_mem_region(r) { if (memblock_is_mirror(r)) continue; From 67197a4f28d28d0b073ab0427b03cb2ee5382578 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Tue, 13 Oct 2020 16:58:35 -0700 Subject: [PATCH 179/181] mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary Currently __set_oom_adj loops through all processes in the system to keep oom_score_adj and oom_score_adj_min in sync between processes sharing their mm. This is done for any task with more that one mm_users, which includes processes with multiple threads (sharing mm and signals). However for such processes the loop is unnecessary because their signal structure is shared as well. Android updates oom_score_adj whenever a tasks changes its role (background/foreground/...) or binds to/unbinds from a service, making it more/less important. Such operation can happen frequently. We noticed that updates to oom_score_adj became more expensive and after further investigation found out that the patch mentioned in "Fixes" introduced a regression. Using Pixel 4 with a typical Android workload, write time to oom_score_adj increased from ~3.57us to ~362us. Moreover this regression linearly depends on the number of multi-threaded processes running on the system. Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj update should be synchronized between multiple processes. To prevent races between clone() and __set_oom_adj(), when oom_score_adj of the process being cloned might be modified from userspace, we use oom_adj_mutex. Its scope is changed to global. The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for the case of vfork(). To prevent performance regressions of vfork(), we skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is specified. Clearing the MMF_MULTIPROCESS flag (when the last process sharing the mm exits) is left out of this patch to keep it simple and because it is believed that this threading model is rare. Should there ever be a need for optimizing that case as well, it can be done by hooking into the exit path, likely following the mm_update_next_owner pattern. With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being quite rare, the regression is gone after the change is applied. [surenb@google.com: v3] Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj") Reported-by: Tim Murray Suggested-by: Michal Hocko Signed-off-by: Suren Baghdasaryan Signed-off-by: Andrew Morton Acked-by: Christian Brauner Acked-by: Michal Hocko Acked-by: Oleg Nesterov Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Eugene Syromiatnikov Cc: Christian Kellner Cc: Adrian Reber Cc: Shakeel Butt Cc: Aleksa Sarai Cc: Alexey Dobriyan Cc: "Eric W. Biederman" Cc: Alexey Gladkov Cc: Michel Lespinasse Cc: Daniel Jordan Cc: Andrei Vagin Cc: Bernd Edlinger Cc: John Johansen Cc: Yafang Shao Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com Debugged-by: Minchan Kim Signed-off-by: Linus Torvalds --- fs/proc/base.c | 3 +-- include/linux/oom.h | 1 + include/linux/sched/coredump.h | 1 + kernel/fork.c | 21 +++++++++++++++++++++ mm/oom_kill.c | 2 ++ 5 files changed, 26 insertions(+), 2 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 617db4e0faa0..aa69c35d904c 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1055,7 +1055,6 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count, static int __set_oom_adj(struct file *file, int oom_adj, bool legacy) { - static DEFINE_MUTEX(oom_adj_mutex); struct mm_struct *mm = NULL; struct task_struct *task; int err = 0; @@ -1095,7 +1094,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy) struct task_struct *p = find_lock_task_mm(task); if (p) { - if (atomic_read(&p->mm->mm_users) > 1) { + if (test_bit(MMF_MULTIPROCESS, &p->mm->flags)) { mm = p->mm; mmgrab(mm); } diff --git a/include/linux/oom.h b/include/linux/oom.h index f022f581ac29..2db9a1432511 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -55,6 +55,7 @@ struct oom_control { }; extern struct mutex oom_lock; +extern struct mutex oom_adj_mutex; static inline void set_current_oom_origin(void) { diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index ecdc6542070f..dfd82eab2902 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -72,6 +72,7 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ #define MMF_OOM_VICTIM 25 /* mm is the oom victim */ #define MMF_OOM_REAP_QUEUED 26 /* mm was queued for oom_reaper */ +#define MMF_MULTIPROCESS 27 /* mm is shared between processes */ #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ diff --git a/kernel/fork.c b/kernel/fork.c index ede26e5a6097..50c90d368117 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1812,6 +1812,25 @@ static __always_inline void delayed_free_task(struct task_struct *tsk) free_task(tsk); } +static void copy_oom_score_adj(u64 clone_flags, struct task_struct *tsk) +{ + /* Skip if kernel thread */ + if (!tsk->mm) + return; + + /* Skip if spawning a thread or using vfork */ + if ((clone_flags & (CLONE_VM | CLONE_THREAD | CLONE_VFORK)) != CLONE_VM) + return; + + /* We need to synchronize with __set_oom_adj */ + mutex_lock(&oom_adj_mutex); + set_bit(MMF_MULTIPROCESS, &tsk->mm->flags); + /* Update the values in case they were changed after copy_signal */ + tsk->signal->oom_score_adj = current->signal->oom_score_adj; + tsk->signal->oom_score_adj_min = current->signal->oom_score_adj_min; + mutex_unlock(&oom_adj_mutex); +} + /* * This creates a new process as a copy of the old one, * but does not actually start it yet. @@ -2288,6 +2307,8 @@ static __latent_entropy struct task_struct *copy_process( trace_task_newtask(p, clone_flags); uprobe_copy_process(p, clone_flags); + copy_oom_score_adj(clone_flags, p); + return p; bad_fork_cancel_cgroup: diff --git a/mm/oom_kill.c b/mm/oom_kill.c index e90f25d6385d..8b84661a6410 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -64,6 +64,8 @@ int sysctl_oom_dump_tasks = 1; * and mark_oom_victim */ DEFINE_MUTEX(oom_lock); +/* Serializes oom_score_adj and oom_score_adj_min updates */ +DEFINE_MUTEX(oom_adj_mutex); static inline bool is_memcg_oom(struct oom_control *oc) { From 4257889124cce4526ebf29329bae8794e97b455a Mon Sep 17 00:00:00 2001 From: Ralph Campbell Date: Tue, 13 Oct 2020 16:58:39 -0700 Subject: [PATCH 180/181] mm/migrate: remove cpages-- in migrate_vma_finalize() The variable struct migrate_vma->cpages is only used in migrate_vma_setup(). There is no need to decrement it in migrate_vma_finalize() since it is never checked. Signed-off-by: Ralph Campbell Signed-off-by: Andrew Morton Cc: Jason Gunthorpe Cc: Jerome Glisse Cc: John Hubbard Cc: Christoph Hellwig Link: http://lkml.kernel.org/r/20200827190735.12752-1-rcampbell@nvidia.com Signed-off-by: Linus Torvalds --- mm/migrate.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index 4de11dfd730b..b0734900e953 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -3077,7 +3077,6 @@ void migrate_vma_finalize(struct migrate_vma *migrate) remove_migration_ptes(page, newpage, false); unlock_page(page); - migrate->cpages--; if (is_zone_device_page(page)) put_page(page); From f1f4f3ab54e9a52c7610c998ff8255f019742e67 Mon Sep 17 00:00:00 2001 From: Ralph Campbell Date: Tue, 13 Oct 2020 16:58:42 -0700 Subject: [PATCH 181/181] mm/migrate: remove obsolete comment about device public Device public memory never had an in tree consumer and was removed in commit 25b2995a35b6 ("mm: remove MEMORY_DEVICE_PUBLIC support"). Delete the obsolete comment. Signed-off-by: Ralph Campbell Signed-off-by: Andrew Morton Reviewed-by: Jason Gunthorpe Cc: Jerome Glisse Cc: John Hubbard Cc: Christoph Hellwig Link: http://lkml.kernel.org/r/20200827190735.12752-2-rcampbell@nvidia.com Signed-off-by: Linus Torvalds --- mm/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index b0734900e953..f94d7c7eeddf 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -381,7 +381,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page) int expected_count = 1; /* - * Device public or private pages have an extra refcount as they are + * Device private pages have an extra refcount as they are * ZONE_DEVICE pages. */ expected_count += is_device_private_page(page);