docs/vm: numa_memory_policy.txt: convert to ReST format

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
Mike Rapoport 2018-03-21 21:22:29 +02:00 committed by Jonathan Corbet
parent 16f9f7f924
commit cb5e4376e5

View file

@ -1,5 +1,11 @@
.. _numa_memory_policy:
===================
Linux Memory Policy
===================
What is Linux Memory Policy? What is Linux Memory Policy?
============================
In the Linux kernel, "memory policy" determines from which node the kernel will In the Linux kernel, "memory policy" determines from which node the kernel will
allocate memory in a NUMA system or in an emulated NUMA system. Linux has allocate memory in a NUMA system or in an emulated NUMA system. Linux has
@ -9,35 +15,36 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy
support. support.
Memory policies should not be confused with cpusets Memory policies should not be confused with cpusets
(Documentation/cgroup-v1/cpusets.txt) (``Documentation/cgroup-v1/cpusets.txt``)
which is an administrative mechanism for restricting the nodes from which which is an administrative mechanism for restricting the nodes from which
memory may be allocated by a set of processes. Memory policies are a memory may be allocated by a set of processes. Memory policies are a
programming interface that a NUMA-aware application can take advantage of. When programming interface that a NUMA-aware application can take advantage of. When
both cpusets and policies are applied to a task, the restrictions of the cpuset both cpusets and policies are applied to a task, the restrictions of the cpuset
takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
below for more details.
MEMORY POLICY CONCEPTS Memory Policy Concepts
======================
Scope of Memory Policies Scope of Memory Policies
------------------------
The Linux kernel supports _scopes_ of memory policy, described here from The Linux kernel supports _scopes_ of memory policy, described here from
most general to most specific: most general to most specific:
System Default Policy: this policy is "hard coded" into the kernel. It System Default Policy
is the policy that governs all page allocations that aren't controlled this policy is "hard coded" into the kernel. It is the policy
by one of the more specific policy scopes discussed below. When the that governs all page allocations that aren't controlled by
system is "up and running", the system default policy will use "local one of the more specific policy scopes discussed below. When
allocation" described below. However, during boot up, the system the system is "up and running", the system default policy will
default policy will be set to interleave allocations across all nodes use "local allocation" described below. However, during boot
with "sufficient" memory, so as not to overload the initial boot node up, the system default policy will be set to interleave
with boot-time allocations. allocations across all nodes with "sufficient" memory, so as
not to overload the initial boot node with boot-time
allocations.
Task/Process Policy: this is an optional, per-task policy. When defined Task/Process Policy
for a specific task, this policy controls all page allocations made by or this is an optional, per-task policy. When defined for a specific task, this policy controls all page allocations made by or on behalf of the task that aren't controlled by a more specific scope. If a task does not define a task policy, then all page allocations that would have been controlled by the task policy "fall back" to the System Default Policy.
on behalf of the task that aren't controlled by a more specific scope.
If a task does not define a task policy, then all page allocations that
would have been controlled by the task policy "fall back" to the System
Default Policy.
The task policy applies to the entire address space of a task. Thus, The task policy applies to the entire address space of a task. Thus,
it is inheritable, and indeed is inherited, across both fork() it is inheritable, and indeed is inherited, across both fork()
@ -58,56 +65,66 @@ most general to most specific:
changes its task policy remain where they were allocated based on changes its task policy remain where they were allocated based on
the policy at the time they were allocated. the policy at the time they were allocated.
VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's .. _vma_policy:
VMA Policy
A "VMA" or "Virtual Memory Area" refers to a range of a task's
virtual address space. A task may define a specific policy for a range virtual address space. A task may define a specific policy for a range
of its virtual address space. See the MEMORY POLICIES APIS section, of its virtual address space. See the MEMORY POLICIES APIS section,
below, for an overview of the mbind() system call used to set a VMA below, for an overview of the mbind() system call used to set a VMA
policy. policy.
A VMA policy will govern the allocation of pages that back this region of A VMA policy will govern the allocation of pages that back
the address space. Any regions of the task's address space that don't this region ofthe address space. Any regions of the task's
have an explicit VMA policy will fall back to the task policy, which may address space that don't have an explicit VMA policy will fall
itself fall back to the System Default Policy. back to the task policy, which may itself fall back to the
System Default Policy.
VMA policies have a few complicating details: VMA policies have a few complicating details:
VMA policy applies ONLY to anonymous pages. These include pages * VMA policy applies ONLY to anonymous pages. These include
allocated for anonymous segments, such as the task stack and heap, and pages allocated for anonymous segments, such as the task
any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. stack and heap, and any regions of the address space
If a VMA policy is applied to a file mapping, it will be ignored if mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is
the mapping used the MAP_SHARED flag. If the file mapping used the applied to a file mapping, it will be ignored if the mapping
MAP_PRIVATE flag, the VMA policy will only be applied when an used the MAP_SHARED flag. If the file mapping used the
anonymous page is allocated on an attempt to write to the mapping-- MAP_PRIVATE flag, the VMA policy will only be applied when
i.e., at Copy-On-Write. an anonymous page is allocated on an attempt to write to the
mapping-- i.e., at Copy-On-Write.
VMA policies are shared between all tasks that share a virtual address * VMA policies are shared between all tasks that share a
space--a.k.a. threads--independent of when the policy is installed; and virtual address space--a.k.a. threads--independent of when
they are inherited across fork(). However, because VMA policies refer the policy is installed; and they are inherited across
to a specific region of a task's address space, and because the address fork(). However, because VMA policies refer to a specific
space is discarded and recreated on exec*(), VMA policies are NOT region of a task's address space, and because the address
inheritable across exec(). Thus, only NUMA-aware applications may space is discarded and recreated on exec*(), VMA policies
use VMA policies. are NOT inheritable across exec(). Thus, only NUMA-aware
applications may use VMA policies.
A task may install a new VMA policy on a sub-range of a previously * A task may install a new VMA policy on a sub-range of a
mmap()ed region. When this happens, Linux splits the existing virtual previously mmap()ed region. When this happens, Linux splits
memory area into 2 or 3 VMAs, each with it's own policy. the existing virtual memory area into 2 or 3 VMAs, each with
it's own policy.
By default, VMA policy applies only to pages allocated after the policy * By default, VMA policy applies only to pages allocated after
is installed. Any pages already faulted into the VMA range remain the policy is installed. Any pages already faulted into the
where they were allocated based on the policy at the time they were VMA range remain where they were allocated based on the
allocated. However, since 2.6.16, Linux supports page migration via policy at the time they were allocated. However, since
the mbind() system call, so that page contents can be moved to match 2.6.16, Linux supports page migration via the mbind() system
a newly installed policy. call, so that page contents can be moved to match a newly
installed policy.
Shared Policy: Conceptually, shared policies apply to "memory objects" Shared Policy
mapped shared into one or more tasks' distinct address spaces. An Conceptually, shared policies apply to "memory objects" mapped
application installs a shared policies the same way as VMA policies--using shared into one or more tasks' distinct address spaces. An
the mbind() system call specifying a range of virtual addresses that map application installs a shared policies the same way as VMA
the shared object. However, unlike VMA policies, which can be considered policies--using the mbind() system call specifying a range of
to be an attribute of a range of a task's address space, shared policies virtual addresses that map the shared object. However, unlike
apply directly to the shared object. Thus, all tasks that attach to the VMA policies, which can be considered to be an attribute of a
object share the policy, and all pages allocated for the shared object, range of a task's address space, shared policies apply
by any task, will obey the shared policy. directly to the shared object. Thus, all tasks that attach to
the object share the policy, and all pages allocated for the
shared object, by any task, will obey the shared policy.
As of 2.6.22, only shared memory segments, created by shmget() or As of 2.6.22, only shared memory segments, created by shmget() or
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
@ -118,11 +135,12 @@ most general to most specific:
Although hugetlbfs segments now support lazy allocation, their support Although hugetlbfs segments now support lazy allocation, their support
for shared policy has not been completed. for shared policy has not been completed.
As mentioned above [re: VMA policies], allocations of page cache As mentioned above :ref:`VMA policies <vma_policy>`,
pages for regular files mmap()ed with MAP_SHARED ignore any VMA allocations of page cache pages for regular files mmap()ed
policy installed on the virtual address range backed by the shared with MAP_SHARED ignore any VMA policy installed on the virtual
file mapping. Rather, shared page cache pages, including pages backing address range backed by the shared file mapping. Rather,
private mappings that have not yet been written by the task, follow shared page cache pages, including pages backing private
mappings that have not yet been written by the task, follow
task policy, if any, else System Default Policy. task policy, if any, else System Default Policy.
The shared policy infrastructure supports different policies on subset The shared policy infrastructure supports different policies on subset
@ -135,24 +153,27 @@ most general to most specific:
one or more ranges of the region. one or more ranges of the region.
Components of Memory Policies Components of Memory Policies
-----------------------------
A Linux memory policy consists of a "mode", optional mode flags, and an A Linux memory policy consists of a "mode", optional mode flags, and
optional set of nodes. The mode determines the behavior of the policy, an optional set of nodes. The mode determines the behavior of the
the optional mode flags determine the behavior of the mode, and the policy, the optional mode flags determine the behavior of the mode,
optional set of nodes can be viewed as the arguments to the policy and the optional set of nodes can be viewed as the arguments to the
behavior. policy behavior.
Internally, memory policies are implemented by a reference counted Internally, memory policies are implemented by a reference counted
structure, struct mempolicy. Details of this structure will be discussed structure, struct mempolicy. Details of this structure will be
in context, below, as required to explain the behavior. discussed in context, below, as required to explain the behavior.
Linux memory policy supports the following 4 behavioral modes: Linux memory policy supports the following 4 behavioral modes:
Default Mode--MPOL_DEFAULT: This mode is only used in the memory Default Mode--MPOL_DEFAULT
policy APIs. Internally, MPOL_DEFAULT is converted to the NULL This mode is only used in the memory policy APIs. Internally,
memory policy in all policy scopes. Any existing non-default policy MPOL_DEFAULT is converted to the NULL memory policy in all
will simply be removed when MPOL_DEFAULT is specified. As a result, policy scopes. Any existing non-default policy will simply be
MPOL_DEFAULT means "fall back to the next most specific policy scope." removed when MPOL_DEFAULT is specified. As a result,
MPOL_DEFAULT means "fall back to the next most specific policy
scope."
For example, a NULL or default task policy will fall back to the For example, a NULL or default task policy will fall back to the
system default policy. A NULL or default vma policy will fall system default policy. A NULL or default vma policy will fall
@ -164,57 +185,63 @@ Components of Memory Policies
It is an error for the set of nodes specified for this policy to It is an error for the set of nodes specified for this policy to
be non-empty. be non-empty.
MPOL_BIND: This mode specifies that memory must come from the MPOL_BIND
set of nodes specified by the policy. Memory will be allocated from This mode specifies that memory must come from the set of
the node in the set with sufficient free memory that is closest to nodes specified by the policy. Memory will be allocated from
the node where the allocation takes place. the node in the set with sufficient free memory that is
closest to the node where the allocation takes place.
MPOL_PREFERRED: This mode specifies that the allocation should be MPOL_PREFERRED
attempted from the single node specified in the policy. If that This mode specifies that the allocation should be attempted
allocation fails, the kernel will search other nodes, in order of from the single node specified in the policy. If that
increasing distance from the preferred node based on information allocation fails, the kernel will search other nodes, in order
provided by the platform firmware. of increasing distance from the preferred node based on
information provided by the platform firmware.
Internally, the Preferred policy uses a single node--the Internally, the Preferred policy uses a single node--the
preferred_node member of struct mempolicy. When the internal preferred_node member of struct mempolicy. When the internal
mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
the policy is interpreted as local allocation. "Local" allocation and the policy is interpreted as local allocation. "Local"
policy can be viewed as a Preferred policy that starts at the node allocation policy can be viewed as a Preferred policy that
containing the cpu where the allocation takes place. starts at the node containing the cpu where the allocation
takes place.
It is possible for the user to specify that local allocation is It is possible for the user to specify that local allocation
always preferred by passing an empty nodemask with this mode. is always preferred by passing an empty nodemask with this
If an empty nodemask is passed, the policy cannot use the mode. If an empty nodemask is passed, the policy cannot use
MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
below. described below.
MPOL_INTERLEAVED: This mode specifies that page allocations be MPOL_INTERLEAVED
interleaved, on a page granularity, across the nodes specified in This mode specifies that page allocations be interleaved, on a
the policy. This mode also behaves slightly differently, based on page granularity, across the nodes specified in the policy.
the context where it is used: This mode also behaves slightly differently, based on the
context where it is used:
For allocation of anonymous pages and shared memory pages, For allocation of anonymous pages and shared memory pages,
Interleave mode indexes the set of nodes specified by the policy Interleave mode indexes the set of nodes specified by the
using the page offset of the faulting address into the segment policy using the page offset of the faulting address into the
[VMA] containing the address modulo the number of nodes specified segment [VMA] containing the address modulo the number of
by the policy. It then attempts to allocate a page, starting at nodes specified by the policy. It then attempts to allocate a
the selected node, as if the node had been specified by a Preferred page, starting at the selected node, as if the node had been
policy or had been selected by a local allocation. That is, specified by a Preferred policy or had been selected by a
allocation will follow the per node zonelist. local allocation. That is, allocation will follow the per
node zonelist.
For allocation of page cache pages, Interleave mode indexes the set For allocation of page cache pages, Interleave mode indexes
of nodes specified by the policy using a node counter maintained the set of nodes specified by the policy using a node counter
per task. This counter wraps around to the lowest specified node maintained per task. This counter wraps around to the lowest
after it reaches the highest specified node. This will tend to specified node after it reaches the highest specified node.
spread the pages out over the nodes specified by the policy based This will tend to spread the pages out over the nodes
on the order in which they are allocated, rather than based on any specified by the policy based on the order in which they are
page offset into an address range or file. During system boot up, allocated, rather than based on any page offset into an
the temporary interleaved system default policy works in this address range or file. During system boot up, the temporary
mode. interleaved system default policy works in this mode.
Linux memory policy supports the following optional mode flags: Linux memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by MPOL_F_STATIC_NODES
This flag specifies that the nodemask passed by
the user should not be remapped if the task or VMA's set of allowed the user should not be remapped if the task or VMA's set of allowed
nodes changes after the memory policy has been defined. nodes changes after the memory policy has been defined.
@ -242,7 +269,8 @@ Components of Memory Policies
MPOL_PREFERRED policies that were created with an empty nodemask MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation). (local allocation).
MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed MPOL_F_RELATIVE_NODES
This flag specifies that the nodemask passed
by the user will be mapped relative to the set of the task or VMA's by the user will be mapped relative to the set of the task or VMA's
set of allowed nodes. The kernel stores the user-passed nodemask, set of allowed nodes. The kernel stores the user-passed nodemask,
and if the allowed nodes changes, then that original nodemask will and if the allowed nodes changes, then that original nodemask will
@ -292,7 +320,8 @@ Components of Memory Policies
MPOL_PREFERRED policies that were created with an empty nodemask MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation). (local allocation).
MEMORY POLICY REFERENCE COUNTING Memory Policy Reference Counting
================================
To resolve use/free races, struct mempolicy contains an atomic reference To resolve use/free races, struct mempolicy contains an atomic reference
count field. Internal interfaces, mpol_get()/mpol_put() increment and count field. Internal interfaces, mpol_get()/mpol_put() increment and
@ -360,60 +389,62 @@ follows:
or by prefaulting the entire shared memory region into memory and locking or by prefaulting the entire shared memory region into memory and locking
it down. However, this might not be appropriate for all applications. it down. However, this might not be appropriate for all applications.
MEMORY POLICY APIs Memory Policy APIs
Linux supports 3 system calls for controlling memory policy. These APIS Linux supports 3 system calls for controlling memory policy. These APIS
always affect only the calling task, the calling task's address space, or always affect only the calling task, the calling task's address space, or
some shared object mapped into the calling task's address space. some shared object mapped into the calling task's address space.
Note: the headers that define these APIs and the parameter data types .. note::
for user space applications reside in a package that is not part of the headers that define these APIs and the parameter data types for
the Linux kernel. The kernel system call interfaces, with the 'sys_' user space applications reside in a package that is not part of the
Linux kernel. The kernel system call interfaces, with the 'sys\_'
prefix, are defined in <linux/syscalls.h>; the mode and flag prefix, are defined in <linux/syscalls.h>; the mode and flag
definitions are defined in <linux/mempolicy.h>. definitions are defined in <linux/mempolicy.h>.
Set [Task] Memory Policy: Set [Task] Memory Policy::
long set_mempolicy(int mode, const unsigned long *nmask, long set_mempolicy(int mode, const unsigned long *nmask,
unsigned long maxnode); unsigned long maxnode);
Set's the calling task's "task/process memory policy" to mode Set's the calling task's "task/process memory policy" to mode
specified by the 'mode' argument and the set of nodes defined specified by the 'mode' argument and the set of nodes defined by
by 'nmask'. 'nmask' points to a bit mask of node ids containing 'nmask'. 'nmask' points to a bit mask of node ids containing at least
at least 'maxnode' ids. Optional mode flags may be passed by 'maxnode' ids. Optional mode flags may be passed by combining the
combining the 'mode' argument with the flag (for example: 'mode' argument with the flag (for example: MPOL_INTERLEAVE |
MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). MPOL_F_STATIC_NODES).
See the set_mempolicy(2) man page for more details See the set_mempolicy(2) man page for more details
Get [Task] Memory Policy or Related Information Get [Task] Memory Policy or Related Information::
long get_mempolicy(int *mode, long get_mempolicy(int *mode,
const unsigned long *nmask, unsigned long maxnode, const unsigned long *nmask, unsigned long maxnode,
void *addr, int flags); void *addr, int flags);
Queries the "task/process memory policy" of the calling task, or Queries the "task/process memory policy" of the calling task, or the
the policy or location of a specified virtual address, depending policy or location of a specified virtual address, depending on the
on the 'flags' argument. 'flags' argument.
See the get_mempolicy(2) man page for more details See the get_mempolicy(2) man page for more details
Install VMA/Shared Policy for a Range of Task's Address Space Install VMA/Shared Policy for a Range of Task's Address Space::
long mbind(void *start, unsigned long len, int mode, long mbind(void *start, unsigned long len, int mode,
const unsigned long *nmask, unsigned long maxnode, const unsigned long *nmask, unsigned long maxnode,
unsigned flags); unsigned flags);
mbind() installs the policy specified by (mode, nmask, maxnodes) as mbind() installs the policy specified by (mode, nmask, maxnodes) as a
a VMA policy for the range of the calling task's address space VMA policy for the range of the calling task's address space specified
specified by the 'start' and 'len' arguments. Additional actions by the 'start' and 'len' arguments. Additional actions may be
may be requested via the 'flags' argument. requested via the 'flags' argument.
See the mbind(2) man page for more details. See the mbind(2) man page for more details.
MEMORY POLICY COMMAND LINE INTERFACE Memory Policy Command Line Interface
====================================
Although not strictly part of the Linux implementation of memory policy, Although not strictly part of the Linux implementation of memory policy,
a command line tool, numactl(8), exists that allows one to: a command line tool, numactl(8), exists that allows one to:
@ -428,8 +459,10 @@ containing the memory policy system call wrappers. Some distributions
package the headers and compile-time libraries in a separate development package the headers and compile-time libraries in a separate development
package. package.
.. _mem_pol_and_cpusets:
MEMORY POLICIES AND CPUSETS Memory Policies and cpusets
===========================
Memory policies work within cpusets as described above. For memory policies Memory policies work within cpusets as described above. For memory policies
that require a node or set of nodes, the nodes are restricted to the set of that require a node or set of nodes, the nodes are restricted to the set of