Star64_linux/kernel
Adrian Reber 49cb2fc42c fork: extend clone3() to support setting a PID
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-11-15 23:49:22 +01:00
..
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-09-28 17:47:33 -07:00
cgroup
configs
debug
dma dma-mapping: fix false positivse warnings in dma_common_free_remap() 2019-10-05 10:24:17 +02:00
events perf/core: Fix corner case in perf_rotate_context() 2019-10-09 12:44:13 +02:00
gcov
irq
livepatch
locking
power Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2019-09-28 08:14:15 -07:00
printk
rcu
sched Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-10-12 15:29:54 -07:00
time tick: broadcast-hrtimer: Fix a race in bc_set_next 2019-09-27 14:45:55 +02:00
trace tracing: Initialize iter->seq after zeroing in tracing_read_pipe() 2019-10-12 20:49:34 -04:00
.gitignore
acct.c
async.c
audit.c
audit.h
audit_fsnotify.c
audit_tree.c
audit_watch.c
auditfilter.c
auditsc.c
backtracetest.c
bounds.c
capability.c
compat.c
configs.c
context_tracking.c
cpu.c
cpu_pm.c
crash_core.c
crash_dump.c
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c exit: use pid_has_task() in do_wait() 2019-10-17 15:36:58 +02:00
extable.c
fail_function.c
fork.c fork: extend clone3() to support setting a PID 2019-11-15 23:49:22 +01:00
freezer.c Revert "libata, freezer: avoid block device removal while system is frozen" 2019-10-06 09:11:37 -06:00
futex.c
gen_kheaders.sh kheaders: make headers archive reproducible 2019-10-05 15:29:49 +09:00
groups.c
hung_task.c
iomem.c
irq_work.c
jump_label.c
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c
kexec.c
kexec_core.c
kexec_elf.c
kexec_file.c Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2019-09-28 08:14:15 -07:00
kexec_internal.h
kheaders.c
kmod.c
kprobes.c
ksysfs.c
kthread.c
latencytop.c
Makefile Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity 2019-09-27 19:37:27 -07:00
module-internal.h
module.c Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2019-09-28 08:14:15 -07:00
module_signature.c
module_signing.c
notifier.c
nsproxy.c
padata.c
panic.c panic: ensure preemption is disabled during panic() 2019-10-07 15:47:19 -07:00
params.c
pid.c fork: extend clone3() to support setting a PID 2019-11-15 23:49:22 +01:00
pid_namespace.c fork: extend clone3() to support setting a PID 2019-11-15 23:49:22 +01:00
profile.c
ptrace.c
range.c
reboot.c
relay.c
resource.c
rseq.c
seccomp.c
signal.c
smp.c
smpboot.c
smpboot.h
softirq.c
stackleak.c
stacktrace.c
stop_machine.c
sys.c
sys_ni.c
sysctl.c
sysctl_binary.c
task_work.c
taskstats.c
test_kprobes.c
torture.c
tracepoint.c
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c
up.c
user-return-notifier.c
user.c
user_namespace.c
utsname.c
utsname_sysctl.c
watchdog.c
watchdog_hld.c
workqueue.c
workqueue_internal.h