linux-bl808/kernel
Daniel Borkmann e0cea7ce98 bpf: implement ld_abs/ld_ind in native bpf
The main part of this work is to finally allow removal of LD_ABS
and LD_IND from the BPF core by reimplementing them through native
eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
keeping them around in native eBPF caused way more trouble than
actually worth it. To just list some of the security issues in
the past:

  * fdfaf64e75 ("x86: bpf_jit: support negative offsets")
  * 35607b02db ("sparc: bpf_jit: fix loads from negative offsets")
  * e0ee9c1215 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
  * 07aee94394 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
  * 6d59b7dbf7 ("bpf, s390x: do not reload skb pointers in non-skb context")
  * 87338c8e2c ("bpf, ppc64: do not reload skb pointers in non-skb context")

For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
these days due to their limitations and more efficient/flexible
alternatives that have been developed over time such as direct
packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
register, the load happens in host endianness and its exception
handling can yield unexpected behavior. The latter is explained
in depth in f6b1b3bf0d ("bpf: fix subprog verifier bypass by
div/mod by 0 exception") with similar cases of exceptions we had.
In native eBPF more recent program types will disable LD_ABS/LD_IND
altogether through may_access_skb() in verifier, and given the
limitations in terms of exception handling, it's also disabled
in programs that use BPF to BPF calls.

In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
to access packet data. It is not used in seccomp-BPF but programs
that use it for socket filtering or reuseport for demuxing with
cBPF. This is mostly relevant for applications that have not yet
migrated to native eBPF.

The main complexity and source of bugs in LD_ABS/LD_IND is coming
from their implementation in the various JITs. Most of them keep
the model around from cBPF times by implementing a fastpath written
in asm. They use typically two from the BPF program hidden CPU
registers for caching the skb's headlen (skb->len - skb->data_len)
and skb->data. Throughout the JIT phase this requires to keep track
whether LD_ABS/LD_IND are used and if so, the two registers need
to be recached each time a BPF helper would change the underlying
packet data in native eBPF case. At least in eBPF case, available
CPU registers are rare and the additional exit path out of the
asm written JIT helper makes it also inflexible since not all
parts of the JITer are in control from plain C. A LD_ABS/LD_IND
implementation in eBPF therefore allows to significantly reduce
the complexity in JITs with comparable performance results for
them, e.g.:

test_bpf             tcpdump port 22             tcpdump complex
x64      - before    15 21 10                    14 19  18
         - after      7 10 10                     7 10  15
arm64    - before    40 91 92                    40 91 151
         - after     51 64 73                    51 62 113

For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
and cache the skb's headlen and data in the cBPF prologue. The
BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
used as a local temporary variable. This allows to shrink the
image on x86_64 also for seccomp programs slightly since mapping
to %rsi is not an ereg. In callee-saved R8 and R9 we now track
skb data and headlen, respectively. For normal prologue emission
in the JITs this does not add any extra instructions since R8, R9
are pushed to stack in any case from eBPF side. cBPF uses the
convert_bpf_ld_abs() emitter which probes the fast path inline
already and falls back to bpf_skb_load_helper_{8,16,32}() helper
relying on the cached skb data and headlen as well. R8 and R9
never need to be reloaded due to bpf_helper_changes_pkt_data()
since all skb access in cBPF is read-only. Then, for the case
of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
does neither cache skb data and headlen nor has an inlined fast
path. The reason for the latter is that native eBPF does not have
any extra registers available anyway, but even if there were, it
avoids any reload of skb data and headlen in the first place.
Additionally, for the negative offsets, we provide an alternative
bpf_skb_load_bytes_relative() helper in eBPF which operates
similarly as bpf_skb_load_bytes() and allows for more flexibility.
Tested myself on x64, arm64, s390x, from Sandipan on ppc64.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 16:49:19 -07:00
..
bpf bpf: implement ld_abs/ld_ind in native bpf 2018-05-03 16:49:19 -07:00
cgroup Merge branch 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2018-04-03 18:00:13 -07:00
configs
debug * Fix 2032 time access issues and new compiler warnings 2018-04-12 10:21:19 -07:00
events perf: Remove superfluous allocation error check 2018-04-17 09:47:40 -03:00
gcov
irq genirq/affinity: Spread irq vectors among present CPUs as far as possible 2018-04-06 12:19:51 +02:00
livepatch livepatch: Allow to call a custom callback when freeing shadow variables 2018-04-17 13:42:48 +02:00
locking locking/rwsem: Add DEBUG_RWSEMS to look for lock/unlock mismatches 2018-03-31 07:30:50 +02:00
power PM / QoS: mark expected switch fall-throughs 2018-04-09 13:49:40 +02:00
printk New features: 2018-04-10 11:27:30 -07:00
rcu
sched Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-04-15 12:43:30 -07:00
time posix-cpu-timers: Ensure set_process_cpu_timer is always evaluated 2018-04-19 12:54:57 +02:00
trace bpf: Allow bpf_current_task_under_cgroup in interrupt 2018-04-29 09:18:04 -07:00
.gitignore
acct.c
async.c
audit.c audit/stable-4.17 PR 20180403 2018-04-06 15:01:25 -07:00
audit.h
audit_fsnotify.c
audit_tree.c
audit_watch.c
auditfilter.c
auditsc.c
backtracetest.c
bounds.c
capability.c
compat.c mm: add kernel_move_pages() helper, move compat syscall to mm/migrate.c 2018-04-02 20:15:32 +02:00
configs.c
context_tracking.c
cpu.c
cpu_pm.c
crash_core.c kexec: export PG_swapbacked to VMCOREINFO 2018-04-13 17:10:27 -07:00
crash_dump.c
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c kernel: use kernel_wait4() instead of sys_wait4() 2018-04-02 20:14:51 +02:00
extable.c
fail_function.c
fork.c fork: unconditionally clear stack on fork 2018-04-20 17:18:35 -07:00
freezer.c
futex.c
futex_compat.c
groups.c
hung_task.c
irq_work.c
jump_label.c
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c
kexec.c kexec: call do_kexec_load() in compat syscall directly 2018-04-02 20:15:01 +02:00
kexec_core.c
kexec_file.c kernel/kexec_file.c: allow archs to set purgatory load address 2018-04-13 17:10:28 -07:00
kexec_internal.h
kmod.c
kprobes.c
ksysfs.c
kthread.c
latencytop.c
Makefile
memremap.c
module-internal.h
module.c arch: remove obsolete architecture ports 2018-04-02 20:20:12 -07:00
module_signing.c
notifier.c
nsproxy.c
padata.c
panic.c taint: add taint for randstruct 2018-04-11 10:28:35 -07:00
params.c kernel/params.c: downgrade warning for unsafe parameters 2018-04-11 10:28:37 -07:00
pid.c xarray: add the xa_lock to the radix_tree_root 2018-04-11 10:28:39 -07:00
pid_namespace.c Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2018-04-03 19:15:32 -07:00
profile.c
ptrace.c
range.c
reboot.c
relay.c
resource.c resource: fix integer overflow at reallocation 2018-04-13 17:10:27 -07:00
seccomp.c
signal.c Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2018-04-07 11:11:41 -07:00
smp.c
smpboot.c
smpboot.h
softirq.c
stacktrace.c
stop_machine.c
sys.c kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid() 2018-04-02 20:16:06 +02:00
sys_ni.c syscalls/core: Prepare CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y for compat syscalls 2018-04-05 16:59:38 +02:00
sysctl.c kernel/sysctl.c: add kdoc comments to do_proc_do{u}intvec_minmax_conv_param 2018-04-11 10:28:38 -07:00
sysctl_binary.c
task_work.c
taskstats.c
test_kprobes.c
torture.c
tracepoint.c
tsacct.c
ucount.c headers: untangle kmemleak.h from mm.h 2018-04-05 21:36:27 -07:00
uid16.c fs: add do_fchownat(), ksys_fchown() helpers and ksys_{,l}chown() wrappers 2018-04-02 20:15:59 +02:00
uid16.h kernel: provide ksys_*() wrappers for syscalls called by kernel/uid16.c 2018-04-02 20:15:30 +02:00
umh.c kernel: use kernel_wait4() instead of sys_wait4() 2018-04-02 20:14:51 +02:00
up.c
user-return-notifier.c
user.c
user_namespace.c
utsname.c uts: create "struct uts_namespace" from kmem_cache 2018-04-11 10:28:35 -07:00
utsname_sysctl.c
watchdog.c
watchdog_hld.c
workqueue.c Merge branch 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2018-04-03 18:00:13 -07:00
workqueue_internal.h