No description
Find a file
Eric Dumazet 52085f9e7e tcp: switch orphan_count to bare per-cpu counters
[ Upstream commit 19757cebf0 ]

Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086a1 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot <lkp@intel.com>
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c00191 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-19 16:57:23 +08:00
arch x86/insn: Use get_unaligned() instead of memcpy() 2023-04-19 16:57:21 +08:00
block block: remove inaccurate requeue check 2023-04-19 16:57:12 +08:00
certs certs: Add support for using elliptic curve keys for signing modules 2021-08-23 19:55:42 +03:00
crypto crypto: ecc - fix CRYPTO_DEFAULT_RNG dependency 2023-04-19 16:57:20 +08:00
Documentation fscrypt: allow 256-bit master keys with AES-256-XTS 2023-04-19 16:57:07 +08:00
drivers tcp: switch orphan_count to bare per-cpu counters 2023-04-19 16:57:23 +08:00
fs erofs: don't trigger WARN() when decompression fails 2023-04-19 16:57:15 +08:00
include tcp: switch orphan_count to bare per-cpu counters 2023-04-19 16:57:23 +08:00
init bootconfig: init: Fix memblock leak in xbc_make_cmdline() 2021-10-10 22:27:40 -04:00
ipc ipc: remove memcg accounting for sops objects in do_semtimedop() 2021-09-14 10:22:11 -07:00
kernel kernel/sched: Fix sched_fork() access an invalid sched_task_group 2023-04-19 16:57:23 +08:00
lib bpf/tests: Fix error in tail call limit tests 2023-04-19 16:57:18 +08:00
LICENSES LICENSES/dual/CC-BY-4.0: Git rid of "smart quotes" 2021-07-15 06:31:24 -06:00
mm kfence: always use static branches to guard kfence_alloc() 2023-04-19 16:56:46 +08:00
net tcp: switch orphan_count to bare per-cpu counters 2023-04-19 16:57:23 +08:00
samples samples/bpf: Fix application of sizeof to pointer 2023-04-19 16:57:12 +08:00
scripts leaking_addresses: Always print a trailing newline 2023-04-19 16:57:11 +08:00
security ima: fix deadlock when traversing "ima_default_rules". 2023-04-19 16:57:22 +08:00
sound ASoC: tegra: Restore AC97 support 2023-04-19 16:57:00 +08:00
tools x86/insn: Use get_unaligned() instead of memcpy() 2023-04-19 16:57:21 +08:00
usr [board]:Init board config for JH7110 2021-11-18 14:06:27 +08:00
virt KVM: Remove tlbs_dirty 2021-09-23 11:01:12 -04:00
.clang-format clang-format: Update with the latest for_each macro list 2021-05-12 23:32:39 +02:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: ignore only top-level modules.builtin 2021-05-02 00:43:35 +09:00
.mailmap mailmap: add Andrej Shadura 2021-10-18 20:22:03 -10:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: Move Daniel Drake to credits 2021-09-21 08:34:58 +03:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS MAINTAINERS: Add entry for RISC-V PMU drivers 2023-01-03 14:26:18 +08:00
Makefile Linux 5.15.2 2023-04-19 16:56:47 +08:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.