Star64_linux/include
Eric Dumazet 52085f9e7e tcp: switch orphan_count to bare per-cpu counters
[ Upstream commit 19757cebf0 ]

Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086a1 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot <lkp@intel.com>
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c00191 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-19 16:57:23 +08:00
..
acpi ACPI: tools: fix compilation error 2021-10-07 19:18:19 +02:00
asm-generic asm-generic: build fixes for v5.15 2021-10-08 11:57:54 -07:00
clocksource
crypto crypto: add patch for 5.15 2022-04-28 13:54:49 +08:00
drm
dt-bindings pinctrl: starfive: jh7110: Correct the ioconfig register address and bit definitions 2023-03-16 14:25:09 +08:00
keys
kunit kunit: fix kernel-doc warnings due to mismatched arg names 2021-10-06 17:54:07 -06:00
kvm KVM: arm64: Fix PMU probe ordering 2021-09-20 12:43:34 +01:00
linux kernel/sched: Fix sched_fork() access an invalid sched_task_group 2023-04-19 16:57:23 +08:00
math-emu
media media: videobuf2: rework vb2_mem_ops API 2023-04-19 16:57:08 +08:00
memory memory: renesas-rpc-if: Correct QSPI data transfer in Manual mode 2023-04-19 16:57:00 +08:00
misc
net tcp: switch orphan_count to bare per-cpu counters 2023-04-19 16:57:23 +08:00
pcmcia
ras
rdma
scsi scsi: core: Avoid leaving shost->last_reset with stale value if EH does not run 2023-04-19 16:56:53 +08:00
soc Add a pm function for GPU 2022-07-01 15:01:36 +08:00
sound [Audio: I2S] Support WM8960 one channel play and captrue 2022-08-11 02:37:26 -07:00
target
trace block-5.15-2021-10-17 2021-10-17 19:25:20 -10:00
uapi signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed 2023-04-19 16:57:00 +08:00
vdso
video v4l2: modify v4l2 compatible name 2022-08-31 09:16:30 +08:00
xen xen/privcmd: drop "pages" parameter from xen_remap_pfn() 2021-10-05 08:20:27 +02:00