MD5 keys are read with RCU protection, and tcp_md5_do_add()
might update in-place a prior key.
Normally, typical RCU updates would allocate a new piece
of memory. In this case only key->key and key->keylen might
be updated, and we do not care if an incoming packet could
see the old key, the new one, or some intermediate value,
since changing the key on a live flow is known to be problematic
anyway.
We only want to make sure that in the case key->keylen
is changed, cpus in tcp_md5_hash_key() wont try to use
uninitialized data, or crash because key->keylen was
read twice to feed sg_init_one() and ahash_request_set_crypt()
Fixes: 9ea88a1530 ("tcp: md5: check md5 signature without socket lock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The flow is allocated in qrtr_tx_wait, but not freed when qrtr node
is released. (*slot) becomes NULL after radix_tree_iter_delete is
called in __qrtr_node_release. The fix is to save (*slot) to a
vairable and then free it.
This memory leak is catched when kmemleak is enabled in kernel,
the report looks like below:
unreferenced object 0xffffa0de69e08420 (size 32):
comm "kworker/u16:3", pid 176, jiffies 4294918275 (age 82858.876s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 28 84 e0 69 de a0 ff ff ........(..i....
28 84 e0 69 de a0 ff ff 03 00 00 00 00 00 00 00 (..i............
backtrace:
[<00000000e252af0a>] qrtr_node_enqueue+0x38e/0x400 [qrtr]
[<000000009cea437f>] qrtr_sendmsg+0x1e0/0x2a0 [qrtr]
[<000000008bddbba4>] sock_sendmsg+0x5b/0x60
[<0000000003beb43a>] qmi_send_message.isra.3+0xbe/0x110 [qmi_helpers]
[<000000009c9ae7de>] qmi_send_request+0x1c/0x20 [qmi_helpers]
Signed-off-by: Carl Huang <cjhuang@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make it an explicit counterpart to devm_register_netdev() just like we
do with devm_free_netdev() for better clarity.
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2020-06-30
The following pull-request contains BPF updates for your *net* tree.
We've added 28 non-merge commits during the last 9 day(s) which contain
a total of 35 files changed, 486 insertions(+), 232 deletions(-).
The main changes are:
1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer
types, from Yonghong Song.
2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various
arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki.
3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb
internals and integrate it properly into DMA core, from Christoph Hellwig.
4) Fix RCU splat from recent changes to avoid skipping ingress policy when
kTLS is enabled, from John Fastabend.
5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its
position masking to work, from Andrii Nakryiko.
6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading
of network programs, from Maciej Żenczykowski.
7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer.
8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When skb is coalesced tcp_ack_tstamp() still needs to be called when not
fully acked in tcp_clean_rtx_queue(), otherwise SCM_TSTAMP_ACK
timestamps may never be fired. Since the original patch series had
dependent commits, this patch fixes the issue instead of reverting by
restoring calls to tcp_ack_tstamp() when skb is not fully acked.
Fixes: fdb7eb21dd ("tcp: stamp SCM_TSTAMP_ACK later in tcp_clean_rtx_queue()")
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This clean-up the code a bit, reduces the number of
used hooks and indirect call requested, and allow
better error reporting from __mptcp_subflow_connect()
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two tests for PTR_TO_BTF_ID vs. null ptr comparison,
one for PTR_TO_BTF_ID in the ctx structure and the
other for PTR_TO_BTF_ID after one level pointer chasing.
In both cases, the test ensures condition is not
removed.
For example, for this test
struct bpf_fentry_test_t {
struct bpf_fentry_test_t *a;
};
int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
{
if (arg == 0)
test7_result = 1;
return 0;
}
Before the previous verifier change, we have xlated codes:
int test7(long long unsigned int * ctx):
; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
0: (79) r1 = *(u64 *)(r1 +0)
; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
1: (b4) w0 = 0
2: (95) exit
After the previous verifier change, we have:
int test7(long long unsigned int * ctx):
; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
0: (79) r1 = *(u64 *)(r1 +0)
; if (arg == 0)
1: (55) if r1 != 0x0 goto pc+4
; test7_result = 1;
2: (18) r1 = map[id:6][0]+48
4: (b7) r2 = 1
5: (7b) *(u64 *)(r1 +0) = r2
; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
6: (b4) w0 = 0
7: (95) exit
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200630171241.2523875-1-yhs@fb.com
- bump version strings, by Simon Wunderlich
- update mailing list URL, by Sven Eckelmann
- fix typos and grammar in documentation, by Sven Eckelmann
- introduce a configurable per interface hop penalty,
by Linus Luessing
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl769w8WHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoatoEACN9O+fCwDIlIdxw/bUlGFw1J+V
4rDReo5grIkl3BSdnLkkaPXTcazCWRoFhCdH6esTsF4/R/nwUa5AZuAbVd2NnCFU
H+hVc7q/w0Hqppx5APT0AghgLcp8mFuHV9jqtJOjv9Kvd05Il6XwGkALT8QX2e2P
ypMD+PBuQT1XCMl0v02lA0ki2TxvCtFYs1UznbAdl86GOGMua3t6fuRWQ2fpUMou
t9251NnwSvSyeYuI8Zws7Wza2S4FOilX/CAjjGi02D0xf7T/JDs3FiPd6obdNfe/
4j6R2dHWjxz1sb6bEdM3wfmUlU4StMQ+Vywpu1fh3gihLc2tmtMZhn6tJk3YjALt
zlwDO2p/si7FLrizfMS7mUesZJH7Ji79BZtZWXsdW3p1zU1jlfJh5dc2ZaKwL9qk
MSpd9DH2F459YC6ELcITLnhCymc7O/g3n3voNnSxi7uX4M5LsGt0mcx6g/iiFN16
nPNHzXo7WJjgRcmLEQ3kYeD7Kjqh5XEogKLnwhThBxCFIQkNFF3OYqWtJyj5Sg5o
+EredOAcq1Rc5CR//V43rQZLk4o1I9qQysKFqbjHohMIgLccz+47QAeBXrhku0ZT
XU6k09Q0pkuo3iVPYnIUTDUkHtvFtsCKR98+8tB5xDR5F155OpFg7AXaKIx5Iym8
4/eGrx22T+6TIeSGGg==
=ceu6
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20200630' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- update mailing list URL, by Sven Eckelmann
- fix typos and grammar in documentation, by Sven Eckelmann
- introduce a configurable per interface hop penalty,
by Linus Luessing
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not very informative to know the DSA master device when a
subordinate network device fails to get its PHY setup. Provide the
device name and capitalize PHY while we are it.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The xfrm interface uses skb->protocol to determine packet type, and
bails out if it's not set. For AF_PACKET injection, we need to support
its call chain of:
packet_sendmsg -> packet_snd -> packet_parse_headers ->
dev_parse_header_protocol -> parse_protocol
Without a valid parse_protocol, this returns zero, and xfrmi rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sit uses skb->protocol to determine packet type, and bails out if it's
not set. For AF_PACKET injection, we need to support its call chain of:
packet_sendmsg -> packet_snd -> packet_parse_headers ->
dev_parse_header_protocol -> parse_protocol
Without a valid parse_protocol, this returns zero, and sit rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vti uses skb->protocol to determine packet type, and bails out if it's
not set. For AF_PACKET injection, we need to support its call chain of:
packet_sendmsg -> packet_snd -> packet_parse_headers ->
dev_parse_header_protocol -> parse_protocol
Without a valid parse_protocol, this returns zero, and vti rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ipip uses skb->protocol to determine packet type, and bails out if it's
not set. For AF_PACKET injection, we need to support its call chain of:
packet_sendmsg -> packet_snd -> packet_parse_headers ->
dev_parse_header_protocol -> parse_protocol
Without a valid parse_protocol, this returns zero, and ipip rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some devices that take straight up layer 3 packets benefit from having a
shared header_ops so that AF_PACKET sockets can inject packets that are
recognized. This shared infrastructure will be used by other drivers
that currently can't inject packets using AF_PACKET. It also exposes the
parser function, as it is useful in standalone form too.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sockmap code currently ignores the value of attach_bpf_fd when
detaching a program. This is contrary to the usual behaviour of
checking that attach_bpf_fd represents the currently attached
program.
Ensure that attach_bpf_fd is indeed the currently attached
program. It turns out that all sockmap selftests already do this,
which indicates that this is unlikely to cause breakage.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
Using BPF_PROG_ATTACH on a sockmap program currently understands no
flags or replace_bpf_fd, but accepts any value. Return EINVAL instead.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200629095630.7933-4-lmb@cloudflare.com
Prepare for having multi-prog attachments for new netns attach types by
storing programs to run in a bpf_prog_array, which is well suited for
iterating over programs and running them in sequence.
After this change bpf(PROG_QUERY) may block to allocate memory in
bpf_prog_array_copy_to_user() for collected program IDs. This forces a
change in how we protect access to the attached program in the query
callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from
an RCU read lock to holding a mutex that serializes updaters.
Because we allow only one BPF flow_dissector program to be attached to
netns at all times, the bpf_prog_array pointed by net->bpf.run_array is
always either detached (null) or one element long.
No functional changes intended.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
Prepare for using bpf_prog_array to store attached programs by moving out
code that updates the attached program out of flow dissector.
Managing bpf_prog_array is more involved than updating a single bpf_prog
pointer. This will let us do it all from one place, bpf/net_namespace.c, in
the subsequent patch.
No functional change intended.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com
Keep the IPVS hooks registered in Netfilter only
while there are configured virtual services. This
saves CPU cycles while IPVS is loaded but not used.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In nft_pipapo_insert(), we need to reallocate scratch maps that will
be used for matching by lookup functions, if they have never been
allocated or if the bucket size changes as a result of the insertion.
As pipapo_realloc_scratch() provides a pair of fresh, zeroed out
maps, there's no need to select a particular one after reallocation.
Other than being useless, the existing assignment was also troubled
by the fact that the index was set only on the CPU performing the
actual insertion, as spotted by Florian.
Simply drop the assignment.
Reported-by: Florian Westphal <fw@strlen.de>
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
REJECT statement can be only used in INPUT, FORWARD and OUTPUT
chains. This patch adds support of REJECT, both icmp and tcp
reset, at PREROUTING stage.
The need for this patch comes from the requirement of some
forwarding devices to reject traffic before the natting and
routing decisions.
The main use case is to be able to send a graceful termination
to legitimate clients that, under any circumstances, the NATed
endpoints are not available. This option allows clients to
decide either to perform a reconnection or manage the error in
their side, instead of just dropping the connection and let
them die due to timeout.
It is supported ipv4, ipv6 and inet families for nft
infrastructure.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the dma_need_sync helper instead of (not always entirely correctly)
poking into the dma-mapping internals.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200629130359.2690853-5-hch@lst.de
->dev is already assigned at the top of the function, remove the duplicate
one at the end.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200629130359.2690853-4-hch@lst.de
Invert the polarity and better name the flag so that the use case is
properly documented.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200629130359.2690853-3-hch@lst.de
Currently, drivers can only tell whether the link is up/down using
LINKSTATE_GET, but no additional information is given.
Add attributes to LINKSTATE_GET command in order to allow drivers
to expose the user more information in addition to link state to ease
the debug process, for example, reason for link down state.
Extended state consists of two attributes - link_ext_state and
link_ext_substate. The idea is to avoid 'vendor specific' states in order
to prevent drivers to use specific link_ext_state that can be in the future
common link_ext_state.
The substates allows drivers to add more information to the common
link_ext_state. For example, vendor can expose 'Autoneg' as link_ext_state
and add 'No partner detected during force mode' as link_ext_substate.
If a driver cannot pinpoint the extended state with the substate
accuracy, it is free to expose only the extended state and omit the
substate attribute.
Signed-off-by: Amit Cohen <amitc@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 'tcfp_burst' with TICK factor, driver side always need to recover
it to the original value, this patch moves the generic calculation and
recover to the 'burst' original value before offloading to device driver.
Signed-off-by: Po Liu <po.liu@nxp.com>
Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mptcp_poll always return POLLOUT for unblocking
connect(), ensure that the socket is a suitable
state.
The MPTCP_DATA_READY bit is never cleared on accept:
ensure we don't leave mptcp_accept() with an empty
accept queue and such bit set.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently __mptcp_tcp_fallback() always return NULL
on incoming connections, because MPTCP does not create
the additional socket for the first subflow.
Since the previous commit no __mptcp_tcp_fallback()
caller needs a struct socket, so let __mptcp_tcp_fallback()
return the first subflow sock and cope correctly even with
incoming connections.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This cleans the code a bit and makes the behavior more consistent.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This cleanup the code a bit and avoid corrupted states
on weird syscall sequence (accept(), connect()).
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
to regular TCP. When fallback is triggered, skip addition of the MPTCP
option on send.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/11
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Improve hardware layouts and structure for kTLS support
2) Generalize ICOSQ (Internal Channel Operations Send Queue)
Due to the asynchronous nature of adding new kTLS flows and handling
HW asynchronous kTLS resync requests, the XSK ICOSQ was extended to
support generic async operations, such as kTLS add flow and resync, in
addition to the existing XSK usages.
3) kTLS hardware flow steering and classification:
The driver already has the means to classify TCP ipv4/6 flows to send them
to the corresponding RSS HW engine, as reflected in patches 3 through 5,
the series will add a steering layer that will hook to the driver's TCP
classifiers and will match on well known kTLS connection, in case of a
match traffic will be redirected to the kTLS decryption engine, otherwise
traffic will continue flowing normally to the TCP RSS engine.
3) kTLS add flow RX HW offload support
New offload contexts post their static/progress params WQEs
(Work Queue Element) to communicate the newly added kTLS contexts
over the per-channel async ICOSQ.
The Channel/RQ is selected according to the socket's rxq index.
A new TLS-RX workqueue is used to allow asynchronous addition of
steering rules, out of the NAPI context.
It will be also used in a downstream patch in the resync procedure.
Feature is OFF by default. Can be turned on by:
$ ethtool -K <if> tls-hw-rx-offload on
4) Added mlx5 kTLS sw stats and new counters are documented in
Documentation/networking/tls-offload.rst
rx_tls_ctx - number of TLS RX HW offload contexts added to device for
decryption.
rx_tls_ooo - number of RX packets which were part of a TLS stream
but did not arrive in the expected order and triggered the resync
procedure.
rx_tls_del - number of TLS RX HW offload contexts deleted from device
(connection has finished).
rx_tls_err - number of RX packets which were part of a TLS stream
but were not decrypted due to unexpected error in the state machine.
5) Asynchronous RX resync
a. The NIC driver indicates that it would like to resync on some TLS
record within the received packet (P), but the driver does not
know (yet) which of the TLS records within the packet.
At this stage, the NIC driver will query the device to find the exact
TCP sequence for resync (tcpsn), however, the driver does not wait
for the device to provide the response.
b. Eventually, the device responds, and the driver provides the tcpsn
within the resync packet to KTLS. Now, KTLS can check the tcpsn against
any processed TLS records within packet P, and also against any record
that is processed in the future within packet P.
The asynchronous resync path simplifies the device driver, as it can
save bits on the packet completion (32-bit TCP sequence), and pass this
information on an asynchronous command instead.
Performance:
CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off
NIC: ConnectX-6 Dx 100GbE dual port
Goodput (app-layer throughput) comparison:
+---------------+-------+-------+---------+
| # connections | 1 | 4 | 8 |
+---------------+-------+-------+---------+
| SW (Gbps) | 7.26 | 24.70 | 50.30 |
+---------------+-------+-------+---------+
| HW (Gbps) | 18.50 | 64.30 | 92.90 |
+---------------+-------+-------+---------+
| Speedup | 2.55x | 2.56x | 1.85x * |
+---------------+-------+-------+---------+
* After linerate is reached, diff is observed in CPU util
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl73s2kACgkQSD+KveBX
+j4wqAf/ZhcEn7i4N2F9wMMIL6wd4DgwKWWhbGpiREIxDwcRbqH7PGom8nBZMNd9
+3g3zfURvByWehLtYcjmMgR4B7+xDgEs0dSx6pQM9764HqLDV2jW8ENr9Vr/u8s1
hJ/eV8uzIfvx27MzbENZi0oJTw7N9nCgdcv1OyZkIba+Iado9pOeakPgBmTbINgo
46LJI9nIEROE15gfjyxrVeYAs3Nxt+bogQCWYfMqUfRmKcMJ0d4oTHaUdtmm+xQB
jC685/e4gE7jRgZ3qH/xvCZYp7+TVKaXsB0EtaJdPFEkvvvQpgPTfquIQ+6l7vvE
Yf1YUhnDOoxGUQy1CdSZ2reNxLIm8A==
=7+rG
-----END PGP SIGNATURE-----
Merge tag 'mlx5-tls-2020-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-tls-2020-06-26
1) Improve hardware layouts and structure for kTLS support
2) Generalize ICOSQ (Internal Channel Operations Send Queue)
Due to the asynchronous nature of adding new kTLS flows and handling
HW asynchronous kTLS resync requests, the XSK ICOSQ was extended to
support generic async operations, such as kTLS add flow and resync, in
addition to the existing XSK usages.
3) kTLS hardware flow steering and classification:
The driver already has the means to classify TCP ipv4/6 flows to send them
to the corresponding RSS HW engine, as reflected in patches 3 through 5,
the series will add a steering layer that will hook to the driver's TCP
classifiers and will match on well known kTLS connection, in case of a
match traffic will be redirected to the kTLS decryption engine, otherwise
traffic will continue flowing normally to the TCP RSS engine.
3) kTLS add flow RX HW offload support
New offload contexts post their static/progress params WQEs
(Work Queue Element) to communicate the newly added kTLS contexts
over the per-channel async ICOSQ.
The Channel/RQ is selected according to the socket's rxq index.
A new TLS-RX workqueue is used to allow asynchronous addition of
steering rules, out of the NAPI context.
It will be also used in a downstream patch in the resync procedure.
Feature is OFF by default. Can be turned on by:
$ ethtool -K <if> tls-hw-rx-offload on
4) Added mlx5 kTLS sw stats and new counters are documented in
Documentation/networking/tls-offload.rst
rx_tls_ctx - number of TLS RX HW offload contexts added to device for
decryption.
rx_tls_ooo - number of RX packets which were part of a TLS stream
but did not arrive in the expected order and triggered the resync
procedure.
rx_tls_del - number of TLS RX HW offload contexts deleted from device
(connection has finished).
rx_tls_err - number of RX packets which were part of a TLS stream
but were not decrypted due to unexpected error in the state machine.
5) Asynchronous RX resync
a. The NIC driver indicates that it would like to resync on some TLS
record within the received packet (P), but the driver does not
know (yet) which of the TLS records within the packet.
At this stage, the NIC driver will query the device to find the exact
TCP sequence for resync (tcpsn), however, the driver does not wait
for the device to provide the response.
b. Eventually, the device responds, and the driver provides the tcpsn
within the resync packet to KTLS. Now, KTLS can check the tcpsn against
any processed TLS records within packet P, and also against any record
that is processed in the future within packet P.
The asynchronous resync path simplifies the device driver, as it can
save bits on the packet completion (32-bit TCP sequence), and pass this
information on an asynchronous command instead.
Performance:
CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off
NIC: ConnectX-6 Dx 100GbE dual port
Goodput (app-layer throughput) comparison:
+---------------+-------+-------+---------+
| # connections | 1 | 4 | 8 |
+---------------+-------+-------+---------+
| SW (Gbps) | 7.26 | 24.70 | 50.30 |
+---------------+-------+-------+---------+
| HW (Gbps) | 18.50 | 64.30 | 92.90 |
+---------------+-------+-------+---------+
| Speedup | 2.55x | 2.56x | 1.85x * |
+---------------+-------+-------+---------+
* After linerate is reached, diff is observed in CPU util
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
genl_family_rcv_msg_attrs_parse() reuses the global family->attrbuf
when family->parallel_ops is false. However, family->attrbuf is not
protected by any lock on the genl_family_rcv_msg_doit() code path.
This leads to several different consequences, one of them is UAF,
like the following:
genl_family_rcv_msg_doit(): genl_start():
genl_family_rcv_msg_attrs_parse()
attrbuf = family->attrbuf
__nlmsg_parse(attrbuf);
genl_family_rcv_msg_attrs_parse()
attrbuf = family->attrbuf
__nlmsg_parse(attrbuf);
info->attrs = attrs;
cb->data = info;
netlink_unicast_kernel():
consume_skb()
genl_lock_dumpit():
genl_dumpit_info(cb)->attrs
Note family->attrbuf is an array of pointers to the skb data, once
the skb is freed, any dereference of family->attrbuf will be a UAF.
Maybe we could serialize the family->attrbuf with genl_mutex too, but
that would make the locking more complicated. Instead, we can just get
rid of family->attrbuf and always allocate attrbuf from heap like the
family->parallel_ops==true code path. This may add some performance
overhead but comparing with taking the global genl_mutex, it still
looks better.
Fixes: 75cdbdd089 ("net: ieee802154: have genetlink code to parse the attrs during dumpit")
Fixes: 057af70713 ("net: tipc: have genetlink code to parse the attrs during dumpit")
Reported-and-tested-by: syzbot+3039ddf6d7b13daf3787@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+80cad1e3cb4c41cde6ff@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+736bcbcb11b60d0c0792@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+520f8704db2b68091d44@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c96e4dfb32f8987fdeed@syzkaller.appspotmail.com
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to allow acting on dropped and/or ECN-marked packets, add two new
qevents to the RED qdisc: "early_drop" and "mark". Filters attached at
"early_drop" block are executed as packets are early-dropped, those
attached at the "mark" block are executed as packets are ECN-marked.
Two new attributes are introduced: TCA_RED_EARLY_DROP_BLOCK with the block
index for the "early_drop" qevent, and TCA_RED_MARK_BLOCK for the "mark"
qevent. Absence of these attributes signifies "don't care": no block is
allocated in that case, or the existing blocks are left intact in case of
the change callback.
For purposes of offloading, blocks attached to these qevents appear with
newly-introduced binder types, FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP and
FLOW_BLOCK_BINDER_TYPE_RED_MARK.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the following patches, RED will get two qevents. The implementation will
be clearer if the callback for change is not a pure subset of the callback
for init. Split the two and promote attribute parsing to the callbacks
themselves from the common code, because it will be handy there.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qevents are attach points for TC blocks, where filters can be put that are
executed when "interesting events" take place in a qdisc. The data to keep
and the functions to invoke to maintain a qevent will be largely the same
between qevents. Therefore introduce sched-wide helpers for qevent
management.
Currently, similarly to ingress and egress blocks of clsact pseudo-qdisc,
blocks attachment cannot be changed after the qdisc is created. To that
end, add a helper tcf_qevent_validate_change(), which verifies whether
block index attribute is not attached, or if it is, whether its value
matches the current one (i.e. there is no material change).
The function tcf_qevent_handle() should be invoked when qdisc hits the
"interesting event" corresponding to a block. This function releases root
lock for the duration of executing the attached filters, to allow packets
generated through user actions (notably mirred) to be reinserted to the
same qdisc tree.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A following patch introduces qevents, points in qdisc algorithm where
packet can be processed by user-defined filters. Should this processing
lead to a situation where a new packet is to be enqueued on the same port,
holding the root lock would lead to deadlocks. To solve the issue, qevent
handler needs to unlock and relock the root lock when necessary.
To that end, add the root lock argument to the qdisc op enqueue, and
propagate throughout.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* TX control port status check fixed to not assume frame format
* mesh control port fixes
* error handling/leak fixes when starting AP, with HE attributes
* fix broadcast packet handling with encapsulation offload
* add new AKM suites
* and a small code cleanup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl75+b4ACgkQB8qZga/f
l8QfpA//QLGvnLKLZwEpmGhqtp/ZvmX3jv24QsbxLO8do210/UqVojf8WrUUWq8P
5kccKp1wxbobuUVzviAmSRRfYIkXQjFsz82dV1hei2xVhhOB1rHavCPqkorZIX+s
p3a9WzaqRwgK+chGtGGDEGjkx31+Ve8tRkvRmIfhxX+r1mAF1jI8gwlsUmX7jTfN
VfbiLP1pbyNemNBqrugjhOmprTqUanfaY9alJVEaglQAild7YcFdmqbpKEBogQk7
0KPnYRsD0w3pv4ewLeBypnz4+kD2+JPaPkgAeq1ppi9YcvFwTalTRaP2P4miL0+x
lxfsT8eSlCls/FTEb7UheWP1Xt4ym0xAQh9e9VaWavzvbw2l6oHtCOEFL9YQrOa2
xSmNJBvle82DXy38RSFpcsaQSHAS3TAJyEeS1Z1b0D7IzBudpbwDLE828FXwylDP
eTnlNy7zrYtKwazhtI0uVB2K/DrEWmwqnyPmX75AiP/3FmiDI4mxmIj0hwpzC27e
jfl2d7qWLRQ8WVs0mE5u7QnNz2sZ3trq6TeO6NwLDIjdm65GpiQB2kpS6DsmIOiu
phpS8vnGAC9ZNThIDvFPYvEBTkRQ+hzrsonwIU54PEVj3RMAqQaL3RPrN7UFfmVK
fDZ21rjzujOchDZNCzwIpXMeAz1EzyNOzBcYwDs2bOoSKD5nYdE=
=rzWh
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-net-2020-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Couple of fixes/small things:
* TX control port status check fixed to not assume frame format
* mesh control port fixes
* error handling/leak fixes when starting AP, with HE attributes
* fix broadcast packet handling with encapsulation offload
* add new AKM suites
* and a small code cleanup
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if that's only a warning, not including asm/cacheflush.h
leads to svc_flush_bvec() being empty allthough powerpc defines
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.
CC net/sunrpc/svcsock.o
net/sunrpc/svcsock.c:227:5: warning: "ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE" is not defined [-Wundef]
#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
^
Include linux/highmem.h so that asm/cacheflush.h will be included.
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reported-by: kernel test robot <lkp@intel.com>
Fixes: ca07eda33e ("SUNRPC: Refactor svc_recvfrom()")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The lockdep annotations for dev_uc_unsync() and dev_mc_unsync()
are not easy to understand, so add some comments to explain
why they are correct.
Similar for the rest netif_addr_lock_bh() cases, they don't
need nested version.
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
lockdep_set_class_and_subclass() is meant to reduce
the _nested() annotations by assigning a default subclass.
For addr_list_lock, we have to compute the subclass at
run-time as the netdevice topology changes after creation.
So, we should just get rid of these
lockdep_set_class_and_subclass() and stick with our _nested()
annotations.
Fixes: 845e0ebb44 ("net: change addr_list_lock back to static key")
Suggested-by: Taehee Yoo <ap420073@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, but the implementation in this
driver returns an 'int'.
Fix this by returning 'netdev_tx_t' in this driver too.
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, but the implementation in this
driver returns an 'int'.
Fix this by returning 'netdev_tx_t' in this driver too.
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an ingress verdict program specifies message sizes greater than
skb->len and there is an ENOMEM error due to memory pressure we
may call the rcv_msg handler outside the strp_data_ready() caller
context. This is because on an ENOMEM error the strparser will
retry from a workqueue. The caller currently protects the use of
psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
block.
But, in above workqueue error case the psock is accessed outside
the read_lock/unlock block of the caller. So instead of using
psock directly we must do a look up against the sk again to
ensure the psock is available.
There is an an ugly piece here where we must handle
the case where we paused the strp and removed the psock. On
psock removal we first pause the strparser and then remove
the psock. If the strparser is paused while an skb is
scheduled on the workqueue the skb will be dropped on the
flow and kfree_skb() is called. If the workqueue manages
to get called before we pause the strparser but runs the rcvmsg
callback after the psock is removed we will hit the unlikely
case where we run the sockmap rcvmsg handler but do not have
a psock. For now we will follow strparser logic and drop the
skb on the floor with skb_kfree(). This is ugly because the
data is dropped. To date this has not caused problems in practice
because either the application controlling the sockmap is
coordinating with the datapath so that skbs are "flushed"
before removal or we simply wait for the sock to be closed before
removing it.
This patch fixes the describe RCU bug and dropping the skb doesn't
make things worse. Future patches will improve this by allowing
the normal case where skbs are not merged to skip the strparser
altogether. In practice many (most?) use cases have no need to
merge skbs so its both a code complexity hit as seen above and
a performance issue. For example, in the Cilium case we always
set the strparser up to return sbks 1:1 without any merging and
have avoided above issues.
Fixes: e91de6afa8 ("bpf: Fix running sk_skb program types with ktls")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
There are two paths to generate the below RCU splat the first and
most obvious is the result of the BPF verdict program issuing a
redirect on a TLS socket (This is the splat shown below). Unlike
the non-TLS case the caller of the *strp_read() hooks does not
wrap the call in a rcu_read_lock/unlock. Then if the BPF program
issues a redirect action we hit the RCU splat.
However, in the non-TLS socket case the splat appears to be
relatively rare, because the skmsg caller into the strp_data_ready()
is wrapped in a rcu_read_lock/unlock. Shown here,
static void sk_psock_strp_data_ready(struct sock *sk)
{
struct sk_psock *psock;
rcu_read_lock();
psock = sk_psock(sk);
if (likely(psock)) {
if (tls_sw_has_ctx_rx(sk)) {
psock->parser.saved_data_ready(sk);
} else {
write_lock_bh(&sk->sk_callback_lock);
strp_data_ready(&psock->parser.strp);
write_unlock_bh(&sk->sk_callback_lock);
}
}
rcu_read_unlock();
}
If the above was the only way to run the verdict program we
would be safe. But, there is a case where the strparser may throw an
ENOMEM error while parsing the skb. This is a result of a failed
skb_clone, or alloc_skb_for_msg while building a new merged skb when
the msg length needed spans multiple skbs. This will in turn put the
skb on the strp_wrk workqueue in the strparser code. The skb will
later be dequeued and verdict programs run, but now from a
different context without the rcu_read_lock()/unlock() critical
section in sk_psock_strp_data_ready() shown above. In practice
I have not seen this yet, because as far as I know most users of the
verdict programs are also only working on single skbs. In this case no
merge happens which could trigger the above ENOMEM errors. In addition
the system would need to be under memory pressure. For example, we
can't hit the above case in selftests because we missed having tests
to merge skbs. (Added in later patch)
To fix the below splat extend the rcu_read_lock/unnlock block to
include the call to sk_psock_tls_verdict_apply(). This will fix both
TLS redirect case and non-TLS redirect+error case. Also remove
psock from the sk_psock_tls_verdict_apply() function signature its
not used there.
[ 1095.937597] WARNING: suspicious RCU usage
[ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G W
[ 1095.944363] -----------------------------
[ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
[ 1095.950866]
[ 1095.950866] other info that might help us debug this:
[ 1095.950866]
[ 1095.957146]
[ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
[ 1095.961482] 1 lock held by test_sockmap/15970:
[ 1095.964501] #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
[ 1095.968568]
[ 1095.968568] stack backtrace:
[ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G W 5.7.0-rc7-02911-g463bac5f1ca79 #1
[ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 1095.980519] Call Trace:
[ 1095.982191] dump_stack+0x8f/0xd0
[ 1095.984040] sk_psock_skb_redirect+0xa6/0xf0
[ 1095.986073] sk_psock_tls_strp_read+0x1d8/0x250
[ 1095.988095] tls_sw_recvmsg+0x714/0x840 [tls]
v2: Improve commit message to identify non-TLS redirect plus error case
condition as well as more common TLS case. In the process I decided
doing the rcu_read_unlock followed by the lock/unlock inside branches
was unnecessarily complex. We can just extend the current rcu block
and get the same effeective without the shuffling and branching.
Thanks Martin!
Fixes: e91de6afa8 ("bpf: Fix running sk_skb program types with ktls")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
We can't cast sk_buff to rtable by (struct rtable *)hint. Use skb_rtable().
Fixes: 02b2494161 ("ipv4: use dst hint for ipv4 list receive")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>