commit dc5e8c99975bb1a1561de884a83b3c19e4ac7ada
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Fri Feb 8 11:25:33 2019 +0100

    Linux 4.4.174

commit 60c7f8fca1bbc6ac677a76f25843202b91f52abb
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Mon Jan 11 16:29:29 2016 -0800

    rcu: Force boolean subscript for expedited stall warnings
    
    commit ec3833ed02ae6ef2a933ece9de7cbab0c64c699e upstream.
    
    The cpu_online() function can return values other than 0 and 1, which
    can result in subscript overflow when applied to a two-element array.
    This commit allows for this behavior by using "!!" on the return value
    from cpu_online() when used as a subscript.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: "Rantala, Tommi" <tommi.t.rantala@nokia.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 873beab3740a085ba284ed7daa4ab4068074db53
Author: Michal Kubecek <mkubecek@suse.cz>
Date:   Thu Dec 13 17:23:32 2018 +0100

    net: ipv4: do not handle duplicate fragments as overlapping
    
    commit ade446403bfb79d3528d56071a84b15351a139ad upstream.
    
    Since commit 7969e5c40dfd ("ip: discard IPv4 datagrams with overlapping
    segments.") IPv4 reassembly code drops the whole queue whenever an
    overlapping fragment is received. However, the test is written in a way
    which detects duplicate fragments as overlapping so that in environments
    with many duplicate packets, fragmented packets may be undeliverable.
    
    Add an extra test and for (potentially) duplicate fragment, only drop the
    new fragment rather than the whole queue. Only starting offset and length
    are checked, not the contents of the fragments as that would be too
    expensive. For similar reason, linear list ("run") of a rbtree node is not
    iterated, we only check if the new fragment is a subset of the interval
    covered by existing consecutive fragments.
    
    v2: instead of an exact check iterating through linear list of an rbtree
    node, only check if the new fragment is subset of the "run" (suggested
    by Eric Dumazet)
    
    Fixes: 7969e5c40dfd ("ip: discard IPv4 datagrams with overlapping segments.")
    Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Mao Wenan <maowenan@huawei.com>
    [bwh: Backported to 4.4:
     - goto discard_qp, not err, in case of overlap
     - Set err earlier variable, as done upstream in commit 0ff89efb5246
       "ip: fail fast on IP defrag errors"]
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 0836bdfee568a417ffba238fc180bf5c00c648e3
Author: Dimitris Michailidis <dmichail@google.com>
Date:   Fri Oct 19 17:07:13 2018 -0700

    net: fix pskb_trim_rcsum_slow() with odd trim offset
    
    commit d55bef5059dd057bd077155375c581b49d25be7e upstream.
    
    We've been getting checksum errors involving small UDP packets, usually
    59B packets with 1 extra non-zero padding byte. netdev_rx_csum_fault()
    has been complaining that HW is providing bad checksums. Turns out the
    problem is in pskb_trim_rcsum_slow(), introduced in commit 88078d98d1bb
    ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends").
    
    The source of the problem is that when the bytes we are trimming start
    at an odd address, as in the case of the 1 padding byte above,
    skb_checksum() returns a byte-swapped value. We cannot just combine this
    with skb->csum using csum_sub(). We need to use csum_block_sub() here
    that takes into account the parity of the start address and handles the
    swapping.
    
    Matches existing code in __skb_postpull_rcsum() and esp_remove_trailer().
    
    Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
    Signed-off-by: Dimitris Michailidis <dmichail@google.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e92b8475d6d7319bb69c3bdf307f3d9384c3776a
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Nov 8 17:34:27 2018 -0800

    inet: frags: better deal with smp races
    
    commit 0d5b9311baf27bb545f187f12ecfd558220c607d upstream.
    
    Multiple cpus might attempt to insert a new fragment in rhashtable,
    if for example RPS is buggy, as reported by 배석진 in
    https://patchwork.ozlabs.org/patch/994601/
    
    We use rhashtable_lookup_get_insert_key() instead of
    rhashtable_insert_fast() to let cpus losing the race
    free their own inet_frag_queue and use the one that
    was inserted by another cpu.
    
    Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: 배석진 <soukjin.bae@samsung.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit acd00a0692072b374afee4b6f38c1eb1c6cf6f4a
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Wed Oct 10 12:30:17 2018 -0700

    ipv4: frags: precedence bug in ip_expire()
    
    commit 70837ffe3085c9a91488b52ca13ac84424da1042 upstream.
    
    We accidentally removed the parentheses here, but they are required
    because '!' has higher precedence than '&'.
    
    Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Mao Wenan <maowenan@huawei.com>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit cb5fd4aa24b57206548d5940dc359f0b181a2688
Author: Taehee Yoo <ap420073@gmail.com>
Date:   Wed Oct 10 12:30:16 2018 -0700

    ip: frags: fix crash in ip_do_fragment()
    
    commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream.
    
    A kernel crash occurrs when defragmented packet is fragmented
    in ip_do_fragment().
    In defragment routine, skb_orphan() is called and
    skb->ip_defrag_offset is set. but skb->sk and
    skb->ip_defrag_offset are same union member. so that
    frag->sk is not NULL.
    Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
    defragmented packet is fragmented.
    
    test commands:
       %iptables -t nat -I POSTROUTING -j MASQUERADE
       %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000
    
    splat looks like:
    [  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
    [  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    [  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
    [  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
    [  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
    [  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
    [  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
    [  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
    [  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
    [  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
    [  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
    [  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
    [  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
    [  261.198158] Call Trace:
    [  261.199018]  ? dst_output+0x180/0x180
    [  261.205011]  ? save_trace+0x300/0x300
    [  261.209018]  ? ip_copy_metadata+0xb00/0xb00
    [  261.213034]  ? sched_clock_local+0xd4/0x140
    [  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
    [  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
    [  261.227014]  ? find_held_lock+0x39/0x1c0
    [  261.233008]  ip_finish_output+0x51d/0xb50
    [  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
    [  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
    [  261.250152]  ? rcu_is_watching+0x77/0x120
    [  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
    [  261.261033]  ? nf_hook_slow+0xb1/0x160
    [  261.265007]  ip_output+0x1c7/0x710
    [  261.269005]  ? ip_mc_output+0x13f0/0x13f0
    [  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
    [  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
    [  261.282996]  ? nf_hook_slow+0xb1/0x160
    [  261.287007]  raw_sendmsg+0x21f9/0x4420
    [  261.291008]  ? dst_output+0x180/0x180
    [  261.297003]  ? sched_clock_cpu+0x126/0x170
    [  261.301003]  ? find_held_lock+0x39/0x1c0
    [  261.306155]  ? stop_critical_timings+0x420/0x420
    [  261.311004]  ? check_flags.part.36+0x450/0x450
    [  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
    [  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
    [  261.326142]  ? cyc2ns_read_end+0x10/0x10
    [  261.330139]  ? raw_bind+0x280/0x280
    [  261.334138]  ? sched_clock_cpu+0x126/0x170
    [  261.338995]  ? check_flags.part.36+0x450/0x450
    [  261.342991]  ? __lock_acquire+0x4500/0x4500
    [  261.348994]  ? inet_sendmsg+0x11c/0x500
    [  261.352989]  ? dst_output+0x180/0x180
    [  261.357012]  inet_sendmsg+0x11c/0x500
    [ ... ]
    
    v2:
     - clear skb->sk at reassembly routine.(Eric Dumarzet)
    
    Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Taehee Yoo <ap420073@gmail.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Mao Wenan <maowenan@huawei.com>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 2822475e70db5a4b46de88a5b66eb2aceb3734af
Author: Peter Oskolkov <posk@google.com>
Date:   Wed Oct 10 12:30:15 2018 -0700

    ip: process in-order fragments efficiently
    
    commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c upstream.
    
    This patch changes the runtime behavior of IP defrag queue:
    incoming in-order fragments are added to the end of the current
    list/"run" of in-order fragments at the tail.
    
    On some workloads, UDP stream performance is substantially improved:
    
    RX: ./udp_stream -F 10 -T 2 -l 60
    TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60
    
    with this patchset applied on a 10Gbps receiver:
    
      throughput=9524.18
      throughput_units=Mbit/s
    
    upstream (net-next):
    
      throughput=4608.93
      throughput_units=Mbit/s
    
    Reported-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: Peter Oskolkov <posk@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Florian Westphal <fw@strlen.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Mao Wenan <maowenan@huawei.com>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 2039bd8669f4ecefca163f0c9d8c5f5f6a4c8610
Author: Peter Oskolkov <posk@google.com>
Date:   Wed Oct 10 12:30:14 2018 -0700

    ip: add helpers to process in-order fragments faster.
    
    commit 353c9cb360874e737fb000545f783df756c06f9a upstream.
    
    This patch introduces several helper functions/macros that will be
    used in the follow-up patch. No runtime changes yet.
    
    The new logic (fully implemented in the second patch) is as follows:
    
    * Nodes in the rb-tree will now contain not single fragments, but lists
      of consecutive fragments ("runs").
    
    * At each point in time, the current "active" run at the tail is
      maintained/tracked. Fragments that arrive in-order, adjacent
      to the previous tail fragment, are added to this tail run without
      triggering the re-balancing of the rb-tree.
    
    * If a fragment arrives out of order with the offset _before_ the tail run,
      it is inserted into the rb-tree as a single fragment.
    
    * If a fragment arrives after the current tail fragment (with a gap),
      it starts a new "tail" run, as is inserted into the rb-tree
      at the end as the head of the new run.
    
    skb->cb is used to store additional information
    needed here (suggested by Eric Dumazet).
    
    Reported-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: Peter Oskolkov <posk@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Florian Westphal <fw@strlen.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Mao Wenan <maowenan@huawei.com>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 3f78a3f45e79ca378cb850a598e4c76633710e92
Author: Peter Oskolkov <posk@google.com>
Date:   Wed Oct 10 12:30:13 2018 -0700

    ip: use rb trees for IP frag queue.
    
    commit fa0f527358bd900ef92f925878ed6bfbd51305cc upstream.
    
    Similar to TCP OOO RX queue, it makes sense to use rb trees to store
    IP fragments, so that OOO fragments are inserted faster.
    
    Tested:
    
    - a follow-up patch contains a rather comprehensive ip defrag
      self-test (functional)
    - ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`:
        netstat --statistics
        Ip:
            282078937 total packets received
            0 forwarded
            0 incoming packets discarded
            946760 incoming packets delivered
            18743456 requests sent out
            101 fragments dropped after timeout
            282077129 reassemblies required
            944952 packets reassembled ok
            262734239 packet reassembles failed
       (The numbers/stats above are somewhat better re:
        reassemblies vs a kernel without this patchset. More
        comprehensive performance testing TBD).
    
    Reported-by: Jann Horn <jannh@google.com>
    Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Peter Oskolkov <posk@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Florian Westphal <fw@strlen.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Mao Wenan <maowenan@huawei.com>
    [bwh: Backported to 4.4:
     - Keep using frag_kfree_skb() in inet_frag_destroy()
     - Adjust context]
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 7fab8b2f0c994decf580027286b97533c8b7f6fd
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:30:11 2018 -0700

    net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
    
    commit 88078d98d1bb085d72af8437707279e203524fa5 upstream.
    
    After working on IP defragmentation lately, I found that some large
    packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
    zero paddings on the last (small) fragment.
    
    While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
    to CHECKSUM_NONE, forcing a full csum validation, even if all prior
    fragments had CHECKSUM_COMPLETE set.
    
    We can instead compute the checksum of the part we are trimming,
    usually smaller than the part we keep.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5f2d68b6b5a439c3223d8fa6ba20736f91fc58d8
Author: Florian Westphal <fw@strlen.de>
Date:   Wed Oct 10 12:30:10 2018 -0700

    ipv6: defrag: drop non-last frags smaller than min mtu
    
    commit 0ed4229b08c13c84a3c301a08defdc9e7f4467e6 upstream.
    
    don't bother with pathological cases, they only waste cycles.
    IPv6 requires a minimum MTU of 1280 so we should never see fragments
    smaller than this (except last frag).
    
    v3: don't use awkward "-offset + len"
    v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
        There were concerns that there could be even smaller frags
        generated by intermediate nodes, e.g. on radio networks.
    
    Cc: Peter Oskolkov <posk@google.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Mao Wenan <maowenan@huawei.com>
    [bwh: Backported to 4.4: In nf_ct_frag6_gather() use clone instead of skb,
     and goto ret_orig in case of error]
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 26cfea3c1d041d08edacae291565f295553e15ce
Author: Peter Oskolkov <posk@google.com>
Date:   Wed Oct 10 12:30:09 2018 -0700

    net: modify skb_rbtree_purge to return the truesize of all purged skbs.
    
    commit 385114dec8a49b5e5945e77ba7de6356106713f4 upstream.
    
    Tested: see the next patch is the series.
    
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Peter Oskolkov <posk@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Florian Westphal <fw@strlen.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Mao Wenan <maowenan@huawei.com>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ef0f963de1d2c5bc99d3d6ace3dd44a7d6002717
Author: Peter Oskolkov <posk@google.com>
Date:   Wed Oct 10 12:30:07 2018 -0700

    ip: discard IPv4 datagrams with overlapping segments.
    
    commit 7969e5c40dfd04799d4341f1b7cd266b6e47f227 upstream.
    
    This behavior is required in IPv6, and there is little need
    to tolerate overlapping fragments in IPv4. This change
    simplifies the code and eliminates potential DDoS attack vectors.
    
    Tested: ran ip_defrag selftest (not yet available uptream).
    
    Suggested-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Peter Oskolkov <posk@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Florian Westphal <fw@strlen.de>
    Acked-by: Stephen Hemminger <stephen@networkplumber.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Mao Wenan <maowenan@huawei.com>
    [bwh: Backported to 4.4:
     - s/__IP_INC_STATS/IP_INC_STATS_BH/
     - Deleted code is slightly different]
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit bf40801a018bbe8f2076fab89f538cc05eb15c03
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:30:06 2018 -0700

    inet: frags: fix ip6frag_low_thresh boundary
    
    commit 3d23401283e80ceb03f765842787e0e79ff598b7 upstream.
    
    Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
    since linker might place next to it a non zero value preventing a change
    to ip6frag_low_thresh.
    
    ip6frag_low_thresh is not used anymore in the kernel, but we do not
    want to prematuraly break user scripts wanting to change it.
    
    Since specifying a minimal value of 0 for proc_doulongvec_minmax()
    is moot, let's remove these zero values in all defrag units.
    
    Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: Maciej Żenczykowski <maze@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 826ff799146685450f84e3158ce66499c928c8ea
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:30:05 2018 -0700

    inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
    
    commit bf66337140c64c27fa37222b7abca7e49d63fb57 upstream.
    
    ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
    this integer is currently in a different cache line than skb->next,
    meaning that we use two cache lines per skb when finding the insertion point.
    
    By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
    in a single cache line and save precious memory bandwidth.
    
    Note that after the fast path added by Changli Gao in commit
    d6bebca92c66 ("fragment: add fast path for in-order fragments")
    this change wont help the fast path, since we still need
    to access prev->len (2nd cache line), but will show great
    benefits when slow path is entered, since we perform
    a linear scan of a potentially long list.
    
    Also, note that this potential long list is an attack vector,
    we might consider also using an rb-tree there eventually.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 29ff723c549906a5d8d64dd406d1b1b4da0eb85b
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:30:04 2018 -0700

    inet: frags: reorganize struct netns_frags
    
    commit c2615cf5a761b32bf74e85bddc223dfff3d9b9f0 upstream.
    
    Put the read-mostly fields in a separate cache line
    at the beginning of struct netns_frags, to reduce
    false sharing noticed in inet_frag_kill()
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    [bwh: Backported to 4.4: adjust context]
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 33990010ea40454a41c4c5dc78d26165579578ca
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:30:03 2018 -0700

    rhashtable: reorganize struct rhashtable layout
    
    commit e5d672a0780d9e7118caad4c171ec88b8299398d upstream.
    
    While under frags DDOS I noticed unfortunate false sharing between
    @nelems and @params.automatic_shrinking
    
    Move @nelems at the end of struct rhashtable so that first cache line
    is shared between all cpus, because almost never dirtied.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit bf8187348f264b720d5bc165703afeb1dee36f14
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:30:02 2018 -0700

    ipv6: frags: rewrite ip6_expire_frag_queue()
    
    commit 05c0b86b9696802fd0ce5676a92a63f1b455bdf3 upstream.
    
    Make it similar to IPv4 ip_expire(), and release the lock
    before calling icmp functions.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    [bwh: Backported to 4.4: adjust context]
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit f925a29652a00e312d373b19f177af17be4ba5be
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:30:01 2018 -0700

    inet: frags: do not clone skb in ip_expire()
    
    commit 1eec5d5670084ee644597bd26c25e22c69b9f748 upstream.
    
    An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release
    spinlock before calling icmp_send()")
    
    While fixing the bug at that time, it also added a very high cost
    for DDOS frags, as the ICMP rate limit is applied after this
    expensive operation (skb_clone() + consume_skb(), implying memory
    allocations, copy, and freeing)
    
    We can use skb_get(head) here, all we want is to make sure skb wont
    be freed by another cpu.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 567ef0554b91de121e9c1ad6b30d0077a5ea1fbf
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:30:00 2018 -0700

    inet: frags: break the 2GB limit for frags storage
    
    commit 3e67f106f619dcfaf6f4e2039599bdb69848c714 upstream.
    
    Some users are willing to provision huge amounts of memory to be able
    to perform reassembly reasonnably well under pressure.
    
    Current memory tracking is using one atomic_t and integers.
    
    Switch to atomic_long_t so that 64bit arches can use more than 2GB,
    without any cost for 32bit arches.
    
    Note that this patch avoids an overflow error, if high_thresh was set
    to ~2GB, since this test in inet_frag_alloc() was never true :
    
    if (... || frag_mem_limit(nf) > nf->high_thresh)
    
    Tested:
    
    $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
    
    <frag DDOS>
    
    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 14705885 memory 16000002880
    
    $ nstat -n ; sleep 1 ; nstat | grep Reas
    IpReasmReqds                    3317150            0.0
    IpReasmFails                    3317112            0.0
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 50fc08963b0ccc01bc9a01a4e699aed9eb0137f1
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:29:59 2018 -0700

    inet: frags: remove inet_frag_maybe_warn_overflow()
    
    commit 2d44ed22e607f9a285b049de2263e3840673a260 upstream.
    
    This function is obsolete, after rhashtable addition to inet defrag.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit b047c796dedca591d1a3dc1ba8166d5082c337ef
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:29:58 2018 -0700

    inet: frags: get rif of inet_frag_evicting()
    
    commit 399d1404be660d355192ff4df5ccc3f4159ec1e4 upstream.
    
    This refactors ip_expire() since one indentation level is removed.
    
    Note: in the future, we should try hard to avoid the skb_clone()
    since this is a serious performance cost.
    Under DDOS, the ICMP message wont be sent because of rate limits.
    
    Fact that ip6_expire_frag_queue() does not use skb_clone() is
    disturbing too. Presumably IPv6 should have the same
    issue than the one we fixed in commit ec4fbd64751d
    ("inet: frag: release spinlock before calling icmp_send()")
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Mao Wenan <maowenan@huawei.com>
    [bwh: Backported to 4.4: adjust context]
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit cf2b9e68a684bf8dd6c1a7671f59faaab8e4ca2c
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:29:57 2018 -0700

    inet: frags: remove some helpers
    
    commit 6befe4a78b1553edb6eed3a78b4bcd9748526672 upstream.
    
    Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()
    
    Also since we use rhashtable we can bring back the number of fragments
    in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
    removed in commit 434d305405ab ("inet: frag: don't account number
    of fragment queues")
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit f67b17c0233a3eb18463a661fbeb671110ada4b9
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Fri Jul 6 12:30:20 2018 +0200

    ipfrag: really prevent allocation on netns exit
    
    commit f6f2a4a2eb92bc73671204198bb2f8ab53ff59fb upstream.
    
    Setting the low threshold to 0 has no effect on frags allocation,
    we need to clear high_thresh instead.
    
    The code was pre-existent to commit 648700f76b03 ("inet: frags:
    use rhashtables for reassembly units"), but before the above,
    such assignment had a different role: prevent concurrent eviction
    from the worker and the netns cleanup helper.
    
    Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 24641fb6453c4cc642f548e1f11502c7c0caaa54
Author: Alexander Aring <aring@mojatatu.com>
Date:   Fri Apr 20 14:54:13 2018 -0400

    net: ieee802154: 6lowpan: fix frag reassembly
    
    commit f18fa5de5ba7f1d6650951502bb96a6e4715a948 upstream.
    
    This patch initialize stack variables which are used in
    frag_lowpan_compare_key to zero. In my case there are padding bytes in the
    structures ieee802154_addr as well in frag_lowpan_compare_key. Otherwise
    the key variable contains random bytes. The result is that a compare of
    two keys by memcmp works incorrect.
    
    Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
    Signed-off-by: Alexander Aring <aring@mojatatu.com>
    Reported-by: Stefan Schmidt <stefan@osg.samsung.com>
    Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 493107105843f299662b3b664a83804645564f12
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:29:56 2018 -0700

    inet: frags: use rhashtables for reassembly units
    
    commit 648700f76b03b7e8149d13cc2bdb3355035258a9 upstream.
    
    Some applications still rely on IP fragmentation, and to be fair linux
    reassembly unit is not working under any serious load.
    
    It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
    
    A work queue is supposed to garbage collect items when host is under memory
    pressure, and doing a hash rebuild, changing seed used in hash computations.
    
    This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
    occurring every 5 seconds if host is under fire.
    
    Then there is the problem of sharing this hash table for all netns.
    
    It is time to switch to rhashtables, and allocate one of them per netns
    to speedup netns dismantle, since this is a critical metric these days.
    
    Lookup is now using RCU. A followup patch will even remove
    the refcount hold/release left from prior implementation and save
    a couple of atomic operations.
    
    Before this patch, 16 cpus (16 RX queue NIC) could not handle more
    than 1 Mpps frags DDOS.
    
    After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
    of storage for the fragments (exact number depends on frags being evicted
    after timeout)
    
    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 1966916 memory 2140004608
    
    A followup patch will change the limits for 64bit arches.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Florian Westphal <fw@strlen.de>
    Cc: Jesper Dangaard Brouer <brouer@redhat.com>
    Cc: Alexander Aring <alex.aring@gmail.com>
    Cc: Stefan Schmidt <stefan@osg.samsung.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    [bwh: Backported to 4.4: adjust context]
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit bf5ea30ff5a143cf7ccb068259d16f4d2d1a2dd9
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:29:55 2018 -0700

    rhashtable: add schedule points
    
    commit ae6da1f503abb5a5081f9f6c4a6881de97830f3e upstream.
    
    Rehashing and destroying large hash table takes a lot of time,
    and happens in process context. It is safe to add cond_resched()
    in rhashtable_rehash_table() and rhashtable_free_and_destroy()
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 9e5f4d0b79f8708db79c912404e68c915eb54f4d
Author: Ben Hutchings <ben.hutchings@codethink.co.uk>
Date:   Fri Dec 7 17:16:46 2018 +0000

    rhashtable: Add rhashtable_lookup()
    
    Extracted from commit ca26893f05e8 "rhashtable: Add rhlist interface".
    
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 8295999786a5141bb3a8db8e7a8b40e82e66ce3f
Author: Pablo Neira Ayuso <pablo@netfilter.org>
Date:   Wed Aug 24 12:31:31 2016 +0200

    rhashtable: add rhashtable_lookup_get_insert_key()
    
    commit 5ca8cc5bf11faed257c762018aea9106d529232f upstream.
    
    This patch modifies __rhashtable_insert_fast() so it returns the
    existing object that clashes with the one that you want to insert.
    In case the object is successfully inserted, NULL is returned.
    Otherwise, you get an error via ERR_PTR().
    
    This patch adapts the existing callers of __rhashtable_insert_fast()
    so they handle this new logic, and it adds a new
    rhashtable_lookup_get_insert_key() interface to fetch this existing
    object.
    
    nf_tables needs this change to improve handling of EEXIST cases via
    honoring the NLM_F_EXCL flag and by checking if the data part of the
    mapping matches what we have.
    
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Thomas Graf <tgraf@suug.ch>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit a7fb573c164488b36aeafe63a1531b9e7cf27fad
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:29:53 2018 -0700

    inet: frags: refactor lowpan_net_frag_init()
    
    commit 807f1844df4ac23594268fa9f41902d0549e92aa upstream.
    
    We want to call lowpan_net_frag_init() earlier.
    Similar to commit "inet: frags: refactor ipv6_frag_init()"
    
    This is a prereq to "inet: frags: use rhashtables for reassembly units"
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 705e71ed99017c123cb09a516f51e0203446ba56
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:29:52 2018 -0700

    inet: frags: refactor ipv6_frag_init()
    
    commit 5b975bab23615cd0fdf67af6c9298eb01c4b9f61 upstream.
    
    We want to call inet_frags_init() earlier.
    
    This is a prereq to "inet: frags: use rhashtables for reassembly units"
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    [bwh: Backported to 4.4: Also delete a redundant assignment to
     ip6_frags.skb_free]
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 8c639cad891c671b449014887678f11b8a1f519a
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:29:51 2018 -0700

    inet: frags: refactor ipfrag_init()
    
    commit 483a6e4fa055123142d8956866fe2aa9c98d546d upstream.
    
    We need to call inet_frags_init() before register_pernet_subsys(),
    as a prereq for following patch ("inet: frags: use rhashtables for reassembly units")
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 9c6727de82e47238f389b415e25443ac46a06de5
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:29:50 2018 -0700

    inet: frags: add a pointer to struct netns_frags
    
    commit 093ba72914b696521e4885756a68a3332782c8de upstream.
    
    In order to simplify the API, add a pointer to struct inet_frags.
    This will allow us to make things less complex.
    
    These functions no longer have a struct inet_frags parameter :
    
    inet_frag_destroy(struct inet_frag_queue *q  /*, struct inet_frags *f */)
    inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
    ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    [bwh: Backported to 4.4: inet_frag_{kill,put}() are called in some
     different places; update all calls]
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5eb2471ef43ea672cef90c7c8d66313aae8745f8
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Oct 10 12:29:49 2018 -0700

    inet: frags: change inet_frags_init_net() return value
    
    commit 787bea7748a76130566f881c2342a0be4127d182 upstream.
    
    We will soon initialize one rhashtable per struct netns_frags
    in inet_frags_init_net().
    
    This patch changes the return value to eventually propagate an
    error.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>