A11 nvme: reset submission queue state by yhavry · Pull Request #1 · HoolockLinux/linux

yhavry · 2026-05-14T03:24:32Z

Summary

Fixes Apple NVMe controller reset on A11 by clearing reused submission queue state.

apple_nvme_init_queue() already reset the completion queue state, but it left sq_tail and the SQ contents alone. That works on first boot because queue memory starts clean, but it breaks after a controller reset because the same queue objects are reused.

Fix

Reset q->sq_tail and clear SQ memory before enabling the queue.

Testing

Tested on iPhone 8 Plus / T8015 / D211.

Before this change:

echo 1 > /sys/class/nvme/nvme0/reset_controller

hung in nvme_reset_ctrl_sync() and ANS/RTKit crashed.

After this change:

echo 1 > /sys/class/nvme/nvme0/reset_controller
echo $?

returns 0, RTKit initializes again cleanly, and the NVMe device keeps working.

The Apple NVMe reset path was reusing stale submission queue state. apple_nvme_init_queue() reset the CQ state, but left sq_tail and the SQ contents alone. That works on first boot because the queue memory starts clean, but it is not true after a controller reset. On iPhone 8 Plus / T8015 / D211, writing to reset_controller crashed ANS/RTKit during the admin command sequence after nvme_enable_ctrl(), and the reset writer stayed blocked in nvme_reset_ctrl_sync(). Reset sq_tail and clear the SQ memory when the queue is initialized. With that change, reset_controller returns 0 and the NVMe device keeps working after reset. Signed-off-by: Yuriy Havrylyuk <yhavry@gmail.com>

When unregistered my self-written scx scheduler, the following panic occurs. [ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8! [ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP [ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full) [ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107 [ 230.093972] Workqueue: events_unbound bpf_map_free_deferred [ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms [ 230.116843] pc : 0xffff80009bc2c1f8 [ 230.120406] lr : dequeue_task_scx+0x270/0x2d0 [ 230.217749] Call trace: [ 230.228515] 0xffff80009bc2c1f8 (P) [ 230.232077] dequeue_task+0x84/0x188 [ 230.235728] sched_change_begin+0x1dc/0x250 [ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240 [ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0 [ 230.249701] ___migrate_enable+0x4c/0xa0 [ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0 [ 230.258246] process_one_work+0x184/0x540 [ 230.262342] worker_thread+0x19c/0x348 [ 230.266170] kthread+0x13c/0x150 [ 230.269465] ret_from_fork+0x10/0x20 [ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000) [ 230.287621] ---[ end trace 0000000000000000 ]--- [ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt The root cause is that the JIT page backing ops->quiescent() is freed before all callers of that function have stopped. The expected ordering during teardown is: bitmap_zero(sch->has_op) + synchronize_rcu() -> guarantees no CPU will ever call sch->ops.* again -> only THEN free the BPF struct_ops JIT page bpf_scx_unreg() is supposed to enforce the order, but after commit f4a6c50 ("sched_ext: Always bounce scx_disable() through irq_work"), disable_work is no longer queued directly, causing kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops map too early and poisoned with AARCH64_BREAK_FAULT before disable_workfn ever execute. So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent) as true and calls ops.quiescent, which hit on the poisoned page and BRK panic. Add a helper scx_flush_disable_work() so the future use cases that want to flush disable_work can use it. Also amend the call for scx_root_enable_workfn() and scx_sub_enable_workfn() which have similar pattern in the error path. Fixes: f4a6c50 ("sched_ext: Always bounce scx_disable() through irq_work") Signed-off-by: Richard Cheng <icheng@nvidia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>

…kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 7.1, take #1 - Allow tracing for non-pKVM, which was accidentally disabled when the series was merged - Rationalise the way the pKVM hypercall ranges are defined by using the same mechanism as already used for the vcpu_sysreg enum - Enforce that SMCCC function numbers relayed by the pKVM proxy are actually compliant with the specification - Fix a couple of feature to idreg mappings which resulted in the wrong sanitisation being applied - Fix the GICD_IIDR revision number field that could never been written correctly by userspace - Make kvm_vcpu_initialized() correctly use its parameter instead of relying on the surrounding context - Enforce correct ordering in __pkvm_init_vcpu(), plugging a potential pin leak at the same time - Move __pkvm_init_finalise() to a less dangerous spot, avoiding future problems - Restore functional userspace irqchip support after a four year breakage (last functional kernel was 5.18...). This is obviously ripe for garbage collection. - ... and the usual lot of spelling fixes

When a TAPRIO child qdisc is deleted via RTM_DELQDISC, taprio_graft() is called with new == NULL and stores NULL into q->qdiscs[cl - 1]. Subsequent RTM_GETTCLASS dump operations walk all classes via taprio_walk() and call taprio_dump_class(), which calls taprio_leaf() returning the NULL pointer, then dereferences it to read child->handle, causing a kernel NULL pointer dereference. The bug is reachable with namespace-scoped CAP_NET_ADMIN on any kernel with CONFIG_NET_SCH_TAPRIO enabled. On systems with unprivileged user namespaces enabled, an unprivileged local user can trigger a kernel panic by creating a taprio qdisc inside a new network namespace, grafting an explicit child qdisc, deleting it, and requesting a class dump. The RTM_GETTCLASS dump itself requires no capability. Oops: general protection fault, probably for non-canonical address 0xdffffc0000000007: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f] RIP: 0010:taprio_dump_class (net/sched/sch_taprio.c:2478) Call Trace: <TASK> tc_fill_tclass (net/sched/sch_api.c:1966) qdisc_class_dump (net/sched/sch_api.c:2326) taprio_walk (net/sched/sch_taprio.c:2514) tc_dump_tclass_qdisc (net/sched/sch_api.c:2352) tc_dump_tclass_root (net/sched/sch_api.c:2370) tc_dump_tclass (net/sched/sch_api.c:2431) rtnl_dumpit (net/core/rtnetlink.c:6864) netlink_dump (net/netlink/af_netlink.c:2325) rtnetlink_rcv_msg (net/core/rtnetlink.c:6959) netlink_rcv_skb (net/netlink/af_netlink.c:2550) </TASK> Fix this by substituting &noop_qdisc when new is NULL in taprio_graft(), a common pattern used by other qdiscs (e.g., multiq_graft()) to ensure the q->qdiscs[] slots are never NULL. This makes control-plane dump paths safe without requiring individual NULL checks. Since the data-plane paths (taprio_enqueue and taprio_dequeue_from_txq) previously had explicit NULL guards that would drop/skip the packet cleanly, update those checks to test for &noop_qdisc instead. Without this, packets would reach taprio_enqueue_one() which increments the root qdisc's qlen and backlog before calling the child's enqueue; noop_qdisc drops the packet but those counters are never rolled back, permanently inflating the root qdisc's statistics. After this change *old can be a valid qdisc, NULL, or &noop_qdisc. Only call qdisc_put(*old) in the first case to avoid decreasing noop_qdisc's refcount, which was never increased. Fixes: 665338b ("net/sched: taprio: dump class stats for the actual q->qdiscs[]") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260422161958.2517539-3-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice_reset_all_vfs() ignores the return value of ice_vf_rebuild_vsi(). When the VSI rebuild fails (e.g. during NVM firmware update via nvmupdate64e), ice_vsi_rebuild() tears down the VSI on its error path, leaving txq_map and rxq_map as NULL. The subsequent unconditional call to ice_vf_post_vsi_rebuild() leads to a NULL pointer dereference in ice_ena_vf_q_mappings() when it accesses vsi->txq_map[0]. The single-VF reset path in ice_reset_vf() already handles this correctly by checking the return value of ice_vf_reconfig_vsi() and skipping ice_vf_post_vsi_rebuild() on failure. Apply the same pattern to ice_reset_all_vfs(): check the return value of ice_vf_rebuild_vsi() and skip ice_vf_post_vsi_rebuild() and ice_eswitch_attach_vf() on failure. The VF is left safely disabled (ICE_VF_STATE_INIT not set, VFGEN_RSTAT not set to VFACTIVE) and can be recovered via a VFLR triggered by a PCI reset of the VF (sysfs reset or driver rebind). Note that this patch does not prevent the VF VSI rebuild from failing during NVM update — the underlying cause is firmware being in a transitional state while the EMP reset is processed, which can cause Admin Queue commands (ice_add_vsi, ice_cfg_vsi_lan) to fail. This patch only prevents the subsequent NULL pointer dereference that crashes the kernel when the rebuild does fail. crash> bt PID: 50795 TASK: ff34c9ee708dc680 CPU: 1 COMMAND: "kworker/u512:5" #0 [ff72159bcfe5bb50] machine_kexec at ffffffffaa8850ee #1 [ff72159bcfe5bba8] __crash_kexec at ffffffffaaa15fba #2 [ff72159bcfe5bc68] crash_kexec at ffffffffaaa16540 #3 [ff72159bcfe5bc70] oops_end at ffffffffaa837eda #4 [ff72159bcfe5bc90] page_fault_oops at ffffffffaa893997 AsahiLinux#5 [ff72159bcfe5bce8] exc_page_fault at ffffffffab528595 AsahiLinux#6 [ff72159bcfe5bd10] asm_exc_page_fault at ffffffffab600bb2 [exception RIP: ice_ena_vf_q_mappings+0x79] RIP: ffffffffc0a85b29 RSP: ff72159bcfe5bdc8 RFLAGS: 00010206 RAX: 00000000000f0000 RBX: ff34c9efc9c00000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000010 RDI: ff34c9efc9c00000 RBP: ff34c9efc27d4828 R8: 0000000000000093 R9: 0000000000000040 R10: ff34c9efc27d4828 R11: 0000000000000040 R12: 0000000000100000 R13: 0000000000000010 R14: R15: ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 AsahiLinux#7 [ff72159bcfe5bdf8] ice_sriov_post_vsi_rebuild at ffffffffc0a85e2e [ice] AsahiLinux#8 [ff72159bcfe5be08] ice_reset_all_vfs at ffffffffc0a920b4 [ice] AsahiLinux#9 [ff72159bcfe5be48] ice_service_task at ffffffffc0a31519 [ice] AsahiLinux#10 [ff72159bcfe5be88] process_one_work at ffffffffaa93dca4 AsahiLinux#11 [ff72159bcfe5bec8] worker_thread at ffffffffaa93e9de AsahiLinux#12 [ff72159bcfe5bf18] kthread at ffffffffaa946663 AsahiLinux#13 [ff72159bcfe5bf50] ret_from_fork at ffffffffaa8086b9 The panic occurs attempting to dereference the NULL pointer in RDX at ice_sriov.c:294, which loads vsi->txq_map (offset 0x4b8 in ice_vsi). The faulting VSI is an allocated slab object but not fully initialized after a failed ice_vsi_rebuild(): crash> struct ice_vsi 0xff34c9efc27d4828 netdev = 0x0, rx_rings = 0x0, tx_rings = 0x0, q_vectors = 0x0, txq_map = 0x0, rxq_map = 0x0, alloc_txq = 0x10, num_txq = 0x10, alloc_rxq = 0x10, num_rxq = 0x10, The nvmupdate64e process was performing NVM firmware update: crash> bt 0xff34c9edd1a30000 PID: 49858 TASK: ff34c9edd1a30000 CPU: 1 COMMAND: "nvmupdate64e" #0 [ff72159bcd617618] __schedule at ffffffffab5333f8 #4 [ff72159bcd617750] ice_sq_send_cmd at ffffffffc0a35347 [ice] AsahiLinux#5 [ff72159bcd6177a8] ice_sq_send_cmd_retry at ffffffffc0a35b47 [ice] AsahiLinux#6 [ff72159bcd617810] ice_aq_send_cmd at ffffffffc0a38018 [ice] AsahiLinux#7 [ff72159bcd617848] ice_aq_read_nvm at ffffffffc0a40254 [ice] AsahiLinux#8 [ff72159bcd6178b8] ice_read_flat_nvm at ffffffffc0a4034c [ice] AsahiLinux#9 [ff72159bcd617918] ice_devlink_nvm_snapshot at ffffffffc0a6ffa5 [ice] dmesg: ice 0000:13:00.0: firmware recommends not updating fw.mgmt, as it may result in a downgrade. continuing anyways ice 0000:13:00.1: ice_init_nvm failed -5 ice 0000:13:00.1: Rebuild failed, unload and reload driver Fixes: 12bb018 ("ice: Refactor VF reset") Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-5-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>

nvmet_tcp_release_queue_work() runs on nvmet-wq and can drop the final controller reference through nvmet_cq_put(). If that triggers nvmet_ctrl_free(), the teardown path flushes ctrl->async_event_work on the same nvmet-wq. Call chain: nvmet_tcp_schedule_release_queue() kref_put(&queue->kref, nvmet_tcp_release_queue) nvmet_tcp_release_queue() queue_work(nvmet_wq, &queue->release_work) <--- nvmet_wq process_one_work() nvmet_tcp_release_queue_work() nvmet_cq_put(&queue->nvme_cq) nvmet_cq_destroy() nvmet_ctrl_put(cq->ctrl) nvmet_ctrl_free() flush_work(&ctrl->async_event_work) <--- nvmet_wq Previously Scheduled by :- nvmet_add_async_event queue_work(nvmet_wq, &ctrl->async_event_work); This trips lockdep with a possible recursive locking warning. [ 5223.015876] run blktests nvme/003 at 2026-04-07 20:53:55 [ 5223.061801] loop0: detected capacity change from 0 to 2097152 [ 5223.072206] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 5223.088368] nvmet_tcp: enabling port 0 (127.0.0.1:4420) [ 5223.126086] nvmet: Created discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349. [ 5223.128453] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 [ 5233.199447] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" [ 5233.227718] ============================================ [ 5233.231283] WARNING: possible recursive locking detected [ 5233.234696] 7.0.0-rc3nvme+ AsahiLinux#20 Tainted: G O N [ 5233.238434] -------------------------------------------- [ 5233.241852] kworker/u192:6/2413 is trying to acquire lock: [ 5233.245429] ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x26/0x90 [ 5233.251438] but task is already holding lock: [ 5233.255254] ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x5cc/0x6e0 [ 5233.261125] other info that might help us debug this: [ 5233.265333] Possible unsafe locking scenario: [ 5233.269217] CPU0 [ 5233.270795] ---- [ 5233.272436] lock((wq_completion)nvmet-wq); [ 5233.275241] lock((wq_completion)nvmet-wq); [ 5233.278020] *** DEADLOCK *** [ 5233.281793] May be due to missing lock nesting notation [ 5233.286195] 3 locks held by kworker/u192:6/2413: [ 5233.289192] #0: ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x5cc/0x6e0 [ 5233.294569] #1: ffffc9000e2a7e40 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x1c5/0x6e0 [ 5233.300128] #2: ffffffff82d7dc40 (rcu_read_lock){....}-{1:3}, at: __flush_work+0x62/0x530 [ 5233.304290] stack backtrace: [ 5233.306520] CPU: 4 UID: 0 PID: 2413 Comm: kworker/u192:6 Tainted: G O N 7.0.0-rc3nvme+ AsahiLinux#20 PREEMPT(full) [ 5233.306524] Tainted: [O]=OOT_MODULE, [N]=TEST [ 5233.306525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 5233.306527] Workqueue: nvmet-wq nvmet_tcp_release_queue_work [nvmet_tcp] [ 5233.306532] Call Trace: [ 5233.306534] <TASK> [ 5233.306536] dump_stack_lvl+0x73/0xb0 [ 5233.306552] print_deadlock_bug+0x225/0x2f0 [ 5233.306556] __lock_acquire+0x13f0/0x2290 [ 5233.306563] lock_acquire+0xd0/0x300 [ 5233.306565] ? touch_wq_lockdep_map+0x26/0x90 [ 5233.306571] ? __flush_work+0x20b/0x530 [ 5233.306573] ? touch_wq_lockdep_map+0x26/0x90 [ 5233.306577] touch_wq_lockdep_map+0x3b/0x90 [ 5233.306580] ? touch_wq_lockdep_map+0x26/0x90 [ 5233.306583] ? __flush_work+0x20b/0x530 [ 5233.306585] __flush_work+0x268/0x530 [ 5233.306588] ? __pfx_wq_barrier_func+0x10/0x10 [ 5233.306594] ? xen_error_entry+0x30/0x60 [ 5233.306600] nvmet_ctrl_free+0x140/0x310 [nvmet] [ 5233.306617] nvmet_cq_put+0x74/0x90 [nvmet] [ 5233.306629] nvmet_tcp_release_queue_work+0x19f/0x360 [nvmet_tcp] [ 5233.306634] process_one_work+0x206/0x6e0 [ 5233.306640] worker_thread+0x184/0x320 [ 5233.306643] ? __pfx_worker_thread+0x10/0x10 [ 5233.306646] kthread+0xf1/0x130 [ 5233.306648] ? __pfx_kthread+0x10/0x10 [ 5233.306651] ret_from_fork+0x355/0x450 [ 5233.306653] ? __pfx_kthread+0x10/0x10 [ 5233.306656] ret_from_fork_asm+0x1a/0x30 [ 5233.306664] </TASK> There is also no need to flush async_event_work from controller teardown. The admin queue teardown already fails outstanding AER requests before the final controller put :- nvmet_sq_destroy(admin sq) nvmet_async_events_failall(ctrl) The controller has already been removed from the subsystem list before nvmet_ctrl_free() quiesces outstanding work. Replace flush_work() with cancel_work_sync() so a pending async_event_work item is canceled and a running instance is waited on without recursing into the same workqueue. Fixes: 06406d8 ("nvmet: cancel fatal error and flush async work before free controller") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>

Commit 03b3bcd ("nvme: fix admin request_queue lifetime") moved the admin queue reference ->put call into nvme_free_ctrl() - a controller device release callback performed for every nvme driver doing nvme_init_ctrl(). nvme-apple sets refcount of the admin queue to 1 at allocation during the probe function and then puts it twice now: nvme_free_ctrl() blk_put_queue(ctrl->admin_q) // #1 ->free_ctrl() apple_nvme_free_ctrl() blk_put_queue(anv->ctrl.admin_q) // #2 Note that there is a commit 941f729 ("nvme-apple: remove an extra queue reference") which intended to drop taking an extra admin queue reference. Looks like at that moment it accidentally fixed a refcount leak, which existed since the driver's introduction. There were two ->get calls at driver's probe function and a single ->put inside apple_nvme_free_ctrl(). However now after commit 03b3bcd ("nvme: fix admin request_queue lifetime") the refcount is imbalanced again. Fix it by removing extra ->put call from apple_nvme_free_ctrl(). anv->dev and ctrl->dev point to the same device, so use ctrl->dev directly for simplification. Compile tested only. Found by Linux Verification Center (linuxtesting.org). Fixes: 03b3bcd ("nvme: fix admin request_queue lifetime") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Keith Busch <kbusch@kernel.org>

When a controller reset is triggered via sysfs (by writing to /sys/class/nvme/<nvmedev>/reset_controller), the reset work tears down and re-establishes all queues. The socket release using fput() defers the actual cleanup to task_work delayed_fput workqueue. This deferred cleanup can race with the subsequent queue re-allocation during reset, potentially leading to use-after-free or resource conflicts. Replace fput() with __fput_sync() to ensure synchronous socket release, guaranteeing that all socket resources are fully cleaned up before the function returns. This prevents races during controller reset where new queue setup may begin before the old socket is fully released. * Call chain during reset: nvme_reset_ctrl_work() -> nvme_tcp_teardown_ctrl() -> nvme_tcp_teardown_io_queues() -> nvme_tcp_free_io_queues() -> nvme_tcp_free_queue() <-- fput() -> __fput_sync() -> nvme_tcp_teardown_admin_queue() -> nvme_tcp_free_admin_queue() -> nvme_tcp_free_queue() <-- fput() -> __fput_sync() -> nvme_tcp_setup_ctrl() <-- race with deferred fput memalloc_noreclaim_save() sets PF_MEMALLOC which is intended for tasks performing memory reclaim work that need reserve access. While PF_MEMALLOC prevents the task from entering direct reclaim (causing __need_reclaim() to return false), it does not strip __GFP_IO from gfp flags. The allocator can therefore still trigger writeback I/O when __GFP_IO remains set, which is unsafe when the caller holds block layer locks. Switch to memalloc_noio_save() which sets PF_MEMALLOC_NOIO. This causes current_gfp_context() to strip __GFP_IO|__GFP_FS from every allocation in the scope, making it safe to allocate memory while holding elevator_lock and set->srcu. * The issue can be reproduced using blktests: nvme_trtype=tcp ./check nvme/005 blktests (master) # nvme_trtype=tcp ./check nvme/005 nvme/005 (tr=tcp) (reset local loopback target) [failed] runtime 0.725s ... 0.798s something found in dmesg: [ 108.473940] run blktests nvme/005 at 2025-11-22 16:12:20 [...] ... (See '/root/blktests/results/nodev_tr_tcp/nvme/005.dmesg' for the entire message) blktests (master) # cat /root/blktests/results/nodev_tr_tcp/nvme/005.dmesg [ 108.473940] run blktests nvme/005 at 2025-11-22 16:12:20 [ 108.526983] loop0: detected capacity change from 0 to 2097152 [ 108.555606] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 108.572531] nvmet_tcp: enabling port 0 (127.0.0.1:4420) [ 108.613061] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349. [ 108.616832] nvme nvme0: creating 48 I/O queues. [ 108.630791] nvme nvme0: mapped 48/0/0 default/read/poll queues. [ 108.661892] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 [ 108.746639] nvmet: Created nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349. [ 108.748466] nvme nvme0: creating 48 I/O queues. [ 108.802984] nvme nvme0: mapped 48/0/0 default/read/poll queues. [ 108.829983] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" [ 108.854288] block nvme0n1: no available path - failing I/O [ 108.854344] block nvme0n1: no available path - failing I/O [ 108.854373] Buffer I/O error on dev nvme0n1, logical block 1, async page read [ 108.891693] ====================================================== [ 108.895912] WARNING: possible circular locking dependency detected [ 108.900184] 6.17.0nvme+ #3 Tainted: G N [ 108.903913] ------------------------------------------------------ [ 108.908171] nvme/2734 is trying to acquire lock: [ 108.911957] ffff88810210e610 (set->srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x17/0x170 [ 108.917587] but task is already holding lock: [ 108.921570] ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0 [ 108.927361] which lock already depends on the new lock. [ 108.933018] the existing dependency chain (in reverse order) is: [ 108.938223] -> #4 (&q->elevator_lock){+.+.}-{4:4}: [ 108.942988] __mutex_lock+0xa2/0x1150 [ 108.945873] elevator_change+0xa8/0x1c0 [ 108.948925] elv_iosched_store+0xdf/0x140 [ 108.952043] kernfs_fop_write_iter+0x16a/0x220 [ 108.955367] vfs_write+0x378/0x520 [ 108.957598] ksys_write+0x67/0xe0 [ 108.959721] do_syscall_64+0x76/0xbb0 [ 108.962052] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 108.965145] -> #3 (&q->q_usage_counter(io)){++++}-{0:0}: [ 108.968923] blk_alloc_queue+0x30e/0x350 [ 108.972117] blk_mq_alloc_queue+0x61/0xd0 [ 108.974677] scsi_alloc_sdev+0x2a0/0x3e0 [ 108.977092] scsi_probe_and_add_lun+0x1bd/0x430 [ 108.979921] __scsi_add_device+0x109/0x120 [ 108.982504] ata_scsi_scan_host+0x97/0x1c0 [ 108.984365] async_run_entry_fn+0x2d/0x130 [ 108.986109] process_one_work+0x20e/0x630 [ 108.987830] worker_thread+0x184/0x330 [ 108.989473] kthread+0x10a/0x250 [ 108.990852] ret_from_fork+0x297/0x300 [ 108.992491] ret_from_fork_asm+0x1a/0x30 [ 108.994159] -> #2 (fs_reclaim){+.+.}-{0:0}: [ 108.996320] fs_reclaim_acquire+0x99/0xd0 [ 108.998058] kmem_cache_alloc_node_noprof+0x4e/0x3c0 [ 109.000123] __alloc_skb+0x15f/0x190 [ 109.002195] tcp_send_active_reset+0x3f/0x1e0 [ 109.004038] tcp_disconnect+0x50b/0x720 [ 109.005695] __tcp_close+0x2b8/0x4b0 [ 109.007227] tcp_close+0x20/0x80 [ 109.008663] inet_release+0x31/0x60 [ 109.010175] __sock_release+0x3a/0xc0 [ 109.011778] sock_close+0x14/0x20 [ 109.013263] __fput+0xee/0x2c0 [ 109.014673] delayed_fput+0x31/0x50 [ 109.016183] process_one_work+0x20e/0x630 [ 109.017897] worker_thread+0x184/0x330 [ 109.019543] kthread+0x10a/0x250 [ 109.020929] ret_from_fork+0x297/0x300 [ 109.022565] ret_from_fork_asm+0x1a/0x30 [ 109.024194] -> #1 (sk_lock-AF_INET-NVME){+.+.}-{0:0}: [ 109.026634] lock_sock_nested+0x2e/0x70 [ 109.028251] tcp_sendmsg+0x1a/0x40 [ 109.029783] sock_sendmsg+0xed/0x110 [ 109.031321] nvme_tcp_try_send_cmd_pdu+0x13e/0x260 [nvme_tcp] [ 109.034263] nvme_tcp_try_send+0xb3/0x330 [nvme_tcp] [ 109.036375] nvme_tcp_queue_rq+0x342/0x3d0 [nvme_tcp] [ 109.038528] blk_mq_dispatch_rq_list+0x297/0x800 [ 109.040448] __blk_mq_sched_dispatch_requests+0x3db/0x5f0 [ 109.042677] blk_mq_sched_dispatch_requests+0x29/0x70 [ 109.044787] blk_mq_run_work_fn+0x76/0x1b0 [ 109.046535] process_one_work+0x20e/0x630 [ 109.048245] worker_thread+0x184/0x330 [ 109.049890] kthread+0x10a/0x250 [ 109.051331] ret_from_fork+0x297/0x300 [ 109.053024] ret_from_fork_asm+0x1a/0x30 [ 109.054740] -> #0 (set->srcu){.+.+}-{0:0}: [ 109.056850] __lock_acquire+0x1468/0x2210 [ 109.058614] lock_sync+0xa5/0x110 [ 109.060048] __synchronize_srcu+0x49/0x170 [ 109.061802] elevator_switch+0xc9/0x330 [ 109.063950] elevator_change+0x128/0x1c0 [ 109.065675] elevator_set_none+0x4c/0x90 [ 109.067316] blk_unregister_queue+0xa8/0x110 [ 109.069165] __del_gendisk+0x14e/0x3c0 [ 109.070824] del_gendisk+0x75/0xa0 [ 109.072328] nvme_ns_remove+0xf2/0x230 [nvme_core] [ 109.074365] nvme_remove_namespaces+0xf2/0x150 [nvme_core] [ 109.076652] nvme_do_delete_ctrl+0x71/0x90 [nvme_core] [ 109.078775] nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core] [ 109.081009] nvme_sysfs_delete+0x34/0x40 [nvme_core] [ 109.083082] kernfs_fop_write_iter+0x16a/0x220 [ 109.085009] vfs_write+0x378/0x520 [ 109.086539] ksys_write+0x67/0xe0 [ 109.087982] do_syscall_64+0x76/0xbb0 [ 109.089577] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 109.091665] other info that might help us debug this: [ 109.095478] Chain exists of: set->srcu --> &q->q_usage_counter(io) --> &q->elevator_lock [ 109.099544] Possible unsafe locking scenario: [ 109.101708] CPU0 CPU1 [ 109.103402] ---- ---- [ 109.105103] lock(&q->elevator_lock); [ 109.106530] lock(&q->q_usage_counter(io)); [ 109.109022] lock(&q->elevator_lock); [ 109.111391] sync(set->srcu); [ 109.112586] *** DEADLOCK *** [ 109.114772] 5 locks held by nvme/2734: [ 109.116189] #0: ffff888101925410 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x67/0xe0 [ 109.119143] #1: ffff88817a914e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x10f/0x220 [ 109.123141] #2: ffff8881046313f8 (kn->active#185){++++}-{0:0}, at: sysfs_remove_file_self+0x26/0x50 [ 109.126543] #3: ffff88810470e1d0 (&set->update_nr_hwq_lock){++++}-{4:4}, at: del_gendisk+0x6d/0xa0 [ 109.129891] #4: ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0 [ 109.133149] stack backtrace: [ 109.134817] CPU: 6 UID: 0 PID: 2734 Comm: nvme Tainted: G N 6.17.0nvme+ #3 PREEMPT(voluntary) [ 109.134819] Tainted: [N]=TEST [ 109.134820] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 109.134821] Call Trace: [ 109.134823] <TASK> [ 109.134824] dump_stack_lvl+0x75/0xb0 [ 109.134828] print_circular_bug+0x26a/0x330 [ 109.134831] check_noncircular+0x12f/0x150 [ 109.134834] __lock_acquire+0x1468/0x2210 [ 109.134837] ? __synchronize_srcu+0x17/0x170 [ 109.134838] lock_sync+0xa5/0x110 [ 109.134840] ? __synchronize_srcu+0x17/0x170 [ 109.134842] __synchronize_srcu+0x49/0x170 [ 109.134843] ? mark_held_locks+0x49/0x80 [ 109.134845] ? _raw_spin_unlock_irqrestore+0x2d/0x60 [ 109.134847] ? kvm_clock_get_cycles+0x14/0x30 [ 109.134853] ? ktime_get_mono_fast_ns+0x36/0xb0 [ 109.134858] elevator_switch+0xc9/0x330 [ 109.134860] elevator_change+0x128/0x1c0 [ 109.134862] ? kernfs_put.part.0+0x86/0x290 [ 109.134864] elevator_set_none+0x4c/0x90 [ 109.134866] blk_unregister_queue+0xa8/0x110 [ 109.134868] __del_gendisk+0x14e/0x3c0 [ 109.134870] del_gendisk+0x75/0xa0 [ 109.134872] nvme_ns_remove+0xf2/0x230 [nvme_core] [ 109.134879] nvme_remove_namespaces+0xf2/0x150 [nvme_core] [ 109.134887] nvme_do_delete_ctrl+0x71/0x90 [nvme_core] [ 109.134893] nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core] [ 109.134899] nvme_sysfs_delete+0x34/0x40 [nvme_core] [ 109.134905] kernfs_fop_write_iter+0x16a/0x220 [ 109.134908] vfs_write+0x378/0x520 [ 109.134911] ksys_write+0x67/0xe0 [ 109.134913] do_syscall_64+0x76/0xbb0 [ 109.134915] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 109.134916] RIP: 0033:0x7fd68a737317 [ 109.134917] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 109.134919] RSP: 002b:00007ffded1546d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 109.134920] RAX: ffffffffffffffda RBX: 000000000054f7e0 RCX: 00007fd68a737317 [ 109.134921] RDX: 0000000000000001 RSI: 00007fd68a855719 RDI: 0000000000000003 [ 109.134921] RBP: 0000000000000003 R08: 0000000030407850 R09: 00007fd68a7cd4e0 [ 109.134922] R10: 00007fd68a65b130 R11: 0000000000000246 R12: 00007fd68a855719 [ 109.134923] R13: 00000000304074c0 R14: 00000000304074c0 R15: 0000000030408660 [ 109.134926] </TASK> [ 109.962756] Key type psk unregistered Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>

Fix invalid data access by passing right data for debugfs entry. [ 171.549793] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 171.559248] Mem abort info: [ 171.562173] ESR = 0x0000000096000044 [ 171.566227] EC = 0x25: DABT (current EL), IL = 32 bits [ 171.573108] SET = 0, FnV = 0 [ 171.576448] EA = 0, S1PTW = 0 [ 171.579745] FSC = 0x04: level 0 translation fault [ 171.584760] Data abort info: [ 171.588012] ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000 [ 171.593734] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 [ 171.598962] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 171.604471] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000083837000 [ 171.611358] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ 171.618500] Internal error: Oops: 0000000096000044 [#1] SMP [ 171.624222] Modules linked in: powervr drm_shmem_helper drm_gpuvm... [ 171.656580] CPU: 0 UID: 0 PID: 549 Comm: bash Not tainted 7.0.0-rc2-g730b257ba723-dirty AsahiLinux#13 PREEMPT [ 171.665773] Hardware name: BeagleBoard.org BeaglePlay (DT) [ 171.671296] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 171.678306] pc : pvr_fw_trace_mask_set+0x78/0x154 [powervr] [ 171.683959] lr : pvr_fw_trace_mask_set+0x4c/0x154 [powervr] [ 171.689593] sp : ffff8000835ebb90 [ 171.692929] x29: ffff8000835ebc00 x28: ffff000005c60f80 x27: 0000000000000000 [ 171.700130] x26: 0000000000000000 x25: ffff00000504af28 x24: 0000000000000000 [ 171.707324] x23: ffff00000504af50 x22: 0000000000000203 x21: 0000000000000000 [ 171.714518] x20: ffff000005c44a80 x19: ffff000005c457b8 x18: 0000000000000000 [ 171.721715] x17: 0000000000000000 x16: 0000000000000000 x15: 0000aaaae8887580 [ 171.728908] x14: 0000000000000000 x13: 0000000000000000 x12: ffff8000835ebc30 [ 171.736095] x11: ffff00000504af2a x10: ffff00008504af29 x9 : 0fffffffffffffff [ 171.743286] x8 : ffff8000835ebbf8 x7 : 0000000000000000 x6 : 000000000000002a [ 171.750479] x5 : ffff00000504af2e x4 : 0000000000000000 x3 : 0000000000000010 [ 171.757674] x2 : 0000000000000203 x1 : 0000000000000000 x0 : ffff8000835ebba0 [ 171.764871] Call trace: [ 171.767342] pvr_fw_trace_mask_set+0x78/0x154 [powervr] (P) [ 171.772984] simple_attr_write_xsigned.isra.0+0xe0/0x19c [ 171.778341] simple_attr_write+0x18/0x24 [ 171.782296] debugfs_attr_write+0x50/0x98 [ 171.786341] full_proxy_write+0x6c/0xa8 [ 171.790208] vfs_write+0xd4/0x350 [ 171.793561] ksys_write+0x70/0x108 [ 171.796995] __arm64_sys_write+0x1c/0x28 [ 171.800952] invoke_syscall+0x48/0x10c [ 171.804740] el0_svc_common.constprop.0+0x40/0xe0 [ 171.809487] do_el0_svc+0x1c/0x28 [ 171.812834] el0_svc+0x34/0x108 [ 171.816013] el0t_64_sync_handler+0xa0/0xe4 [ 171.820237] el0t_64_sync+0x198/0x19c [ 171.823939] Code: 32000262 b90ac293 1a931056 9134e293 (b9000036) [ 171.830073] ---[ end trace 0000000000000000 ]--- Fixes: a331631 ("drm/imagination: Simplify module parameters") Signed-off-by: Brajesh Gupta <brajesh.gupta@imgtec.com> Reviewed-by: Alessio Belle <alessio.belle@imgtec.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260427-ftrace_fix-v3-1-e081530759a8@imgtec.com Signed-off-by: Matt Coster <matt.coster@imgtec.com>

Raw Packet QPs are unique in that they support separate send and receive queues, using 2 different user-provided buffers. They can also be created with one of the queues having size 0, allowing a send-only or receive-only QP. The Raw Packet RQ umem is created in the common user QP creation path, which allows zero-length queues. Add a later validation of the RQ umem in Raw Packet QP creation path when an RQ was requested. This prevents possible null-ptr dereference crashes, as seen in the below trace: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] CPU: 6 UID: 0 PID: 3539 Comm: raw_packet_umem Not tainted 6.19.0-rc1+ AsahiLinux#166 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:__mlx5_umem_find_best_quantized_pgoff+0x37/0x280 [mlx5_ib] Code: ff df 41 57 49 89 ff 41 56 41 55 41 89 d5 41 54 4d 89 cc 4c 8d 4f 30 55 4c 89 ca 48 89 f5 53 48 c1 ea 03 48 89 cb 48 83 ec 18 <80> 3c 02 00 44 89 04 24 0f 85 01 02 00 00 48 ba 00 00 00 00 00 fc RSP: 0018:ff1100013966f4e0 EFLAGS: 00010282 RAX: dffffc0000000000 RBX: 00000000ffffffc0 RCX: 00000000ffffffc0 RDX: 0000000000000006 RSI: 00000ffffffff000 RDI: 0000000000000000 RBP: 00000ffffffff000 R08: 0000000000000040 R09: 0000000000000030 R10: 0000000000000000 R11: 0000000000000000 R12: ff1100013966f648 R13: 0000000000000005 R14: ff1100013966f980 R15: 0000000000000000 FS: 00007fae6c82f740(0000) GS:ff11000898ba1000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000200000000000 CR3: 000000010f96c005 CR4: 0000000000373eb0 Call Trace: <TASK> create_qp+0x747d/0xc740 [mlx5_ib] ? is_module_address+0x18/0x110 ? _create_user_qp.constprop.0+0x18e0/0x18e0 [mlx5_ib] ? __module_address+0x49/0x210 ? is_module_address+0x68/0x110 ? static_obj+0x67/0x90 ? lockdep_init_map_type+0x58/0x200 mlx5_ib_create_qp+0xc85/0x2620 [mlx5_ib] ? find_held_lock+0x2b/0x80 ? create_qp+0xc740/0xc740 [mlx5_ib] ? lock_release+0xcb/0x260 ? lockdep_init_map_type+0x58/0x200 ? __init_swait_queue_head+0xcb/0x150 create_qp.part.0+0x558/0x7c0 [ib_core] ib_create_qp_user+0xa0/0x4f0 [ib_core] ? rdma_lookup_get_uobject+0x1e4/0x400 [ib_uverbs] create_qp+0xe4f/0x1d10 [ib_uverbs] ? ib_uverbs_rereg_mr+0xd40/0xd40 [ib_uverbs] ? ib_uverbs_cq_event_handler+0x120/0x120 [ib_uverbs] ? __might_fault+0x81/0x100 ? lock_release+0xcb/0x260 ? _copy_from_user+0x3e/0x90 ib_uverbs_create_qp+0x10a/0x150 [ib_uverbs] ? ib_uverbs_ex_create_qp+0xe0/0xe0 [ib_uverbs] ? __might_fault+0x81/0x100 ? lock_release+0xcb/0x260 ib_uverbs_write+0x7e5/0xc90 [ib_uverbs] ? uverbs_devnode+0xc0/0xc0 [ib_uverbs] ? lock_acquire+0xfa/0x2b0 ? find_held_lock+0x2b/0x80 ? finish_task_switch.isra.0+0x189/0x6c0 vfs_write+0x1c0/0xf70 ? lockdep_hardirqs_on_prepare+0xde/0x170 ? kernel_write+0x5a0/0x5a0 ? __switch_to+0x527/0xe60 ? __schedule+0x10a3/0x3950 ? io_schedule_timeout+0x110/0x110 ksys_write+0x170/0x1c0 ? __x64_sys_read+0xb0/0xb0 ? trace_hardirqs_off.part.0+0x4e/0xe0 do_syscall_64+0x70/0x1360 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fae6ca3118d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b cc 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe678ca308 EFLAGS: 00000213 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007ffe678ca448 RCX: 00007fae6ca3118d RDX: 0000000000000070 RSI: 0000200000000280 RDI: 0000000000000003 RBP: 00007ffe678ca320 R08: 00000000ffffffff R09: 00007fae6c8ec5b8 R10: 0000000000000064 R11: 0000000000000213 R12: 0000000000000001 R13: 0000000000000000 R14: 00007fae6cb71000 R15: 0000000000404df0 </TASK> Modules linked in: mlx5_ib mlx5_fwctl mlx5_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre rdma_ucm ib_uverbs rdma_cm iw_cm ib_ipoib ib_cm ib_umad ib_core rpcsec_gss_krb5 auth_rpcgss oid_registry overlay nfnetlink zram zsmalloc fuse scsi_transport_iscsi [last unloaded: mlx5_core] ---[ end trace 0000000000000000 ]--- RIP: 0010:__mlx5_umem_find_best_quantized_pgoff+0x37/0x280 [mlx5_ib] Fixes: 0fb2ed6 ("IB/mlx5: Add create and destroy functionality for Raw Packet QP") Link: https://patch.msgid.link/r/20260427-security-bug-fixes-v3-5-4621fa52de0e@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

The switch case in loongson_gpu_fixup_dma_hang() may not DC2 or DC3, and readl(crtc_reg) will access with random address, because the "device" is from "base+PCI_DEVICE_ID", "base" is from "pdev->devfn+1". This is wrong when my platform inserts a discrete GPU: lspci -tv -[0000:00]-+-00.0 Loongson Technology LLC Hyper Transport Bridge Controller ... +-06.0 Loongson Technology LLC LG100 GPU +-06.2 Loongson Technology LLC Device 7a37 ... Add a default switch case to fix the panic as below: Kernel ade access[#1]: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.136-loong64-desktop-hwe+ #4 pc 90000000017e5534 ra 90000000017e54c0 tp 90000001002f8000 sp 90000001002fb6c0 a0 80000efe00003100 a1 0000000000003100 a2 0000000000000000 a3 0000000000000002 a4 90000001002fb6b4 a5 900000087cdb58fd a6 90000000027af000 a7 0000000000000001 t0 00000000000085b9 t1 000000000000ffff t2 0000000000000000 t3 0000000000000000 t4 fffffffffffffffd t5 00000000fffb6d9c t6 0000000000083b00 t7 00000000000070c0 t8 900000087cdb4d94 u0 900000087cdb58fd s9 90000001002fb826 s0 90000000031c12c8 s1 7fffffffffffff00 s2 90000000031c12d0 s3 0000000000002710 s4 0000000000000000 s5 0000000000000000 s6 9000000100053000 s7 7fffffffffffff00 s8 90000000030d4000 ra: 90000000017e54c0 loongson_gpu_fixup_dma_hang+0x40/0x210 ERA: 90000000017e5534 loongson_gpu_fixup_dma_hang+0xb4/0x210 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 00000004 (PPLV0 +PIE -PWE) EUEN: 00000000 (-FPE -SXE -ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00480000 [ADEM] (IS= ECode=8 EsubCode=1) BADV: 7fffffffffffff00 PRID: 0014d000 (Loongson-64bit, Loongson-3A6000-HV) Modules linked in: Process swapper/0 (pid: 1, threadinfo=(____ptrval____), task=(____ptrval____)) Stack : 0000000000000006 90000001002fb778 90000001002fb704 0000000000000007 0000000016a65700 90000000017e5690 000000000000ffff ffffffffffffffff 900000000209f7c0 9000000100053000 900000000209f7a8 9000000000eebc08 0000000000000000 0000000000000000 0000000000000006 90000001002fb778 90000001000530b8 90000000027af000 0000000000000000 9000000100054000 9000000100053000 9000000000ebb70c 9000000100004c00 9000000004000001 90000001002fb7e4 bae765461f31cb12 0000000000000000 0000000000000000 0000000000000006 90000000027af000 0000000000000030 90000000027af000 900000087cd6f800 9000000100053000 0000000000000000 9000000000ebc560 7a2500147cdaf720 bae765461f31cb12 0000000000000001 0000000000000030 ... Call Trace: [<90000000017e5534>] loongson_gpu_fixup_dma_hang+0xb4/0x210 [<9000000000eebc08>] pci_fixup_device+0x108/0x280 [<9000000000ebb70c>] pci_setup_device+0x24c/0x690 [<9000000000ebc560>] pci_scan_single_device+0xe0/0x140 [<9000000000ebc684>] pci_scan_slot+0xc4/0x280 [<9000000000ebdd00>] pci_scan_child_bus_extend+0x60/0x3f0 [<9000000000f5bc94>] acpi_pci_root_create+0x2b4/0x420 [<90000000017e5e74>] pci_acpi_scan_root+0x2d4/0x440 [<9000000000f5b02c>] acpi_pci_root_add+0x21c/0x3a0 [<9000000000f4ee54>] acpi_bus_attach+0x1a4/0x3c0 [<90000000010e200c>] device_for_each_child+0x6c/0xe0 [<9000000000f4bbf4>] acpi_dev_for_each_child+0x44/0x70 [<9000000000f4ef40>] acpi_bus_attach+0x290/0x3c0 [<90000000010e200c>] device_for_each_child+0x6c/0xe0 [<9000000000f4bbf4>] acpi_dev_for_each_child+0x44/0x70 [<9000000000f4ef40>] acpi_bus_attach+0x290/0x3c0 [<9000000000f5211c>] acpi_bus_scan+0x6c/0x280 [<900000000189c028>] acpi_scan_init+0x194/0x310 [<900000000189bc6c>] acpi_init+0xcc/0x140 [<9000000000220cdc>] do_one_initcall+0x4c/0x310 [<90000000018618fc>] kernel_init_freeable+0x258/0x2d4 [<900000000184326c>] kernel_init+0x28/0x13c [<9000000000222008>] ret_from_kernel_thread+0xc/0xa4 Cc: stable@vger.kernel.org Fixes: 95db0c9 ("LoongArch: Workaround LS2K/LS7A GPU DMA hang bug") Link: https://gist.github.com/opsiff/ebf2dac51b4013d22462f2124c55f807 Link: https://gist.github.com/opsiff/a62f2a73db0492b3c49bf223a339b133 Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

When mana_create_rxq() fails at mana_create_wq_obj() or any step before xdp_rxq_info_reg() is called, the error path jumps to `out:` which calls mana_destroy_rxq(). mana_destroy_rxq() unconditionally calls xdp_rxq_info_unreg() on xilinx xdp_rxq that was never registered, triggering a WARN_ON in net/core/xdp.c: mana 7870:00:00.0: HWC: Failed hw_channel req: 0xc000009a mana 7870:00:00.0 eth7: Failed to create RXQ: err = -71 Driver BUG WARNING: CPU: 442 PID: 491615 at ../net/core/xdp.c:150 xdp_rxq_info_unreg+0x44/0x70 Modules linked in: tcp_bbr xsk_diag udp_diag raw_diag unix_diag af_packet_diag netlink_diag nf_tables nfnetlink tcp_diag inet_diag binfmt_misc rpcsec_gss_krb5 nfsv3 nfs_acl auth_rpcgss nfsv4 dns_resolver nfs lockd ext4 grace crc16 iscsi_tcp mbcache fscache libiscsi_tcp jbd2 netfs rpcrdma af_packet sunrpc rdma_ucm ib_iser rdma_cm iw_cm iscsi_ibft ib_cm iscsi_boot_sysfs libiscsi rfkill scsi_transport_iscsi mana_ib ib_uverbs ib_core mana hyperv_drm(X) drm_shmem_helper intel_rapl_msr drm_kms_helper intel_rapl_common syscopyarea nls_iso8859_1 sysfillrect intel_uncore_frequency_common nls_cp437 vfat fat nfit sysimgblt libnvdimm hv_netvsc(X) hv_utils(X) fb_sys_fops hv_balloon(X) joydev fuse drm dm_mod configfs ip_tables x_tables xfs libcrc32c sd_mod nvme nvme_core nvme_common t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 hid_generic serio_raw pci_hyperv(X) hv_storvsc(X) scsi_transport_fc hyperv_keyboard(X) hid_hyperv(X) pci_hyperv_intf(X) crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd hv_vmbus(X) softdog sg scsi_mod efivarfs Supported: Yes, External CPU: 442 PID: 491615 Comm: ethtool Kdump: loaded Tainted: G X 5.14.21-150500.55.136-default #1 SLE15-SP5 a627be1b53abbfd64ad16b2685e4308c52847f42 Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 07/25/2025 RIP: 0010:xdp_rxq_info_unreg+0x44/0x70 Code: e8 91 fe ff ff c7 43 0c 02 00 00 00 48 c7 03 00 00 00 00 5b c3 cc cc cc cc e9 58 3a 1c 00 48 c7 c7 f6 5f 19 97 e8 5c a4 7e ff <0f> 0b 83 7b 0c 01 74 ca 48 c7 c7 d9 5f 19 97 e8 48 a4 7e ff 0f 0b RSP: 0018:ff3df6c8f7207818 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ff30d89f94808a80 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 0000000000000002 RDI: ff30d94bdcca2908 RBP: 0000000000080000 R08: ffffffff98ed11a0 R09: ff3df6c8f72077a0 R10: dead000000000100 R11: 000000000000000a R12: 0000000000000000 R13: 0000000000002000 R14: 0000000000040000 R15: ff30d89f94800000 FS: 00007fe6d8432b80(0000) GS:ff30d94bdcc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe6d81a89b1 CR3: 00000b3b6d578001 CR4: 0000000000371ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 Call Trace: <TASK> mana_destroy_rxq+0x5b/0x2f0 [mana 267acf7006bcb696095bba4d810643d1db3b9e94] mana_create_rxq.isra.55+0x3db/0x720 [mana 267acf7006bcb696095bba4d810643d1db3b9e94] ? simple_lookup+0x36/0x50 ? current_time+0x42/0x80 ? __d_free_external+0x30/0x30 mana_alloc_queues+0x32a/0x470 [mana 267acf7006bcb696095bba4d810643d1db3b9e94] ? _raw_spin_unlock+0xa/0x30 ? d_instantiate.part.29+0x2e/0x40 ? _raw_spin_unlock+0xa/0x30 ? debugfs_create_dir+0xe4/0x140 mana_attach+0x5c/0xf0 [mana 267acf7006bcb696095bba4d810643d1db3b9e94] mana_set_ringparam+0xd5/0x1a0 [mana 267acf7006bcb696095bba4d810643d1db3b9e94] ethnl_set_rings+0x292/0x320 genl_family_rcv_msg_doit.isra.15+0x11b/0x150 genl_rcv_msg+0xe3/0x1e0 ? rings_prepare_data+0x80/0x80 ? genl_family_rcv_msg_doit.isra.15+0x150/0x150 netlink_rcv_skb+0x50/0x100 genl_rcv+0x24/0x40 netlink_unicast+0x1b6/0x280 netlink_sendmsg+0x365/0x4d0 sock_sendmsg+0x5f/0x70 __sys_sendto+0x112/0x140 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x5b/0x80 ? handle_mm_fault+0xd7/0x290 ? do_user_addr_fault+0x2d8/0x740 ? exc_page_fault+0x67/0x150 entry_SYSCALL_64_after_hwframe+0x6b/0xd5 RIP: 0033:0x7fe6d8122f06 Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 f3 c3 41 57 41 56 4d 89 c7 41 55 41 54 41 RSP: 002b:00007fff2b66b068 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000055771123d2a0 RCX: 00007fe6d8122f06 RDX: 0000000000000034 RSI: 000055771123d3b0 RDI: 0000000000000003 RBP: 00007fff2b66b100 R08: 00007fe6d8203360 R09: 000000000000000c R10: 0000000000000000 R11: 0000000000000246 R12: 000055771123d350 R13: 000055771123d340 R14: 0000000000000000 R15: 00007fff2b66b2b0 </TASK> Guard the xdp_rxq_info_unreg() call with xdp_rxq_info_is_reg() so that mana_destroy_rxq() is safe to call regardless of how far initialization progressed. Fixes: ed5356b ("net: mana: Add XDP support") Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Link: https://patch.msgid.link/20260430035935.1859220-2-dipayanroy@linux.microsoft.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

…g FLR During Function Level Reset recovery, the MANA driver reads hardware BAR0 registers that may temporarily contain garbage values. The SHM (Shared Memory) offset read from GDMA_REG_SHM_OFFSET is used to compute gc->shm_base, which is later dereferenced via readl() in mana_smc_poll_register(). If the hardware returns an unaligned or out-of-range value, the driver must not blindly use it, as this would propagate the hardware error into a kernel crash. The following crash was observed on an arm64 Hyper-V guest running kernel 6.17.0-3013-azure during VF reset recovery triggered by HWC timeout. [13291.785274] Unable to handle kernel paging request at virtual address ffff8000a200001b [13291.785311] Mem abort info: [13291.785332] ESR = 0x0000000096000021 [13291.785343] EC = 0x25: DABT (current EL), IL = 32 bits [13291.785355] SET = 0, FnV = 0 [13291.785363] EA = 0, S1PTW = 0 [13291.785372] FSC = 0x21: alignment fault [13291.785382] Data abort info: [13291.785391] ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000 [13291.785404] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [13291.785412] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [13291.785421] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000014df3a1000 [13291.785432] [ffff8000a200001b] pgd=1000000100438403, p4d=1000000100438403, pud=1000000100439403, pmd=0068000fc2000711 [13291.785703] Internal error: Oops: 0000000096000021 [#1] SMP [13291.830975] Modules linked in: tls qrtr mana_ib ib_uverbs ib_core xt_owner xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables cfg80211 8021q garp mrp stp llc binfmt_misc joydev serio_raw nls_iso8859_1 hid_generic aes_ce_blk aes_ce_cipher polyval_ce ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher hid_hyperv sm4 sm3_ce sha3_ce hv_netvsc hid vmgenid hyperv_keyboard hyperv_drm sch_fq_codel nvme_fabrics efi_pstore dm_multipath nfnetlink vsock_loopback vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport vmw_vmci vsock dmi_sysfs ip_tables x_tables autofs4 [13291.862630] CPU: 122 UID: 0 PID: 61796 Comm: kworker/122:2 Tainted: G W 6.17.0-3013-azure AsahiLinux#13-Ubuntu VOLUNTARY [13291.869902] Tainted: [W]=WARN [13291.871901] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 01/08/2026 [13291.878086] Workqueue: events mana_serv_func [13291.880718] pstate: 62400005 (nZCv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--) [13291.884835] pc : mana_smc_poll_register+0x48/0xb0 [13291.887902] lr : mana_smc_setup_hwc+0x70/0x1c0 [13291.890493] sp : ffff8000ab79bbb0 [13291.892364] x29: ffff8000ab79bbb0 x28: ffff00410c8b5900 x27: ffff00410d630680 [13291.896252] x26: ffff004171f9fd80 x25: 000000016ed55000 x24: 000000017f37e000 [13291.899990] x23: 0000000000000000 x22: 000000016ed55000 x21: 0000000000000000 [13291.904497] x20: ffff8000a200001b x19: 0000000000004e20 x18: ffff8000a6183050 [13291.908308] x17: 0000000000000000 x16: 0000000000000000 x15: 000000000000000a [13291.912542] x14: 0000000000000004 x13: 0000000000000000 x12: 0000000000000000 [13291.916298] x11: 0000000000000000 x10: 0000000000000001 x9 : ffffc45006af1bd8 [13291.920945] x8 : ffff000151129000 x7 : 0000000000000000 x6 : 0000000000000000 [13291.925293] x5 : 000000015f214000 x4 : 000000017217a000 x3 : 000000016ed50000 [13291.930436] x2 : 000000016ed55000 x1 : 0000000000000000 x0 : ffff8000a1ffffff [13291.934342] Call trace: [13291.935736] mana_smc_poll_register+0x48/0xb0 (P) [13291.938611] mana_smc_setup_hwc+0x70/0x1c0 [13291.941113] mana_hwc_create_channel+0x1a0/0x3a0 [13291.944283] mana_gd_setup+0x16c/0x398 [13291.946584] mana_gd_resume+0x24/0x70 [13291.948917] mana_do_service+0x13c/0x1d0 [13291.951583] mana_serv_func+0x34/0x68 [13291.953732] process_one_work+0x168/0x3d0 [13291.956745] worker_thread+0x2ac/0x480 [13291.959104] kthread+0xf8/0x110 [13291.961026] ret_from_fork+0x10/0x20 [13291.963560] Code: d2807d00 9417c551 71000673 54000220 (b9400281) [13291.967299] ---[ end trace 0000000000000000 ]--- Disassembly of mana_smc_poll_register() around the crash site: Disassembly of section .text: 00000000000047c8 <mana_smc_poll_register>: 47c8: d503201f nop 47cc: d503201f nop 47d0: d503233f paciasp 47d4: f800865e str x30, [x18], AsahiLinux#8 47d8: a9bd7bfd stp x29, x30, [sp, #-48]! 47dc: 910003fd mov x29, sp 47e0: a90153f3 stp x19, x20, [sp, AsahiLinux#16] 47e4: 91007014 add x20, x0, #0x1c 47e8: 5289c413 mov w19, #0x4e20 47ec: f90013f5 str x21, [sp, AsahiLinux#32] 47f0: 12001c35 and w21, w1, #0xff 47f4: 14000008 b 4814 <mana_smc_poll_register+0x4c> 47f8: 36f801e1 tbz w1, AsahiLinux#31, 4834 <mana_smc_poll_register+0x6c> 47fc: 52800042 mov w2, #0x2 4800: d280fa01 mov x1, #0x7d0 4804: d2807d00 mov x0, #0x3e8 4808: 94000000 bl 0 <usleep_range_state> 480c: 71000673 subs w19, w19, #0x1 4810: 54000200 b.eq 4850 <mana_smc_poll_register+0x88> 4814: b9400281 ldr w1, [x20] <-- **** CRASHED HERE ***** 4818: d50331bf dmb oshld 481c: 2a0103e2 mov w2, w1 ... From the crash signature x20 = ffff8000a200001b, this address ends in 0x1b which is not 4-byte aligned, so the 'ldr w1, [x20]' instruction (readl) triggers the arm64 alignment fault (FSC = 0x21). The root cause is in mana_gd_init_vf_regs(), which computes: gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET); The offset is used without any validation. The same problem exists in mana_gd_init_pf_regs() for sriov_base_off and sriov_shm_off. Fix this by validating all offsets before use: - VF: check shm_off is within BAR0, properly aligned to 4 bytes (readl requirement), and leaves room for the full 256-bit (32-byte) SMC aperture. - PF: check sriov_base_off is within BAR0, aligned to 8 bytes (readq requirement), and leaves room to safely read the sriov_shm_off register at sriov_base_off + GDMA_PF_REG_SHM_OFF. Then check sriov_shm_off leaves room for the full SMC aperture. All arithmetic uses subtraction rather than addition to avoid integer overflow on garbage values. Define SMC_APERTURE_SIZE (32 bytes, derived from the 256-bit aperture width) Return -EPROTO on invalid values. The existing recovery path in mana_serv_reset() already handles -EPROTO by falling through to PCI device rescan, giving the hardware another chance to present valid register values after reset. Fixes: 9bf6603 ("net: mana: Handle hardware recovery events when probing the device") Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Link: https://patch.msgid.link/afQUMClyjmBVfD+u@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>

syzbot reported null-ptr-deref in fib6_mtu(). [0] When res->f6i->fib6_pmtu is 0 in fib6_mtu(), it fetches MTU from __in6_dev_get(nh->fib_nh_dev)->cnf.mtu6. However, __in6_dev_get() could return NULL when the device is being unregistered. Let's return 0 MTU if __in6_dev_get() returns NULL in fib6_mtu(). [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc00000000bc: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x00000000000005e0-0x00000000000005e7] CPU: 0 UID: 0 PID: 7890 Comm: syz.2.502 Tainted: G L syzkaller #0 PREEMPT(full) Tainted: [L]=SOFTLOCKUP Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:fib6_mtu net/ipv6/route.c:1648 [inline] RIP: 0010:rt6_insert_exception+0x9eb/0x10a0 net/ipv6/route.c:1753 Code: 3b 14 cf f7 45 85 f6 0f 85 1d 02 00 00 e8 7d 19 cf f7 48 8d bb e0 05 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 89 RSP: 0000:ffffc9000610f120 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc9000c001000 RDX: 00000000000000bc RSI: ffffffff8a38bc83 RDI: 00000000000005e0 RBP: ffff888052f06000 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff888042d16c00 R13: ffff888042d16cc8 R14: 0000000000000001 R15: 0000000000000500 FS: 0000000000000000(0000) GS:ffff88809717d000(0063) knlGS:00000000f540db40 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 00000000f73c6d50 CR3: 000000006eff0000 CR4: 0000000000352ef0 Call Trace: <TASK> __ip6_rt_update_pmtu+0x555/0xd60 net/ipv6/route.c:2982 ip6_update_pmtu+0x34f/0x3b0 net/ipv6/route.c:3014 icmpv6_err+0x2a2/0x3f0 net/ipv6/icmp.c:82 icmpv6_notify+0x35e/0x820 net/ipv6/icmp.c:1087 icmpv6_rcv+0x10bf/0x1ae0 net/ipv6/icmp.c:1228 ip6_protocol_deliver_rcu+0xf97/0x1500 net/ipv6/ip6_input.c:478 ip6_input_finish+0x1e4/0x4a0 net/ipv6/ip6_input.c:529 NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ip6_input+0x105/0x2f0 net/ipv6/ip6_input.c:540 ip6_mc_input+0x513/0xf50 net/ipv6/ip6_input.c:630 dst_input include/net/dst.h:480 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:119 [inline] NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0x34c/0x3d0 net/ipv6/ip6_input.c:351 __netif_receive_skb_one_core+0x12d/0x1e0 net/core/dev.c:6202 __netif_receive_skb+0x1f/0x120 net/core/dev.c:6315 netif_receive_skb_internal net/core/dev.c:6401 [inline] netif_receive_skb+0x13b/0x7f0 net/core/dev.c:6460 tun_rx_batched.isra.0+0x3f6/0x750 drivers/net/tun.c:1511 tun_get_user+0x1e31/0x3c20 drivers/net/tun.c:1955 tun_chr_write_iter+0xdc/0x200 drivers/net/tun.c:2001 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x6ac/0x1070 fs/read_write.c:688 ksys_write+0x12a/0x250 fs/read_write.c:740 do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline] do_int80_emulation+0x141/0x700 arch/x86/entry/syscall_32.c:172 asm_int80_emulation+0x1a/0x20 arch/x86/include/asm/idtentry.h:621 RIP: 0023:0xf715616b Code: 57 56 53 8b 44 24 14 f6 00 08 75 23 8b 44 24 18 8b 5c 24 1c 8b 4c 24 20 8b 54 24 24 8b 74 24 28 8b 7c 24 2c 8b 6c 24 30 cd 80 <5b> 5e 5f 5d c3 5b 5e 5f 5d e9 f7 a1 ff ff 66 90 66 90 66 90 90 53 RSP: 002b:00000000f540d44c EFLAGS: 00000246 ORIG_RAX: 0000000000000004 RAX: ffffffffffffffda RBX: 00000000000000c8 RCX: 0000000080000640 RDX: 000000000000007a RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Fixes: dcd1f57 ("net/ipv6: Remove fib6_idev") Reported-by: syzbot+01f005f9c6387ca6f6dd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69f83f22.170a0220.13cc2.0004.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260504064316.3820775-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Pavan Chebbi says: ==================== bnxt_en: Bug fixes This patchset adds the following fixes for bnxt: Patch #1 fixes DPC AER handling to make it more reliable Patch #2 fixes incorrect capping bp->max_tpa based on what the FW supports Patch #3 fixes ignoring of VNIC configuration result when RDMA driver is loading Patch #4 fixes logic to make phase adjustment on the PPS OUT signal ==================== Link: https://patch.msgid.link/20260504083611.1383776-1-pavan.chebbi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

syzbot reports for sleeping function called from invalid context [1]. The recently added code for resizable hash tables uses hlist_bl bit locks in combination with spin_lock for the connection fields (cp->lock). Fix the following problems: * avoid using spin_lock(&cp->lock) under locked bit lock because it sleeps on PREEMPT_RT * as the recent changes call ip_vs_conn_hash() only for newly allocated connection, the spin_lock can be removed there because the connection is still not linked to table and does not need cp->lock protection. * the lock can be removed also from ip_vs_conn_unlink() where we are the last connection user. * the last place that is fixed is ip_vs_conn_fill_cport() where now the cp->lock is locked before the other locks to ensure other packets do not modify the cp->flags in non-atomic way. Here we make sure cport and flags are changed only once if two or more packets race to fill the cport. Also, we fill cport early, so that if we race with resizing there will be valid cport key for the hashing. Add a warning if too many hash table changes occur for our RCU read-side critical section which is error condition but minor because the connection still can expire gracefully. Still, restore the cport to 0 to allow retransmitted packet to properly fill the cport. Problems reported by Sashiko. [1]: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 16, name: ktimers/0 preempt_count: 2, expected: 0 RCU nest depth: 3, expected: 3 8 locks held by ktimers/0/16: #0: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #1: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:45 [inline] #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: timer_base_lock_expiry kernel/time/timer.c:1502 [inline] #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: __run_timer_base+0x120/0x9f0 kernel/time/timer.c:2384 #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __rt_spin_lock kernel/locking/spinlock_rt.c:50 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1e0/0x400 kernel/locking/spinlock_rt.c:57 #4: ffffc90000157a80 ((&cp->timer)){+...}-{0:0}, at: call_timer_fn+0xd4/0x5e0 kernel/time/timer.c:1745 AsahiLinux#5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline] AsahiLinux#5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] AsahiLinux#5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:315 [inline] AsahiLinux#5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_expire+0x257/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 AsahiLinux#6: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 AsahiLinux#7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:45 [inline] AsahiLinux#7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline] AsahiLinux#7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 Preemption disabled at: [<ffffffff898a6358>] bit_spin_lock include/linux/bit_spinlock.h:38 [inline] [<ffffffff898a6358>] hlist_bl_lock+0x18/0x110 include/linux/list_bl.h:149 CPU: 0 UID: 0 PID: 16 Comm: ktimers/0 Tainted: G W L syzkaller #0 PREEMPT_{RT,(full)} Tainted: [W]=WARN, [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/18/2026 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 __might_resched+0x329/0x480 kernel/sched/core.c:9162 __rt_spin_lock kernel/locking/spinlock_rt.c:48 [inline] rt_spin_lock+0xc2/0x400 kernel/locking/spinlock_rt.c:57 spin_lock include/linux/spinlock_rt.h:45 [inline] ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline] ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 call_timer_fn+0x192/0x5e0 kernel/time/timer.c:1748 expire_timers kernel/time/timer.c:1799 [inline] __run_timers kernel/time/timer.c:2374 [inline] __run_timer_base+0x6a3/0x9f0 kernel/time/timer.c:2386 run_timer_base kernel/time/timer.c:2395 [inline] run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2405 handle_softirqs+0x1de/0x6d0 kernel/softirq.c:622 __do_softirq kernel/softirq.c:656 [inline] run_ktimerd+0x69/0x100 kernel/softirq.c:1151 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Reported-by: syzbot+504e778ddaecd36fdd17@syzkaller.appspotmail.com Link: https://sashiko.dev/#/patchset/20260415200216.79699-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260420165539.85174-4-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260422135823.50489-4-ja%40ssi.bg Fixes: 2fa7cc9 ("ipvs: switch to per-net connection table") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

When utilizing Socket-Direct single netdev functionality the driver resolves the actual auxiliary device using mlx5_sd_get_adev(). However, the current implementation returns the primary ETH auxiliary device without holding the device lock, leading to a potential race condition where the ETH device could be unbound or removed concurrently during probe, suspend, resume, or remove operations.[1] Fix this by introducing mlx5_sd_put_adev() and updating mlx5_sd_get_adev() so that secondaries devices would get a ref and acquire the device lock of the returned auxiliary device. After the lock is acquired, a second devcom check is needed[2]. In addition, update The callers to pair the get operation with the new put operation, ensuring the lock is held while the auxiliary device is being operated on and released afterwards. The "primary" designation is determined once in sd_register(). It's set before devcom is marked ready, and it never changes after that. In Addition, The primary path never locks a secondary: When the primary device invoke mlx5_sd_get_adev(), it sees dev == primary and returns. no additional lock is taken. Therefore lock ordering is always: secondary_lock -> primary_lock. The reverse never happens, so ABBA deadlock is impossible. [1] for example: BUG: kernel NULL pointer dereference, address: 0000000000000370 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP CPU: 4 UID: 0 PID: 3945 Comm: bash Not tainted 6.19.0-rc3+ #1 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5e_dcbnl_dscp_app+0x23/0x100 [mlx5_core] Call Trace: <TASK> mlx5e_remove+0x82/0x12a [mlx5_core] device_release_driver_internal+0x194/0x1f0 bus_remove_device+0xc6/0x140 device_del+0x159/0x3c0 ? devl_param_driverinit_value_get+0x29/0x80 mlx5_rescan_drivers_locked+0x92/0x160 [mlx5_core] mlx5_unregister_device+0x34/0x50 [mlx5_core] mlx5_uninit_one+0x43/0xb0 [mlx5_core] remove_one+0x4e/0xc0 [mlx5_core] pci_device_remove+0x39/0xa0 device_release_driver_internal+0x194/0x1f0 unbind_store+0x99/0xa0 kernfs_fop_write_iter+0x12e/0x1e0 vfs_write+0x215/0x3d0 ksys_write+0x5f/0xd0 do_syscall_64+0x55/0xe90 entry_SYSCALL_64_after_hwframe+0x4b/0x53 [2] CPU0 (primary) CPU1 (secondary) ========================================================================== mlx5e_remove() (device_lock held) mlx5e_remove() (2nd device_lock held) mlx5_sd_get_adev() mlx5_devcom_comp_is_ready() => true device_lock(primary) mlx5_sd_get_adev() ==> ret adev _mlx5e_remove() mlx5_sd_cleanup() // mlx5e_remove finished // releasing device_lock //need another check here... mlx5_devcom_comp_is_ready() => false Fixes: 381978d ("net/mlx5e: Create single netdev per SD group") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260504180206.268568-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Two functions in ath12k assert that the caller holds an RCU read lock: ath12k_mac_get_arvif() and ath12k_p2p_noa_update_vdev_iter(). Both use: WARN_ON(!rcu_read_lock_any_held()); On kernels using preemptible RCU (CONFIG_PREEMPT=y or CONFIG_PREEMPT_RT=y) without CONFIG_DEBUG_LOCK_ALLOC, this produces a false positive splat whenever these functions are invoked from paths that do hold the RCU read lock (e.g. firmware stats processing or mac80211 interface iteration). Root cause: - Without CONFIG_DEBUG_LOCK_ALLOC, rcu_read_lock_any_held() is a static inline that returns !preemptible() as a proxy for "in an RCU read section". - With preemptible RCU, rcu_read_lock() does not disable preemption. A task can therefore be preemptible while legitimately holding an RCU read lock, making the proxy unreliable. - Callers such as ath12k_wmi_tlv_rssi_chain_parse() (via guard(rcu)()) and ieee80211_iterate_active_interfaces_atomic() do hold the RCU read lock, so these warnings are incorrect. Typical splat seen on a WCN7850 station with periodic fw stats processing: WARNING: drivers/net/wireless/ath/ath12k/mac.c:791 at ath12k_mac_get_arvif+0x9e/0xd0 [ath12k] Tainted: G W O 6.19.13-rt #1 PREEMPT_RT Call Trace: ath12k_wmi_tlv_rssi_chain_parse+0x69/0x170 [ath12k] ath12k_wmi_tlv_iter+0x7f/0x120 [ath12k] ath12k_wmi_tlv_fw_stats_parse+0x342/0x6b0 [ath12k] ath12k_wmi_op_rx+0xe9e/0x3150 [ath12k] ath12k_htc_rx_completion_handler+0x3df/0x5b0 [ath12k] ath12k_ce_per_engine_service+0x325/0x3e0 [ath12k] ath12k_pci_ce_workqueue+0x20/0x40 [ath12k] Replace WARN_ON(!rcu_read_lock_any_held()) with lockdep_assert_in_rcu_read_lock(), which is gated on CONFIG_PROVE_RCU and therefore compiles out entirely when PROVE_RCU is disabled. PROVE_RCU kernels continue to get the full lockdep-based check, and the new helper precisely checks for rcu_read_lock() rather than any RCU variant, which better matches the callers' expectations. Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.1.c5-00302-QCAHMTSWPL_V1.0_V2.0_SILICONZ-1.115823.3 Fixes: 3dd2c68 ("wifi: ath12k: prepare vif data structure for MLO handling") Suggested-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com> Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com> Reviewed-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com> Signed-off-by: Yu-Hsiang Tseng <asas1asas200@gmail.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260422180814.1938317-1-asas1asas200@gmail.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>

fbnic_phylink_create() stores the newly allocated PCS in fbn->pcs and then calls phylink_create(). When phylink_create() fails, the error path correctly destroys the PCS via xpcs_destroy_pcs(), but the caller, fbnic_netdev_alloc(), responds by invoking fbnic_netdev_free() which calls fbnic_phylink_destroy(). That function finds fbn->pcs non-NULL and calls xpcs_destroy_pcs() a second time on the already-freed object, triggering a refcount underflow use-after-free: [ 1.934973] fbnic 0000:01:00.0: Failed to create Phylink interface, err: -22 [ 1.935103] ------------[ cut here ]------------ [ 1.935179] refcount_t: underflow; use-after-free. [ 1.935252] WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x59/0x90, CPU#0: swapper/0/1 [ 1.935389] Modules linked in: [ 1.935484] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-virtme-04244-g1f5ffc672165-dirty #1 PREEMPT(lazy) [ 1.935661] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1.935826] RIP: 0010:refcount_warn_saturate+0x59/0x90 [ 1.935931] Code: 44 48 8d 3d 49 f9 a7 01 67 48 0f b9 3a e9 bf 1e 96 00 48 8d 3d 48 f9 a7 01 67 48 0f b9 3a c3 cc cc cc cc 48 8d 3d 47 f9 a7 01 <67> 48 0f b9 3a c3 cc cc cc cc 48 8d 3d 46 f9 a7 01 67 48 0f b9 3a [ 1.936274] RSP: 0000:ffffd0d440013c58 EFLAGS: 00010246 [ 1.936376] RAX: 0000000000000000 RBX: ffff8f39c188c278 RCX: 000000000000002b [ 1.936524] RDX: ffff8f39c004f000 RSI: 0000000000000003 RDI: ffffffff96abab00 [ 1.936692] RBP: ffff8f39c188c240 R08: ffffffff96988e88 R09: 00000000ffffdfff [ 1.936835] R10: ffffffff96878ea0 R11: 0000000000000187 R12: 0000000000000000 [ 1.936970] R13: ffff8f39c0cef0c8 R14: ffff8f39c1ac01c0 R15: 0000000000000000 [ 1.937114] FS: 0000000000000000(0000) GS:ffff8f3ba08b4000(0000) knlGS:0000000000000000 [ 1.937273] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.937382] CR2: ffff8f3b3ffff000 CR3: 0000000172642001 CR4: 0000000000372ef0 [ 1.937540] Call Trace: [ 1.937619] <TASK> [ 1.937698] xpcs_destroy_pcs+0x25/0x40 [ 1.937783] fbnic_netdev_alloc+0x1e5/0x200 [ 1.937859] fbnic_probe+0x230/0x370 [ 1.937939] local_pci_probe+0x3e/0x90 [ 1.938013] pci_device_probe+0xbb/0x1e0 [ 1.938091] ? sysfs_do_create_link_sd+0x6d/0xe0 [ 1.938188] really_probe+0xc1/0x2b0 [ 1.938282] __driver_probe_device+0x73/0x120 [ 1.938371] driver_probe_device+0x1e/0xe0 [ 1.938466] __driver_attach+0x8d/0x190 [ 1.938560] ? __pfx___driver_attach+0x10/0x10 [ 1.938663] bus_for_each_dev+0x7b/0xd0 [ 1.938758] bus_add_driver+0xe8/0x210 [ 1.938854] driver_register+0x60/0x120 [ 1.938929] ? __pfx_fbnic_init_module+0x10/0x10 [ 1.939026] fbnic_init_module+0x25/0x60 [ 1.939109] do_one_initcall+0x49/0x220 [ 1.939202] ? rdinit_setup+0x20/0x40 [ 1.939304] kernel_init_freeable+0x1b0/0x310 [ 1.939449] ? __pfx_kernel_init+0x10/0x10 [ 1.939560] kernel_init+0x1a/0x1c0 [ 1.939640] ret_from_fork+0x1ed/0x240 [ 1.939730] ? __pfx_kernel_init+0x10/0x10 [ 1.939805] ret_from_fork_asm+0x1a/0x30 [ 1.939886] </TASK> [ 1.939927] ---[ end trace 0000000000000000 ]--- [ 1.940184] fbnic 0000:01:00.0: Netdev allocation failed Instead of calling fbnic_phylink_destroy(), the prior initialization of netdev should just be unrolled with free_netdev() and clearing fbd->netdev. Clearing fbd->netdev to NULL avoids UAF in init_failure_mode where callers guard by checking !fbd->netdev, such as fbnic_mdio_read_pmd(). These callers remain active even after a failed probe, so fdb->netdev still needs to be cleared. Fixes: d0fe710 ("fbnic: Replace use of internal PCS w/ Designware XPCS") Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260504-fbnic-pcs-fix-v2-1-de45192821d9@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Yi Lai reported RCU splat in reg_vif_xmit() below. [0] When CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup() uses rcu_dereference() without explicit rcu_read_lock(). Although rcu_read_lock_bh() is already held by the caller __dev_queue_xmit(), lockdep requires explicit rcu_read_lock() for rcu_dereference(). Let's move up rcu_read_lock() in reg_vif_xmit() to cover ipmr_fib_lookup(). [0]: WARNING: suspicious RCU usage 7.1.0-rc2-next-20260504-9d0d467c3572 #1 Not tainted ----------------------------- net/ipv4/ipmr.c:329 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 2 locks held by syz.2.17/1779: #0: ffffffff87896440 (rcu_read_lock_bh){....}-{1:3}, at: local_bh_disable include/linux/bottom_half.h:20 [inline] #0: ffffffff87896440 (rcu_read_lock_bh){....}-{1:3}, at: rcu_read_lock_bh include/linux/rcupdate.h:891 [inline] #0: ffffffff87896440 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x239/0x4140 net/core/dev.c:4792 #1: ffff88801a199d18 (_xmit_PIMREG#2){+...}-{3:3}, at: spin_lock include/linux/spinlock.h:342 [inline] #1: ffff88801a199d18 (_xmit_PIMREG#2){+...}-{3:3}, at: __netif_tx_lock include/linux/netdevice.h:4795 [inline] #1: ffff88801a199d18 (_xmit_PIMREG#2){+...}-{3:3}, at: __dev_queue_xmit+0x1d5d/0x4140 net/core/dev.c:4865 stack backtrace: CPU: 1 UID: 0 PID: 1779 Comm: syz.2.17 Not tainted 7.1.0-rc2-next-20260504-9d0d467c3572 #1 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x121/0x150 lib/dump_stack.c:120 dump_stack+0x19/0x20 lib/dump_stack.c:129 lockdep_rcu_suspicious+0x15b/0x1f0 kernel/locking/lockdep.c:6878 ipmr_fib_lookup net/ipv4/ipmr.c:329 [inline] reg_vif_xmit+0x2ee/0x3c0 net/ipv4/ipmr.c:540 __netdev_start_xmit include/linux/netdevice.h:5382 [inline] netdev_start_xmit include/linux/netdevice.h:5391 [inline] xmit_one net/core/dev.c:3889 [inline] dev_hard_start_xmit+0x170/0x700 net/core/dev.c:3905 __dev_queue_xmit+0x1df1/0x4140 net/core/dev.c:4871 dev_queue_xmit include/linux/netdevice.h:3423 [inline] packet_xmit+0x252/0x370 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3082 [inline] packet_sendmsg+0x39ad/0x5650 net/packet/af_packet.c:3114 sock_sendmsg_nosec net/socket.c:797 [inline] __sock_sendmsg net/socket.c:812 [inline] ____sys_sendmsg+0xa21/0xba0 net/socket.c:2716 ___sys_sendmsg+0x121/0x1c0 net/socket.c:2770 __sys_sendmsg+0x177/0x220 net/socket.c:2802 __do_sys_sendmsg net/socket.c:2807 [inline] __se_sys_sendmsg net/socket.c:2805 [inline] __x64_sys_sendmsg+0x80/0xc0 net/socket.c:2805 x64_sys_call+0x1d9c/0x21c0 arch/x86/include/generated/asm/syscalls_64.h:47 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc1/0x1020 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f37e563ee5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe5caa7fa8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000005c5fa0 RCX: 00007f37e563ee5d RDX: 0000000000000000 RSI: 00002000000012c0 RDI: 0000000000000004 RBP: 00000000005c5fa0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00000000005c5fac R15: 00000000005c5fa0 </TASK> Fixes: b3b6bab ("ipmr: Free mr_table after RCU grace period.") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: Yi Lai <yi1.lai@intel.com> Closes: https://lore.kernel.org/netdev/afrY34dLXNUboevf@ly-workstation/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260506065955.1695753-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit 0c22d95 upstream. The generic/642 test-case can reproduce the kernel crash: [40243.605254] ------------[ cut here ]------------ [40243.605956] kernel BUG at fs/ceph/xattr.c:918! [40243.607142] Oops: invalid opcode: 0000 [#1] SMP PTI [40243.608067] CPU: 7 UID: 0 PID: 498762 Comm: kworker/7:1 Not tainted 7.0.0-rc7+ #3 PREEMPT(full) [40243.609700] Hardware name: QEMU Ubuntu 25.10 PC v2 (i440FX + PIIX, + 10.1 machine, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [40243.611820] Workqueue: ceph-msgr ceph_con_workfn [40243.612715] RIP: 0010:__ceph_build_xattrs_blob+0x1b8/0x1e0 [40243.613731] Code: 0f 84 82 fe ff ff e9 cf 8e 56 ff 48 8d 65 e8 31 c0 5b 41 5c 41 5d 5d 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9 c3 cc cc cc cc <0f> 0b 4c 8b 62 08 41 8b 85 24 07 00 00 49 83 c4 04 41 89 44 24 fc [40243.616888] RSP: 0018:ffffcc80c4d4b688 EFLAGS: 00010287 [40243.617773] RAX: 0000000000010026 RBX: 0000000000000001 RCX: 0000000000000000 [40243.618928] RDX: ffff8a773798dee0 RSI: 0000000000000000 RDI: 0000000000000000 [40243.620158] RBP: ffffcc80c4d4b6a0 R08: 0000000000000000 R09: 0000000000000000 [40243.621573] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a75f3b58000 [40243.622907] R13: ffff8a75f3b58000 R14: 0000000000000080 R15: 000000000000bffd [40243.624054] FS: 0000000000000000(0000) GS:ffff8a787d1b4000(0000) knlGS:0000000000000000 [40243.625331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [40243.626269] CR2: 000072f390b623c0 CR3: 000000011c02a003 CR4: 0000000000372ef0 [40243.627408] Call Trace: [40243.627839] <TASK> [40243.628188] __prep_cap+0x3fd/0x4a0 [40243.628789] ? do_raw_spin_unlock+0x4e/0xe0 [40243.629474] ceph_check_caps+0x46a/0xc80 [40243.630094] ? __lock_acquire+0x4a2/0x2650 [40243.630773] ? find_held_lock+0x31/0x90 [40243.631347] ? handle_cap_grant+0x79f/0x1060 [40243.632068] ? lock_release+0xd9/0x300 [40243.632696] ? __mutex_unlock_slowpath+0x3e/0x340 [40243.633429] ? lock_release+0xd9/0x300 [40243.634052] handle_cap_grant+0xcf6/0x1060 [40243.634745] ceph_handle_caps+0x122b/0x2110 [40243.635415] mds_dispatch+0x5bd/0x2160 [40243.636034] ? ceph_con_process_message+0x65/0x190 [40243.636828] ? lock_release+0xd9/0x300 [40243.637431] ceph_con_process_message+0x7a/0x190 [40243.638184] ? kfree+0x311/0x4f0 [40243.638749] ? kfree+0x311/0x4f0 [40243.639268] process_message+0x16/0x1a0 [40243.639915] ? sg_free_table+0x39/0x90 [40243.640572] ceph_con_v2_try_read+0xf58/0x2120 [40243.641255] ? lock_acquire+0xc8/0x300 [40243.641863] ceph_con_workfn+0x151/0x820 [40243.642493] process_one_work+0x22f/0x630 [40243.643093] ? process_one_work+0x254/0x630 [40243.643770] worker_thread+0x1e2/0x400 [40243.644332] ? __pfx_worker_thread+0x10/0x10 [40243.645020] kthread+0x109/0x140 [40243.645560] ? __pfx_kthread+0x10/0x10 [40243.646125] ret_from_fork+0x3f8/0x480 [40243.646752] ? __pfx_kthread+0x10/0x10 [40243.647316] ? __pfx_kthread+0x10/0x10 [40243.647919] ret_from_fork_asm+0x1a/0x30 [40243.648556] </TASK> [40243.648902] Modules linked in: overlay hctr2 libpolyval chacha libchacha adiantum libnh libpoly1305 essiv intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit kvm_intel kvm irqbypass joydev ghash_clmulni_intel aesni_intel rapl input_leds mac_hid psmouse vga16fb serio_raw vgastate floppy i2c_piix4 pata_acpi bochs qemu_fw_cfg i2c_smbus sch_fq_codel rbd dm_crypt msr parport_pc ppdev lp parport efi_pstore [40243.654766] ---[ end trace 0000000000000000 ]--- Commit d93231a ("ceph: prevent a client from exceeding the MDS maximum xattr size") moved the required_blob_size computation to before the __build_xattrs() call, introducing a race. __build_xattrs() releases and reacquires i_ceph_lock during execution. In that window, handle_cap_grant() may update i_xattrs.blob with a newer MDS-provided blob and bump i_xattrs.version. When __build_xattrs() detects that index_version < version, it destroys and rebuilds the entire xattr rb-tree from the new blob, potentially increasing count, names_size, and vals_size. The prealloc_blob size check that follows still uses the stale required_blob_size computed before the rebuild, so it passes even when prealloc_blob is too small for the now-larger tree. After __set_xattr() adds one more xattr on top, __ceph_build_xattrs_blob() is called from the cap flush path and hits: BUG_ON(need > ci->i_xattrs.prealloc_blob->alloc_len); Fix this by recomputing required_blob_size after __build_xattrs() returns, using the current tree state. Also re-validate against m_max_xattr_size to fall back to the sync path if the rebuilt tree now exceeds the MDS limit. Cc: stable@vger.kernel.org Fixes: d93231a ("ceph: prevent a client from exceeding the MDS maximum xattr size") Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 07d0f49 upstream. iommu_device_register() walks every device on the PCI bus via bus_for_each_dev() and calls amd_iommu_probe_device() for each. The inlined check_device() path computes the device's sbdf, calls rlookup_amd_iommu() to find the owning IOMMU, and only afterwards verifies devid <= pci_seg->last_bdf. __rlookup_amd_iommu() indexes rlookup_table[devid] with no bounds check of its own, so for a PCI device whose BDF is not described by the IVRS, the lookup reads past the end of the allocation before the caller's bounds check can run. This was harmless before commit e874c66 ("iommu/amd: Change rlookup, irq_lookup, and alias to use kvalloc()"): the table was a zeroed page-order allocation, so the over-read returned NULL and the caller's NULL check skipped the device. After that commit the table is a tight kvcalloc() and the over-read returns adjacent slab contents, which check_device() then dereferences as a struct amd_iommu *, causing a boot-time GPF. Seen on Google Compute Engine ct6e VMs, where the virtualized IVRS describes only the four TPU endpoints 00:04.0-07.0; the gVNIC at 00:08.0 (devid 0x40) indexes 56 bytes past the 456-byte allocation, into the adjacent kmalloc-512 slab object: pci 0000:00:04.0: Adding to iommu group 0 pci 0000:00:05.0: Adding to iommu group 1 pci 0000:00:06.0: Adding to iommu group 2 pci 0000:00:07.0: Adding to iommu group 3 Oops: general protection fault, probably for non-canonical address 0x3a64695f78746382: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.22 #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/06/2025 RIP: 0010:amd_iommu_probe_device+0x54/0x3a0 Call Trace: __iommu_probe_device+0x107/0x520 probe_iommu_group+0x29/0x50 bus_for_each_dev+0x7e/0xe0 iommu_device_register+0xc9/0x240 iommu_go_to_state+0x9c0/0x1c60 amd_iommu_init+0x14/0x40 pci_iommu_init+0x16/0x60 do_one_initcall+0x47/0x2f0 Guard the array access in __rlookup_amd_iommu(). With the fix applied on 6.18.22, the gVNIC at 00:08.0 is skipped cleanly and the VM boots. Fixes: e874c66 ("iommu/amd: Change rlookup, irq_lookup, and alias to use kvalloc()") Cc: stable@vger.kernel.org Reported-by: Ziyuan Chen <zc@anthropic.com> Tested-by: Ziyuan Chen <zc@anthropic.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Assisted-by: Claude:unspecified Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit a6dea58 upstream. Below oops triggers when kill QEMU process: Oops: general protection fault, probably for non-canonical address 0x7fffffff844eaaa7: 0000 [#1] SMP NOPTI Call Trace: <TASK> do_raw_spin_lock+0xaa/0xc0 _raw_spin_lock_irqsave+0x21/0x40 domain_remove_dev_pasid+0x52/0x160 intel_nested_set_dev_pasid+0x1b9/0x1e0 __iommu_set_group_pasid+0x56/0x120 pci_dev_reset_iommu_done+0xe3/0x180 pcie_flr+0x65/0x160 __pci_reset_function_locked+0x5b/0x120 vfio_pci_core_close_device+0x63/0xe0 [vfio_pci_core] vfio_df_close+0x4f/0xa0 vfio_df_unbind_iommufd+0x2d/0x60 vfio_device_fops_release+0x3e/0x40 __fput+0xe5/0x2c0 task_work_run+0x58/0xa0 do_exit+0x2c8/0x600 do_group_exit+0x2f/0xa0 get_signal+0x863/0x8c0 arch_do_signal_or_restart+0x24/0x100 exit_to_user_mode_loop+0x87/0x380 do_syscall_64+0x2ff/0x11e0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The global static blocked domain is a dummy domain without corresponding dmar_domain structure, accessing beyond iommu_domain structure triggers oops easily. Fix it by return early in domain_remove_dev_pasid() like identity domain. Fixes: 7d0c9da ("iommu/vt-d: Add set_dev_pasid callback for dma domain") Cc: stable@vger.kernel.org Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260421031347.1408890-1-zhenzhong.duan@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

__team_change_mode() clears team->ops with memset() before restoring safe dummy handlers via team_adjust_ops(). A concurrent team_xmit() running under RCU on another CPU can read team->ops.transmit during this window and call a NULL function pointer, crashing the kernel. The race requires a mode change (CAP_NET_ADMIN) concurrent with transmit on the team device. BUG: kernel NULL pointer dereference, address: 0000000000000000 Oops: 0010 [#1] SMP KASAN NOPTI RIP: 0010:0x0 Call Trace: team_xmit (drivers/net/team/team_core.c:1853) dev_hard_start_xmit (net/core/dev.c:3904) __dev_queue_xmit (net/core/dev.c:4871) packet_sendmsg (net/packet/af_packet.c:3109) __sys_sendto (net/socket.c:2265) The original code assumed that no ports means no traffic, so mode changes could freely memset()/memcpy() the ops. AF_PACKET with forced carrier breaks that assumption. Prevent the race instead of making it safe: replace memset()/memcpy() with per-field updates that never touch transmit or receive. Those two handlers are managed solely by team_adjust_ops(), which already installs dummies when tx_en_port_count == 0 (always true during mode change since no ports are present). WRITE_ONCE/READ_ONCE prevent store/load tearing on the handler pointers. synchronize_net() before exit_op() drains in-flight readers that may still reference old mode state from before port removal switched the handlers to dummies. Fixes: 3d249d4 ("net: introduce ethernet teaming device") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260521081159.1491563-3-bestswngs@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Since the introduction of the netlink configuration path for bridge ports in commit 25c71c7 ("bridge: bridge port parameters over netlink"), br_setport() was always called with the bridge lock held around it. Back then this decision made sense: The bridge lock protects the STP state of the bridge and its ports and at that time the function only processed three STP related netlink attributes (cost, priority and state). Nowadays, br_setport() processes a lot more attributes and most of them do not need the bridge lock: * Bridge flags: Only require RTNL. Read locklessly by the data path. Annotations can be added in net-next. * FDB port flushing: Only requires the FDB lock. * Multicast attributes: Only require the multicast lock. * Group forward mask: Only requires RTNL. Read locklessly by the data path. Annotations can be added in net-next. * Backup port and NHID: Only require RTNL. Read locklessly by the data path. This is a problem as the bridge calls dev_set_promiscuity() when certain bridge port flags change and this function can sleep since the commit cited below, resulting in a splat such as [1]. Fix this by reducing the scope of the bridge lock and only take it when processing the three STP related attributes that require it. This is consistent with the multicast attributes where each attribute acquires the multicast lock instead of having one critical section for all relevant attributes. [1] BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 356, name: bridge preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 2 locks held by bridge/356: #0: ffffffff919473a0 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg (net/core/rtnetlink.c:80 net/core/rtnetlink.c:7002) #1: ffff888115072d58 (&br->lock){+...}-{3:3}, at: br_setlink (./include/linux/spinlock.h:348 net/bridge/br_netlink.c:1117) Preemption disabled at: 0x0 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120) __might_resched.cold (kernel/sched/core.c:9163) netif_rx_mode_run (net/core/dev_addr_lists.c:1262) netif_rx_mode_sync (net/core/dev_addr_lists.c:1428) dev_set_promiscuity (net/core/dev_api.c:289) br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172) br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747) br_setport (net/bridge/br_netlink.c:1000) br_setlink (net/bridge/br_netlink.c:1118) rtnl_bridge_setlink (net/core/rtnetlink.c:5572) rtnetlink_rcv_msg (net/core/rtnetlink.c:7005) netlink_rcv_skb (net/netlink/af_netlink.c:2550) netlink_unicast (net/netlink/af_netlink.c:1318 net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1894) __sock_sendmsg (net/socket.c:787 (discriminator 4) net/socket.c:802 (discriminator 4)) ____sys_sendmsg (net/socket.c:2698) ___sys_sendmsg (net/socket.c:2752) __sys_sendmsg (net/socket.c:2784) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Fixes: 78cd408 ("net: add missing instance lock to dev_set_promiscuity") Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260526064818.272516-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Since the start of the git history, brport_store() always acquired the bridge lock. Back then this decision made sense: The bridge lock protects the STP state of the bridge and its ports and at that time the function was only used by two STP related attributes (cost and priority). Nowadays, brport_store() processes a lot more attributes and most of them do not need the bridge lock: * Bridge flags: Only require RTNL. Read locklessly by the data path. Annotations can be added in net-next. * FDB port flushing: Only requires the FDB lock. * Multicast attributes: Only require the multicast lock. * Group forward mask: Only requires RTNL. Read locklessly by the data path. Annotations can be added in net-next. * Backup port: Only requires RTNL. Read locklessly by the data path. This is a problem as the bridge calls dev_set_promiscuity() when certain bridge port flags change and this function can sleep since the commit cited below, resulting in a splat such as [1]. Fix this by reducing the scope of the bridge lock and only take it when processing the two STP related attributes that require it. Remove the now stale comment from br_switchdev_set_port_flag(). The SWITCHDEV_F_DEFER flag can be removed in net-next. [1] BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 372, name: bash preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 5 locks held by bash/372: #0: ffff88810c51c3f0 (sb_writers#7){.+.+}-{0:0}, at: ksys_write (fs/read_write.c:740) #1: ffff888115ce9480 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter (fs/kernfs/file.c:343) #2: ffff88810b9fd330 (kn->active#37){.+.+}-{0:0}, at: kernfs_fop_write_iter (fs/kernfs/file.c:80 fs/kernfs/file.c:344) #3: ffffffffa59473a0 (rtnl_mutex){+.+.}-{4:4}, at: brport_store (net/bridge/br_sysfs_if.c:326) #4: ffff8881099d2d58 (&br->lock){+...}-{3:3}, at: brport_store (./include/linux/spinlock.h:348 net/bridge/br_sysfs_if.c:345) Preemption disabled at: 0x0 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120) __might_resched.cold (kernel/sched/core.c:9163) netif_rx_mode_run (net/core/dev_addr_lists.c:1262) netif_rx_mode_sync (net/core/dev_addr_lists.c:1428) dev_set_promiscuity (net/core/dev_api.c:289) br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172) br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747) store_learning (net/bridge/br_sysfs_if.c:79 net/bridge/br_sysfs_if.c:235) brport_store (net/bridge/br_sysfs_if.c:346) kernfs_fop_write_iter (fs/kernfs/file.c:352) new_sync_write (fs/read_write.c:595) vfs_write (fs/read_write.c:688) ksys_write (fs/read_write.c:740) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Fixes: 78cd408 ("net: add missing instance lock to dev_set_promiscuity") Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260526064818.272516-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Ido Schimmel says: ==================== bridge: Fix sleep in atomic context Under certain circumstances the bridge driver can call dev_set_promiscuity() while holding the bridge spin lock. This is a problem as dev_set_promiscuity() might sleep. Patches #1-#2 fix the problem in the netlink and sysfs configuration paths by only taking the lock where it is actually needed, thereby avoiding calling dev_set_promiscuity() from an atomic context. Patch #3 adds test cases for both configuration paths in rtnetlink.sh which already includes test cases for similar issues. Note that dev_set_promiscuity() can sleep either when it takes the net device mutex or when calling netif_rx_mode_sync(). I encountered the problem with the latter, but blamed the former since it came earlier. ==================== Link: https://patch.msgid.link/20260526064818.272516-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

…k overflow tcf_mirred_act() checks sched_mirred_nest against MIRRED_NEST_LIMIT (4) to prevent deep recursion. However, when the action uses blockcast (tcfm_blockid != 0), the function returns at the tcf_blockcast() call BEFORE reaching the counter increment. As a result, the recursion counter never advances and the limit check is entirely bypassed. When two devices share a TC egress block with a mirred blockcast rule, a packet egressing on device A is mirrored to device B via blockcast; device B's egress TC re-enters tcf_mirred_act() via blockcast and mirrors back to A, creating an unbounded recursion loop: tcf_mirred_act -> tcf_blockcast -> tcf_mirred_to_dev -> dev_queue_xmit -> sch_handle_egress -> tcf_classify -> tcf_mirred_act -> (repeat) This recursion continues until the kernel stack overflows. The bug is reachable from an unprivileged user via unshare(CLONE_NEWUSER | CLONE_NEWNET): user namespaces grant CAP_NET_ADMIN in the new network namespace, which is sufficient to create dummy devices, attach clsact qdiscs with shared blocks, and install mirred blockcast filters. BUG: TASK stack guard page was hit at ffffc90000b7fff8 Oops: stack guard page: 0000 [#1] SMP KASAN NOPTI CPU: 2 UID: 1000 PID: 169 Comm: poc Not tainted 7.0.0-rc7-next-20260410 RIP: 0010:xas_find+0x17/0x480 Call Trace: xa_find+0x17b/0x1d0 tcf_mirred_act+0x640/0x1060 tcf_action_exec+0x400/0x530 basic_classify+0x128/0x1d0 tcf_classify+0xd83/0x1150 tc_run+0x328/0x620 __dev_queue_xmit+0x797/0x3100 tcf_mirred_to_dev+0x7b1/0xf70 tcf_mirred_act+0x68a/0x1060 [repeating ~30+ times until stack overflow] Kernel panic - not syncing: Fatal exception in interrupt Fix this by incrementing sched_mirred_nest before calling tcf_blockcast() and decrementing it on return, mirroring the non-blockcast path. This ensures subsequent recursive entries see the updated counter and are correctly limited by MIRRED_NEST_LIMIT. Fixes: fe946a7 ("net/sched: act_mirred: add loop detection") Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com> Link: https://patch.msgid.link/20260525122556.973584-7-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>

TLDR: The bo->ttm object might be changed by calling ttm_bo_validate(), move casting it to an i915_tt object later to actually get the right pointer. A user reported hitting the following bug under heavy use on DG2: [26620.095550] Oops: general protection fault, probably for non-canonical address 0xa56b6b6b6b6b6b8b: 0000 1 SMP NOPTI [26620.095556] CPU: 2 UID: 0 PID: 631 Comm: Xorg Not tainted 6.18.8 #1 PREEMPT(lazy) [26620.095558] Hardware name: ASRock B850M Steel Legend WiFi/B850M Steel Legend WiFi, BIOS 3.50 09/18/2025 [26620.095559] RIP: 0010:i915_ttm_purge+0x84/0x100 [i915] [26620.095604] Code: 00 00 00 48 8d 54 24 10 48 89 e6 48 89 fb e8 83 aa ae ff 85 c0 75 6f 48 83 bb a8 01 00 00 00 74 2c 48 8b 45 78 48 85 c0 74 23 <48> 8b 78 20 48 c7 c2 ff ff ff ff 31 f6 e8 7a 73 e3 e0 48 8b 7d 78 [26620.095605] RSP: 0018:ffffc90005fd7430 EFLAGS: 00010282 [26620.095607] RAX: a56b6b6b6b6b6b6b RBX: ffff8881f46c3dc0 RCX: 0000000000000000 [26620.095608] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 00000000ffffffff [26620.095609] RBP: ffff888289610f00 R08: 0000000000000001 R09: ffff88823b022000 [26620.095609] R10: ffff888103029b28 R11: ffff8881fc7f3800 R12: ffff88810b6150d0 [26620.095609] R13: ffff888289610f00 R14: 0000000000000000 R15: ffff8881f46c3dc0 [26620.095610] FS: 00007f1004d86900(0000) GS:ffff88901c858000(0000) knlGS:0000000000000000 [26620.095611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [26620.095611] CR2: 00007f0fdf489000 CR3: 000000035b0c1000 CR4: 0000000000750ef0 [26620.095612] PKRU: 55555554 [26620.095612] Call Trace: [26620.095615] <TASK> [26620.095615] i915_ttm_move+0x2b9/0x420 [i915] [26620.095642] ? ttm_tt_init+0x65/0x80 [ttm] [26620.095644] ? i915_ttm_tt_create+0xc6/0x150 [i915] [26620.095667] ttm_bo_handle_move_mem+0xb6/0x160 [ttm] [26620.095669] ttm_bo_evict+0x100/0x150 [ttm] [26620.095671] ? preempt_count_add+0x64/0xa0 [26620.095673] ? _raw_spin_lock+0xe/0x30 [26620.095675] ? _raw_spin_unlock+0xd/0x30 [26620.095675] ? i915_gem_object_evictable+0xb7/0xd0 [i915] [26620.095704] ttm_bo_evict_cb+0x6e/0xd0 [ttm] [26620.095705] ttm_lru_walk_for_evict+0xa6/0x200 [ttm] [26620.095708] ttm_bo_alloc_resource+0x185/0x4f0 [ttm] [26620.095709] ? init_object+0x62/0xd0 [26620.095712] ttm_bo_validate+0x7a/0x180 [ttm] [26620.095713] ? _raw_spin_unlock_irqrestore+0x16/0x30 [26620.095714] __i915_ttm_get_pages+0xb0/0x170 [i915] [26620.095737] i915_ttm_get_pages+0x9f/0x150 [i915] [26620.095759] ? i915_gem_do_execbuffer+0xedc/0x2b40 [i915] [26620.095786] ? alloc_debug_processing+0xd0/0x100 [26620.095787] ? _raw_spin_unlock_irqrestore+0x16/0x30 [26620.095788] ? i915_vma_instance+0xa0/0x4e0 [i915] [26620.095822] __i915_gem_object_get_pages+0x2f/0x40 [i915] [26620.095848] i915_vma_pin_ww+0x706/0x980 [i915] [26620.095875] ? i915_gem_do_execbuffer+0xedc/0x2b40 [i915] [26620.095904] eb_validate_vmas+0x170/0xa00 [i915] [26620.095930] i915_gem_do_execbuffer+0x1201/0x2b40 [i915] [26620.095953] ? alloc_debug_processing+0xd0/0x100 [26620.095954] ? _raw_spin_unlock_irqrestore+0x16/0x30 [26620.095955] ? i915_gem_execbuffer2_ioctl+0xc9/0x240 [i915] [26620.095977] ? __wake_up_sync_key+0x32/0x50 [26620.095979] ? i915_gem_execbuffer2_ioctl+0xc9/0x240 [i915] [26620.096001] ? __slab_alloc.isra.0+0x67/0xc0 [26620.096003] i915_gem_execbuffer2_ioctl+0x11a/0x240 [i915] Results from decode_stacktrace.sh pointed to dereference of a file pointer field of a i915 TTM page vector container associated with an object being purged on eviction. That path is taken when the object is marked as no longer needed. Code analysis revealed a possibility of the i915 TTM page vector container being replaced with a new instance inside a function that purges content of the object, should it be still busy. That function is called, indirectly via a more general function that changes the object's placement and caching policy, before the problematic dereference, but still after a pointer to the container is captured, rendering the pointer no longer valid. Fix the issue by capturing the pointer to the container only after its potential replacement. v2: Move the container_of() inside the if block (Sebastian), - a simplified version of the commit description that explains briefly why the change is necessary (Christian). Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/work_items/14882 Fixes: 7ae0345 ("drm/i915/ttm: add tt shmem backend") Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: stable@vger.kernel.org # v5.17+ Cc: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Sebastian Brzezinka <sebastian.brzezinka@intel.com> Cc: Christian König <christian.koenig@amd.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/20260508122612.469227-2-janusz.krzysztofik@linux.intel.com (cherry picked from commit 4462966) Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>

Prevent a crash from happening as the first serial port is initialised: Console: switching to colour frame buffer device 160x64 tgafb: SFB+ detected, rev=0x02 fb0: Digital ZLX-E1 frame buffer device at 0x1e000000 DECstation DZ serial driver version 1.04 CPU 0 Unable to handle kernel paging request at virtual address 000000bc, epc == 8048b3a4, ra == 80470a78 Oops[#1]: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.19.0-dirty AsahiLinux#35 NONE $ 0 : 00000000 1000ac00 00000004 804707ac $ 4 : 00000000 80e20850 80e20858 81000030 $ 8 : 00000000 8072c81c 00000008 fefefeff $12 : 6c616972 00000006 80c5917f 69726420 $16 : 80e20800 00000000 808f8968 80e20800 $20 : 00000000 807f5a9 808b0094 808d3bc8 $24 : 00000018 80479030 $28 : 80c2e000 80c2fd70 00000069 80470a78 Hi : 00000004 Lo : 00000000 epc : 8048b3a4 __dev_fwnode+0x0/0xc ra : 80470a78 serial_base_ctrl_add+0xa0/0x168 Status: 1000ac04 IEp Cause : 30000008 (ExcCode 02) BadVA : 000000bc PrId : 00000220 (R3000) Modules linked in: Process swapper/0 (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000) Stack : 00400044 00400040 8046f4cc 00000000 808a6148 808a0000 808f8968 8086983c 808e0000 8046fc84 1000ac01 00000028 80e20700 802ba3f8 80e20700 80d34a94 80c1b900 80e20700 80e20700 80e20700 80e20700 80444650 00000000 00000000 00000000 807f5a9 808b0094 80447080 00400040 808e0000 80d34a94 808a6148 80d34a94 00000004 80e20700 00000000 8076974c 80469810 80c2fe3c 1000ac01 ... Call Trace: [<8048b3a4>] __dev_fwnode+0x0/0xc [<80470a78>] serial_base_ctrl_add+0xa0/0x168 [<8046fc84>] serial_core_register_port+0x1c8/0x974 [<808c6af0>] dz_init+0x74/0xc8 [<800470e0>] do_one_initcall+0x44/0x2d4 [<808b111c>] kernel_init_freeable+0x258/0x308 [<8072e434>] kernel_init+0x20/0x114 [<80049cd0>] ret_from_kernel_thread+0x14/0x1c Code: 27bd0018 03e00008 2402ffea <8c8200bc> 03e00008 00000000 27bdffc0 afbe0038 afb30024 ---[ end trace 0000000000000000 ]--- -- where a pointer is dereferenced that has been derived from a null pointer to the port's parent device. Since no device is available with legacy probing and it's not anymore a preferable way to discover devices anyway, switch the driver to using a platform device and use it as the port's parent device. Update resource handling accordingly and only request the actual span of addresses used within the slot, which will have had its resource already requested by generic platform device code. Use platform_driver_probe() not just because the DZ device is fixed with solder on board and not straightforward to remove, but foremost because the associated TTY's major device number is the same as used by the zs driver and the first driver to claim it will prevent the other one from using it. Either one DZ device or some SCC devices will be present in a given system but never both at a time, and therefore we want the major device number to be claimed by the first driver to actually successfully bind to its device and platform_driver_probe() is a way to fulfil that. An unfortunate consequence of the switch to a platform device is we now hand the console over from the bootconsole much later in the bootstrap. The firmware console handler appears good enough though to work so late and in particular with interrupts enabled. Conversely only starting the console port so late lets the reset code fully utilise our delay handlers, so switch from udelay() to fsleep() for transmitter draining so as to avoid busy-waiting for an excessive amount of time. Fixes: 84a9582 ("serial: core: Start managing serial controllers to enable runtime PM") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Cc: stable@vger.kernel.org # needs to use .remove_new for <= 6.10 Link: https://patch.msgid.link/alpine.DEB.2.21.2605062326540.46195@angie.orcam.me.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Prevent a crash from happening as the first serial port is initialised: Console: switching to mono frame buffer device 160x64 fb0: PMAG-AA frame buffer device at tc0 DECstation Z85C30 serial driver version 0.10 CPU 0 Unable to handle kernel paging request at virtual address 0000002c, epc == 803ab00c, ra == 803aafe0 Oops[#1]: CPU: 0 PID: 1 Comm: swapper Not tainted 6.4.0-rc3-00031-g84a9582fd203-dirty AsahiLinux#57 $ 0 : 00000000 10012c00 803aaeb0 00000000 $ 4 : 80e12f60 80e12f50 80e12f58 81000030 $ 8 : 00000000 805ff37c 00000000 33433538 $12 : 65732030 00000006 80c2915d 6c616972 $16 : 80e12f00 807b7630 00000000 00000000 $20 : 00000004 00000348 000001a0 807623b8 $24 : 00000018 00000000 $28 : 80c24000 80c25d6 8078b148 803aafe0 Hi : 00000000 Lo : 00000000 epc : 803ab00c serial_base_ctrl_add+0x78/0xf4 ra : 803aafe0 serial_base_ctrl_add+0x4c/0xf4 Status: 10012c0 KERNEL EXL IE Cause : 00000008 (ExcCode 02) BadVA : 0000002c PrId : 00000440 (R4400SC) Modules linked in: Process swapper (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000) Stack : 80760000 00000cc0 00400044 00400040 803aa02c 80d61ab8 00000000 807b7630 80760000 807623b8 807b7628 803aa644 80386998 00000000 80e17780 80220f68 80e17780 80d61ab8 80c17d80 80e17780 80e17780 8063c798 80e17780 80383fa0 00000010 80e17780 00000000 80386998 807a0000 00000000 00400040 8038f848 807623b8 80d61ab8 00000004 80e17780 00000000 803a68e4 80c25e2c 803bb884 ... Call Trace: [<803ab00c>] serial_base_ctrl_add+0x78/0xf4 [<803aa644>] serial_core_register_port+0x174/0x69c [<8077e9ac>] zs_init+0xc8/0xfc [<800404d4>] do_one_initcall+0x40/0x2ac [<8076cecc>] kernel_init_freeable+0x1e4/0x270 [<80605bec>] kernel_init+0x20/0x108 [<800431e8>] ret_from_kernel_thread+0x14/0x1c Code: 2442aeb ae120024 ae0200d0 <8c67002c> 50e00001 8c670000 3c06806 3c05806e afb30010 ---[ end trace 0000000000000000 ]--- (report at the offending commit) -- where a pointer is dereferenced that has been derived from a null pointer to the port's parent device. Since no device is available with legacy probing and it's not anymore a preferable way to discover devices anyway, switch the driver to using a platform device and use it as the port's parent device. Update resource handling accordingly and only request the actual span of addresses used within the slot, which will have had its resource already requested by generic platform device code. Use platform_driver_probe() not just because SCC devices are fixed with solder on board and not straightforward to remove, but foremost because the associated TTY's major device number is the same as used by the dz driver and the first driver to claim it will prevent the other one from using it. Either one DZ device or some SCC devices will be present in a given system but never both at a time, and therefore we want the major device number to be claimed by the first driver to actually successfully bind to its device and platform_driver_probe() is a way to fulfil that. An unfortunate consequence of the switch to a platform device is we now hand the console over from the bootconsole much later in the bootstrap. The firmware console handler appears good enough though to work so late and in particular with interrupts enabled. Since there is one way only remaining to reach zs_reset() now, remove the port initialisation marker as no longer needed and go through the channel reset unconditionally. Fixes: 84a9582 ("serial: core: Start managing serial controllers to enable runtime PM") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Cc: stable@vger.kernel.org # needs to use .remove_new for <= 6.10 Link: https://patch.msgid.link/alpine.DEB.2.21.2605062328480.46195@angie.orcam.me.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

With CONFIG_CALL_DEPTH_TRACKING enabled on an x86 retbleed-affected platform (eg: Skylake), with retbleed=stuff, registering a dynamic ftrace trampoline crashes on the first call into the traced function: BUG: unable to handle page fault for address: ffff88817ae18880 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 4b53067 P4D 4b53067 PUD 0 Oops: Oops: 0002 [#1] SMP PTI CPU: 3 UID: 0 PID: 187 Comm: usleep Not tainted 7.0.10 AsahiLinux#243 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 Code: 24 78 00 00 00 00 48 89 ea 48 89 54 24 20 48 8b b4 24 b8 00 00 00 48 8b bc 24 b0 00 00 00 48 89 bc 24 80 00 00 00 48 83 ef 05 <65> 48 c1 3d 1f a8 b6 02 05 48 8b 15 f6 00 00 00 4c 89 3c 24 4c 89 Call Trace: <TASK> ? find_held_lock ? exc_page_fault ? lock_release ? __x64_sys_clock_nanosleep ? lockdep_hardirqs_on_prepare ? trace_hardirqs_on __x64_sys_clock_nanosleep do_syscall_64 ? exc_page_fault ? call_depth_return_thunk entry_SYSCALL_64_after_hwframe ... Kernel panic - not syncing: Fatal exception This small reproducer allows to easily trigger the crash: # echo 'p __x64_sys_clock_nanosleep' > /sys/kernel/tracing/kprobe_events # echo 1 > /sys/kernel/tracing/events/kprobes/p___x64_sys_clock_nanosleep_0/enable # usleep 1 Monitoring the crash under GDB points to the exact instruction in charge of incrementing the call depth: sarq $5, %gs:__x86_call_depth(%rip) This instruction matches the one inserted by the ftrace_regs_caller from ftrace_64.S. This emitted code was likely working fine until the introduction of 59bec00 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()"): it has made the call depth accounting addressing relative to $rip, instead of being based on an absolute address. As this code exact location depends on where the trampoline lives in memory, the corresponding displacement needs to be adjusted at runtime to actually correctly find the per-cpu __x86_call_depth value, otherwise the targeted address is wrong, leading to the page fault seen above. Fix the %rip-relative displacement of the copied CALL_DEPTH_ACCOUNT instruction (from ftrace_regs_caller) by calling text_poke_apply_relocation(), as it is done for example by the x86 BPF JIT compiler through x86_call_depth_emit_accounting(). This corrects both CALL_DEPTH_ACCOUNT slots, in ftrace_caller and ftrace_regs_caller. [ bp: Massage. ] Fixes: 59bec00 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()") Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@kernel.org> Link: https://patch.msgid.link/20260527-fix_call_depth_in_trampoline-v1-1-1c1abc8ae310@bootlin.com

commit 4b83cbc upstream. session_fd_check() walks the per-inode m_op_list during durable-handle session teardown and sets op->conn = NULL for every opinfo whose conn matched the closing session's connection. The matching opinfo, however, stays linked in its per-ClientGuid lease_table_list entry's lb->lease_list because destroy_lease_table() only runs on full TCP-connection teardown, not on SESSION_LOGOFF. If the same TCP connection then negotiates a fresh session with the same ClientGuid (ClientGuid is bound to NEGOTIATE, not the session, and is unchanged across LOGOFF + SETUP) and issues a SMB2 CREATE with a lease context on a different inode, find_same_lease_key() walks lb->lease_list, reaches the stale opinfo, and calls compare_guid_key(), which unconditionally dereferences opinfo->conn->ClientGUID. The conn pointer is NULL and the kernel panics. Reproducer requires only a successful SMB2 SESSION_SETUP and a share configured with 'durable handles = yes'. KASAN report on mainline 7039050: general protection fault, probably for non-canonical address 0xdffffc0000000069: 0000 [#1] SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000348-0x000000000000034f] Workqueue: ksmbd-io handle_ksmbd_work RIP: 0010:bcmp+0x5b/0x230 Call Trace: compare_guid_key+0x4b/0xd0 find_same_lease_key+0x324/0x690 smb2_open+0x6aea/0x8e60 handle_ksmbd_work+0x796/0xee0 ... Faulting address 0x348 is the offset of ClientGUID within struct ksmbd_conn, confirming opinfo->conn was NULL. Read opinfo->conn once and bail out if it has been cleared by a concurrent session_fd_check(). A half-detached opinfo cannot be the owner of an active lease, so returning 0 is the correct match result. Fixes: c8efcc7 ("ksmbd: add support for durable handles v1/v2") Cc: stable@vger.kernel.org Signed-off-by: Jeremy Laratro <research@aradex.io> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 9049015 upstream. When a SMB2 client opens a file with a durable v2 handle and then issues SMB2 SESSION_LOGOFF, session_fd_check() clears fp->tcon = NULL on the reconnectable file pointer but leaves the fp registered in global_ft.idr until the durable scavenger fires (up to fp->durable_timeout seconds later). During that window any read of /proc/fs/ksmbd/files (mode 0400) panics the kernel because proc_show_files() walks global_ft.idr and unconditionally dereferences fp->tcon->id with no NULL guard. Reproducer requires only a successful SMB2 SESSION_SETUP and a share configured with 'durable handles = yes'. KASAN report on mainline 7039050: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:proc_show_files+0x118/0x740 Call Trace: proc_show_files+0x118/0x740 seq_read_iter+0x4ef/0xe10 proc_reg_read_iter+0x1b7/0x280 ... Guard the dereference. A durable-disconnected fp legitimately has no tcon; report its tree id as 0 rather than oopsing. Fixes: b38f99c ("ksmbd: add procfs interface for runtime monitoring and statistics") Cc: stable@vger.kernel.org Signed-off-by: Jeremy Laratro <research@aradex.io> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 4412634 upstream. Booting with "nopcid" clears X86_FEATURE_PCID and keeps CR4.PCIDE from being set to one. On AMD CPUs that support INVLPGB, broadcast TLB flushing remains enabled. There are two checks that decide whether the global ASID code runs, mm_global_asid() and consider_global_asid(), that key off of the X86_FEATURE_INVLPGB feature. Once an mm becomes active on more than three CPUs, consider_global_asid() assigns it a global ASID, after which flush_tlb_mm_range() takes the broadcast_tlb_flush() path using a non-zero PCID. Issuing an INVLPGB with a non-zero PCID while CR4.PCIDE is not set results in a #GP: Oops: general protection fault, kernel NULL pointer dereference 0x1: 0000 [#1] SMP NOPTI CPU: 158 UID: 0 PID: 3119 Comm: snap Not tainted 7.1.0-rc3 #1 PREEMPT(full) Hardware name: ... RIP: 0010:broadcast_tlb_flush Code: ... 89 da 48 83 c8 07 <0f> 01 fe eb 08 cc cc cc ... Call Trace: <TASK> flush_tlb_mm_range ptep_clear_flush wp_page_copy ? _raw_spin_unlock __handle_mm_fault handle_mm_fault do_user_addr_fault exc_page_fault asm_exc_page_fault All processors that support broadcast TLB invalidation also have PCID support, so it is only the "nopcid" scenario that is of concern. In this situation just disable the broadcast TLB support using the CPUID dependency support by making X86_FEATURE_INVLPGB dependent on X86_FEATURE_PCID. [ bp: Massage commit message. ] Fixes: 4afeb0e ("x86/mm: Enable broadcast TLB invalidation for multi-threaded processes") Suggested-by: Dave Hansen <dave.hansen@intel.com> Assisted-by: Claude:claude-opus-4.7 Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Rik van Riel <riel@surriel.com> Cc: <stable@kernel.org> Link: https://patch.msgid.link/b915acfd63e8b2a094fdeb8dc608738072518764.1779296450.git.thomas.lendacky@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 3392291 upstream. With PROVE_LOCKING on an Snapdragon X1 and VM reclaim pressure, we see: ====================================================== WARNING: possible circular locking dependency detected 7.0.0-debug+ AsahiLinux#43 Tainted: G W ------------------------------------------------------ kswapd0/82 is trying to acquire lock: ffff800080ec3870 (reservation_ww_class_acquire){+.+.}-{0:0}, at: msm_gem_shrinker_scan+0x17c/0x400 [msm] but task is already holding lock: ffffc31709b263b8 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x88/0x988 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (fs_reclaim){+.+.}-{0:0}: __lock_acquire+0x4d0/0xad0 lock_acquire.part.0+0xc4/0x248 lock_acquire+0x8c/0x248 fs_reclaim_acquire+0xd0/0xf0 dma_resv_lockdep+0x224/0x348 do_one_initcall+0x84/0x5d0 do_initcalls+0x194/0x1d8 kernel_init_freeable+0x128/0x180 kernel_init+0x2c/0x160 ret_from_fork+0x10/0x20 -> #1 (reservation_ww_class_mutex){+.+.}-{4:4}: __lock_acquire+0x4d0/0xad0 lock_acquire.part.0+0xc4/0x248 lock_acquire+0x8c/0x248 dma_resv_lockdep+0x1a8/0x348 do_one_initcall+0x84/0x5d0 do_initcalls+0x194/0x1d8 kernel_init_freeable+0x128/0x180 kernel_init+0x2c/0x160 ret_from_fork+0x10/0x20 -> #0 (reservation_ww_class_acquire){+.+.}-{0:0}: check_prev_add+0x114/0x790 validate_chain+0x594/0x6f0 __lock_acquire+0x4d0/0xad0 lock_acquire.part.0+0xc4/0x248 lock_acquire+0x8c/0x248 drm_gem_lru_scan+0x1ac/0x440 msm_gem_shrinker_scan+0x17c/0x400 [msm] do_shrink_slab+0x150/0x4a0 shrink_slab+0x144/0x460 shrink_one+0x9c/0x1b0 shrink_many+0x27c/0x5c0 shrink_node+0x344/0x550 balance_pgdat+0x2c0/0x988 kswapd+0x11c/0x318 kthread+0x10c/0x128 ret_from_fork+0x10/0x20 other info that might help us debug this: Chain exists of: reservation_ww_class_acquire --> reservation_ww_class_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(reservation_ww_class_mutex); lock(fs_reclaim); lock(reservation_ww_class_acquire); *** DEADLOCK *** 1 lock held by kswapd0/82: #0: ffffc31709b263b8 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x88/0x988 stack backtrace: CPU: 4 UID: 0 PID: 82 Comm: kswapd0 Tainted: G W 7.0.0-debug+ AsahiLinux#43 PREEMPT(full) Tainted: [W]=WARN Hardware name: LENOVO 21BX0016US/21BX0016US, BIOS N3HET94W (1.66 ) 09/15/2025 Call trace: show_stack+0x20/0x40 (C) dump_stack_lvl+0x9c/0xd0 dump_stack+0x18/0x30 print_circular_bug+0x114/0x120 check_noncircular+0x178/0x198 check_prev_add+0x114/0x790 validate_chain+0x594/0x6f0 __lock_acquire+0x4d0/0xad0 lock_acquire.part.0+0xc4/0x248 lock_acquire+0x8c/0x248 drm_gem_lru_scan+0x1ac/0x440 msm_gem_shrinker_scan+0x17c/0x400 [msm] do_shrink_slab+0x150/0x4a0 shrink_slab+0x144/0x460 shrink_one+0x9c/0x1b0 shrink_many+0x27c/0x5c0 shrink_node+0x344/0x550 balance_pgdat+0x2c0/0x988 kswapd+0x11c/0x318 kthread+0x10c/0x128 ret_from_fork+0x10/0x20 kswapd0 holding fs_reclaim calls the MSM shrinker, which calls dma_resv_lock. This in turn acquires fs_reclaim. Fix this deadlock by using dma_resv_trylock() instead, dropping the subsequently unused passed wait-wound lock 'ticket'. Cc: stable@vger.kernel.org Signed-off-by: Daniel J Blueman <daniel@quora.org> Fixes: fe4952b ("drm/msm: Convert vm locking") Patchwork: https://patchwork.freedesktop.org/patch/723564/ Message-ID: <20260508065722.18785-1-daniel@quora.org> [rob: fixup compile errors, replace lockdep splat with something legible] Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

[ Upstream commit 90d77b3 ] Starting with commit bdb249f ("ARM: integrator: read counter using syscon/regmap"), intcp_init_early calls syscon_regmap_lookup_by_compatible which in turn calls of_syscon_register. This function allocates memory. Since the memory management code has not been initialized at that time, the call always fails. It either returns -ENOMEM or crashes as follows. Unable to handle kernel NULL pointer dereference at virtual address 0000000c when read [0000000c] *pgd=00000000 Internal error: Oops: 5 [#1] ARM Modules linked in: CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc5-00026-g5fcc9bf84ee5 #1 PREEMPT Hardware name: ARM Integrator/CP (Device Tree) PC is at __kmalloc_cache_noprof+0xec/0x39c LR is at __kmalloc_cache_noprof+0x34/0x39c ... Call trace: __kmalloc_cache_noprof from of_syscon_register+0x7c/0x310 of_syscon_register from device_node_get_regmap+0xa4/0xb0 device_node_get_regmap from intcp_init_early+0xc/0x40 intcp_init_early from start_kernel+0x60/0x688 start_kernel from 0x0 The crash is seen due to a dereferenced pointer which is not supposed to be NULL but is NULL if the memory management subsystem has not been initialized. The crash is not seen with all versions of gcc. Some versions such as gcc 9.x apparently do not dereference the pointer, presumably if tracing is disabled. The problem has been reproduced with gcc 10.x, 11.x, and 13.x. Either case, if the crash is not seen, the call to syscon_regmap_lookup_by_compatible returns -ENOMEM, and sched_clock_register is never called. Fix the problem by moving the early initialization code into the standard machine initialization code. Fixes: bdb249f ("ARM: integrator: read counter using syscon/regmap") Cc: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/20250518164118.3859567-1-linux@roeck-us.net Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20260505-integrator-fixes-v1-1-56ab9aac59db@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sasha Levin <sashal@kernel.org>

…ile() [ Upstream commit c73370c ] The trace event btrfs_sync_file() is called in an atomic context (all trace events are) and its call to dput(), which is needed due to the call to dget_parent(), can sleep, triggering a kernel splat. This can be reproduced by enabling the trace event and running btrfs/056 from fstests for example. The splat shown in dmesg is the following: [53.919] BUG: sleeping function called from invalid context at fs/dcache.c:970 [53.947] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32773, name: xfs_io [53.988] preempt_count: 2, expected: 0 [53.967] RCU nest depth: 0, expected: 0 [53.943] Preemption disabled at: [53.944] [<0000000000000000>] 0x0 [54.078] CPU: 0 UID: 0 PID: 32773 Comm: xfs_io Tainted: G W 7.1.0-rc1-btrfs-next-232+ #1 PREEMPT(full) [54.070] Tainted: [W]=WARN [54.071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [54.072] Call Trace: [54.074] <TASK> [54.076] dump_stack_lvl+0x56/0x80 [54.079] __might_resched.cold+0xd6/0x10f [54.072] dput.part.0+0x24/0x110 [54.078] trace_event_raw_event_btrfs_sync_file+0x75/0x140 [btrfs] [54.089] btrfs_sync_file+0x1ed/0x530 [btrfs] [54.087] ? __handle_mm_fault+0x8ae/0xed0 [54.089] btrfs_do_write_iter+0x172/0x210 [btrfs] [54.091] vfs_write+0x21f/0x450 [54.094] __x64_sys_pwrite64+0x8d/0xc0 [54.096] ? do_user_addr_fault+0x20c/0x670 [54.099] do_syscall_64+0x60/0xf20 [54.092] ? clear_bhb_loop+0x60/0xb0 [54.094] entry_SYSCALL_64_after_hwframe+0x76/0x7e So stop using dget_parent() and dput() and access the parent dentry directly as dentry->d_parent. This is also what ext4 is doing in its equivalent trace event ext4_sync_file_enter(). Fixes: a85b46d ("btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file()") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit dc7832d ] The multiple runs of generic/013 test-case is capable to reproduce a kernel BUG at mm/filemap.c:1504 with probability of 30%. while true; do sudo ./check generic/013 done [ 9849.452376] page: refcount:3 mapcount:0 mapping:00000000e58ff252 index:0x10781 pfn:0x1c322 [ 9849.452412] memcg:ffff8881a1915800 [ 9849.452417] aops:ceph_aops ino:1000058db9e dentry name(?):"f9XXXXXX" [ 9849.452432] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) [ 9849.452441] raw: 0017ffffc0000000 0000000000000000 dead000000000122 ffff88816110d248 [ 9849.452445] raw: 0000000000010781 0000000000000000 00000003ffffffff ffff8881a1915800 [ 9849.452447] page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) [ 9849.452474] ------------[ cut here ]------------ [ 9849.452476] kernel BUG at mm/filemap.c:1504! [ 9849.478635] Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI [ 9849.481772] CPU: 2 UID: 0 PID: 84223 Comm: fsstress Not tainted 7.0.0-rc1+ AsahiLinux#18 PREEMPT(full) [ 9849.482881] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/1 0/2025 [ 9849.484539] RIP: 0010:folio_unlock+0x85/0xa0 [ 9849.485076] Code: 89 df 31 f6 e8 1c f3 ff ff 48 8b 5d f8 c9 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 48 c7 c6 80 6c d9 a7 48 89 df e8 4b b3 10 00 <0f> 0b 48 89 df e8 21 e6 2c 00 eb 9d 0f 1f 40 00 66 66 2e 0f 1f 84 [ 9849.493818] RSP: 0018:ffff8881bb8076b0 EFLAGS: 00010246 [ 9849.495740] RAX: 0000000000000000 RBX: ffffea00070c8980 RCX: 0000000000000000 [ 9849.498678] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 9849.500559] RBP: ffff8881bb8076b8 R08: 0000000000000000 R09: 0000000000000000 [ 9849.501097] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000010782000 [ 9849.502108] R13: ffff8881935de738 R14: ffff88816110d010 R15: 0000000000001000 [ 9849.502516] FS: 00007e36cbe94740(0000) GS:ffff88824a899000(0000) knlGS:0000000000000000 [ 9849.502996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9849.503810] CR2: 000000c0002b0000 CR3: 000000011bbf6004 CR4: 0000000000772ef0 [ 9849.504459] PKRU: 55555554 [ 9849.504626] Call Trace: [ 9849.505242] <TASK> [ 9849.505379] netfs_write_begin+0x7c8/0x10a0 [ 9849.505877] ? __kasan_check_read+0x11/0x20 [ 9849.506384] ? __pfx_netfs_write_begin+0x10/0x10 [ 9849.507178] ceph_write_begin+0x8c/0x1c0 [ 9849.507934] generic_perform_write+0x391/0x8f0 [ 9849.508503] ? __pfx_generic_perform_write+0x10/0x10 [ 9849.509062] ? file_update_time_flags+0x19a/0x4b0 [ 9849.509581] ? ceph_get_caps+0x63/0xf0 [ 9849.510259] ? ceph_get_caps+0x63/0xf0 [ 9849.510530] ceph_write_iter+0xe79/0x1ae0 [ 9849.511282] ? __pfx_ceph_write_iter+0x10/0x10 [ 9849.511839] ? lock_acquire+0x1ad/0x310 [ 9849.512334] ? ksys_write+0xf9/0x230 [ 9849.512582] ? lock_is_held_type+0xaa/0x140 [ 9849.513128] vfs_write+0x512/0x1110 [ 9849.513634] ? __fget_files+0x33/0x350 [ 9849.513893] ? __pfx_vfs_write+0x10/0x10 [ 9849.514143] ? mutex_lock_nested+0x1b/0x30 [ 9849.514394] ksys_write+0xf9/0x230 [ 9849.514621] ? __pfx_ksys_write+0x10/0x10 [ 9849.514887] ? do_syscall_64+0x25e/0x1520 [ 9849.515122] ? __kasan_check_read+0x11/0x20 [ 9849.515366] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.515655] __x64_sys_write+0x72/0xd0 [ 9849.515885] ? trace_hardirqs_on+0x24/0x1c0 [ 9849.516130] x64_sys_call+0x22f/0x2390 [ 9849.516341] do_syscall_64+0x12b/0x1520 [ 9849.516545] ? do_syscall_64+0x27c/0x1520 [ 9849.516783] ? do_syscall_64+0x27c/0x1520 [ 9849.517003] ? lock_release+0x318/0x480 [ 9849.517220] ? __x64_sys_io_getevents+0x143/0x2d0 [ 9849.517479] ? percpu_ref_put_many.constprop.0+0x8f/0x210 [ 9849.517779] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9849.518073] ? do_syscall_64+0x25e/0x1520 [ 9849.518291] ? __kasan_check_read+0x11/0x20 [ 9849.518519] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.518799] ? do_syscall_64+0x27c/0x1520 [ 9849.519024] ? local_clock_noinstr+0xf/0x120 [ 9849.519262] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9849.519544] ? do_syscall_64+0x25e/0x1520 [ 9849.519781] ? __kasan_check_read+0x11/0x20 [ 9849.520008] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.520273] ? do_syscall_64+0x27c/0x1520 [ 9849.520491] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.520767] ? irqentry_exit+0x10c/0x6c0 [ 9849.520984] ? trace_hardirqs_off+0x86/0x1b0 [ 9849.521224] ? exc_page_fault+0xab/0x130 [ 9849.521472] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9849.521766] RIP: 0033:0x7e36cbd14907 [ 9849.521989] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 9849.523057] RSP: 002b:00007ffff2d2a968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 9849.523484] RAX: ffffffffffffffda RBX: 000000000000e549 RCX: 00007e36cbd14907 [ 9849.523885] RDX: 000000000000e549 RSI: 00005bd797ec6370 RDI: 0000000000000004 [ 9849.524277] RBP: 0000000000000004 R08: 0000000000000047 R09: 00005bd797ec6370 [ 9849.524652] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000049 [ 9849.525062] R13: 0000000010781a37 R14: 00005bd797ec6370 R15: 0000000000000000 [ 9849.525447] </TASK> [ 9849.525574] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel joydev kvm irqbypass ghash_clmulni_intel aesni_intel input_leds rapl mac_hid psmouse vga16fb serio_raw vgastate floppy i2c_piix4 bochs qemu_fw_cfg i2c_smbus pata_acpi sch_fq_codel rbd msr parport_pc ppdev lp parport efi_pstore [ 9849.529150] ---[ end trace 0000000000000000 ]--- [ 9849.529502] RIP: 0010:folio_unlock+0x85/0xa0 [ 9849.530813] Code: 89 df 31 f6 e8 1c f3 ff ff 48 8b 5d f8 c9 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 48 c7 c6 80 6c d9 a7 48 89 df e8 4b b3 10 00 <0f> 0b 48 89 df e8 21 e6 2c 00 eb 9d 0f 1f 40 00 66 66 2e 0f 1f 84 [ 9849.534986] RSP: 0018:ffff8881bb8076b0 EFLAGS: 00010246 [ 9849.536198] RAX: 0000000000000000 RBX: ffffea00070c8980 RCX: 0000000000000000 [ 9849.537718] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 9849.539321] RBP: ffff8881bb8076b8 R08: 0000000000000000 R09: 0000000000000000 [ 9849.540862] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000010782000 [ 9849.542438] R13: ffff8881935de738 R14: ffff88816110d010 R15: 0000000000001000 [ 9849.543996] FS: 00007e36cbe94740(0000) GS:ffff88824b899000(0000) knlGS:0000000000000000 [ 9849.545854] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9849.547092] CR2: 00007e36cb3ff000 CR3: 000000011bbf6006 CR4: 0000000000772ef0 [ 9849.548679] PKRU: 55555554 The race sequence: 1. Read completes -> netfs_read_collection() runs 2. netfs_wake_rreq_flag(rreq, NETFS_RREQ_IN_PROGRESS, ...) 3. netfs_wait_for_read() returns -EFAULT to netfs_write_begin() 4. The netfs_unlock_abandoned_read_pages() unlocks the folio 5. netfs_write_begin() calls folio_unlock(folio) -> VM_BUG_ON_FOLIO() The key reason of the issue that netfs_unlock_abandoned_read_pages() doesn't check the flag NETFS_RREQ_NO_UNLOCK_FOLIO and executes folio_unlock() unconditionally. This patch implements in netfs_unlock_abandoned_read_pages() logic similar to netfs_unlock_read_folio(). Fixes: ee4cdf7 ("netfs: Speed up buffered reading") Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-8-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: Ceph Development <ceph-devel@vger.kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 8582792 ] pin_user_pages_fast() can partially succeed and return the number of pages that were actually pinned. However, the bio_integrity_map_user() does not handle this partial pinning. This leads to a general protection fault since bvec_from_pages() dereferences an unpinned page address, which is 0. To fix this, add a check to verify that all requested memory is pinned. If partial pinning occurs, unpin the memory and return -EFAULT. Kernel Oops: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 UID: 0 PID: 1061 Comm: nvme-passthroug Not tainted 7.0.0-11783-g90957f9314e8-dirty AsahiLinux#16 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 RIP: 0010:bio_integrity_map_user.cold+0x1b0/0x9d6 Fixes: 492c5d4 ("block: bio-integrity: directly map user buffers") Acked-by: Chao Shi <cshi008@fiu.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Acked-by: Dave Tian <daveti@purdue.edu> Signed-off-by: Sungwoo Kim <iam@sung-woo.kim> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: linux-blktests/blktests#244 Link: https://patch.msgid.link/20260512050929.541397-2-iam@sung-woo.kim Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>

…ch_irq_work_raise() [ Upstream commit 31467b2 ] A kernel panic is observed when handling machine check exceptions from real mode. BUG: Unable to handle kernel data access on read at 0xc00000006be21300 Oops: Kernel access of bad area, sig: 11 [#1] MSR: 8000000000001003 <SF,ME,RI,LE> CR: 88222248 XER: 00000005 CFAR: c00000000003ffc4 DAR: c00000006be21300 DSISR: 40000000 IRQMASK: 0 NIP [c000000000029e40] arch_irq_work_raise+0x10/0x70 LR [c00000000003ffc8] machine_check_queue_event+0xa8/0x150 Call Trace: [c0000000179d3c70] [c00000000003ff64] machine_check_queue_event+0x44/0x150 [c0000000179d3d30] [c0000000000084e0] machine_check_early_common+0x1f0/0x2c0 The crash occurs because arch_irq_work_raise() calls preempt_disable() from machine check exception (MCE) handlers running in real mode. In this context, accessing the preempt_count can fault, leading to the panic. The preempt_disable()/preempt_enable() pair in arch_irq_work_raise() was originally added by commit 0fe1ac4 ("powerpc/perf_event: Fix oops due to perf_event_do_pending call") to avoid races while raising irq work from exception context. Later, commit 471ba0e ("irq_work: Do not raise an IPI when queueing work on the local CPU") added preemption protection in irq_work_queue() path, while commit 20b8769 ("irq_work: Use per cpu atomics instead of regular atomics") added equivalent protection in irq_work_queue_on() before reaching arch_irq_work_raise(): irq_work_queue() / irq_work_queue_on() -> preempt_disable() -> __irq_work_queue_local() -> irq_work_raise() -> arch_irq_work_raise() As a result, callers other than mce_irq_work_raise() already execute with preemption disabled, making the additional preempt_disable()/preempt_enable() pair in arch_irq_work_raise() redundant. The arch_irq_work_raise() function executes in NMI context when called from MCE handler. Hence we will not be preempted or scheduled out since we are in NMI context with MSR[EE]=0. Therefore, it is safe to remove the preempt_disable()/preempt_enable() calls from here. Remove it to avoid accessing preempt_count from real mode context. Fixes: cc15ff3 ("powerpc/mce: Avoid using irq_work_queue() in realmode") Suggested-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Sayali Patil <sayalip@linux.ibm.com> [Maddy: Fixed the commit title] Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20260513081413.222490-1-sayalip@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 5eec442 ] Since the introduction of the netlink configuration path for bridge ports in commit 25c71c7 ("bridge: bridge port parameters over netlink"), br_setport() was always called with the bridge lock held around it. Back then this decision made sense: The bridge lock protects the STP state of the bridge and its ports and at that time the function only processed three STP related netlink attributes (cost, priority and state). Nowadays, br_setport() processes a lot more attributes and most of them do not need the bridge lock: * Bridge flags: Only require RTNL. Read locklessly by the data path. Annotations can be added in net-next. * FDB port flushing: Only requires the FDB lock. * Multicast attributes: Only require the multicast lock. * Group forward mask: Only requires RTNL. Read locklessly by the data path. Annotations can be added in net-next. * Backup port and NHID: Only require RTNL. Read locklessly by the data path. This is a problem as the bridge calls dev_set_promiscuity() when certain bridge port flags change and this function can sleep since the commit cited below, resulting in a splat such as [1]. Fix this by reducing the scope of the bridge lock and only take it when processing the three STP related attributes that require it. This is consistent with the multicast attributes where each attribute acquires the multicast lock instead of having one critical section for all relevant attributes. [1] BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 356, name: bridge preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 2 locks held by bridge/356: #0: ffffffff919473a0 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg (net/core/rtnetlink.c:80 net/core/rtnetlink.c:7002) #1: ffff888115072d58 (&br->lock){+...}-{3:3}, at: br_setlink (./include/linux/spinlock.h:348 net/bridge/br_netlink.c:1117) Preemption disabled at: 0x0 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120) __might_resched.cold (kernel/sched/core.c:9163) netif_rx_mode_run (net/core/dev_addr_lists.c:1262) netif_rx_mode_sync (net/core/dev_addr_lists.c:1428) dev_set_promiscuity (net/core/dev_api.c:289) br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172) br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747) br_setport (net/bridge/br_netlink.c:1000) br_setlink (net/bridge/br_netlink.c:1118) rtnl_bridge_setlink (net/core/rtnetlink.c:5572) rtnetlink_rcv_msg (net/core/rtnetlink.c:7005) netlink_rcv_skb (net/netlink/af_netlink.c:2550) netlink_unicast (net/netlink/af_netlink.c:1318 net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1894) __sock_sendmsg (net/socket.c:787 (discriminator 4) net/socket.c:802 (discriminator 4)) ____sys_sendmsg (net/socket.c:2698) ___sys_sendmsg (net/socket.c:2752) __sys_sendmsg (net/socket.c:2784) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Fixes: 78cd408 ("net: add missing instance lock to dev_set_promiscuity") Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260526064818.272516-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 6d34594 ] Since the start of the git history, brport_store() always acquired the bridge lock. Back then this decision made sense: The bridge lock protects the STP state of the bridge and its ports and at that time the function was only used by two STP related attributes (cost and priority). Nowadays, brport_store() processes a lot more attributes and most of them do not need the bridge lock: * Bridge flags: Only require RTNL. Read locklessly by the data path. Annotations can be added in net-next. * FDB port flushing: Only requires the FDB lock. * Multicast attributes: Only require the multicast lock. * Group forward mask: Only requires RTNL. Read locklessly by the data path. Annotations can be added in net-next. * Backup port: Only requires RTNL. Read locklessly by the data path. This is a problem as the bridge calls dev_set_promiscuity() when certain bridge port flags change and this function can sleep since the commit cited below, resulting in a splat such as [1]. Fix this by reducing the scope of the bridge lock and only take it when processing the two STP related attributes that require it. Remove the now stale comment from br_switchdev_set_port_flag(). The SWITCHDEV_F_DEFER flag can be removed in net-next. [1] BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 372, name: bash preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 5 locks held by bash/372: #0: ffff88810c51c3f0 (sb_writers#7){.+.+}-{0:0}, at: ksys_write (fs/read_write.c:740) #1: ffff888115ce9480 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter (fs/kernfs/file.c:343) #2: ffff88810b9fd330 (kn->active#37){.+.+}-{0:0}, at: kernfs_fop_write_iter (fs/kernfs/file.c:80 fs/kernfs/file.c:344) #3: ffffffffa59473a0 (rtnl_mutex){+.+.}-{4:4}, at: brport_store (net/bridge/br_sysfs_if.c:326) #4: ffff8881099d2d58 (&br->lock){+...}-{3:3}, at: brport_store (./include/linux/spinlock.h:348 net/bridge/br_sysfs_if.c:345) Preemption disabled at: 0x0 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120) __might_resched.cold (kernel/sched/core.c:9163) netif_rx_mode_run (net/core/dev_addr_lists.c:1262) netif_rx_mode_sync (net/core/dev_addr_lists.c:1428) dev_set_promiscuity (net/core/dev_api.c:289) br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172) br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747) store_learning (net/bridge/br_sysfs_if.c:79 net/bridge/br_sysfs_if.c:235) brport_store (net/bridge/br_sysfs_if.c:346) kernfs_fop_write_iter (fs/kernfs/file.c:352) new_sync_write (fs/read_write.c:595) vfs_write (fs/read_write.c:688) ksys_write (fs/read_write.c:740) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Fixes: 78cd408 ("net: add missing instance lock to dev_set_promiscuity") Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260526064818.272516-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

…k overflow [ Upstream commit a005fa5 ] tcf_mirred_act() checks sched_mirred_nest against MIRRED_NEST_LIMIT (4) to prevent deep recursion. However, when the action uses blockcast (tcfm_blockid != 0), the function returns at the tcf_blockcast() call BEFORE reaching the counter increment. As a result, the recursion counter never advances and the limit check is entirely bypassed. When two devices share a TC egress block with a mirred blockcast rule, a packet egressing on device A is mirrored to device B via blockcast; device B's egress TC re-enters tcf_mirred_act() via blockcast and mirrors back to A, creating an unbounded recursion loop: tcf_mirred_act -> tcf_blockcast -> tcf_mirred_to_dev -> dev_queue_xmit -> sch_handle_egress -> tcf_classify -> tcf_mirred_act -> (repeat) This recursion continues until the kernel stack overflows. The bug is reachable from an unprivileged user via unshare(CLONE_NEWUSER | CLONE_NEWNET): user namespaces grant CAP_NET_ADMIN in the new network namespace, which is sufficient to create dummy devices, attach clsact qdiscs with shared blocks, and install mirred blockcast filters. BUG: TASK stack guard page was hit at ffffc90000b7fff8 Oops: stack guard page: 0000 [#1] SMP KASAN NOPTI CPU: 2 UID: 1000 PID: 169 Comm: poc Not tainted 7.0.0-rc7-next-20260410 RIP: 0010:xas_find+0x17/0x480 Call Trace: xa_find+0x17b/0x1d0 tcf_mirred_act+0x640/0x1060 tcf_action_exec+0x400/0x530 basic_classify+0x128/0x1d0 tcf_classify+0xd83/0x1150 tc_run+0x328/0x620 __dev_queue_xmit+0x797/0x3100 tcf_mirred_to_dev+0x7b1/0xf70 tcf_mirred_act+0x68a/0x1060 [repeating ~30+ times until stack overflow] Kernel panic - not syncing: Fatal exception in interrupt Fix this by incrementing sched_mirred_nest before calling tcf_blockcast() and decrementing it on return, mirroring the non-blockcast path. This ensures subsequent recursive entries see the updated counter and are correctly limited by MIRRED_NEST_LIMIT. Fixes: fe946a7 ("net/sched: act_mirred: add loop detection") Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com> Link: https://patch.msgid.link/20260525122556.973584-7-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

commit a17dc12 upstream. With CONFIG_CALL_DEPTH_TRACKING enabled on an x86 retbleed-affected platform (eg: Skylake), with retbleed=stuff, registering a dynamic ftrace trampoline crashes on the first call into the traced function: BUG: unable to handle page fault for address: ffff88817ae18880 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 4b53067 P4D 4b53067 PUD 0 Oops: Oops: 0002 [#1] SMP PTI CPU: 3 UID: 0 PID: 187 Comm: usleep Not tainted 7.0.10 AsahiLinux#243 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 Code: 24 78 00 00 00 00 48 89 ea 48 89 54 24 20 48 8b b4 24 b8 00 00 00 48 8b bc 24 b0 00 00 00 48 89 bc 24 80 00 00 00 48 83 ef 05 <65> 48 c1 3d 1f a8 b6 02 05 48 8b 15 f6 00 00 00 4c 89 3c 24 4c 89 Call Trace: <TASK> ? find_held_lock ? exc_page_fault ? lock_release ? __x64_sys_clock_nanosleep ? lockdep_hardirqs_on_prepare ? trace_hardirqs_on __x64_sys_clock_nanosleep do_syscall_64 ? exc_page_fault ? call_depth_return_thunk entry_SYSCALL_64_after_hwframe ... Kernel panic - not syncing: Fatal exception This small reproducer allows to easily trigger the crash: # echo 'p __x64_sys_clock_nanosleep' > /sys/kernel/tracing/kprobe_events # echo 1 > /sys/kernel/tracing/events/kprobes/p___x64_sys_clock_nanosleep_0/enable # usleep 1 Monitoring the crash under GDB points to the exact instruction in charge of incrementing the call depth: sarq $5, %gs:__x86_call_depth(%rip) This instruction matches the one inserted by the ftrace_regs_caller from ftrace_64.S. This emitted code was likely working fine until the introduction of 59bec00 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()"): it has made the call depth accounting addressing relative to $rip, instead of being based on an absolute address. As this code exact location depends on where the trampoline lives in memory, the corresponding displacement needs to be adjusted at runtime to actually correctly find the per-cpu __x86_call_depth value, otherwise the targeted address is wrong, leading to the page fault seen above. Fix the %rip-relative displacement of the copied CALL_DEPTH_ACCOUNT instruction (from ftrace_regs_caller) by calling text_poke_apply_relocation(), as it is done for example by the x86 BPF JIT compiler through x86_call_depth_emit_accounting(). This corrects both CALL_DEPTH_ACCOUNT slots, in ftrace_caller and ftrace_regs_caller. [ bp: Massage. ] Fixes: 59bec00 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()") Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@kernel.org> Link: https://patch.msgid.link/20260527-fix_call_depth_in_trampoline-v1-1-1c1abc8ae310@bootlin.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5c4063c upstream. TLDR: The bo->ttm object might be changed by calling ttm_bo_validate(), move casting it to an i915_tt object later to actually get the right pointer. A user reported hitting the following bug under heavy use on DG2: [26620.095550] Oops: general protection fault, probably for non-canonical address 0xa56b6b6b6b6b6b8b: 0000 1 SMP NOPTI [26620.095556] CPU: 2 UID: 0 PID: 631 Comm: Xorg Not tainted 6.18.8 #1 PREEMPT(lazy) [26620.095558] Hardware name: ASRock B850M Steel Legend WiFi/B850M Steel Legend WiFi, BIOS 3.50 09/18/2025 [26620.095559] RIP: 0010:i915_ttm_purge+0x84/0x100 [i915] [26620.095604] Code: 00 00 00 48 8d 54 24 10 48 89 e6 48 89 fb e8 83 aa ae ff 85 c0 75 6f 48 83 bb a8 01 00 00 00 74 2c 48 8b 45 78 48 85 c0 74 23 <48> 8b 78 20 48 c7 c2 ff ff ff ff 31 f6 e8 7a 73 e3 e0 48 8b 7d 78 [26620.095605] RSP: 0018:ffffc90005fd7430 EFLAGS: 00010282 [26620.095607] RAX: a56b6b6b6b6b6b6b RBX: ffff8881f46c3dc0 RCX: 0000000000000000 [26620.095608] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 00000000ffffffff [26620.095609] RBP: ffff888289610f00 R08: 0000000000000001 R09: ffff88823b022000 [26620.095609] R10: ffff888103029b28 R11: ffff8881fc7f3800 R12: ffff88810b6150d0 [26620.095609] R13: ffff888289610f00 R14: 0000000000000000 R15: ffff8881f46c3dc0 [26620.095610] FS: 00007f1004d86900(0000) GS:ffff88901c858000(0000) knlGS:0000000000000000 [26620.095611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [26620.095611] CR2: 00007f0fdf489000 CR3: 000000035b0c1000 CR4: 0000000000750ef0 [26620.095612] PKRU: 55555554 [26620.095612] Call Trace: [26620.095615] <TASK> [26620.095615] i915_ttm_move+0x2b9/0x420 [i915] [26620.095642] ? ttm_tt_init+0x65/0x80 [ttm] [26620.095644] ? i915_ttm_tt_create+0xc6/0x150 [i915] [26620.095667] ttm_bo_handle_move_mem+0xb6/0x160 [ttm] [26620.095669] ttm_bo_evict+0x100/0x150 [ttm] [26620.095671] ? preempt_count_add+0x64/0xa0 [26620.095673] ? _raw_spin_lock+0xe/0x30 [26620.095675] ? _raw_spin_unlock+0xd/0x30 [26620.095675] ? i915_gem_object_evictable+0xb7/0xd0 [i915] [26620.095704] ttm_bo_evict_cb+0x6e/0xd0 [ttm] [26620.095705] ttm_lru_walk_for_evict+0xa6/0x200 [ttm] [26620.095708] ttm_bo_alloc_resource+0x185/0x4f0 [ttm] [26620.095709] ? init_object+0x62/0xd0 [26620.095712] ttm_bo_validate+0x7a/0x180 [ttm] [26620.095713] ? _raw_spin_unlock_irqrestore+0x16/0x30 [26620.095714] __i915_ttm_get_pages+0xb0/0x170 [i915] [26620.095737] i915_ttm_get_pages+0x9f/0x150 [i915] [26620.095759] ? i915_gem_do_execbuffer+0xedc/0x2b40 [i915] [26620.095786] ? alloc_debug_processing+0xd0/0x100 [26620.095787] ? _raw_spin_unlock_irqrestore+0x16/0x30 [26620.095788] ? i915_vma_instance+0xa0/0x4e0 [i915] [26620.095822] __i915_gem_object_get_pages+0x2f/0x40 [i915] [26620.095848] i915_vma_pin_ww+0x706/0x980 [i915] [26620.095875] ? i915_gem_do_execbuffer+0xedc/0x2b40 [i915] [26620.095904] eb_validate_vmas+0x170/0xa00 [i915] [26620.095930] i915_gem_do_execbuffer+0x1201/0x2b40 [i915] [26620.095953] ? alloc_debug_processing+0xd0/0x100 [26620.095954] ? _raw_spin_unlock_irqrestore+0x16/0x30 [26620.095955] ? i915_gem_execbuffer2_ioctl+0xc9/0x240 [i915] [26620.095977] ? __wake_up_sync_key+0x32/0x50 [26620.095979] ? i915_gem_execbuffer2_ioctl+0xc9/0x240 [i915] [26620.096001] ? __slab_alloc.isra.0+0x67/0xc0 [26620.096003] i915_gem_execbuffer2_ioctl+0x11a/0x240 [i915] Results from decode_stacktrace.sh pointed to dereference of a file pointer field of a i915 TTM page vector container associated with an object being purged on eviction. That path is taken when the object is marked as no longer needed. Code analysis revealed a possibility of the i915 TTM page vector container being replaced with a new instance inside a function that purges content of the object, should it be still busy. That function is called, indirectly via a more general function that changes the object's placement and caching policy, before the problematic dereference, but still after a pointer to the container is captured, rendering the pointer no longer valid. Fix the issue by capturing the pointer to the container only after its potential replacement. v2: Move the container_of() inside the if block (Sebastian), - a simplified version of the commit description that explains briefly why the change is necessary (Christian). Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/work_items/14882 Fixes: 7ae0345 ("drm/i915/ttm: add tt shmem backend") Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: stable@vger.kernel.org # v5.17+ Cc: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Sebastian Brzezinka <sebastian.brzezinka@intel.com> Cc: Christian König <christian.koenig@amd.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/20260508122612.469227-2-janusz.krzysztofik@linux.intel.com (cherry picked from commit 4462966) Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5d7a49d upstream. Prevent a crash from happening as the first serial port is initialised: Console: switching to colour frame buffer device 160x64 tgafb: SFB+ detected, rev=0x02 fb0: Digital ZLX-E1 frame buffer device at 0x1e000000 DECstation DZ serial driver version 1.04 CPU 0 Unable to handle kernel paging request at virtual address 000000bc, epc == 8048b3a4, ra == 80470a78 Oops[#1]: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.19.0-dirty AsahiLinux#35 NONE $ 0 : 00000000 1000ac00 00000004 804707ac $ 4 : 00000000 80e20850 80e20858 81000030 $ 8 : 00000000 8072c81c 00000008 fefefeff $12 : 6c616972 00000006 80c5917f 69726420 $16 : 80e20800 00000000 808f8968 80e20800 $20 : 00000000 807f5a9 808b0094 808d3bc8 $24 : 00000018 80479030 $28 : 80c2e000 80c2fd70 00000069 80470a78 Hi : 00000004 Lo : 00000000 epc : 8048b3a4 __dev_fwnode+0x0/0xc ra : 80470a78 serial_base_ctrl_add+0xa0/0x168 Status: 1000ac04 IEp Cause : 30000008 (ExcCode 02) BadVA : 000000bc PrId : 00000220 (R3000) Modules linked in: Process swapper/0 (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000) Stack : 00400044 00400040 8046f4cc 00000000 808a6148 808a0000 808f8968 8086983c 808e0000 8046fc84 1000ac01 00000028 80e20700 802ba3f8 80e20700 80d34a94 80c1b900 80e20700 80e20700 80e20700 80e20700 80444650 00000000 00000000 00000000 807f5a9 808b0094 80447080 00400040 808e0000 80d34a94 808a6148 80d34a94 00000004 80e20700 00000000 8076974c 80469810 80c2fe3c 1000ac01 ... Call Trace: [<8048b3a4>] __dev_fwnode+0x0/0xc [<80470a78>] serial_base_ctrl_add+0xa0/0x168 [<8046fc84>] serial_core_register_port+0x1c8/0x974 [<808c6af0>] dz_init+0x74/0xc8 [<800470e0>] do_one_initcall+0x44/0x2d4 [<808b111c>] kernel_init_freeable+0x258/0x308 [<8072e434>] kernel_init+0x20/0x114 [<80049cd0>] ret_from_kernel_thread+0x14/0x1c Code: 27bd0018 03e00008 2402ffea <8c8200bc> 03e00008 00000000 27bdffc0 afbe0038 afb30024 ---[ end trace 0000000000000000 ]--- -- where a pointer is dereferenced that has been derived from a null pointer to the port's parent device. Since no device is available with legacy probing and it's not anymore a preferable way to discover devices anyway, switch the driver to using a platform device and use it as the port's parent device. Update resource handling accordingly and only request the actual span of addresses used within the slot, which will have had its resource already requested by generic platform device code. Use platform_driver_probe() not just because the DZ device is fixed with solder on board and not straightforward to remove, but foremost because the associated TTY's major device number is the same as used by the zs driver and the first driver to claim it will prevent the other one from using it. Either one DZ device or some SCC devices will be present in a given system but never both at a time, and therefore we want the major device number to be claimed by the first driver to actually successfully bind to its device and platform_driver_probe() is a way to fulfil that. An unfortunate consequence of the switch to a platform device is we now hand the console over from the bootconsole much later in the bootstrap. The firmware console handler appears good enough though to work so late and in particular with interrupts enabled. Conversely only starting the console port so late lets the reset code fully utilise our delay handlers, so switch from udelay() to fsleep() for transmitter draining so as to avoid busy-waiting for an excessive amount of time. Fixes: 84a9582 ("serial: core: Start managing serial controllers to enable runtime PM") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Cc: stable@vger.kernel.org # needs to use .remove_new for <= 6.10 Link: https://patch.msgid.link/alpine.DEB.2.21.2605062326540.46195@angie.orcam.me.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 7cac59d upstream. Prevent a crash from happening as the first serial port is initialised: Console: switching to mono frame buffer device 160x64 fb0: PMAG-AA frame buffer device at tc0 DECstation Z85C30 serial driver version 0.10 CPU 0 Unable to handle kernel paging request at virtual address 0000002c, epc == 803ab00c, ra == 803aafe0 Oops[#1]: CPU: 0 PID: 1 Comm: swapper Not tainted 6.4.0-rc3-00031-g84a9582fd203-dirty AsahiLinux#57 $ 0 : 00000000 10012c00 803aaeb0 00000000 $ 4 : 80e12f60 80e12f50 80e12f58 81000030 $ 8 : 00000000 805ff37c 00000000 33433538 $12 : 65732030 00000006 80c2915d 6c616972 $16 : 80e12f00 807b7630 00000000 00000000 $20 : 00000004 00000348 000001a0 807623b8 $24 : 00000018 00000000 $28 : 80c24000 80c25d6 8078b148 803aafe0 Hi : 00000000 Lo : 00000000 epc : 803ab00c serial_base_ctrl_add+0x78/0xf4 ra : 803aafe0 serial_base_ctrl_add+0x4c/0xf4 Status: 10012c0 KERNEL EXL IE Cause : 00000008 (ExcCode 02) BadVA : 0000002c PrId : 00000440 (R4400SC) Modules linked in: Process swapper (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000) Stack : 80760000 00000cc0 00400044 00400040 803aa02c 80d61ab8 00000000 807b7630 80760000 807623b8 807b7628 803aa644 80386998 00000000 80e17780 80220f68 80e17780 80d61ab8 80c17d80 80e17780 80e17780 8063c798 80e17780 80383fa0 00000010 80e17780 00000000 80386998 807a0000 00000000 00400040 8038f848 807623b8 80d61ab8 00000004 80e17780 00000000 803a68e4 80c25e2c 803bb884 ... Call Trace: [<803ab00c>] serial_base_ctrl_add+0x78/0xf4 [<803aa644>] serial_core_register_port+0x174/0x69c [<8077e9ac>] zs_init+0xc8/0xfc [<800404d4>] do_one_initcall+0x40/0x2ac [<8076cecc>] kernel_init_freeable+0x1e4/0x270 [<80605bec>] kernel_init+0x20/0x108 [<800431e8>] ret_from_kernel_thread+0x14/0x1c Code: 2442aeb ae120024 ae0200d0 <8c67002c> 50e00001 8c670000 3c06806 3c05806e afb30010 ---[ end trace 0000000000000000 ]--- (report at the offending commit) -- where a pointer is dereferenced that has been derived from a null pointer to the port's parent device. Since no device is available with legacy probing and it's not anymore a preferable way to discover devices anyway, switch the driver to using a platform device and use it as the port's parent device. Update resource handling accordingly and only request the actual span of addresses used within the slot, which will have had its resource already requested by generic platform device code. Use platform_driver_probe() not just because SCC devices are fixed with solder on board and not straightforward to remove, but foremost because the associated TTY's major device number is the same as used by the dz driver and the first driver to claim it will prevent the other one from using it. Either one DZ device or some SCC devices will be present in a given system but never both at a time, and therefore we want the major device number to be claimed by the first driver to actually successfully bind to its device and platform_driver_probe() is a way to fulfil that. An unfortunate consequence of the switch to a platform device is we now hand the console over from the bootconsole much later in the bootstrap. The firmware console handler appears good enough though to work so late and in particular with interrupts enabled. Since there is one way only remaining to reach zs_reset() now, remove the port initialisation marker as no longer needed and go through the channel reset unconditionally. Fixes: 84a9582 ("serial: core: Start managing serial controllers to enable runtime PM") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Cc: stable@vger.kernel.org # needs to use .remove_new for <= 6.10 Link: https://patch.msgid.link/alpine.DEB.2.21.2605062328480.46195@angie.orcam.me.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

On the UDP receive path skb->dev is repurposed as dev_scratch (the truesize/state cache set by udp_set_dev_scratch()), through the union { struct net_device *dev; unsigned long dev_scratch; } in sk_buff. When a UDP socket is in a sockmap, sk_data_ready is sk_psock_verdict_data_ready(), which calls udp_read_skb() -> recv_actor() (sk_psock_verdict_recv) to run the attached SK_SKB verdict program in softirq. If that program calls a socket-lookup helper (bpf_sk_lookup_tcp/udp, bpf_skc_lookup_tcp), bpf_skc_lookup() does: if (skb->dev) caller_net = dev_net(skb->dev); skb->dev still holds the dev_scratch value (a non-NULL integer), so dev_net() dereferences it as a struct net_device * and the kernel takes a general protection fault on a non-canonical address in softirq: Oops: general protection fault, probably for non-canonical address 0x1010000800004a0 CPU: 1 UID: 0 PID: 1406 Comm: syz.2.19 Not tainted 7.1.0-rc6 #1 PREEMPT(full) RIP: 0010:bpf_skc_lookup net/core/filter.c:7033 [inline] RIP: 0010:bpf_sk_lookup+0x45/0x160 net/core/filter.c:7047 Call Trace: <IRQ> bpf_prog_4675cb904b7071f8+0x12e/0x14e bpf_prog_run_pin_on_cpu+0xc6/0x1f0 sk_psock_verdict_recv+0x1ba/0x350 udp_read_skb+0x31a/0x370 sk_psock_verdict_data_ready+0x2e3/0x600 __udp_enqueue_schedule_skb+0x4c8/0x650 udpv6_queue_rcv_one_skb+0x3ec/0x740 udp6_unicast_rcv_skb+0x11d/0x140 ip6_protocol_deliver_rcu+0x61e/0x950 ip6_input_finish+0xa9/0x150 NF_HOOK+0x286/0x2f0 ip6_input+0x117/0x220 NF_HOOK+0x286/0x2f0 __netif_receive_skb+0x85/0x200 process_backlog+0x374/0x9a0 __napi_poll+0x4f/0x1c0 net_rx_action+0x3b0/0x770 handle_softirqs+0x15a/0x460 do_softirq+0x57/0x80 </IRQ> The rmem charge that dev_scratch accounted for is released by skb_recv_udp() on dequeue, just above, so the scratch is dead by the time recv_actor() runs. Clear skb->dev so bpf_skc_lookup() falls back to sock_net(skb->sk), which skb_set_owner_sk_safe() set just above. Fixes: 965b57b ("net: Introduce a new proto_ops ->read_skb()") Cc: stable@vger.kernel.org Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260603162737.697215-1-rhkrqnwk98@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mvebu_pwm_suspend() and mvebu_pwm_resume() are called for all GPIO banks during suspend/resume, but not all banks have PWM functionality. GPIO banks without PWM have mvchip->mvpwm set to NULL. Calling mvebu_pwm_suspend() with mvpwm == NULL causes a NULL pointer dereference when it tries to access mvpwm->blink_select. Unable to handle kernel NULL pointer dereference at virtual address 00000020 when write [00000020] *pgd=00000000 Internal error: Oops: 815 [#1] PREEMPT ARM Modules linked in: CPU: 0 UID: 0 PID: 406 Comm: sh Not tainted 6.12.74-rt12-yocto-standard-g4e96f98fb7db-dirty AsahiLinux#353 Hardware name: Marvell Armada 370/XP (Device Tree) PC is at regmap_mmio_read+0x38/0x54 LR is at regmap_mmio_read+0x38/0x54 pc : [<c05fd2ac>] lr : [<c05fd2ac>] psr: 200f0013 sp : f0c11d10 ip : 00000000 fp : c100d2f0 r10: c14fb854 r9 : 00000000 r8 : 00000000 r7 : c1799c00 r6 : 00000020 r5 : 00000020 r4 : c179c7c0 r3 : f0a231a0 r2 : 00000020 r1 : 00000020 r0 : 00000000 Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: 135ec059 DAC: 00000051 Call trace: regmap_mmio_read from _regmap_bus_reg_read+0x78/0xac _regmap_bus_reg_read from _regmap_read+0x60/0x154 _regmap_read from regmap_read+0x3c/0x60 regmap_read from mvebu_gpio_suspend+0xa4/0x14c mvebu_gpio_suspend from dpm_run_callback+0x54/0x180 dpm_run_callback from device_suspend+0x124/0x630 device_suspend from dpm_suspend+0x124/0x270 dpm_suspend from dpm_suspend_start+0x64/0x6c dpm_suspend_start from suspend_devices_and_enter+0x140/0x8e8 suspend_devices_and_enter from pm_suspend+0x2fc/0x308 pm_suspend from state_store+0x6c/0xc8 state_store from kernfs_fop_write_iter+0x10c/0x1f8 kernfs_fop_write_iter from vfs_write+0x270/0x468 vfs_write from ksys_write+0x70/0xf0 ksys_write from ret_fast_syscall+0x0/0x54 Add a NULL check for mvchip->mvpwm before calling the PWM suspend/resume functions. Fixes: 757642f ("gpio: mvebu: Add limited PWM support") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Link: https://patch.msgid.link/20260608084334.2960803-1-yun.zhou@windriver.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

asdfugil and others added 9 commits April 27, 2026 14:51

Merge branch 'bits/000-cpufeature-a10' into hoolock

0c37f3a

Merge branch 'bits/010-usb' into hoolock

33d52d7

Merge branch 'bits/020-pmu' into hoolock

8191b15

Merge branch 'bits/030-perf' into hoolock

c53aac0

Merge branch 'bits/040-dialog' into hoolock

99f922e

Merge branch 'bits/050-usb-dt' into hoolock

5ad63a2

Merge branch 'bits/060-misc' into hoolock

1ecb03d

Merge branch 'bits/070-smc' into hoolock

0e556ca

yhavry force-pushed the apple-nvme-reset-queue-reinit branch from dcc06b0 to 9c70bd0 Compare May 14, 2026 04:21

asdfugil force-pushed the hoolock branch from 0e556ca to 0d23eac Compare May 14, 2026 06:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A11 nvme: reset submission queue state - #1

A11 nvme: reset submission queue state#1
yhavry wants to merge 9 commits into
HoolockLinux:hoolockfrom
yhavry:apple-nvme-reset-queue-reinit

yhavry commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yhavry commented May 14, 2026

Summary

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants