问题背景
有台centos8机器热插拔virtio网卡失败了,dmesg内显示有hung task的栈。机器使用的是4.18内核。
相关知识
以前,网卡收发包都是通过中断触发去处理的,但是随着高性能网络时代的到来,如果每一次的网络收发包都依靠中断来通知处理,效率低下且会造成中断风暴,因此才出现了napi。简单来讲,napi就是一种中断+轮询混合模式。收发包时,触发中断,然后关闭中断改为轮询模式,批量处理完网络包后结束轮询,恢复中断模式。
这里要简单说一下napi的使用和生命周期,由于这回排查的是virtio网卡,那我们就以virtio驱动为例来梳理,其他的网卡驱动类似(本想画个图来的,但是东西太多不好塞下,让ai总结一下吧):
┌─────────────────────────────────────────────────────────────────────────┐
│ NAPI生命周期 │
└─────────────────────────────────────────────────────────────────────────┘
1. PROBE 阶段
┌──────────────────────────────────────────────────────────────┐
│ netif_napi_add() │
│ → 注册 napi_struct,绑定 poll 回调 │
│ → 此时 NAPI 未使能,不参与调度 │
└──────────────────────────────────────────────────────────────┘
2. OPEN 阶段
┌──────────────────────────────────────────────────────────────┐
│ virtnet_open() │
│ → virtnet_enable_queue_pair() │
│ → virtnet_napi_enable() │
│ → napi_enable() ← 使能 NAPI │
│ → virtqueue_napi_schedule() ← 检查是否有待处理包 │
└──────────────────────────────────────────────────────────────┘
3. 数据包到达 → 中断触发
┌──────────────────────────────────────────────────────────────┐
│ skb_recv_done() (virtio 回调) │
│ → virtqueue_napi_schedule() │
│ → napi_schedule_prep() ← 检查是否可以调度 │
│ → virtqueue_disable_cb() ← 禁用 virtio 回调 │
│ → __napi_schedule() ← 添加到软中断队列 │
└──────────────────────────────────────────────────────────────┘
│
▼
4. 软中断触发
┌──────────────────────────────────────────────────────────────┐
│ NET_RX_SOFTIRQ │
│ → virtnet_poll(napi, budget) │
│ → virtnet_receive() ← 实际接收数据包 │
│ → if received < budget: │
│ → virtqueue_napi_complete() │
│ → napi_complete_done() │
│ → virtqueue_enable_cb() ← 重新使能中断 │
└──────────────────────────────────────────────────────────────┘
5. CLOSE 阶段
┌──────────────────────────────────────────────────────────────┐
│ virtnet_close() │
│ → virtnet_disable_queue_pair() │
│ → virtnet_napi_disable() │
│ → napi_disable() ← 禁用 NAPI │
└──────────────────────────────────────────────────────────────┘
排查过程
先看一下dmesg日志,这是最早hung的栈,后面还有很多。只能看到是在等锁了,至于谁在持有锁,只能触发kdump拿vmcore来看了。
......
[ 615.640467] INFO: task kworker/218:2:4106 blocked for more than 120 seconds.
[ 615.647756] Tainted: G OE --------- - - 4.18.0-348.7.1.el8_5.x86_64 #1
[ 615.656038] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 615.663881] task:kworker/218:2 state:D stack: 0 pid: 4106 ppid: 2 flags:0x80004000
[ 615.672252] Workqueue: ipv6_addrconf addrconf_verify_work
[ 615.677665] Call Trace:
[ 615.680129] __schedule+0x2bd/0x760
[ 615.683629] schedule+0x37/0xa0
[ 615.686780] schedule_preempt_disabled+0xa/0x10
[ 615.691325] __mutex_lock.isra.6+0x2b5/0x4a0
[ 615.695604] addrconf_verify_work+0xa/0x20
[ 615.699715] process_one_work+0x1a7/0x360
[ 615.703743] ? create_worker+0x1a0/0x1a0
[ 615.707675] worker_thread+0x30/0x390
[ 615.711350] ? create_worker+0x1a0/0x1a0
[ 615.715285] kthread+0x116/0x130
[ 615.718528] ? kthread_flush_work_fn+0x10/0x10
[ 615.722981] ret_from_fork+0x1f/0x40
.......
接下来是一步一步排查问题的过程:
先看一下大体情况,有很多D进程
crash> ps |grep UN
2291 2 138 ffff9d9312741800 UN 0.0 0 0 [irq/103-pciehp]
4106 2 218 ffff9d93110d3000 UN 0.0 0 0 [kworker/218:2]
23180 5096 234 ffff9d9382b91800 UN 0.0 163784 10448 sshd
107852 1 210 ffff9d9312198000 UN 0.0 1764 1096 ethtool
107906 1 215 ffff9d93970c1800 UN 0.0 1764 1096 ethtool
107942 107927 223 ffff9d9e82421800 UN 0.0 163544 8080 curl
110122 110112 220 ffff9c9be4b30000 UN 0.0 163544 8068 curl
110159 1 239 ffff9d9b83d06000 UN 0.0 40788 4512 checknic
110195 5096 109 ffff9d9344796000 UN 0.0 74432 6792 sshd
......
crash> ps |grep UN |wc -l
352
切到kworker/218:2看一下该进程的栈
crash> bt ffff9d93110d3000
PID: 4106 TASK: ffff9d93110d3000 CPU: 218 COMMAND: "kworker/218:2"
#0 [ffffaaca6bb3fd98] __schedule at ffffffffa6b7a60d
#1 [ffffaaca6bb3fdf0] schedule at ffffffffa6b7aae7
#2 [ffffaaca6bb3fe00] schedule_preempt_disabled at ffffffffa6b7ae2a
#3 [ffffaaca6bb3fe08] __mutex_lock at ffffffffa6b7cb25
#4 [ffffaaca6bb3fe90] addrconf_verify_work at ffffffffa6aecdfa
#5 [ffffaaca6bb3fe98] process_one_work at ffffffffa6309777
#6 [ffffaaca6bb3fed8] worker_thread at ffffffffa6309e40
#7 [ffffaaca6bb3ff10] kthread at ffffffffa630f906
#8 [ffffaaca6bb3ff50] ret_from_fork at ffffffffa6c0023f
看一下这里的addrconf_verify_work函数,所以这里是一直在等rtnl_mutex锁
crash> dis -lr ffffffffa6aecdfa
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/ipv6/addrconf.c: 4553
0xffffffffa6aecdf0 <addrconf_verify_work>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/ipv6/addrconf.c: 4554
0xffffffffa6aecdf5 <addrconf_verify_work+5>: call 0xffffffffa69ce880 <rtnl_lock>
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/ipv6/addrconf.c: 4555
0xffffffffa6aecdfa <addrconf_verify_work+10>: call 0xffffffffa6aeca00 <addrconf_verify_rtnl>
static void addrconf_verify_work(struct work_struct *w)
{
rtnl_lock();
addrconf_verify_rtnl();
rtnl_unlock();
}
void rtnl_lock(void)
{
mutex_lock(&rtnl_mutex);
}
查一下rtnl_mutex锁
crash> p rtnl_mutex -x
rtnl_mutex = $2 = {
owner = {
counter = 0xffff9d9312741801
},
wait_lock = {
{
rlock = {
raw_lock = {
{
val = {
counter = 0x0
.......
根据这个可以知道锁的owner是0xffff9d9312741801,不过这还不是真正的task_struct的地址,因为mutex的后几位可能是作为一些flags存在的,所以我们需要把这个0xffff9d9312741801进行一下地址对齐,得到的0xffff9d9312741800才是真正的task_struct地址。
知道了持有锁的进程的task_struct,就可以看一下这是哪个进程,原来就是我们最开始ps出来的irq/103-pciehp,这就是罪魁祸首了.
crash> task_struct.comm,tgid 0xffff9d9312741800
comm = "irq/103-pciehp\000",
tgid = 2291,
看一下这个pciehp进程的栈,可以看出这就是典型的pcie热插拔的栈,但是这里为什么hung住呢?
crash> bt ffff9d9312741800
PID: 2291 TASK: ffff9d9312741800 CPU: 138 COMMAND: "irq/103-pciehp"
#0 [ffffaaca74b9b9f0] __schedule at ffffffffa6b7a60d
#1 [ffffaaca74b9ba48] schedule at ffffffffa6b7aae7
#2 [ffffaaca74b9ba58] schedule_timeout at ffffffffa6b7f2c7
#3 [ffffaaca74b9baf0] msleep at ffffffffa637c9b9
#4 [ffffaaca74b9baf8] napi_disable at ffffffffa69b7dcb
#5 [ffffaaca74b9bb10] virtnet_close at ffffffffc0e1274d [virtio_net]
#6 [ffffaaca74b9bb30] __dev_close_many at ffffffffa69b70a1
#7 [ffffaaca74b9bb60] dev_close_many at ffffffffa69ba4af
#8 [ffffaaca74b9bba8] rollback_registered_many at ffffffffa69be898
#9 [ffffaaca74b9bc08] rollback_registered at ffffffffa69bedc6
#10 [ffffaaca74b9bc38] unregister_netdevice_queue at ffffffffa69bee73
#11 [ffffaaca74b9bc60] unregister_netdev at ffffffffa69beec8
#12 [ffffaaca74b9bc70] virtnet_remove at ffffffffc0e148ca [virtio_net]
#13 [ffffaaca74b9bc80] virtio_dev_remove at ffffffffa678ddc9
#14 [ffffaaca74b9bca8] device_release_driver_internal at ffffffffa6813003
#15 [ffffaaca74b9bcc8] bus_remove_device at ffffffffa6811647
#16 [ffffaaca74b9bce8] device_del at ffffffffa680cd91
#17 [ffffaaca74b9bd30] device_unregister at ffffffffa680d036
#18 [ffffaaca74b9bd40] unregister_virtio_device at ffffffffa678dfe1
#19 [ffffaaca74b9bd50] virtio_pci_remove at ffffffffa67920cf
#20 [ffffaaca74b9bd70] pci_device_remove at ffffffffa66ec4cb
#21 [ffffaaca74b9bd90] device_release_driver_internal at ffffffffa6813003
#22 [ffffaaca74b9bdb0] pci_stop_bus_device at ffffffffa66e3249
#23 [ffffaaca74b9bdd0] pci_stop_and_remove_bus_device at ffffffffa66e338e
#24 [ffffaaca74b9bde0] pciehp_unconfigure_device at ffffffffa67042dc
#25 [ffffaaca74b9be18] pciehp_disable_slot at ffffffffa670366b
#26 [ffffaaca74b9be48] pciehp_handle_disable_request at ffffffffa670396b
#27 [ffffaaca74b9be60] pciehp_ist at ffffffffa670532c
#28 [ffffaaca74b9be98] irq_thread_fn at ffffffffa635adef
#29 [ffffaaca74b9beb8] irq_thread at ffffffffa635b1e7
#30 [ffffaaca74b9bf10] kthread at ffffffffa630f906
#31 [ffffaaca74b9bf50] ret_from_fork at ffffffffa6c0023f
看一下上面,应该是在napi_disable这里,应该是在msleep。看一下napi_disable这个函数。通过对比汇编和代码的行号,可以知道是在while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))这个判断下面的msleep。
crash> dis -lr ffffffffa69b7dcb
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/core/dev.c: 6622
0xffffffffa69b7da0 <napi_disable>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/core/dev.c: 6623
0xffffffffa69b7da5 <napi_disable+5>: push %rbp
0xffffffffa69b7da6 <napi_disable+6>: mov %rdi,%rbp
0xffffffffa69b7da9 <napi_disable+9>: push %rbx
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/core/dev.c: 6624
0xffffffffa69b7daa <napi_disable+10>: lea 0x10(%rbp),%rbx
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/core/dev.c: 6623
0xffffffffa69b7dae <napi_disable+14>: call 0xffffffffa6b7abb0 <_cond_resched>
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/./arch/x86/include/asm/bitops.h: 76
0xffffffffa69b7db3 <napi_disable+19>: lock orb $0x4,0x10(%rbp)
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/./arch/x86/include/asm/bitops.h: 220
0xffffffffa69b7db8 <napi_disable+24>: lock btsq $0x0,0x10(%rbp)
0xffffffffa69b7dbf <napi_disable+31>: jae 0xffffffffa69b7dd3 <napi_disable+51>
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/core/dev.c: 6627
0xffffffffa69b7dc1 <napi_disable+33>: mov $0x1,%edi
0xffffffffa69b7dc6 <napi_disable+38>: call 0xffffffffa637c990 <msleep>
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/./arch/x86/include/asm/bitops.h: 220
0xffffffffa69b7dcb <napi_disable+43>: lock btsq $0x0,(%rbx)
void napi_disable(struct napi_struct *n)
{
might_sleep();
set_bit(NAPI_STATE_DISABLE, &n->state);
while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
msleep(1);
while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
msleep(1);
hrtimer_cancel(&n->timer);
clear_bit(NAPI_STATE_PREFER_BUSY_POLL, &n->state);
clear_bit(NAPI_STATE_DISABLE, &n->state);
clear_bit(NAPI_STATE_THREADED, &n->state);
}
所以这时候就可以推断,应该是热插拔走到virtnet_close去disable每个队列上的napi,然后某个队列的napi一直不能被成功关闭(因为NAPI_STATE_SCHED的状态不对,期望是0,而现在是1),导致一直在while重试,而pciehp线程还持有rtnl_mutex锁,导致其他的一些curl ethtool这些的进程也没有办法执行完成,一直在等pciehp线程释放rtnl_mutex锁,如此系统变hung住了。
那么我们查找一下这个napi结构体的地址,去看一下它的状态。
可以看到上面的napi_disable函数的参数就是napi的指针,所以可以从这里去查找,napi_disable的上一个函数virtnet_close调用napi_disable的时候必然会准备napi参数。所以看一下virtnet_close的栈,可以看到得到rbx寄存器的值就可以知道napi的地址。然后我们看一下napi_disable的栈,这里push了rbx,所以我们可以通过bt -f查找到第二个压栈的数加0x8就是napi的地址了。也就是0xffff9c9a037e2d00+0x8
crash> dis -lr ffffffffc0e1274d
......
0xffffffffc0e12739 <virtnet_close+89>: add 0xae0(%r12),%rbx
0xffffffffc0e12741 <virtnet_close+97>: mov %rbx,%rdi ---->rbx
0xffffffffc0e12744 <virtnet_close+100>: add $0x8,%rdi ---->也就是rbx+8
0xffffffffc0e12748 <virtnet_close+104>: call 0xffffffffa69b7da0 <napi_disable>
0xffffffffc0e1274d <virtnet_close+109>: imul $0x3b8,%rbp,%rdi
crash> dis -lr ffffffffa69b7dcb
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/core/dev.c: 6622
0xffffffffa69b7da0 <napi_disable>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/core/dev.c: 6623
0xffffffffa69b7da5 <napi_disable+5>: push %rbp
0xffffffffa69b7da6 <napi_disable+6>: mov %rdi,%rbp
0xffffffffa69b7da9 <napi_disable+9>: push %rbx ---->push了rbx
/usr/src/debug/kernel-4.18.0-348.7.1.el8_5/linux-4.18.0-348.7.1.el8_5.x86_64/net/core/dev.c: 6624
crash> bt -f
......
#4 [ffffaaca74b9baf8] napi_disable at ffffffffa69b7dcb
ffffaaca74b9bb00: ffff9c9a037e2d00 000000000000000a
ffffaaca74b9bb10: ffffffffc0e1274d
根据上面得到的地址看一下napi的内容,去源码中找一下这些bit的含义,所以这里的state就是NAPIF_STATE_SCHED | NAPIF_STATE_MISSED | NAPIF_STATE_DISABLE | NAPIF_STATE_LISTED,符合之前看到的NAPIF_STATE_SCHED一直是1导致一直在死循环的情况。
crash> struct napi_struct 0xffff9c9a037e2d08 -x
struct napi_struct {
poll_list = {
next = 0xffff9c9a037e2d08,
prev = 0xffff9c9a037e2d08
},
state = 0x17,
weight = 0x40,
gro_count = 0x0,
poll = 0xffffffffc0e185e0 <virtnet_poll>,
poll_owner = 0xffffffff,
dev = 0xffff9c9a03792000,
rh_reserved_gro_list = 0x0,
skb = 0x0,
......
enum {
NAPIF_STATE_SCHED = BIT(NAPI_STATE_SCHED),
NAPIF_STATE_MISSED = BIT(NAPI_STATE_MISSED),
NAPIF_STATE_DISABLE = BIT(NAPI_STATE_DISABLE),
NAPIF_STATE_NPSVC = BIT(NAPI_STATE_NPSVC),
NAPIF_STATE_LISTED = BIT(NAPI_STATE_LISTED),
NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
NAPIF_STATE_PREFER_BUSY_POLL = BIT(NAPI_STATE_PREFER_BUSY_POLL),
NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
NAPIF_STATE_SCHED_THREADED = BIT(NAPI_STATE_SCHED_THREADED),
};
而如果NAPIF_STATE_SCHED是1的话,也就是在调度中,那么为什么poll list是空的?这就说明,至少是存在race condition的,可能sched但是没有completion清除掉NAPIF_STATE_SCHED,或者可能设置的时候就有问题,或者可能是其他的race。
总之,问题根因就是在pciehp热插拔网卡在virtnet_close中使用napi_disable去停止napi的时候,我们预期这时候napi的state中NAPIF_STATE_SCHED是0,也就是表示此时没有任何调度,当时当前的实际情况,NAPIF_STATE_SCHED是1,并且没有其他代码去set bit,导致napi_disable一直while msleep循环。而pciehp线程是持有rtnl_mutex锁的,它一直while msleep循环不能结束,也就一直持有锁,而其他需要加rtnl_mutex锁的进程就会一直等待,最终的表现就是hung task发生。
crash> list 0xffff9c9a037e2d08
ffff9c9a037e2d08
查了一下上游,应该是已经被修复了。
commit 3765996e4f0b8a755cab215a08df744490c76052
Author: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Date: Sat Sep 18 16:52:32 2021 +0800
napi: fix race inside napi_enable
The process will cause napi.state to contain NAPI_STATE_SCHED and
not in the poll_list, which will cause napi_disable() to get stuck.
The prefix "NAPI_STATE_" is removed in the figure below, and
NAPI_STATE_HASHED is ignored in napi.state.
CPU0 | CPU1 | napi.state
===============================================================================
napi_disable() | | SCHED | NPSVC
napi_enable() | |
{ | |
smp_mb__before_atomic(); | |
clear_bit(SCHED, &n->state); | | NPSVC
| napi_schedule_prep() | SCHED | NPSVC
| napi_poll() |
| napi_complete_done() |
| { |
| if (n->state & (NPSVC | | (1)
| _BUSY_POLL))) |
| return false; |
| ................ |
| } | SCHED | NPSVC
| |
clear_bit(NPSVC, &n->state); | | SCHED
} | |
| |
napi_schedule_prep() | | SCHED | MISSED (2)
(1) Here return direct. Because of NAPI_STATE_NPSVC exists.
(2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list
Since NAPI_STATE_SCHED already exists and napi is not in the
sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
exist.
1. This will cause this queue to no longer receive packets.
2. If you encounter napi_disable under the protection of rtnl_lock, it
will cause the entire rtnl_lock to be locked, affecting the overall
system.
This patch uses cmpxchg to implement napi_enable(), which ensures that
there will be no race due to the separation of clear two bits.
Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/net/core/dev.c b/net/core/dev.c
index 74fd402d26dd..7ee9fecd3aff 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6923,12 +6923,16 @@ EXPORT_SYMBOL(napi_disable);
*/
void napi_enable(struct napi_struct *n)
{
- BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
- smp_mb__before_atomic();
- clear_bit(NAPI_STATE_SCHED, &n->state);
- clear_bit(NAPI_STATE_NPSVC, &n->state);
- if (n->dev->threaded && n->thread)
- set_bit(NAPI_STATE_THREADED, &n->state);
+ unsigned long val, new;
+
+ do {
+ val = READ_ONCE(n->state);
+ BUG_ON(!test_bit(NAPI_STATE_SCHED, &val));
+
+ new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
+ if (n->dev->threaded && n->thread)
+ new |= NAPIF_STATE_THREADED;
+ } while (cmpxchg(&n->state, val, new) != val);
}
EXPORT_SYMBOL(napi_enable);
大佬的patch已经把race场景分析的很清晰了,就是napi_enable函数中clear NAPIF_STATE_SCHED和NAPIF_STATE_NPSVC之间,其他的cpu可能触发了schedule调度napi去执行,导致最后NAPIF_STATE_SCHED没有办法被clear掉。
转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 857879363@qq.com