一个FPU相关问题的修复---without kdump

  1. 背景
  2. 分析
    1. 根因定位
    2. 谁引入的问题?
    3. 修复问题
  3. 后记

背景

yzwddsg曾经说过,有资源而不加以利用,就是暴殄天物。

所以闲暇(上班摸鱼)之余就搞了个虚机,搞下来最新的内核源码,编译安装想着测一下有没有问题。什么stress-ng fio hackbench sysbench ltp,反正能上的全都上。

结果,还真就出问题了,测试ltp的时候就发现,宕机了,而且必现。遂尝试配一下kdump看看能不能搞出个core来分析分析,因为看log的话只看出个空指针引用。无奈,配不好,core出来了但是crash打不开,于是,就这么分析吧。

分析

根因定位

当前我用的是比较新的6.16.rc1内核版本,ltp是20250530版本

当跑了好几次ltp都宕机在同一个地方的时候,这就肯定不是巧合了,来看一下log

[ 8100.082169] LTP: starting read_all_dev (read_all -d /dev -p -q -r 3)
[ 8100.146532] VFIO - User Level meta-driver version: 0.3
[ 8100.297762] LTP: starting read_all_proc (read_all -d /proc -q -r 3)
[ 8100.547969] ------------[ cut here ]------------
[ 8100.547978] WARNING: CPU: 1 PID: 115002 at arch/x86/kernel/fpu/core.c:61 x86_task_fpu+0x46/0x60
[ 8100.547989] Modules linked in: vfio_iommu_type1 vfio dns_resolver tun overlay nls_iso8859_1 ntfs3 vfat fat xfs sctp ip6_udp_tun
nel udp_tunnel nf_tables nfnetlink tcp_diag inet_diag ib_core isofs skx_edac_common input_leds led_class serio_raw sg virtio_ballo
on binfmt_misc squashfs loop sch_fq_codel dm_multipath fuse drm bpf_preload ip_tables x_tables raid10 async_tx raid1 raid0 linear
dm_mirror dm_region_hash dm_log dm_mod hid_generic usbhid hid virtio_blk virtio_net net_failover failover ghash_clmulni_intel atkb
d sha512_ssse3 aesni_intel vivaldi_fmap sr_mod i2c_i801 i2c_smbus i2c_core cdrom uhci_hcd ehci_pci virtio_pci ehci_hcd virtio_pci_
legacy_dev virtio_pci_modern_dev virtio virtio_ring [last unloaded: hwpoison_inject]
[ 8100.548134] CPU: 1 UID: 0 PID: 115002 Comm: read_all Kdump: loaded Tainted: G           OE       6.16.0-rc1 #4 PREEMPT(full)
[ 8100.548143] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 8100.548151] RIP: 0010:x86_task_fpu+0x46/0x60
[ 8100.548156] Code: ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 48 8d 83 00 25 00 00 f6 43 2e 20 75 06 5
b c3 cc cc cc cc <0f> 0b 31 c0 5b c3 cc cc cc cc e8 cb da a4 00 eb dc 66 0f 1f 84 00
[ 8100.548162] RSP: 0018:ff1100017cdefaf8 EFLAGS: 00010202
[ 8100.548167] RAX: ff11000108b3d5c0 RBX: ff11000108b3b0c0 RCX: ff11000108b3b0c0
[ 8100.548172] RDX: 0000000000000000 RSI: ffffffffa610d020 RDI: ff11000108b3b0ec
[ 8100.548176] RBP: ff110002cc49c918 R08: 0000000000000001 R09: ffe21c002116761d
[ 8100.548179] R10: ff11000108b3b0eb R11: 0000000000000000 R12: ff11000108a80180
[ 8100.548183] R13: ff11000108b3b0e8 R14: ffffffffa610d020 R15: 0000000000000001
[ 8100.548187] FS:  00007fb2218f4740(0000) GS:ff11000f171d1000(0000) knlGS:0000000000000000
[ 8100.548193] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8100.548197] CR2: 000055e2a4620000 CR3: 000000011412f004 CR4: 0000000000771ef0
[ 8100.548200] DR0: 0000000000000001 DR1: 0000000000000000 DR2: 0000000000000000
[ 8100.548204] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 8100.548207] PKRU: 55555554
[ 8100.548210] Call Trace:
[ 8100.548213]  <TASK>
[ 8100.548216]  proc_pid_arch_status+0x1b/0xe0
[ 8100.548222]  proc_single_show+0x10c/0x1c0
[ 8100.548230]  seq_read_iter+0x3e5/0x1050
[ 8100.548239]  seq_read+0x24b/0x3b0
[ 8100.548249]  ? __pfx_do_filp_open+0x10/0x10
[ 8100.548256]  ? __pfx_seq_read+0x10/0x10
[ 8100.548260]  ? rcu_segcblist_enqueue+0x1d/0xe0
[ 8100.548268]  ? rcutree_enqueue.constprop.0+0x36/0x290
[ 8100.548274]  ? __call_rcu_common.constprop.0+0x30f/0x930
[ 8100.548281]  vfs_read+0x186/0xad0
[ 8100.548288]  ? alloc_fd+0x2c3/0x4c0
[ 8100.548293]  ? do_sys_openat2+0xef/0x160
[ 8100.548299]  ? __pfx_vfs_read+0x10/0x10
[ 8100.548304]  ? do_sys_openat2+0xef/0x160
[ 8100.548309]  ? __pfx_do_sys_openat2+0x10/0x10
[ 8100.548314]  ? kmem_cache_free+0x273/0x580
[ 8100.548320]  ? fdget_pos+0x1c9/0x4c0
[ 8100.548326]  ksys_read+0xef/0x1c0
[ 8100.548331]  ? __pfx_ksys_read+0x10/0x10
[ 8100.548338]  do_syscall_64+0x73/0x330
[ 8100.548343]  ? irqentry_exit_to_user_mode+0x32/0x210
[ 8100.548349]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 8100.548354] RIP: 0033:0x7fb2217147e2
[ 8100.548359] Code: c0 e9 b2 fe ff ff 50 48 8d 3d 8a b4 0c 00 e8 a5 1d 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 8
5 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 8100.548364] RSP: 002b:00007fff86c1f1b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 8100.548369] RAX: ffffffffffffffda RBX: 000055e29cebb140 RCX: 00007fb2217147e2
[ 8100.548373] RDX: 00000000000003ff RSI: 00007fff86c1f290 RDI: 0000000000000003
[ 8100.548377] RBP: 000055e29cea7012 R08: 00000000003923e4 R09: 00007fb2219010e8
[ 8100.548384] R13: 000055e29cea706f R14: 000055e2a45f7e18 R15: 00007fb2218cb028
[ 8100.548390]  </TASK>
[ 8100.548393] ---[ end trace 0000000000000000 ]---
[ 8100.548408] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
[ 8100.550709] KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
[ 8100.552307] CPU: 1 UID: 0 PID: 115002 Comm: read_all Kdump: loaded Tainted: G        W  OE       6.16.0-rc1 #4 PREEMPT(full)
[ 8100.554966] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 8100.559389] RIP: 0010:proc_pid_arch_status+0x30/0xe0
[ 8100.560872] Code: 1f 44 00 00 55 48 89 fd 48 89 cf 53 48 83 ec 08 e8 e5 64 ff ff 48 ba 00 00 00 00 00 fc ff df 48 8d 78 08 48 8
9 f9 48 c1 e9 03 <80> 3c 11 00 75 7d 48 8b 58 08 48 c7 c2 ff ff ff ff 48 85 db 74 3d
[ 8100.565217] ICMPv6: process `read_all' is using deprecated sysctl (syscall) net.ipv6.neigh.default.base_reachable_time - use ne
t.ipv6.neigh.default.base_reachable_time_ms instead
[ 8100.566314] RSP: 0018:ff1100017cdefb08 EFLAGS: 00010202
[ 8100.572739] RAX: 0000000000000000 RBX: ff1100024d6fe980 RCX: 0000000000000001
[ 8100.575131] RDX: dffffc0000000000 RSI: ffffffffa610d020 RDI: 0000000000000008
[ 8100.577328] RBP: ff110002cc49c918 R08: 0000000000000001 R09: ffe21c002116761d
[ 8100.579627] R10: ff11000108b3b0eb R11: 0000000000000000 R12: ff11000108a80180
[ 8100.581911] R13: ff11000108b3b0e8 R14: ffffffffa610d020 R15: 0000000000000001
[ 8100.584259] FS:  00007fb2218f4740(0000) GS:ff11000f171d1000(0000) knlGS:0000000000000000
[ 8100.586917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8100.589242] CR2: 000055e2a4620000 CR3: 000000011412f004 CR4: 0000000000771ef0
[ 8100.591925] DR0: 0000000000000001 DR1: 0000000000000000 DR2: 0000000000000000
[ 8100.594666] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 8100.597372] PKRU: 55555554
[ 8100.599449] Call Trace:
[ 8100.601435]  <TASK>
[ 8100.603330]  proc_single_show+0x10c/0x1c0
[ 8100.605595]  seq_read_iter+0x3e5/0x1050
[ 8100.607898]  seq_read+0x24b/0x3b0
[ 8100.610127]  ? __pfx_do_filp_open+0x10/0x10
[ 8100.612476]  ? __pfx_seq_read+0x10/0x10
[ 8100.614670]  ? rcu_segcblist_enqueue+0x1d/0xe0
[ 8100.617053]  ? rcutree_enqueue.constprop.0+0x36/0x290
[ 8100.619531]  ? __call_rcu_common.constprop.0+0x30f/0x930
[ 8100.622088]  vfs_read+0x186/0xad0
[ 8100.624278]  ? alloc_fd+0x2c3/0x4c0
[ 8100.626502]  ? do_sys_openat2+0xef/0x160
[ 8100.628713]  ? __pfx_vfs_read+0x10/0x10
[ 8100.630963]  ? do_sys_openat2+0xef/0x160
[ 8100.633136]  ? __pfx_do_sys_openat2+0x10/0x10
[ 8100.635389]  ? kmem_cache_free+0x273/0x580
[ 8100.637560]  ? fdget_pos+0x1c9/0x4c0
[ 8100.639454]  ksys_read+0xef/0x1c0
[ 8100.641307]  ? __pfx_ksys_read+0x10/0x10
[ 8100.643164]  do_syscall_64+0x73/0x330
[ 8100.645072]  ? irqentry_exit_to_user_mode+0x32/0x210
[ 8100.647118]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 8100.649087] RIP: 0033:0x7fb2217147e2
[ 8100.650862] Code: c0 e9 b2 fe ff ff 50 48 8d 3d 8a b4 0c 00 e8 a5 1d 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 8
5 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 8100.657529] RSP: 002b:00007fff86c1f1b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 8100.660056] RAX: ffffffffffffffda RBX: 000055e29cebb140 RCX: 00007fb2217147e2
[ 8100.662654] RDX: 00000000000003ff RSI: 00007fff86c1f290 RDI: 0000000000000003
[ 8100.665091] RBP: 000055e29cea7012 R08: 00000000003923e4 R09: 00007fb2219010e8
[ 8100.667422] R10: 0000000000000004 R11: 0000000000000246 R12: 00007fb2218c7000
[ 8100.669755] R13: 000055e29cea706f R14: 000055e2a45f7e18 R15: 00007fb2218cb028
[ 8100.672083]  </TASK>
[ 8100.673727] Modules linked in: vfio_iommu_type1 vfio dns_resolver tun overlay nls_iso8859_1 ntfs3 vfat fat xfs sctp ip6_udp_tun
nel udp_tunnel nf_tables nfnetlink tcp_diag inet_diag ib_core isofs skx_edac_common input_leds led_class serio_raw sg virtio_ballo
on binfmt_misc squashfs loop sch_fq_codel dm_multipath fuse drm bpf_preload ip_tables x_tables raid10 async_tx raid1 raid0 linear
dm_mirror dm_region_hash dm_log dm_mod hid_generic usbhid hid virtio_blk virtio_net net_failover failover ghash_clmulni_intel atkb
d sha512_ssse3 aesni_intel vivaldi_fmap sr_mod i2c_i801 i2c_smbus i2c_core cdrom uhci_hcd ehci_pci virtio_pci ehci_hcd virtio_pci_
legacy_dev virtio_pci_modern_dev virtio virtio_ring [last unloaded: hwpoison_inject]

看一下这个warn: arch/x86/kernel/fpu/core.c:61

也就是说,内核线程是不应该使用fpu的,如果配置了CONFIG_X86_DEBUG_FPU的话,那么调用x86_task_fpu就直接返回null了

#ifdef CONFIG_X86_DEBUG_FPU
struct fpu *x86_task_fpu(struct task_struct *task)
{
        if (WARN_ON_ONCE(task->flags & PF_KTHREAD))
                return NULL;

        return (void *)task + sizeof(*task);
}
#endif

那么是谁调用的x86_task_fpu呢?看一下log是read_all这个进程,这是ltp中的一个进程,应该是尝试把/proc下进程的所有能读的东西都读出来?没仔细看,不过应该大差不差

看了下发现有这么个procfs:/proc/[PID]/arch_status

只要执行cat /proc/*/arch_status就会宕机,log是和ltp引发的宕机log是一样的

所以就得分析为什么这个会导致失败呢?

先在一个没问题的机器上执行一下cat /proc/*/arch_status,strace看一下

execve("/usr/bin/cat", ["cat", "/proc/1009334/arch_status", "/proc/100/arch_status", "/proc/1011/arch_status", "/proc/1
012/arch_status", "/proc/1013/arch_status", "/proc/1014/arch_status", "/proc/1015/arch_status", "/proc/1016/arch_status
", "/proc/1017/arch_status", "/proc/1018/arch_status", "/proc/1019/arch_status", "/proc/101/arch_status", "/proc/1020/a
rch_status", "/proc/1021336/arch_status", "/proc/1021386/arch_status", "/proc/1021/arch_status", "/proc/1022/arch_statu
s", "/proc/1023/arch_status", "/proc/1024/arch_status", "/proc/1025/arch_status", "/proc/1026/arch_status", "/proc/1028
/arch_status", "/proc/1029/arch_status", "/proc/102/arch_status", "/proc/1030/arch_status", "/proc/1032/arch_status", "
/proc/1033/arch_status", "/proc/1034003/arch_status", "/proc/1034/arch_status", "/proc/1035/arch_status", "/proc/1036/a
rch_status", ...], [/* 27 vars */]) = 0
brk(NULL)                               = 0x1fff000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9a6eeb3000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=48807, ...}) = 0
mmap(NULL, 48807, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f9a6eea7000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`&\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2156592, ...}) = 0
mmap(NULL, 3985920, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f9a6e800000
mprotect(0x7f9a6e9c4000, 2093056, PROT_NONE) = 0
mmap(0x7f9a6ebc3000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c3000) = 0x7f9a6ebc3000
mmap(0x7f9a6ebc9000, 16896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f9a6ebc9000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9a6eea6000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9a6eea4000
arch_prctl(ARCH_SET_FS, 0x7f9a6eea4740) = 0
access("/etc/sysconfig/strcasecmp-nonascii", F_OK) = -1 ENOENT (No such file or directory)
access("/etc/sysconfig/strcasecmp-nonascii", F_OK) = -1 ENOENT (No such file or directory)
mprotect(0x7f9a6ebc3000, 16384, PROT_READ) = 0
mprotect(0x60b000, 4096, PROT_READ)     = 0
mprotect(0x7f9a6ee21000, 4096, PROT_READ) = 0
munmap(0x7f9a6eea7000, 48807)           = 0
brk(NULL)                               = 0x1fff000
brk(0x2020000)                          = 0x2020000
brk(NULL)                               = 0x2020000
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=106176928, ...}) = 0
mmap(NULL, 106176928, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f9a68200000
close(3)                                = 0
fstat(1, {st_mode=S_IFREG|0644, st_size=2905, ...}) = 0
open("/proc/1009334/arch_status", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "AVX512_elapsed_ms:\t-1\n", 65536) = 22
write(1, "AVX512_elapsed_ms:\t-1\n", 22AVX512_elapsed_ms:       -1
) = 22
read(3, "", 65536)                      = 0
close(3)                                = 0
open("/proc/100/arch_status", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "AVX512_elapsed_ms:\t-1\n", 65536) = 22
write(1, "AVX512_elapsed_ms:\t-1\n", 22AVX512_elapsed_ms:       -1
) = 22
read(3, "", 65536)                      = 0
close(3)                                = 0
open("/proc/1011/arch_status", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "AVX512_elapsed_ms:\t-1\n", 65536) = 22
write(1, "AVX512_elapsed_ms:\t-1\n", 22AVX512_elapsed_ms:       -1
。。。。。。
read(3, "", 65536)                      = 0
close(3)                                = 0
open("/proc/self/arch_status", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "AVX512_elapsed_ms:\t-1\n", 65536) = 22
write(1, "AVX512_elapsed_ms:\t-1\n", 22AVX512_elapsed_ms:       -1
) = 22
read(3, "", 65536)                      = 0
close(3)                                = 0
open("/proc/thread-self/arch_status", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "AVX512_elapsed_ms:\t-1\n", 65536) = 22
write(1, "AVX512_elapsed_ms:\t-1\n", 22AVX512_elapsed_ms:       -1
) = 22
read(3, "", 65536)                      = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++

其实就是直接去读procfs去了吧,在内核源码中搜索AVX512_elapsed_ms可以找到是谁调用的,看一下这块的读取过程

//tid_base_stuff/tgid_base_stuff中,有如下内容,所以看上去就是定义一个procfs接口吧:

#ifdef CONFIG_PROC_PID_ARCH_STATUS
        ONE("arch_status", S_IRUGO, proc_pid_arch_status),
#endif

#define ONE(NAME, MODE, show)                           \
        NOD(NAME, (S_IFREG|(MODE)),                     \
                NULL, &proc_single_file_operations,     \
                { .proc_show = show } )

#define NOD(NAME, MODE, IOP, FOP, OP) {                 \
        .name = (NAME),                                 \
        .len  = sizeof(NAME) - 1,                       \
        .mode = MODE,                                   \
        .iop  = IOP,                                    \
        .fop  = FOP,                                    \
        .op   = OP,                                     \
}



/*
 * Report architecture specific information
 */
int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
                        struct pid *pid, struct task_struct *task)
{
        /*
         * Report AVX512 state if the processor and build option supported.
         */
        if (cpu_feature_enabled(X86_FEATURE_AVX512F))
                avx512_status(m, task);

        return 0;
}

static void avx512_status(struct seq_file *m, struct task_struct *task)
{
        unsigned long timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
        long delta;

        if (!timestamp) {
                /*
                 * Report -1 if no AVX512 usage
                 */
                delta = -1;
        } else {
                delta = (long)(jiffies - timestamp);
                /*
                 * Cap to LONG_MAX if time difference > LONG_MAX
                 */
                if (delta < 0)
                        delta = LONG_MAX;
                delta = jiffies_to_msecs(delta);
        }

        seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
        seq_putc(m, '\n');
}

如上,可以看到,是注册了一个名字叫arch_status的procfs的,然后他的show接口就是proc_pid_arch_status函数,这个函数里判断如果cpu支持X86_FEATURE_AVX512F的话,就调用avx512_status去

这里就可以发现了,在avx512_status函数的timestamp赋值中做了什么?x86_task_fpu(task)->avx512_timestamp?联系刚才看的x86_task_fpu函数内容,如果配置了CONFIG_X86_DEBUG_FPU且是内核线程的话,就直接返回NULL了。

这就对上了,所以根因就是,开启了CONFIG_X86_DEBUG_FPU之后,x86_task_fpu的参数如果是一个内核线程的task_struct的话,就直接返回NULL。而在/proc下,是有内核线程的pid的,所以当我们执行cat /proc/*/arch_status的时候,如果执行到的是一个/proc/[内核线程]/arch_status的话,执行到avx512_status函数中timestamp的赋值的话,在x86_task_fpu就直接返回NULL了,而再继续找->avx512_timestamp的话肯定就触发空指针的解引用了

所以到这里,宕机的原因我们搞清楚了,接下来就应该看:哪次提交引入的这个问题?怎么修复这个问题?

谁引入的问题?

首先看第一个,哪次提交引入的这个问题?显而易见,直接触发宕机的原因肯定是因为x86_task_fpu中的warn_on_once,因此查一下它的提交记录

嗯,果然是它,22aafe3bcb67 (“x86/fpu: Remove init_task FPU state dependencies, add debugging warning for PF_KTHREAD tasks”)

这个patch里就提到了,把init_task中的fpu相关的东西去掉,因为init_task不使用fpu context,而且,另外,init_task和其他的内核线程使用fpu的话应该通过kernel_fpu_begin()/_end(),所以这个cmmitter就在CONFIG_X86_DEBUG_FPU里加上了对内核线程的flag的判定,如果有内核线程的话就直接返回NULL了

修复问题

再来看第二个,怎么修复呢?

我想的是,如果能利用patch中提到的kernel_fpu_begin()/_end()在内核线程的pid获取arch_status的时候临时给它加上权限然后访问完再去掉权限来的,但是应该不能这么使用,看上去,kernel_fpu_begin()/_end()应该是在内核线程自己的上下文中使用的。那么,需要自己实现一个函数,在执行x86_task_fpu之前判断如果是内核线程的话,通过自己实现的函数来临时使用fpu?但是这样有必要嘛?当前,这只是想向userspace暴露一下AVX512_elapsed_ms的使用时间?如果再cat /proc/[PID]/arch_status的时候,临时给它访问fpu的权限然后再去掉,也许会引发别的问题?那么是不是直接判断到内核线程的话返回-1就完事了,如下

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9aa9ac8399ae..16f813a42f42 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1859,9 +1859,14 @@ long fpu_xstate_prctl(int option, unsigned long arg2)
  */
 static void avx512_status(struct seq_file *m, struct task_struct *task)
 {
-       unsigned long timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
+       unsigned long timestamp = 0;
        long delta;

+#ifdef CONFIG_X86_DEBUG_FPU
+       if (!(task->flags & PF_KTHREAD))
+#endif
+               timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
+
        if (!timestamp) {
                /*
                 * Report -1 if no AVX512 usage

重新编译内核并测试通过

root@instance-ogqytwuj:~# ps aux |grep 117440
root      117440  0.0  0.0      0     0 ?        I    19:49   0:00 [kworker/3:1]
root      120216  0.0  0.0   9756  2404 pts/0    S+   20:03   0:00 grep --color=auto 117440
root@instance-ogqytwuj:~# cat /proc/117440/arch_status
AVX512_elapsed_ms:	-1

https://lore.kernel.org/all/20250717094308.94450-1-wangfushuai@baidu.com/T/#u

https://lore.kernel.org/all/fa4e5e3d-431c-4dcb-9ffc-b20e6ee66e43@intel.com/T/#t

https://lore.kernel.org/all/20250724013422.307954-1-sohil.mehta@intel.com/T/#t

https://lore.kernel.org/all/11c3284d-1257-4010-b2fb-5cc5b7b87fb4@intel.com/T/#t

经过社区讨论和Sohil Mehta大佬的co-develope,最终决定是,不让内核线程向userspace暴露avx的使用时间了,无论开不开CONFIG_X86_DEBUG_FPU

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 12ed75c1b567..28e4fd65c9da 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1881,19 +1881,20 @@ long fpu_xstate_prctl(int option, unsigned long arg2)
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 /*
  * Report the amount of time elapsed in millisecond since last AVX512
- * use in the task.
+ * use in the task. Report -1 if no AVX-512 usage.
  */
 static void avx512_status(struct seq_file *m, struct task_struct *task)
 {
-	unsigned long timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
-	long delta;
+	unsigned long timestamp;
+	long delta = -1;
 
-	if (!timestamp) {
-		/*
-		 * Report -1 if no AVX512 usage
-		 */
-		delta = -1;
-	} else {
+	/* AVX-512 usage is not tracked for kernel threads. Don't report anything. */
+	if (task->flags & (PF_KTHREAD | PF_USER_WORKER))
+		return;
+
+	timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
+
+	if (timestamp) {
 		delta = (long)(jiffies - timestamp);
 		/*
 		 * Cap to LONG_MAX if time difference > LONG_MAX

后记

如上,说到内核线程如果要访问fpu的话可以使用kernel_fpu_begin()/_end() api,这个是怎么用的呢?

/* Code that is unaware of kernel_fpu_begin_mask() can use this */
static inline void kernel_fpu_begin(void)
{
#ifdef CONFIG_X86_64
        /*
         * Any 64-bit code that uses 387 instructions must explicitly request
         * KFPU_387.
         */
        kernel_fpu_begin_mask(KFPU_MXCSR);
#else
        /*
         * 32-bit kernel code may use 387 operations as well as SSE2, etc,
         * as long as it checks that the CPU has the required capability.
         */
        kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR);
#endif
}

void kernel_fpu_begin_mask(unsigned int kfpu_mask)
{
        if (!irqs_disabled())
                fpregs_lock();

        WARN_ON_FPU(!irq_fpu_usable());

        /* Toggle kernel_fpu_allowed to false: */
        WARN_ON_FPU(!this_cpu_read(kernel_fpu_allowed));
        this_cpu_write(kernel_fpu_allowed, false);

        if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
            !test_thread_flag(TIF_NEED_FPU_LOAD)) {
                set_thread_flag(TIF_NEED_FPU_LOAD);
                save_fpregs_to_fpstate(x86_task_fpu(current));
        }
        __cpu_invalidate_fpregs_state();

        /* Put sane initial values into the control registers. */
        if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM))
                ldmxcsr(MXCSR_DEFAULT);

        if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU))
                asm volatile ("fninit");
}


/*
 * Track FPU initialization and kernel-mode usage. 'true' means the FPU is
 * initialized and is not currently being used by the kernel:
 */
DEFINE_PER_CPU(bool, kernel_fpu_allowed);

嗯,看上去就是如果内核线程要使用fpu的话,先kernel_fpu_begin,这里会判断是否满足可用条件,如果满足的话,设置kernel_fpu_allowed为false,然后kernel就可以使用了

所以看上去还是依赖kernel_fpu_allowed这个per-cpu变量吧,如果是true就表示没有在被内核使用,是false表示被内核使用。通过这个per-cpu的状态维护当前使用情况


转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 857879363@qq.com