背景
yzwddsg曾经说过,有资源而不加以利用,就是暴殄天物。
所以闲暇(上班摸鱼)之余就搞了个虚机,搞下来最新的内核源码,编译安装想着测一下有没有问题。什么stress-ng fio hackbench sysbench ltp,反正能上的全都上。
结果,还真就出问题了,测试ltp的时候就发现,宕机了,而且必现。遂尝试配一下kdump看看能不能搞出个core来分析分析,因为看log的话只看出个空指针引用。无奈,配不好,core出来了但是crash打不开,于是,就这么分析吧。
分析
根因定位
当前我用的是比较新的6.16.rc1内核版本,ltp是20250530版本
当跑了好几次ltp都宕机在同一个地方的时候,这就肯定不是巧合了,来看一下log
[ 8100.082169] LTP: starting read_all_dev (read_all -d /dev -p -q -r 3)
[ 8100.146532] VFIO - User Level meta-driver version: 0.3
[ 8100.297762] LTP: starting read_all_proc (read_all -d /proc -q -r 3)
[ 8100.547969] ------------[ cut here ]------------
[ 8100.547978] WARNING: CPU: 1 PID: 115002 at arch/x86/kernel/fpu/core.c:61 x86_task_fpu+0x46/0x60
[ 8100.547989] Modules linked in: vfio_iommu_type1 vfio dns_resolver tun overlay nls_iso8859_1 ntfs3 vfat fat xfs sctp ip6_udp_tun
nel udp_tunnel nf_tables nfnetlink tcp_diag inet_diag ib_core isofs skx_edac_common input_leds led_class serio_raw sg virtio_ballo
on binfmt_misc squashfs loop sch_fq_codel dm_multipath fuse drm bpf_preload ip_tables x_tables raid10 async_tx raid1 raid0 linear
dm_mirror dm_region_hash dm_log dm_mod hid_generic usbhid hid virtio_blk virtio_net net_failover failover ghash_clmulni_intel atkb
d sha512_ssse3 aesni_intel vivaldi_fmap sr_mod i2c_i801 i2c_smbus i2c_core cdrom uhci_hcd ehci_pci virtio_pci ehci_hcd virtio_pci_
legacy_dev virtio_pci_modern_dev virtio virtio_ring [last unloaded: hwpoison_inject]
[ 8100.548134] CPU: 1 UID: 0 PID: 115002 Comm: read_all Kdump: loaded Tainted: G OE 6.16.0-rc1 #4 PREEMPT(full)
[ 8100.548143] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 8100.548151] RIP: 0010:x86_task_fpu+0x46/0x60
[ 8100.548156] Code: ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 48 8d 83 00 25 00 00 f6 43 2e 20 75 06 5
b c3 cc cc cc cc <0f> 0b 31 c0 5b c3 cc cc cc cc e8 cb da a4 00 eb dc 66 0f 1f 84 00
[ 8100.548162] RSP: 0018:ff1100017cdefaf8 EFLAGS: 00010202
[ 8100.548167] RAX: ff11000108b3d5c0 RBX: ff11000108b3b0c0 RCX: ff11000108b3b0c0
[ 8100.548172] RDX: 0000000000000000 RSI: ffffffffa610d020 RDI: ff11000108b3b0ec
[ 8100.548176] RBP: ff110002cc49c918 R08: 0000000000000001 R09: ffe21c002116761d
[ 8100.548179] R10: ff11000108b3b0eb R11: 0000000000000000 R12: ff11000108a80180
[ 8100.548183] R13: ff11000108b3b0e8 R14: ffffffffa610d020 R15: 0000000000000001
[ 8100.548187] FS: 00007fb2218f4740(0000) GS:ff11000f171d1000(0000) knlGS:0000000000000000
[ 8100.548193] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8100.548197] CR2: 000055e2a4620000 CR3: 000000011412f004 CR4: 0000000000771ef0
[ 8100.548200] DR0: 0000000000000001 DR1: 0000000000000000 DR2: 0000000000000000
[ 8100.548204] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 8100.548207] PKRU: 55555554
[ 8100.548210] Call Trace:
[ 8100.548213] <TASK>
[ 8100.548216] proc_pid_arch_status+0x1b/0xe0
[ 8100.548222] proc_single_show+0x10c/0x1c0
[ 8100.548230] seq_read_iter+0x3e5/0x1050
[ 8100.548239] seq_read+0x24b/0x3b0
[ 8100.548249] ? __pfx_do_filp_open+0x10/0x10
[ 8100.548256] ? __pfx_seq_read+0x10/0x10
[ 8100.548260] ? rcu_segcblist_enqueue+0x1d/0xe0
[ 8100.548268] ? rcutree_enqueue.constprop.0+0x36/0x290
[ 8100.548274] ? __call_rcu_common.constprop.0+0x30f/0x930
[ 8100.548281] vfs_read+0x186/0xad0
[ 8100.548288] ? alloc_fd+0x2c3/0x4c0
[ 8100.548293] ? do_sys_openat2+0xef/0x160
[ 8100.548299] ? __pfx_vfs_read+0x10/0x10
[ 8100.548304] ? do_sys_openat2+0xef/0x160
[ 8100.548309] ? __pfx_do_sys_openat2+0x10/0x10
[ 8100.548314] ? kmem_cache_free+0x273/0x580
[ 8100.548320] ? fdget_pos+0x1c9/0x4c0
[ 8100.548326] ksys_read+0xef/0x1c0
[ 8100.548331] ? __pfx_ksys_read+0x10/0x10
[ 8100.548338] do_syscall_64+0x73/0x330
[ 8100.548343] ? irqentry_exit_to_user_mode+0x32/0x210
[ 8100.548349] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 8100.548354] RIP: 0033:0x7fb2217147e2
[ 8100.548359] Code: c0 e9 b2 fe ff ff 50 48 8d 3d 8a b4 0c 00 e8 a5 1d 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 8
5 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 8100.548364] RSP: 002b:00007fff86c1f1b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 8100.548369] RAX: ffffffffffffffda RBX: 000055e29cebb140 RCX: 00007fb2217147e2
[ 8100.548373] RDX: 00000000000003ff RSI: 00007fff86c1f290 RDI: 0000000000000003
[ 8100.548377] RBP: 000055e29cea7012 R08: 00000000003923e4 R09: 00007fb2219010e8
[ 8100.548384] R13: 000055e29cea706f R14: 000055e2a45f7e18 R15: 00007fb2218cb028
[ 8100.548390] </TASK>
[ 8100.548393] ---[ end trace 0000000000000000 ]---
[ 8100.548408] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
[ 8100.550709] KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
[ 8100.552307] CPU: 1 UID: 0 PID: 115002 Comm: read_all Kdump: loaded Tainted: G W OE 6.16.0-rc1 #4 PREEMPT(full)
[ 8100.554966] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 8100.559389] RIP: 0010:proc_pid_arch_status+0x30/0xe0
[ 8100.560872] Code: 1f 44 00 00 55 48 89 fd 48 89 cf 53 48 83 ec 08 e8 e5 64 ff ff 48 ba 00 00 00 00 00 fc ff df 48 8d 78 08 48 8
9 f9 48 c1 e9 03 <80> 3c 11 00 75 7d 48 8b 58 08 48 c7 c2 ff ff ff ff 48 85 db 74 3d
[ 8100.565217] ICMPv6: process `read_all' is using deprecated sysctl (syscall) net.ipv6.neigh.default.base_reachable_time - use ne
t.ipv6.neigh.default.base_reachable_time_ms instead
[ 8100.566314] RSP: 0018:ff1100017cdefb08 EFLAGS: 00010202
[ 8100.572739] RAX: 0000000000000000 RBX: ff1100024d6fe980 RCX: 0000000000000001
[ 8100.575131] RDX: dffffc0000000000 RSI: ffffffffa610d020 RDI: 0000000000000008
[ 8100.577328] RBP: ff110002cc49c918 R08: 0000000000000001 R09: ffe21c002116761d
[ 8100.579627] R10: ff11000108b3b0eb R11: 0000000000000000 R12: ff11000108a80180
[ 8100.581911] R13: ff11000108b3b0e8 R14: ffffffffa610d020 R15: 0000000000000001
[ 8100.584259] FS: 00007fb2218f4740(0000) GS:ff11000f171d1000(0000) knlGS:0000000000000000
[ 8100.586917] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8100.589242] CR2: 000055e2a4620000 CR3: 000000011412f004 CR4: 0000000000771ef0
[ 8100.591925] DR0: 0000000000000001 DR1: 0000000000000000 DR2: 0000000000000000
[ 8100.594666] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 8100.597372] PKRU: 55555554
[ 8100.599449] Call Trace:
[ 8100.601435] <TASK>
[ 8100.603330] proc_single_show+0x10c/0x1c0
[ 8100.605595] seq_read_iter+0x3e5/0x1050
[ 8100.607898] seq_read+0x24b/0x3b0
[ 8100.610127] ? __pfx_do_filp_open+0x10/0x10
[ 8100.612476] ? __pfx_seq_read+0x10/0x10
[ 8100.614670] ? rcu_segcblist_enqueue+0x1d/0xe0
[ 8100.617053] ? rcutree_enqueue.constprop.0+0x36/0x290
[ 8100.619531] ? __call_rcu_common.constprop.0+0x30f/0x930
[ 8100.622088] vfs_read+0x186/0xad0
[ 8100.624278] ? alloc_fd+0x2c3/0x4c0
[ 8100.626502] ? do_sys_openat2+0xef/0x160
[ 8100.628713] ? __pfx_vfs_read+0x10/0x10
[ 8100.630963] ? do_sys_openat2+0xef/0x160
[ 8100.633136] ? __pfx_do_sys_openat2+0x10/0x10
[ 8100.635389] ? kmem_cache_free+0x273/0x580
[ 8100.637560] ? fdget_pos+0x1c9/0x4c0
[ 8100.639454] ksys_read+0xef/0x1c0
[ 8100.641307] ? __pfx_ksys_read+0x10/0x10
[ 8100.643164] do_syscall_64+0x73/0x330
[ 8100.645072] ? irqentry_exit_to_user_mode+0x32/0x210
[ 8100.647118] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 8100.649087] RIP: 0033:0x7fb2217147e2
[ 8100.650862] Code: c0 e9 b2 fe ff ff 50 48 8d 3d 8a b4 0c 00 e8 a5 1d 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 8
5 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 8100.657529] RSP: 002b:00007fff86c1f1b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 8100.660056] RAX: ffffffffffffffda RBX: 000055e29cebb140 RCX: 00007fb2217147e2
[ 8100.662654] RDX: 00000000000003ff RSI: 00007fff86c1f290 RDI: 0000000000000003
[ 8100.665091] RBP: 000055e29cea7012 R08: 00000000003923e4 R09: 00007fb2219010e8
[ 8100.667422] R10: 0000000000000004 R11: 0000000000000246 R12: 00007fb2218c7000
[ 8100.669755] R13: 000055e29cea706f R14: 000055e2a45f7e18 R15: 00007fb2218cb028
[ 8100.672083] </TASK>
[ 8100.673727] Modules linked in: vfio_iommu_type1 vfio dns_resolver tun overlay nls_iso8859_1 ntfs3 vfat fat xfs sctp ip6_udp_tun
nel udp_tunnel nf_tables nfnetlink tcp_diag inet_diag ib_core isofs skx_edac_common input_leds led_class serio_raw sg virtio_ballo
on binfmt_misc squashfs loop sch_fq_codel dm_multipath fuse drm bpf_preload ip_tables x_tables raid10 async_tx raid1 raid0 linear
dm_mirror dm_region_hash dm_log dm_mod hid_generic usbhid hid virtio_blk virtio_net net_failover failover ghash_clmulni_intel atkb
d sha512_ssse3 aesni_intel vivaldi_fmap sr_mod i2c_i801 i2c_smbus i2c_core cdrom uhci_hcd ehci_pci virtio_pci ehci_hcd virtio_pci_
legacy_dev virtio_pci_modern_dev virtio virtio_ring [last unloaded: hwpoison_inject]
看一下这个warn: arch/x86/kernel/fpu/core.c:61
也就是说,内核线程是不应该使用fpu的,如果配置了CONFIG_X86_DEBUG_FPU的话,那么调用x86_task_fpu就直接返回null了
#ifdef CONFIG_X86_DEBUG_FPU
struct fpu *x86_task_fpu(struct task_struct *task)
{
if (WARN_ON_ONCE(task->flags & PF_KTHREAD))
return NULL;
return (void *)task + sizeof(*task);
}
#endif
那么是谁调用的x86_task_fpu呢?看一下log是read_all这个进程,这是ltp中的一个进程,应该是尝试把/proc下进程的所有能读的东西都读出来?没仔细看,不过应该大差不差
看了下发现有这么个procfs:/proc/[PID]/arch_status
只要执行cat /proc/*/arch_status就会宕机,log是和ltp引发的宕机log是一样的
所以就得分析为什么这个会导致失败呢?
先在一个没问题的机器上执行一下cat /proc/*/arch_status,strace看一下
execve("/usr/bin/cat", ["cat", "/proc/1009334/arch_status", "/proc/100/arch_status", "/proc/1011/arch_status", "/proc/1
012/arch_status", "/proc/1013/arch_status", "/proc/1014/arch_status", "/proc/1015/arch_status", "/proc/1016/arch_status
", "/proc/1017/arch_status", "/proc/1018/arch_status", "/proc/1019/arch_status", "/proc/101/arch_status", "/proc/1020/a
rch_status", "/proc/1021336/arch_status", "/proc/1021386/arch_status", "/proc/1021/arch_status", "/proc/1022/arch_statu
s", "/proc/1023/arch_status", "/proc/1024/arch_status", "/proc/1025/arch_status", "/proc/1026/arch_status", "/proc/1028
/arch_status", "/proc/1029/arch_status", "/proc/102/arch_status", "/proc/1030/arch_status", "/proc/1032/arch_status", "
/proc/1033/arch_status", "/proc/1034003/arch_status", "/proc/1034/arch_status", "/proc/1035/arch_status", "/proc/1036/a
rch_status", ...], [/* 27 vars */]) = 0
brk(NULL) = 0x1fff000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9a6eeb3000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=48807, ...}) = 0
mmap(NULL, 48807, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f9a6eea7000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`&\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2156592, ...}) = 0
mmap(NULL, 3985920, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f9a6e800000
mprotect(0x7f9a6e9c4000, 2093056, PROT_NONE) = 0
mmap(0x7f9a6ebc3000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c3000) = 0x7f9a6ebc3000
mmap(0x7f9a6ebc9000, 16896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f9a6ebc9000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9a6eea6000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9a6eea4000
arch_prctl(ARCH_SET_FS, 0x7f9a6eea4740) = 0
access("/etc/sysconfig/strcasecmp-nonascii", F_OK) = -1 ENOENT (No such file or directory)
access("/etc/sysconfig/strcasecmp-nonascii", F_OK) = -1 ENOENT (No such file or directory)
mprotect(0x7f9a6ebc3000, 16384, PROT_READ) = 0
mprotect(0x60b000, 4096, PROT_READ) = 0
mprotect(0x7f9a6ee21000, 4096, PROT_READ) = 0
munmap(0x7f9a6eea7000, 48807) = 0
brk(NULL) = 0x1fff000
brk(0x2020000) = 0x2020000
brk(NULL) = 0x2020000
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=106176928, ...}) = 0
mmap(NULL, 106176928, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f9a68200000
close(3) = 0
fstat(1, {st_mode=S_IFREG|0644, st_size=2905, ...}) = 0
open("/proc/1009334/arch_status", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "AVX512_elapsed_ms:\t-1\n", 65536) = 22
write(1, "AVX512_elapsed_ms:\t-1\n", 22AVX512_elapsed_ms: -1
) = 22
read(3, "", 65536) = 0
close(3) = 0
open("/proc/100/arch_status", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "AVX512_elapsed_ms:\t-1\n", 65536) = 22
write(1, "AVX512_elapsed_ms:\t-1\n", 22AVX512_elapsed_ms: -1
) = 22
read(3, "", 65536) = 0
close(3) = 0
open("/proc/1011/arch_status", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "AVX512_elapsed_ms:\t-1\n", 65536) = 22
write(1, "AVX512_elapsed_ms:\t-1\n", 22AVX512_elapsed_ms: -1
。。。。。。
read(3, "", 65536) = 0
close(3) = 0
open("/proc/self/arch_status", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "AVX512_elapsed_ms:\t-1\n", 65536) = 22
write(1, "AVX512_elapsed_ms:\t-1\n", 22AVX512_elapsed_ms: -1
) = 22
read(3, "", 65536) = 0
close(3) = 0
open("/proc/thread-self/arch_status", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "AVX512_elapsed_ms:\t-1\n", 65536) = 22
write(1, "AVX512_elapsed_ms:\t-1\n", 22AVX512_elapsed_ms: -1
) = 22
read(3, "", 65536) = 0
close(3) = 0
close(1) = 0
close(2) = 0
exit_group(1) = ?
+++ exited with 1 +++
其实就是直接去读procfs去了吧,在内核源码中搜索AVX512_elapsed_ms可以找到是谁调用的,看一下这块的读取过程
//tid_base_stuff/tgid_base_stuff中,有如下内容,所以看上去就是定义一个procfs接口吧:
#ifdef CONFIG_PROC_PID_ARCH_STATUS
ONE("arch_status", S_IRUGO, proc_pid_arch_status),
#endif
#define ONE(NAME, MODE, show) \
NOD(NAME, (S_IFREG|(MODE)), \
NULL, &proc_single_file_operations, \
{ .proc_show = show } )
#define NOD(NAME, MODE, IOP, FOP, OP) { \
.name = (NAME), \
.len = sizeof(NAME) - 1, \
.mode = MODE, \
.iop = IOP, \
.fop = FOP, \
.op = OP, \
}
/*
* Report architecture specific information
*/
int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
/*
* Report AVX512 state if the processor and build option supported.
*/
if (cpu_feature_enabled(X86_FEATURE_AVX512F))
avx512_status(m, task);
return 0;
}
static void avx512_status(struct seq_file *m, struct task_struct *task)
{
unsigned long timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
long delta;
if (!timestamp) {
/*
* Report -1 if no AVX512 usage
*/
delta = -1;
} else {
delta = (long)(jiffies - timestamp);
/*
* Cap to LONG_MAX if time difference > LONG_MAX
*/
if (delta < 0)
delta = LONG_MAX;
delta = jiffies_to_msecs(delta);
}
seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
seq_putc(m, '\n');
}
如上,可以看到,是注册了一个名字叫arch_status的procfs的,然后他的show接口就是proc_pid_arch_status函数,这个函数里判断如果cpu支持X86_FEATURE_AVX512F的话,就调用avx512_status去
这里就可以发现了,在avx512_status函数的timestamp赋值中做了什么?x86_task_fpu(task)->avx512_timestamp?联系刚才看的x86_task_fpu函数内容,如果配置了CONFIG_X86_DEBUG_FPU且是内核线程的话,就直接返回NULL了。
这就对上了,所以根因就是,开启了CONFIG_X86_DEBUG_FPU之后,x86_task_fpu的参数如果是一个内核线程的task_struct的话,就直接返回NULL。而在/proc下,是有内核线程的pid的,所以当我们执行cat /proc/*/arch_status的时候,如果执行到的是一个/proc/[内核线程]/arch_status的话,执行到avx512_status函数中timestamp的赋值的话,在x86_task_fpu就直接返回NULL了,而再继续找->avx512_timestamp的话肯定就触发空指针的解引用了
所以到这里,宕机的原因我们搞清楚了,接下来就应该看:哪次提交引入的这个问题?怎么修复这个问题?
谁引入的问题?
首先看第一个,哪次提交引入的这个问题?显而易见,直接触发宕机的原因肯定是因为x86_task_fpu中的warn_on_once,因此查一下它的提交记录
嗯,果然是它,22aafe3bcb67 (“x86/fpu: Remove init_task FPU state dependencies, add debugging warning for PF_KTHREAD tasks”)
这个patch里就提到了,把init_task中的fpu相关的东西去掉,因为init_task不使用fpu context,而且,另外,init_task和其他的内核线程使用fpu的话应该通过kernel_fpu_begin()/_end(),所以这个cmmitter就在CONFIG_X86_DEBUG_FPU里加上了对内核线程的flag的判定,如果有内核线程的话就直接返回NULL了
修复问题
再来看第二个,怎么修复呢?
我想的是,如果能利用patch中提到的kernel_fpu_begin()/_end()在内核线程的pid获取arch_status的时候临时给它加上权限然后访问完再去掉权限来的,但是应该不能这么使用,看上去,kernel_fpu_begin()/_end()应该是在内核线程自己的上下文中使用的。那么,需要自己实现一个函数,在执行x86_task_fpu之前判断如果是内核线程的话,通过自己实现的函数来临时使用fpu?但是这样有必要嘛?当前,这只是想向userspace暴露一下AVX512_elapsed_ms的使用时间?如果再cat /proc/[PID]/arch_status的时候,临时给它访问fpu的权限然后再去掉,也许会引发别的问题?那么是不是直接判断到内核线程的话返回-1就完事了,如下
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9aa9ac8399ae..16f813a42f42 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1859,9 +1859,14 @@ long fpu_xstate_prctl(int option, unsigned long arg2)
*/
static void avx512_status(struct seq_file *m, struct task_struct *task)
{
- unsigned long timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
+ unsigned long timestamp = 0;
long delta;
+#ifdef CONFIG_X86_DEBUG_FPU
+ if (!(task->flags & PF_KTHREAD))
+#endif
+ timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
+
if (!timestamp) {
/*
* Report -1 if no AVX512 usage
重新编译内核并测试通过
root@instance-ogqytwuj:~# ps aux |grep 117440
root 117440 0.0 0.0 0 0 ? I 19:49 0:00 [kworker/3:1]
root 120216 0.0 0.0 9756 2404 pts/0 S+ 20:03 0:00 grep --color=auto 117440
root@instance-ogqytwuj:~# cat /proc/117440/arch_status
AVX512_elapsed_ms: -1
https://lore.kernel.org/all/20250717094308.94450-1-wangfushuai@baidu.com/T/#u
https://lore.kernel.org/all/fa4e5e3d-431c-4dcb-9ffc-b20e6ee66e43@intel.com/T/#t
https://lore.kernel.org/all/20250724013422.307954-1-sohil.mehta@intel.com/T/#t
https://lore.kernel.org/all/11c3284d-1257-4010-b2fb-5cc5b7b87fb4@intel.com/T/#t
经过社区讨论和Sohil Mehta大佬的co-develope,最终决定是,不让内核线程向userspace暴露avx的使用时间了,无论开不开CONFIG_X86_DEBUG_FPU
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 12ed75c1b567..28e4fd65c9da 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1881,19 +1881,20 @@ long fpu_xstate_prctl(int option, unsigned long arg2)
#ifdef CONFIG_PROC_PID_ARCH_STATUS
/*
* Report the amount of time elapsed in millisecond since last AVX512
- * use in the task.
+ * use in the task. Report -1 if no AVX-512 usage.
*/
static void avx512_status(struct seq_file *m, struct task_struct *task)
{
- unsigned long timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
- long delta;
+ unsigned long timestamp;
+ long delta = -1;
- if (!timestamp) {
- /*
- * Report -1 if no AVX512 usage
- */
- delta = -1;
- } else {
+ /* AVX-512 usage is not tracked for kernel threads. Don't report anything. */
+ if (task->flags & (PF_KTHREAD | PF_USER_WORKER))
+ return;
+
+ timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
+
+ if (timestamp) {
delta = (long)(jiffies - timestamp);
/*
* Cap to LONG_MAX if time difference > LONG_MAX
后记
如上,说到内核线程如果要访问fpu的话可以使用kernel_fpu_begin()/_end() api,这个是怎么用的呢?
/* Code that is unaware of kernel_fpu_begin_mask() can use this */
static inline void kernel_fpu_begin(void)
{
#ifdef CONFIG_X86_64
/*
* Any 64-bit code that uses 387 instructions must explicitly request
* KFPU_387.
*/
kernel_fpu_begin_mask(KFPU_MXCSR);
#else
/*
* 32-bit kernel code may use 387 operations as well as SSE2, etc,
* as long as it checks that the CPU has the required capability.
*/
kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR);
#endif
}
void kernel_fpu_begin_mask(unsigned int kfpu_mask)
{
if (!irqs_disabled())
fpregs_lock();
WARN_ON_FPU(!irq_fpu_usable());
/* Toggle kernel_fpu_allowed to false: */
WARN_ON_FPU(!this_cpu_read(kernel_fpu_allowed));
this_cpu_write(kernel_fpu_allowed, false);
if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
!test_thread_flag(TIF_NEED_FPU_LOAD)) {
set_thread_flag(TIF_NEED_FPU_LOAD);
save_fpregs_to_fpstate(x86_task_fpu(current));
}
__cpu_invalidate_fpregs_state();
/* Put sane initial values into the control registers. */
if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM))
ldmxcsr(MXCSR_DEFAULT);
if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU))
asm volatile ("fninit");
}
/*
* Track FPU initialization and kernel-mode usage. 'true' means the FPU is
* initialized and is not currently being used by the kernel:
*/
DEFINE_PER_CPU(bool, kernel_fpu_allowed);
嗯,看上去就是如果内核线程要使用fpu的话,先kernel_fpu_begin,这里会判断是否满足可用条件,如果满足的话,设置kernel_fpu_allowed为false,然后kernel就可以使用了
所以看上去还是依赖kernel_fpu_allowed这个per-cpu变量吧,如果是true就表示没有在被内核使用,是false表示被内核使用。通过这个per-cpu的状态维护当前使用情况
转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 857879363@qq.com