问题背景
某款机型的机器在系统启动的时候稳定出现RCU stall的栈,需要排查并定位
排查过程
- 上游搜索相关patch
首先看一下发生问题时候的栈
[ 45.941504] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[ 45.949072] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[ 45.957833] 00:05: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
[ 45.966229] printk: console [ttyS0] disabled
[ 45.988682] serial 00:04: Runtime PM usage count underflow!
[ 45.989099] 00:06: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[ 45.989254] printk: console [ttyS0] enabled
[ 66.990978] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 66.990978] rcu: 81-....: (2686 ticks this GP) idle=8cb4/1/0x4000000000000000 softirq=115/115 fqs=1750
[ 66.990978] rcu: hardirqs softirqs csw/system
[ 66.990978] rcu: number: 0 171 0
[ 66.990978] rcu: cputime: 0 0 5360 ==> 10492(ms)
[ 66.990978] rcu: (t=5251 jiffies g=-755 q=66 ncpus=256)
[ 66.990978] CPU: 81 PID: 1 Comm: swapper/0 Not tainted 6.6.0-3.0.0.7 #1
[ 66.990978] Hardware name: Inventec P9000G7/P9000G7 MLB, BIOS P9000G7094 02/13/2026
[ 66.990978] RIP: 0010:console_flush_all+0x19f/0x450
[ 66.990978] Code: 01 49 89 46 58 8b 05 60 60 f0 01 83 f8 ff 0f 84 a8 00 00 00 31 c0 41 88 04 24 e8 5c 2b 00 00 80 e7 02 74 01 f
b 41 0f b6 1c 24 <84> db 0f 85 1a 01 00 00 49 8b 46 58 49 3b 45 00 76 04 49 89 45 00
[ 66.990978] RSP: 0000:ffffc90000087728 EFLAGS: 00000202
[ 66.990978] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 66.990978] RDX: 0000000000000001 RSI: 0000000000000046 RDI: ffffffff83badf40
[ 66.990978] RBP: 0000000000000001 R08: 6166203a33312052 R09: 4142203a302e3030
[ 66.990978] R10: 302e30303a61633a R11: 3030303020696370 R12: ffffc900000877af
[ 66.990978] R13: ffffc900000877b0 R14: ffffffff83254ba0 R15: ffffffff83bad320
[ 66.990978] FS: 0000000000000000(0000) GS:ffff8b7a7e240000(0000) knlGS:0000000000000000
[ 66.990978] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 66.990978] CR2: 0000000000000000 CR3: 0000000005e2e001 CR4: 0000000000770ee0
[ 66.990978] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 66.990978] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 66.990978] PKRU: 55555554
[ 66.990978] Call Trace:
[ 66.990978] <IRQ>
[ 66.990978] ? rcu_dump_cpu_stacks+0xa2/0x110
[ 66.990978] ? rcu_sched_clock_irq+0x35e/0x970
[ 66.990978] ? update_process_times+0x5b/0x90
[ 66.990978] ? tick_sched_handle.isra.9+0x1e/0x50
[ 66.990978] ? tick_sched_timer+0x6f/0x90
[ 66.990978] ? __pfx_tick_sched_timer+0x10/0x10
[ 66.990978] ? __hrtimer_run_queues+0x11c/0x2a0
[ 66.990978] ? hrtimer_interrupt+0x100/0x240
[ 66.990978] ? __sysvec_apic_timer_interrupt+0x4e/0x130
[ 66.990978] ? sysvec_apic_timer_interrupt+0x69/0x90
[ 66.990978] </IRQ>
[ 66.990978] <TASK>
[ 66.990978] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 66.990978] ? console_flush_all+0x19f/0x450
[ 66.990978] ? console_flush_all+0x194/0x450
[ 66.990978] console_unlock+0x75/0xd0
[ 66.990978] vprintk_emit+0x209/0x290
[ 66.990978] _printk+0x58/0x80
[ 66.990978] ? kernfs_notify+0x53/0xb0
[ 66.990978] register_console+0x1ff/0x440
[ 66.990978] serial_core_register_port+0x767/0x780
[ 66.990978] ? __pfx_ignore_unknown_bootoption+0x10/0x10
[ 66.990978] serial8250_register_8250_port+0x3cc/0x610
[ 66.990978] serial_pnp_probe+0x276/0x390
[ 66.990978] ? __pfx_serial_pnp_probe+0x10/0x10
[ 66.990978] pnp_device_probe+0x59/0xb0
[ 66.990978] really_probe+0x126/0x3b0
[ 66.990978] __driver_probe_device+0x80/0x160
[ 66.990978] driver_probe_device+0x1f/0x90
[ 66.990978] ? __pfx___driver_attach+0x10/0x10
[ 66.990978] __driver_attach+0xd3/0x170
[ 66.990978] ? __pfx___driver_attach+0x10/0x10
[ 66.990978] bus_for_each_dev+0x78/0xd0
[ 66.990978] bus_add_driver+0x177/0x230
[ 66.990978] driver_register+0x55/0x100
[ 66.990978] serial8250_init+0x86/0x1e0
[ 66.990978] ? __pfx_serial8250_init+0x10/0x10
[ 66.990978] do_one_initcall+0x4c/0x230
[ 66.990978] kernel_init_freeable+0x327/0x460
[ 66.990978] ? __pfx_kernel_init+0x10/0x10
[ 66.990978] kernel_init+0x16/0x130
[ 66.990978] ret_from_fork+0x2d/0x40
[ 66.990978] ? __pfx_kernel_init+0x10/0x10
[ 66.990978] ret_from_fork_asm+0x1b/0x30
[ 66.990978] </TASK>
发现有一个serial 00:04: Runtime PM usage count underflow!,这里产生了计数器下溢
搜索上游patch,发现ed2761958ad7(“tty: serial: 8250: Fix another runtime PM usage counter underflow”)可能解决这个问题
于是合入这个patch重编内核,测试发现,不再产生underflow了,但是RCU stall还是存在
也就是说,RCU stall的产生与这个underflow日志并没有关系
- 与正常机器对比
找一个”正常机器”(与异常机器并非同一机型,因为异常机器所属的机型都存在问题)对比下看看
发现”正常机器”并没有printk: console [ttyS0] disabled然后又enable的过程
所以此时怀疑就是这一次disable->enable的过程导致了RCU stall
- 去掉console=ttyS0实验
把cmdline中的console=ttyS0去掉,然后重启机器,发现没有call trace了
所以基本可以确定就是ttyS0在系统启动过程中的一次disable->enable导致的RCU stall
- 看代码+做实验实锤猜测
可以看到register_console中有一段:
void register_console(struct console *newcon)
{
......
/*
* If we have a bootconsole, and are switching to a real console,
* don't print everything out again, since when the boot console, and
* the real console are the same physical device, it's annoying to
* see the beginning boot messages twice
*/
if (bootcon_registered &&
((newcon->flags & (CON_CONSDEV | CON_BOOT)) == CON_CONSDEV)) {
newcon->flags &= ~CON_PRINTBUFFER;
}
......
}
这里可以看出,如果是bootcon->realcon的切换的话,新注册的realcon中会清除掉CON_PRINTBUFFER这个标志
这个标志的作用可以在下面代码中体现出来:初始化seq的时候,如果newcon是bootcon或者带有CON_PRINTBUFFER标志,那么这个newcon的seq就赋值为syslog_seq(这个syslog_seq就是syslog的buffer最开始的位置,只要buffer足够大,那syslog_seq基本就是系统启动的第一行log的位置)。
static void console_init_seq(struct console *newcon, bool bootcon_registered)
{
struct console *con;
bool handover;
if (newcon->flags & (CON_PRINTBUFFER | CON_BOOT)) {
/* Get a consistent copy of @syslog_seq. */
mutex_lock(&syslog_lock);
newcon->seq = syslog_seq;
mutex_unlock(&syslog_lock);
} else {
/* Begin with next message added to ringbuffer. */
newcon->seq = prb_next_seq(prb);
......
}
......
}
在这块加点log重编内核测试一下
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3505,6 +3505,7 @@ static void console_init_seq(struct console *newcon, bool bootcon_registered)
newcon->seq = prb_first_valid_seq(prb);
} else {
/* Get a consistent copy of @syslog_seq. */
+ pr_warn("yzwddsg newcon is in newcon->flags & (CON_PRINTBUFFER | CON_BOOT)\n");
mutex_lock(&syslog_lock);
newcon->seq = syslog_seq;
mutex_unlock(&syslog_lock);
@@ -3521,7 +3522,9 @@ static void console_init_seq(struct console *newcon, bool bootcon_registered)
* start with the message of the enabled boot console that is
* the furthest behind.
*/
+
if (bootcon_registered && !keep_bootcon) {
+ pr_warn("yzwddsg unregister bootcon\n");
/*
* Hold the console_lock to stop console printing and
* guarantee safe access to console->seq.
@@ -3650,6 +3653,7 @@ void register_console(struct console *newcon)
*/
if (bootcon_registered &&
((newcon->flags & (CON_CONSDEV | CON_BOOT)) == CON_CONSDEV)) {
+ pr_warn("yzwddsg: newcon removed CON_PRINTBUFFER\n");
newcon->flags &= ~CON_PRINTBUFFER;
}
发现确实没有清除CON_PRINTBUFFER标志,直接去走了刚才说的newcon->seq = syslog_seq;路径(因为之前并没有bootcon)
那这里就可以确认:RCU stall的直接原因就是因为异常机型的系统启动过程中存在一次ttyS0的disable->enable的过程,这里在enable注册新的ttyS0的过程中,console结构体中的seq会被赋值成syslog_seq,也就导致新ttyS0被注册enable之后需要从头到尾重新打印一遍系统启动的log,而这时候syslog buffer中的log已经有3000多行,串口打印的速度又不是很快,这就导致了系统出现RCU stall。
事情到这还没完,还需要确定为什么异常机型上会存在ttyS0 disable->enable的过程?
- 对比正常机器日志
在”正常机器”和异常机器的cmdline中都分别添加trace_buf_size=20M trace_options=stacktrace ftrace=function ftrace_filter=serial8250*,然后重启机器看一下ftrace的输出,过滤一下结果
正常机器输出如下:
<idle>-0 [000] ...2. 1.219179: serial8250_isa_init_ports.part.0 <-univ8250_console_init
<idle>-0 [000] ...2. 1.219180: serial8250_setup_port <-serial8250_isa_init_ports.part.0
<idle>-0 [000] ...2. 1.219180: serial8250_init_port <-serial8250_setup_port
<idle>-0 [000] ...2. 1.219180: serial8250_set_defaults <-serial8250_setup_port
<idle>-0 [000] ...2. 1.219180: serial8250_setup_port <-serial8250_isa_init_ports.part.0
<idle>-0 [000] ...2. 1.219180: serial8250_init_port <-serial8250_setup_port
<idle>-0 [000] ...2. 1.219180: serial8250_set_defaults <-serial8250_setup_port
<idle>-0 [000] ...2. 1.219180: serial8250_setup_port <-serial8250_isa_init_ports.part.0
<idle>-0 [000] ...2. 1.219180: serial8250_init_port <-serial8250_setup_port
<idle>-0 [000] ...2. 1.219181: serial8250_set_defaults <-serial8250_setup_port
<idle>-0 [000] ...2. 1.219181: serial8250_setup_port <-serial8250_isa_init_ports.part.0
<idle>-0 [000] ...2. 1.219181: serial8250_init_port <-serial8250_setup_port
<idle>-0 [000] ...2. 1.219181: serial8250_set_defaults <-serial8250_setup_port
<idle>-0 [000] ...2. 1.219182: serial8250_console_setup <-univ8250_console_setup
<idle>-0 [000] ...2. 1.219183: serial8250_set_termios <-uart_set_options
<idle>-0 [000] ...2. 1.219183: serial8250_do_set_termios <-uart_set_options
<idle>-0 [000] ...2. 1.219183: serial8250_get_baud_rate <-serial8250_do_set_termios
<idle>-0 [000] ...2. 1.219184: serial8250_get_divisor <-serial8250_do_set_termios
<idle>-0 [000] d..3. 1.219186: serial8250_set_divisor <-serial8250_do_set_termios
<idle>-0 [000] d..3. 1.219187: serial8250_do_set_divisor <-serial8250_do_set_termios
<idle>-0 [000] d..3. 1.219197: serial8250_set_mctrl.part.8 <-serial8250_do_set_termios
<idle>-0 [000] d..3. 1.219197: serial8250_do_set_mctrl <-serial8250_do_set_termios
......
systemd-1 [241] ...1. 15.112899: serial8250_pnp_init <-serial8250_init
systemd-1 [241] ...1. 15.112920: serial8250_register_8250_port <-serial_pnp_probe
systemd-1 [241] ...1. 15.112921: serial8250_set_defaults <-serial8250_register_8250_port
systemd-1 [241] ...1. 15.112964: serial8250_config_port <-univ8250_config_port
systemd-1 [241] ...1. 15.112965: serial8250_request_std_resource <-serial8250_config_port
systemd-1 [241] d..2. 15.112996: serial8250_clear_fifos.part.1 <-serial8250_config_port
systemd-1 [241] d..2. 15.113004: serial8250_clear_IER <-serial8250_config_port
systemd-1 [241] ...1. 15.113008: serial8250_type <-serial_core_register_port
systemd-1 [241] ...1. 15.120475: serial8250_pm <-serial_core_register_port
systemd-1 [241] ...1. 15.120475: serial8250_set_sleep <-serial_core_register_port
systemd-1 [241] d..2. 15.120475: serial8250_set_mctrl <-serial_core_register_port
systemd-1 [241] d..2. 15.120475: serial8250_set_mctrl.part.8 <-serial_core_register_port
systemd-1 [241] d..2. 15.120476: serial8250_do_set_mctrl <-serial_core_register_port
systemd-1 [241] ...1. 15.120520: serial8250_get_port <-serial_pnp_probe
systemd-1 [241] ...1. 15.120575: serial8250_config_port <-univ8250_config_port
systemd-1 [241] ...1. 15.120575: serial8250_request_std_resource <-serial8250_config_port
systemd-1 [241] ...1. 15.120586: serial8250_release_std_resource <-serial8250_config_port
systemd-1 [247] ...1. 15.120646: serial8250_config_port <-univ8250_config_port
systemd-1 [247] ...1. 15.120646: serial8250_request_std_resource <-serial8250_config_port
systemd-1 [247] ...1. 15.120657: serial8250_release_std_resource <-serial8250_config_port
systemd-1 [241] ...1. 15.120710: serial8250_config_port <-univ8250_config_port
systemd-1 [241] ...1. 15.120710: serial8250_request_std_resource <-serial8250_config_port
systemd-1 [241] ...1. 15.120721: serial8250_release_std_resource <-serial8250_config_port
systemd-1 [247] ...1. 15.120764: serial8250_probe <-platform_probe
异常机器输出如下:
=> trace_event_raw_event_initcall_start
=> console_init
=> start_kernel
=> x86_64_start_reservations
=> x86_64_start_kernel
=> secondary_startup_64_no_verify
<idle>-0 [000] ...2. 9.735705: serial8250_isa_init_ports.part.0 <-univ8250_console_init
<idle>-0 [000] ...2. 9.735705: serial8250_setup_port <-serial8250_isa_init_ports.part.0
<idle>-0 [000] ...2. 9.735705: serial8250_init_port <-serial8250_setup_port
<idle>-0 [000] ...2. 9.735705: serial8250_set_defaults <-serial8250_setup_port
<idle>-0 [000] ...2. 9.735706: serial8250_setup_port <-serial8250_isa_init_ports.part.0
<idle>-0 [000] ...2. 9.735706: serial8250_init_port <-serial8250_setup_port
<idle>-0 [000] ...2. 9.735706: serial8250_set_defaults <-serial8250_setup_port
<idle>-0 [000] ...2. 9.735706: serial8250_setup_port <-serial8250_isa_init_ports.part.0
<idle>-0 [000] ...2. 9.735706: serial8250_init_port <-serial8250_setup_port
<idle>-0 [000] ...2. 9.735706: serial8250_set_defaults <-serial8250_setup_port
<idle>-0 [000] ...2. 9.735706: serial8250_setup_port <-serial8250_isa_init_ports.part.0
<idle>-0 [000] ...2. 9.735706: serial8250_init_port <-serial8250_setup_port
<idle>-0 [000] ...2. 9.735706: serial8250_set_defaults <-serial8250_setup_port
<idle>-0 [000] ...2. 9.735707: serial8250_console_setup <-univ8250_console_setup
<idle>-0 [000] ...2. 9.735708: serial8250_set_termios <-uart_set_options
<idle>-0 [000] ...2. 9.735708: serial8250_do_set_termios <-uart_set_options
<idle>-0 [000] ...2. 9.735709: serial8250_get_baud_rate <-serial8250_do_set_termios
<idle>-0 [000] ...2. 9.735709: serial8250_get_divisor <-serial8250_do_set_termios
<idle>-0 [000] d..3. 9.735726: serial8250_set_divisor <-serial8250_do_set_termios
<idle>-0 [000] d..3. 9.735726: serial8250_do_set_divisor <-serial8250_do_set_termios
<idle>-0 [000] d..3. 9.735810: serial8250_set_mctrl.part.8 <-serial8250_do_set_termios
<idle>-0 [000] d..3. 9.735810: serial8250_do_set_mctrl <-serial8250_do_set_termios
<idle>-0 [000] ...1. 11.567817: initcall_finish: func=univ8250_console_init+0x0/0x40 ret=0
<idle>-0 [000] ...1. 11.567818: <stack trace>
.......
=> trace_event_raw_event_initcall_start
=> do_one_initcall
=> kernel_init_freeable
=> kernel_init
=> ret_from_fork
=> ret_from_fork_asm
systemd-1 [240] ...1. 48.507441: serial8250_init <-do_one_initcall
systemd-1 [240] ...1. 48.514565: serial8250_pnp_init <-serial8250_init
systemd-1 [240] ...1. 48.514583: serial8250_register_8250_port <-serial_pnp_probe
systemd-1 [240] ...1. 48.514585: serial8250_set_defaults <-serial8250_register_8250_port
systemd-1 [240] ...1. 48.514623: serial8250_config_port <-univ8250_config_port
systemd-1 [240] ...1. 48.514623: serial8250_request_std_resource <-serial8250_config_port
systemd-1 [240] d..2. 48.514916: serial8250_clear_fifos.part.1 <-serial8250_config_port
systemd-1 [240] d..2. 48.514983: serial8250_clear_IER <-serial8250_config_port
systemd-1 [240] ...1. 48.515008: serial8250_type <-serial_core_register_port
systemd-1 [240] ...1. 48.529539: serial8250_pm <-serial_core_register_port
systemd-1 [240] ...1. 48.529539: serial8250_set_sleep <-serial_core_register_port
systemd-1 [240] d..2. 48.529540: serial8250_set_mctrl <-serial_core_register_port
systemd-1 [240] d..2. 48.529540: serial8250_set_mctrl.part.8 <-serial_core_register_port
systemd-1 [240] d..2. 48.529540: serial8250_do_set_mctrl <-serial_core_register_port
systemd-1 [240] ...1. 48.529595: serial8250_get_port <-serial_pnp_probe
systemd-1 [240] ...1. 48.529605: serial8250_register_8250_port <-serial_pnp_probe
systemd-1 [240] ...1. 48.529606: serial8250_set_defaults <-serial8250_register_8250_port
systemd-1 [240] ...1. 48.529625: serial8250_config_port <-univ8250_config_port
systemd-1 [240] ...1. 48.529625: serial8250_request_std_resource <-serial8250_config_port
systemd-1 [240] d..2. 48.529919: serial8250_clear_fifos.part.1 <-serial8250_config_port
systemd-1 [240] d..2. 48.529986: serial8250_clear_IER <-serial8250_config_port
systemd-1 [240] ...1. 48.530003: serial8250_type <-serial_core_register_port
systemd-1 [240] ...1. 48.544532: serial8250_pm <-serial_core_register_port
systemd-1 [240] ...1. 48.544532: serial8250_set_sleep <-serial_core_register_port
systemd-1 [240] d..2. 48.544532: serial8250_set_mctrl <-serial_core_register_port
systemd-1 [240] d..2. 48.544532: serial8250_set_mctrl.part.8 <-serial_core_register_port
systemd-1 [240] d..2. 48.544532: serial8250_do_set_mctrl <-serial_core_register_port
systemd-1 [240] ...1. 48.544549: serial8250_pm <-serial_core_register_port
systemd-1 [240] ...1. 48.544549: serial8250_set_sleep <-serial_core_register_port
systemd-1 [240] ...1. 48.544582: serial8250_get_port <-serial_pnp_probe
systemd-1 [240] ...1. 48.544588: serial8250_register_8250_port <-serial_pnp_probe
systemd-1 [241] ...1. 48.558764: serial8250_console_exit <-unregister_console
systemd-1 [241] ...1. 48.558765: serial8250_release_port <-serial_core_unregister_port
systemd-1 [241] ...1. 48.558766: serial8250_release_std_resource <-serial_core_unregister_port
systemd-1 [241] ...1. 48.558799: serial8250_set_defaults <-serial8250_register_8250_port
systemd-1 [241] ...1. 48.558824: serial8250_config_port <-univ8250_config_port
systemd-1 [241] ...1. 48.558824: serial8250_request_std_resource <-serial8250_config_port
systemd-1 [241] d..2. 48.559121: serial8250_clear_fifos.part.1 <-serial8250_config_port
systemd-1 [241] d..2. 48.559186: serial8250_clear_IER <-serial8250_config_port
systemd-1 [241] ...1. 48.559205: serial8250_type <-serial_core_register_port
systemd-1 [241] ...1. 48.559212: serial8250_pm <-serial_core_register_port
systemd-1 [241] ...1. 48.559212: serial8250_set_sleep <-serial_core_register_port
systemd-1 [241] d..2. 48.559212: serial8250_set_mctrl <-serial_core_register_port
systemd-1 [241] d..2. 48.559212: serial8250_set_mctrl.part.8 <-serial_core_register_port
systemd-1 [241] d..2. 48.559212: serial8250_do_set_mctrl <-serial_core_register_port
systemd-1 [241] ...1. 48.559229: serial8250_console_setup <-univ8250_console_setup
systemd-1 [241] ...1. 48.559230: serial8250_set_termios <-uart_set_options
systemd-1 [241] ...1. 48.559230: serial8250_do_set_termios <-uart_set_options
systemd-1 [241] ...1. 48.559230: serial8250_get_baud_rate <-serial8250_do_set_termios
systemd-1 [241] ...1. 48.559230: serial8250_get_divisor <-serial8250_do_set_termios
systemd-1 [241] d..2. 48.559246: serial8250_set_divisor <-serial8250_do_set_termios
systemd-1 [241] d..2. 48.559246: serial8250_do_set_divisor <-serial8250_do_set_termios
systemd-1 [241] d..2. 48.559346: serial8250_set_mctrl.part.8 <-serial8250_do_set_termios
systemd-1 [241] d..2. 48.559346: serial8250_do_set_mctrl <-serial8250_do_set_termios
systemd-1 [113] ...1. 73.198659: serial8250_get_port <-serial_pnp_probe
systemd-1 [113] ...1. 73.198713: serial8250_config_port <-univ8250_config_port
systemd-1 [113] ...1. 73.198713: serial8250_request_std_resource <-serial8250_config_port
systemd-1 [113] ...1. 73.198771: serial8250_release_std_resource <-serial8250_config_port
systemd-1 [114] ...1. 73.198842: serial8250_config_port <-univ8250_config_port
systemd-1 [114] ...1. 73.198842: serial8250_request_std_resource <-serial8250_config_port
systemd-1 [114] ...1. 73.198891: serial8250_release_std_resource <-serial8250_config_port
systemd-1 [113] ...1. 73.198939: serial8250_probe <-platform_probe
systemd-1 [113] ..... 73.198943: initcall_finish: func=serial8250_init+0x0/0x1e0 ret=0
systemd-1 [113] ..... 73.198946: <stack trace>
可以发现,异常机器中是比正常机器多两次serial_pnp_probe的
- 看代码做实验
看一下serial8250_register_8250_port函数
int serial8250_register_8250_port(const struct uart_8250_port *up)
{
......
uart = serial8250_find_match_or_unused(&up->port);
if (!uart) {
uart = serial8250_setup_port(nr_uarts);
if (!uart)
goto unlock;
nr_uarts++;
}
if (uart->port.type != PORT_8250_CIR) {
struct mctrl_gpios *gpios;
if (uart->port.dev)
uart_remove_one_port(&serial8250_reg, &uart->port);
......
}
应该就是这里serial8250_find_match_or_unused匹配到了上一个ttyS0是match的或者没有使用,所以后面去触发了uart_remove_one_port去disable旧的ttyS0,然后才去enable新的ttyS0
在serial8250_find_match_or_unused相关函数中添加一些log,然后重编内核重启机器
发现确实是match的,而且match的原因是port相同
这也符合我们的日志看到的00:04和00:06设备的端口都是0x3f8
[ 45.949072] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
......
[ 45.966229] printk: console [ttyS0] disabled
......
[ 45.989099] 00:06: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[ 45.989254] printk: console [ttyS0] enabled
所以这里我们也可以确定了:ttyS0的disable->enable操作是因为00:04和00:06这两个设备都映射的是0x3f8这个端口号。当serial8250_register_8250_port扫到00:06这个设备的时候发现历史上有一个00:04是同样的端口号,所以需要先把00:04给disable掉,然后才能enable 00:06。
这个我们也可以通过系统中的一些sysfs接口看出来:如下,可以看到00:04和00:06的io端口是一样的。
[root@yzwddsg tmp]# cat /sys/bus/pnp/devices/00:04/resources
state = active
io 0x3f8-0x3ff
irq 4
dma disabled
[root@yzwddsg tmp]# cat /sys/bus/pnp/devices/00:06/resources
state = active
io 0x3f8-0x3ff
irq 4
- 继续追查00:04和00:06设备
如上所说,已经可以明确00:04和00:06是两个不同的设备但是映射到相同的0x3f8端口号
dump acpi表看一下DSDT表中的COM1
Device (UAR1)
{
Name (_HID, EisaId ("PNP0501") /* 16550A-compatible COM Serial Port */) // _HID: Hardware ID
Name (_UID, 0x00) // _UID: Unique ID
Name (LDN, 0x02)
Method (_STA, 0, NotSerialized) // _STA: Status
{
Return (^^SIO1.DSTA (0x00))
}
Method (_DIS, 0, NotSerialized) // _DIS: Disable Device
{
^^SIO1.DCNT (0x00, 0x00)
}
Method (_CRS, 0, NotSerialized) // _CRS: Current Resource Settings
{
Return (^^SIO1.DCRS (0x00, 0x00))
}
Method (_SRS, 1, NotSerialized) // _SRS: Set Resource Settings
{
^^SIO1.DSRS (Arg0, 0x00)
}
Name (_DDN, "COM1") // _DDN: DOS Device Name
Name (_PRS, ResourceTemplate () // _PRS: Possible Resource Settings
{
StartDependentFn (0x00, 0x00)
{
IO (Decode16,
0x03F8, // Range Minimum
0x03F8, // Range Maximum
0x01, // Alignment
0x08, // Length
)
IRQNoFlags ()
{4}
DMA (Compatibility, NotBusMaster, Transfer8, )
{}
}
StartDependentFnNoPri ()
{
IO (Decode16,
0x03F8, // Range Minimum
0x03F8, // Range Maximum
0x01, // Alignment
0x08, // Length
)
IRQNoFlags ()
{3,4,5,7,9,10,11,12}
DMA (Compatibility, NotBusMaster, Transfer8, )
{}
}
StartDependentFnNoPri ()
{
IO (Decode16,
0x02F8, // Range Minimum
0x02F8, // Range Maximum
0x01, // Alignment
0x08, // Length
)
IRQNoFlags ()
{3,4,5,7,9,10,11,12}
DMA (Compatibility, NotBusMaster, Transfer8, )
{}
}
StartDependentFnNoPri ()
{
IO (Decode16,
0x03E8, // Range Minimum
0x03E8, // Range Maximum
0x01, // Alignment
0x08, // Length
)
IRQNoFlags ()
{3,4,5,7,9,10,11,12}
DMA (Compatibility, NotBusMaster, Transfer8, )
{}
}
StartDependentFnNoPri ()
{
IO (Decode16,
0x02E8, // Range Minimum
0x02E8, // Range Maximum
0x01, // Alignment
0x08, // Length
)
IRQNoFlags ()
{3,4,5,7,9,10,11,12}
DMA (Compatibility, NotBusMaster, Transfer8, )
{}
}
EndDependentFn ()
})
}
还有个COM2串口,COM2串口也是类似的,只是StartDependentFn中首选是0x02F8那些
怀疑是不是COM2串口使用了StartDependentFnNoPri变成了去映射0x3f8端口导致和COM1冲突了?
但是看了下系统的sysfs接口,这个怀疑被排除了,00:05这个设备就是COM2
[root@yzwddsg bbb]# cat /sys/bus/pnp/devices/00\:05/resources
state = active
io 0x2f8-0x2ff
irq 3
dma disabled
那么00:06是什么设备?
继续看DSDT表中发现这个IUR0也是个PNP0501串口设备
Device (IUR0)
{
Name (_HID, EisaId ("PNP0501") /* 16550A-compatible COM Serial Port */) // _HID: Hardware ID
Name (_UID, "IblUart0") // _UID: Unique ID
Method (_STA, 0, NotSerialized) // _STA: Status
{
If ((UM00 == 0x02) || (UM00 == 0x03))
{
Return (0x0F)
}
Else
{
Return (0x00)
}
}
Method (_CRS, 0, Serialized) // _CRS: Current Resource Settings
{
Return (UACR (UC00, UI00))
}
}
那么会不会是这个呢?
[root@yzwddsg bbb]# cat /sys/bus/pnp/devices/00\:04/firmware_node/path
\_SB_.LPC0.UAR1
[root@yzwddsg bbb]# cat /sys/bus/pnp/devices/00\:05/firmware_node/path
\_SB_.LPC0.UAR2
[root@yzwddsg bbb]# cat /sys/bus/pnp/devices/00\:06/firmware_node/path
\_SB_.UAR1
这里可以看到00:06是直接挂在_SB_下面的
去dump bios看一下,有这个可疑选项
Setup Question = Serial Device
Help String = Sets the Serial Device used to output bios serial log
Token =9003 // Do NOT change this line
Offset =2F8
Width =01
BIOS Default =[01]BMC
Options =*[01]BMC // Move "*" to the desired Option
[02]S3M
它的作用就是dump串口日志到BMC上,所以00:06极有可能就是这个选项控制的串口设备
把该选项改成S3M然后重启测试发现已经没有RCU stall了
所以现在可以确定就是Serial Device选择了BMC导致00:06串口设备和主板COM1的串口设备都被系统检测到并register,而且这俩设备都映射到相同的0x3f8端口。内核在register 00:06的时候检测到0x3f8端口已经被COM1端口使用了,必须先把它disable,然后才能register 00:06。对于00:06来说,这是一个新的console,所以是带有CON_PRINTBUFFER标志位的,当它被enable的时候,就需要把syslog buffer中的所有日志重新flush到这个串口上,当时syslog buffer中已经有3000多条启动日志,它们需要一次flush到串口上,所以出现了RCU stall。
后记:但是如果把Serial Device设置成S3M的话,串口日志就不能打到BMC上了,ilo也不能看到log和交互了。
天天开心噢11st
转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 857879363@qq.com